diff mbox series

[v14,08/14] mm: multi-gen LRU: support page table walks

Message ID 20220815071332.627393-9-yuzhao@google.com (mailing list archive)
State New, archived
Headers show
Series Multi-Gen LRU Framework | expand

Commit Message

Yu Zhao Aug. 15, 2022, 7:13 a.m. UTC
To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.

When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[8, 10]%
                Ops/sec      KB/sec
      patch1-7: 1147696.57   44640.29
      patch1-8: 1245274.91   48435.66

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-7
      48.16%  lzo1x_1_do_compress (real work)
       8.20%  page_vma_mapped_walk (overhead)
       7.06%  _raw_spin_unlock_irq
       2.92%  ptep_clear_flush
       2.53%  __zram_bvec_write
       2.11%  do_raw_spin_lock
       2.02%  memmove
       1.93%  lru_gen_look_around
       1.56%  free_unref_page_list
       1.40%  memset

    patch1-8
      49.44%  lzo1x_1_do_compress (real work)
       6.19%  page_vma_mapped_walk (overhead)
       5.97%  _raw_spin_unlock_irq
       3.13%  get_pfn_folio
       2.85%  ptep_clear_flush
       2.42%  __zram_bvec_write
       2.08%  do_raw_spin_lock
       1.92%  memmove
       1.44%  alloc_zspage
       1.36%  memset

  Configurations:
    no change

Thanks to the following developers for their efforts [3].
  kernel test robot <lkp@intel.com>

[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 fs/exec.c                  |    2 +
 include/linux/memcontrol.h |    5 +
 include/linux/mm_types.h   |   77 +++
 include/linux/mmzone.h     |   56 +-
 include/linux/swap.h       |    4 +
 kernel/exit.c              |    1 +
 kernel/fork.c              |    9 +
 kernel/sched/core.c        |    1 +
 mm/memcontrol.c            |   25 +
 mm/vmscan.c                | 1012 +++++++++++++++++++++++++++++++++++-
 10 files changed, 1175 insertions(+), 17 deletions(-)

Comments

Peter Zijlstra Oct. 13, 2022, 3:04 p.m. UTC | #1
On Mon, Aug 15, 2022 at 01:13:27AM -0600, Yu Zhao wrote:
> +	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
> +		pmd_t val = pmd_read_atomic(pmd + i);
> +
> +		/* for pmd_read_atomic() */
> +		barrier();

Please clarify the above. This is an entirely inadequate ordering
comment.

> +
> +		next = pmd_addr_end(addr, end);
Yu Zhao Oct. 19, 2022, 5:51 a.m. UTC | #2
On Thu, Oct 13, 2022 at 9:04 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Aug 15, 2022 at 01:13:27AM -0600, Yu Zhao wrote:
> > +     for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
> > +             pmd_t val = pmd_read_atomic(pmd + i);
> > +
> > +             /* for pmd_read_atomic() */
> > +             barrier();
>
> Please clarify the above. This is an entirely inadequate ordering
> comment.

If it's acceptable, I'll copy what we have in
pmd_none_or_clear_bad_unless_trans_huge():

        pmd_t pmdval = pmd_read_atomic(pmd);

       /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
        barrier();
#endif

        if (pmd_none(pmdval))
                return 1;

pmd_read_atomic() should have a built-in READ_ONCE() in the first
place. If we have to use pmd_read_atomic(), it means we are not under
PMD PTL. So we can also race with pte_alloc(), regardless of THP
split. In this case, compiler reordering probably won't cause any real
damage, but technically not having barrier() is still a bug and will
trigger KCSAN warnings, I think.
Linus Torvalds Oct. 19, 2022, 5:40 p.m. UTC | #3
On Tue, Oct 18, 2022 at 10:51 PM Yu Zhao <yuzhao@google.com> wrote:
>
> pmd_read_atomic() should have a built-in READ_ONCE() in the first
> place.

I really think that is the right thing to do. Let's just move the
barrier in *there* instead.

It really should use 'READ_ONCE()', but it sadly cannot do that
portably, because 'pmd_t' may be a multi-word structure.

Of course, the x86-32 code does this all *almost* right, and
implements its own version of pmd_read_atomic(), but then sadly does
it _without_ actually using READ_ONCE() there.

So even when we could do it right, we don't.

But the x86-32 implementation of pmd_read_atomic() would be trivial to
fix to just use READ_ATOMIC, and the generic implementation should
just have a "barrier()" in it so that we wouldn't need crazy barriers
in the users.

Because as you say, the function is already called "read_atomic", and
it should damn well *act* that way then.

Hmm?

                 Linus
Peter Zijlstra Oct. 20, 2022, 2:13 p.m. UTC | #4
On Wed, Oct 19, 2022 at 10:40:40AM -0700, Linus Torvalds wrote:

> Because as you say, the function is already called "read_atomic", and
> it should damn well *act* that way then.

So I've been sitting on these here patches (and never having time to
repost them), which is how I noticed in the first place:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/mm.pae
Yu Zhao Oct. 20, 2022, 5:29 p.m. UTC | #5
On Thu, Oct 20, 2022 at 8:14 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 19, 2022 at 10:40:40AM -0700, Linus Torvalds wrote:
>
> > Because as you say, the function is already called "read_atomic", and
> > it should damn well *act* that way then.
>
> So I've been sitting on these here patches (and never having time to
> repost them), which is how I noticed in the first place:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/mm.pae

This looks good to me. It'll help get rid of all those open-coded
barrier()s and fix a couple of missing barrier()s.
Linus Torvalds Oct. 20, 2022, 5:35 p.m. UTC | #6
On Thu, Oct 20, 2022 at 7:14 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I've been sitting on these here patches (and never having time to
> repost them), which is how I noticed in the first place:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/mm.pae

Well, that seems an improvement. I don't love how GUP_GET_PTE_LOW_HIGH
now affects the PMD too, but if it's ok for all the three users, I
guess it's ok. Maybe rename it now that it's not just the PTE?

That said, I reacted to that cmpxchg loop:

        } while (cmpxchg64(&pmdp->pmd, old.pmd, 0ULL) != old.pmd);

is this series just so old (but rebased) that it doesn't use "try_cmpxchg64()"?


               Linus
Peter Zijlstra Oct. 20, 2022, 6:55 p.m. UTC | #7
On Thu, Oct 20, 2022 at 10:35:28AM -0700, Linus Torvalds wrote:
> On Thu, Oct 20, 2022 at 7:14 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So I've been sitting on these here patches (and never having time to
> > repost them), which is how I noticed in the first place:
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/mm.pae
> 
> Well, that seems an improvement. I don't love how GUP_GET_PTE_LOW_HIGH
> now affects the PMD too, but if it's ok for all the three users, I
> guess it's ok. Maybe rename it now that it's not just the PTE?
> 
> That said, I reacted to that cmpxchg loop:
> 
>         } while (cmpxchg64(&pmdp->pmd, old.pmd, 0ULL) != old.pmd);
> 
> is this series just so old (but rebased) that it doesn't use "try_cmpxchg64()"?

Yep -- it's *that* old :-/ You're not in fact the first to point that
out.

I'll make time tomorrow to fix it up and respin and send out. This is as
good a time as any to get rid of carrying these patches myself.
Linus Torvalds Oct. 21, 2022, 2:10 a.m. UTC | #8
On Thu, Oct 20, 2022 at 11:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 20, 2022 at 10:35:28AM -0700, Linus Torvalds wrote:
> > That said, I reacted to that cmpxchg loop:
> >
> >         } while (cmpxchg64(&pmdp->pmd, old.pmd, 0ULL) != old.pmd);
> >
> > is this series just so old (but rebased) that it doesn't use "try_cmpxchg64()"?
>
> Yep -- it's *that* old :-/ You're not in fact the first to point that
> out.
>
> I'll make time tomorrow to fix it up and respin and send out. This is as
> good a time as any to get rid of carrying these patches myself.

Hmm. Thinking some more about it, even if you do it as a
try_cmpxchg64() loop, it ends up being duplicated twice.

Maybe we should just bite the bullet, and say that we only support
x86-32 with 'cmpxchg8b' (ie Pentium and later).

Get rid of all the "emulate 64-bit atomics with cli/sti, knowing that
nobody has SMP on those CPU's anyway", and implement a generic x86-32
xchg() setup using that try_cmpxchg64 loop.

I think most (all?) distros already enable X86_PAE anyway, which makes
that X86_CMPXCHG64 be part of the base requirement.

Not that I'm convinced most distros even do 32-bit development anyway
these days.

(Of course, if we require X86_CMPXCHG64, we'll also hit some of the
odd clone CPU's that actually *do* support the instruction, but do not
report it in cpuid due to an odd old Windows NT bug. IOW, things like
the Cyrix and Transmeta CPU's did support the instruction, but had the
CX8 bit clear because otherwise NT wouldn't boot. We may or may not
get those cases right, but I doubt anybody really has any of those old
CPUs).

We got rid of i386 support back in 2012. Maybe it's time to get rid of
i486 support in 2022?

That way we could finally get rid of CONFIG_MATH_EMULATION too.

               Linus
Matthew Wilcox (Oracle) Oct. 21, 2022, 3:38 a.m. UTC | #9
On Thu, Oct 20, 2022 at 07:10:46PM -0700, Linus Torvalds wrote:
> Maybe we should just bite the bullet, and say that we only support
> x86-32 with 'cmpxchg8b' (ie Pentium and later).
> 
> Get rid of all the "emulate 64-bit atomics with cli/sti, knowing that
> nobody has SMP on those CPU's anyway", and implement a generic x86-32
> xchg() setup using that try_cmpxchg64 loop.
> 
> I think most (all?) distros already enable X86_PAE anyway, which makes
> that X86_CMPXCHG64 be part of the base requirement.
> 
> Not that I'm convinced most distros even do 32-bit development anyway
> these days.
> 
> (Of course, if we require X86_CMPXCHG64, we'll also hit some of the
> odd clone CPU's that actually *do* support the instruction, but do not
> report it in cpuid due to an odd old Windows NT bug. IOW, things like
> the Cyrix and Transmeta CPU's did support the instruction, but had the
> CX8 bit clear because otherwise NT wouldn't boot. We may or may not
> get those cases right, but I doubt anybody really has any of those old
> CPUs).
> 
> We got rid of i386 support back in 2012. Maybe it's time to get rid of
> i486 support in 2022?

Arnd suggested removing i486 last year and got a bit of pushback.
The most convincing to my mind was Maciej:

https://lore.kernel.org/lkml/alpine.DEB.2.21.2110231853170.38243@angie.orcam.me.uk/

but you can see a few other responses indicating that people have
shipped new 486-class hardware within the last few years (!)
Peter Zijlstra Oct. 21, 2022, 10:12 a.m. UTC | #10
On Thu, Oct 20, 2022 at 07:10:46PM -0700, Linus Torvalds wrote:

> Maybe we should just bite the bullet, and say that we only support
> x86-32 with 'cmpxchg8b' (ie Pentium and later).
> 
> Get rid of all the "emulate 64-bit atomics with cli/sti, knowing that
> nobody has SMP on those CPU's anyway", and implement a generic x86-32
> xchg() setup using that try_cmpxchg64 loop.
> 
> I think most (all?) distros already enable X86_PAE anyway, which makes
> that X86_CMPXCHG64 be part of the base requirement.
> 
> Not that I'm convinced most distros even do 32-bit development anyway
> these days.
> 
> (Of course, if we require X86_CMPXCHG64, we'll also hit some of the
> odd clone CPU's that actually *do* support the instruction, but do not
> report it in cpuid due to an odd old Windows NT bug. IOW, things like
> the Cyrix and Transmeta CPU's did support the instruction, but had the
> CX8 bit clear because otherwise NT wouldn't boot. We may or may not
> get those cases right, but I doubt anybody really has any of those old
> CPUs).
> 
> We got rid of i386 support back in 2012. Maybe it's time to get rid of
> i486 support in 2022?
> 
> That way we could finally get rid of CONFIG_MATH_EMULATION too.

Would love that; but as pointed out, there's still a few stragglers out
there. OTOH, they *could* use one of the LTS kernels.
Linus Torvalds Oct. 21, 2022, 4:50 p.m. UTC | #11
On Thu, Oct 20, 2022 at 8:38 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Oct 20, 2022 at 07:10:46PM -0700, Linus Torvalds wrote:
> >
> > We got rid of i386 support back in 2012. Maybe it's time to get rid of
> > i486 support in 2022?
>
> Arnd suggested removing i486 last year and got a bit of pushback.
> The most convincing to my mind was Maciej:

Hmm. Maciej added to the cc.

I suspect we can just say "oh, well, use LTS kernels".

> but you can see a few other responses indicating that people have
> shipped new 486-class hardware within the last few years (!)

Hmm.. I can only find references to PC104 boards with Bay Trail
(Pentium-class Atom core, iirc), and some possible FPGA
implementations through MISTer.

There's apparently also a Vortex86DX board too, and I don't know what
core that is based off, but judging from the fact that it has a
dual-core version, it had *better* support cmpxchg8b anyway, because
without that our 64-bit atomics aren't actually atomic.

(Trying to google around, it looks like the older Vortex86SX was a
486-class CPU and did indeed lack CX8)

Intel had some embedded cores (Quark) that were based on the i486
pipeline (as can be seen from the timings), but they actually reported
themselves as Pentium-class and supported all the classic (pre-MMX)
Pentium features.

So I *really* don't think i486 class hardware is relevant any more.
Yes, I'm sure it exists (Maciej being an example), but from a kernel
development standpoint I don't think they are really relevant.

At some point, people have them as museum pieces. They might as well
run museum kernels.

Moving up to requiring cmpxchg8b doesn't sound unreasonable to me.

            Linus
David Gow Oct. 23, 2022, 2:44 p.m. UTC | #12
Le 22/10/22 à 00:50, Linus Torvalds a écrit :
> On Thu, Oct 20, 2022 at 8:38 PM Matthew Wilcox<willy@infradead.org>  wrote:
>> On Thu, Oct 20, 2022 at 07:10:46PM -0700, Linus Torvalds wrote:
>>> We got rid of i386 support back in 2012. Maybe it's time to get rid of
>>> i486 support in 2022?
>> Arnd suggested removing i486 last year and got a bit of pushback.
>> The most convincing to my mind was Maciej:
> Hmm. Maciej added to the cc.
> 
> I suspect we can just say "oh, well, use LTS kernels".
> 

To jump in early on the inevitable pile-on, I'm doing my occasional 
32-bit x86 KUnit test runs on an old 486 DX/2. Now, this is _mostly_ 
just a party trick -- and there are lots of people running 32-bit builds 
under QEMU et al -- but personally, the only non-amd64-capable x86 
machines I have lying around are all 486 class (including a new Vortex86 
board).

(But, at the very least, I can confirm that the latest torvalds/master 
does build, run, and pass KUnit tests on a real 486 at the moment.)

So while dropping i486 wouldn't affect anything particularly important 
for me, it'd be a minor inconvenience and make me a bit sad.

That being said, I have no objection to dropping support for 486SX CPUs 
and CONFIG_MATH_EMULATION.

Cheers,
-- David
Maciej W. Rozycki Oct. 23, 2022, 5:55 p.m. UTC | #13
On Fri, 21 Oct 2022, Linus Torvalds wrote:

> > > We got rid of i386 support back in 2012. Maybe it's time to get rid of
> > > i486 support in 2022?
> >
> > Arnd suggested removing i486 last year and got a bit of pushback.
> > The most convincing to my mind was Maciej:
> 
> Hmm. Maciej added to the cc.

 Thanks!

> So I *really* don't think i486 class hardware is relevant any more.
> Yes, I'm sure it exists (Maciej being an example), but from a kernel
> development standpoint I don't think they are really relevant.
> 
> At some point, people have them as museum pieces. They might as well
> run museum kernels.
> 
> Moving up to requiring cmpxchg8b doesn't sound unreasonable to me.

 But is it really a problem?  I mean unlike MIPS R2000/R3000 class gear 
that has no atomics at all at the CPU level (SMP R3000 machines did exist 
and necessarily had atomics, actually via gating storage implemented by 
board hardware in systems we have never had support for even for UP) we 
have had atomics in x86 since forever.  Just not 64-bit ones.

 Given the presence of generic atomics we can emulate CMPXCHG8B easily 
LL/SC-style using a spinlock with XCHG even on SMP let alone UP.  So all 
the kernel code can just assume the presence of CMPXCHG8B, but any 
invocations of CMPXCHG8B would be diverted to the emulation, perhaps even 
at the assembly level via a GAS macro called `cmpxchg8b' (why not?).  All 
the maintenance burden is then shifted to that macro and said emulation 
code.

 Proof of concept wrapper:

#define LOCK_PREFIX ""

#define CC_SET(c) "\n\t/* output condition code " #c "*/\n"
#define CC_OUT(c) "=@cc" #c

#define unlikely(x) __builtin_expect(!!(x), 0)

__extension__ typedef unsigned long long __u64;
typedef unsigned int __u32;
typedef __u64 u64;
typedef __u32 u32;

typedef _Bool bool;

__asm__(
	".macro		cmpxchg8b arg\n\t"
	"pushl		%eax\n\t"
	"leal		\\arg, %eax\n\t"
	"xchgl		%eax, (%esp)\n\t"
	"call		cmpxchg8b_emu\n\t"
	".endm\n\t");

bool __try_cmpxchg64(volatile u64 *ptr, u64 *pold, u64 new)
{
	bool success;
	u64 old = *pold;
	asm volatile(LOCK_PREFIX "cmpxchg8b %[ptr]"
		     CC_SET(z)
		     : CC_OUT(z) (success),
		       [ptr] "+m" (*ptr),
		       "+A" (old)
		     : "b" ((u32)new),
		       "c" ((u32)(new >> 32))
		     : "memory");

	if (unlikely(!success))
		*pold = old;
	return success;
}

This assembles to:

cmpxchg8b.o:     file format elf32-i386

Disassembly of section .text:

00000000 <__try_cmpxchg64>:
   0:	55                   	push   %ebp
   1:	89 e5                	mov    %esp,%ebp
   3:	57                   	push   %edi
   4:	56                   	push   %esi
   5:	89 d7                	mov    %edx,%edi
   7:	53                   	push   %ebx
   8:	89 c6                	mov    %eax,%esi
   a:	8b 4d 0c             	mov    0xc(%ebp),%ecx
   d:	8b 02                	mov    (%edx),%eax
   f:	8b 5d 08             	mov    0x8(%ebp),%ebx
  12:	8b 52 04             	mov    0x4(%edx),%edx
  15:	50                   	push   %eax
  16:	8d 06                	lea    (%esi),%eax
  18:	87 04 24             	xchg   %eax,(%esp)
  1b:	e8 fc ff ff ff       	call   1c <__try_cmpxchg64+0x1c>
			1c: R_386_PC32	cmpxchg8b_emu
  20:	0f 94 c1             	sete   %cl
  23:	75 0b                	jne    30 <__try_cmpxchg64+0x30>
  25:	5b                   	pop    %ebx
  26:	88 c8                	mov    %cl,%al
  28:	5e                   	pop    %esi
  29:	5f                   	pop    %edi
  2a:	5d                   	pop    %ebp
  2b:	c3                   	ret    
  2c:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
  30:	5b                   	pop    %ebx
  31:	89 07                	mov    %eax,(%edi)
  33:	5e                   	pop    %esi
  34:	89 57 04             	mov    %edx,0x4(%edi)
  37:	5f                   	pop    %edi
  38:	88 c8                	mov    %cl,%al
  3a:	5d                   	pop    %ebp
  3b:	c3                   	ret    

Of course there's a minor ABI nit for `cmpxchg8b_emu' to return a result 
in ZF and the wrapper relies on CONFIG_FRAME_POINTER for correct `arg' 
evaluation in all cases.  But that shouldn't be a big deal, should it?

 Then long-term maintenance would be minimal to nil and all the code 
except for the wrapper and the emulation handler need not be concerned 
about the 486 obscurity.  I can volunteer to maintain said wrapper and 
emulation (and for that matter generic 486 support) if that helped to keep 
the 486 alive.

 Eventually we may choose to drop 486 support after all, but CMPXCHG8B 
alone seems too small a reason to me for that to happen right now.

 NB MIPS R2000 dates back to 1985, solid 4 years before the 486, and we 
continue supporting it with minimal effort.  We do have atomic emulation 
for userland of course.

  Maciej
Linus Torvalds Oct. 23, 2022, 6:35 p.m. UTC | #14
On Sun, Oct 23, 2022 at 10:55 AM Maciej W. Rozycki <macro@orcam.me.uk> wrote:
>
>  Given the presence of generic atomics we can emulate CMPXCHG8B easily
> LL/SC-style using a spinlock with XCHG even on SMP let alone UP.

We already do that (admittedly badly - it's not SMP safe, but
486-class SMP machines have never been supported even if they
technically did exist), see

  arch/x86/lib/cmpxchg8b_emu.S
  arch/x86/lib/atomic64_386_32.S

for some pretty disgusting code.

But it's all the other infrastructure to support this that is just an
unnecessary weight. Grep for CONFIG_X86_CMPXCHG64 and X86_FEATURE_CX8.

We already have increasingly bad coverage testing for x86-32 - and
your example of MIPS really doesn't strengthen your argument all that
much, because MIPS has never been very widely used in the first place,
and doesn't affect any mainline development.

The odd features and CPU selection really do not help.

Honestly, I wouldn't mind upgrading the minimum requirements to at
least M586TSC - leaving some of those early "fake Pentium" clones
behind too. Because 'rdtsc' is probably an even worse issue than
CMPXCHG8B.

In fact, I don't understand how current kernels work on an i486 at
all, since it looks like

  exit_to_user_mode_prepare ->
    arch_exit_to_user_mode_prepare

ends up having an unconditional 'rdtsc' instruction in it.

I'm guessing that you don't have RANDOMIZE_KSTACK_OFFSET enabled?

In other words, our non-Pentium support is ACTIVELY BUGGY AND BROKEN right now.

This is not some theoretical issue, but very much a "look, ma, this
has never been tested, and cannot actually work" issue, that nobody
has ever noticed because nobody really cares.

It took me a couple of minutes of "let's go hunting" to find that
thing, and it's just an example of how broken our current support is.
That RANDOMIZE_KSTACK_OFFSET code *compiles* just fine. It just
doesn't actually work.

That's the kind of maintenance burden we simply shouldn't have - no
developer actually cares (correctly), nobody really tests that
situation (also correctly - it's old and irrelevant hardware), but it
also means that code just randomly doesn't actually work.

                      Linus
Arnd Bergmann Oct. 24, 2022, 7:30 a.m. UTC | #15
On Sun, Oct 23, 2022, at 20:35, Linus Torvalds wrote:
>
> Honestly, I wouldn't mind upgrading the minimum requirements to at
> least M586TSC - leaving some of those early "fake Pentium" clones
> behind too. Because 'rdtsc' is probably an even worse issue than
> CMPXCHG8B.

Kconfig treats X86_CMPXCHG64 as a strict subset of X86_TSC (except
when enabling X86_PAE, which relies on cx8), so requiring both
sounds like a good idea.

From the Kconfig history, I see you initially only enabled
cx8 unconditionally for a couple of CPUs in 982d007a6eec ("x86:
Optimize cmpxchg64() at build-time some more"), and Matthew
Whitehead extended that list in f960cfd12650 ("x86/Kconfig:
Add missing i586-class CPUs to the X86_CMPXCHG64 Kconfig group").

There are still a handful of CPUs that according to [1] 
claim cx8 support that we leave disabled, specifically the
Kconfig symbols for MWINCHIP3D, MCRUSOE, MEFFICEON, MCYRIXIII,
MVIAC3_2 and MVIAC7 should have both tsc and cx8, while the
older MWINCHIPC6 and a small subset of M586 (Cyrix 6x86mx, C-II
and AMD K5) apparently have cx8 but not tsc.

Would you drop support for the 686-class chips that currently
don't use cmpxchg8b, or just remove CONFIG_X86_CMPXCHG64 and
assume they work after all?

       Arnd

[1] https://reactos.org/wiki/Supported_Hardware/CPU
Serentty Oct. 24, 2022, 6:20 p.m. UTC | #16
As someone who still regularly uses hardware from this era, and often 
runs Linux on it, this would definitely be a blow to which machines I 
can actively use. Linux support is a big part of how I use these 
machines, since DOS and Windows 95 really can’t keep up with modern 
networking standards.

I would be very disappointed, and impacted, if Linux dropped 486 support.

On 10/20/22 23:10, Linus Torvalds wrote:
> On Thu, Oct 20, 2022 at 11:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Oct 20, 2022 at 10:35:28AM -0700, Linus Torvalds wrote:
>>> That said, I reacted to that cmpxchg loop:
>>>
>>>          } while (cmpxchg64(&pmdp->pmd, old.pmd, 0ULL) != old.pmd);
>>>
>>> is this series just so old (but rebased) that it doesn't use "try_cmpxchg64()"?
>> Yep -- it's *that* old :-/ You're not in fact the first to point that
>> out.
>>
>> I'll make time tomorrow to fix it up and respin and send out. This is as
>> good a time as any to get rid of carrying these patches myself.
> Hmm. Thinking some more about it, even if you do it as a
> try_cmpxchg64() loop, it ends up being duplicated twice.
>
> Maybe we should just bite the bullet, and say that we only support
> x86-32 with 'cmpxchg8b' (ie Pentium and later).
>
> Get rid of all the "emulate 64-bit atomics with cli/sti, knowing that
> nobody has SMP on those CPU's anyway", and implement a generic x86-32
> xchg() setup using that try_cmpxchg64 loop.
>
> I think most (all?) distros already enable X86_PAE anyway, which makes
> that X86_CMPXCHG64 be part of the base requirement.
>
> Not that I'm convinced most distros even do 32-bit development anyway
> these days.
>
> (Of course, if we require X86_CMPXCHG64, we'll also hit some of the
> odd clone CPU's that actually *do* support the instruction, but do not
> report it in cpuid due to an odd old Windows NT bug. IOW, things like
> the Cyrix and Transmeta CPU's did support the instruction, but had the
> CX8 bit clear because otherwise NT wouldn't boot. We may or may not
> get those cases right, but I doubt anybody really has any of those old
> CPUs).
>
> We got rid of i386 support back in 2012. Maybe it's time to get rid of
> i486 support in 2022?
>
> That way we could finally get rid of CONFIG_MATH_EMULATION too.
>
>                 Linus
>
Serentty Oct. 24, 2022, 7:28 p.m. UTC | #17
Trying sending again since I am not sure if it went through the first time. I’m not used to the mailing list.

As someone who still regularly uses hardware from this era, and often runs Linux on it, this would definitely be a blow to which machines I can actively use. Linux support is a big part of how I use these machines, since I usually have them connected to my LAN and the internet, and networking support on MS-DOS and Windows 9x rots more each year in terms of support for modern protocols and security. I would be very disappointed, and impacted, if Linux dropped 486 support.
Maciej W. Rozycki Oct. 25, 2022, 4:28 p.m. UTC | #18
On Sun, 23 Oct 2022, Linus Torvalds wrote:

> >  Given the presence of generic atomics we can emulate CMPXCHG8B easily
> > LL/SC-style using a spinlock with XCHG even on SMP let alone UP.
> 
> We already do that (admittedly badly - it's not SMP safe, but
> 486-class SMP machines have never been supported even if they
> technically did exist), see
> 
>   arch/x86/lib/cmpxchg8b_emu.S
>   arch/x86/lib/atomic64_386_32.S
> 
> for some pretty disgusting code.

 I skimmed over these, yeah, before writing my previous reply and hence 
my proposal to make the approach somewhat less disgusting.

 FWIW Intel talks about 486 SMP systems in their MP spec, but even back 25
years ago I was unable to track down a single mention of a product name 
for an APIC-based 486 SMP system let alone a specimen.  I guess the MP 
spec and the APIC simply came too late in the game for the 486.

 Compaq did however make 486 SMP systems based on their proprietary 
solution (I can't remember the name offhand), which they also propagated 
to their later Pentium products, some of which could be switched into the 
APIC mode instead via a BIOS setting.  ISTR a 16-way brand new Xeon box 
still using the Compaq solution in early 2000s.  Compaq never bothered to 
publish the spec for their solution and nobody was determined enough to 
reverse-engineer it, so we never had support for it.

> But it's all the other infrastructure to support this that is just an
> unnecessary weight. Grep for CONFIG_X86_CMPXCHG64 and X86_FEATURE_CX8.

 Some of these are syntactic sugar really, but I agree there seem to be 
too many of them and I guess even that perhaps could be simplified at the 
expense of some performance loss with 486 systems, by assuming universal 
presence of CMPXCHG8B and emulating the instruction in #UD handler where 
unavailable.  I could live with that, and that could get away with no 
conditionals (except maybe one to have the emulation handler optimised 
away where not needed based on a single CONFIG_X86_CMPXCHG64 instance).

> We already have increasingly bad coverage testing for x86-32 - and
> your example of MIPS really doesn't strengthen your argument all that
> much, because MIPS has never been very widely used in the first place,
> and doesn't affect any mainline development.

 TBH by the number of pieces of hardware I am fairly sure there have been 
significantly more MIPS Linux deployments in the world than x86 ones, and
second only to ARM ones, though indeed the diversity of configurations may 
have been smaller.  And all of them seem to have survived having ancient 
MIPS CPU support alongside.

> The odd features and CPU selection really do not help.
> 
> Honestly, I wouldn't mind upgrading the minimum requirements to at
> least M586TSC - leaving some of those early "fake Pentium" clones
> behind too. Because 'rdtsc' is probably an even worse issue than
> CMPXCHG8B.
> 
> In fact, I don't understand how current kernels work on an i486 at
> all, since it looks like
> 
>   exit_to_user_mode_prepare ->
>     arch_exit_to_user_mode_prepare
> 
> ends up having an unconditional 'rdtsc' instruction in it.
> 
> I'm guessing that you don't have RANDOMIZE_KSTACK_OFFSET enabled?

 I have checked and I have not moved past 5.11.0 yet for my 486 box, and 
that's before the addition of RANDOMIZE_KSTACK_OFFSET.

 Sigh, time flies by and there's been too much breakage around for me to 
deal with to schedule an upgrade of what has been mostly a stable 
trouble-free configuration for me.  E.g. after three unrelated bug fixes 
the parport_pc driver still does not work with my RISC-V box, and that's 
of course only once I figured out how to work around a hardware erratum 
with a pair of upstream PCIe switches that let the PCIe parallel port 
option card to be reachable in the first place.  So I gave those issues 
priority over upgrading the 486 kernel, though obviously I'll get to it 
sooner or later.

 The fix here is obviously and trivially:

	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET if !M486SX && !M486

Not all features have to be available everywhere and there are compromises 
to make.  I can live with my 486 box being less secure: I don't intend to 
hand out accounts to people I don't trust or run a web server there.

 I don't know offhand how much we rely on RDTSC and if necessary how much 
a trivial emulation referring to jiffies plus maybe reading the 8254 PIT
counter would suck.  Maybe it's not a big deal.  Again, it all depends on 
the application.

 NB I have seen this in the logs appearing since a while ago with my dual
Pentium-MMX box:

clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc-early' as unstable because the skew is too large:
clocksource:                       'refined-jiffies' wd_nsec: 500000000 wd_now:ffff8b3e wd_last: ffff8b0c mask: ffffffff
clocksource:                       'tsc-early' cs_nsec: 829966007 cs_now: 6049e0005 cs_last: 5f91b3cff mask: ffffffffffffffff
clocksource:                       No current clocksource.
tsc: Marking TSC unstable due to clocksource watchdog

so I guess this system would qualify as one without RDTSC too (even though 
it's M586MMX, i.e. a superset of M586TSC, so hardly early "fake Pentium"), 
right?  It's interesting to see how we can't (anymore) make use of any of 
the various timers this system has (i.e. the PIT, the TSC, the APIC timer) 
for timekeeping.

> In other words, our non-Pentium support is ACTIVELY BUGGY AND BROKEN 
> right now.

 We have plenty of bugs elsewhere too.  I hit them all the time and try to 
fix as my time permits; I guess other people do too.  I think it's just a 
matter of willing to deal with issues.  And we won't ever fix all the bugs 
we have.  There will always be some remaining even if the exact set 
changes.

> This is not some theoretical issue, but very much a "look, ma, this
> has never been tested, and cannot actually work" issue, that nobody
> has ever noticed because nobody really cares.

 Same with parport_pc on RISC-V.  That just happens with more rarely used 
features, even if it's brand new hardware (I bought both pieces new in 
retail boxes last year or so).

> It took me a couple of minutes of "let's go hunting" to find that
> thing, and it's just an example of how broken our current support is.
> That RANDOMIZE_KSTACK_OFFSET code *compiles* just fine. It just
> doesn't actually work.

 Nobody tried it, and that's just it.  We may have bots building random 
configs, but not all will ever be tried at run time.  Bugs creep in all 
the time, because nobody has the ability to foresee all the scenarios, or 
sometimes they are genuine human errors.

> That's the kind of maintenance burden we simply shouldn't have - no
> developer actually cares (correctly), nobody really tests that
> situation (also correctly - it's old and irrelevant hardware), but it
> also means that code just randomly doesn't actually work.

 I think this is a strawman's argument really.  For various reasons there 
are always combinations of hardware that do not work just because one 
cannot verify everything and the age of hardware may or may not be the 
culprit.  It's more about the user base.  Niche use gets less coverage.  
If something doesn't work and someone actually cares about it, they will 
come and fix it.

 The only argument I buy is extra maintenance burden caused *elsewhere*, 
so if support for old 486 systems staying around causes extra work for 
mainstream x86-64 systems, only then I will consider it a valid concern.

 So what's the actual burden from keeping this support around?  Would my 
proposal to emulate CMPXCHG8B (and possibly RDTSC) in #UD handler help?  

 Getting the decoding of x86 address modes in software right is a pain and 
tedious (I've seen fixes to get the corner cases right in the disassembler 
in binutils fly by quite recently, after so many years), but I guess we 
can try if we don't have it implemented already for another purpose (I 
haven't checked; I've been hardly involved with x86 recently).

 Thoughts?

  Maciej
Arnd Bergmann Oct. 26, 2022, 3:43 p.m. UTC | #19
On Tue, Oct 25, 2022, at 18:28, Maciej W. Rozycki wrote:
> On Sun, 23 Oct 2022, Linus Torvalds wrote:
>
>> In fact, I don't understand how current kernels work on an i486 at
>> all, since it looks like
>> 
>>   exit_to_user_mode_prepare ->
>>     arch_exit_to_user_mode_prepare
>> 
>> ends up having an unconditional 'rdtsc' instruction in it.
>> >
>  The fix here is obviously and trivially:
>
> 	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET if !M486SX && !M486

I think that would be "if X86_TSC", otherwise you still include the
TSC-less 586-class (5x86, 6x86, Elan, Winchip C6, MediaGX, ...)

> So what's the actual burden from keeping this support around?  Would my 
> proposal to emulate CMPXCHG8B (and possibly RDTSC) in #UD handler help?

That sounds worse to me than the current use of runtime alternatives
for picking between cmpxchg8b_emu and the native instruction.

For arm32, we have a combination of two other approaches:

- On the oldest processors that never had SMP support (ARMv5 and
  earlier), it is not possible to enable support for SMP at all.
  Using a Kconfig 'depends on X86_CMPXCHG64' for CONFIG_SMP would
  still allow building 486 kernels, but completely avoid the problem
  of trying to make the same kernel work on later SMP machines.

- For the special case of early ARMv6 hardware that has 32-bit
  atomics but not 64-bit ones, the kernel just falls back to
  CONFIG_GENERIC_ATOMIC64 and no cmpxchg64(). The same should work
  for an i486+SMP kernel. It's obviously slower, but most users
  can trivially avoid this by either running an i686 SMP kernel
  or an i486 UP kernel.

       Arnd
Maciej W. Rozycki Oct. 27, 2022, 11:08 p.m. UTC | #20
On Wed, 26 Oct 2022, Arnd Bergmann wrote:

> >> In fact, I don't understand how current kernels work on an i486 at
> >> all, since it looks like
> >> 
> >>   exit_to_user_mode_prepare ->
> >>     arch_exit_to_user_mode_prepare
> >> 
> >> ends up having an unconditional 'rdtsc' instruction in it.
> >> >
> >  The fix here is obviously and trivially:
> >
> > 	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET if !M486SX && !M486
> 
> I think that would be "if X86_TSC", otherwise you still include the
> TSC-less 586-class (5x86, 6x86, Elan, Winchip C6, MediaGX, ...)

 Right, I tend to forget about these more exotic chips from the 1990s era.  
I'll run some verification and come up with the actual fix in the next 
several days.

> > So what's the actual burden from keeping this support around?  Would my 
> > proposal to emulate CMPXCHG8B (and possibly RDTSC) in #UD handler help?
> 
> That sounds worse to me than the current use of runtime alternatives
> for picking between cmpxchg8b_emu and the native instruction.

 Why is that so?  Because of the trap-and-emulate technique?  It's been 
around since forever and specified in some processor ISAs even, where some 
machine instructions are explicitly allowed to be omitted from actual 
hardware and delegated to OS emulation without making affected hardware 
non-compliant.  VAX had it back from 1970s and RISC-V has it now.  We've 
been using it to retrofit operations ourselves, though maybe not with the 
x86 arch.

 Or is it because of the complex address decoding x86 requires?  Well, I 
have actually realised we do have it already, in the x87 CR0.EM emulator.  
While IEEE-754 exceptions can make use of the address of the operand 
recorded in the FPU environment full emulation requires decoding by hand.

> For arm32, we have a combination of two other approaches:
> 
> - On the oldest processors that never had SMP support (ARMv5 and
>   earlier), it is not possible to enable support for SMP at all.
>   Using a Kconfig 'depends on X86_CMPXCHG64' for CONFIG_SMP would
>   still allow building 486 kernels, but completely avoid the problem
>   of trying to make the same kernel work on later SMP machines.

 That would be fine with me of course.

> - For the special case of early ARMv6 hardware that has 32-bit
>   atomics but not 64-bit ones, the kernel just falls back to
>   CONFIG_GENERIC_ATOMIC64 and no cmpxchg64(). The same should work
>   for an i486+SMP kernel. It's obviously slower, but most users
>   can trivially avoid this by either running an i686 SMP kernel
>   or an i486 UP kernel.

 You meant an M586TSC+ SMP kernel presumably (I have such a machine), but 
otherwise I'd be fine with such an approach too.

 So it looks to me like we have at least three options to keep 486 alive,
two of which seem fairly straightforward to deploy and maintain long-term.  
I like your last proposal the most, FWIW.  Do we have a consensus here?

  Maciej
Arnd Bergmann Oct. 28, 2022, 7:27 a.m. UTC | #21
On Fri, Oct 28, 2022, at 01:08, Maciej W. Rozycki wrote:
> On Wed, 26 Oct 2022, Arnd Bergmann wrote:

>> - For the special case of early ARMv6 hardware that has 32-bit
>>   atomics but not 64-bit ones, the kernel just falls back to
>>   CONFIG_GENERIC_ATOMIC64 and no cmpxchg64(). The same should work
>>   for an i486+SMP kernel. It's obviously slower, but most users
>>   can trivially avoid this by either running an i686 SMP kernel
>>   or an i486 UP kernel.
>
>  You meant an M586TSC+ SMP kernel presumably (I have such a machine), but 
> otherwise I'd be fine with such an approach too.

Sure. I just gave i686 as the example since that's already the
baseline in 90% of the remaining x86-32 distros. Slackware, ALT
and Mageia are notable exceptions that target i586, and some
others already have separate installers for i486 and i686.
The i586 distros all seem to have separate SMP/PAE kernels in
addition to the minimal i586.

    Arnd
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index f793221f4eb6..5fd98ca569ff 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1014,6 +1014,7 @@  static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
 	tsk->mm = mm;
+	lru_gen_add_mm(mm);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
@@ -1029,6 +1030,7 @@  static int exec_mmap(struct mm_struct *mm)
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
 	task_unlock(tsk);
+	lru_gen_use_mm(mm);
 
 	if (vfork)
 		timens_on_fork(tsk->nsproxy, tsk);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 47829f378fcb..ea6a78bb896c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -350,6 +350,11 @@  struct mem_cgroup {
 	struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_LRU_GEN
+	/* per-memcg mm_struct list */
+	struct lru_gen_mm_list mm_list;
+#endif
+
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cf97f3884fda..1952e70ec099 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -3,6 +3,7 @@ 
 #define _LINUX_MM_TYPES_H
 
 #include <linux/mm_types_task.h>
+#include <linux/sched.h>
 
 #include <linux/auxvec.h>
 #include <linux/kref.h>
@@ -17,6 +18,7 @@ 
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/mmdebug.h>
 
 #include <asm/mmu.h>
 
@@ -672,6 +674,22 @@  struct mm_struct {
 		 */
 		unsigned long ksm_merging_pages;
 #endif
+#ifdef CONFIG_LRU_GEN
+		struct {
+			/* this mm_struct is on lru_gen_mm_list */
+			struct list_head list;
+			/*
+			 * Set when switching to this mm_struct, as a hint of
+			 * whether it has been used since the last time per-node
+			 * page table walkers cleared the corresponding bits.
+			 */
+			unsigned long bitmap;
+#ifdef CONFIG_MEMCG
+			/* points to the memcg of "owner" above */
+			struct mem_cgroup *memcg;
+#endif
+		} lru_gen;
+#endif /* CONFIG_LRU_GEN */
 	} __randomize_layout;
 
 	/*
@@ -698,6 +716,65 @@  static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+struct lru_gen_mm_list {
+	/* mm_struct list for page table walkers */
+	struct list_head fifo;
+	/* protects the list above */
+	spinlock_t lock;
+};
+
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+	INIT_LIST_HEAD(&mm->lru_gen.list);
+	mm->lru_gen.bitmap = 0;
+#ifdef CONFIG_MEMCG
+	mm->lru_gen.memcg = NULL;
+#endif
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+	/* unlikely but not a bug when racing with lru_gen_migrate_mm() */
+	VM_WARN_ON_ONCE(list_empty(&mm->lru_gen.list));
+
+	if (!(current->flags & PF_KTHREAD))
+		WRITE_ONCE(mm->lru_gen.bitmap, -1);
+}
+
+#else /* !CONFIG_LRU_GEN */
+
+static inline void lru_gen_add_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_del_mm(struct mm_struct *mm)
+{
+}
+
+#ifdef CONFIG_MEMCG
+static inline void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+}
+#endif
+
+static inline void lru_gen_init_mm(struct mm_struct *mm)
+{
+}
+
+static inline void lru_gen_use_mm(struct mm_struct *mm)
+{
+}
+
+#endif /* CONFIG_LRU_GEN */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 850c6171af68..51e521465742 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -408,7 +408,7 @@  enum {
  * min_seq behind.
  *
  * The number of pages in each generation is eventually consistent and therefore
- * can be transiently negative.
+ * can be transiently negative when reset_batch_size() is pending.
  */
 struct lru_gen_struct {
 	/* the aging increments the youngest generation number */
@@ -430,6 +430,53 @@  struct lru_gen_struct {
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 };
 
+enum {
+	MM_LEAF_TOTAL,		/* total leaf entries */
+	MM_LEAF_OLD,		/* old leaf entries */
+	MM_LEAF_YOUNG,		/* young leaf entries */
+	MM_NONLEAF_TOTAL,	/* total non-leaf entries */
+	MM_NONLEAF_FOUND,	/* non-leaf entries found in Bloom filters */
+	MM_NONLEAF_ADDED,	/* non-leaf entries added to Bloom filters */
+	NR_MM_STATS
+};
+
+/* double-buffering Bloom filters */
+#define NR_BLOOM_FILTERS	2
+
+struct lru_gen_mm_state {
+	/* set to max_seq after each iteration */
+	unsigned long seq;
+	/* where the current iteration continues (inclusive) */
+	struct list_head *head;
+	/* where the last iteration ended (exclusive) */
+	struct list_head *tail;
+	/* to wait for the last page table walker to finish */
+	struct wait_queue_head wait;
+	/* Bloom filters flip after each iteration */
+	unsigned long *filters[NR_BLOOM_FILTERS];
+	/* the mm stats for debugging */
+	unsigned long stats[NR_HIST_GENS][NR_MM_STATS];
+	/* the number of concurrent page table walkers */
+	int nr_walkers;
+};
+
+struct lru_gen_mm_walk {
+	/* the lruvec under reclaim */
+	struct lruvec *lruvec;
+	/* unstable max_seq from lru_gen_struct */
+	unsigned long max_seq;
+	/* the next address within an mm to scan */
+	unsigned long next_addr;
+	/* to batch promoted pages */
+	int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
+	/* to batch the mm stats */
+	int mm_stats[NR_MM_STATS];
+	/* total batched items */
+	int batched;
+	bool can_swap;
+	bool force_scan;
+};
+
 void lru_gen_init_lruvec(struct lruvec *lruvec);
 void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
@@ -480,6 +527,8 @@  struct lruvec {
 #ifdef CONFIG_LRU_GEN
 	/* evictable pages divided into generations */
 	struct lru_gen_struct		lrugen;
+	/* to concurrently iterate lru_gen_mm_list */
+	struct lru_gen_mm_state		mm_state;
 #endif
 #ifdef CONFIG_MEMCG
 	struct pglist_data *pgdat;
@@ -1174,6 +1223,11 @@  typedef struct pglist_data {
 
 	unsigned long		flags;
 
+#ifdef CONFIG_LRU_GEN
+	/* kswap mm walk data */
+	struct lru_gen_mm_walk	mm_walk;
+#endif
+
 	ZONE_PADDING(_pad2_)
 
 	/* Per-node vmstats */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 43150b9bbc5c..6308150b234a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -162,6 +162,10 @@  union swap_header {
  */
 struct reclaim_state {
 	unsigned long reclaimed_slab;
+#ifdef CONFIG_LRU_GEN
+	/* per-thread mm walk data */
+	struct lru_gen_mm_walk *mm_walk;
+#endif
 };
 
 #ifdef __KERNEL__
diff --git a/kernel/exit.c b/kernel/exit.c
index 84021b24f79e..98a33bd7c25c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -466,6 +466,7 @@  void mm_update_next_owner(struct mm_struct *mm)
 		goto retry;
 	}
 	WRITE_ONCE(mm->owner, c);
+	lru_gen_migrate_mm(mm);
 	task_unlock(c);
 	put_task_struct(c);
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index 90c85b17bf69..d2da065442af 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1152,6 +1152,7 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		goto fail_nocontext;
 
 	mm->user_ns = get_user_ns(user_ns);
+	lru_gen_init_mm(mm);
 	return mm;
 
 fail_nocontext:
@@ -1194,6 +1195,7 @@  static inline void __mmput(struct mm_struct *mm)
 	}
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
+	lru_gen_del_mm(mm);
 	mmdrop(mm);
 }
 
@@ -2694,6 +2696,13 @@  pid_t kernel_clone(struct kernel_clone_args *args)
 		get_task_struct(p);
 	}
 
+	if (IS_ENABLED(CONFIG_LRU_GEN) && !(clone_flags & CLONE_VM)) {
+		/* lock the task to synchronize with memcg migration */
+		task_lock(p);
+		lru_gen_add_mm(p->mm);
+		task_unlock(p);
+	}
+
 	wake_up_new_task(p);
 
 	/* forking complete and child started to run, tell ptracer */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8fccd8721bb8..2c605bdede47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5180,6 +5180,7 @@  context_switch(struct rq *rq, struct task_struct *prev,
 		 * finish_task_switch()'s mmdrop().
 		 */
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
+		lru_gen_use_mm(next->mm);
 
 		if (!prev->mm) {                        // from kernel
 			/* will mmdrop() in finish_task_switch(). */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 882180866e31..2121a9bcbb54 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6199,6 +6199,30 @@  static void mem_cgroup_move_task(void)
 }
 #endif
 
+#ifdef CONFIG_LRU_GEN
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct cgroup_subsys_state *css;
+
+	/* find the first leader if there is any */
+	cgroup_taskset_for_each_leader(task, css, tset)
+		break;
+
+	if (!task)
+		return;
+
+	task_lock(task);
+	if (task->mm && task->mm->owner == task)
+		lru_gen_migrate_mm(task->mm);
+	task_unlock(task);
+}
+#else
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+#endif /* CONFIG_LRU_GEN */
+
 static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
 {
 	if (value == PAGE_COUNTER_MAX)
@@ -6604,6 +6628,7 @@  struct cgroup_subsys memory_cgrp_subsys = {
 	.css_reset = mem_cgroup_css_reset,
 	.css_rstat_flush = mem_cgroup_css_rstat_flush,
 	.can_attach = mem_cgroup_can_attach,
+	.attach = mem_cgroup_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.post_attach = mem_cgroup_move_task,
 	.dfl_cftypes = memory_files,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f365386eb441..d1dfc0a77b6f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -49,6 +49,8 @@ 
 #include <linux/printk.h>
 #include <linux/dax.h>
 #include <linux/psi.h>
+#include <linux/pagewalk.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -3082,7 +3084,7 @@  static bool can_age_anon_pages(struct pglist_data *pgdat,
 		for ((type) = 0; (type) < ANON_AND_FILE; (type)++)	\
 			for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
 
-static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int nid)
+static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
 
@@ -3127,6 +3129,372 @@  static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
 	       get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
 }
 
+/******************************************************************************
+ *                          mm_struct list
+ ******************************************************************************/
+
+static struct lru_gen_mm_list *get_mm_list(struct mem_cgroup *memcg)
+{
+	static struct lru_gen_mm_list mm_list = {
+		.fifo = LIST_HEAD_INIT(mm_list.fifo),
+		.lock = __SPIN_LOCK_UNLOCKED(mm_list.lock),
+	};
+
+#ifdef CONFIG_MEMCG
+	if (memcg)
+		return &memcg->mm_list;
+#endif
+	VM_WARN_ON_ONCE(!mem_cgroup_disabled());
+
+	return &mm_list;
+}
+
+void lru_gen_add_mm(struct mm_struct *mm)
+{
+	int nid;
+	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+
+	VM_WARN_ON_ONCE(!list_empty(&mm->lru_gen.list));
+#ifdef CONFIG_MEMCG
+	VM_WARN_ON_ONCE(mm->lru_gen.memcg);
+	mm->lru_gen.memcg = memcg;
+#endif
+	spin_lock(&mm_list->lock);
+
+	for_each_node_state(nid, N_MEMORY) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		if (!lruvec)
+			continue;
+
+		/* the first addition since the last iteration */
+		if (lruvec->mm_state.tail == &mm_list->fifo)
+			lruvec->mm_state.tail = &mm->lru_gen.list;
+	}
+
+	list_add_tail(&mm->lru_gen.list, &mm_list->fifo);
+
+	spin_unlock(&mm_list->lock);
+}
+
+void lru_gen_del_mm(struct mm_struct *mm)
+{
+	int nid;
+	struct lru_gen_mm_list *mm_list;
+	struct mem_cgroup *memcg = NULL;
+
+	if (list_empty(&mm->lru_gen.list))
+		return;
+
+#ifdef CONFIG_MEMCG
+	memcg = mm->lru_gen.memcg;
+#endif
+	mm_list = get_mm_list(memcg);
+
+	spin_lock(&mm_list->lock);
+
+	for_each_node(nid) {
+		struct lruvec *lruvec = get_lruvec(memcg, nid);
+
+		if (!lruvec)
+			continue;
+
+		/* where the last iteration ended (exclusive) */
+		if (lruvec->mm_state.tail == &mm->lru_gen.list)
+			lruvec->mm_state.tail = lruvec->mm_state.tail->next;
+
+		/* where the current iteration continues (inclusive) */
+		if (lruvec->mm_state.head != &mm->lru_gen.list)
+			continue;
+
+		lruvec->mm_state.head = lruvec->mm_state.head->next;
+		/* the deletion ends the current iteration */
+		if (lruvec->mm_state.head == &mm_list->fifo)
+			WRITE_ONCE(lruvec->mm_state.seq, lruvec->mm_state.seq + 1);
+	}
+
+	list_del_init(&mm->lru_gen.list);
+
+	spin_unlock(&mm_list->lock);
+
+#ifdef CONFIG_MEMCG
+	mem_cgroup_put(mm->lru_gen.memcg);
+	mm->lru_gen.memcg = NULL;
+#endif
+}
+
+#ifdef CONFIG_MEMCG
+void lru_gen_migrate_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg;
+
+	lockdep_assert_held(&mm->owner->alloc_lock);
+
+	/* for mm_update_next_owner() */
+	if (mem_cgroup_disabled())
+		return;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	rcu_read_unlock();
+	if (memcg == mm->lru_gen.memcg)
+		return;
+
+	VM_WARN_ON_ONCE(!mm->lru_gen.memcg);
+	VM_WARN_ON_ONCE(list_empty(&mm->lru_gen.list));
+
+	lru_gen_del_mm(mm);
+	lru_gen_add_mm(mm);
+}
+#endif
+
+/*
+ * Bloom filters with m=1<<15, k=2 and the false positive rates of ~1/5 when
+ * n=10,000 and ~1/2 when n=20,000, where, conventionally, m is the number of
+ * bits in a bitmap, k is the number of hash functions and n is the number of
+ * inserted items.
+ *
+ * Page table walkers use one of the two filters to reduce their search space.
+ * To get rid of non-leaf entries that no longer have enough leaf entries, the
+ * aging uses the double-buffering technique to flip to the other filter each
+ * time it produces a new generation. For non-leaf entries that have enough
+ * leaf entries, the aging carries them over to the next generation in
+ * walk_pmd_range(); the eviction also report them when walking the rmap
+ * in lru_gen_look_around().
+ *
+ * For future optimizations:
+ * 1. It's not necessary to keep both filters all the time. The spare one can be
+ *    freed after the RCU grace period and reallocated if needed again.
+ * 2. And when reallocating, it's worth scaling its size according to the number
+ *    of inserted entries in the other filter, to reduce the memory overhead on
+ *    small systems and false positives on large systems.
+ * 3. Jenkins' hash function is an alternative to Knuth's.
+ */
+#define BLOOM_FILTER_SHIFT	15
+
+static inline int filter_gen_from_seq(unsigned long seq)
+{
+	return seq % NR_BLOOM_FILTERS;
+}
+
+static void get_item_key(void *item, int *key)
+{
+	u32 hash = hash_ptr(item, BLOOM_FILTER_SHIFT * 2);
+
+	BUILD_BUG_ON(BLOOM_FILTER_SHIFT * 2 > BITS_PER_TYPE(u32));
+
+	key[0] = hash & (BIT(BLOOM_FILTER_SHIFT) - 1);
+	key[1] = hash >> BLOOM_FILTER_SHIFT;
+}
+
+static void reset_bloom_filter(struct lruvec *lruvec, unsigned long seq)
+{
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	filter = lruvec->mm_state.filters[gen];
+	if (filter) {
+		bitmap_clear(filter, 0, BIT(BLOOM_FILTER_SHIFT));
+		return;
+	}
+
+	filter = bitmap_zalloc(BIT(BLOOM_FILTER_SHIFT),
+			       __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+	WRITE_ONCE(lruvec->mm_state.filters[gen], filter);
+}
+
+static void update_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+	int key[2];
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+	if (!filter)
+		return;
+
+	get_item_key(item, key);
+
+	if (!test_bit(key[0], filter))
+		set_bit(key[0], filter);
+	if (!test_bit(key[1], filter))
+		set_bit(key[1], filter);
+}
+
+static bool test_bloom_filter(struct lruvec *lruvec, unsigned long seq, void *item)
+{
+	int key[2];
+	unsigned long *filter;
+	int gen = filter_gen_from_seq(seq);
+
+	filter = READ_ONCE(lruvec->mm_state.filters[gen]);
+	if (!filter)
+		return true;
+
+	get_item_key(item, key);
+
+	return test_bit(key[0], filter) && test_bit(key[1], filter);
+}
+
+static void reset_mm_stats(struct lruvec *lruvec, struct lru_gen_mm_walk *walk, bool last)
+{
+	int i;
+	int hist;
+
+	lockdep_assert_held(&get_mm_list(lruvec_memcg(lruvec))->lock);
+
+	if (walk) {
+		hist = lru_hist_from_seq(walk->max_seq);
+
+		for (i = 0; i < NR_MM_STATS; i++) {
+			WRITE_ONCE(lruvec->mm_state.stats[hist][i],
+				   lruvec->mm_state.stats[hist][i] + walk->mm_stats[i]);
+			walk->mm_stats[i] = 0;
+		}
+	}
+
+	if (NR_HIST_GENS > 1 && last) {
+		hist = lru_hist_from_seq(lruvec->mm_state.seq + 1);
+
+		for (i = 0; i < NR_MM_STATS; i++)
+			WRITE_ONCE(lruvec->mm_state.stats[hist][i], 0);
+	}
+}
+
+static bool should_skip_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	int type;
+	unsigned long size = 0;
+	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+	int key = pgdat->node_id % BITS_PER_TYPE(mm->lru_gen.bitmap);
+
+	if (!walk->force_scan && !test_bit(key, &mm->lru_gen.bitmap))
+		return true;
+
+	clear_bit(key, &mm->lru_gen.bitmap);
+
+	for (type = !walk->can_swap; type < ANON_AND_FILE; type++) {
+		size += type ? get_mm_counter(mm, MM_FILEPAGES) :
+			       get_mm_counter(mm, MM_ANONPAGES) +
+			       get_mm_counter(mm, MM_SHMEMPAGES);
+	}
+
+	if (size < MIN_LRU_BATCH)
+		return true;
+
+	if (mm_is_oom_victim(mm))
+		return true;
+
+	return !mmget_not_zero(mm);
+}
+
+static bool iterate_mm_list(struct lruvec *lruvec, struct lru_gen_mm_walk *walk,
+			    struct mm_struct **iter)
+{
+	bool first = false;
+	bool last = true;
+	struct mm_struct *mm = NULL;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+	/*
+	 * There are four interesting cases for this page table walker:
+	 * 1. It tries to start a new iteration of mm_list with a stale max_seq;
+	 *    there is nothing left to do.
+	 * 2. It's the first of the current generation, and it needs to reset
+	 *    the Bloom filter for the next generation.
+	 * 3. It reaches the end of mm_list, and it needs to increment
+	 *    mm_state->seq; the iteration is done.
+	 * 4. It's the last of the current generation, and it needs to reset the
+	 *    mm stats counters for the next generation.
+	 */
+	spin_lock(&mm_list->lock);
+
+	VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->max_seq);
+	VM_WARN_ON_ONCE(*iter && mm_state->seq > walk->max_seq);
+	VM_WARN_ON_ONCE(*iter && !mm_state->nr_walkers);
+
+	if (walk->max_seq <= mm_state->seq) {
+		if (!*iter)
+			last = false;
+		goto done;
+	}
+
+	if (!mm_state->nr_walkers) {
+		VM_WARN_ON_ONCE(mm_state->head && mm_state->head != &mm_list->fifo);
+
+		mm_state->head = mm_list->fifo.next;
+		first = true;
+	}
+
+	while (!mm && mm_state->head != &mm_list->fifo) {
+		mm = list_entry(mm_state->head, struct mm_struct, lru_gen.list);
+
+		mm_state->head = mm_state->head->next;
+
+		/* force scan for those added after the last iteration */
+		if (!mm_state->tail || mm_state->tail == &mm->lru_gen.list) {
+			mm_state->tail = mm_state->head;
+			walk->force_scan = true;
+		}
+
+		if (should_skip_mm(mm, walk))
+			mm = NULL;
+	}
+
+	if (mm_state->head == &mm_list->fifo)
+		WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+done:
+	if (*iter && !mm)
+		mm_state->nr_walkers--;
+	if (!*iter && mm)
+		mm_state->nr_walkers++;
+
+	if (mm_state->nr_walkers)
+		last = false;
+
+	if (*iter || last)
+		reset_mm_stats(lruvec, walk, last);
+
+	spin_unlock(&mm_list->lock);
+
+	if (mm && first)
+		reset_bloom_filter(lruvec, walk->max_seq + 1);
+
+	if (*iter)
+		mmput_async(*iter);
+
+	*iter = mm;
+
+	return last;
+}
+
+static bool iterate_mm_list_nowalk(struct lruvec *lruvec, unsigned long max_seq)
+{
+	bool success = false;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+	struct lru_gen_mm_list *mm_list = get_mm_list(memcg);
+	struct lru_gen_mm_state *mm_state = &lruvec->mm_state;
+
+	spin_lock(&mm_list->lock);
+
+	VM_WARN_ON_ONCE(mm_state->seq + 1 < max_seq);
+
+	if (max_seq > mm_state->seq && !mm_state->nr_walkers) {
+		VM_WARN_ON_ONCE(mm_state->head && mm_state->head != &mm_list->fifo);
+
+		WRITE_ONCE(mm_state->seq, mm_state->seq + 1);
+		reset_mm_stats(lruvec, NULL, true);
+		success = true;
+	}
+
+	spin_unlock(&mm_list->lock);
+
+	return success;
+}
+
 /******************************************************************************
  *                          refault feedback loop
  ******************************************************************************/
@@ -3277,6 +3645,118 @@  static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 	return new_gen;
 }
 
+static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
+			      int old_gen, int new_gen)
+{
+	int type = folio_is_file_lru(folio);
+	int zone = folio_zonenum(folio);
+	int delta = folio_nr_pages(folio);
+
+	VM_WARN_ON_ONCE(old_gen >= MAX_NR_GENS);
+	VM_WARN_ON_ONCE(new_gen >= MAX_NR_GENS);
+
+	walk->batched++;
+
+	walk->nr_pages[old_gen][type][zone] -= delta;
+	walk->nr_pages[new_gen][type][zone] += delta;
+}
+
+static void reset_batch_size(struct lruvec *lruvec, struct lru_gen_mm_walk *walk)
+{
+	int gen, type, zone;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	walk->batched = 0;
+
+	for_each_gen_type_zone(gen, type, zone) {
+		enum lru_list lru = type * LRU_INACTIVE_FILE;
+		int delta = walk->nr_pages[gen][type][zone];
+
+		if (!delta)
+			continue;
+
+		walk->nr_pages[gen][type][zone] = 0;
+		WRITE_ONCE(lrugen->nr_pages[gen][type][zone],
+			   lrugen->nr_pages[gen][type][zone] + delta);
+
+		if (lru_gen_is_active(lruvec, gen))
+			lru += LRU_ACTIVE;
+		__update_lru_size(lruvec, lru, zone, delta);
+	}
+}
+
+static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *args)
+{
+	struct address_space *mapping;
+	struct vm_area_struct *vma = args->vma;
+	struct lru_gen_mm_walk *walk = args->private;
+
+	if (!vma_is_accessible(vma))
+		return true;
+
+	if (is_vm_hugetlb_page(vma))
+		return true;
+
+	if (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ))
+		return true;
+
+	if (vma == get_gate_vma(vma->vm_mm))
+		return true;
+
+	if (vma_is_anonymous(vma))
+		return !walk->can_swap;
+
+	if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
+		return true;
+
+	mapping = vma->vm_file->f_mapping;
+	if (mapping_unevictable(mapping))
+		return true;
+
+	if (shmem_mapping(mapping))
+		return !walk->can_swap;
+
+	/* to exclude special mappings like dax, etc. */
+	return !mapping->a_ops->read_folio;
+}
+
+/*
+ * Some userspace memory allocators map many single-page VMAs. Instead of
+ * returning back to the PGD table for each of such VMAs, finish an entire PMD
+ * table to reduce zigzags and improve cache performance.
+ */
+static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk *args,
+			 unsigned long *vm_start, unsigned long *vm_end)
+{
+	unsigned long start = round_up(*vm_end, size);
+	unsigned long end = (start | ~mask) + 1;
+
+	VM_WARN_ON_ONCE(mask & size);
+	VM_WARN_ON_ONCE((start & mask) != (*vm_start & mask));
+
+	while (args->vma) {
+		if (start >= args->vma->vm_end) {
+			args->vma = args->vma->vm_next;
+			continue;
+		}
+
+		if (end && end <= args->vma->vm_start)
+			return false;
+
+		if (should_skip_vma(args->vma->vm_start, args->vma->vm_end, args)) {
+			args->vma = args->vma->vm_next;
+			continue;
+		}
+
+		*vm_start = max(start, args->vma->vm_start);
+		*vm_end = min(end - 1, args->vma->vm_end - 1) + 1;
+
+		return true;
+	}
+
+	return false;
+}
+
 static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
 {
 	unsigned long pfn = pte_pfn(pte);
@@ -3295,8 +3775,28 @@  static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
 	return pfn;
 }
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr)
+{
+	unsigned long pfn = pmd_pfn(pmd);
+
+	VM_WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end);
+
+	if (!pmd_present(pmd) || is_huge_zero_pmd(pmd))
+		return -1;
+
+	if (WARN_ON_ONCE(pmd_devmap(pmd)))
+		return -1;
+
+	if (WARN_ON_ONCE(!pfn_valid(pfn)))
+		return -1;
+
+	return pfn;
+}
+#endif
+
 static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
-				   struct pglist_data *pgdat)
+				   struct pglist_data *pgdat, bool can_swap)
 {
 	struct folio *folio;
 
@@ -3311,9 +3811,378 @@  static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
 	if (folio_memcg_rcu(folio) != memcg)
 		return NULL;
 
+	/* file VMAs can contain anon pages from COW */
+	if (!folio_is_file_lru(folio) && !can_swap)
+		return NULL;
+
 	return folio;
 }
 
+static bool suitable_to_scan(int total, int young)
+{
+	int n = clamp_t(int, cache_line_size() / sizeof(pte_t), 2, 8);
+
+	/* suitable if the average number of young PTEs per cacheline is >=1 */
+	return young * n >= total;
+}
+
+static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+			   struct mm_walk *args)
+{
+	int i;
+	pte_t *pte;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int total = 0;
+	int young = 0;
+	struct lru_gen_mm_walk *walk = args->private;
+	struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(walk->max_seq);
+
+	VM_WARN_ON_ONCE(pmd_leaf(*pmd));
+
+	ptl = pte_lockptr(args->mm, pmd);
+	if (!spin_trylock(ptl))
+		return false;
+
+	arch_enter_lazy_mmu_mode();
+
+	pte = pte_offset_map(pmd, start & PMD_MASK);
+restart:
+	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		unsigned long pfn;
+		struct folio *folio;
+
+		total++;
+		walk->mm_stats[MM_LEAF_TOTAL]++;
+
+		pfn = get_pte_pfn(pte[i], args->vma, addr);
+		if (pfn == -1)
+			continue;
+
+		if (!pte_young(pte[i])) {
+			walk->mm_stats[MM_LEAF_OLD]++;
+			continue;
+		}
+
+		folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
+		if (!folio)
+			continue;
+
+		if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
+			continue;
+
+		young++;
+		walk->mm_stats[MM_LEAF_YOUNG]++;
+
+		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			update_batch_size(walk, folio, old_gen, new_gen);
+	}
+
+	if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
+		goto restart;
+
+	pte_unmap(pte);
+
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(ptl);
+
+	return suitable_to_scan(total, young);
+}
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+				  struct mm_walk *args, unsigned long *bitmap, unsigned long *start)
+{
+	int i;
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	struct lru_gen_mm_walk *walk = args->private;
+	struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
+	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(walk->max_seq);
+
+	VM_WARN_ON_ONCE(pud_leaf(*pud));
+
+	/* try to batch at most 1+MIN_LRU_BATCH+1 entries */
+	if (*start == -1) {
+		*start = next;
+		return;
+	}
+
+	i = next == -1 ? 0 : pmd_index(next) - pmd_index(*start);
+	if (i && i <= MIN_LRU_BATCH) {
+		__set_bit(i - 1, bitmap);
+		return;
+	}
+
+	pmd = pmd_offset(pud, *start);
+
+	ptl = pmd_lockptr(args->mm, pmd);
+	if (!spin_trylock(ptl))
+		goto done;
+
+	arch_enter_lazy_mmu_mode();
+
+	do {
+		unsigned long pfn;
+		struct folio *folio;
+		unsigned long addr = i ? (*start & PMD_MASK) + i * PMD_SIZE : *start;
+
+		pfn = get_pmd_pfn(pmd[i], vma, addr);
+		if (pfn == -1)
+			goto next;
+
+		if (!pmd_trans_huge(pmd[i])) {
+			if (IS_ENABLED(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG))
+				pmdp_test_and_clear_young(vma, addr, pmd + i);
+			goto next;
+		}
+
+		folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
+		if (!folio)
+			goto next;
+
+		if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+			goto next;
+
+		walk->mm_stats[MM_LEAF_YOUNG]++;
+
+		if (pmd_dirty(pmd[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen >= 0 && old_gen != new_gen)
+			update_batch_size(walk, folio, old_gen, new_gen);
+next:
+		i = i > MIN_LRU_BATCH ? 0 : find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1;
+	} while (i <= MIN_LRU_BATCH);
+
+	arch_leave_lazy_mmu_mode();
+	spin_unlock(ptl);
+done:
+	*start = -1;
+	bitmap_zero(bitmap, MIN_LRU_BATCH);
+}
+#else
+static void walk_pmd_range_locked(pud_t *pud, unsigned long next, struct vm_area_struct *vma,
+				  struct mm_walk *args, unsigned long *bitmap, unsigned long *start)
+{
+}
+#endif
+
+static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+			   struct mm_walk *args)
+{
+	int i;
+	pmd_t *pmd;
+	unsigned long next;
+	unsigned long addr;
+	struct vm_area_struct *vma;
+	unsigned long pos = -1;
+	struct lru_gen_mm_walk *walk = args->private;
+	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
+
+	VM_WARN_ON_ONCE(pud_leaf(*pud));
+
+	/*
+	 * Finish an entire PMD in two passes: the first only reaches to PTE
+	 * tables to avoid taking the PMD lock; the second, if necessary, takes
+	 * the PMD lock to clear the accessed bit in PMD entries.
+	 */
+	pmd = pmd_offset(pud, start & PUD_MASK);
+restart:
+	/* walk_pte_range() may call get_next_vma() */
+	vma = args->vma;
+	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+		pmd_t val = pmd_read_atomic(pmd + i);
+
+		/* for pmd_read_atomic() */
+		barrier();
+
+		next = pmd_addr_end(addr, end);
+
+		if (!pmd_present(val) || is_huge_zero_pmd(val)) {
+			walk->mm_stats[MM_LEAF_TOTAL]++;
+			continue;
+		}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		if (pmd_trans_huge(val)) {
+			unsigned long pfn = pmd_pfn(val);
+			struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+
+			walk->mm_stats[MM_LEAF_TOTAL]++;
+
+			if (!pmd_young(val)) {
+				walk->mm_stats[MM_LEAF_OLD]++;
+				continue;
+			}
+
+			/* try to avoid unnecessary memory loads */
+			if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+				continue;
+
+			walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos);
+			continue;
+		}
+#endif
+		walk->mm_stats[MM_NONLEAF_TOTAL]++;
+
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+		if (!pmd_young(val))
+			continue;
+
+		walk_pmd_range_locked(pud, addr, vma, args, bitmap, &pos);
+#endif
+		if (!walk->force_scan && !test_bloom_filter(walk->lruvec, walk->max_seq, pmd + i))
+			continue;
+
+		walk->mm_stats[MM_NONLEAF_FOUND]++;
+
+		if (!walk_pte_range(&val, addr, next, args))
+			continue;
+
+		walk->mm_stats[MM_NONLEAF_ADDED]++;
+
+		/* carry over to the next generation */
+		update_bloom_filter(walk->lruvec, walk->max_seq + 1, pmd + i);
+	}
+
+	walk_pmd_range_locked(pud, -1, vma, args, bitmap, &pos);
+
+	if (i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end))
+		goto restart;
+}
+
+static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
+			  struct mm_walk *args)
+{
+	int i;
+	pud_t *pud;
+	unsigned long addr;
+	unsigned long next;
+	struct lru_gen_mm_walk *walk = args->private;
+
+	VM_WARN_ON_ONCE(p4d_leaf(*p4d));
+
+	pud = pud_offset(p4d, start & P4D_MASK);
+restart:
+	for (i = pud_index(start), addr = start; addr != end; i++, addr = next) {
+		pud_t val = READ_ONCE(pud[i]);
+
+		next = pud_addr_end(addr, end);
+
+		if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
+			continue;
+
+		walk_pmd_range(&val, addr, next, args);
+
+		if (mm_is_oom_victim(args->mm))
+			return 1;
+
+		/* a racy check to curtail the waiting time */
+		if (wq_has_sleeper(&walk->lruvec->mm_state.wait))
+			return 1;
+
+		if (need_resched() || walk->batched >= MAX_LRU_BATCH) {
+			end = (addr | ~PUD_MASK) + 1;
+			goto done;
+		}
+	}
+
+	if (i < PTRS_PER_PUD && get_next_vma(P4D_MASK, PUD_SIZE, args, &start, &end))
+		goto restart;
+
+	end = round_up(end, P4D_SIZE);
+done:
+	if (!end || !args->vma)
+		return 1;
+
+	walk->next_addr = max(end, args->vma->vm_start);
+
+	return -EAGAIN;
+}
+
+static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	static const struct mm_walk_ops mm_walk_ops = {
+		.test_walk = should_skip_vma,
+		.p4d_entry = walk_pud_range,
+	};
+
+	int err;
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	walk->next_addr = FIRST_USER_ADDRESS;
+
+	do {
+		err = -EBUSY;
+
+		/* folio_update_gen() requires stable folio_memcg() */
+		if (!mem_cgroup_trylock_pages(memcg))
+			break;
+
+		/* the caller might be holding the lock for write */
+		if (mmap_read_trylock(mm)) {
+			err = walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, walk);
+
+			mmap_read_unlock(mm);
+		}
+
+		mem_cgroup_unlock_pages();
+
+		if (walk->batched) {
+			spin_lock_irq(&lruvec->lru_lock);
+			reset_batch_size(lruvec, walk);
+			spin_unlock_irq(&lruvec->lru_lock);
+		}
+
+		cond_resched();
+	} while (err == -EAGAIN);
+}
+
+static struct lru_gen_mm_walk *set_mm_walk(struct pglist_data *pgdat)
+{
+	struct lru_gen_mm_walk *walk = current->reclaim_state->mm_walk;
+
+	if (pgdat && current_is_kswapd()) {
+		VM_WARN_ON_ONCE(walk);
+
+		walk = &pgdat->mm_walk;
+	} else if (!pgdat && !walk) {
+		VM_WARN_ON_ONCE(current_is_kswapd());
+
+		walk = kzalloc(sizeof(*walk), __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+	}
+
+	current->reclaim_state->mm_walk = walk;
+
+	return walk;
+}
+
+static void clear_mm_walk(void)
+{
+	struct lru_gen_mm_walk *walk = current->reclaim_state->mm_walk;
+
+	VM_WARN_ON_ONCE(walk && memchr_inv(walk->nr_pages, 0, sizeof(walk->nr_pages)));
+	VM_WARN_ON_ONCE(walk && memchr_inv(walk->mm_stats, 0, sizeof(walk->mm_stats)));
+
+	current->reclaim_state->mm_walk = NULL;
+
+	if (!current_is_kswapd())
+		kfree(walk);
+}
+
 static void inc_min_seq(struct lruvec *lruvec, int type)
 {
 	struct lru_gen_struct *lrugen = &lruvec->lrugen;
@@ -3365,7 +4234,7 @@  static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
 	return success;
 }
 
-static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_swap)
+static void inc_max_seq(struct lruvec *lruvec, bool can_swap)
 {
 	int prev, next;
 	int type, zone;
@@ -3375,9 +4244,6 @@  static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_s
 
 	VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
 
-	if (max_seq != lrugen->max_seq)
-		goto unlock;
-
 	for (type = 0; type < ANON_AND_FILE; type++) {
 		if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
 			continue;
@@ -3415,10 +4281,76 @@  static void inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, bool can_s
 
 	/* make sure preceding modifications appear */
 	smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
-unlock:
+
 	spin_unlock_irq(&lruvec->lru_lock);
 }
 
+static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq,
+			       struct scan_control *sc, bool can_swap)
+{
+	bool success;
+	struct lru_gen_mm_walk *walk;
+	struct mm_struct *mm = NULL;
+	struct lru_gen_struct *lrugen = &lruvec->lrugen;
+
+	VM_WARN_ON_ONCE(max_seq > READ_ONCE(lrugen->max_seq));
+
+	/* see the comment in iterate_mm_list() */
+	if (max_seq <= READ_ONCE(lruvec->mm_state.seq)) {
+		success = false;
+		goto done;
+	}
+
+	/*
+	 * If the hardware doesn't automatically set the accessed bit, fallback
+	 * to lru_gen_look_around(), which only clears the accessed bit in a
+	 * handful of PTEs. Spreading the work out over a period of time usually
+	 * is less efficient, but it avoids bursty page faults.
+	 */
+	if (!arch_has_hw_pte_young()) {
+		success = iterate_mm_list_nowalk(lruvec, max_seq);
+		goto done;
+	}
+
+	walk = set_mm_walk(NULL);
+	if (!walk) {
+		success = iterate_mm_list_nowalk(lruvec, max_seq);
+		goto done;
+	}
+
+	walk->lruvec = lruvec;
+	walk->max_seq = max_seq;
+	walk->can_swap = can_swap;
+	walk->force_scan = false;
+
+	do {
+		success = iterate_mm_list(lruvec, walk, &mm);
+		if (mm)
+			walk_mm(lruvec, mm, walk);
+
+		cond_resched();
+	} while (mm);
+done:
+	if (!success) {
+		if (sc->priority < DEF_PRIORITY - 2)
+			wait_event_killable(lruvec->mm_state.wait,
+					    max_seq < READ_ONCE(lrugen->max_seq));
+
+		return max_seq < READ_ONCE(lrugen->max_seq);
+	}
+
+	VM_WARN_ON_ONCE(max_seq != READ_ONCE(lrugen->max_seq));
+
+	inc_max_seq(lruvec, can_swap);
+	/* either this sees any waiters or they will see updated max_seq */
+	if (wq_has_sleeper(&lruvec->mm_state.wait))
+		wake_up_all(&lruvec->mm_state.wait);
+
+	wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+	return true;
+}
+
 static unsigned long get_nr_evictable(struct lruvec *lruvec, unsigned long max_seq,
 				      unsigned long *min_seq, bool can_swap, bool *need_aging)
 {
@@ -3496,7 +4428,7 @@  static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	nr_to_scan >>= mem_cgroup_online(memcg) ? sc->priority : 0;
 
 	if (nr_to_scan && need_aging)
-		inc_max_seq(lruvec, max_seq, swappiness);
+		try_to_inc_max_seq(lruvec, max_seq, sc, swappiness);
 }
 
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
@@ -3505,6 +4437,8 @@  static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 	VM_WARN_ON_ONCE(!current_is_kswapd());
 
+	set_mm_walk(pgdat);
+
 	memcg = mem_cgroup_iter(NULL, NULL, NULL);
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -3513,11 +4447,16 @@  static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 
 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	clear_mm_walk();
 }
 
 /*
  * This function exploits spatial locality when shrink_page_list() walks the
- * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages. If
+ * the scan was done cacheline efficiently, it adds the PMD entry pointing to
+ * the PTE table to the Bloom filter. This forms a feedback loop between the
+ * eviction and the aging.
  */
 void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
@@ -3526,6 +4465,8 @@  void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	unsigned long start;
 	unsigned long end;
 	unsigned long addr;
+	struct lru_gen_mm_walk *walk;
+	int young = 0;
 	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
 	struct folio *folio = pfn_folio(pvmw->pfn);
 	struct mem_cgroup *memcg = folio_memcg(folio);
@@ -3555,6 +4496,7 @@  void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	}
 
 	pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
+	walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
 
 	rcu_read_lock();
 	arch_enter_lazy_mmu_mode();
@@ -3569,13 +4511,15 @@  void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		if (!pte_young(pte[i]))
 			continue;
 
-		folio = get_pfn_folio(pfn, memcg, pgdat);
+		folio = get_pfn_folio(pfn, memcg, pgdat, !walk || walk->can_swap);
 		if (!folio)
 			continue;
 
 		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
 			continue;
 
+		young++;
+
 		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
 		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
 		      !folio_test_swapcache(folio)))
@@ -3591,7 +4535,11 @@  void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	arch_leave_lazy_mmu_mode();
 	rcu_read_unlock();
 
-	if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+	/* feedback from rmap walkers to page table walkers */
+	if (suitable_to_scan(i, young))
+		update_bloom_filter(lruvec, max_seq, pvmw->pmd);
+
+	if (!walk && bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
 		for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
 			folio = pfn_folio(pte_pfn(pte[i]));
 			folio_activate(folio);
@@ -3603,8 +4551,10 @@  void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	if (!mem_cgroup_trylock_pages(memcg))
 		return;
 
-	spin_lock_irq(&lruvec->lru_lock);
-	new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+	if (!walk) {
+		spin_lock_irq(&lruvec->lru_lock);
+		new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+	}
 
 	for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
 		folio = pfn_folio(pte_pfn(pte[i]));
@@ -3615,10 +4565,14 @@  void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		if (old_gen < 0 || old_gen == new_gen)
 			continue;
 
-		lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+		if (walk)
+			update_batch_size(walk, folio, old_gen, new_gen);
+		else
+			lru_gen_update_size(lruvec, folio, old_gen, new_gen);
 	}
 
-	spin_unlock_irq(&lruvec->lru_lock);
+	if (!walk)
+		spin_unlock_irq(&lruvec->lru_lock);
 
 	mem_cgroup_unlock_pages();
 }
@@ -3901,6 +4855,7 @@  static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 	struct folio *folio;
 	enum vm_event_item item;
 	struct reclaim_stat stat;
+	struct lru_gen_mm_walk *walk;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
@@ -3937,6 +4892,10 @@  static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 
 	move_pages_to_lru(lruvec, &list);
 
+	walk = current->reclaim_state->mm_walk;
+	if (walk && walk->batched)
+		reset_batch_size(lruvec, walk);
+
 	item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, reclaimed);
@@ -3953,6 +4912,11 @@  static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 	return scanned;
 }
 
+/*
+ * For future optimizations:
+ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
+ *    reclaim.
+ */
 static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
 				    bool can_swap, unsigned long reclaimed)
 {
@@ -3999,7 +4963,8 @@  static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
 	if (current_is_kswapd())
 		return 0;
 
-	inc_max_seq(lruvec, max_seq, can_swap);
+	if (try_to_inc_max_seq(lruvec, max_seq, sc, can_swap))
+		return nr_to_scan;
 done:
 	return min_seq[!can_swap] + MIN_NR_GENS <= max_seq ? nr_to_scan : 0;
 }
@@ -4014,6 +4979,8 @@  static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 
 	blk_start_plug(&plug);
 
+	set_mm_walk(lruvec_pgdat(lruvec));
+
 	while (true) {
 		int delta;
 		int swappiness;
@@ -4041,6 +5008,8 @@  static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 		cond_resched();
 	}
 
+	clear_mm_walk();
+
 	blk_finish_plug(&plug);
 }
 
@@ -4057,15 +5026,21 @@  void lru_gen_init_lruvec(struct lruvec *lruvec)
 
 	for_each_gen_type_zone(gen, type, zone)
 		INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
+
+	lruvec->mm_state.seq = MIN_NR_GENS;
+	init_waitqueue_head(&lruvec->mm_state.wait);
 }
 
 #ifdef CONFIG_MEMCG
 void lru_gen_init_memcg(struct mem_cgroup *memcg)
 {
+	INIT_LIST_HEAD(&memcg->mm_list.fifo);
+	spin_lock_init(&memcg->mm_list.lock);
 }
 
 void lru_gen_exit_memcg(struct mem_cgroup *memcg)
 {
+	int i;
 	int nid;
 
 	for_each_node(nid) {
@@ -4073,6 +5048,11 @@  void lru_gen_exit_memcg(struct mem_cgroup *memcg)
 
 		VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
 					   sizeof(lruvec->lrugen.nr_pages)));
+
+		for (i = 0; i < NR_BLOOM_FILTERS; i++) {
+			bitmap_free(lruvec->mm_state.filters[i]);
+			lruvec->mm_state.filters[i] = NULL;
+		}
 	}
 }
 #endif