mbox series

[v10,00/14] Multi-Gen LRU Framework

Message ID 20220407031525.2368067-1-yuzhao@google.com (mailing list archive)
Headers show
Series Multi-Gen LRU Framework | expand

Message

Yu Zhao April 7, 2022, 3:15 a.m. UTC
TLDR
====
The current page reclaim is too expensive in terms of CPU usage and it
often makes poor choices about what to evict. This patchset offers an
alternative solution that is performant, versatile and
straightforward.

Patchset overview
=================
The design and implementation overview is in patch 14:
https://lore.kernel.org/r/20220407031525.2368067-15-yuzhao@google.com/

01. mm: x86, arm64: add arch_has_hw_pte_young()
02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Take advantage of hardware features when trying to clear the accessed
bit in many PTEs.

03. mm/vmscan.c: refactor shrink_node()
04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
    its sole caller"
Minor refactors to improve readability for the following patches.

05. mm: multi-gen LRU: groundwork
Adds the basic data structure and the functions that insert pages to
and remove pages from the multi-gen LRU (MGLRU) lists.

06. mm: multi-gen LRU: minimal implementation
A minimal implementation without optimizations.

07. mm: multi-gen LRU: exploit locality in rmap
Exploits spatial locality to improve efficiency when using the rmap.

08. mm: multi-gen LRU: support page table walks
Further exploits spatial locality by optionally scanning page tables.

09. mm: multi-gen LRU: optimize multiple memcgs
Optimizes the overall performance for multiple memcgs running mixed
types of workloads.

10. mm: multi-gen LRU: kill switch
Adds a kill switch to enable or disable MGLRU at runtime.

11. mm: multi-gen LRU: thrashing prevention
12. mm: multi-gen LRU: debugfs interface
Provide userspace with features like thrashing prevention, working set
estimation and proactive reclaim.

13. mm: multi-gen LRU: admin guide
14. mm: multi-gen LRU: design doc
Add an admin guide and a design doc.

Benchmark results
=================
Independent lab results
-----------------------
Based on the popularity of searches [01] and the memory usage in
Google's public cloud, the most popular open-source memory-hungry
applications, in alphabetical order, are:
      Apache Cassandra      Memcached
      Apache Hadoop         MongoDB
      Apache Spark          PostgreSQL
      MariaDB (MySQL)       Redis

An independent lab evaluated MGLRU with the most widely used benchmark
suites for the above applications. They posted 960 data points along
with kernel metrics and perf profiles collected over more than 500
hours of total benchmark time. Their final reports show that, with 95%
confidence intervals (CIs), the above applications all performed
significantly better for at least part of their benchmark matrices.

On 5.14:
1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
   less wall time to sort three billion random integers, respectively,
   under the medium- and the high-concurrency conditions, when
   overcommitting memory. There were no statistically significant
   changes in wall time for the rest of the benchmark matrix.
2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
   more transactions per minute (TPM), respectively, under the medium-
   and the high-concurrency conditions, when overcommitting memory.
   There were no statistically significant changes in TPM for the rest
   of the benchmark matrix.
3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
   and [21.59, 30.02]% more operations per second (OPS), respectively,
   for sequential access, random access and Gaussian (distribution)
   access, when THP=always; 95% CIs [13.85, 15.97]% and
   [23.94, 29.92]% more OPS, respectively, for random access and
   Gaussian access, when THP=never. There were no statistically
   significant changes in OPS for the rest of the benchmark matrix.
4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
   [2.16, 3.55]% more operations per second (OPS), respectively, for
   exponential (distribution) access, random access and Zipfian
   (distribution) access, when underutilizing memory; 95% CIs
   [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
   respectively, for exponential access, random access and Zipfian
   access, when overcommitting memory.

On 5.15:
5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
   and [4.11, 7.50]% more operations per second (OPS), respectively,
   for exponential (distribution) access, random access and Zipfian
   (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
   [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
   exponential access, random access and Zipfian access, when swap was
   on.
6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
   less average wall time to finish twelve parallel TeraSort jobs,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in average wall time for the rest of the
   benchmark matrix.
7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
   minute (TPM) under the high-concurrency condition, when swap was
   off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
   respectively, under the medium- and the high-concurrency
   conditions, when swap was on. There were no statistically
   significant changes in TPM for the rest of the benchmark matrix.
8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
   [11.47, 19.36]% more total operations per second (OPS),
   respectively, for sequential access, random access and Gaussian
   (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
   [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
   for sequential access, random access and Gaussian access, when
   THP=never.

Our lab results
---------------
To supplement the above results, we ran the following benchmark suites
on 5.16-rc7 and found no regressions [10]. (These synthetic benchmarks
are popular among MM developers, but we prefer large-scale A/B
experiments to validate improvements.)
      fs_fio_bench_hdd_mq      pft
      fs_lmbench               pgsql-hammerdb
      fs_parallelio            redis
      fs_postmark              stream
      hackbench                sysbenchthread
      kernbench                tpcc_spark
      memcached                unixbench
      multichase               vm-scalability
      mutilate                 will-it-scale
      nginx

[01] https://trends.google.com
[02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
[03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
[04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
[05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
[06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
[07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
[08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
[09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
[10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/

Read-world applications
=======================
Third-party testimonials
------------------------
Konstantin reported [11]:
   I have Archlinux with 8G RAM + zswap + swap. While developing, I
   have lots of apps opened such as multiple LSP-servers for different
   langs, chats, two browsers, etc... Usually, my system gets quickly
   to a point of SWAP-storms, where I have to kill LSP-servers,
   restart browsers to free memory, etc, otherwise the system lags
   heavily and is barely usable.
   
   1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
   patchset, and I started up by opening lots of apps to create memory
   pressure, and worked for a day like this. Till now I had not a
   single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
   getting to the point of 3G in SWAP before without a single
   SWAP-storm.

Vaibhav from IBM reported [12]:
   In a synthetic MongoDB Benchmark, seeing an average of ~19%
   throughput improvement on POWER10(Radix MMU + 64K Page Size) with
   MGLRU patches on top of v5.16 kernel for MongoDB + YCSB across
   three different request distributions, namely, Exponential, Uniform
   and Zipfan.

Shuang from U of Rochester reported [13]:
   With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
   and [9.26, 10.36]% higher throughput, respectively, for random
   access, Zipfian (distribution) access and Gaussian (distribution)
   access, when the average number of jobs per CPU is 1; 95% CIs
   [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
   throughput, respectively, for random access, Zipfian access and
   Gaussian access, when the average number of jobs per CPU is 2.

Daniel from Michigan Tech reported [14]:
   With Memcached allocating ~100GB of byte-addressable Optante,
   performance improvement in terms of throughput (measured as queries
   per second) was about 10% for a series of workloads.

Large-scale deployments
-----------------------
The downstream kernels that have been using MGLRU include:
1. Android ARCVM [15]
2. Arch Linux Zen [16]
3. Chrome OS [17]
4. Liquorix [18]
5. post-factum [19]
6. XanMod [20]

We've rolled out MGLRU to tens of millions of Chrome OS users and
about a million Android users. Google's fleetwide profiling [21] shows
an overall 40% decrease in kswapd CPU usage, in addition to
improvements in other UX metrics, e.g., an 85% decrease in the number
of low-memory kills at the 75th percentile and an 18% decrease in
app launch time at the 50th percentile.

[11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
[12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
[13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
[15] https://chromium.googlesource.com/chromiumos/third_party/kernel/
[16] https://archlinux.org
[17] https://chromium.org
[18] https://liquorix.net
[19] https://gitlab.com/post-factum/pf-kernel/
[20] https://xanmod.org
[21] https://research.google/pubs/pub44271/

Summery
=======
The facts are:
1. The independent lab results and the real-world applications
   indicate substantial improvements; there are no known regressions.
2. Thrashing prevention, working set estimation and proactive reclaim
   work out of the box; there are no equivalent solutions.
3. There is a lot of new code; nobody has demonstrated smaller changes
   with similar effects.

Our options, accordingly, are:
1. Given the amount of evidence, the reported improvements will likely
   materialize for a wide range of workloads.
2. Gauging the interest from the past discussions [22][23][24], the
   new features will likely be put to use for both personal computers
   and data centers.
3. Based on Google's track record, the new code will likely be well
   maintained in the long term. It'd be more difficult if not
   impossible to achieve similar effects on top of the current
   active/inactive LRU.

[22] https://lore.kernel.org/r/20201005081313.732745-1-andrea.righi@canonical.com/
[23] https://lore.kernel.org/r/20210716081449.22187-1-sj38.park@gmail.com/
[24] https://lore.kernel.org/r/20211130201652.2218636d@mail.inbox.lv/

Yu Zhao (14):
  mm: x86, arm64: add arch_has_hw_pte_young()
  mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  mm/vmscan.c: refactor shrink_node()
  Revert "include/linux/mm_inline.h: fold __update_lru_size() into its
    sole caller"
  mm: multi-gen LRU: groundwork
  mm: multi-gen LRU: minimal implementation
  mm: multi-gen LRU: exploit locality in rmap
  mm: multi-gen LRU: support page table walks
  mm: multi-gen LRU: optimize multiple memcgs
  mm: multi-gen LRU: kill switch
  mm: multi-gen LRU: thrashing prevention
  mm: multi-gen LRU: debugfs interface
  mm: multi-gen LRU: admin guide
  mm: multi-gen LRU: design doc

 Documentation/admin-guide/mm/index.rst        |    1 +
 Documentation/admin-guide/mm/multigen_lru.rst |  152 +
 Documentation/vm/index.rst                    |    1 +
 Documentation/vm/multigen_lru.rst             |  160 +
 arch/Kconfig                                  |    9 +
 arch/arm64/include/asm/pgtable.h              |   14 +-
 arch/x86/Kconfig                              |    1 +
 arch/x86/include/asm/pgtable.h                |    9 +-
 arch/x86/mm/pgtable.c                         |    5 +-
 fs/exec.c                                     |    2 +
 fs/fuse/dev.c                                 |    3 +-
 include/linux/cgroup.h                        |   15 +-
 include/linux/memcontrol.h                    |   36 +
 include/linux/mm.h                            |    7 +
 include/linux/mm_inline.h                     |  217 +-
 include/linux/mm_types.h                      |   78 +
 include/linux/mmzone.h                        |  211 ++
 include/linux/nodemask.h                      |    1 +
 include/linux/page-flags-layout.h             |   11 +-
 include/linux/page-flags.h                    |    4 +-
 include/linux/pgtable.h                       |   17 +-
 include/linux/sched.h                         |    4 +
 include/linux/swap.h                          |    4 +
 kernel/bounds.c                               |    7 +
 kernel/cgroup/cgroup-internal.h               |    1 -
 kernel/exit.c                                 |    1 +
 kernel/fork.c                                 |    9 +
 kernel/sched/core.c                           |    1 +
 mm/Kconfig                                    |   26 +
 mm/huge_memory.c                              |    3 +-
 mm/internal.h                                 |    1 +
 mm/memcontrol.c                               |   27 +
 mm/memory.c                                   |   39 +-
 mm/mm_init.c                                  |    6 +-
 mm/mmzone.c                                   |    2 +
 mm/rmap.c                                     |    7 +
 mm/swap.c                                     |   55 +-
 mm/vmscan.c                                   | 2856 ++++++++++++++++-
 mm/workingset.c                               |  119 +-
 39 files changed, 3975 insertions(+), 147 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst
 create mode 100644 Documentation/vm/multigen_lru.rst

Comments

Yu Zhao April 7, 2022, 3:24 a.m. UTC | #1
On Wed, Apr 6, 2022 at 9:15 PM Yu Zhao <yuzhao@google.com> wrote:
>
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and it
> often makes poor choices about what to evict. This patchset offers an
> alternative solution that is performant, versatile and
> straightforward.

<snipped>

> Summery
> =======
> The facts are:
> 1. The independent lab results and the real-world applications
>    indicate substantial improvements; there are no known regressions.
> 2. Thrashing prevention, working set estimation and proactive reclaim
>    work out of the box; there are no equivalent solutions.
> 3. There is a lot of new code; nobody has demonstrated smaller changes
>    with similar effects.
>
> Our options, accordingly, are:
> 1. Given the amount of evidence, the reported improvements will likely
>    materialize for a wide range of workloads.
> 2. Gauging the interest from the past discussions [22][23][24], the
>    new features will likely be put to use for both personal computers
>    and data centers.
> 3. Based on Google's track record, the new code will likely be well
>    maintained in the long term. It'd be more difficult if not
>    impossible to achieve similar effects on top of the current
>    active/inactive LRU.

Hi Stephen,

Can you please include this patchset in linux-next? Git repo for you to fetch:

https://linux-mm.googlesource.com/mglru for-linux-next

My goal is to get additional test coverage before I send a pull
request for 5.19 to Linus.

I've explored all avenues, but ultimately I've failed to rally
substantial support from the MM stakeholders [1]. There are no pending
technical issues against this patchset [2]. What is more concerning
are the fundamental disagreements on priorities, methodologies, etc.
that are not specific to this patchset and have been hindering our
progress as a collective. (Cheers to the mutual dissatisfaction.)

While we plan to discuss those issues during the LSFMM next month, it
doesn't seem reasonable to leave this patchset hanging in the air,
since it has reached its maturity a while ago and there are strong
demands from downstream kernels as well as a large user base. Thus I
sent that pull request to Linus a couple of weeks ago, implying that
he would have to make the final decision soon.

I hope this gives enough background about what's been going on with
this patchset. If you decide to take it and it causes you any
troubles, please feel free to yell at me.

Thanks!

[1] https://lore.kernel.org/r/20220104202227.2903605-1-yuzhao@google.com/
[2] https://lore.kernel.org/r/20220326010003.3155137-1-yuzhao@google.com/
Stephen Rothwell April 7, 2022, 8:31 a.m. UTC | #2
Hi Yu,

On Wed, 6 Apr 2022 21:24:27 -0600 Yu Zhao <yuzhao@google.com> wrote:
>
> Can you please include this patchset in linux-next? Git repo for you to fetch:
> 
> https://linux-mm.googlesource.com/mglru for-linux-next

I get a message saying "This repository is empty. Push to it to show
branches and history." :-(

> My goal is to get additional test coverage before I send a pull
> request for 5.19 to Linus.

Good idea :-)

> I've explored all avenues, but ultimately I've failed to rally
> substantial support from the MM stakeholders [1]. There are no pending
> technical issues against this patchset [2]. What is more concerning
> are the fundamental disagreements on priorities, methodologies, etc.
> that are not specific to this patchset and have been hindering our
> progress as a collective. (Cheers to the mutual dissatisfaction.)

I have not been following the discussion as I am not an mm person, but
this is not a good sign.

> While we plan to discuss those issues during the LSFMM next month, it
> doesn't seem reasonable to leave this patchset hanging in the air,
> since it has reached its maturity a while ago and there are strong
> demands from downstream kernels as well as a large user base. Thus I
> sent that pull request to Linus a couple of weeks ago, implying that
> he would have to make the final decision soon.
> 
> I hope this gives enough background about what's been going on with
> this patchset. If you decide to take it and it causes you any
> troubles, please feel free to yell at me.
> 
> Thanks!
> 
> [1] https://lore.kernel.org/r/20220104202227.2903605-1-yuzhao@google.com/
> [2] https://lore.kernel.org/r/20220326010003.3155137-1-yuzhao@google.com/

I had a look at those threads and I guess things are better that your
comment above implies.

So, a couple of questions:

Have you done a trial merge with a current linux-next tree to see what
sort of mess/pain we may already be in?

Is it all stable enough now that it could be sent as a patch series for
Andrew to include in mmotm (with perhaps just smallish followup patches)?
Yu Zhao April 7, 2022, 9:08 a.m. UTC | #3
On Thu, Apr 7, 2022 at 2:31 AM Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Hi Yu,
>
> On Wed, 6 Apr 2022 21:24:27 -0600 Yu Zhao <yuzhao@google.com> wrote:
> >
> > Can you please include this patchset in linux-next? Git repo for you to fetch:
> >
> > https://linux-mm.googlesource.com/mglru for-linux-next
>
> I get a message saying "This repository is empty. Push to it to show
> branches and history." :-(

Sorry about that

> > My goal is to get additional test coverage before I send a pull
> > request for 5.19 to Linus.
>
> Good idea :-)
>
> > I've explored all avenues, but ultimately I've failed to rally
> > substantial support from the MM stakeholders [1]. There are no pending
> > technical issues against this patchset [2]. What is more concerning
> > are the fundamental disagreements on priorities, methodologies, etc.
> > that are not specific to this patchset and have been hindering our
> > progress as a collective. (Cheers to the mutual dissatisfaction.)
>
> I have not been following the discussion as I am not an mm person, but
> this is not a good sign.
>
> > While we plan to discuss those issues during the LSFMM next month, it
> > doesn't seem reasonable to leave this patchset hanging in the air,
> > since it has reached its maturity a while ago and there are strong
> > demands from downstream kernels as well as a large user base. Thus I
> > sent that pull request to Linus a couple of weeks ago, implying that
> > he would have to make the final decision soon.
> >
> > I hope this gives enough background about what's been going on with
> > this patchset. If you decide to take it and it causes you any
> > troubles, please feel free to yell at me.
> >
> > Thanks!
> >
> > [1] https://lore.kernel.org/r/20220104202227.2903605-1-yuzhao@google.com/
> > [2] https://lore.kernel.org/r/20220326010003.3155137-1-yuzhao@google.com/
>
> I had a look at those threads and I guess things are better that your
> comment above implies.
>
> So, a couple of questions:
>
> Have you done a trial merge with a current linux-next tree to see what
> sort of mess/pain we may already be in?
>
> Is it all stable enough now that it could be sent as a patch series for
> Andrew to include in mmotm (with perhaps just smallish followup patches)?
>
> --
> Cheers,
> Stephen Rothwell
Yu Zhao April 7, 2022, 9:41 a.m. UTC | #4
On Thu, Apr 7, 2022 at 2:31 AM Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Hi Yu,
>
> On Wed, 6 Apr 2022 21:24:27 -0600 Yu Zhao <yuzhao@google.com> wrote:
> >
> > Can you please include this patchset in linux-next? Git repo for you to fetch:
> >
> > https://linux-mm.googlesource.com/mglru for-linux-next
>
> I get a message saying "This repository is empty. Push to it to show
> branches and history." :-(

Sorry about this. It should work now.

> > My goal is to get additional test coverage before I send a pull
> > request for 5.19 to Linus.
>
> Good idea :-)
>
> > I've explored all avenues, but ultimately I've failed to rally
> > substantial support from the MM stakeholders [1]. There are no pending
> > technical issues against this patchset [2]. What is more concerning
> > are the fundamental disagreements on priorities, methodologies, etc.
> > that are not specific to this patchset and have been hindering our
> > progress as a collective. (Cheers to the mutual dissatisfaction.)
>
> I have not been following the discussion as I am not an mm person, but
> this is not a good sign.
>
> > While we plan to discuss those issues during the LSFMM next month, it
> > doesn't seem reasonable to leave this patchset hanging in the air,
> > since it has reached its maturity a while ago and there are strong
> > demands from downstream kernels as well as a large user base. Thus I
> > sent that pull request to Linus a couple of weeks ago, implying that
> > he would have to make the final decision soon.
> >
> > I hope this gives enough background about what's been going on with
> > this patchset. If you decide to take it and it causes you any
> > troubles, please feel free to yell at me.
> >
> > Thanks!
> >
> > [1] https://lore.kernel.org/r/20220104202227.2903605-1-yuzhao@google.com/
> > [2] https://lore.kernel.org/r/20220326010003.3155137-1-yuzhao@google.com/
>
> I had a look at those threads and I guess things are better that your
> comment above implies.
>
> So, a couple of questions:
>
> Have you done a trial merge with a current linux-next tree to see what
> sort of mess/pain we may already be in?

Yes, the repo I prepared for you is based on the latest linux-next.
There shouldn't be any conflicts.

> Is it all stable enough now that it could be sent as a patch series for
> Andrew to include in mmotm (with perhaps just smallish followup patches)?

Yes, on multiple occasions, e.g., [1][2][3], I've claimed this
patchset has an unprecedented test coverage and nobody has proven
otherwise so far.

Andrew suggested a cycle in linux-next [4]. So here we are :)

[1] https://lore.kernel.org/all/YdSuSHa%2FVjl6bPkg@google.com/
[2] https://lore.kernel.org/r/YdiKVJlClB3h1Kmg@google.com/
[3] https://lore.kernel.org/r/YgR+MfXjpg82QyBT@google.com/
[4] https://lore.kernel.org/r/20220326134928.ad739eeecd5d0855dbdc6257@linux-foundation.org/
Stephen Rothwell April 7, 2022, 12:13 p.m. UTC | #5
Hi,

On Thu, 7 Apr 2022 03:41:15 -0600 Yu Zhao <yuzhao@google.com> wrote:
>
> > So, a couple of questions:
> >
> > Have you done a trial merge with a current linux-next tree to see what
> > sort of mess/pain we may already be in?  
> 
> Yes, the repo I prepared for you is based on the latest linux-next.
> There shouldn't be any conflicts.

Ah, that is a problem :-(  I can't merge a branch into linux-next if
that branch is based on linux-next itself.  linux-next rebases
everyday, so that merge would bring in the previous version of
linux-next - including other branches that may have rebased :-(

All the branches in linux-next need to be based on Linus' tree or some
tree that does not rebase (or one you can keep up with if it does
rebase).

The only exception is part of Andrew's patch series which is rebased
(by me) on top of linux-next each day.

> > Is it all stable enough now that it could be sent as a patch series for
> > Andrew to include in mmotm (with perhaps just smallish followup patches)?  
> 
> Yes, on multiple occasions, e.g., [1][2][3], I've claimed this
> patchset has an unprecedented test coverage and nobody has proven
> otherwise so far.
> 
> Andrew suggested a cycle in linux-next [4]. So here we are :)

So the easiest thing for me is if Andrew takes it into his mmotm patch
series (most of which ends up in linux-next).  Otherwise I am probably
at some point going to need help fixing the conflicts.
Yu Zhao April 8, 2022, 2:08 a.m. UTC | #6
On Thu, Apr 7, 2022 at 6:14 AM Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Hi,
>
> On Thu, 7 Apr 2022 03:41:15 -0600 Yu Zhao <yuzhao@google.com> wrote:
> >
> > > So, a couple of questions:
> > >
> > > Have you done a trial merge with a current linux-next tree to see what
> > > sort of mess/pain we may already be in?
> >
> > Yes, the repo I prepared for you is based on the latest linux-next.
> > There shouldn't be any conflicts.
>
> Ah, that is a problem :-(  I can't merge a branch into linux-next if
> that branch is based on linux-next itself.  linux-next rebases
> everyday, so that merge would bring in the previous version of
> linux-next - including other branches that may have rebased :-(
>
> All the branches in linux-next need to be based on Linus' tree or some
> tree that does not rebase (or one you can keep up with if it does
> rebase).
>
> The only exception is part of Andrew's patch series which is rebased
> (by me) on top of linux-next each day.

Gotcha.

> > > Is it all stable enough now that it could be sent as a patch series for
> > > Andrew to include in mmotm (with perhaps just smallish followup patches)?
> >
> > Yes, on multiple occasions, e.g., [1][2][3], I've claimed this
> > patchset has an unprecedented test coverage and nobody has proven
> > otherwise so far.
> >
> > Andrew suggested a cycle in linux-next [4]. So here we are :)
>
> So the easiest thing for me is if Andrew takes it into his mmotm patch
> series (most of which ends up in linux-next).

Agreed.

> Otherwise I am probably
> at some point going to need help fixing the conflicts.

Yes, very likely.

I see three options at the moment:
1. I grab mmotm, rebase it, apply this patchset atop and then route it
to you. Based on my rough understanding of your workflow, this might
reduce the work on your side.
2. I skip linux-next, and when I send the pull request for 5.19, I'll
include incentive for Linus to potentially forgo the required
linux-next cycle.
3. You or Andrew would have to do something you/he might not enjoy :)

Please let me know. Thanks!
Andrew Morton April 12, 2022, 2:15 a.m. UTC | #7
It's pretty clear from the results and from the test coverage to date
that Linux wants this.

I do think additional review is needed.  For the usual code quality
reasons, but also to help others get up to speed with the proposed
changes and to ensure that we have something which is well maintainable
going forward.

The code at present is unnecessarily difficult to review and that
review will be less effective than it might be, because of its lack of
commentary.

I'll merge the v10 series into -mm and -next for additional runtime
exposure, but I'll be asking for a broad update.

The data structures appear to be adequately documented (this is rare)
but the code itself is lacking.  Please go through each and every new
function and ask "is this function so self-evident that it can be
presented to a newcomer with no explanatory comments at all".

If "yes" then check again.  If still "yes" then fine.  If "no" then
let's craft comments which tell the reader things which are not evident
from the code itself.  Things which are helpful to that reader.  Design
concepts, side-effects, preconditions, unobvious traps, etc.

Please ensure that the current group of mglru developers review these
comment additions as closely as they review code changes.

Alas, it's late in the process to be adding these things.  But it can
be made better.

I'll make some high-level comments on the patches themselves now, but I
see little point in attempting detailed review when a better-explained
version is hopefully forthcoming.


A few gratuitous notebook entries:

- lru_gen_struct.timestamps works in jiffies?  Why?  jiffies can vary
  based on platform and config - why add inter-platform and
  inter-Kconfig variability like this?  Time is measured in seconds!

  And the amount of work which can be performed in one second varies
  by, guess, 100x on machines which we care about.  This also has
  potential to introduce tremendous inter-platform variability in the
  behaviour of this feature.

- did lru_gen_use_mm() really need to test nodes_full()?  Is that
  likely enough to be a net benefit?

- seq_is_valid() sounds like it belongs to the seq_file() subsystem

- does get_nr_evictable() need to return and use signed types (long)?
  It sums the contents of lru_gen_struct.nr_pages[][][], which is
  ulong, so I think "no".
Andrew Morton April 14, 2022, 5:06 a.m. UTC | #8
The code uses 

	struct lru_gen_mm_walk *walk

in some places and

	struct mm_walk *walk

in others.  This is already driving me nuts.  Can we please use
different identifiers for different types?
Yu Zhao April 20, 2022, 12:50 a.m. UTC | #9
On Wed, Apr 13, 2022 at 11:06 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> The code uses
>
>         struct lru_gen_mm_walk *walk
>
> in some places and
>
>         struct mm_walk *walk
>
> in others.  This is already driving me nuts.  Can we please use
> different identifiers for different types?

Sorry.

Going with "struct lru_gen_mm_walk *args", unless there are objections.

("struct mm_walk *walk" is an existing convention so I don't think I
should change that.)