[RFC,v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap

Message ID	951fb7edab535cf522def4f5f2613947ed7b7d28.1701853894.git.henry.hj@antgroup.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: "Henry Huang" <henry.hj@antgroup.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: " =?utf-8?b?6LCI6Ym06ZSL?= " <henry.tjf@antgroup.com>, " =?utf-8?b?5pyx?= =?utf-8?b?6L6JKOiMtuawtCk=?= " <teawater@antgroup.com>, <akpm@linux-foundation.org>, "Henry Huang" <henry.hj@antgroup.com> Subject: [RFC v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap Date: Wed, 06 Dec 2023 20:50:07 +0800 Message-ID: <951fb7edab535cf522def4f5f2613947ed7b7d28.1701853894.git.henry.hj@antgroup.com> In-Reply-To: <cover.1701853894.git.henry.hj@antgroup.com> References: <cover.1701853894.git.henry.hj@antgroup.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC,v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap \| expand [RFC,v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap

Henry Huang Dec. 6, 2023, 12:50 p.m. UTC

Multi-Gen LRU page-table walker clears pte young flag, but it doesn't
clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to check
whether one page is accessed, it would tell us this page is idle,
but actually this page has been accessed.

For those unmapped filecache pages, page idle flag would not been
cleared in folio_mark_accessed if Multi-Gen LRU is enabled.
So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether
a filecache page is read or written.

What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag.
If one page is accessed, it would set page young flag. Multi-Gen LRU
page-table walker should check both page&pte young flags.

how-to-reproduce-problem

idle_page_track
   a tools to track process accessed memory during a specific time
usage
   idle_page_track $pid $time
how-it-works
   1. scan process vma from /proc/$pid/maps
   2. vfn --> pfn from /proc/$pid/pagemap
   3. write /sys/kernel/mm/page_idle/bitmap to
      mark phy page idle flag and clear pte young flag
   4. sleep $time
   5. read /sys/kernel/mm/page_idle/bitmap to
      test_and_clear pte young flag and
      return whether phy page is accessed

test ---- test program

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <fcntl.h>

 int main(int argc, const char *argv[])
 {
     char *buf = NULL;
     char pipe_info[4096];
     int n;
     int fd = -1;

     buf = malloc(1024*1024*1024UL);
     memset(buf, 0, 1024*1024*1024UL);
     fd = open("access.pipe", O_RDONLY);
     if (fd < 0)
         goto out;
     while (1) {
         n = read(fd, pipe_info, sizeof(pipe_info));
         if (!n) {
             sleep(1);
             continue;
         } else if (n < 0) {
             break;
         }
         memset(buf, 0, 1024*1024*1024UL);
         puts("finish access");
      }
 out:
     if (fd >=0)
         close(fd);
     if (buf)
         free(buf);

     return 0;
 }

prepare:
mkfifo access.pipe
./test
ps -ef | grep test
root       4106   3148  8 06:47 pts/0    00:00:01 ./test

We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan.

case 1: mglru walker break page_idle
./idle_page_track 4106 60 &
sleep 5; echo 1 > access.pipe
sleep 5;  echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen

the output of idle_page_track is:
Est(s)     Ref(MB)
64.822        1.00
only found 1MB were accessed during 64.822s, but actually 1024MB were
accessed.

case 2: page_idle break mglru walker
echo 1 > access.pipe
./idle_page_track 4106 10
echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen
lru gen status:
memcg     8     /user.slice
 node     0
          5     772458       1065        9735
          6     737435     262244          72
          7     538053       1184         632
          8      59404       6422           0
almost pages should be in max_seq-1 queue, but actually not.

Signed-off-by: Henry Huang <henry.hj@antgroup.com>
---
 mm/swap.c   |  3 +++
 mm/vmscan.c | 40 ++++++++++++++++++++++++++++------------
 2 files changed, 31 insertions(+), 12 deletions(-)

Yu Zhao Dec. 7, 2023, 1:30 a.m. UTC | #1

On Wed, Dec 6, 2023 at 5:51 AM Henry Huang <henry.hj@antgroup.com> wrote:
>
> Multi-Gen LRU page-table walker clears pte young flag, but it doesn't
> clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to check
> whether one page is accessed, it would tell us this page is idle,
> but actually this page has been accessed.
>
> For those unmapped filecache pages, page idle flag would not been
> cleared in folio_mark_accessed if Multi-Gen LRU is enabled.
> So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether
> a filecache page is read or written.
>
> What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag.
> If one page is accessed, it would set page young flag. Multi-Gen LRU
> page-table walker should check both page&pte young flags.
>
> how-to-reproduce-problem
>
> idle_page_track
>    a tools to track process accessed memory during a specific time
> usage
>    idle_page_track $pid $time
> how-it-works
>    1. scan process vma from /proc/$pid/maps
>    2. vfn --> pfn from /proc/$pid/pagemap
>    3. write /sys/kernel/mm/page_idle/bitmap to
>       mark phy page idle flag and clear pte young flag
>    4. sleep $time
>    5. read /sys/kernel/mm/page_idle/bitmap to
>       test_and_clear pte young flag and
>       return whether phy page is accessed
>
> test ---- test program
>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <unistd.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
>  #include <fcntl.h>
>
>  int main(int argc, const char *argv[])
>  {
>      char *buf = NULL;
>      char pipe_info[4096];
>      int n;
>      int fd = -1;
>
>      buf = malloc(1024*1024*1024UL);
>      memset(buf, 0, 1024*1024*1024UL);
>      fd = open("access.pipe", O_RDONLY);
>      if (fd < 0)
>          goto out;
>      while (1) {
>          n = read(fd, pipe_info, sizeof(pipe_info));
>          if (!n) {
>              sleep(1);
>              continue;
>          } else if (n < 0) {
>              break;
>          }
>          memset(buf, 0, 1024*1024*1024UL);
>          puts("finish access");
>       }
>  out:
>      if (fd >=0)
>          close(fd);
>      if (buf)
>          free(buf);
>
>      return 0;
>  }
>
> prepare:
> mkfifo access.pipe
> ./test
> ps -ef | grep test
> root       4106   3148  8 06:47 pts/0    00:00:01 ./test
>
> We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan.
>
> case 1: mglru walker break page_idle
> ./idle_page_track 4106 60 &
> sleep 5; echo 1 > access.pipe
> sleep 5;  echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen
>
> the output of idle_page_track is:
> Est(s)     Ref(MB)
> 64.822        1.00
> only found 1MB were accessed during 64.822s, but actually 1024MB were
> accessed.
>
> case 2: page_idle break mglru walker
> echo 1 > access.pipe
> ./idle_page_track 4106 10
> echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen
> lru gen status:
> memcg     8     /user.slice
>  node     0
>           5     772458       1065        9735
>           6     737435     262244          72
>           7     538053       1184         632
>           8      59404       6422           0
> almost pages should be in max_seq-1 queue, but actually not.
>
> Signed-off-by: Henry Huang <henry.hj@antgroup.com>

It's never intended for MGLRU to support page_idle/bitmap or
PG_idle/young because:
1. page_idle/bitmap isn't a capable interface at all -- yes, Google
proposed the idea [1], but we don't really use it anymore because of
its poor scalability.
2. PG_idle/young, being a boolean value, has poor granularity. If
anyone must use page_idle/bitmap for some specific reason, I'd
recommend exporting generation numbers instead.

[1] https://lore.kernel.org/cover.1426706637.git.vdavydov@parallels.com/

Henry Huang Dec. 8, 2023, 7:12 a.m. UTC | #2

Thanks for replying this RFC.

> 1. page_idle/bitmap isn't a capable interface at all -- yes, Google
> proposed the idea [1], but we don't really use it anymore because of
> its poor scalability.

In our environment, we use /sys/kernel/mm/page_idle/bitmap to check
pages whether were accessed during a peroid of time. We manage all pages
idle time in userspace. Then use a prediction algorithm to select pages
to reclaim. These pages would more likely be idled for a long time.

We only need kernel to tell use whether a page is accessed, a boolean
value in kernel is enough for our case.

> 2. PG_idle/young, being a boolean value, has poor granularity. If
> anyone must use page_idle/bitmap for some specific reason, I'd
> recommend exporting generation numbers instead.

Yes, at first time, we try using multi-gen LRU proactvie scan and
exporting generation&refs number to do the same thing.

But there are serveral problems:

1. multi-gen LRU only care about self-memcg pages. In our environment,
it's likely to see that different memcg's process share pages.
multi-gen LRU only update gen of pages in current memcg. It's hard to
judge a page whether is accessed depends on gen update.

We still have no ideas how to solve this problem.

2. We set swappiness 0, and use proactive scan to select cold pages
& proactive reclaim to swap anon pages. But we can't control passive
scan(can_swap = false), which would make anon pages cold/hot inversion
in inc_min_seq.

Is it a good idea to include a interface to config passive scan option?

Yu Zhao Dec. 15, 2023, 6:46 a.m. UTC | #3

On Fri, Dec 8, 2023 at 12:12 AM Henry Huang <henry.hj@antgroup.com> wrote:
>
> Thanks for replying this RFC.
>
> > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google
> > proposed the idea [1], but we don't really use it anymore because of
> > its poor scalability.
>
> In our environment, we use /sys/kernel/mm/page_idle/bitmap to check
> pages whether were accessed during a peroid of time.

Is it a production environment? If so, what's your
1. scan interval
2. memory size

I'm trying to understand why scalability isn't a problem for you. On
an average server, there are hundreds of millions of PFNs, so it'd be
very expensive to use that ABI even for a time interval of minutes.

> We manage all pages
> idle time in userspace. Then use a prediction algorithm to select pages
> to reclaim. These pages would more likely be idled for a long time.

"There is a system in place now that is based on a user-space process
that reads a bitmap stored in sysfs, but it has a high CPU and memory
overhead, so a new approach is being tried."
https://lwn.net/Articles/787611/

Could you elaborate how you solved this problem?

> We only need kernel to tell use whether a page is accessed, a boolean
> value in kernel is enough for our case.

How do you define "accessed"? I.e., through page tables or file
descriptors or both?

> > 2. PG_idle/young, being a boolean value, has poor granularity. If
> > anyone must use page_idle/bitmap for some specific reason, I'd
> > recommend exporting generation numbers instead.
>
> Yes, at first time, we try using multi-gen LRU proactvie scan and
> exporting generation&refs number to do the same thing.
>
> But there are serveral problems:
>
> 1. multi-gen LRU only care about self-memcg pages. In our environment,
> it's likely to see that different memcg's process share pages.

This is related to my question above: are those pages mapped into
different memcgs or not?

> multi-gen LRU only update gen of pages in current memcg. It's hard to
> judge a page whether is accessed depends on gen update.

This depends. I'd be glad to elaborate after you clarify the above.

> We still have no ideas how to solve this problem.
>
> 2. We set swappiness 0, and use proactive scan to select cold pages
> & proactive reclaim to swap anon pages. But we can't control passive
> scan(can_swap = false), which would make anon pages cold/hot inversion
> in inc_min_seq.

There is an option to prevent the inversion, IIUC, the force_scan
option is what you are looking for.

Yu Zhao Dec. 15, 2023, 7:23 a.m. UTC | #4

On Wed, Dec 6, 2023 at 5:51 AM Henry Huang <henry.hj@antgroup.com> wrote:
>
> Multi-Gen LRU page-table walker clears pte young flag, but it doesn't
> clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to check
> whether one page is accessed, it would tell us this page is idle,
> but actually this page has been accessed.
>
> For those unmapped filecache pages, page idle flag would not been
> cleared in folio_mark_accessed if Multi-Gen LRU is enabled.
> So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether
> a filecache page is read or written.
>
> What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag.
> If one page is accessed, it would set page young flag. Multi-Gen LRU
> page-table walker should check both page&pte young flags.
>
> how-to-reproduce-problem
>
> idle_page_track
>    a tools to track process accessed memory during a specific time
> usage
>    idle_page_track $pid $time
> how-it-works
>    1. scan process vma from /proc/$pid/maps
>    2. vfn --> pfn from /proc/$pid/pagemap
>    3. write /sys/kernel/mm/page_idle/bitmap to
>       mark phy page idle flag and clear pte young flag
>    4. sleep $time
>    5. read /sys/kernel/mm/page_idle/bitmap to
>       test_and_clear pte young flag and
>       return whether phy page is accessed
>
> test ---- test program
>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <unistd.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
>  #include <fcntl.h>
>
>  int main(int argc, const char *argv[])
>  {
>      char *buf = NULL;
>      char pipe_info[4096];
>      int n;
>      int fd = -1;
>
>      buf = malloc(1024*1024*1024UL);
>      memset(buf, 0, 1024*1024*1024UL);
>      fd = open("access.pipe", O_RDONLY);
>      if (fd < 0)
>          goto out;
>      while (1) {
>          n = read(fd, pipe_info, sizeof(pipe_info));
>          if (!n) {
>              sleep(1);
>              continue;
>          } else if (n < 0) {
>              break;
>          }
>          memset(buf, 0, 1024*1024*1024UL);
>          puts("finish access");
>       }
>  out:
>      if (fd >=0)
>          close(fd);
>      if (buf)
>          free(buf);
>
>      return 0;
>  }
>
> prepare:
> mkfifo access.pipe
> ./test
> ps -ef | grep test
> root       4106   3148  8 06:47 pts/0    00:00:01 ./test
>
> We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan.
>
> case 1: mglru walker break page_idle
> ./idle_page_track 4106 60 &
> sleep 5; echo 1 > access.pipe
> sleep 5;  echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen
>
> the output of idle_page_track is:
> Est(s)     Ref(MB)
> 64.822        1.00
> only found 1MB were accessed during 64.822s, but actually 1024MB were
> accessed.
>
> case 2: page_idle break mglru walker
> echo 1 > access.pipe
> ./idle_page_track 4106 10
> echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen
> lru gen status:
> memcg     8     /user.slice
>  node     0
>           5     772458       1065        9735
>           6     737435     262244          72
>           7     538053       1184         632
>           8      59404       6422           0
> almost pages should be in max_seq-1 queue, but actually not.
>
> Signed-off-by: Henry Huang <henry.hj@antgroup.com>

Regarding the change itself, it'd cause a slight regression to other
use cases (details below).

> @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>                 unsigned long pfn;
>                 struct folio *folio;
>                 pte_t ptent = ptep_get(pte + i);
> +               bool is_pte_young;
>
>                 total++;
>                 walk->mm_stats[MM_LEAF_TOTAL]++;
> @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>                 if (pfn == -1)
>                         continue;
>
> -               if (!pte_young(ptent)) {
> -                       walk->mm_stats[MM_LEAF_OLD]++;

Most overhead from page table scanning normally comes from
get_pfn_folio() because it almost always causes a cache miss. This is
like a pointer dereference, whereas scanning PTEs is like streaming an
array (bad vs good cache performance).

pte_young() is here to avoid an unnecessary cache miss from
get_pfn_folio(). Also see the first comment in get_pfn_folio(). It
should be easy to verify the regression -- FlameGraph from the
memcached benchmark in the original commit message should do it.

Would a tracepoint here work for you?



> +               is_pte_young = !!pte_young(ptent);
> +               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, is_pte_young);
> +               if (!folio) {
> +                       if (!is_pte_young)
> +                               walk->mm_stats[MM_LEAF_OLD]++;
>                         continue;
>                 }
>
> -               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> -               if (!folio)
> +               if (!folio_test_clear_young(folio) && !is_pte_young) {
> +                       walk->mm_stats[MM_LEAF_OLD]++;
>                         continue;
> +               }
>
> -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> +               if (is_pte_young && !ptep_test_and_clear_young(args->vma, addr, pte + i))
>                         VM_WARN_ON_ONCE(true);
>
>                 young++;

Henry Huang Dec. 15, 2023, 10:53 a.m. UTC | #5

On Fri, Dec 15, 2023 at 14:46 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Thanks for replying this RFC.
> >
> > > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google
> > > proposed the idea [1], but we don't really use it anymore because of
> > > its poor scalability.
> >
> > In our environment, we use /sys/kernel/mm/page_idle/bitmap to check
> > pages whether were accessed during a peroid of time.
>
> Is it a production environment? If so, what's your
> 1. scan interval
> 2. memory size

> I'm trying to understand why scalability isn't a problem for you. On
> an average server, there are hundreds of millions of PFNs, so it'd be
> very expensive to use that ABI even for a time interval of minutes.

Thanks for replying.

Our scan interval is 10 minutes and total memory size is 512GB.
We perferred to reclaim pages which idle age > 1 hour at least.

> > We manage all pages
> > idle time in userspace. Then use a prediction algorithm to select pages
> > to reclaim. These pages would more likely be idled for a long time.

> "There is a system in place now that is based on a user-space process
> that reads a bitmap stored in sysfs, but it has a high CPU and memory
> overhead, so a new approach is being tried."
> https://lwn.net/Articles/787611/
>
> Could you elaborate how you solved this problem?

In out environment, we found that we take average 0.4 core and 300MB memory
to do scan, basic analyse and reclaim idle pages.

For reducing cpu & memroy usage, we do: 
1. We implement a ratelimiter to control rate of scan and reclaim.
2. All pages info & idle age were stored in local DB file. Our prediction
algorithm don't need all pages info in memory at the same time.

In out environment, about 1/3 memory was attemped to allocate as THP,
which may save some cpu usage of scan.

> > We only need kernel to tell use whether a page is accessed, a boolean
> > value in kernel is enough for our case.
>
> How do you define "accessed"? I.e., through page tables or file
> descriptors or both?

both

> > > 2. PG_idle/young, being a boolean value, has poor granularity. If
> > > anyone must use page_idle/bitmap for some specific reason, I'd
> > > recommend exporting generation numbers instead.
> >
> > Yes, at first time, we try using multi-gen LRU proactvie scan and
> > exporting generation&refs number to do the same thing.
> >
> > But there are serveral problems:
> >
> > 1. multi-gen LRU only care about self-memcg pages. In our environment,
> > it's likely to see that different memcg's process share pages.
>
> This is related to my question above: are those pages mapped into
> different memcgs or not?

There is a case:
There are two cgroup A, B (B is child cgroup of A)
Process in A create a file and use mmap to read/write this file.
Process in B mmap this file and usually read this file.

> > We still have no ideas how to solve this problem.
> >
> > 2. We set swappiness 0, and use proactive scan to select cold pages
> > & proactive reclaim to swap anon pages. But we can't control passive
> > scan(can_swap = false), which would make anon pages cold/hot inversion
> > in inc_min_seq.
>
> There is an option to prevent the inversion, IIUC, the force_scan
> option is what you are looking for.

It seems that doesn't work now.

static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
{
......
    for (type = ANON_AND_FILE - 1; type >= 0; type--) {
        if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
            continue;

        VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));

        if (inc_min_seq(lruvec, type, can_swap))
            continue;

        spin_unlock_irq(&lruvec->lru_lock);
        cond_resched();
        goto restart;
    }
.....
}

force_scan is not a parameter of inc_min_seq.
In our environment, swappiness is 0, so can_swap would be false.

static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
{
    int zone;
    int remaining = MAX_LRU_BATCH;
    struct lru_gen_folio *lrugen = &lruvec->lrugen;
    int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);

    if (type == LRU_GEN_ANON && !can_swap)
        goto done;
......
}

If can_swap is false, would pass anon lru list.

What's more, in passive scan, force_scan is also false.

static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap)
{
......
    /* skip this lruvec as it's low on cold folios */
    return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0;
}

Is it a good idea to include a global parameter no_inversion, and modify inc_min_seq
like this:

static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
{
    int zone;
    int remaining = MAX_LRU_BATCH;
    struct lru_gen_folio *lrugen = &lruvec->lrugen;
    int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);

-   if (type == LRU_GEN_ANON && !can_swap)
+   if (type == LRU_GEN_ANON && !can_swap && !no_inversion)
        goto done;
......
}

Henry Huang Dec. 15, 2023, 12:44 p.m. UTC | #6

On Fri, Dec 15, 2023 at 15:23 PM Yu Zhao <yuzhao@google.com> wrote:
>Regarding the change itself, it'd cause a slight regression to other
>use cases (details below).
>
> > @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> >                 unsigned long pfn;
> >                 struct folio *folio;
> >                 pte_t ptent = ptep_get(pte + i);
> > +               bool is_pte_young;
> >
> >                 total++;
> >                 walk->mm_stats[MM_LEAF_TOTAL]++;
> > @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> >                 if (pfn == -1)
> >                         continue;
> >
> > -               if (!pte_young(ptent)) {
> > -                       walk->mm_stats[MM_LEAF_OLD]++;
>
> Most overhead from page table scanning normally comes from
> get_pfn_folio() because it almost always causes a cache miss. This is
> like a pointer dereference, whereas scanning PTEs is like streaming an
> array (bad vs good cache performance).
>
> pte_young() is here to avoid an unnecessary cache miss from
> get_pfn_folio(). Also see the first comment in get_pfn_folio(). It
> should be easy to verify the regression -- FlameGraph from the
> memcached benchmark in the original commit message should do it.
>
> Would a tracepoint here work for you?
>
>
>
> > +               is_pte_young = !!pte_young(ptent);
> > +               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, is_pte_young);
> > +               if (!folio) {
> > +                       if (!is_pte_young)
> > +                               walk->mm_stats[MM_LEAF_OLD]++;
> >                         continue;
> >                 }
> >
> > -               folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> > -               if (!folio)
> > +               if (!folio_test_clear_young(folio) && !is_pte_young) {
> > +                       walk->mm_stats[MM_LEAF_OLD]++;
> >                         continue;
> > +               }
> >
> > -               if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
> > +               if (is_pte_young && !ptep_test_and_clear_young(args->vma, addr, pte + i))
> >                         VM_WARN_ON_ONCE(true);
> >
> >                 young++;

Thanks for replying.

For avoiding below:
1. confict between page_idle/bitmap and mglru scan
2. performance downgrade in mglru page-table scan if call get_pfn_folio for each pte.

We have a new idea:
1. Include a new api under /sys/kernel/mm/page_idle, support mark idle flag only, without
rmap walking or clearing pte young.
2. Use mglru proactive scan to clear page idle flag.

workflows:
      t1                      t2 
mark pages idle    mglru scan and check pages idle

It's easy for us to know that whether a page is accessed during t1~t2.

Some code changes like these:

We clear idle flags in get_pfn_folio, and in walk_pte_range we still follow
original design.

static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
					struct pglist_data *pgdat, bool can_swap, bool clear_idle)
{
	struct folio *folio;

	/* try to avoid unnecessary memory loads */
	if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
		return NULL;

	folio = pfn_folio(pfn);
+
+	if (clear_idle && folio_test_idle(folio))
+		folio_clear_idle(folio);
+
	if (folio_nid(folio) != pgdat->node_id)
		return NULL;

	if (folio_memcg_rcu(folio) != memcg)
		return NULL;

	/* file VMAs can contain anon pages from COW */
	if (!folio_is_file_lru(folio) && !can_swap)
		return NULL;

	return folio;
}

static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
			struct mm_walk *args)
{
...
	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
		unsigned long pfn;
		struct folio *folio;
		pte_t ptent = ptep_get(pte + i);

		total++;
		walk->mm_stats[MM_LEAF_TOTAL]++;

		pfn = get_pte_pfn(ptent, args->vma, addr);
		if (pfn == -1)
			continue;

		if (!pte_young(ptent)) {
			walk->mm_stats[MM_LEAF_OLD]++;
			continue;
		}

+		folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, true);
-		folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
		if (!folio)
			continue;
...
}

Is it a good idea or not ?

Yu Zhao Dec. 16, 2023, 9:06 p.m. UTC | #7

On Fri, Dec 15, 2023 at 3:53 AM Henry Huang <henry.hj@antgroup.com> wrote:
>
> On Fri, Dec 15, 2023 at 14:46 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > Thanks for replying this RFC.
> > >
> > > > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google
> > > > proposed the idea [1], but we don't really use it anymore because of
> > > > its poor scalability.
> > >
> > > In our environment, we use /sys/kernel/mm/page_idle/bitmap to check
> > > pages whether were accessed during a peroid of time.
> >
> > Is it a production environment? If so, what's your
> > 1. scan interval
> > 2. memory size
>
> > I'm trying to understand why scalability isn't a problem for you. On
> > an average server, there are hundreds of millions of PFNs, so it'd be
> > very expensive to use that ABI even for a time interval of minutes.
>
> Thanks for replying.
>
> Our scan interval is 10 minutes and total memory size is 512GB.
> We perferred to reclaim pages which idle age > 1 hour at least.

Yes, that makes sense. We have similar use cases, details below.

> > > We manage all pages
> > > idle time in userspace. Then use a prediction algorithm to select pages
> > > to reclaim. These pages would more likely be idled for a long time.
>
> > "There is a system in place now that is based on a user-space process
> > that reads a bitmap stored in sysfs, but it has a high CPU and memory
> > overhead, so a new approach is being tried."
> > https://lwn.net/Articles/787611/
> >
> > Could you elaborate how you solved this problem?
>
> In out environment, we found that we take average 0.4 core and 300MB memory
> to do scan, basic analyse and reclaim idle pages.
>
> For reducing cpu & memroy usage, we do:
> 1. We implement a ratelimiter to control rate of scan and reclaim.
> 2. All pages info & idle age were stored in local DB file. Our prediction
> algorithm don't need all pages info in memory at the same time.
>
> In out environment, about 1/3 memory was attemped to allocate as THP,
> which may save some cpu usage of scan.
>
> > > We only need kernel to tell use whether a page is accessed, a boolean
> > > value in kernel is enough for our case.
> >
> > How do you define "accessed"? I.e., through page tables or file
> > descriptors or both?
>
> both
>
> > > > 2. PG_idle/young, being a boolean value, has poor granularity. If
> > > > anyone must use page_idle/bitmap for some specific reason, I'd
> > > > recommend exporting generation numbers instead.
> > >
> > > Yes, at first time, we try using multi-gen LRU proactvie scan and
> > > exporting generation&refs number to do the same thing.
> > >
> > > But there are serveral problems:
> > >
> > > 1. multi-gen LRU only care about self-memcg pages. In our environment,
> > > it's likely to see that different memcg's process share pages.
> >
> > This is related to my question above: are those pages mapped into
> > different memcgs or not?
>
> There is a case:
> There are two cgroup A, B (B is child cgroup of A)
> Process in A create a file and use mmap to read/write this file.
> Process in B mmap this file and usually read this file.

Yes, actually we have a private patch to solve a similar problem.
Basically it finds VMAs from other processes in different memcgs that
share a mapping and jumps to those VMAs to scan them. We can upstream
it for you if you find it useful too.

> > > We still have no ideas how to solve this problem.
> > >
> > > 2. We set swappiness 0, and use proactive scan to select cold pages
> > > & proactive reclaim to swap anon pages. But we can't control passive
> > > scan(can_swap = false), which would make anon pages cold/hot inversion
> > > in inc_min_seq.
> >
> > There is an option to prevent the inversion, IIUC, the force_scan
> > option is what you are looking for.
>
> It seems that doesn't work now.
>
> static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
> {
> ......
>     for (type = ANON_AND_FILE - 1; type >= 0; type--) {
>         if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
>             continue;
>
>         VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
>
>         if (inc_min_seq(lruvec, type, can_swap))
>             continue;
>
>         spin_unlock_irq(&lruvec->lru_lock);
>         cond_resched();
>         goto restart;
>     }
> .....
> }
>
> force_scan is not a parameter of inc_min_seq.
> In our environment, swappiness is 0, so can_swap would be false.
>
> static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> {
>     int zone;
>     int remaining = MAX_LRU_BATCH;
>     struct lru_gen_folio *lrugen = &lruvec->lrugen;
>     int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
>
>     if (type == LRU_GEN_ANON && !can_swap)
>         goto done;
> ......
> }
>
> If can_swap is false, would pass anon lru list.
>
> What's more, in passive scan, force_scan is also false.

Ok, I see what you mean. (I thought "passive" means proactive scans
triggered by the debugfs interface, but it actually means "reactive"
scans triggered by memory pressure.) We actually have a private patch
too to solve this. But there is a corner case here: that private
change, which is essentially the same as what you suggested, can stall
direct reclaim when there is tons of cold anon memory. E.g., if there
is 300GB anon memory in the oldest generation which can't be swapped,
calling inc_min_seq() with can_swap being true would stall the direct
reclaim. Does it make sense?

Let me check the state of those private patches and get back to you in
a couple of days. Thanks!

Henry Huang Dec. 17, 2023, 6:59 a.m. UTC | #8

On Sun, Dec 17, 2023 at 05:07 AM Yu Zhao <yuzhao@google.com> wrote:
> Let me check the state of those private patches and get back to you in
> a couple of days. Thanks!

Thanks very much!
Looking forward your patches.

Yuanchu Xie Dec. 21, 2023, 11:15 p.m. UTC | #9

Hi Henry, I have a question on memcg charging for the shared pages.
Replied inline.

On Fri, Dec 15, 2023 at 2:53 AM Henry Huang <henry.hj@antgroup.com> wrote:
>
> On Fri, Dec 15, 2023 at 14:46 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > Thanks for replying this RFC.
> > >
> > > > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google
> > > > proposed the idea [1], but we don't really use it anymore because of
> > > > its poor scalability.
> > >
> > > In our environment, we use /sys/kernel/mm/page_idle/bitmap to check
> > > pages whether were accessed during a peroid of time.
> >
> > Is it a production environment? If so, what's your
> > 1. scan interval
> > 2. memory size
>
> > I'm trying to understand why scalability isn't a problem for you. On
> > an average server, there are hundreds of millions of PFNs, so it'd be
> > very expensive to use that ABI even for a time interval of minutes.
>
> Thanks for replying.
>
> Our scan interval is 10 minutes and total memory size is 512GB.
> We perferred to reclaim pages which idle age > 1 hour at least.
>
> > > We manage all pages
> > > idle time in userspace. Then use a prediction algorithm to select pages
> > > to reclaim. These pages would more likely be idled for a long time.
>
> > "There is a system in place now that is based on a user-space process
> > that reads a bitmap stored in sysfs, but it has a high CPU and memory
> > overhead, so a new approach is being tried."
> > https://lwn.net/Articles/787611/
> >
> > Could you elaborate how you solved this problem?
>
> In out environment, we found that we take average 0.4 core and 300MB memory
> to do scan, basic analyse and reclaim idle pages.
>
> For reducing cpu & memroy usage, we do:
> 1. We implement a ratelimiter to control rate of scan and reclaim.
> 2. All pages info & idle age were stored in local DB file. Our prediction
> algorithm don't need all pages info in memory at the same time.
>
> In out environment, about 1/3 memory was attemped to allocate as THP,
> which may save some cpu usage of scan.
>
> > > We only need kernel to tell use whether a page is accessed, a boolean
> > > value in kernel is enough for our case.
> >
> > How do you define "accessed"? I.e., through page tables or file
> > descriptors or both?
>
> both
>
> > > > 2. PG_idle/young, being a boolean value, has poor granularity. If
> > > > anyone must use page_idle/bitmap for some specific reason, I'd
> > > > recommend exporting generation numbers instead.
> > >
> > > Yes, at first time, we try using multi-gen LRU proactvie scan and
> > > exporting generation&refs number to do the same thing.
> > >
> > > But there are serveral problems:
> > >
> > > 1. multi-gen LRU only care about self-memcg pages. In our environment,
> > > it's likely to see that different memcg's process share pages.
> >
> > This is related to my question above: are those pages mapped into
> > different memcgs or not?
>
> There is a case:
> There are two cgroup A, B (B is child cgroup of A)
> Process in A create a file and use mmap to read/write this file.
> Process in B mmap this file and usually read this file.\

How does the shared memory get charged to the cgroups?
Does it all go to cgroup A or B exclusively, or do some pages get
charged to each one?

>
> > > We still have no ideas how to solve this problem.
> > >
> > > 2. We set swappiness 0, and use proactive scan to select cold pages
> > > & proactive reclaim to swap anon pages. But we can't control passive
> > > scan(can_swap = false), which would make anon pages cold/hot inversion
> > > in inc_min_seq.
> >
> > There is an option to prevent the inversion, IIUC, the force_scan
> > option is what you are looking for.
>
> It seems that doesn't work now.
>
> static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan)
> {
> ......
>     for (type = ANON_AND_FILE - 1; type >= 0; type--) {
>         if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
>             continue;
>
>         VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
>
>         if (inc_min_seq(lruvec, type, can_swap))
>             continue;
>
>         spin_unlock_irq(&lruvec->lru_lock);
>         cond_resched();
>         goto restart;
>     }
> .....
> }
>
> force_scan is not a parameter of inc_min_seq.
> In our environment, swappiness is 0, so can_swap would be false.
>
> static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> {
>     int zone;
>     int remaining = MAX_LRU_BATCH;
>     struct lru_gen_folio *lrugen = &lruvec->lrugen;
>     int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
>
>     if (type == LRU_GEN_ANON && !can_swap)
>         goto done;
> ......
> }
>
> If can_swap is false, would pass anon lru list.
>
> What's more, in passive scan, force_scan is also false.
>
> static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap)
> {
> ......
>     /* skip this lruvec as it's low on cold folios */
>     return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0;
> }
>
> Is it a good idea to include a global parameter no_inversion, and modify inc_min_seq
> like this:
>
> static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> {
>     int zone;
>     int remaining = MAX_LRU_BATCH;
>     struct lru_gen_folio *lrugen = &lruvec->lrugen;
>     int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
>
> -   if (type == LRU_GEN_ANON && !can_swap)
> +   if (type == LRU_GEN_ANON && !can_swap && !no_inversion)
>         goto done;
> ......
> }
>
> --
> 2.43.0
>
>

Henry Huang Dec. 22, 2023, 2:44 a.m. UTC | #10

Thanks for replyting.

On Fri, Dec 22, 2023 at 07:16 AM Yuanchu Xie wrote:
> How does the shared memory get charged to the cgroups?
> Does it all go to cgroup A or B exclusively, or do some pages get
> charged to each one?

Some pages get charged to cgroup A, and the other get charged to cgroup B.

Yu Zhao Dec. 22, 2023, 4:35 a.m. UTC | #11

On Thu, Dec 21, 2023 at 7:45 PM Henry Huang <henry.hj@antgroup.com> wrote:
>
> Thanks for replyting.
>
> On Fri, Dec 22, 2023 at 07:16 AM Yuanchu Xie wrote:
> > How does the shared memory get charged to the cgroups?
> > Does it all go to cgroup A or B exclusively, or do some pages get
> > charged to each one?
>
> Some pages get charged to cgroup A, and the other get charged to cgroup B.

Just a side note:
We can potentially "fix" this, but it doesn't mean this is a good
practice. In fact, I think this is an anti-pattern to memcgs:
resources should be preferably isolated between memcgs, or if a
resource has to be shared between memcgs, it should be charged in a
predetermined way, not randomly to one of the memcgs sharing it.

David Rientjes Dec. 22, 2023, 5:14 a.m. UTC | #12

On Thu, 21 Dec 2023, Yu Zhao wrote:

> > Thanks for replyting.
> >
> > On Fri, Dec 22, 2023 at 07:16 AM Yuanchu Xie wrote:
> > > How does the shared memory get charged to the cgroups?
> > > Does it all go to cgroup A or B exclusively, or do some pages get
> > > charged to each one?
> >
> > Some pages get charged to cgroup A, and the other get charged to cgroup B.
> 
> Just a side note:
> We can potentially "fix" this, but it doesn't mean this is a good
> practice. In fact, I think this is an anti-pattern to memcgs:
> resources should be preferably isolated between memcgs, or if a
> resource has to be shared between memcgs, it should be charged in a
> predetermined way, not randomly to one of the memcgs sharing it.
> 

Very interesting thread, a few questions for Henry to understand the 
situation better:

 - is the lack of predeterministic charging a problem for you?  Are you
   initially faulting it in a manner that charges it to the "right" memcg
   and the refault of it after periodic reclaim can causing the charge to
   appear "randomly," i.e. to whichever process happened to access it 
   next?

 - are pages ever shared between different memcg hierarchies?  You 
   mentioned sharing between processes in A and A/B, but I'm wondering
   if there is sharing between two different memcg hierarchies where root
   is the only common ancestor?

 - do you anticipate a shorter scan period at some point?  Proactively
   reclaiming all memory colder than one hour is a long time :)  Are you
   concerned at all about the cost of doing your current idle bit 
   harvesting approach becoming too expensive if you significantly reduce
   the scan period?

 - is proactive reclaim being driven by writing to memory.reclaim, by
   enforcing a smaller memory.high, or something else?

Looking forward to learning more about your particular issue.

Henry Huang Dec. 22, 2023, 3:40 p.m. UTC | #13

Thanks for replying.

On Fri, Dec 22, 2023 at 13:14 PM David Rientjes wrote:
> - is the lack of predeterministic charging a problem for you?  Are you
>   initially faulting it in a manner that charges it to the "right" memcg
>   and the refault of it after periodic reclaim can causing the charge to
>   appear "randomly," i.e. to whichever process happened to access it 
>   next?

Actually at begin, all pages got charged to cgroup A, but with memory pressure
or after proactive reclaim. Some pages would be dropped or swapped.
Task in cgroup B visit this shared memory before task in cgroup A,
would make these pages charged to cgroup B.

This is common in our enviorment.

> - are pages ever shared between different memcg hierarchies?  You 
>   mentioned sharing between processes in A and A/B, but I'm wondering
>   if there is sharing between two different memcg hierarchies where root
>   is the only common ancestor?

Yes, there is a another really common case:
If docker graph driver is overlayfs, different docker containers use the
same image, or share same low layers, would share file cache of public bin or
lib(i.e libc.so).

> - do you anticipate a shorter scan period at some point?  Proactively
>   reclaiming all memory colder than one hour is a long time :)  Are you
>   concerned at all about the cost of doing your current idle bit 
>   harvesting approach becoming too expensive if you significantly reduce
>   the scan period?

We don't want the owner of the application to feel a significant
performance downgrade when using swap. There is a high risk to reclaim pages
which idle age are less than 1 hour. We have internal test and
data analysis to support it.

We disabled global swappiness and memcg swapinness.
Only proactive reclaim can swap anon pages.

What's more, we see that mglru has a more efficient way to scan pte access bit.
We perferred to use mglru scan help us scan and select idle pages.

> - is proactive reclaim being driven by writing to memory.reclaim, by
>   enforcing a smaller memory.high, or something else?

Because all pages info and idle age are stored in userspace, kernel can't get
these information directly. We have a private patch include a new reclaim interface
to support reclaim pages with specific pfns.

Yuanchu Xie Jan. 10, 2024, 7:24 p.m. UTC | #14

On Fri, Dec 22, 2023 at 7:40 AM Henry Huang <henry.hj@antgroup.com> wrote:
> > - are pages ever shared between different memcg hierarchies?  You
> >   mentioned sharing between processes in A and A/B, but I'm wondering
> >   if there is sharing between two different memcg hierarchies where root
> >   is the only common ancestor?
>
> Yes, there is a another really common case:
> If docker graph driver is overlayfs, different docker containers use the
> same image, or share same low layers, would share file cache of public bin or
> lib(i.e libc.so).
Does this present a problem with setting memcg limits or OOMs? It
seems like deterministically charging shared pages would be highly
desirable. Mina Almasry previously proposed a memcg= mount option to
implement deterministic charging[1], but it wasn't a generic sharing
mechanism. Nonetheless, the problem remains, and it would be
interesting to learn if this presents any issues for you.

[1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
>
> > - do you anticipate a shorter scan period at some point?  Proactively
> >   reclaiming all memory colder than one hour is a long time :)  Are you
> >   concerned at all about the cost of doing your current idle bit
> >   harvesting approach becoming too expensive if you significantly reduce
> >   the scan period?
>
> We don't want the owner of the application to feel a significant
> performance downgrade when using swap. There is a high risk to reclaim pages
> which idle age are less than 1 hour. We have internal test and
> data analysis to support it.
>
> We disabled global swappiness and memcg swapinness.
> Only proactive reclaim can swap anon pages.
>
> What's more, we see that mglru has a more efficient way to scan pte access bit.
> We perferred to use mglru scan help us scan and select idle pages.
I'm working on a kernel driver/per-memcg interface to perform aging
with MGLRU, including configuration for the MGLRU page scanning
optimizations. I suspect scanning the PTE accessed bits for pages
charged to a foreign memcg ad-hoc has some performance implications,
and the more general solution is to charge in a predetermined way,
which makes the scanning on behalf of the foreign memcg a bit cleaner.
This is possible nonetheless, but a bit hacky. Let me know you have
any ideas.
>
> > - is proactive reclaim being driven by writing to memory.reclaim, by
> >   enforcing a smaller memory.high, or something else?
>
> Because all pages info and idle age are stored in userspace, kernel can't get
> these information directly. We have a private patch include a new reclaim interface
> to support reclaim pages with specific pfns.
Thanks for sharing! It's been enlightening to learn about different
prod environments.

Henry Huang Jan. 12, 2024, 4:40 a.m. UTC | #15

Thanks for replying.

On Thu, Jan 11, 2024 at 03:24 AM Yuanchu Xie <yuanchu@google.com> wrote:
> Does this present a problem with setting memcg limits or OOMs? It
> seems like deterministically charging shared pages would be highly
> desirable. Mina Almasry previously proposed a memcg= mount option to
> implement deterministic charging[1], but it wasn't a generic sharing
> mechanism. Nonetheless, the problem remains, and it would be
> interesting to learn if this presents any issues for you.
>
> [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

In this case, total size of shared memory usually is small(< 100MB).
What's more, almost shared pages were file cache. So it doesn't present any problems.

> I'm working on a kernel driver/per-memcg interface to perform aging
> with MGLRU, including configuration for the MGLRU page scanning
> optimizations. I suspect scanning the PTE accessed bits for pages
> charged to a foreign memcg ad-hoc has some performance implications,
> and the more general solution is to charge in a predetermined way,
> which makes the scanning on behalf of the foreign memcg a bit cleaner.
> This is possible nonetheless, but a bit hacky. Let me know you have
> any ideas.

Wow! per-memcg interface is also what we need.
It's hardly to control user's behaviors in our envrionment. We can't promise that all
process who share memory would be in same memcg.
Maybe kernel should provide new interface to make shared page charge more predictable.
I think it would take some overhead to do this.

The key problem of us is that we can't check whether page is accesed(no gen or ref changed) 
in this case.
page belongs to A, but maybe process in B read/write this page more frequently.
we may treat this page as cold page but accutly hot page.
Maybe just call folio_mark_accessed instead of folio_update_gen(should hold memcg lru lock)
for those remote memcg pages?

[RFC,v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap

Commit Message

Comments

Patch