Message ID | 951fb7edab535cf522def4f5f2613947ed7b7d28.1701853894.git.henry.hj@antgroup.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC,v2] mm: Multi-Gen LRU: fix use mm/page_idle/bitmap | expand |
On Wed, Dec 6, 2023 at 5:51 AM Henry Huang <henry.hj@antgroup.com> wrote: > > Multi-Gen LRU page-table walker clears pte young flag, but it doesn't > clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to check > whether one page is accessed, it would tell us this page is idle, > but actually this page has been accessed. > > For those unmapped filecache pages, page idle flag would not been > cleared in folio_mark_accessed if Multi-Gen LRU is enabled. > So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether > a filecache page is read or written. > > What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag. > If one page is accessed, it would set page young flag. Multi-Gen LRU > page-table walker should check both page&pte young flags. > > how-to-reproduce-problem > > idle_page_track > a tools to track process accessed memory during a specific time > usage > idle_page_track $pid $time > how-it-works > 1. scan process vma from /proc/$pid/maps > 2. vfn --> pfn from /proc/$pid/pagemap > 3. write /sys/kernel/mm/page_idle/bitmap to > mark phy page idle flag and clear pte young flag > 4. sleep $time > 5. read /sys/kernel/mm/page_idle/bitmap to > test_and_clear pte young flag and > return whether phy page is accessed > > test ---- test program > > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > > int main(int argc, const char *argv[]) > { > char *buf = NULL; > char pipe_info[4096]; > int n; > int fd = -1; > > buf = malloc(1024*1024*1024UL); > memset(buf, 0, 1024*1024*1024UL); > fd = open("access.pipe", O_RDONLY); > if (fd < 0) > goto out; > while (1) { > n = read(fd, pipe_info, sizeof(pipe_info)); > if (!n) { > sleep(1); > continue; > } else if (n < 0) { > break; > } > memset(buf, 0, 1024*1024*1024UL); > puts("finish access"); > } > out: > if (fd >=0) > close(fd); > if (buf) > free(buf); > > return 0; > } > > prepare: > mkfifo access.pipe > ./test > ps -ef | grep test > root 4106 3148 8 06:47 pts/0 00:00:01 ./test > > We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan. > > case 1: mglru walker break page_idle > ./idle_page_track 4106 60 & > sleep 5; echo 1 > access.pipe > sleep 5; echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen > > the output of idle_page_track is: > Est(s) Ref(MB) > 64.822 1.00 > only found 1MB were accessed during 64.822s, but actually 1024MB were > accessed. > > case 2: page_idle break mglru walker > echo 1 > access.pipe > ./idle_page_track 4106 10 > echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen > lru gen status: > memcg 8 /user.slice > node 0 > 5 772458 1065 9735 > 6 737435 262244 72 > 7 538053 1184 632 > 8 59404 6422 0 > almost pages should be in max_seq-1 queue, but actually not. > > Signed-off-by: Henry Huang <henry.hj@antgroup.com> It's never intended for MGLRU to support page_idle/bitmap or PG_idle/young because: 1. page_idle/bitmap isn't a capable interface at all -- yes, Google proposed the idea [1], but we don't really use it anymore because of its poor scalability. 2. PG_idle/young, being a boolean value, has poor granularity. If anyone must use page_idle/bitmap for some specific reason, I'd recommend exporting generation numbers instead. [1] https://lore.kernel.org/cover.1426706637.git.vdavydov@parallels.com/
Thanks for replying this RFC. > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google > proposed the idea [1], but we don't really use it anymore because of > its poor scalability. In our environment, we use /sys/kernel/mm/page_idle/bitmap to check pages whether were accessed during a peroid of time. We manage all pages idle time in userspace. Then use a prediction algorithm to select pages to reclaim. These pages would more likely be idled for a long time. We only need kernel to tell use whether a page is accessed, a boolean value in kernel is enough for our case. > 2. PG_idle/young, being a boolean value, has poor granularity. If > anyone must use page_idle/bitmap for some specific reason, I'd > recommend exporting generation numbers instead. Yes, at first time, we try using multi-gen LRU proactvie scan and exporting generation&refs number to do the same thing. But there are serveral problems: 1. multi-gen LRU only care about self-memcg pages. In our environment, it's likely to see that different memcg's process share pages. multi-gen LRU only update gen of pages in current memcg. It's hard to judge a page whether is accessed depends on gen update. We still have no ideas how to solve this problem. 2. We set swappiness 0, and use proactive scan to select cold pages & proactive reclaim to swap anon pages. But we can't control passive scan(can_swap = false), which would make anon pages cold/hot inversion in inc_min_seq. Is it a good idea to include a interface to config passive scan option?
On Fri, Dec 8, 2023 at 12:12 AM Henry Huang <henry.hj@antgroup.com> wrote: > > Thanks for replying this RFC. > > > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google > > proposed the idea [1], but we don't really use it anymore because of > > its poor scalability. > > In our environment, we use /sys/kernel/mm/page_idle/bitmap to check > pages whether were accessed during a peroid of time. Is it a production environment? If so, what's your 1. scan interval 2. memory size I'm trying to understand why scalability isn't a problem for you. On an average server, there are hundreds of millions of PFNs, so it'd be very expensive to use that ABI even for a time interval of minutes. > We manage all pages > idle time in userspace. Then use a prediction algorithm to select pages > to reclaim. These pages would more likely be idled for a long time. "There is a system in place now that is based on a user-space process that reads a bitmap stored in sysfs, but it has a high CPU and memory overhead, so a new approach is being tried." https://lwn.net/Articles/787611/ Could you elaborate how you solved this problem? > We only need kernel to tell use whether a page is accessed, a boolean > value in kernel is enough for our case. How do you define "accessed"? I.e., through page tables or file descriptors or both? > > 2. PG_idle/young, being a boolean value, has poor granularity. If > > anyone must use page_idle/bitmap for some specific reason, I'd > > recommend exporting generation numbers instead. > > Yes, at first time, we try using multi-gen LRU proactvie scan and > exporting generation&refs number to do the same thing. > > But there are serveral problems: > > 1. multi-gen LRU only care about self-memcg pages. In our environment, > it's likely to see that different memcg's process share pages. This is related to my question above: are those pages mapped into different memcgs or not? > multi-gen LRU only update gen of pages in current memcg. It's hard to > judge a page whether is accessed depends on gen update. This depends. I'd be glad to elaborate after you clarify the above. > We still have no ideas how to solve this problem. > > 2. We set swappiness 0, and use proactive scan to select cold pages > & proactive reclaim to swap anon pages. But we can't control passive > scan(can_swap = false), which would make anon pages cold/hot inversion > in inc_min_seq. There is an option to prevent the inversion, IIUC, the force_scan option is what you are looking for.
On Wed, Dec 6, 2023 at 5:51 AM Henry Huang <henry.hj@antgroup.com> wrote: > > Multi-Gen LRU page-table walker clears pte young flag, but it doesn't > clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to check > whether one page is accessed, it would tell us this page is idle, > but actually this page has been accessed. > > For those unmapped filecache pages, page idle flag would not been > cleared in folio_mark_accessed if Multi-Gen LRU is enabled. > So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether > a filecache page is read or written. > > What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag. > If one page is accessed, it would set page young flag. Multi-Gen LRU > page-table walker should check both page&pte young flags. > > how-to-reproduce-problem > > idle_page_track > a tools to track process accessed memory during a specific time > usage > idle_page_track $pid $time > how-it-works > 1. scan process vma from /proc/$pid/maps > 2. vfn --> pfn from /proc/$pid/pagemap > 3. write /sys/kernel/mm/page_idle/bitmap to > mark phy page idle flag and clear pte young flag > 4. sleep $time > 5. read /sys/kernel/mm/page_idle/bitmap to > test_and_clear pte young flag and > return whether phy page is accessed > > test ---- test program > > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > > int main(int argc, const char *argv[]) > { > char *buf = NULL; > char pipe_info[4096]; > int n; > int fd = -1; > > buf = malloc(1024*1024*1024UL); > memset(buf, 0, 1024*1024*1024UL); > fd = open("access.pipe", O_RDONLY); > if (fd < 0) > goto out; > while (1) { > n = read(fd, pipe_info, sizeof(pipe_info)); > if (!n) { > sleep(1); > continue; > } else if (n < 0) { > break; > } > memset(buf, 0, 1024*1024*1024UL); > puts("finish access"); > } > out: > if (fd >=0) > close(fd); > if (buf) > free(buf); > > return 0; > } > > prepare: > mkfifo access.pipe > ./test > ps -ef | grep test > root 4106 3148 8 06:47 pts/0 00:00:01 ./test > > We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan. > > case 1: mglru walker break page_idle > ./idle_page_track 4106 60 & > sleep 5; echo 1 > access.pipe > sleep 5; echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen > > the output of idle_page_track is: > Est(s) Ref(MB) > 64.822 1.00 > only found 1MB were accessed during 64.822s, but actually 1024MB were > accessed. > > case 2: page_idle break mglru walker > echo 1 > access.pipe > ./idle_page_track 4106 10 > echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen > lru gen status: > memcg 8 /user.slice > node 0 > 5 772458 1065 9735 > 6 737435 262244 72 > 7 538053 1184 632 > 8 59404 6422 0 > almost pages should be in max_seq-1 queue, but actually not. > > Signed-off-by: Henry Huang <henry.hj@antgroup.com> Regarding the change itself, it'd cause a slight regression to other use cases (details below). > @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > unsigned long pfn; > struct folio *folio; > pte_t ptent = ptep_get(pte + i); > + bool is_pte_young; > > total++; > walk->mm_stats[MM_LEAF_TOTAL]++; > @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > if (pfn == -1) > continue; > > - if (!pte_young(ptent)) { > - walk->mm_stats[MM_LEAF_OLD]++; Most overhead from page table scanning normally comes from get_pfn_folio() because it almost always causes a cache miss. This is like a pointer dereference, whereas scanning PTEs is like streaming an array (bad vs good cache performance). pte_young() is here to avoid an unnecessary cache miss from get_pfn_folio(). Also see the first comment in get_pfn_folio(). It should be easy to verify the regression -- FlameGraph from the memcached benchmark in the original commit message should do it. Would a tracepoint here work for you? > + is_pte_young = !!pte_young(ptent); > + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, is_pte_young); > + if (!folio) { > + if (!is_pte_young) > + walk->mm_stats[MM_LEAF_OLD]++; > continue; > } > > - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); > - if (!folio) > + if (!folio_test_clear_young(folio) && !is_pte_young) { > + walk->mm_stats[MM_LEAF_OLD]++; > continue; > + } > > - if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) > + if (is_pte_young && !ptep_test_and_clear_young(args->vma, addr, pte + i)) > VM_WARN_ON_ONCE(true); > > young++;
On Fri, Dec 15, 2023 at 14:46 PM Yu Zhao <yuzhao@google.com> wrote: > > > > Thanks for replying this RFC. > > > > > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google > > > proposed the idea [1], but we don't really use it anymore because of > > > its poor scalability. > > > > In our environment, we use /sys/kernel/mm/page_idle/bitmap to check > > pages whether were accessed during a peroid of time. > > Is it a production environment? If so, what's your > 1. scan interval > 2. memory size > I'm trying to understand why scalability isn't a problem for you. On > an average server, there are hundreds of millions of PFNs, so it'd be > very expensive to use that ABI even for a time interval of minutes. Thanks for replying. Our scan interval is 10 minutes and total memory size is 512GB. We perferred to reclaim pages which idle age > 1 hour at least. > > We manage all pages > > idle time in userspace. Then use a prediction algorithm to select pages > > to reclaim. These pages would more likely be idled for a long time. > "There is a system in place now that is based on a user-space process > that reads a bitmap stored in sysfs, but it has a high CPU and memory > overhead, so a new approach is being tried." > https://lwn.net/Articles/787611/ > > Could you elaborate how you solved this problem? In out environment, we found that we take average 0.4 core and 300MB memory to do scan, basic analyse and reclaim idle pages. For reducing cpu & memroy usage, we do: 1. We implement a ratelimiter to control rate of scan and reclaim. 2. All pages info & idle age were stored in local DB file. Our prediction algorithm don't need all pages info in memory at the same time. In out environment, about 1/3 memory was attemped to allocate as THP, which may save some cpu usage of scan. > > We only need kernel to tell use whether a page is accessed, a boolean > > value in kernel is enough for our case. > > How do you define "accessed"? I.e., through page tables or file > descriptors or both? both > > > 2. PG_idle/young, being a boolean value, has poor granularity. If > > > anyone must use page_idle/bitmap for some specific reason, I'd > > > recommend exporting generation numbers instead. > > > > Yes, at first time, we try using multi-gen LRU proactvie scan and > > exporting generation&refs number to do the same thing. > > > > But there are serveral problems: > > > > 1. multi-gen LRU only care about self-memcg pages. In our environment, > > it's likely to see that different memcg's process share pages. > > This is related to my question above: are those pages mapped into > different memcgs or not? There is a case: There are two cgroup A, B (B is child cgroup of A) Process in A create a file and use mmap to read/write this file. Process in B mmap this file and usually read this file. > > We still have no ideas how to solve this problem. > > > > 2. We set swappiness 0, and use proactive scan to select cold pages > > & proactive reclaim to swap anon pages. But we can't control passive > > scan(can_swap = false), which would make anon pages cold/hot inversion > > in inc_min_seq. > > There is an option to prevent the inversion, IIUC, the force_scan > option is what you are looking for. It seems that doesn't work now. static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) { ...... for (type = ANON_AND_FILE - 1; type >= 0; type--) { if (get_nr_gens(lruvec, type) != MAX_NR_GENS) continue; VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap)); if (inc_min_seq(lruvec, type, can_swap)) continue; spin_unlock_irq(&lruvec->lru_lock); cond_resched(); goto restart; } ..... } force_scan is not a parameter of inc_min_seq. In our environment, swappiness is 0, so can_swap would be false. static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) { int zone; int remaining = MAX_LRU_BATCH; struct lru_gen_folio *lrugen = &lruvec->lrugen; int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); if (type == LRU_GEN_ANON && !can_swap) goto done; ...... } If can_swap is false, would pass anon lru list. What's more, in passive scan, force_scan is also false. static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap) { ...... /* skip this lruvec as it's low on cold folios */ return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0; } Is it a good idea to include a global parameter no_inversion, and modify inc_min_seq like this: static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) { int zone; int remaining = MAX_LRU_BATCH; struct lru_gen_folio *lrugen = &lruvec->lrugen; int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); - if (type == LRU_GEN_ANON && !can_swap) + if (type == LRU_GEN_ANON && !can_swap && !no_inversion) goto done; ...... }
On Fri, Dec 15, 2023 at 15:23 PM Yu Zhao <yuzhao@google.com> wrote: >Regarding the change itself, it'd cause a slight regression to other >use cases (details below). > > > @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > > unsigned long pfn; > > struct folio *folio; > > pte_t ptent = ptep_get(pte + i); > > + bool is_pte_young; > > > > total++; > > walk->mm_stats[MM_LEAF_TOTAL]++; > > @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > > if (pfn == -1) > > continue; > > > > - if (!pte_young(ptent)) { > > - walk->mm_stats[MM_LEAF_OLD]++; > > Most overhead from page table scanning normally comes from > get_pfn_folio() because it almost always causes a cache miss. This is > like a pointer dereference, whereas scanning PTEs is like streaming an > array (bad vs good cache performance). > > pte_young() is here to avoid an unnecessary cache miss from > get_pfn_folio(). Also see the first comment in get_pfn_folio(). It > should be easy to verify the regression -- FlameGraph from the > memcached benchmark in the original commit message should do it. > > Would a tracepoint here work for you? > > > > > + is_pte_young = !!pte_young(ptent); > > + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, is_pte_young); > > + if (!folio) { > > + if (!is_pte_young) > > + walk->mm_stats[MM_LEAF_OLD]++; > > continue; > > } > > > > - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); > > - if (!folio) > > + if (!folio_test_clear_young(folio) && !is_pte_young) { > > + walk->mm_stats[MM_LEAF_OLD]++; > > continue; > > + } > > > > - if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) > > + if (is_pte_young && !ptep_test_and_clear_young(args->vma, addr, pte + i)) > > VM_WARN_ON_ONCE(true); > > > > young++; Thanks for replying. For avoiding below: 1. confict between page_idle/bitmap and mglru scan 2. performance downgrade in mglru page-table scan if call get_pfn_folio for each pte. We have a new idea: 1. Include a new api under /sys/kernel/mm/page_idle, support mark idle flag only, without rmap walking or clearing pte young. 2. Use mglru proactive scan to clear page idle flag. workflows: t1 t2 mark pages idle mglru scan and check pages idle It's easy for us to know that whether a page is accessed during t1~t2. Some code changes like these: We clear idle flags in get_pfn_folio, and in walk_pte_range we still follow original design. static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, struct pglist_data *pgdat, bool can_swap, bool clear_idle) { struct folio *folio; /* try to avoid unnecessary memory loads */ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat)) return NULL; folio = pfn_folio(pfn); + + if (clear_idle && folio_test_idle(folio)) + folio_clear_idle(folio); + if (folio_nid(folio) != pgdat->node_id) return NULL; if (folio_memcg_rcu(folio) != memcg) return NULL; /* file VMAs can contain anon pages from COW */ if (!folio_is_file_lru(folio) && !can_swap) return NULL; return folio; } static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, struct mm_walk *args) { ... for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) { unsigned long pfn; struct folio *folio; pte_t ptent = ptep_get(pte + i); total++; walk->mm_stats[MM_LEAF_TOTAL]++; pfn = get_pte_pfn(ptent, args->vma, addr); if (pfn == -1) continue; if (!pte_young(ptent)) { walk->mm_stats[MM_LEAF_OLD]++; continue; } + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, true); - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); if (!folio) continue; ... } Is it a good idea or not ?
On Fri, Dec 15, 2023 at 3:53 AM Henry Huang <henry.hj@antgroup.com> wrote: > > On Fri, Dec 15, 2023 at 14:46 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > Thanks for replying this RFC. > > > > > > > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google > > > > proposed the idea [1], but we don't really use it anymore because of > > > > its poor scalability. > > > > > > In our environment, we use /sys/kernel/mm/page_idle/bitmap to check > > > pages whether were accessed during a peroid of time. > > > > Is it a production environment? If so, what's your > > 1. scan interval > > 2. memory size > > > I'm trying to understand why scalability isn't a problem for you. On > > an average server, there are hundreds of millions of PFNs, so it'd be > > very expensive to use that ABI even for a time interval of minutes. > > Thanks for replying. > > Our scan interval is 10 minutes and total memory size is 512GB. > We perferred to reclaim pages which idle age > 1 hour at least. Yes, that makes sense. We have similar use cases, details below. > > > We manage all pages > > > idle time in userspace. Then use a prediction algorithm to select pages > > > to reclaim. These pages would more likely be idled for a long time. > > > "There is a system in place now that is based on a user-space process > > that reads a bitmap stored in sysfs, but it has a high CPU and memory > > overhead, so a new approach is being tried." > > https://lwn.net/Articles/787611/ > > > > Could you elaborate how you solved this problem? > > In out environment, we found that we take average 0.4 core and 300MB memory > to do scan, basic analyse and reclaim idle pages. > > For reducing cpu & memroy usage, we do: > 1. We implement a ratelimiter to control rate of scan and reclaim. > 2. All pages info & idle age were stored in local DB file. Our prediction > algorithm don't need all pages info in memory at the same time. > > In out environment, about 1/3 memory was attemped to allocate as THP, > which may save some cpu usage of scan. > > > > We only need kernel to tell use whether a page is accessed, a boolean > > > value in kernel is enough for our case. > > > > How do you define "accessed"? I.e., through page tables or file > > descriptors or both? > > both > > > > > 2. PG_idle/young, being a boolean value, has poor granularity. If > > > > anyone must use page_idle/bitmap for some specific reason, I'd > > > > recommend exporting generation numbers instead. > > > > > > Yes, at first time, we try using multi-gen LRU proactvie scan and > > > exporting generation&refs number to do the same thing. > > > > > > But there are serveral problems: > > > > > > 1. multi-gen LRU only care about self-memcg pages. In our environment, > > > it's likely to see that different memcg's process share pages. > > > > This is related to my question above: are those pages mapped into > > different memcgs or not? > > There is a case: > There are two cgroup A, B (B is child cgroup of A) > Process in A create a file and use mmap to read/write this file. > Process in B mmap this file and usually read this file. Yes, actually we have a private patch to solve a similar problem. Basically it finds VMAs from other processes in different memcgs that share a mapping and jumps to those VMAs to scan them. We can upstream it for you if you find it useful too. > > > We still have no ideas how to solve this problem. > > > > > > 2. We set swappiness 0, and use proactive scan to select cold pages > > > & proactive reclaim to swap anon pages. But we can't control passive > > > scan(can_swap = false), which would make anon pages cold/hot inversion > > > in inc_min_seq. > > > > There is an option to prevent the inversion, IIUC, the force_scan > > option is what you are looking for. > > It seems that doesn't work now. > > static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) > { > ...... > for (type = ANON_AND_FILE - 1; type >= 0; type--) { > if (get_nr_gens(lruvec, type) != MAX_NR_GENS) > continue; > > VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap)); > > if (inc_min_seq(lruvec, type, can_swap)) > continue; > > spin_unlock_irq(&lruvec->lru_lock); > cond_resched(); > goto restart; > } > ..... > } > > force_scan is not a parameter of inc_min_seq. > In our environment, swappiness is 0, so can_swap would be false. > > static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) > { > int zone; > int remaining = MAX_LRU_BATCH; > struct lru_gen_folio *lrugen = &lruvec->lrugen; > int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); > > if (type == LRU_GEN_ANON && !can_swap) > goto done; > ...... > } > > If can_swap is false, would pass anon lru list. > > What's more, in passive scan, force_scan is also false. Ok, I see what you mean. (I thought "passive" means proactive scans triggered by the debugfs interface, but it actually means "reactive" scans triggered by memory pressure.) We actually have a private patch too to solve this. But there is a corner case here: that private change, which is essentially the same as what you suggested, can stall direct reclaim when there is tons of cold anon memory. E.g., if there is 300GB anon memory in the oldest generation which can't be swapped, calling inc_min_seq() with can_swap being true would stall the direct reclaim. Does it make sense? Let me check the state of those private patches and get back to you in a couple of days. Thanks!
On Sun, Dec 17, 2023 at 05:07 AM Yu Zhao <yuzhao@google.com> wrote: > Let me check the state of those private patches and get back to you in > a couple of days. Thanks! Thanks very much! Looking forward your patches.
Hi Henry, I have a question on memcg charging for the shared pages. Replied inline. On Fri, Dec 15, 2023 at 2:53 AM Henry Huang <henry.hj@antgroup.com> wrote: > > On Fri, Dec 15, 2023 at 14:46 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > Thanks for replying this RFC. > > > > > > > 1. page_idle/bitmap isn't a capable interface at all -- yes, Google > > > > proposed the idea [1], but we don't really use it anymore because of > > > > its poor scalability. > > > > > > In our environment, we use /sys/kernel/mm/page_idle/bitmap to check > > > pages whether were accessed during a peroid of time. > > > > Is it a production environment? If so, what's your > > 1. scan interval > > 2. memory size > > > I'm trying to understand why scalability isn't a problem for you. On > > an average server, there are hundreds of millions of PFNs, so it'd be > > very expensive to use that ABI even for a time interval of minutes. > > Thanks for replying. > > Our scan interval is 10 minutes and total memory size is 512GB. > We perferred to reclaim pages which idle age > 1 hour at least. > > > > We manage all pages > > > idle time in userspace. Then use a prediction algorithm to select pages > > > to reclaim. These pages would more likely be idled for a long time. > > > "There is a system in place now that is based on a user-space process > > that reads a bitmap stored in sysfs, but it has a high CPU and memory > > overhead, so a new approach is being tried." > > https://lwn.net/Articles/787611/ > > > > Could you elaborate how you solved this problem? > > In out environment, we found that we take average 0.4 core and 300MB memory > to do scan, basic analyse and reclaim idle pages. > > For reducing cpu & memroy usage, we do: > 1. We implement a ratelimiter to control rate of scan and reclaim. > 2. All pages info & idle age were stored in local DB file. Our prediction > algorithm don't need all pages info in memory at the same time. > > In out environment, about 1/3 memory was attemped to allocate as THP, > which may save some cpu usage of scan. > > > > We only need kernel to tell use whether a page is accessed, a boolean > > > value in kernel is enough for our case. > > > > How do you define "accessed"? I.e., through page tables or file > > descriptors or both? > > both > > > > > 2. PG_idle/young, being a boolean value, has poor granularity. If > > > > anyone must use page_idle/bitmap for some specific reason, I'd > > > > recommend exporting generation numbers instead. > > > > > > Yes, at first time, we try using multi-gen LRU proactvie scan and > > > exporting generation&refs number to do the same thing. > > > > > > But there are serveral problems: > > > > > > 1. multi-gen LRU only care about self-memcg pages. In our environment, > > > it's likely to see that different memcg's process share pages. > > > > This is related to my question above: are those pages mapped into > > different memcgs or not? > > There is a case: > There are two cgroup A, B (B is child cgroup of A) > Process in A create a file and use mmap to read/write this file. > Process in B mmap this file and usually read this file.\ How does the shared memory get charged to the cgroups? Does it all go to cgroup A or B exclusively, or do some pages get charged to each one? > > > > We still have no ideas how to solve this problem. > > > > > > 2. We set swappiness 0, and use proactive scan to select cold pages > > > & proactive reclaim to swap anon pages. But we can't control passive > > > scan(can_swap = false), which would make anon pages cold/hot inversion > > > in inc_min_seq. > > > > There is an option to prevent the inversion, IIUC, the force_scan > > option is what you are looking for. > > It seems that doesn't work now. > > static void inc_max_seq(struct lruvec *lruvec, bool can_swap, bool force_scan) > { > ...... > for (type = ANON_AND_FILE - 1; type >= 0; type--) { > if (get_nr_gens(lruvec, type) != MAX_NR_GENS) > continue; > > VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap)); > > if (inc_min_seq(lruvec, type, can_swap)) > continue; > > spin_unlock_irq(&lruvec->lru_lock); > cond_resched(); > goto restart; > } > ..... > } > > force_scan is not a parameter of inc_min_seq. > In our environment, swappiness is 0, so can_swap would be false. > > static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) > { > int zone; > int remaining = MAX_LRU_BATCH; > struct lru_gen_folio *lrugen = &lruvec->lrugen; > int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); > > if (type == LRU_GEN_ANON && !can_swap) > goto done; > ...... > } > > If can_swap is false, would pass anon lru list. > > What's more, in passive scan, force_scan is also false. > > static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, bool can_swap) > { > ...... > /* skip this lruvec as it's low on cold folios */ > return try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, false) ? -1 : 0; > } > > Is it a good idea to include a global parameter no_inversion, and modify inc_min_seq > like this: > > static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) > { > int zone; > int remaining = MAX_LRU_BATCH; > struct lru_gen_folio *lrugen = &lruvec->lrugen; > int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]); > > - if (type == LRU_GEN_ANON && !can_swap) > + if (type == LRU_GEN_ANON && !can_swap && !no_inversion) > goto done; > ...... > } > > -- > 2.43.0 > >
Thanks for replyting. On Fri, Dec 22, 2023 at 07:16 AM Yuanchu Xie wrote: > How does the shared memory get charged to the cgroups? > Does it all go to cgroup A or B exclusively, or do some pages get > charged to each one? Some pages get charged to cgroup A, and the other get charged to cgroup B.
On Thu, Dec 21, 2023 at 7:45 PM Henry Huang <henry.hj@antgroup.com> wrote: > > Thanks for replyting. > > On Fri, Dec 22, 2023 at 07:16 AM Yuanchu Xie wrote: > > How does the shared memory get charged to the cgroups? > > Does it all go to cgroup A or B exclusively, or do some pages get > > charged to each one? > > Some pages get charged to cgroup A, and the other get charged to cgroup B. Just a side note: We can potentially "fix" this, but it doesn't mean this is a good practice. In fact, I think this is an anti-pattern to memcgs: resources should be preferably isolated between memcgs, or if a resource has to be shared between memcgs, it should be charged in a predetermined way, not randomly to one of the memcgs sharing it.
On Thu, 21 Dec 2023, Yu Zhao wrote: > > Thanks for replyting. > > > > On Fri, Dec 22, 2023 at 07:16 AM Yuanchu Xie wrote: > > > How does the shared memory get charged to the cgroups? > > > Does it all go to cgroup A or B exclusively, or do some pages get > > > charged to each one? > > > > Some pages get charged to cgroup A, and the other get charged to cgroup B. > > Just a side note: > We can potentially "fix" this, but it doesn't mean this is a good > practice. In fact, I think this is an anti-pattern to memcgs: > resources should be preferably isolated between memcgs, or if a > resource has to be shared between memcgs, it should be charged in a > predetermined way, not randomly to one of the memcgs sharing it. > Very interesting thread, a few questions for Henry to understand the situation better: - is the lack of predeterministic charging a problem for you? Are you initially faulting it in a manner that charges it to the "right" memcg and the refault of it after periodic reclaim can causing the charge to appear "randomly," i.e. to whichever process happened to access it next? - are pages ever shared between different memcg hierarchies? You mentioned sharing between processes in A and A/B, but I'm wondering if there is sharing between two different memcg hierarchies where root is the only common ancestor? - do you anticipate a shorter scan period at some point? Proactively reclaiming all memory colder than one hour is a long time :) Are you concerned at all about the cost of doing your current idle bit harvesting approach becoming too expensive if you significantly reduce the scan period? - is proactive reclaim being driven by writing to memory.reclaim, by enforcing a smaller memory.high, or something else? Looking forward to learning more about your particular issue.
Thanks for replying. On Fri, Dec 22, 2023 at 13:14 PM David Rientjes wrote: > - is the lack of predeterministic charging a problem for you? Are you > initially faulting it in a manner that charges it to the "right" memcg > and the refault of it after periodic reclaim can causing the charge to > appear "randomly," i.e. to whichever process happened to access it > next? Actually at begin, all pages got charged to cgroup A, but with memory pressure or after proactive reclaim. Some pages would be dropped or swapped. Task in cgroup B visit this shared memory before task in cgroup A, would make these pages charged to cgroup B. This is common in our enviorment. > - are pages ever shared between different memcg hierarchies? You > mentioned sharing between processes in A and A/B, but I'm wondering > if there is sharing between two different memcg hierarchies where root > is the only common ancestor? Yes, there is a another really common case: If docker graph driver is overlayfs, different docker containers use the same image, or share same low layers, would share file cache of public bin or lib(i.e libc.so). > - do you anticipate a shorter scan period at some point? Proactively > reclaiming all memory colder than one hour is a long time :) Are you > concerned at all about the cost of doing your current idle bit > harvesting approach becoming too expensive if you significantly reduce > the scan period? We don't want the owner of the application to feel a significant performance downgrade when using swap. There is a high risk to reclaim pages which idle age are less than 1 hour. We have internal test and data analysis to support it. We disabled global swappiness and memcg swapinness. Only proactive reclaim can swap anon pages. What's more, we see that mglru has a more efficient way to scan pte access bit. We perferred to use mglru scan help us scan and select idle pages. > - is proactive reclaim being driven by writing to memory.reclaim, by > enforcing a smaller memory.high, or something else? Because all pages info and idle age are stored in userspace, kernel can't get these information directly. We have a private patch include a new reclaim interface to support reclaim pages with specific pfns.
On Fri, Dec 22, 2023 at 7:40 AM Henry Huang <henry.hj@antgroup.com> wrote: > > - are pages ever shared between different memcg hierarchies? You > > mentioned sharing between processes in A and A/B, but I'm wondering > > if there is sharing between two different memcg hierarchies where root > > is the only common ancestor? > > Yes, there is a another really common case: > If docker graph driver is overlayfs, different docker containers use the > same image, or share same low layers, would share file cache of public bin or > lib(i.e libc.so). Does this present a problem with setting memcg limits or OOMs? It seems like deterministically charging shared pages would be highly desirable. Mina Almasry previously proposed a memcg= mount option to implement deterministic charging[1], but it wasn't a generic sharing mechanism. Nonetheless, the problem remains, and it would be interesting to learn if this presents any issues for you. [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ > > > - do you anticipate a shorter scan period at some point? Proactively > > reclaiming all memory colder than one hour is a long time :) Are you > > concerned at all about the cost of doing your current idle bit > > harvesting approach becoming too expensive if you significantly reduce > > the scan period? > > We don't want the owner of the application to feel a significant > performance downgrade when using swap. There is a high risk to reclaim pages > which idle age are less than 1 hour. We have internal test and > data analysis to support it. > > We disabled global swappiness and memcg swapinness. > Only proactive reclaim can swap anon pages. > > What's more, we see that mglru has a more efficient way to scan pte access bit. > We perferred to use mglru scan help us scan and select idle pages. I'm working on a kernel driver/per-memcg interface to perform aging with MGLRU, including configuration for the MGLRU page scanning optimizations. I suspect scanning the PTE accessed bits for pages charged to a foreign memcg ad-hoc has some performance implications, and the more general solution is to charge in a predetermined way, which makes the scanning on behalf of the foreign memcg a bit cleaner. This is possible nonetheless, but a bit hacky. Let me know you have any ideas. > > > - is proactive reclaim being driven by writing to memory.reclaim, by > > enforcing a smaller memory.high, or something else? > > Because all pages info and idle age are stored in userspace, kernel can't get > these information directly. We have a private patch include a new reclaim interface > to support reclaim pages with specific pfns. Thanks for sharing! It's been enlightening to learn about different prod environments.
Thanks for replying. On Thu, Jan 11, 2024 at 03:24 AM Yuanchu Xie <yuanchu@google.com> wrote: > Does this present a problem with setting memcg limits or OOMs? It > seems like deterministically charging shared pages would be highly > desirable. Mina Almasry previously proposed a memcg= mount option to > implement deterministic charging[1], but it wasn't a generic sharing > mechanism. Nonetheless, the problem remains, and it would be > interesting to learn if this presents any issues for you. > > [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ In this case, total size of shared memory usually is small(< 100MB). What's more, almost shared pages were file cache. So it doesn't present any problems. > I'm working on a kernel driver/per-memcg interface to perform aging > with MGLRU, including configuration for the MGLRU page scanning > optimizations. I suspect scanning the PTE accessed bits for pages > charged to a foreign memcg ad-hoc has some performance implications, > and the more general solution is to charge in a predetermined way, > which makes the scanning on behalf of the foreign memcg a bit cleaner. > This is possible nonetheless, but a bit hacky. Let me know you have > any ideas. Wow! per-memcg interface is also what we need. It's hardly to control user's behaviors in our envrionment. We can't promise that all process who share memory would be in same memcg. Maybe kernel should provide new interface to make shared page charge more predictable. I think it would take some overhead to do this. The key problem of us is that we can't check whether page is accesed(no gen or ref changed) in this case. page belongs to A, but maybe process in B read/write this page more frequently. we may treat this page as cold page but accutly hot page. Maybe just call folio_mark_accessed instead of folio_update_gen(should hold memcg lru lock) for those remote memcg pages?
diff --git a/mm/swap.c b/mm/swap.c index cd8f0150ba3a..4bd14aabdc10 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -416,6 +416,9 @@ static void folio_inc_refs(struct folio *folio) { unsigned long new_flags, old_flags = READ_ONCE(folio->flags); + if (folio_test_idle(folio)) + folio_clear_idle(folio); + if (folio_test_unevictable(folio)) return; diff --git a/mm/vmscan.c b/mm/vmscan.c index 96abaa5a973e..c1dd9e7bc315 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3297,7 +3297,7 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned #endif static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, - struct pglist_data *pgdat, bool can_swap) + struct pglist_data *pgdat, bool can_swap, bool clear_idle) { struct folio *folio; @@ -3306,6 +3306,10 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg, return NULL; folio = pfn_folio(pfn); + + if (clear_idle && folio_test_idle(folio)) + folio_clear_idle(folio); + if (folio_nid(folio) != pgdat->node_id) return NULL; @@ -3355,6 +3359,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, unsigned long pfn; struct folio *folio; pte_t ptent = ptep_get(pte + i); + bool is_pte_young; total++; walk->mm_stats[MM_LEAF_TOTAL]++; @@ -3363,16 +3368,20 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, if (pfn == -1) continue; - if (!pte_young(ptent)) { - walk->mm_stats[MM_LEAF_OLD]++; + is_pte_young = !!pte_young(ptent); + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, is_pte_young); + if (!folio) { + if (!is_pte_young) + walk->mm_stats[MM_LEAF_OLD]++; continue; } - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); - if (!folio) + if (!folio_test_clear_young(folio) && !is_pte_young) { + walk->mm_stats[MM_LEAF_OLD]++; continue; + } - if (!ptep_test_and_clear_young(args->vma, addr, pte + i)) + if (is_pte_young && !ptep_test_and_clear_young(args->vma, addr, pte + i)) VM_WARN_ON_ONCE(true); young++; @@ -3435,6 +3444,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area do { unsigned long pfn; struct folio *folio; + bool is_pmd_young; /* don't round down the first address */ addr = i ? (*first & PMD_MASK) + i * PMD_SIZE : *first; @@ -3449,11 +3459,15 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area goto next; } - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap); + is_pmd_young = !!pmd_young(pmd[i]); + folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap, is_pmd_young); if (!folio) goto next; - if (!pmdp_test_and_clear_young(vma, addr, pmd + i)) + if (is_pmd_young && !pmdp_test_and_clear_young(vma, addr, pmd + i)) + VM_WARN_ON_ONCE(true); + + if (!folio_test_clear_young(folio) && !is_pmd_young) goto next; walk->mm_stats[MM_LEAF_YOUNG]++; @@ -4025,19 +4039,21 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) { unsigned long pfn; pte_t ptent = ptep_get(pte + i); + bool is_pte_young; pfn = get_pte_pfn(ptent, pvmw->vma, addr); if (pfn == -1) continue; - if (!pte_young(ptent)) + is_pte_young = !!pte_young(ptent); + folio = get_pfn_folio(pfn, memcg, pgdat, can_swap, is_pte_young); + if (!folio) continue; - folio = get_pfn_folio(pfn, memcg, pgdat, can_swap); - if (!folio) + if (!folio_test_clear_young(folio) && !is_pte_young) continue; - if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) + if (is_pte_young && !ptep_test_and_clear_young(pvmw->vma, addr, pte + i)) VM_WARN_ON_ONCE(true); young++;
Multi-Gen LRU page-table walker clears pte young flag, but it doesn't clear page idle flag. When we use /sys/kernel/mm/page_idle/bitmap to check whether one page is accessed, it would tell us this page is idle, but actually this page has been accessed. For those unmapped filecache pages, page idle flag would not been cleared in folio_mark_accessed if Multi-Gen LRU is enabled. So we couln't use /sys/kernel/mm/page_idle/bitmap to check whether a filecache page is read or written. What's more, /sys/kernel/mm/page_idle/bitmap also clears pte young flag. If one page is accessed, it would set page young flag. Multi-Gen LRU page-table walker should check both page&pte young flags. how-to-reproduce-problem idle_page_track a tools to track process accessed memory during a specific time usage idle_page_track $pid $time how-it-works 1. scan process vma from /proc/$pid/maps 2. vfn --> pfn from /proc/$pid/pagemap 3. write /sys/kernel/mm/page_idle/bitmap to mark phy page idle flag and clear pte young flag 4. sleep $time 5. read /sys/kernel/mm/page_idle/bitmap to test_and_clear pte young flag and return whether phy page is accessed test ---- test program #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(int argc, const char *argv[]) { char *buf = NULL; char pipe_info[4096]; int n; int fd = -1; buf = malloc(1024*1024*1024UL); memset(buf, 0, 1024*1024*1024UL); fd = open("access.pipe", O_RDONLY); if (fd < 0) goto out; while (1) { n = read(fd, pipe_info, sizeof(pipe_info)); if (!n) { sleep(1); continue; } else if (n < 0) { break; } memset(buf, 0, 1024*1024*1024UL); puts("finish access"); } out: if (fd >=0) close(fd); if (buf) free(buf); return 0; } prepare: mkfifo access.pipe ./test ps -ef | grep test root 4106 3148 8 06:47 pts/0 00:00:01 ./test We use /sys/kernel/debug/lru_gen to simulate mglru page-table scan. case 1: mglru walker break page_idle ./idle_page_track 4106 60 & sleep 5; echo 1 > access.pipe sleep 5; echo '+ 8 0 6 1 1' > /sys/kernel/debug/lru_gen the output of idle_page_track is: Est(s) Ref(MB) 64.822 1.00 only found 1MB were accessed during 64.822s, but actually 1024MB were accessed. case 2: page_idle break mglru walker echo 1 > access.pipe ./idle_page_track 4106 10 echo '+ 8 0 7 1 1' > /sys/kernel/debug/lru_gen lru gen status: memcg 8 /user.slice node 0 5 772458 1065 9735 6 737435 262244 72 7 538053 1184 632 8 59404 6422 0 almost pages should be in max_seq-1 queue, but actually not. Signed-off-by: Henry Huang <henry.hj@antgroup.com> --- mm/swap.c | 3 +++ mm/vmscan.c | 40 ++++++++++++++++++++++++++++------------ 2 files changed, 31 insertions(+), 12 deletions(-)