Message ID | 20220407031525.2368067-13-yuzhao@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Multi-Gen LRU Framework | expand |
On Wed, 6 Apr 2022 21:15:24 -0600 Yu Zhao <yuzhao@google.com> wrote: > Add /sys/kernel/debug/lru_gen for working set estimation and proactive > reclaim. These features are required to optimize job scheduling (bin > packing) in data centers [1][2]. debugfs is for ephemeral temp stuf which can and will change or disappear at any time. Anything which is "required" by userspace should not be in debufgs. Presumably sysfs is the place. Fully documented and with usage examples in the changelog so we can carefully review the proposed extensions to Linux's ABI. Extensions which must be maintained unchanged for all time.
On Mon, Apr 11, 2022 at 8:16 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 6 Apr 2022 21:15:24 -0600 Yu Zhao <yuzhao@google.com> wrote: > > > Add /sys/kernel/debug/lru_gen for working set estimation and proactive > > reclaim. These features are required to optimize job scheduling (bin > > packing) in data centers [1][2]. > > debugfs is for ephemeral temp stuf which can and will change or > disappear at any time. Anything which is "required" by userspace > should not be in debufgs. Right. "required" is probably a poor choice of words. "These techniques are commonly used to optimize job scheduling" would sound better. > Presumably sysfs is the place. Fully documented and with usage > examples in the changelog so we can carefully review the proposed > extensions to Linux's ABI. Extensions which must be maintained > unchanged for all time. Eventually, yes. There still is a long way to go. Rest assured, this is something Google will keep investing resources on.
On Fri, 15 Apr 2022 18:03:16 -0600 Yu Zhao <yuzhao@google.com> wrote: > > Presumably sysfs is the place. Fully documented and with usage > > examples in the changelog so we can carefully review the proposed > > extensions to Linux's ABI. Extensions which must be maintained > > unchanged for all time. > > Eventually, yes. There still is a long way to go. Rest assured, this > is something Google will keep investing resources on. So. The plan is to put these interfaces in debugfs for now, with a view to migrating stabilized interfaces into sysfs (or procfs or whatever) once end-user requirements and use cases are better understood? If so, that sounds totally great to me. But it should have been in the darn changelog! This is the sort of thing which we care about most keenly. It would be helpful for reviewers to understand the proposed timeline for this process, because the entire feature isn't really real until this is completed, is it? I do think we should get this nailed down relatively rapidly, otherwise people will be reluctant to invest much into a moving target. And I must say, I see dissonance between the overall maturity of the feature as described in these emails versus the immaturity of these userspace control interfaces. What's happening there?
On Fri, Apr 15, 2022 at 10:20 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 15 Apr 2022 18:03:16 -0600 Yu Zhao <yuzhao@google.com> wrote: > > > > Presumably sysfs is the place. Fully documented and with usage > > > examples in the changelog so we can carefully review the proposed > > > extensions to Linux's ABI. Extensions which must be maintained > > > unchanged for all time. > > > > Eventually, yes. There still is a long way to go. Rest assured, this > > is something Google will keep investing resources on. > > So. The plan is to put these interfaces in debugfs for now, with a > view to migrating stabilized interfaces into sysfs (or procfs or > whatever) once end-user requirements and use cases are better > understood? The requirements are well understood and the use cases are proven, e.g., Google [1], Meta [2] and Alibaba [3]. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://gitee.com/anolis/cloud-kernel/blob/release-5.10/mm/kidled.c > If so, that sounds totally great to me. But it should have been in > the darn changelog! This is the sort of thing which we care about most > keenly. > > It would be helpful for reviewers to understand the proposed timeline > for this process, because the entire feature isn't really real until > this is completed, is it? I do think we should get this nailed down > relatively rapidly, otherwise people will be reluctant to invest much > into a moving target. > > And I must say, I see dissonance between the overall maturity of the > feature as described in these emails versus the immaturity of these > userspace control interfaces. What's happening there? Very observant. To answer both of the questions above: each iteration of the entire stack is a multi-year effort. Given its ROI, companies I know of constantly pour money into this area. Given its scale, this debugfs is the least of their concerns. A good example is the proactive reclaim sysfs interface [4]. It's been used at Google for many years and at Meta for a few years. We only started finalizing it recently. [4] https://lore.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com/
On Tue, 26 Apr 2022 00:59:37 -0600 Yu Zhao <yuzhao@google.com> wrote: > On Fri, Apr 15, 2022 at 10:20 PM Andrew Morton > <akpm@linux-foundation.org> wrote: > > > > On Fri, 15 Apr 2022 18:03:16 -0600 Yu Zhao <yuzhao@google.com> wrote: > > > > > > Presumably sysfs is the place. Fully documented and with usage > > > > examples in the changelog so we can carefully review the proposed > > > > extensions to Linux's ABI. Extensions which must be maintained > > > > unchanged for all time. > > > > > > Eventually, yes. There still is a long way to go. Rest assured, this > > > is something Google will keep investing resources on. > > > > So. The plan is to put these interfaces in debugfs for now, with a > > view to migrating stabilized interfaces into sysfs (or procfs or > > whatever) once end-user requirements and use cases are better > > understood? > > The requirements are well understood and the use cases are proven, > e.g., Google [1], Meta [2] and Alibaba [3]. > > [1] https://dl.acm.org/doi/10.1145/3297858.3304053 > [2] https://dl.acm.org/doi/10.1145/3503222.3507731 > [3] https://gitee.com/anolis/cloud-kernel/blob/release-5.10/mm/kidled.c So will these interfaces be moved into sysfs? > > If so, that sounds totally great to me. But it should have been in > > the darn changelog! This is the sort of thing which we care about most > > keenly. > > > > It would be helpful for reviewers to understand the proposed timeline > > for this process, because the entire feature isn't really real until > > this is completed, is it? I do think we should get this nailed down > > relatively rapidly, otherwise people will be reluctant to invest much > > into a moving target. > > > > And I must say, I see dissonance between the overall maturity of the > > feature as described in these emails versus the immaturity of these > > userspace control interfaces. What's happening there? > > Very observant. To answer both of the questions above: each iteration > of the entire stack is a multi-year effort. > > Given its ROI, companies I know of constantly pour money into this > area. Given its scale, this debugfs is the least of their concerns. A > good example is the proactive reclaim sysfs interface [4]. It's been > used at Google for many years and at Meta for a few years. We only > started finalizing it recently. > > [4] https://lore.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com/ Sure, if one organization is involved in both the userspace code and the kernel interfaces then the alteration of kernel interfaces can be handled in a coordinated fashion. But releasing interfaces to the whole world is a different deal. It's acceptable to say "this is in debugfs for now because it's a work in progress" but it sounds like mglru's interfaces are beyond that stage?
On Tue, Apr 26, 2022 at 3:30 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 26 Apr 2022 00:59:37 -0600 Yu Zhao <yuzhao@google.com> wrote: > > > On Fri, Apr 15, 2022 at 10:20 PM Andrew Morton > > <akpm@linux-foundation.org> wrote: > > > > > > On Fri, 15 Apr 2022 18:03:16 -0600 Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > Presumably sysfs is the place. Fully documented and with usage > > > > > examples in the changelog so we can carefully review the proposed > > > > > extensions to Linux's ABI. Extensions which must be maintained > > > > > unchanged for all time. > > > > > > > > Eventually, yes. There still is a long way to go. Rest assured, this > > > > is something Google will keep investing resources on. > > > > > > So. The plan is to put these interfaces in debugfs for now, with a > > > view to migrating stabilized interfaces into sysfs (or procfs or > > > whatever) once end-user requirements and use cases are better > > > understood? > > > > The requirements are well understood and the use cases are proven, > > e.g., Google [1], Meta [2] and Alibaba [3]. > > > > [1] https://dl.acm.org/doi/10.1145/3297858.3304053 > > [2] https://dl.acm.org/doi/10.1145/3503222.3507731 > > [3] https://gitee.com/anolis/cloud-kernel/blob/release-5.10/mm/kidled.c > > So will these interfaces be moved into sysfs? So the debugfs interface from this patch provides: 1. proactive reclaim, 2. working set estimation. The sysfs interface for item 1 is being finalized by [4], and it's a subset of this debugfs interface because we want it to be more general. The sysfs interface for item 2 will be eventually proposed as well, with the same approach. It will look like a histogram in which the active/inactive LRU has two bins whereas MGLRU has more bins. Bins contain pages and multiple bins represent different hotness/coldness. Since [4] took about two years, I'd say this histogram-like interface would take no less than a couple of years as well. This debugfs interface stays even after that, and it will serve its true purpose (debugging), not a substitute for the sysfs interfaces. > > > If so, that sounds totally great to me. But it should have been in > > > the darn changelog! This is the sort of thing which we care about most > > > keenly. > > > > > > It would be helpful for reviewers to understand the proposed timeline > > > for this process, because the entire feature isn't really real until > > > this is completed, is it? I do think we should get this nailed down > > > relatively rapidly, otherwise people will be reluctant to invest much > > > into a moving target. > > > > > > And I must say, I see dissonance between the overall maturity of the > > > feature as described in these emails versus the immaturity of these > > > userspace control interfaces. What's happening there? > > > > Very observant. To answer both of the questions above: each iteration > > of the entire stack is a multi-year effort. > > > > Given its ROI, companies I know of constantly pour money into this > > area. Given its scale, this debugfs is the least of their concerns. A > > good example is the proactive reclaim sysfs interface [4]. It's been > > used at Google for many years and at Meta for a few years. We only > > started finalizing it recently. > > > > [4] https://lore.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com/ > > Sure, if one organization is involved in both the userspace code and > the kernel interfaces then the alteration of kernel interfaces can be > handled in a coordinated fashion. > > But releasing interfaces to the whole world is a different deal. It's > acceptable to say "this is in debugfs for now because it's a work > in progress" but it sounds like mglru's interfaces are beyond that > stage? Correct. It's a WIP in the sense of "know what needs to be done but can't get it done immediately", not "don't know what's next; try this for now".
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 567c3ddba2c4..90840c459abc 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -486,6 +486,7 @@ static inline int num_node_state(enum node_states state) #define first_online_node 0 #define first_memory_node 0 #define next_online_node(nid) (MAX_NUMNODES) +#define next_memory_node(nid) (MAX_NUMNODES) #define nr_node_ids 1U #define nr_online_nodes 1U diff --git a/mm/vmscan.c b/mm/vmscan.c index 4b7da68b8750..913c28805236 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -53,6 +53,7 @@ #include <linux/pagewalk.h> #include <linux/shmem_fs.h> #include <linux/ctype.h> +#include <linux/debugfs.h> #include <asm/tlbflush.h> #include <asm/div64.h> @@ -5191,6 +5192,348 @@ static struct attribute_group lru_gen_attr_group = { .attrs = lru_gen_attrs, }; +/****************************************************************************** + * debugfs interface + ******************************************************************************/ + +static void *lru_gen_seq_start(struct seq_file *m, loff_t *pos) +{ + struct mem_cgroup *memcg; + loff_t nr_to_skip = *pos; + + m->private = kvmalloc(PATH_MAX, GFP_KERNEL); + if (!m->private) + return ERR_PTR(-ENOMEM); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + int nid; + + for_each_node_state(nid, N_MEMORY) { + if (!nr_to_skip--) + return get_lruvec(memcg, nid); + } + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + return NULL; +} + +static void lru_gen_seq_stop(struct seq_file *m, void *v) +{ + if (!IS_ERR_OR_NULL(v)) + mem_cgroup_iter_break(NULL, lruvec_memcg(v)); + + kvfree(m->private); + m->private = NULL; +} + +static void *lru_gen_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + int nid = lruvec_pgdat(v)->node_id; + struct mem_cgroup *memcg = lruvec_memcg(v); + + ++*pos; + + nid = next_memory_node(nid); + if (nid == MAX_NUMNODES) { + memcg = mem_cgroup_iter(NULL, memcg, NULL); + if (!memcg) + return NULL; + + nid = first_memory_node; + } + + return get_lruvec(memcg, nid); +} + +static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec, + unsigned long max_seq, unsigned long *min_seq, + unsigned long seq) +{ + int i; + int type, tier; + int hist = lru_hist_from_seq(seq); + struct lru_gen_struct *lrugen = &lruvec->lrugen; + + for (tier = 0; tier < MAX_NR_TIERS; tier++) { + seq_printf(m, " %10d", tier); + for (type = 0; type < ANON_AND_FILE; type++) { + unsigned long n[3] = {}; + + if (seq == max_seq) { + n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]); + n[1] = READ_ONCE(lrugen->avg_total[type][tier]); + + seq_printf(m, " %10luR %10luT %10lu ", n[0], n[1], n[2]); + } else if (seq == min_seq[type] || NR_HIST_GENS > 1) { + n[0] = atomic_long_read(&lrugen->refaulted[hist][type][tier]); + n[1] = atomic_long_read(&lrugen->evicted[hist][type][tier]); + if (tier) + n[2] = READ_ONCE(lrugen->protected[hist][type][tier - 1]); + + seq_printf(m, " %10lur %10lue %10lup", n[0], n[1], n[2]); + } else + seq_puts(m, " 0 0 0 "); + } + seq_putc(m, '\n'); + } + + seq_puts(m, " "); + for (i = 0; i < NR_MM_STATS; i++) { + if (seq == max_seq && NR_HIST_GENS == 1) + seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]), + toupper(MM_STAT_CODES[i])); + else if (seq != max_seq && NR_HIST_GENS > 1) + seq_printf(m, " %10lu%c", READ_ONCE(lruvec->mm_state.stats[hist][i]), + MM_STAT_CODES[i]); + else + seq_puts(m, " 0 "); + } + seq_putc(m, '\n'); +} + +static int lru_gen_seq_show(struct seq_file *m, void *v) +{ + unsigned long seq; + bool full = !debugfs_real_fops(m->file)->write; + struct lruvec *lruvec = v; + struct lru_gen_struct *lrugen = &lruvec->lrugen; + int nid = lruvec_pgdat(lruvec)->node_id; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + DEFINE_MAX_SEQ(lruvec); + DEFINE_MIN_SEQ(lruvec); + + if (nid == first_memory_node) { + const char *path = memcg ? m->private : ""; + +#ifdef CONFIG_MEMCG + if (memcg) + cgroup_path(memcg->css.cgroup, m->private, PATH_MAX); +#endif + seq_printf(m, "memcg %5hu %s\n", mem_cgroup_id(memcg), path); + } + + seq_printf(m, " node %5d\n", nid); + + if (!full) + seq = min_seq[LRU_GEN_ANON]; + else if (max_seq >= MAX_NR_GENS) + seq = max_seq - MAX_NR_GENS + 1; + else + seq = 0; + + for (; seq <= max_seq; seq++) { + int type, zone; + int gen = lru_gen_from_seq(seq); + unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); + + seq_printf(m, " %10lu %10u", seq, jiffies_to_msecs(jiffies - birth)); + + for (type = 0; type < ANON_AND_FILE; type++) { + long size = 0; + char mark = full && seq < min_seq[type] ? 'x' : ' '; + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += READ_ONCE(lrugen->nr_pages[gen][type][zone]); + + seq_printf(m, " %10lu%c", max(size, 0L), mark); + } + + seq_putc(m, '\n'); + + if (full) + lru_gen_seq_show_full(m, lruvec, max_seq, min_seq, seq); + } + + return 0; +} + +static const struct seq_operations lru_gen_seq_ops = { + .start = lru_gen_seq_start, + .stop = lru_gen_seq_stop, + .next = lru_gen_seq_next, + .show = lru_gen_seq_show, +}; + +static int run_aging(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc, + bool can_swap, bool full_scan) +{ + DEFINE_MAX_SEQ(lruvec); + + if (seq == max_seq) + try_to_inc_max_seq(lruvec, max_seq, sc, can_swap, full_scan); + + return seq > max_seq ? -EINVAL : 0; +} + +static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc, + int swappiness, unsigned long nr_to_reclaim) +{ + struct blk_plug plug; + int err = -EINTR; + DEFINE_MAX_SEQ(lruvec); + + if (seq + MIN_NR_GENS > max_seq) + return -EINVAL; + + sc->nr_reclaimed = 0; + + blk_start_plug(&plug); + + while (!signal_pending(current)) { + DEFINE_MIN_SEQ(lruvec); + + if (seq < min_seq[!swappiness] || sc->nr_reclaimed >= nr_to_reclaim || + !evict_folios(lruvec, sc, swappiness, NULL)) { + err = 0; + break; + } + + cond_resched(); + } + + blk_finish_plug(&plug); + + return err; +} + +static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq, + struct scan_control *sc, int swappiness, unsigned long opt) +{ + struct lruvec *lruvec; + int err = -EINVAL; + struct mem_cgroup *memcg = NULL; + + if (!mem_cgroup_disabled()) { + rcu_read_lock(); + memcg = mem_cgroup_from_id(memcg_id); +#ifdef CONFIG_MEMCG + if (memcg && !css_tryget(&memcg->css)) + memcg = NULL; +#endif + rcu_read_unlock(); + + if (!memcg) + goto done; + } + if (memcg_id != mem_cgroup_id(memcg)) + goto done; + + if (nid < 0 || nid >= MAX_NUMNODES || !node_state(nid, N_MEMORY)) + goto done; + + lruvec = get_lruvec(memcg, nid); + + if (swappiness < 0) + swappiness = get_swappiness(lruvec, sc); + else if (swappiness > 200) + goto done; + + switch (cmd) { + case '+': + err = run_aging(lruvec, seq, sc, swappiness, opt); + break; + case '-': + err = run_eviction(lruvec, seq, sc, swappiness, opt); + break; + } +done: + mem_cgroup_put(memcg); + + return err; +} + +static ssize_t lru_gen_seq_write(struct file *file, const char __user *src, + size_t len, loff_t *pos) +{ + void *buf; + char *cur, *next; + unsigned int flags; + int err = 0; + struct scan_control sc = { + .may_writepage = true, + .may_unmap = true, + .may_swap = true, + .reclaim_idx = MAX_NR_ZONES - 1, + .gfp_mask = GFP_KERNEL, + }; + + buf = kvmalloc(len + 1, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + if (copy_from_user(buf, src, len)) { + kvfree(buf); + return -EFAULT; + } + + next = buf; + next[len] = '\0'; + + sc.reclaim_state.mm_walk = alloc_mm_walk(); + if (!sc.reclaim_state.mm_walk) { + kvfree(buf); + return -ENOMEM; + } + + set_task_reclaim_state(current, &sc.reclaim_state); + flags = memalloc_noreclaim_save(); + + while ((cur = strsep(&next, ",;\n"))) { + int n; + int end; + char cmd; + unsigned int memcg_id; + unsigned int nid; + unsigned long seq; + unsigned int swappiness = -1; + unsigned long opt = -1; + + cur = skip_spaces(cur); + if (!*cur) + continue; + + n = sscanf(cur, "%c %u %u %lu %n %u %n %lu %n", &cmd, &memcg_id, &nid, + &seq, &end, &swappiness, &end, &opt, &end); + if (n < 4 || cur[end]) { + err = -EINVAL; + break; + } + + err = run_cmd(cmd, memcg_id, nid, seq, &sc, swappiness, opt); + if (err) + break; + } + + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); + + free_mm_walk(sc.reclaim_state.mm_walk); + kvfree(buf); + + return err ? : len; +} + +static int lru_gen_seq_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &lru_gen_seq_ops); +} + +static const struct file_operations lru_gen_rw_fops = { + .open = lru_gen_seq_open, + .read = seq_read, + .write = lru_gen_seq_write, + .llseek = seq_lseek, + .release = seq_release, +}; + +static const struct file_operations lru_gen_ro_fops = { + .open = lru_gen_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + /****************************************************************************** * initialization ******************************************************************************/ @@ -5249,6 +5592,9 @@ static int __init init_lru_gen(void) if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) pr_err("lru_gen: failed to create sysfs group\n"); + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); + debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops); + return 0; }; late_initcall(init_lru_gen);