[-next] mm: introduce per-node proactive reclaim interface

Message ID	20240904162740.1043168-1-dave@stgolabs.net (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 146DE54824F; Wed, 4 Sep 2024 16:27:57 +0000 (UTC) sender: dave@stgolabs.net) by pdx1-sub0-mail-a253.dreamhost.com (Postfix) with ESMTPSA id 4WzSb002CYz7H; Wed, 4 Sep 2024 09:27:55 -0700 (PDT) From: Davidlohr Bueso <dave@stgolabs.net> To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, mhocko@kernel.org, rientjes@google.com, yosryahmed@google.com, hannes@cmpxchg.org, almasrymina@google.com, roman.gushchin@linux.dev, gthelen@google.com, dseo3@uci.edu, a.manzanares@samsung.com, dave@stgolabs.net, linux-kernel@vger.kernel.org Subject: [PATCH -next] mm: introduce per-node proactive reclaim interface Date: Wed, 4 Sep 2024 09:27:40 -0700 Message-Id: <20240904162740.1043168-1-dave@stgolabs.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[-next] mm: introduce per-node proactive reclaim interface \| expand [-next] mm: introduce per-node proactive reclaim interface

Davidlohr Bueso Sept. 4, 2024, 4:27 p.m. UTC

This adds support for allowing proactive reclaim in general on a
NUMA system. A per-node interface extends support for beyond a
memcg-specific interface, respecting the current semantics of
memory.reclaim: respecting aging LRU and not supporting
artificially triggering eviction on nodes belonging to non-bottom
tiers.

This patch allows userspace to do:

     echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim

One of the premises for this is to semantically align as best as
possible with memory.reclaim. During a brief time memcg did
support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
arg to memory.reclaim"), for which semantics around reclaim
(eviction) vs demotion were not clear, rendering charging
expectations to be broken.

With this approach:

1. Users who do not use memcg can benefit from proactive reclaim.

2. Proactive reclaim on top tiers will trigger demotion, for which
memory is still byte-addressable. Reclaiming on the bottom nodes
will trigger evicting to swap (the traditional sense of reclaim).
This follows the semantics of what is today part of the aging process
on tiered memory, mirroring what every other form of reclaim does
(reactive and memcg proactive reclaim). Furthermore per-node proactive
reclaim is not as susceptible to the memcg charging problem mentioned
above.

3. Unlike memcg, there should be no surprises of callers expecting
reclaim but instead got a demotion. Essentially relying on behavior
of shrink_folio_list() after 6b426d071419 (mm: disable top-tier
fallback to reclaim on proactive reclaim), without the expectations
of try_to_free_mem_cgroup_pages().

4. Unlike the nodes= arg, this interface avoids confusing semantics,
such as what exactly the user wants when mixing top-tier and low-tier
nodes in the nodemask. Further per-node interface is less exposed to
"free up memory in my container" usecases, where eviction is intended.

5. Users that *really* want to free up memory can use proactive reclaim
on nodes knowingly to be on the bottom tiers to force eviction in a
natural way - higher access latencies are still better than swap.
If compelled, while no guarantees and perhaps not worth the effort,
users could also also potentially follow a ladder-like approach to
eventually free up the memory. Alternatively, perhaps an 'evict' option
could be added to the parameters for both memory.reclaim and per-node
interfaces to force this action unconditionally.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---

This topic has been brought up in the past without much resolution.
But today, I believe a number of semantics and expectations have become
clearer (per the changelog), which could merit revisiting this.

 Documentation/ABI/stable/sysfs-devices-node |  11 ++
 drivers/base/node.c                         |   2 +
 include/linux/swap.h                        |  16 ++
 mm/vmscan.c                                 | 154 ++++++++++++++++----
 4 files changed, 156 insertions(+), 27 deletions(-)

Andrew Morton Sept. 4, 2024, 8:18 p.m. UTC | #1

On Wed,  4 Sep 2024 09:27:40 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:

> This adds support for allowing proactive reclaim in general on a
> NUMA system. A per-node interface extends support for beyond a
> memcg-specific interface, respecting the current semantics of
> memory.reclaim: respecting aging LRU and not supporting
> artificially triggering eviction on nodes belonging to non-bottom
> tiers.
> 
> This patch allows userspace to do:
> 
>      echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim

One value per sysfs file is a rule.

> One of the premises for this is to semantically align as best as
> possible with memory.reclaim. During a brief time memcg did
> support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
> arg to memory.reclaim"), for which semantics around reclaim
> (eviction) vs demotion were not clear, rendering charging
> expectations to be broken.
> 
> With this approach:
> 
> 1. Users who do not use memcg can benefit from proactive reclaim.
> 
> 2. Proactive reclaim on top tiers will trigger demotion, for which
> memory is still byte-addressable. Reclaiming on the bottom nodes
> will trigger evicting to swap (the traditional sense of reclaim).
> This follows the semantics of what is today part of the aging process
> on tiered memory, mirroring what every other form of reclaim does
> (reactive and memcg proactive reclaim). Furthermore per-node proactive
> reclaim is not as susceptible to the memcg charging problem mentioned
> above.
> 
> 3. Unlike memcg, there should be no surprises of callers expecting
> reclaim but instead got a demotion. Essentially relying on behavior
> of shrink_folio_list() after 6b426d071419 (mm: disable top-tier
> fallback to reclaim on proactive reclaim), without the expectations
> of try_to_free_mem_cgroup_pages().
> 
> 4. Unlike the nodes= arg, this interface avoids confusing semantics,
> such as what exactly the user wants when mixing top-tier and low-tier
> nodes in the nodemask. Further per-node interface is less exposed to
> "free up memory in my container" usecases, where eviction is intended.
> 
> 5. Users that *really* want to free up memory can use proactive reclaim
> on nodes knowingly to be on the bottom tiers to force eviction in a
> natural way - higher access latencies are still better than swap.
> If compelled, while no guarantees and perhaps not worth the effort,
> users could also also potentially follow a ladder-like approach to
> eventually free up the memory. Alternatively, perhaps an 'evict' option
> could be added to the parameters for both memory.reclaim and per-node
> interfaces to force this action unconditionally.
> 
> ...
>
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -221,3 +221,14 @@ Contact:	Jiaqi Yan <jiaqiyan@google.com>
>  Description:
>  		Of the raw poisoned pages on a NUMA node, how many pages are
>  		recovered by memory error recovery attempt.
> +
> +What:		/sys/devices/system/node/nodeX/reclaim
> +Date:		September 2024
> +Contact:	Linux Memory Management list <linux-mm@kvack.org>
> +Description:
> +		This is write-only nested-keyed file which accepts the number of

"is a write-only".

What does "nested keyed" mean?

> +		bytes to reclaim as well as the swappiness for this particular
> +		operation. Write the amount of bytes to induce memory reclaim in
> +		this node. When it completes successfully, the specified amount
> +		or more memory will have been reclaimed, and -EAGAIN if less
> +		bytes are reclaimed than the specified amount.

Could be that this feature would benefit from a more expansive
treatment under Documentation/somewhere.

>
> ...
>
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +
> +enum {
> +	MEMORY_RECLAIM_SWAPPINESS = 0,
> +	MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t tokens = {
> +	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> +	{ MEMORY_RECLAIM_NULL, NULL },
> +};
> +
> +static ssize_t reclaim_store(struct device *dev,
> +			     struct device_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	int nid = dev->id;
> +	gfp_t gfp_mask = GFP_KERNEL;
> +	struct pglist_data *pgdat = NODE_DATA(nid);
> +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +	int swappiness = -1;
> +	char *old_buf, *start;
> +	substring_t args[MAX_OPT_ARGS];
> +	struct scan_control sc = {
> +		.gfp_mask = current_gfp_context(gfp_mask),
> +		.reclaim_idx = gfp_zone(gfp_mask),
> +		.priority = DEF_PRIORITY,
> +		.may_writepage = !laptop_mode,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +		.proactive = 1,
> +	};
> +
> +	buf = strstrip((char *)buf);
> +
> +	old_buf = (char *)buf;
> +	nr_to_reclaim = memparse(buf, (char **)&buf) / PAGE_SIZE;
> +	if (buf == old_buf)
> +		return -EINVAL;
> +
> +	buf = strstrip((char *)buf);
> +
> +	while ((start = strsep((char **)&buf, " ")) != NULL) {
> +		if (!strlen(start))
> +			continue;
> +		switch (match_token(start, tokens, args)) {
> +		case MEMORY_RECLAIM_SWAPPINESS:
> +			if (match_int(&args[0], &swappiness))
> +				return -EINVAL;
> +			if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
> +				return -EINVAL;

Code forgot to use local `swappiness' for any purpose?

> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
>
> ...
>

Yosry Ahmed Sept. 4, 2024, 9:29 p.m. UTC | #2

On Wed, Sep 4, 2024 at 9:28 AM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> This adds support for allowing proactive reclaim in general on a
> NUMA system. A per-node interface extends support for beyond a
> memcg-specific interface, respecting the current semantics of
> memory.reclaim: respecting aging LRU and not supporting
> artificially triggering eviction on nodes belonging to non-bottom
> tiers.
>
> This patch allows userspace to do:
>
>      echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim
>
> One of the premises for this is to semantically align as best as
> possible with memory.reclaim. During a brief time memcg did
> support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
> arg to memory.reclaim"), for which semantics around reclaim
> (eviction) vs demotion were not clear, rendering charging
> expectations to be broken.
>
> With this approach:
>
> 1. Users who do not use memcg can benefit from proactive reclaim.
>
> 2. Proactive reclaim on top tiers will trigger demotion, for which
> memory is still byte-addressable. Reclaiming on the bottom nodes
> will trigger evicting to swap (the traditional sense of reclaim).
> This follows the semantics of what is today part of the aging process
> on tiered memory, mirroring what every other form of reclaim does
> (reactive and memcg proactive reclaim). Furthermore per-node proactive
> reclaim is not as susceptible to the memcg charging problem mentioned
> above.
>
> 3. Unlike memcg, there should be no surprises of callers expecting
> reclaim but instead got a demotion. Essentially relying on behavior
> of shrink_folio_list() after 6b426d071419 (mm: disable top-tier
> fallback to reclaim on proactive reclaim), without the expectations
> of try_to_free_mem_cgroup_pages().
>
> 4. Unlike the nodes= arg, this interface avoids confusing semantics,
> such as what exactly the user wants when mixing top-tier and low-tier
> nodes in the nodemask. Further per-node interface is less exposed to
> "free up memory in my container" usecases, where eviction is intended.
>
> 5. Users that *really* want to free up memory can use proactive reclaim
> on nodes knowingly to be on the bottom tiers to force eviction in a
> natural way - higher access latencies are still better than swap.
> If compelled, while no guarantees and perhaps not worth the effort,
> users could also also potentially follow a ladder-like approach to
> eventually free up the memory. Alternatively, perhaps an 'evict' option
> could be added to the parameters for both memory.reclaim and per-node
> interfaces to force this action unconditionally.
>
> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
> ---
>
> This topic has been brought up in the past without much resolution.
> But today, I believe a number of semantics and expectations have become
> clearer (per the changelog), which could merit revisiting this.
>
>  Documentation/ABI/stable/sysfs-devices-node |  11 ++
>  drivers/base/node.c                         |   2 +
>  include/linux/swap.h                        |  16 ++
>  mm/vmscan.c                                 | 154 ++++++++++++++++----
>  4 files changed, 156 insertions(+), 27 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 402af4b2b905..5d69ee956cf9 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -221,3 +221,14 @@ Contact:   Jiaqi Yan <jiaqiyan@google.com>
>  Description:
>                 Of the raw poisoned pages on a NUMA node, how many pages are
>                 recovered by memory error recovery attempt.
> +
> +What:          /sys/devices/system/node/nodeX/reclaim
> +Date:          September 2024
> +Contact:       Linux Memory Management list <linux-mm@kvack.org>
> +Description:
> +               This is write-only nested-keyed file which accepts the number of
> +               bytes to reclaim as well as the swappiness for this particular
> +               operation. Write the amount of bytes to induce memory reclaim in
> +               this node. When it completes successfully, the specified amount
> +               or more memory will have been reclaimed, and -EAGAIN if less
> +               bytes are reclaimed than the specified amount.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index eb72580288e6..d8ed19f8565b 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -626,6 +626,7 @@ static int register_node(struct node *node, int num)
>         } else {
>                 hugetlb_register_node(node);
>                 compaction_register_node(node);
> +               reclaim_register_node(node);
>         }
>
>         return error;
> @@ -642,6 +643,7 @@ void unregister_node(struct node *node)
>  {
>         hugetlb_unregister_node(node);
>         compaction_unregister_node(node);
> +       reclaim_unregister_node(node);
>         node_remove_accesses(node);
>         node_remove_caches(node);
>         device_unregister(&node->dev);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 248db1dd7812..456e3aedb964 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -423,6 +423,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
>  extern int vm_swappiness;
>  long remove_mapping(struct address_space *mapping, struct folio *folio);
>
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +extern int reclaim_register_node(struct node *node);
> +extern void reclaim_unregister_node(struct node *node);
> +
> +#else
> +
> +static inline int reclaim_register_node(struct node *node)
> +{
> +       return 0;
> +}
> +
> +static inline void reclaim_unregister_node(struct node *node)
> +{
> +}
> +#endif /* CONFIG_SYSFS && CONFIG_NUMA */
> +
>  #ifdef CONFIG_NUMA
>  extern int node_reclaim_mode;
>  extern int sysctl_min_unmapped_ratio;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5dc96a843466..56ddf54366e4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -56,6 +56,7 @@
>  #include <linux/khugepaged.h>
>  #include <linux/rculist_nulls.h>
>  #include <linux/random.h>
> +#include <linux/parser.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -92,10 +93,8 @@ struct scan_control {
>         unsigned long   anon_cost;
>         unsigned long   file_cost;
>
> -#ifdef CONFIG_MEMCG
>         /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
>         int *proactive_swappiness;
> -#endif
>
>         /* Can active folios be deactivated as part of reclaim? */
>  #define DEACTIVATE_ANON 1
> @@ -266,6 +265,9 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>
>  static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
>  {
> +       if (sc->proactive && sc->proactive_swappiness)
> +               return *sc->proactive_swappiness;
> +

This code is already upstream, right?

>         return READ_ONCE(vm_swappiness);
>  }
>  #endif
> @@ -7470,36 +7472,28 @@ static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
>  /*
>   * Try to free up some pages from this node through reclaim.
>   */
> -static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
> +static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask,
> +                         unsigned long nr_pages, struct scan_control *sc)
>  {
> -       /* Minimum pages needed in order to stay on node */
> -       const unsigned long nr_pages = 1 << order;
>         struct task_struct *p = current;
>         unsigned int noreclaim_flag;
> -       struct scan_control sc = {
> -               .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -               .gfp_mask = current_gfp_context(gfp_mask),
> -               .order = order,
> -               .priority = NODE_RECLAIM_PRIORITY,
> -               .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> -               .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> -               .may_swap = 1,
> -               .reclaim_idx = gfp_zone(gfp_mask),
> -       };
>         unsigned long pflags;
>
> -       trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order,
> -                                          sc.gfp_mask);
> +       trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, sc->order,
> +                                          sc->gfp_mask);
>
>         cond_resched();
> -       psi_memstall_enter(&pflags);
> +
> +       if (!sc->proactive)
> +               psi_memstall_enter(&pflags);
> +
>         delayacct_freepages_start();
> -       fs_reclaim_acquire(sc.gfp_mask);
> +       fs_reclaim_acquire(sc->gfp_mask);
>         /*
>          * We need to be able to allocate from the reserves for RECLAIM_UNMAP
>          */
>         noreclaim_flag = memalloc_noreclaim_save();
> -       set_task_reclaim_state(p, &sc.reclaim_state);
> +       set_task_reclaim_state(p, &sc->reclaim_state);
>
>         if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages ||
>             node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) > pgdat->min_slab_pages) {
> @@ -7508,24 +7502,38 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>                  * priorities until we have enough memory freed.
>                  */
>                 do {
> -                       shrink_node(pgdat, &sc);
> -               } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
> +                       shrink_node(pgdat, sc);
> +               } while (sc->nr_reclaimed < nr_pages && --sc->priority >= 0);
>         }
>
>         set_task_reclaim_state(p, NULL);
>         memalloc_noreclaim_restore(noreclaim_flag);
> -       fs_reclaim_release(sc.gfp_mask);
> -       psi_memstall_leave(&pflags);
> +       fs_reclaim_release(sc->gfp_mask);
>         delayacct_freepages_end();
>
> -       trace_mm_vmscan_node_reclaim_end(sc.nr_reclaimed);
> +       if (!sc->proactive)
> +               psi_memstall_leave(&pflags);
> +
> +       trace_mm_vmscan_node_reclaim_end(sc->nr_reclaimed);
>
> -       return sc.nr_reclaimed >= nr_pages;
> +       return sc->nr_reclaimed;
>  }
>
>  int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  {
>         int ret;
> +       /* Minimum pages needed in order to stay on node */
> +       const unsigned long nr_pages = 1 << order;
> +       struct scan_control sc = {
> +               .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> +               .gfp_mask = current_gfp_context(gfp_mask),
> +               .order = order,
> +               .priority = NODE_RECLAIM_PRIORITY,
> +               .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> +               .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> +               .may_swap = 1,
> +               .reclaim_idx = gfp_zone(gfp_mask),
> +       };
>
>         /*
>          * Node reclaim reclaims unmapped file backed pages and
> @@ -7560,7 +7568,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>         if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
>                 return NODE_RECLAIM_NOSCAN;
>
> -       ret = __node_reclaim(pgdat, gfp_mask, order);
> +       ret = __node_reclaim(pgdat, gfp_mask, nr_pages, &sc) >= nr_pages;
>         clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
>
>         if (ret)
> @@ -7617,3 +7625,95 @@ void check_move_unevictable_folios(struct folio_batch *fbatch)
>         }
>  }
>  EXPORT_SYMBOL_GPL(check_move_unevictable_folios);
> +
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +
> +enum {
> +       MEMORY_RECLAIM_SWAPPINESS = 0,
> +       MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t tokens = {
> +       { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> +       { MEMORY_RECLAIM_NULL, NULL },
> +};
> +
> +static ssize_t reclaim_store(struct device *dev,
> +                            struct device_attribute *attr,
> +                            const char *buf, size_t count)
> +{
> +       int nid = dev->id;
> +       gfp_t gfp_mask = GFP_KERNEL;
> +       struct pglist_data *pgdat = NODE_DATA(nid);
> +       unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +       unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +       int swappiness = -1;
> +       char *old_buf, *start;
> +       substring_t args[MAX_OPT_ARGS];
> +       struct scan_control sc = {
> +               .gfp_mask = current_gfp_context(gfp_mask),
> +               .reclaim_idx = gfp_zone(gfp_mask),
> +               .priority = DEF_PRIORITY,
> +               .may_writepage = !laptop_mode,
> +               .may_unmap = 1,
> +               .may_swap = 1,
> +               .proactive = 1,
> +       };
> +
> +       buf = strstrip((char *)buf);
> +
> +       old_buf = (char *)buf;
> +       nr_to_reclaim = memparse(buf, (char **)&buf) / PAGE_SIZE;
> +       if (buf == old_buf)
> +               return -EINVAL;
> +
> +       buf = strstrip((char *)buf);
> +
> +       while ((start = strsep((char **)&buf, " ")) != NULL) {
> +               if (!strlen(start))
> +                       continue;
> +               switch (match_token(start, tokens, args)) {
> +               case MEMORY_RECLAIM_SWAPPINESS:
> +                       if (match_int(&args[0], &swappiness))
> +                               return -EINVAL;
> +                       if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
> +                               return -EINVAL;
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       sc.nr_to_reclaim = max(nr_to_reclaim, SWAP_CLUSTER_MAX);
> +       while (nr_reclaimed < nr_to_reclaim) {
> +               unsigned long reclaimed;
> +
> +               if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
> +                       return -EAGAIN;

Can the PGDAT_RECLAIM_LOCKED check be moved into __node_reclaim()?
They are duplicated in node_reclaim().

> +
> +               /* does cond_resched() */
> +               reclaimed = __node_reclaim(pgdat, gfp_mask,
> +                                          nr_to_reclaim - nr_reclaimed, &sc);
> +
> +               clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
> +
> +               if (!reclaimed && !nr_retries--)
> +                       break;
> +
> +               nr_reclaimed += reclaimed;
> +       }

In the memcg code (i.e. memory_reclaim()) we also check for pending
signals, and drain the LRUs before the last iteration. Do we need this
here as well?

This leads to my next question: there is a lot of common code with
memory_reclaim(). Should we refactor some of it? At least the
arguments parsing part looks almost identical.

> +
> +       return nr_reclaimed < nr_to_reclaim ? -EAGAIN : count;
> +}
> +
> +static DEVICE_ATTR_WO(reclaim);
> +int reclaim_register_node(struct node *node)
> +{
> +       return device_create_file(&node->dev, &dev_attr_reclaim);
> +}
> +
> +void reclaim_unregister_node(struct node *node)
> +{
> +       return device_remove_file(&node->dev, &dev_attr_reclaim);
> +}
> +#endif
> --
> 2.39.2
>

Davidlohr Bueso Sept. 5, 2024, 1:08 a.m. UTC | #3

On Wed, 04 Sep 2024, Andrew Morton wrote:\n
>On Wed,  4 Sep 2024 09:27:40 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:
>
>> This adds support for allowing proactive reclaim in general on a
>> NUMA system. A per-node interface extends support for beyond a
>> memcg-specific interface, respecting the current semantics of
>> memory.reclaim: respecting aging LRU and not supporting
>> artificially triggering eviction on nodes belonging to non-bottom
>> tiers.
>>
>> This patch allows userspace to do:
>>
>>      echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim
>
>One value per sysfs file is a rule.

I wasn't aware of it as a rule - is this documented somewhere?

I ask because I see some others are using space-separated parameters, ie:

/sys/bus/usb/drivers/foo/new_id

... or colons. What would be acceptable? echo "512M:10" > ... ?

>> +What:		/sys/devices/system/node/nodeX/reclaim
>> +Date:		September 2024
>> +Contact:	Linux Memory Management list <linux-mm@kvack.org>
>> +Description:
>> +		This is write-only nested-keyed file which accepts the number of
>
>"is a write-only".
>
>What does "nested keyed" mean?

Will re-phrase.

>
>> +		bytes to reclaim as well as the swappiness for this particular
>> +		operation. Write the amount of bytes to induce memory reclaim in
>> +		this node. When it completes successfully, the specified amount
>> +		or more memory will have been reclaimed, and -EAGAIN if less
>> +		bytes are reclaimed than the specified amount.
>
>Could be that this feature would benefit from a more expansive
>treatment under Documentation/somewhere.

Sure.

>
>>
>> ...
>>
>> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
>> +
>> +enum {
>> +	MEMORY_RECLAIM_SWAPPINESS = 0,
>> +	MEMORY_RECLAIM_NULL,
>> +};
>> +
>> +static const match_table_t tokens = {
>> +	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
>> +	{ MEMORY_RECLAIM_NULL, NULL },
>> +};
>> +
>> +static ssize_t reclaim_store(struct device *dev,
>> +			     struct device_attribute *attr,
>> +			     const char *buf, size_t count)
>> +{
>> +	int nid = dev->id;
>> +	gfp_t gfp_mask = GFP_KERNEL;
>> +	struct pglist_data *pgdat = NODE_DATA(nid);
>> +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
>> +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>> +	int swappiness = -1;
>> +	char *old_buf, *start;
>> +	substring_t args[MAX_OPT_ARGS];
>> +	struct scan_control sc = {
>> +		.gfp_mask = current_gfp_context(gfp_mask),
>> +		.reclaim_idx = gfp_zone(gfp_mask),
>> +		.priority = DEF_PRIORITY,
>> +		.may_writepage = !laptop_mode,
>> +		.may_unmap = 1,
>> +		.may_swap = 1,
>> +		.proactive = 1,
>> +	};
>> +
>> +	buf = strstrip((char *)buf);
>> +
>> +	old_buf = (char *)buf;
>> +	nr_to_reclaim = memparse(buf, (char **)&buf) / PAGE_SIZE;
>> +	if (buf == old_buf)
>> +		return -EINVAL;
>> +
>> +	buf = strstrip((char *)buf);
>> +
>> +	while ((start = strsep((char **)&buf, " ")) != NULL) {
>> +		if (!strlen(start))
>> +			continue;
>> +		switch (match_token(start, tokens, args)) {
>> +		case MEMORY_RECLAIM_SWAPPINESS:
>> +			if (match_int(&args[0], &swappiness))
>> +				return -EINVAL;
>> +			if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
>> +				return -EINVAL;
>
>Code forgot to use local `swappiness' for any purpose?

Bleh, yeah sc.proactive_swappiness needs to be set here.

>
>> +			break;
>> +		default:
>> +			return -EINVAL;
>> +		}
>> +	}
>> +
>>
>> ...
>>

Andrew Morton Sept. 5, 2024, 1:15 a.m. UTC | #4

On Wed, 4 Sep 2024 18:08:05 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:

> On Wed, 04 Sep 2024, Andrew Morton wrote:\n
> >On Wed,  4 Sep 2024 09:27:40 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> >> This adds support for allowing proactive reclaim in general on a
> >> NUMA system. A per-node interface extends support for beyond a
> >> memcg-specific interface, respecting the current semantics of
> >> memory.reclaim: respecting aging LRU and not supporting
> >> artificially triggering eviction on nodes belonging to non-bottom
> >> tiers.
> >>
> >> This patch allows userspace to do:
> >>
> >>      echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim
> >
> >One value per sysfs file is a rule.
> 
> I wasn't aware of it as a rule - is this documented somewhere?

Documentation/filesystems/sysfs.rst, line 62.  Also lots of gregkh
grumpygrams :)

> I ask because I see some others are using space-separated parameters, ie:
> 
> /sys/bus/usb/drivers/foo/new_id
> 
> ... or colons. What would be acceptable? echo "512M:10" > ... ?

Kinda cheating.  But the rule gets violated a lot.

Davidlohr Bueso Sept. 5, 2024, 3:35 a.m. UTC | #5

On Wed, 04 Sep 2024, Andrew Morton wrote:\n
>On Wed, 4 Sep 2024 18:08:05 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:
>
>> On Wed, 04 Sep 2024, Andrew Morton wrote:\n
>> >On Wed,  4 Sep 2024 09:27:40 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:
>> >
>> >> This adds support for allowing proactive reclaim in general on a
>> >> NUMA system. A per-node interface extends support for beyond a
>> >> memcg-specific interface, respecting the current semantics of
>> >> memory.reclaim: respecting aging LRU and not supporting
>> >> artificially triggering eviction on nodes belonging to non-bottom
>> >> tiers.
>> >>
>> >> This patch allows userspace to do:
>> >>
>> >>      echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim
>> >
>> >One value per sysfs file is a rule.
>>
>> I wasn't aware of it as a rule - is this documented somewhere?
>
>Documentation/filesystems/sysfs.rst, line 62.  Also lots of gregkh
>grumpygrams :)
>
>> I ask because I see some others are using space-separated parameters, ie:
>>
>> /sys/bus/usb/drivers/foo/new_id
>>
>> ... or colons. What would be acceptable? echo "512M:10" > ... ?
>
>Kinda cheating.  But the rule gets violated a lot.

The only other alternative I can think of is to have a separate file
for swappiness, which of course sucks. So I will go with the colon
approach unless somebody shouts - I still prefer it as is in this patch,
if we are going to violate the rule altogether...

Yosry Ahmed Sept. 5, 2024, 7:31 a.m. UTC | #6

On Wed, Sep 4, 2024 at 8:35 PM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Wed, 04 Sep 2024, Andrew Morton wrote:\n
> >On Wed, 4 Sep 2024 18:08:05 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:
> >
> >> On Wed, 04 Sep 2024, Andrew Morton wrote:\n
> >> >On Wed,  4 Sep 2024 09:27:40 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:
> >> >
> >> >> This adds support for allowing proactive reclaim in general on a
> >> >> NUMA system. A per-node interface extends support for beyond a
> >> >> memcg-specific interface, respecting the current semantics of
> >> >> memory.reclaim: respecting aging LRU and not supporting
> >> >> artificially triggering eviction on nodes belonging to non-bottom
> >> >> tiers.
> >> >>
> >> >> This patch allows userspace to do:
> >> >>
> >> >>      echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim
> >> >
> >> >One value per sysfs file is a rule.
> >>
> >> I wasn't aware of it as a rule - is this documented somewhere?
> >
> >Documentation/filesystems/sysfs.rst, line 62.  Also lots of gregkh
> >grumpygrams :)
> >
> >> I ask because I see some others are using space-separated parameters, ie:
> >>
> >> /sys/bus/usb/drivers/foo/new_id
> >>
> >> ... or colons. What would be acceptable? echo "512M:10" > ... ?
> >
> >Kinda cheating.  But the rule gets violated a lot.
>
> The only other alternative I can think of is to have a separate file
> for swappiness, which of course sucks. So I will go with the colon
> approach unless somebody shouts - I still prefer it as is in this patch,
> if we are going to violate the rule altogether...

I also prefer this patch's approach. It'd be really confusing if the
per-node and per-memcg proactive reclaim interfaces have the same
semantics but different syntax.

Hillf Danton Sept. 5, 2024, 9:59 p.m. UTC | #7

On Wed,  4 Sep 2024 09:27:40 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:

> This adds support for allowing proactive reclaim in general on a
> NUMA system. A per-node interface extends support for beyond a
> memcg-specific interface, respecting the current semantics of
> memory.reclaim: respecting aging LRU and not supporting
> artificially triggering eviction on nodes belonging to non-bottom
> tiers.
> 
> This patch allows userspace to do:
> 
>      echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim
>
The proactive reclaim on the cmdline looks like waste of cpu cycles before
the cases where kswapd fails to work are spotted. It is not correct to add
it because you can type the code.

Davidlohr Bueso Sept. 5, 2024, 11:29 p.m. UTC | #8

On Fri, 06 Sep 2024, Hillf Danton wrote:\n
>The proactive reclaim on the cmdline looks like waste of cpu cycles before
>the cases where kswapd fails to work are spotted. It is not correct to add
>it because you can type the code.

Are you against proactive reclaim altogether (ie: memcg) or this patch in
particular, which extends its availability?

The benefits of proactive reclaim are well documented, and the community has
been overall favorable towards it. This operation is not meant to be generally
used, but there are real latency benefits to be had which are completely
unrelated to watermarks. Similarly, we have 'compact' as an alternative to
kcompactd (which was once upon a time part of kswapd).

Hillf Danton Sept. 6, 2024, 11:04 a.m. UTC | #9

On Thu, 5 Sep 2024 16:29:41 -0700 Davidlohr Bueso <dave@stgolabs.net>
> On Fri, 06 Sep 2024, Hillf Danton wrote:\n
> >The proactive reclaim on the cmdline looks like waste of cpu cycles before
> >the cases where kswapd fails to work are spotted. It is not correct to add
> >it because you can type the code.
> 
> Are you against proactive reclaim altogether (ie: memcg) or this patch in
> particular, which extends its availability?
> 
The against makes no sense to me because I know your patch is never able to
escape standing ovation.

> The benefits of proactive reclaim are well documented, and the community has
> been overall favorable towards it. This operation is not meant to be generally
> used, but there are real latency benefits to be had which are completely
> unrelated to watermarks. Similarly, we have 'compact' as an alternative to
> kcompactd (which was once upon a time part of kswapd).
>
Because kswapd is responsible for watermark instead of high order pages,
compact does not justify proactive reclaim from the begining.

Michal Hocko Sept. 9, 2024, 7:12 a.m. UTC | #10

On Fri 06-09-24 19:04:19, Hillf Danton wrote:
> On Thu, 5 Sep 2024 16:29:41 -0700 Davidlohr Bueso <dave@stgolabs.net>
> > On Fri, 06 Sep 2024, Hillf Danton wrote:\n
> > >The proactive reclaim on the cmdline looks like waste of cpu cycles before
> > >the cases where kswapd fails to work are spotted. It is not correct to add
> > >it because you can type the code.
> > 
> > Are you against proactive reclaim altogether (ie: memcg) or this patch in
> > particular, which extends its availability?
> > 
> The against makes no sense to me because I know your patch is never able to
> escape standing ovation.

I fail to understand your reasoning. Do you have any actual technical
arguments why this is a bad idea?

> > The benefits of proactive reclaim are well documented, and the community has
> > been overall favorable towards it. This operation is not meant to be generally
> > used, but there are real latency benefits to be had which are completely
> > unrelated to watermarks. Similarly, we have 'compact' as an alternative to
> > kcompactd (which was once upon a time part of kswapd).
> >
> Because kswapd is responsible for watermark instead of high order pages,
> compact does not justify proactive reclaim from the begining.

What do you mean? How does keeping a global watermark helps to trigger
per NUMA node specific aging - e.g. demotion? Or do you dispute the
overall idea and have a different idea how to achieve those usecases?

Michal Hocko Sept. 9, 2024, 7:20 a.m. UTC | #11

On Wed 04-09-24 09:27:40, Davidlohr Bueso wrote:
> This adds support for allowing proactive reclaim in general on a
> NUMA system. A per-node interface extends support for beyond a
> memcg-specific interface, respecting the current semantics of
> memory.reclaim: respecting aging LRU and not supporting
> artificially triggering eviction on nodes belonging to non-bottom
> tiers.
> 
> This patch allows userspace to do:
> 
>      echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim
> 
> One of the premises for this is to semantically align as best as
> possible with memory.reclaim. During a brief time memcg did
> support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
> arg to memory.reclaim"), for which semantics around reclaim
> (eviction) vs demotion were not clear, rendering charging
> expectations to be broken.
> 
> With this approach:
> 
> 1. Users who do not use memcg can benefit from proactive reclaim.

It would be great to have some specific examples here. Is there a
specific reason memcg is not used?

> 2. Proactive reclaim on top tiers will trigger demotion, for which
> memory is still byte-addressable. Reclaiming on the bottom nodes
> will trigger evicting to swap (the traditional sense of reclaim).
> This follows the semantics of what is today part of the aging process
> on tiered memory, mirroring what every other form of reclaim does
> (reactive and memcg proactive reclaim). Furthermore per-node proactive
> reclaim is not as susceptible to the memcg charging problem mentioned
> above.
> 
> 3. Unlike memcg, there should be no surprises of callers expecting
> reclaim but instead got a demotion. Essentially relying on behavior
> of shrink_folio_list() after 6b426d071419 (mm: disable top-tier
> fallback to reclaim on proactive reclaim), without the expectations
> of try_to_free_mem_cgroup_pages().

I am not sure I understand. If you demote then you effectively reclaim
because you free up memory on the specific node. Or do I just misread
what you mean? Maybe you meant to say that the overall memory
consumption on all nodes is not affected?

Your point 4 and 5 follows up on this so we should better clarify that
before going there.

Hillf Danton Sept. 9, 2024, 10:51 a.m. UTC | #12

On Date: Mon, 9 Sep 2024 09:12:03 +0200 Michal Hocko <mhocko@suse.com>
> On Fri 06-09-24 19:04:19, Hillf Danton wrote:
> > On Thu, 5 Sep 2024 16:29:41 -0700 Davidlohr Bueso <dave@stgolabs.net>
> > > On Fri, 06 Sep 2024, Hillf Danton wrote:\n
> > > >The proactive reclaim on the cmdline looks like waste of cpu cycles before
> > > >the cases where kswapd fails to work are spotted. It is not correct to add
> > > >it because you can type the code.
> > > 
> > > Are you against proactive reclaim altogether (ie: memcg) or this patch in
> > > particular, which extends its availability?
> > > 
> > The against makes no sense to me because I know your patch is never able to
> > escape standing ovation.
> 
> I fail to understand your reasoning. Do you have any actual technical
> arguments why this is a bad idea?
> 
> > > The benefits of proactive reclaim are well documented, and the community has
> > > been overall favorable towards it. This operation is not meant to be generally
> > > used, but there are real latency benefits to be had which are completely
> > > unrelated to watermarks. Similarly, we have 'compact' as an alternative to
> > > kcompactd (which was once upon a time part of kswapd).
> > >
> > Because kswapd is responsible for watermark instead of high order pages,
> > compact does not justify proactive reclaim from the begining.
> 
> What do you mean? How does keeping a global watermark helps to trigger
> per NUMA node specific aging - e.g. demotion?
>
In addition to the cost of pro/demorion, the percpu pages prevent random aging
from making any sense without memory pressue, because I think it is aging that
rolls out red carpet for multi-gen lru.

Michal Hocko Sept. 9, 2024, 2:50 p.m. UTC | #13

On Mon 09-09-24 18:51:57, Hillf Danton wrote:
> On Date: Mon, 9 Sep 2024 09:12:03 +0200 Michal Hocko <mhocko@suse.com>
> > On Fri 06-09-24 19:04:19, Hillf Danton wrote:
> > > On Thu, 5 Sep 2024 16:29:41 -0700 Davidlohr Bueso <dave@stgolabs.net>
> > > > On Fri, 06 Sep 2024, Hillf Danton wrote:\n
> > > > >The proactive reclaim on the cmdline looks like waste of cpu cycles before
> > > > >the cases where kswapd fails to work are spotted. It is not correct to add
> > > > >it because you can type the code.
> > > > 
> > > > Are you against proactive reclaim altogether (ie: memcg) or this patch in
> > > > particular, which extends its availability?
> > > > 
> > > The against makes no sense to me because I know your patch is never able to
> > > escape standing ovation.
> > 
> > I fail to understand your reasoning. Do you have any actual technical
> > arguments why this is a bad idea?
> > 
> > > > The benefits of proactive reclaim are well documented, and the community has
> > > > been overall favorable towards it. This operation is not meant to be generally
> > > > used, but there are real latency benefits to be had which are completely
> > > > unrelated to watermarks. Similarly, we have 'compact' as an alternative to
> > > > kcompactd (which was once upon a time part of kswapd).
> > > >
> > > Because kswapd is responsible for watermark instead of high order pages,
> > > compact does not justify proactive reclaim from the begining.
> > 
> > What do you mean? How does keeping a global watermark helps to trigger
> > per NUMA node specific aging - e.g. demotion?
> >
> In addition to the cost of pro/demorion, the percpu pages prevent random aging
> from making any sense without memory pressue, because I think it is aging that
> rolls out red carpet for multi-gen lru.

I am sorry but I do not get what you are trying to say. Can you be
_much_more_ specific?

Davidlohr Bueso Sept. 10, 2024, 4:31 p.m. UTC | #14

On Mon, 09 Sep 2024, Michal Hocko wrote:

>On Wed 04-09-24 09:27:40, Davidlohr Bueso wrote:
>> 1. Users who do not use memcg can benefit from proactive reclaim.
>
>It would be great to have some specific examples here. Is there a
>specific reason memcg is not used?

I know cases of people wanting to use this to free up fast memory
without incurring in extra latency spikes before a promotion occurs.
I do not have details as to why memcg is not used. I can also see
this for virtual machines running on specific nodes, reclaiming "extra"
memory based on wss and qos, as well as potential hibernation optimizations.

>> 2. Proactive reclaim on top tiers will trigger demotion, for which
>> memory is still byte-addressable. Reclaiming on the bottom nodes
>> will trigger evicting to swap (the traditional sense of reclaim).
>> This follows the semantics of what is today part of the aging process
>> on tiered memory, mirroring what every other form of reclaim does
>> (reactive and memcg proactive reclaim). Furthermore per-node proactive
>> reclaim is not as susceptible to the memcg charging problem mentioned
>> above.
>>
>> 3. Unlike memcg, there should be no surprises of callers expecting
>> reclaim but instead got a demotion. Essentially relying on behavior
>> of shrink_folio_list() after 6b426d071419 (mm: disable top-tier
>> fallback to reclaim on proactive reclaim), without the expectations
>> of try_to_free_mem_cgroup_pages().
>
>I am not sure I understand. If you demote then you effectively reclaim
>because you free up memory on the specific node. Or do I just misread
>what you mean? Maybe you meant to say that the overall memory
>consumption on all nodes is not affected?

Yes, exactly, that is what I meant to say.

>Your point 4 and 5 follows up on this so we should better clarify that
>before going there.

Michal Hocko Sept. 11, 2024, 6:49 a.m. UTC | #15

On Tue 10-09-24 09:31:15, Davidlohr Bueso wrote:
> On Mon, 09 Sep 2024, Michal Hocko wrote:
> 
> > On Wed 04-09-24 09:27:40, Davidlohr Bueso wrote:
> > > 1. Users who do not use memcg can benefit from proactive reclaim.
> > 
> > It would be great to have some specific examples here. Is there a
> > specific reason memcg is not used?
> 
> I know cases of people wanting to use this to free up fast memory
> without incurring in extra latency spikes before a promotion occurs.

Please give us more information about those because this might have an
impact on how the interface is shaped. E.g. we might need to plan for 
future extension.

> I do not have details as to why memcg is not used.

I am not saying this is crucial to clarify but it is a natural question.
We have a ready interface to achieve preemptive reclaim, why not use
that and introduce something new. A plausible argument could be that
memcg interface is not NUMA aware and there are usecases that are
focusing on NUMA balancing rather than workload memory footprint.

> I can also see
> this for virtual machines running on specific nodes, reclaiming "extra"
> memory based on wss and qos, as well as potential hibernation optimizations.

Do not virtual solutions have own ways to manage overcommit/memory
balancing (memory balooning etc.)? Does such interface fall into the
existing picture?

[-next] mm: introduce per-node proactive reclaim interface

Commit Message

Comments

Patch