diff mbox series

[RFC,8/8] mm/numa: new reclaim mode to enable reclaim-based migration

Message ID 20200629234517.A7EC4BD3@viggo.jf.intel.com (mailing list archive)
State New, archived
Headers show
Series Migrate Pages in lieu of discard | expand

Commit Message

Dave Hansen June 29, 2020, 11:45 p.m. UTC
From: Dave Hansen <dave.hansen@linux.intel.com>

Some method is obviously needed to enable reclaim-based migration.

Just like traditional autonuma, there will be some workloads that
will benefit like workloads with more "static" configurations where
hot pages stay hot and cold pages stay cold.  If pages come and go
from the hot and cold sets, the benefits of this approach will be
more limited.

The benefits are truly workload-based and *not* hardware-based.
We do not believe that there is a viable threshold where certain
hardware configurations should have this mechanism enabled while
others do not.

To be conservative, earlier work defaulted to disable reclaim-
based migration and did not include a mechanism to enable it.
This propses extending the existing "zone_reclaim_mode" (now
now really node_reclaim_mode) as a method to enable it.

We are open to any alternative that allows end users to enable
this mechanism or disable it it workload harm is detected (just
like traditional autonuma).

The implementation here is pretty simple and entirely unoptimized.
On any memory hotplug events, assume that a node was added or
removed and recalculate all migration targets.  This ensures that
the node_demotion[] array is always ready to be used in case the
new reclaim mode is enabled.  This recalculation is far from
optimal, most glaringly that it does not even attempt to figure
out if nodes are actually coming or going.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---

 b/Documentation/admin-guide/sysctl/vm.rst |    9 ++++
 b/mm/migrate.c                            |   61 +++++++++++++++++++++++++++++-
 b/mm/vmscan.c                             |    7 +--
 3 files changed, 73 insertions(+), 4 deletions(-)

Comments

Huang, Ying June 30, 2020, 7:23 a.m. UTC | #1
Hi, Dave,

Dave Hansen <dave.hansen@linux.intel.com> writes:

> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Some method is obviously needed to enable reclaim-based migration.
>
> Just like traditional autonuma, there will be some workloads that
> will benefit like workloads with more "static" configurations where
> hot pages stay hot and cold pages stay cold.  If pages come and go
> from the hot and cold sets, the benefits of this approach will be
> more limited.
>
> The benefits are truly workload-based and *not* hardware-based.
> We do not believe that there is a viable threshold where certain
> hardware configurations should have this mechanism enabled while
> others do not.
>
> To be conservative, earlier work defaulted to disable reclaim-
> based migration and did not include a mechanism to enable it.
> This propses extending the existing "zone_reclaim_mode" (now
> now really node_reclaim_mode) as a method to enable it.
>
> We are open to any alternative that allows end users to enable
> this mechanism or disable it it workload harm is detected (just
> like traditional autonuma).
>
> The implementation here is pretty simple and entirely unoptimized.
> On any memory hotplug events, assume that a node was added or
> removed and recalculate all migration targets.  This ensures that
> the node_demotion[] array is always ready to be used in case the
> new reclaim mode is enabled.  This recalculation is far from
> optimal, most glaringly that it does not even attempt to figure
> out if nodes are actually coming or going.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>
>  b/Documentation/admin-guide/sysctl/vm.rst |    9 ++++
>  b/mm/migrate.c                            |   61 +++++++++++++++++++++++++++++-
>  b/mm/vmscan.c                             |    7 +--
>  3 files changed, 73 insertions(+), 4 deletions(-)
>
> diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
> --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion	2020-06-29 16:35:01.012312549 -0700
> +++ b/Documentation/admin-guide/sysctl/vm.rst	2020-06-29 16:35:01.021312549 -0700
> @@ -941,6 +941,7 @@ This is value OR'ed together of
>  1	(bit currently ignored)
>  2	Zone reclaim writes dirty pages out
>  4	Zone reclaim swaps pages
> +8	Zone reclaim migrates pages
>  =	===================================
>  
>  zone_reclaim_mode is disabled by default.  For file servers or workloads
> @@ -965,3 +966,11 @@ of other processes running on other node
>  Allowing regular swap effectively restricts allocations to the local
>  node unless explicitly overridden by memory policies or cpuset
>  configurations.
> +
> +Page migration during reclaim is intended for systems with tiered memory
> +configurations.  These systems have multiple types of memory with varied
> +performance characteristics instead of plain NUMA systems where the same
> +kind of memory is found at varied distances.  Allowing page migration
> +during reclaim enables these systems to migrate pages from fast tiers to
> +slow tiers when the fast tier is under pressure.  This migration is
> +performed before swap.
> diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
> --- a/mm/migrate.c~enable-numa-demotion	2020-06-29 16:35:01.015312549 -0700
> +++ b/mm/migrate.c	2020-06-29 16:35:01.022312549 -0700
> @@ -49,6 +49,7 @@
>  #include <linux/sched/mm.h>
>  #include <linux/ptrace.h>
>  #include <linux/oom.h>
> +#include <linux/memory.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
>  	 * Avoid any oddities like cycles that could occur
>  	 * from changes in the topology.  This will leave
>  	 * a momentary gap when migration is disabled.
> +	 *
> +	 * This is superfluous for memory offlining since
> +	 * MEM_GOING_OFFLINE does it independently, but it
> +	 * does not hurt to do it a second time.
>  	 */
>  	disable_all_migrate_targets();
>  
> @@ -3211,6 +3216,60 @@ again:
>  	/* Is another pass necessary? */
>  	if (!nodes_empty(next_pass))
>  		goto again;
> +}
>  
> -	put_online_mems();
> +/*
> + * React to hotplug events that might online or offline
> + * NUMA nodes.
> + *
> + * This leaves migrate-on-reclaim transiently disabled
> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
> + * This runs whether RECLAIM_MIGRATE is enabled or not.
> + * That ensures that the user can turn RECLAIM_MIGRATE
> + * without needing to recalcuate migration targets.
> + */
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> +						 unsigned long action, void *arg)
> +{
> +	switch (action) {
> +	case MEM_GOING_OFFLINE:
> +		/*
> +		 * Make sure there are not transient states where
> +		 * an offline node is a migration target.  This
> +		 * will leave migration disabled until the offline
> +		 * completes and the MEM_OFFLINE case below runs.
> +		 */
> +		disable_all_migrate_targets();
> +		break;
> +	case MEM_OFFLINE:
> +	case MEM_ONLINE:
> +		/*
> +		 * Recalculate the target nodes once the node
> +		 * reaches its final state (online or offline).
> +		 */
> +		set_migration_target_nodes();
> +		break;
> +	case MEM_CANCEL_OFFLINE:
> +		/*
> +		 * MEM_GOING_OFFLINE disabled all the migration
> +		 * targets.  Reenable them.
> +		 */
> +		set_migration_target_nodes();
> +		break;
> +	case MEM_GOING_ONLINE:
> +	case MEM_CANCEL_ONLINE:
> +		break;
> +	}
> +
> +	return notifier_from_errno(0);
>  }
> +
> +static int __init migrate_on_reclaim_init(void)
> +{
> +	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
> +	return 0;
> +}
> +late_initcall(migrate_on_reclaim_init);
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> +
> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>   * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>   * ABI.  New bits are OK, but existing bits can never change.
>   */
> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>  
>  /*
>   * Priority for NODE_RECLAIM. This determines the fraction of pages

I found that RECLAIM_MIGRATE is defined but never referenced in the
patch.

If my understanding of the code were correct, shrink_do_demote_mapping()
is called by shrink_page_list(), which is used by kswapd and direct
reclaim.  So as long as the persistent memory node is onlined,
reclaim-based migration will be enabled regardless of node reclaim mode.

Best Regards,
Huang, Ying
Yang Shi June 30, 2020, 5:50 p.m. UTC | #2
On 6/30/20 12:23 AM, Huang, Ying wrote:
> Hi, Dave,
>
> Dave Hansen <dave.hansen@linux.intel.com> writes:
>
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> Some method is obviously needed to enable reclaim-based migration.
>>
>> Just like traditional autonuma, there will be some workloads that
>> will benefit like workloads with more "static" configurations where
>> hot pages stay hot and cold pages stay cold.  If pages come and go
>> from the hot and cold sets, the benefits of this approach will be
>> more limited.
>>
>> The benefits are truly workload-based and *not* hardware-based.
>> We do not believe that there is a viable threshold where certain
>> hardware configurations should have this mechanism enabled while
>> others do not.
>>
>> To be conservative, earlier work defaulted to disable reclaim-
>> based migration and did not include a mechanism to enable it.
>> This propses extending the existing "zone_reclaim_mode" (now
>> now really node_reclaim_mode) as a method to enable it.
>>
>> We are open to any alternative that allows end users to enable
>> this mechanism or disable it it workload harm is detected (just
>> like traditional autonuma).
>>
>> The implementation here is pretty simple and entirely unoptimized.
>> On any memory hotplug events, assume that a node was added or
>> removed and recalculate all migration targets.  This ensures that
>> the node_demotion[] array is always ready to be used in case the
>> new reclaim mode is enabled.  This recalculation is far from
>> optimal, most glaringly that it does not even attempt to figure
>> out if nodes are actually coming or going.
>>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Yang Shi <yang.shi@linux.alibaba.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> ---
>>
>>   b/Documentation/admin-guide/sysctl/vm.rst |    9 ++++
>>   b/mm/migrate.c                            |   61 +++++++++++++++++++++++++++++-
>>   b/mm/vmscan.c                             |    7 +--
>>   3 files changed, 73 insertions(+), 4 deletions(-)
>>
>> diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
>> --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion	2020-06-29 16:35:01.012312549 -0700
>> +++ b/Documentation/admin-guide/sysctl/vm.rst	2020-06-29 16:35:01.021312549 -0700
>> @@ -941,6 +941,7 @@ This is value OR'ed together of
>>   1	(bit currently ignored)
>>   2	Zone reclaim writes dirty pages out
>>   4	Zone reclaim swaps pages
>> +8	Zone reclaim migrates pages
>>   =	===================================
>>   
>>   zone_reclaim_mode is disabled by default.  For file servers or workloads
>> @@ -965,3 +966,11 @@ of other processes running on other node
>>   Allowing regular swap effectively restricts allocations to the local
>>   node unless explicitly overridden by memory policies or cpuset
>>   configurations.
>> +
>> +Page migration during reclaim is intended for systems with tiered memory
>> +configurations.  These systems have multiple types of memory with varied
>> +performance characteristics instead of plain NUMA systems where the same
>> +kind of memory is found at varied distances.  Allowing page migration
>> +during reclaim enables these systems to migrate pages from fast tiers to
>> +slow tiers when the fast tier is under pressure.  This migration is
>> +performed before swap.
>> diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
>> --- a/mm/migrate.c~enable-numa-demotion	2020-06-29 16:35:01.015312549 -0700
>> +++ b/mm/migrate.c	2020-06-29 16:35:01.022312549 -0700
>> @@ -49,6 +49,7 @@
>>   #include <linux/sched/mm.h>
>>   #include <linux/ptrace.h>
>>   #include <linux/oom.h>
>> +#include <linux/memory.h>
>>   
>>   #include <asm/tlbflush.h>
>>   
>> @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
>>   	 * Avoid any oddities like cycles that could occur
>>   	 * from changes in the topology.  This will leave
>>   	 * a momentary gap when migration is disabled.
>> +	 *
>> +	 * This is superfluous for memory offlining since
>> +	 * MEM_GOING_OFFLINE does it independently, but it
>> +	 * does not hurt to do it a second time.
>>   	 */
>>   	disable_all_migrate_targets();
>>   
>> @@ -3211,6 +3216,60 @@ again:
>>   	/* Is another pass necessary? */
>>   	if (!nodes_empty(next_pass))
>>   		goto again;
>> +}
>>   
>> -	put_online_mems();
>> +/*
>> + * React to hotplug events that might online or offline
>> + * NUMA nodes.
>> + *
>> + * This leaves migrate-on-reclaim transiently disabled
>> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
>> + * This runs whether RECLAIM_MIGRATE is enabled or not.
>> + * That ensures that the user can turn RECLAIM_MIGRATE
>> + * without needing to recalcuate migration targets.
>> + */
>> +#if defined(CONFIG_MEMORY_HOTPLUG)
>> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>> +						 unsigned long action, void *arg)
>> +{
>> +	switch (action) {
>> +	case MEM_GOING_OFFLINE:
>> +		/*
>> +		 * Make sure there are not transient states where
>> +		 * an offline node is a migration target.  This
>> +		 * will leave migration disabled until the offline
>> +		 * completes and the MEM_OFFLINE case below runs.
>> +		 */
>> +		disable_all_migrate_targets();
>> +		break;
>> +	case MEM_OFFLINE:
>> +	case MEM_ONLINE:
>> +		/*
>> +		 * Recalculate the target nodes once the node
>> +		 * reaches its final state (online or offline).
>> +		 */
>> +		set_migration_target_nodes();
>> +		break;
>> +	case MEM_CANCEL_OFFLINE:
>> +		/*
>> +		 * MEM_GOING_OFFLINE disabled all the migration
>> +		 * targets.  Reenable them.
>> +		 */
>> +		set_migration_target_nodes();
>> +		break;
>> +	case MEM_GOING_ONLINE:
>> +	case MEM_CANCEL_ONLINE:
>> +		break;
>> +	}
>> +
>> +	return notifier_from_errno(0);
>>   }
>> +
>> +static int __init migrate_on_reclaim_init(void)
>> +{
>> +	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
>> +	return 0;
>> +}
>> +late_initcall(migrate_on_reclaim_init);
>> +#endif /* CONFIG_MEMORY_HOTPLUG */
>> +
>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
>> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>    * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>    * ABI.  New bits are OK, but existing bits can never change.
>>    */
>> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
>> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
>> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
>> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
>> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
>> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
>> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>>   
>>   /*
>>    * Priority for NODE_RECLAIM. This determines the fraction of pages
> I found that RECLAIM_MIGRATE is defined but never referenced in the
> patch.
>
> If my understanding of the code were correct, shrink_do_demote_mapping()
> is called by shrink_page_list(), which is used by kswapd and direct
> reclaim.  So as long as the persistent memory node is onlined,
> reclaim-based migration will be enabled regardless of node reclaim mode.

It looks so according to the code. But the intention of a new node 
reclaim mode is to do migration on reclaim *only when* the RECLAIM_MODE 
is enabled by the users.

It looks the patch just clear the migration target node masks if the 
memory is offlined.

So, I'm supposed you need check if node_reclaim is enabled before doing 
migration in shrink_page_list() and also need make node reclaim to adopt 
the new mode.

Please refer to 
https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/

I copied the related chunks here:

+ if (is_demote_ok(page_to_nid(page))) { <--- check if node reclaim is 
enabled + list_add(&page->lru, &demote_pages); + unlock_page(page); + 
continue; + } and @@ -4084,8 +4179,10 @@ static int 
__node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in   		.gfp_mask = current_gfp_context(gfp_mask),
  		.order = order,
  		.priority = NODE_RECLAIM_PRIORITY,
- .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), - .may_unmap = 
!!(node_reclaim_mode & RECLAIM_UNMAP), + .may_writepage = 
!!((node_reclaim_mode & RECLAIM_WRITE) || + (node_reclaim_mode & 
RECLAIM_MIGRATE)), + .may_unmap = !!((node_reclaim_mode & RECLAIM_UNMAP) 
|| + (node_reclaim_mode & RECLAIM_MIGRATE)),   		.may_swap = 1,
  		.reclaim_idx = gfp_zone(gfp_mask),
  	};
@@ -4105,7 +4202,8 @@ static int __node_reclaim(struct pglist_data 
*pgdat, gfp_t gfp_mask, unsigned in   	reclaim_state.reclaimed_slab = 0;
  	p->reclaim_state = &reclaim_state;
  
- if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) { + 
if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages || + 
(node_reclaim_mode & RECLAIM_MIGRATE)) {   		/*
  		 * Free memory by calling shrink node with increasing
  		 * priorities until we have enough memory freed.
@@ -4138,9 +4236,12 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t 
gfp_mask, unsigned int order)   	 * thrown out if the node is overallocated. So we do not reclaim
  	 * if less than a specified percentage of the node is used by
  	 * unmapped file backed pages.
+ * + * Migrate mode doesn't care the above restrictions.   	 */
  	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
- node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages) 
+ node_page_state(pgdat, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages 
&& + !(node_reclaim_mode & RECLAIM_MIGRATE))   		return NODE_RECLAIM_FULL;

>
> Best Regards,
> Huang, Ying
Huang, Ying July 1, 2020, 12:48 a.m. UTC | #3
Hi, Yang,

Yang Shi <yang.shi@linux.alibaba.com> writes:

>>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>>> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
>>> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
>>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>>    * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>>    * ABI.  New bits are OK, but existing bits can never change.
>>>    */
>>> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
>>> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
>>> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
>>> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
>>> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
>>> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
>>> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>>>     /*
>>>    * Priority for NODE_RECLAIM. This determines the fraction of pages
>> I found that RECLAIM_MIGRATE is defined but never referenced in the
>> patch.
>>
>> If my understanding of the code were correct, shrink_do_demote_mapping()
>> is called by shrink_page_list(), which is used by kswapd and direct
>> reclaim.  So as long as the persistent memory node is onlined,
>> reclaim-based migration will be enabled regardless of node reclaim mode.
>
> It looks so according to the code. But the intention of a new node
> reclaim mode is to do migration on reclaim *only when* the
> RECLAIM_MODE is enabled by the users.
>
> It looks the patch just clear the migration target node masks if the
> memory is offlined.
>
> So, I'm supposed you need check if node_reclaim is enabled before
> doing migration in shrink_page_list() and also need make node reclaim
> to adopt the new mode.

But why shouldn't we migrate in kswapd and direct reclaim?  I think that
we may need a way to control it, but shouldn't disable it
unconditionally.

> Please refer to
> https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/
>

Best Regards,
Huang, Ying
Yang Shi July 1, 2020, 1:12 a.m. UTC | #4
On 6/30/20 5:48 PM, Huang, Ying wrote:
> Hi, Yang,
>
> Yang Shi <yang.shi@linux.alibaba.com> writes:
>
>>>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>>>> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
>>>> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
>>>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>>>     * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>>>     * ABI.  New bits are OK, but existing bits can never change.
>>>>     */
>>>> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
>>>> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
>>>> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
>>>> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
>>>> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
>>>> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
>>>> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>>>>      /*
>>>>     * Priority for NODE_RECLAIM. This determines the fraction of pages
>>> I found that RECLAIM_MIGRATE is defined but never referenced in the
>>> patch.
>>>
>>> If my understanding of the code were correct, shrink_do_demote_mapping()
>>> is called by shrink_page_list(), which is used by kswapd and direct
>>> reclaim.  So as long as the persistent memory node is onlined,
>>> reclaim-based migration will be enabled regardless of node reclaim mode.
>> It looks so according to the code. But the intention of a new node
>> reclaim mode is to do migration on reclaim *only when* the
>> RECLAIM_MODE is enabled by the users.
>>
>> It looks the patch just clear the migration target node masks if the
>> memory is offlined.
>>
>> So, I'm supposed you need check if node_reclaim is enabled before
>> doing migration in shrink_page_list() and also need make node reclaim
>> to adopt the new mode.
> But why shouldn't we migrate in kswapd and direct reclaim?  I think that
> we may need a way to control it, but shouldn't disable it
> unconditionally.

Let me share some background. In the past discussions on LKML and last 
year's LSFMM the opt-in approach was preferred since the new feature 
might be not stable and mature.  So the new node reclaim mode was 
suggested by both Mel and Michal. I'm supposed this is still a valid 
point now.

Once it is mature and stable enough we definitely could make it 
universally preferred and default behavior.

>
>> Please refer to
>> https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/
>>
> Best Regards,
> Huang, Ying
Huang, Ying July 1, 2020, 1:28 a.m. UTC | #5
Yang Shi <yang.shi@linux.alibaba.com> writes:

> On 6/30/20 5:48 PM, Huang, Ying wrote:
>> Hi, Yang,
>>
>> Yang Shi <yang.shi@linux.alibaba.com> writes:
>>
>>>>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>>>>> --- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
>>>>> +++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
>>>>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>>>>     * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>>>>     * ABI.  New bits are OK, but existing bits can never change.
>>>>>     */
>>>>> -#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
>>>>> -#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
>>>>> -#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
>>>>> +#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
>>>>> +#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
>>>>> +#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
>>>>> +#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
>>>>>      /*
>>>>>     * Priority for NODE_RECLAIM. This determines the fraction of pages
>>>> I found that RECLAIM_MIGRATE is defined but never referenced in the
>>>> patch.
>>>>
>>>> If my understanding of the code were correct, shrink_do_demote_mapping()
>>>> is called by shrink_page_list(), which is used by kswapd and direct
>>>> reclaim.  So as long as the persistent memory node is onlined,
>>>> reclaim-based migration will be enabled regardless of node reclaim mode.
>>> It looks so according to the code. But the intention of a new node
>>> reclaim mode is to do migration on reclaim *only when* the
>>> RECLAIM_MODE is enabled by the users.
>>>
>>> It looks the patch just clear the migration target node masks if the
>>> memory is offlined.
>>>
>>> So, I'm supposed you need check if node_reclaim is enabled before
>>> doing migration in shrink_page_list() and also need make node reclaim
>>> to adopt the new mode.
>> But why shouldn't we migrate in kswapd and direct reclaim?  I think that
>> we may need a way to control it, but shouldn't disable it
>> unconditionally.
>
> Let me share some background. In the past discussions on LKML and last
> year's LSFMM the opt-in approach was preferred since the new feature
> might be not stable and mature.  So the new node reclaim mode was
> suggested by both Mel and Michal. I'm supposed this is still a valid
> point now.

Is there any technical reason?  I think the code isn't very complex.  If
we really worry about stable and mature, isn't it enough to provide some
way to enable/disable the feature?  Even for kswapd and direct reclaim?

Best Regards,
Huang, Ying

> Once it is mature and stable enough we definitely could make it
> universally preferred and default behavior.
>
>>
>>> Please refer to
>>> https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/
>>>
>> Best Regards,
>> Huang, Ying
Dave Hansen July 1, 2020, 4:02 p.m. UTC | #6
On 6/30/20 10:50 AM, Yang Shi wrote:
> So, I'm supposed you need check if node_reclaim is enabled before doing
> migration in shrink_page_list() and also need make node reclaim to adopt
> the new mode.
> 
> Please refer to
> https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang.shi@linux.alibaba.com/
> 
> I copied the related chunks here:

Thanks for those!  I'll incorporate them for the next version.
Huang, Ying July 3, 2020, 9:30 a.m. UTC | #7
Dave Hansen <dave.hansen@linux.intel.com> writes:
> +/*
> + * React to hotplug events that might online or offline
> + * NUMA nodes.
> + *
> + * This leaves migrate-on-reclaim transiently disabled
> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
> + * This runs whether RECLAIM_MIGRATE is enabled or not.
> + * That ensures that the user can turn RECLAIM_MIGRATE
> + * without needing to recalcuate migration targets.
> + */
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> +						 unsigned long action, void *arg)
> +{
> +	switch (action) {
> +	case MEM_GOING_OFFLINE:
> +		/*
> +		 * Make sure there are not transient states where
> +		 * an offline node is a migration target.  This
> +		 * will leave migration disabled until the offline
> +		 * completes and the MEM_OFFLINE case below runs.
> +		 */
> +		disable_all_migrate_targets();
> +		break;
> +	case MEM_OFFLINE:
> +	case MEM_ONLINE:
> +		/*
> +		 * Recalculate the target nodes once the node
> +		 * reaches its final state (online or offline).
> +		 */
> +		set_migration_target_nodes();
> +		break;
> +	case MEM_CANCEL_OFFLINE:
> +		/*
> +		 * MEM_GOING_OFFLINE disabled all the migration
> +		 * targets.  Reenable them.
> +		 */
> +		set_migration_target_nodes();
> +		break;
> +	case MEM_GOING_ONLINE:
> +	case MEM_CANCEL_ONLINE:
> +		break;

I think we need to call
disable_all_migrate_targets()/set_migration_target_nodes() for CPU
online/offline event too.  Because that will influence node_state(nid,
N_CPU).  Which will influence node demotion relationship.

> +	}
> +
> +	return notifier_from_errno(0);
>  }
> +

Best Regards,
Huang, Ying
diff mbox series

Patch

diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion Documentation/admin-guide/sysctl/vm.rst
--- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion	2020-06-29 16:35:01.012312549 -0700
+++ b/Documentation/admin-guide/sysctl/vm.rst	2020-06-29 16:35:01.021312549 -0700
@@ -941,6 +941,7 @@  This is value OR'ed together of
 1	(bit currently ignored)
 2	Zone reclaim writes dirty pages out
 4	Zone reclaim swaps pages
+8	Zone reclaim migrates pages
 =	===================================
 
 zone_reclaim_mode is disabled by default.  For file servers or workloads
@@ -965,3 +966,11 @@  of other processes running on other node
 Allowing regular swap effectively restricts allocations to the local
 node unless explicitly overridden by memory policies or cpuset
 configurations.
+
+Page migration during reclaim is intended for systems with tiered memory
+configurations.  These systems have multiple types of memory with varied
+performance characteristics instead of plain NUMA systems where the same
+kind of memory is found at varied distances.  Allowing page migration
+during reclaim enables these systems to migrate pages from fast tiers to
+slow tiers when the fast tier is under pressure.  This migration is
+performed before swap.
diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
--- a/mm/migrate.c~enable-numa-demotion	2020-06-29 16:35:01.015312549 -0700
+++ b/mm/migrate.c	2020-06-29 16:35:01.022312549 -0700
@@ -49,6 +49,7 @@ 
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
 #include <linux/oom.h>
+#include <linux/memory.h>
 
 #include <asm/tlbflush.h>
 
@@ -3165,6 +3166,10 @@  void set_migration_target_nodes(void)
 	 * Avoid any oddities like cycles that could occur
 	 * from changes in the topology.  This will leave
 	 * a momentary gap when migration is disabled.
+	 *
+	 * This is superfluous for memory offlining since
+	 * MEM_GOING_OFFLINE does it independently, but it
+	 * does not hurt to do it a second time.
 	 */
 	disable_all_migrate_targets();
 
@@ -3211,6 +3216,60 @@  again:
 	/* Is another pass necessary? */
 	if (!nodes_empty(next_pass))
 		goto again;
+}
 
-	put_online_mems();
+/*
+ * React to hotplug events that might online or offline
+ * NUMA nodes.
+ *
+ * This leaves migrate-on-reclaim transiently disabled
+ * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
+ * This runs whether RECLAIM_MIGRATE is enabled or not.
+ * That ensures that the user can turn RECLAIM_MIGRATE
+ * without needing to recalcuate migration targets.
+ */
+#if defined(CONFIG_MEMORY_HOTPLUG)
+static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
+						 unsigned long action, void *arg)
+{
+	switch (action) {
+	case MEM_GOING_OFFLINE:
+		/*
+		 * Make sure there are not transient states where
+		 * an offline node is a migration target.  This
+		 * will leave migration disabled until the offline
+		 * completes and the MEM_OFFLINE case below runs.
+		 */
+		disable_all_migrate_targets();
+		break;
+	case MEM_OFFLINE:
+	case MEM_ONLINE:
+		/*
+		 * Recalculate the target nodes once the node
+		 * reaches its final state (online or offline).
+		 */
+		set_migration_target_nodes();
+		break;
+	case MEM_CANCEL_OFFLINE:
+		/*
+		 * MEM_GOING_OFFLINE disabled all the migration
+		 * targets.  Reenable them.
+		 */
+		set_migration_target_nodes();
+		break;
+	case MEM_GOING_ONLINE:
+	case MEM_CANCEL_ONLINE:
+		break;
+	}
+
+	return notifier_from_errno(0);
 }
+
+static int __init migrate_on_reclaim_init(void)
+{
+	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
+	return 0;
+}
+late_initcall(migrate_on_reclaim_init);
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
--- a/mm/vmscan.c~enable-numa-demotion	2020-06-29 16:35:01.017312549 -0700
+++ b/mm/vmscan.c	2020-06-29 16:35:01.023312549 -0700
@@ -4165,9 +4165,10 @@  int node_reclaim_mode __read_mostly;
  * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
  * ABI.  New bits are OK, but existing bits can never change.
  */
-#define RECLAIM_RSVD  (1<<0)	/* (currently ignored/unused) */
-#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
+#define RECLAIM_RSVD	(1<<0)	/* (currently ignored/unused) */
+#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
+#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
+#define RECLAIM_MIGRATE	(1<<3)	/* Migrate pages during reclaim */
 
 /*
  * Priority for NODE_RECLAIM. This determines the fraction of pages