diff mbox series

[RFC,-V5] autonuma: Migrate on fault among multiple bound nodes

Message ID 20201118051952.39097-1-ying.huang@intel.com (mailing list archive)
State New, archived
Headers show
Series [RFC,-V5] autonuma: Migrate on fault among multiple bound nodes | expand

Commit Message

Huang, Ying Nov. 18, 2020, 5:19 a.m. UTC
Now, AutoNUMA can only optimize the page placement among the NUMA
nodes if the default memory policy is used.  Because the memory policy
specified explicitly should take precedence.  But this seems too
strict in some situations.  For example, on a system with 4 NUMA
nodes, if the memory of an application is bound to the node 0 and 1,
AutoNUMA can potentially migrate the pages between the node 0 and 1 to
reduce cross-node accessing without breaking the explicit memory
binding policy.

So in this patch, we add MPOL_F_AUTONUMA mode flag to set_mempolicy().
With the flag specified, AutoNUMA will be enabled within the thread to
optimize the page placement within the constrains of the specified
memory binding policy.  With the newly added flag, the NUMA balancing
control mechanism becomes,

- sysctl knob numa_balancing can enable/disable the NUMA balancing
  globally.

- even if sysctl numa_balancing is enabled, the NUMA balancing will be
  disabled for the memory areas or applications with the explicit memory
  policy by default.

- MPOL_F_AUTONUMA can be used to enable the NUMA balancing for the
  applications when specifying the explicit memory policy.

Various page placement optimization based on the NUMA balancing can be
done with these flags.  As the first step, in this patch, if the
memory of the application is bound to multiple nodes (MPOL_BIND), and
in the hint page fault handler the accessing node are in the policy
nodemask, the page will be tried to be migrated to the accessing node
to reduce the cross-node accessing.

In the previous version of the patch, we tried to reuse MPOL_MF_LAZY
for mbind().  But that flag is tied to MPOL_MF_MOVE.*, so it seems not
a good API/ABI for the purpose of the patch.

And because it's not clear whether it's necessary to enable AutoNUMA
for a specific memory area inside an application, so we only add the
flag at the thread level (set_mempolicy()) instead of the memory area
level (mbind()).  We can do that when it become necessary.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>

Changes:

v5:

- Remove mbind() support, because it's not clear that it's necessary.

v4:

- Use new flags instead of reuse MPOL_MF_LAZY.

v3:

- Rebased on latest upstream (v5.10-rc3)

- Revised the change log.

v2:

- Rebased on latest upstream (v5.10-rc1)

Huang Ying (2):
  mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON
  autonuma: Migrate on fault among multiple bound nodes
---
 include/uapi/linux/mempolicy.h | 4 +++-
 mm/mempolicy.c                 | 9 +++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

Comments

Mel Gorman Nov. 18, 2020, 9:56 a.m. UTC | #1
On Wed, Nov 18, 2020 at 01:19:52PM +0800, Huang Ying wrote:
> Now, AutoNUMA can only optimize the page placement among the NUMA

Note that the feature is referred to as NUMA_BALANCING in the kernel
configs as AUTONUMA as it was first presented was not merged. The sysctl
for it is kernel.numa_balancing as you already note. So use NUMAB or
NUMA_BALANCING but not AUTONUMA because at least a new person grepping
for NUMA_BALANCING or variants will find it where as autonuma only creeped
into the powerpc arch code.

If exposing to userspace, the naming should definitely be consistent.

> - sysctl knob numa_balancing can enable/disable the NUMA balancing
>   globally.
> 
> - even if sysctl numa_balancing is enabled, the NUMA balancing will be
>   disabled for the memory areas or applications with the explicit memory
>   policy by default.
> 
> - MPOL_F_AUTONUMA can be used to enable the NUMA balancing for the
>   applications when specifying the explicit memory policy.
> 

MPOL_F_NUMAB

> Various page placement optimization based on the NUMA balancing can be
> done with these flags.  As the first step, in this patch, if the
> memory of the application is bound to multiple nodes (MPOL_BIND), and
> in the hint page fault handler the accessing node are in the policy
> nodemask, the page will be tried to be migrated to the accessing node
> to reduce the cross-node accessing.
> 

The patch still lacks supporting data. It really should have a basic
benchmark of some sort serving as an example of how the policies should
be set and a before/after comparison showing the throughput of MPOL_BIND
accesses spanning 2 or more nodes is faster when numa balancing is enabled.

A man page update should also be added clearly outlining when an
application should consider using it with the linux-api people cc'd
for review.

The main limitation is that if this requires application modification,
it may never be used. For example, if an application uses openmp places
that translates into bind then openmp needs knowledge of the flag.
Similar limitations apply to MPI. This feature has a risk that no one
uses it.

> Huang Ying (2):
>   mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON
>   autonuma: Migrate on fault among multiple bound nodes
> ---
>  include/uapi/linux/mempolicy.h | 4 +++-
>  mm/mempolicy.c                 | 9 +++++++++
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index 3354774af61e..adb49f13840e 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -28,12 +28,14 @@ enum {
>  /* Flags for set_mempolicy */
>  #define MPOL_F_STATIC_NODES	(1 << 15)
>  #define MPOL_F_RELATIVE_NODES	(1 << 14)
> +#define MPOL_F_AUTONUMA		(1 << 13) /* Optimize with AutoNUMA if possible */
>  

Order by flag usage, correct the naming.

>  /*
>   * MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
>   * either set_mempolicy() or mbind().
>   */
> -#define MPOL_MODE_FLAGS	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES)
> +#define MPOL_MODE_FLAGS							\
> +	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES | MPOL_F_AUTONUMA)
>  
>  /* Flags for get_mempolicy */
>  #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */

How does an application discover if MPOL_F_NUMAB is supported by the
current running kernel? It looks like they might receive -EINVAL (didn't
check for sure). In that case, a manual page is defintely needed to
explain that an error can be returned if the flag is used and the kernel
does not support it so the application can cover by falling back to a
strict binding. If it fails silently, then that also needs to be documented
because it'll lead to different behaviour depending on the running
kernel.
Huang, Ying Nov. 19, 2020, 6:17 a.m. UTC | #2
Mel Gorman <mgorman@suse.de> writes:

> On Wed, Nov 18, 2020 at 01:19:52PM +0800, Huang Ying wrote:
>> Now, AutoNUMA can only optimize the page placement among the NUMA
>
> Note that the feature is referred to as NUMA_BALANCING in the kernel
> configs as AUTONUMA as it was first presented was not merged. The sysctl
> for it is kernel.numa_balancing as you already note. So use NUMAB or
> NUMA_BALANCING but not AUTONUMA because at least a new person grepping
> for NUMA_BALANCING or variants will find it where as autonuma only creeped
> into the powerpc arch code.

Sure.  Will change this.

>
> If exposing to userspace, the naming should definitely be consistent.
>
>> - sysctl knob numa_balancing can enable/disable the NUMA balancing
>>   globally.
>> 
>> - even if sysctl numa_balancing is enabled, the NUMA balancing will be
>>   disabled for the memory areas or applications with the explicit memory
>>   policy by default.
>> 
>> - MPOL_F_AUTONUMA can be used to enable the NUMA balancing for the
>>   applications when specifying the explicit memory policy.
>> 
>
> MPOL_F_NUMAB

Sure, will change it to MPOL_F_NUMA_BALANCING.

>> Various page placement optimization based on the NUMA balancing can be
>> done with these flags.  As the first step, in this patch, if the
>> memory of the application is bound to multiple nodes (MPOL_BIND), and
>> in the hint page fault handler the accessing node are in the policy
>> nodemask, the page will be tried to be migrated to the accessing node
>> to reduce the cross-node accessing.
>> 
>
> The patch still lacks supporting data. It really should have a basic
> benchmark of some sort serving as an example of how the policies should
> be set and a before/after comparison showing the throughput of MPOL_BIND
> accesses spanning 2 or more nodes is faster when numa balancing is enabled.

Sure.  Will add some basic benchmark data and usage example.

> A man page update should also be added clearly outlining when an
> application should consider using it with the linux-api people cc'd
> for review.

Yes.  Will Cc linux-api for review and will submit patches to
manpages.git after the API is finalized.

> The main limitation is that if this requires application modification,
> it may never be used. For example, if an application uses openmp places
> that translates into bind then openmp needs knowledge of the flag.
> Similar limitations apply to MPI. This feature has a risk that no one
> uses it.

My plan is to add a new option to `numactl`
(https://github.com/numactl/numactl/), so users who want to enable NUMA
balancing within the constrains of NUMA binding can use that.  I can
reach some Openstack and Kubernate developers to check whether it's
possible to add the support to these software.  For other applications,
Yes, it may take long time for the new flag to be used.

>> Huang Ying (2):
>>   mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON
>>   autonuma: Migrate on fault among multiple bound nodes
>> ---
>>  include/uapi/linux/mempolicy.h | 4 +++-
>>  mm/mempolicy.c                 | 9 +++++++++
>>  2 files changed, 12 insertions(+), 1 deletion(-)
>> 
>> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
>> index 3354774af61e..adb49f13840e 100644
>> --- a/include/uapi/linux/mempolicy.h
>> +++ b/include/uapi/linux/mempolicy.h
>> @@ -28,12 +28,14 @@ enum {
>>  /* Flags for set_mempolicy */
>>  #define MPOL_F_STATIC_NODES	(1 << 15)
>>  #define MPOL_F_RELATIVE_NODES	(1 << 14)
>> +#define MPOL_F_AUTONUMA		(1 << 13) /* Optimize with AutoNUMA if possible */
>>  
>
> Order by flag usage, correct the naming.

I will correct the naming.  Sorry, what does "order" refer to?

>>  /*
>>   * MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
>>   * either set_mempolicy() or mbind().
>>   */
>> -#define MPOL_MODE_FLAGS	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES)
>> +#define MPOL_MODE_FLAGS							\
>> +	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES | MPOL_F_AUTONUMA)
>>  
>>  /* Flags for get_mempolicy */
>>  #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
>
> How does an application discover if MPOL_F_NUMAB is supported by the
> current running kernel? It looks like they might receive -EINVAL (didn't
> check for sure).

Yes.

> In that case, a manual page is defintely needed to
> explain that an error can be returned if the flag is used and the kernel
> does not support it so the application can cover by falling back to a
> strict binding. If it fails silently, then that also needs to be documented
> because it'll lead to different behaviour depending on the running
> kernel.

Sure.  Will describe this in the manual page.

Best Regards,
Huang, Ying
Mel Gorman Nov. 19, 2020, 7:50 a.m. UTC | #3
On Thu, Nov 19, 2020 at 02:17:21PM +0800, Huang, Ying wrote:
> >> Various page placement optimization based on the NUMA balancing can be
> >> done with these flags.  As the first step, in this patch, if the
> >> memory of the application is bound to multiple nodes (MPOL_BIND), and
> >> in the hint page fault handler the accessing node are in the policy
> >> nodemask, the page will be tried to be migrated to the accessing node
> >> to reduce the cross-node accessing.
> >> 
> >
> > The patch still lacks supporting data. It really should have a basic
> > benchmark of some sort serving as an example of how the policies should
> > be set and a before/after comparison showing the throughput of MPOL_BIND
> > accesses spanning 2 or more nodes is faster when numa balancing is enabled.
> 
> Sure.  Will add some basic benchmark data and usage example.
> 

Thanks

> > A man page update should also be added clearly outlining when an
> > application should consider using it with the linux-api people cc'd
> > for review.
> 
> Yes.  Will Cc linux-api for review and will submit patches to
> manpages.git after the API is finalized.
> 

Add the manpages patch to this series. While it is not merged through
the kernel, it's important for review purposes.

> > The main limitation is that if this requires application modification,
> > it may never be used. For example, if an application uses openmp places
> > that translates into bind then openmp needs knowledge of the flag.
> > Similar limitations apply to MPI. This feature has a risk that no one
> > uses it.
> 
> My plan is to add a new option to `numactl`
> (https://github.com/numactl/numactl/), so users who want to enable NUMA
> balancing within the constrains of NUMA binding can use that.  I can
> reach some Openstack and Kubernate developers to check whether it's
> possible to add the support to these software.  For other applications,
> Yes, it may take long time for the new flag to be used.
> 

Patch for numactl should also be included to see what it looks like in
practice. Document what happens if the flag does not exist in the
running kernel.

I know this is awkward, but it's an interface exposed to userspace and
as it is expected that applications will exist that then try run on
older kernels, it needs to be very up-front about what happens on older
kernels. It would not be a complete surprise for openmp and openmpi
packages to be updated on distributions with older kernels (either by
source or via packaging) leading to surprises.

> >> Huang Ying (2):
> >>   mempolicy: Rename MPOL_F_MORON to MPOL_F_MOPRON
> >>   autonuma: Migrate on fault among multiple bound nodes
> >> ---
> >>  include/uapi/linux/mempolicy.h | 4 +++-
> >>  mm/mempolicy.c                 | 9 +++++++++
> >>  2 files changed, 12 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> >> index 3354774af61e..adb49f13840e 100644
> >> --- a/include/uapi/linux/mempolicy.h
> >> +++ b/include/uapi/linux/mempolicy.h
> >> @@ -28,12 +28,14 @@ enum {
> >>  /* Flags for set_mempolicy */
> >>  #define MPOL_F_STATIC_NODES	(1 << 15)
> >>  #define MPOL_F_RELATIVE_NODES	(1 << 14)
> >> +#define MPOL_F_AUTONUMA		(1 << 13) /* Optimize with AutoNUMA if possible */
> >>  
> >
> > Order by flag usage, correct the naming.
> 
> I will correct the naming.  Sorry, what does "order" refer to?
> 

Never mind, it was already in reverse order, it was a silly comment.
Just fix the name.

> >>  /*
> >>   * MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
> >>   * either set_mempolicy() or mbind().
> >>   */
> >> -#define MPOL_MODE_FLAGS	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES)
> >> +#define MPOL_MODE_FLAGS							\
> >> +	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES | MPOL_F_AUTONUMA)
> >>  
> >>  /* Flags for get_mempolicy */
> >>  #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
> >
> > How does an application discover if MPOL_F_NUMAB is supported by the
> > current running kernel? It looks like they might receive -EINVAL (didn't
> > check for sure).
> 
> Yes.
> 

Needs to be documented so applications know they can recover. Also needs
to be determined how numactl should behave if the flag does not exist.
Likely it will simply fail in which case the error should be clear.

> > In that case, a manual page is defintely needed to
> > explain that an error can be returned if the flag is used and the kernel
> > does not support it so the application can cover by falling back to a
> > strict binding. If it fails silently, then that also needs to be documented
> > because it'll lead to different behaviour depending on the running
> > kernel.
> 
> Sure.  Will describe this in the manual page.
> 

Thanks.
Huang, Ying Nov. 19, 2020, 8:02 a.m. UTC | #4
Mel Gorman <mgorman@suse.de> writes:

> On Thu, Nov 19, 2020 at 02:17:21PM +0800, Huang, Ying wrote:
>> >> Various page placement optimization based on the NUMA balancing can be
>> >> done with these flags.  As the first step, in this patch, if the
>> >> memory of the application is bound to multiple nodes (MPOL_BIND), and
>> >> in the hint page fault handler the accessing node are in the policy
>> >> nodemask, the page will be tried to be migrated to the accessing node
>> >> to reduce the cross-node accessing.
>> >> 
>> >
>> > The patch still lacks supporting data. It really should have a basic
>> > benchmark of some sort serving as an example of how the policies should
>> > be set and a before/after comparison showing the throughput of MPOL_BIND
>> > accesses spanning 2 or more nodes is faster when numa balancing is enabled.
>> 
>> Sure.  Will add some basic benchmark data and usage example.
>> 
>
> Thanks
>
>> > A man page update should also be added clearly outlining when an
>> > application should consider using it with the linux-api people cc'd
>> > for review.
>> 
>> Yes.  Will Cc linux-api for review and will submit patches to
>> manpages.git after the API is finalized.
>> 
>
> Add the manpages patch to this series. While it is not merged through
> the kernel, it's important for review purposes.
>
>> > The main limitation is that if this requires application modification,
>> > it may never be used. For example, if an application uses openmp places
>> > that translates into bind then openmp needs knowledge of the flag.
>> > Similar limitations apply to MPI. This feature has a risk that no one
>> > uses it.
>> 
>> My plan is to add a new option to `numactl`
>> (https://github.com/numactl/numactl/), so users who want to enable NUMA
>> balancing within the constrains of NUMA binding can use that.  I can
>> reach some Openstack and Kubernate developers to check whether it's
>> possible to add the support to these software.  For other applications,
>> Yes, it may take long time for the new flag to be used.
>> 
>
> Patch for numactl should also be included to see what it looks like in
> practice. Document what happens if the flag does not exist in the
> running kernel.
>
> I know this is awkward, but it's an interface exposed to userspace and
> as it is expected that applications will exist that then try run on
> older kernels, it needs to be very up-front about what happens on older
> kernels. It would not be a complete surprise for openmp and openmpi
> packages to be updated on distributions with older kernels (either by
> source or via packaging) leading to surprises.

Sure.  I understand that we should be careful about the user space
interface.  I will send out a new version together with the man pages
and numactl patches with all your comments addressed.

Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3354774af61e..adb49f13840e 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -28,12 +28,14 @@  enum {
 /* Flags for set_mempolicy */
 #define MPOL_F_STATIC_NODES	(1 << 15)
 #define MPOL_F_RELATIVE_NODES	(1 << 14)
+#define MPOL_F_AUTONUMA		(1 << 13) /* Optimize with AutoNUMA if possible */
 
 /*
  * MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
  * either set_mempolicy() or mbind().
  */
-#define MPOL_MODE_FLAGS	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES)
+#define MPOL_MODE_FLAGS							\
+	(MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES | MPOL_F_AUTONUMA)
 
 /* Flags for get_mempolicy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3ca4898f3f24..dc77827e8c08 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -875,6 +875,9 @@  static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 		goto out;
 	}
 
+	if (new && new->mode == MPOL_BIND && (flags & MPOL_F_AUTONUMA))
+		new->flags |= (MPOL_F_MOF | MPOL_F_MORON);
+
 	ret = mpol_set_nodemask(new, nodes, scratch);
 	if (ret) {
 		mpol_put(new);
@@ -2490,6 +2493,12 @@  int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		break;
 
 	case MPOL_BIND:
+		/* Optimize placement among multiple nodes via NUMA balancing */
+		if (pol->flags & MPOL_F_MORON) {
+			if (node_isset(thisnid, pol->v.nodes))
+				break;
+			goto out;
+		}
 
 		/*
 		 * allows binding to multiple nodes.