diff mbox series

[v1] mm/numa_balancing: Fix the memory thrashing problem in the single-threaded process

Message ID 20240723053250.3263125-1-hezhongkun.hzk@bytedance.com (mailing list archive)
State New
Headers show
Series [v1] mm/numa_balancing: Fix the memory thrashing problem in the single-threaded process | expand

Commit Message

Zhongkun He July 23, 2024, 5:32 a.m. UTC
I found a problem in my test machine that the memory of a process is
repeatedly migrated between two nodes and does not stop.

1.Test step and the machines.
------------
VM machine: 4 numa nodes and 10GB per node.

stress --vm 1 --vm-bytes 12g --vm-keep

The info of numa stat:
while :;do cat memory.numa_stat | grep -w anon;sleep 5;done
anon N0=98304 N1=0 N2=10250747904 N3=2634334208
anon N0=98304 N1=0 N2=10250747904 N3=2634334208
anon N0=98304 N1=0 N2=9937256448 N3=2947825664
anon N0=98304 N1=0 N2=8863514624 N3=4021567488
anon N0=98304 N1=0 N2=7789772800 N3=5095309312
anon N0=98304 N1=0 N2=6716030976 N3=6169051136
anon N0=98304 N1=0 N2=5642289152 N3=7242792960
anon N0=98304 N1=0 N2=5105442816 N3=7779639296
anon N0=98304 N1=0 N2=5105442816 N3=7779639296
anon N0=98304 N1=0 N2=4837007360 N3=8048074752
anon N0=98304 N1=0 N2=3763265536 N3=9121816576
anon N0=98304 N1=0 N2=2689523712 N3=10195558400
anon N0=98304 N1=0 N2=2515148800 N3=10369933312
anon N0=98304 N1=0 N2=2515148800 N3=10369933312
anon N0=98304 N1=0 N2=2515148800 N3=10369933312
anon N0=98304 N1=0 N2=3320455168 N3=9564626944
anon N0=98304 N1=0 N2=4394196992 N3=8490885120
anon N0=98304 N1=0 N2=5105442816 N3=7779639296
anon N0=98304 N1=0 N2=6174195712 N3=6710886400
anon N0=98304 N1=0 N2=7247937536 N3=5637144576
anon N0=98304 N1=0 N2=8321679360 N3=4563402752
anon N0=98304 N1=0 N2=9395421184 N3=3489660928
anon N0=98304 N1=0 N2=10247872512 N3=2637209600
anon N0=98304 N1=0 N2=10247872512 N3=2637209600

2. Root cause:
Since commit 3e32158767b0 ("mm/mprotect.c: don't touch single threaded
PTEs which are on the right node")the PTE of local pages will not be
changed in change_pte_range() for single-threaded process, so no
page_faults information will be generated in do_numa_page(). If a
single-threaded process has memory on another node, it will
unconditionally migrate all of it's local memory to that node,
even if the remote node has only one page.

So, let's fix it. The memory of single-threaded process should follow
the cpu, not the numa faults info in order to avoid memory thrashing.

After a long time of testing, there is no memory thrashing
from the beginning.

while :;do cat memory.numa_stat | grep -w anon;sleep 5;done
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0
anon N0=2548117504 N1=10336903168 N2=139264 N3=0

V1:
-- Add the test results (numa stats) from Ying's feedback

Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Anshuman Khandual July 23, 2024, 6:15 a.m. UTC | #1
On 7/23/24 11:02, Zhongkun He wrote:
> I found a problem in my test machine that the memory of a process is
> repeatedly migrated between two nodes and does not stop.
> 
> 1.Test step and the machines.
> ------------
> VM machine: 4 numa nodes and 10GB per node.
> 
> stress --vm 1 --vm-bytes 12g --vm-keep
> 
> The info of numa stat:
> while :;do cat memory.numa_stat | grep -w anon;sleep 5;done
> anon N0=98304 N1=0 N2=10250747904 N3=2634334208
> anon N0=98304 N1=0 N2=10250747904 N3=2634334208
> anon N0=98304 N1=0 N2=9937256448 N3=2947825664
> anon N0=98304 N1=0 N2=8863514624 N3=4021567488
> anon N0=98304 N1=0 N2=7789772800 N3=5095309312
> anon N0=98304 N1=0 N2=6716030976 N3=6169051136
> anon N0=98304 N1=0 N2=5642289152 N3=7242792960
> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> anon N0=98304 N1=0 N2=4837007360 N3=8048074752
> anon N0=98304 N1=0 N2=3763265536 N3=9121816576
> anon N0=98304 N1=0 N2=2689523712 N3=10195558400
> anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> anon N0=98304 N1=0 N2=3320455168 N3=9564626944
> anon N0=98304 N1=0 N2=4394196992 N3=8490885120
> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> anon N0=98304 N1=0 N2=6174195712 N3=6710886400
> anon N0=98304 N1=0 N2=7247937536 N3=5637144576
> anon N0=98304 N1=0 N2=8321679360 N3=4563402752
> anon N0=98304 N1=0 N2=9395421184 N3=3489660928
> anon N0=98304 N1=0 N2=10247872512 N3=2637209600
> anon N0=98304 N1=0 N2=10247872512 N3=2637209600
> 
> 2. Root cause:
> Since commit 3e32158767b0 ("mm/mprotect.c: don't touch single threaded
> PTEs which are on the right node")the PTE of local pages will not be
> changed in change_pte_range() for single-threaded process, so no
> page_faults information will be generated in do_numa_page(). If a
> single-threaded process has memory on another node, it will
> unconditionally migrate all of it's local memory to that node,
> even if the remote node has only one page.
> 
> So, let's fix it. The memory of single-threaded process should follow
> the cpu, not the numa faults info in order to avoid memory thrashing.
> 
> After a long time of testing, there is no memory thrashing
> from the beginning.
> 
> while :;do cat memory.numa_stat | grep -w anon;sleep 5;done
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> 
> V1:
> -- Add the test results (numa stats) from Ying's feedback
> 
> Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> Acked-by: "Huang, Ying" <ying.huang@intel.com>
> ---
>  kernel/sched/fair.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 24dda708b699..d7cbbda568fb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2898,6 +2898,12 @@ static void task_numa_placement(struct task_struct *p)
>  		numa_group_count_active_nodes(ng);
>  		spin_unlock_irq(group_lock);
>  		max_nid = preferred_group_nid(p, max_nid);
> +	} else if (atomic_read(&p->mm->mm_users) == 1) {
> +		/*
> +		 * The memory of a single-threaded process should
> +		 * follow the CPU in order to avoid memory thrashing.
> +		 */
> +		max_nid = numa_node_id();
>  	}
>  
>  	if (max_faults) {

This in fact makes sense for a single threaded process but just
wondering could there be any other unwanted side effects ?
Zhongkun He July 23, 2024, 7 a.m. UTC | #2
On Tue, Jul 23, 2024 at 2:16 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 7/23/24 11:02, Zhongkun He wrote:
> > I found a problem in my test machine that the memory of a process is
> > repeatedly migrated between two nodes and does not stop.
> >
> > 1.Test step and the machines.
> > ------------
> > VM machine: 4 numa nodes and 10GB per node.
> >
> > stress --vm 1 --vm-bytes 12g --vm-keep
> >
> > The info of numa stat:
> > while :;do cat memory.numa_stat | grep -w anon;sleep 5;done
> > anon N0=98304 N1=0 N2=10250747904 N3=2634334208
> > anon N0=98304 N1=0 N2=10250747904 N3=2634334208
> > anon N0=98304 N1=0 N2=9937256448 N3=2947825664
> > anon N0=98304 N1=0 N2=8863514624 N3=4021567488
> > anon N0=98304 N1=0 N2=7789772800 N3=5095309312
> > anon N0=98304 N1=0 N2=6716030976 N3=6169051136
> > anon N0=98304 N1=0 N2=5642289152 N3=7242792960
> > anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> > anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> > anon N0=98304 N1=0 N2=4837007360 N3=8048074752
> > anon N0=98304 N1=0 N2=3763265536 N3=9121816576
> > anon N0=98304 N1=0 N2=2689523712 N3=10195558400
> > anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> > anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> > anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> > anon N0=98304 N1=0 N2=3320455168 N3=9564626944
> > anon N0=98304 N1=0 N2=4394196992 N3=8490885120
> > anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> > anon N0=98304 N1=0 N2=6174195712 N3=6710886400
> > anon N0=98304 N1=0 N2=7247937536 N3=5637144576
> > anon N0=98304 N1=0 N2=8321679360 N3=4563402752
> > anon N0=98304 N1=0 N2=9395421184 N3=3489660928
> > anon N0=98304 N1=0 N2=10247872512 N3=2637209600
> > anon N0=98304 N1=0 N2=10247872512 N3=2637209600
> >
> > 2. Root cause:
> > Since commit 3e32158767b0 ("mm/mprotect.c: don't touch single threaded
> > PTEs which are on the right node")the PTE of local pages will not be
> > changed in change_pte_range() for single-threaded process, so no
> > page_faults information will be generated in do_numa_page(). If a
> > single-threaded process has memory on another node, it will
> > unconditionally migrate all of it's local memory to that node,
> > even if the remote node has only one page.
> >
> > So, let's fix it. The memory of single-threaded process should follow
> > the cpu, not the numa faults info in order to avoid memory thrashing.
> >
> > After a long time of testing, there is no memory thrashing
> > from the beginning.
> >
> > while :;do cat memory.numa_stat | grep -w anon;sleep 5;done
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> > anon N0=2548117504 N1=10336903168 N2=139264 N3=0
> >
> > V1:
> > -- Add the test results (numa stats) from Ying's feedback
> >
> > Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> > Acked-by: "Huang, Ying" <ying.huang@intel.com>
> > ---
> >  kernel/sched/fair.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 24dda708b699..d7cbbda568fb 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2898,6 +2898,12 @@ static void task_numa_placement(struct task_struct *p)
> >               numa_group_count_active_nodes(ng);
> >               spin_unlock_irq(group_lock);
> >               max_nid = preferred_group_nid(p, max_nid);
> > +     } else if (atomic_read(&p->mm->mm_users) == 1) {
> > +             /*
> > +              * The memory of a single-threaded process should
> > +              * follow the CPU in order to avoid memory thrashing.
> > +              */
> > +             max_nid = numa_node_id();
> >       }
> >
> >       if (max_faults) {
>
> This in fact makes sense for a single threaded process but just
> wondering could there be any other unwanted side effects ?

Hi Anshuman,

This fix only works on a single threaded process because of the statement
'atomic_read(&p->mm->mm_users) == 1',  so I don't think there's any other
effects.
Abel Wu July 23, 2024, 1:38 p.m. UTC | #3
Hi Zhongkun,

On 7/23/24 1:32 PM, Zhongkun He Wrote:
> I found a problem in my test machine that the memory of a process is
> repeatedly migrated between two nodes and does not stop.
> 
> 1.Test step and the machines.
> ------------
> VM machine: 4 numa nodes and 10GB per node.
> 
> stress --vm 1 --vm-bytes 12g --vm-keep
> 
> The info of numa stat:
> while :;do cat memory.numa_stat | grep -w anon;sleep 5;done
> anon N0=98304 N1=0 N2=10250747904 N3=2634334208

I am curious what was the exact reason made the worker migrated
to N3? And later...

> anon N0=98304 N1=0 N2=10250747904 N3=2634334208
> anon N0=98304 N1=0 N2=9937256448 N3=2947825664
> anon N0=98304 N1=0 N2=8863514624 N3=4021567488
> anon N0=98304 N1=0 N2=7789772800 N3=5095309312
> anon N0=98304 N1=0 N2=6716030976 N3=6169051136
> anon N0=98304 N1=0 N2=5642289152 N3=7242792960
> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> anon N0=98304 N1=0 N2=4837007360 N3=8048074752
> anon N0=98304 N1=0 N2=3763265536 N3=9121816576
> anon N0=98304 N1=0 N2=2689523712 N3=10195558400
> anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> anon N0=98304 N1=0 N2=2515148800 N3=10369933312

.. why it was moved back to N2?

> anon N0=98304 N1=0 N2=3320455168 N3=9564626944
> anon N0=98304 N1=0 N2=4394196992 N3=8490885120
> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> anon N0=98304 N1=0 N2=6174195712 N3=6710886400
> anon N0=98304 N1=0 N2=7247937536 N3=5637144576
> anon N0=98304 N1=0 N2=8321679360 N3=4563402752
> anon N0=98304 N1=0 N2=9395421184 N3=3489660928
> anon N0=98304 N1=0 N2=10247872512 N3=2637209600
> anon N0=98304 N1=0 N2=10247872512 N3=2637209600
> 
> 2. Root cause:
> Since commit 3e32158767b0 ("mm/mprotect.c: don't touch single threaded
> PTEs which are on the right node")the PTE of local pages will not be
> changed in change_pte_range() for single-threaded process, so no
> page_faults information will be generated in do_numa_page(). If a
> single-threaded process has memory on another node, it will
> unconditionally migrate all of it's local memory to that node,
> even if the remote node has only one page.

IIUC the remote pages will be moved to the node where the worker
is running since local (private) PTEs are not set to protnone and
won't be faulted on.

> 
> So, let's fix it. The memory of single-threaded process should follow
> the cpu, not the numa faults info in order to avoid memory thrashing.

Don't forget the 'Fixes' tag for bugfix patches :)

> 
> ...> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 24dda708b699..d7cbbda568fb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2898,6 +2898,12 @@ static void task_numa_placement(struct task_struct *p)
>   		numa_group_count_active_nodes(ng);
>   		spin_unlock_irq(group_lock);
>   		max_nid = preferred_group_nid(p, max_nid);
> +	} else if (atomic_read(&p->mm->mm_users) == 1) {
> +		/*
> +		 * The memory of a single-threaded process should
> +		 * follow the CPU in order to avoid memory thrashing.
> +		 */
> +		max_nid = numa_node_id();
>   	}
>   
>   	if (max_faults) {

Since you don't want to respect the faults info, can we simply
skip task placement?
Zhongkun He July 24, 2024, 3:55 a.m. UTC | #4
On Tue, Jul 23, 2024 at 9:39 PM Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> Hi Zhongkun,
>
> On 7/23/24 1:32 PM, Zhongkun He Wrote:
> > I found a problem in my test machine that the memory of a process is
> > repeatedly migrated between two nodes and does not stop.
> >
> > 1.Test step and the machines.
> > ------------
> > VM machine: 4 numa nodes and 10GB per node.
> >
> > stress --vm 1 --vm-bytes 12g --vm-keep
> >
> > The info of numa stat:
> > while :;do cat memory.numa_stat | grep -w anon;sleep 5;done
> > anon N0=98304 N1=0 N2=10250747904 N3=2634334208
>
> I am curious what was the exact reason made the worker migrated
> to N3? And later...

The maximum capacity of each node is 10 GB, but it requires 12GB,
so there's always 2G on other nodes. With the patch below we only
have page_faults in other nodes, not local. so we will migrate pages
to other nodes because p->numa_preferred_nid is always the other node.

>
> > anon N0=98304 N1=0 N2=10250747904 N3=2634334208
> > anon N0=98304 N1=0 N2=9937256448 N3=2947825664
> > anon N0=98304 N1=0 N2=8863514624 N3=4021567488
> > anon N0=98304 N1=0 N2=7789772800 N3=5095309312
> > anon N0=98304 N1=0 N2=6716030976 N3=6169051136
> > anon N0=98304 N1=0 N2=5642289152 N3=7242792960
> > anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> > anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> > anon N0=98304 N1=0 N2=4837007360 N3=8048074752
> > anon N0=98304 N1=0 N2=3763265536 N3=9121816576
> > anon N0=98304 N1=0 N2=2689523712 N3=10195558400
> > anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> > anon N0=98304 N1=0 N2=2515148800 N3=10369933312
> > anon N0=98304 N1=0 N2=2515148800 N3=10369933312
>
> .. why it was moved back to N2?

The private page_faults on N2 are larger than that on N3.

>
> > anon N0=98304 N1=0 N2=3320455168 N3=9564626944
> > anon N0=98304 N1=0 N2=4394196992 N3=8490885120
> > anon N0=98304 N1=0 N2=5105442816 N3=7779639296
> > anon N0=98304 N1=0 N2=6174195712 N3=6710886400
> > anon N0=98304 N1=0 N2=7247937536 N3=5637144576
> > anon N0=98304 N1=0 N2=8321679360 N3=4563402752
> > anon N0=98304 N1=0 N2=9395421184 N3=3489660928
> > anon N0=98304 N1=0 N2=10247872512 N3=2637209600
> > anon N0=98304 N1=0 N2=10247872512 N3=2637209600
> >
> > 2. Root cause:
> > Since commit 3e32158767b0 ("mm/mprotect.c: don't touch single threaded
> > PTEs which are on the right node")the PTE of local pages will not be
> > changed in change_pte_range() for single-threaded process, so no
> > page_faults information will be generated in do_numa_page(). If a
> > single-threaded process has memory on another node, it will
> > unconditionally migrate all of it's local memory to that node,
> > even if the remote node has only one page.
>
> IIUC the remote pages will be moved to the node where the worker
> is running since local (private) PTEs are not set to protnone and
> won't be faulted on.
>

Yes.

> >
> > So, let's fix it. The memory of single-threaded process should follow
> > the cpu, not the numa faults info in order to avoid memory thrashing.
>
> Don't forget the 'Fixes' tag for bugfix patches :)

OK, thanks.

>
> >
> > ...>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 24dda708b699..d7cbbda568fb 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2898,6 +2898,12 @@ static void task_numa_placement(struct task_struct *p)
> >               numa_group_count_active_nodes(ng);
> >               spin_unlock_irq(group_lock);
> >               max_nid = preferred_group_nid(p, max_nid);
> > +     } else if (atomic_read(&p->mm->mm_users) == 1) {
> > +             /*
> > +              * The memory of a single-threaded process should
> > +              * follow the CPU in order to avoid memory thrashing.
> > +              */
> > +             max_nid = numa_node_id();
> >       }
> >
> >       if (max_faults) {
>
> Since you don't want to respect the faults info, can we simply
> skip task placement?

This is a good suggestion. It would be even better if there were some
feedback from others.
Abel Wu July 24, 2024, 12:11 p.m. UTC | #5
On 7/24/24 11:55 AM, Zhongkun He Wrote:
> On Tue, Jul 23, 2024 at 9:39 PM Abel Wu <wuyun.abel@bytedance.com> wrote:
>>
>> Hi Zhongkun,
>>
>> On 7/23/24 1:32 PM, Zhongkun He Wrote:
>>> I found a problem in my test machine that the memory of a process is
>>> repeatedly migrated between two nodes and does not stop.
>>>
>>> 1.Test step and the machines.
>>> ------------
>>> VM machine: 4 numa nodes and 10GB per node.
>>>
>>> stress --vm 1 --vm-bytes 12g --vm-keep
>>>
>>> The info of numa stat:
>>> while :;do cat memory.numa_stat | grep -w anon;sleep 5;done
>>> anon N0=98304 N1=0 N2=10250747904 N3=2634334208
>>
>> I am curious what was the exact reason made the worker migrated
>> to N3? And later...
> 
> The maximum capacity of each node is 10 GB, but it requires 12GB,
> so there's always 2G on other nodes. With the patch below we only
> have page_faults in other nodes, not local. so we will migrate pages
> to other nodes because p->numa_preferred_nid is always the other node.

Ahh sorry, I didn't notice the size of the node...

> 
>>
>>> anon N0=98304 N1=0 N2=10250747904 N3=2634334208
>>> anon N0=98304 N1=0 N2=9937256448 N3=2947825664
>>> anon N0=98304 N1=0 N2=8863514624 N3=4021567488
>>> anon N0=98304 N1=0 N2=7789772800 N3=5095309312
>>> anon N0=98304 N1=0 N2=6716030976 N3=6169051136
>>> anon N0=98304 N1=0 N2=5642289152 N3=7242792960
>>> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
>>> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
>>> anon N0=98304 N1=0 N2=4837007360 N3=8048074752
>>> anon N0=98304 N1=0 N2=3763265536 N3=9121816576
>>> anon N0=98304 N1=0 N2=2689523712 N3=10195558400
>>> anon N0=98304 N1=0 N2=2515148800 N3=10369933312
>>> anon N0=98304 N1=0 N2=2515148800 N3=10369933312
>>> anon N0=98304 N1=0 N2=2515148800 N3=10369933312
>>
>> .. why it was moved back to N2?
> 
> The private page_faults on N2 are larger than that on N3.
> 
>>
>>> anon N0=98304 N1=0 N2=3320455168 N3=9564626944
>>> anon N0=98304 N1=0 N2=4394196992 N3=8490885120
>>> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
>>> anon N0=98304 N1=0 N2=6174195712 N3=6710886400
>>> anon N0=98304 N1=0 N2=7247937536 N3=5637144576
>>> anon N0=98304 N1=0 N2=8321679360 N3=4563402752
>>> anon N0=98304 N1=0 N2=9395421184 N3=3489660928
>>> anon N0=98304 N1=0 N2=10247872512 N3=2637209600
>>> anon N0=98304 N1=0 N2=10247872512 N3=2637209600
>>>
>>> 2. Root cause:
>>> Since commit 3e32158767b0 ("mm/mprotect.c: don't touch single threaded
>>> PTEs which are on the right node")the PTE of local pages will not be
>>> changed in change_pte_range() for single-threaded process, so no
>>> page_faults information will be generated in do_numa_page(). If a
>>> single-threaded process has memory on another node, it will
>>> unconditionally migrate all of it's local memory to that node,
>>> even if the remote node has only one page.
>>
>> IIUC the remote pages will be moved to the node where the worker
>> is running since local (private) PTEs are not set to protnone and
>> won't be faulted on.
>>
> 
> Yes.
> 
>>>
>>> So, let's fix it. The memory of single-threaded process should follow
>>> the cpu, not the numa faults info in order to avoid memory thrashing.
>>
>> Don't forget the 'Fixes' tag for bugfix patches :)
> 
> OK, thanks.
> 
>>
>>>
>>> ...>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 24dda708b699..d7cbbda568fb 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2898,6 +2898,12 @@ static void task_numa_placement(struct task_struct *p)
>>>                numa_group_count_active_nodes(ng);
>>>                spin_unlock_irq(group_lock);
>>>                max_nid = preferred_group_nid(p, max_nid);
>>> +     } else if (atomic_read(&p->mm->mm_users) == 1) {
>>> +             /*
>>> +              * The memory of a single-threaded process should
>>> +              * follow the CPU in order to avoid memory thrashing.
>>> +              */
>>> +             max_nid = numa_node_id();
>>>        }
>>>
>>>        if (max_faults) {
>>
>> Since you don't want to respect the faults info, can we simply
>> skip task placement?
> 
> This is a good suggestion. It would be even better if there were some
> feedback from others.

Although it is still a pity that, if remote node holds more hot pages
than p->numa_preferred_nid, we have to migrate them to local rather
than simply migrate the task to the hot node.
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 24dda708b699..d7cbbda568fb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2898,6 +2898,12 @@  static void task_numa_placement(struct task_struct *p)
 		numa_group_count_active_nodes(ng);
 		spin_unlock_irq(group_lock);
 		max_nid = preferred_group_nid(p, max_nid);
+	} else if (atomic_read(&p->mm->mm_users) == 1) {
+		/*
+		 * The memory of a single-threaded process should
+		 * follow the CPU in order to avoid memory thrashing.
+		 */
+		max_nid = numa_node_id();
 	}
 
 	if (max_faults) {