diff mbox series

mm/mempolicy: Fix an incorrect rebind node in mpol_rebind_nodemask

Message ID 1558768043-23184-1-git-send-email-zhongjiang@huawei.com (mailing list archive)
State New, archived
Headers show
Series mm/mempolicy: Fix an incorrect rebind node in mpol_rebind_nodemask | expand

Commit Message

zhong jiang May 25, 2019, 7:07 a.m. UTC
We bind an different node to different vma, Unluckily,
it will bind different vma to same node by checking the /proc/pid/numa_maps.   
Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
has introduced the issue.  when we change memory policy by seting cpuset.mems,
A process will rebind the specified policy more than one times. 
if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
Maybe result in the out of memory which allocating memory from same node.

Fixes: 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets") 
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
---
 mm/mempolicy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Andrew Morton May 25, 2019, 6:28 p.m. UTC | #1
(Cc Vlastimil)

On Sat, 25 May 2019 15:07:23 +0800 zhong jiang <zhongjiang@huawei.com> wrote:

> We bind an different node to different vma, Unluckily,
> it will bind different vma to same node by checking the /proc/pid/numa_maps.   
> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
> has introduced the issue.  when we change memory policy by seting cpuset.mems,
> A process will rebind the specified policy more than one times. 
> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
> Maybe result in the out of memory which allocating memory from same node.
> 
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
>  	else {
>  		nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
>  								*nodes);
> -		pol->w.cpuset_mems_allowed = tmp;
> +		pol->w.cpuset_mems_allowed = *nodes;
>  	}
>  
>  	if (nodes_empty(tmp))

hm, I'm not surprised the code broke.  What the heck is going on in
there?  It used to have a perfunctory comment, but Vlastimil deleted
it.

Could someone please propose a comment for the above code block
explaining why we're doing what we do?
Vlastimil Babka May 27, 2019, 12:23 p.m. UTC | #2
On 5/25/19 8:28 PM, Andrew Morton wrote:
> (Cc Vlastimil)

Oh dear, 2 years and I forgot all the details about how this works.

> On Sat, 25 May 2019 15:07:23 +0800 zhong jiang <zhongjiang@huawei.com> wrote:
> 
>> We bind an different node to different vma, Unluckily,
>> it will bind different vma to same node by checking the /proc/pid/numa_maps.   
>> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
>> has introduced the issue.  when we change memory policy by seting cpuset.mems,
>> A process will rebind the specified policy more than one times. 
>> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
>> Maybe result in the out of memory which allocating memory from same node.

I have a hard time understanding what the problem is. Could you please
write it as a (pseudo) reproducer? I.e. an example of the process/admin
mempolicy/cpuset actions that have some wrong observed results vs the
correct expected result.

>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
>>  	else {
>>  		nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
>>  								*nodes);
>> -		pol->w.cpuset_mems_allowed = tmp;
>> +		pol->w.cpuset_mems_allowed = *nodes;

Looks like a mechanical error on my side when removing the code for
step1+step2 rebinding. Before my commit there was

pol->w.cpuset_mems_allowed = step ? tmp : *nodes;

Since 'step' was removed and thus 0, I should have used *nodes indeed.
Thanks for catching that.

>>  	}
>>  
>>  	if (nodes_empty(tmp))
> 
> hm, I'm not surprised the code broke.  What the heck is going on in
> there?  It used to have a perfunctory comment, but Vlastimil deleted
> it.

Yeah the comment was specific for the case that was being removed.

> Could someone please propose a comment for the above code block
> explaining why we're doing what we do?

I'll have to relearn this first...
zhong jiang May 27, 2019, 1:58 p.m. UTC | #3
On 2019/5/27 20:23, Vlastimil Babka wrote:
> On 5/25/19 8:28 PM, Andrew Morton wrote:
>> (Cc Vlastimil)
> Oh dear, 2 years and I forgot all the details about how this works.
>
>> On Sat, 25 May 2019 15:07:23 +0800 zhong jiang <zhongjiang@huawei.com> wrote:
>>
>>> We bind an different node to different vma, Unluckily,
>>> it will bind different vma to same node by checking the /proc/pid/numa_maps.   
>>> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
>>> has introduced the issue.  when we change memory policy by seting cpuset.mems,
>>> A process will rebind the specified policy more than one times. 
>>> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
>>> Maybe result in the out of memory which allocating memory from same node.
> I have a hard time understanding what the problem is. Could you please
> write it as a (pseudo) reproducer? I.e. an example of the process/admin
> mempolicy/cpuset actions that have some wrong observed results vs the
> correct expected result.
Sorry, I havn't an testcase to reproduce the issue. At first, It was disappeared by
my colleague to configure the xml to start an vm.  To his suprise, The bind mempolicy
doesn't work.

Thanks,
zhong jiang
>>> --- a/mm/mempolicy.c
>>> +++ b/mm/mempolicy.c
>>> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
>>>  	else {
>>>  		nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
>>>  								*nodes);
>>> -		pol->w.cpuset_mems_allowed = tmp;
>>> +		pol->w.cpuset_mems_allowed = *nodes;
> Looks like a mechanical error on my side when removing the code for
> step1+step2 rebinding. Before my commit there was
>
> pol->w.cpuset_mems_allowed = step ? tmp : *nodes;
>
> Since 'step' was removed and thus 0, I should have used *nodes indeed.
> Thanks for catching that.
>
>>>  	}
>>>  
>>>  	if (nodes_empty(tmp))
>> hm, I'm not surprised the code broke.  What the heck is going on in
>> there?  It used to have a perfunctory comment, but Vlastimil deleted
>> it.
> Yeah the comment was specific for the case that was being removed.
>
>> Could someone please propose a comment for the above code block
>> explaining why we're doing what we do?
> I'll have to relearn this first...
>
>
Andrew Morton June 27, 2019, 3:57 a.m. UTC | #4
On Mon, 27 May 2019 21:58:17 +0800 zhong jiang <zhongjiang@huawei.com> wrote:

> On 2019/5/27 20:23, Vlastimil Babka wrote:
> > On 5/25/19 8:28 PM, Andrew Morton wrote:
> >> (Cc Vlastimil)
> > Oh dear, 2 years and I forgot all the details about how this works.
> >
> >> On Sat, 25 May 2019 15:07:23 +0800 zhong jiang <zhongjiang@huawei.com> wrote:
> >>
> >>> We bind an different node to different vma, Unluckily,
> >>> it will bind different vma to same node by checking the /proc/pid/numa_maps.   
> >>> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
> >>> has introduced the issue.  when we change memory policy by seting cpuset.mems,
> >>> A process will rebind the specified policy more than one times. 
> >>> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
> >>> Maybe result in the out of memory which allocating memory from same node.
> > I have a hard time understanding what the problem is. Could you please
> > write it as a (pseudo) reproducer? I.e. an example of the process/admin
> > mempolicy/cpuset actions that have some wrong observed results vs the
> > correct expected result.
> Sorry, I havn't an testcase to reproduce the issue. At first, It was disappeared by
> my colleague to configure the xml to start an vm.  To his suprise, The bind mempolicy
> doesn't work.

So... what do we do with this patch?

> Thanks,
> zhong jiang
> >>> --- a/mm/mempolicy.c
> >>> +++ b/mm/mempolicy.c
> >>> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> >>>  	else {
> >>>  		nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
> >>>  								*nodes);
> >>> -		pol->w.cpuset_mems_allowed = tmp;
> >>> +		pol->w.cpuset_mems_allowed = *nodes;
> > Looks like a mechanical error on my side when removing the code for
> > step1+step2 rebinding. Before my commit there was
> >
> > pol->w.cpuset_mems_allowed = step ? tmp : *nodes;
> >
> > Since 'step' was removed and thus 0, I should have used *nodes indeed.
> > Thanks for catching that.

Was that an ack?

> >>>  	}
> >>>  
> >>>  	if (nodes_empty(tmp))
> >> hm, I'm not surprised the code broke.  What the heck is going on in
> >> there?  It used to have a perfunctory comment, but Vlastimil deleted
> >> it.
> > Yeah the comment was specific for the case that was being removed.
> >
> >> Could someone please propose a comment for the above code block
> >> explaining why we're doing what we do?
> > I'll have to relearn this first...
> >
> >
>
Vlastimil Babka June 27, 2019, 7:47 a.m. UTC | #5
On 6/27/19 5:57 AM, Andrew Morton wrote:
> On Mon, 27 May 2019 21:58:17 +0800 zhong jiang <zhongjiang@huawei.com> wrote:
> 
>> On 2019/5/27 20:23, Vlastimil Babka wrote:
>>> On 5/25/19 8:28 PM, Andrew Morton wrote:
>>>> (Cc Vlastimil)
>>> Oh dear, 2 years and I forgot all the details about how this works.
>>>
>>>> On Sat, 25 May 2019 15:07:23 +0800 zhong jiang <zhongjiang@huawei.com> wrote:
>>>>
>>>>> We bind an different node to different vma, Unluckily,
>>>>> it will bind different vma to same node by checking the /proc/pid/numa_maps.   
>>>>> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
>>>>> has introduced the issue.  when we change memory policy by seting cpuset.mems,
>>>>> A process will rebind the specified policy more than one times. 
>>>>> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
>>>>> Maybe result in the out of memory which allocating memory from same node.
>>> I have a hard time understanding what the problem is. Could you please
>>> write it as a (pseudo) reproducer? I.e. an example of the process/admin
>>> mempolicy/cpuset actions that have some wrong observed results vs the
>>> correct expected result.
>> Sorry, I havn't an testcase to reproduce the issue. At first, It was disappeared by
>> my colleague to configure the xml to start an vm.  To his suprise, The bind mempolicy
>> doesn't work.
> 
> So... what do we do with this patch?
> 
>> Thanks,
>> zhong jiang
>>>>> --- a/mm/mempolicy.c
>>>>> +++ b/mm/mempolicy.c
>>>>> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
>>>>>  	else {
>>>>>  		nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
>>>>>  								*nodes);
>>>>> -		pol->w.cpuset_mems_allowed = tmp;
>>>>> +		pol->w.cpuset_mems_allowed = *nodes;
>>> Looks like a mechanical error on my side when removing the code for
>>> step1+step2 rebinding. Before my commit there was
>>>
>>> pol->w.cpuset_mems_allowed = step ? tmp : *nodes;
>>>
>>> Since 'step' was removed and thus 0, I should have used *nodes indeed.
>>> Thanks for catching that.
> 
> Was that an ack?

The fix should be fine, but the description is vague. I'll try to
improve it.

>>>>>  	}
>>>>>  
>>>>>  	if (nodes_empty(tmp))
>>>> hm, I'm not surprised the code broke.  What the heck is going on in
>>>> there?  It used to have a perfunctory comment, but Vlastimil deleted
>>>> it.
>>> Yeah the comment was specific for the case that was being removed.
>>>
>>>> Could someone please propose a comment for the above code block
>>>> explaining why we're doing what we do?
>>> I'll have to relearn this first...
>>>
>>>
>>
Vlastimil Babka June 27, 2019, 9:59 a.m. UTC | #6
On 5/25/19 9:07 AM, zhong jiang wrote:
> We bind an different node to different vma, Unluckily,
> it will bind different vma to same node by checking the /proc/pid/numa_maps.   
> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
> has introduced the issue.  when we change memory policy by seting cpuset.mems,
> A process will rebind the specified policy more than one times. 
> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
> Maybe result in the out of memory which allocating memory from same node.

OK, how about this instead?

mpol_rebind_nodemask() is called for MPOL_BIND and MPOL_INTERLEAVE
mempoclicies when the tasks's cpuset's mems_allowed changes. For
policies created without MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES,
it works by remapping the policy's allowed nodes (stored in v.nodes)
using the previous value of mems_allowed (stored in
w.cpuset_mems_allowed) as the domain of map and the new mems_allowed
(passed as nodes) as the range of the map (see the comment of
bitmap_remap() for details).

The result of remapping is stored back as policy's nodemask in v.nodes,
and the new value of mems_allowed should be stored in
w.cpuset_mems_allowed to facilitate the next rebind, if it happens.

However, commit 213980c0f23b ("mm, mempolicy: simplify rebinding
mempolicies when updating cpusets") introduced a bug where the result of
remapping is stored in w.cpuset_mems_allowed instead. Thus, a
mempolicy's allowed nodes can evolve in an unexpected way after a series
of rebinding due to cpuset mems_allowed changes, possibly binding to a
wrong node or a smaller number of nodes which may e.g. overload them.
This patch fixes the bug so rebinding again works as intended.

> Fixes: 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets") 
> Signed-off-by: zhong jiang <zhongjiang@huawei.com>

(an example of what exactly was the sequence of set_mempolicy and cpuset
mems changes with expected wrt actual results would be nice, but I think
the above should be fine by itself)

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/mempolicy.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index e3ab1d9..a60a3be 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
>  	else {
>  		nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
>  								*nodes);
> -		pol->w.cpuset_mems_allowed = tmp;
> +		pol->w.cpuset_mems_allowed = *nodes;
>  	}
>  
>  	if (nodes_empty(tmp))
>
diff mbox series

Patch

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e3ab1d9..a60a3be 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -345,7 +345,7 @@  static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 	else {
 		nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
 								*nodes);
-		pol->w.cpuset_mems_allowed = tmp;
+		pol->w.cpuset_mems_allowed = *nodes;
 	}
 
 	if (nodes_empty(tmp))