diff mbox series

slub: fix slub segmentation

Message ID 20240402031025.1097-1-yangming73@huawei.com (mailing list archive)
State New
Headers show
Series slub: fix slub segmentation | expand

Commit Message

Ming Yang April 2, 2024, 3:10 a.m. UTC
When one of numa nodes runs out of memory and lots of processes still
booting, slabinfo shows much slub segmentation exits. The following
shows some of them:

tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs>
<num_slabs> <sharedavail>
kmalloc-512        84309 380800   1024   32    8 :
tunables    0    0    0 : slabdata  11900  11900      0
kmalloc-256        65869 365408    512   32    4 :
tunables    0    0    0 : slabdata  11419  11419      0

365408 "kmalloc-256" objects are alloced but only 65869 of them are
used; While 380800 "kmalloc-512" objects are alloced but only 84309
of them are used.

This problem exits in the following senario:
1. Multiple numa nodes, e.g. four nodes.
2. Lack of memory in any one node.
3. Functions which alloc many slub memory in certain numa nodes,
like alloc_fair_sched_group.

The slub segmentation generated because of the following reason:
In function "___slab_alloc" a new slab is attempted to be gotten via
function "get_partial". If the argument 'node' is assigned but there
are neither partial memory nor buddy memory in that assigned node, no
slab could be gotten. And then the program attempt to alloc new slub
from buddy system, as mentationed before: no buddy memory in that
assigned node left, a new slub might be alloced from the buddy system
of other node directly, no matter whether there is free partil memory
left on other node. As a result slub segmentation generated.

The key point of above allocation flow is: the slab should be alloced
from the partial of other node first, instead of the buddy system of
other node directly.

In this commit a new slub allocation flow is proposed:
1. Attempt to get a slab via function get_partial (first step in
new_objects lable).
2. If no slab is gotten and 'node' is assigned, try to alloc a new
slab just from the assigned node instead of all node.
3. If no slab could be alloced from the assigned node, try to alloc
slub from partial of other node.
4. If the alloctation in step 3 fails, alloc a new slub from buddy
system of all node.

Signed-off-by: Ming Yang <yangming73@huawei.com>
Signed-off-by: Liang Zhang <zhangliang5@huawei.com>
Signed-off-by: Zhigang Wang <wangzhigang17@huawei.com>
Reviewed-by: Shixin Liu <liushixin2@huawei.com>
---
This patch can be tested and verified by following steps:
1. First, try to run out memory on node0. echo 1000(depending on your memory) > 
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages.
2. Second, boot 10000(depending on your memory) processes which use setsid 
systemcall, as the setsid systemcall may likely call function 
alloc_fair_sched_group.
3. Last, check slabinfo, cat /proc/slabinfo.

Hardware info:
Memory : 8GiB
CPU (total #): 120
numa node: 4

Test clang code example:
int main() {
    void *p = malloc(1024);
    setsid();
    while(1);
}

 mm/slub.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Chengming Zhou April 2, 2024, 3:45 a.m. UTC | #1
On 2024/4/2 11:10, Ming Yang wrote:
> When one of numa nodes runs out of memory and lots of processes still
> booting, slabinfo shows much slub segmentation exits. The following
> shows some of them:
> 
> tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs>
> <num_slabs> <sharedavail>
> kmalloc-512        84309 380800   1024   32    8 :
> tunables    0    0    0 : slabdata  11900  11900      0
> kmalloc-256        65869 365408    512   32    4 :
> tunables    0    0    0 : slabdata  11419  11419      0
> 
> 365408 "kmalloc-256" objects are alloced but only 65869 of them are
> used; While 380800 "kmalloc-512" objects are alloced but only 84309
> of them are used.
> 
> This problem exits in the following senario:
> 1. Multiple numa nodes, e.g. four nodes.
> 2. Lack of memory in any one node.
> 3. Functions which alloc many slub memory in certain numa nodes,
> like alloc_fair_sched_group.
> 
> The slub segmentation generated because of the following reason:
> In function "___slab_alloc" a new slab is attempted to be gotten via
> function "get_partial". If the argument 'node' is assigned but there
> are neither partial memory nor buddy memory in that assigned node, no
> slab could be gotten. And then the program attempt to alloc new slub
> from buddy system, as mentationed before: no buddy memory in that
> assigned node left, a new slub might be alloced from the buddy system
> of other node directly, no matter whether there is free partil memory
> left on other node. As a result slub segmentation generated.
> 
> The key point of above allocation flow is: the slab should be alloced
> from the partial of other node first, instead of the buddy system of
> other node directly.
> 
> In this commit a new slub allocation flow is proposed:
> 1. Attempt to get a slab via function get_partial (first step in
> new_objects lable).
> 2. If no slab is gotten and 'node' is assigned, try to alloc a new
> slab just from the assigned node instead of all node.
> 3. If no slab could be alloced from the assigned node, try to alloc
> slub from partial of other node.
> 4. If the alloctation in step 3 fails, alloc a new slub from buddy
> system of all node.

FYI, there is another patch to the very same problem:

https://lore.kernel.org/all/20240330082335.29710-1-chenjun102@huawei.com/

> 
> Signed-off-by: Ming Yang <yangming73@huawei.com>
> Signed-off-by: Liang Zhang <zhangliang5@huawei.com>
> Signed-off-by: Zhigang Wang <wangzhigang17@huawei.com>
> Reviewed-by: Shixin Liu <liushixin2@huawei.com>
> ---
> This patch can be tested and verified by following steps:
> 1. First, try to run out memory on node0. echo 1000(depending on your memory) > 
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages.
> 2. Second, boot 10000(depending on your memory) processes which use setsid 
> systemcall, as the setsid systemcall may likely call function 
> alloc_fair_sched_group.
> 3. Last, check slabinfo, cat /proc/slabinfo.
> 
> Hardware info:
> Memory : 8GiB
> CPU (total #): 120
> numa node: 4
> 
> Test clang code example:
> int main() {
>     void *p = malloc(1024);
>     setsid();
>     while(1);
> }
> 
>  mm/slub.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 1bb2a93cf7..3eb2e7d386 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3522,7 +3522,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	}
>  
>  	slub_put_cpu_ptr(s->cpu_slab);
> +	if (node != NUMA_NO_NODE) {
> +		slab = new_slab(s, gfpflags | __GFP_THISNODE, node);
> +		if (slab)
> +			goto slab_alloced;
> +
> +		slab = get_any_partial(s, &pc);
> +		if (slab)
> +			goto slab_alloced;
> +	}
>  	slab = new_slab(s, gfpflags, node);
> +
> +slab_alloced:
>  	c = slub_get_cpu_ptr(s->cpu_slab);
>  
>  	if (unlikely(!slab)) {
Vlastimil Babka April 2, 2024, 4:13 p.m. UTC | #2
On 4/2/24 5:45 AM, Chengming Zhou wrote:
> On 2024/4/2 11:10, Ming Yang wrote:
>> When one of numa nodes runs out of memory and lots of processes still
>> booting, slabinfo shows much slub segmentation exits. The following

You mean fragmentation not segmentation, right?

>> shows some of them:
>> 
>> tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs>
>> <num_slabs> <sharedavail>
>> kmalloc-512        84309 380800   1024   32    8 :
>> tunables    0    0    0 : slabdata  11900  11900      0
>> kmalloc-256        65869 365408    512   32    4 :
>> tunables    0    0    0 : slabdata  11419  11419      0
>> 
>> 365408 "kmalloc-256" objects are alloced but only 65869 of them are
>> used; While 380800 "kmalloc-512" objects are alloced but only 84309
>> of them are used.
>> 
>> This problem exits in the following senario:
>> 1. Multiple numa nodes, e.g. four nodes.
>> 2. Lack of memory in any one node.
>> 3. Functions which alloc many slub memory in certain numa nodes,
>> like alloc_fair_sched_group.
>> 
>> The slub segmentation generated because of the following reason:
>> In function "___slab_alloc" a new slab is attempted to be gotten via
>> function "get_partial". If the argument 'node' is assigned but there
>> are neither partial memory nor buddy memory in that assigned node, no
>> slab could be gotten. And then the program attempt to alloc new slub
>> from buddy system, as mentationed before: no buddy memory in that
>> assigned node left, a new slub might be alloced from the buddy system
>> of other node directly, no matter whether there is free partil memory
>> left on other node. As a result slub segmentation generated.
>> 
>> The key point of above allocation flow is: the slab should be alloced
>> from the partial of other node first, instead of the buddy system of
>> other node directly.
>> 
>> In this commit a new slub allocation flow is proposed:
>> 1. Attempt to get a slab via function get_partial (first step in
>> new_objects lable).
>> 2. If no slab is gotten and 'node' is assigned, try to alloc a new
>> slab just from the assigned node instead of all node.
>> 3. If no slab could be alloced from the assigned node, try to alloc
>> slub from partial of other node.
>> 4. If the alloctation in step 3 fails, alloc a new slub from buddy
>> system of all node.
> 
> FYI, there is another patch to the very same problem:
> 
> https://lore.kernel.org/all/20240330082335.29710-1-chenjun102@huawei.com/

Yeah and I have just taken that one to slab/for-6.10

>> 
>> Signed-off-by: Ming Yang <yangming73@huawei.com>
>> Signed-off-by: Liang Zhang <zhangliang5@huawei.com>
>> Signed-off-by: Zhigang Wang <wangzhigang17@huawei.com>
>> Reviewed-by: Shixin Liu <liushixin2@huawei.com>
>> ---
>> This patch can be tested and verified by following steps:
>> 1. First, try to run out memory on node0. echo 1000(depending on your memory) > 
>> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages.
>> 2. Second, boot 10000(depending on your memory) processes which use setsid 
>> systemcall, as the setsid systemcall may likely call function 
>> alloc_fair_sched_group.
>> 3. Last, check slabinfo, cat /proc/slabinfo.
>> 
>> Hardware info:
>> Memory : 8GiB
>> CPU (total #): 120
>> numa node: 4
>> 
>> Test clang code example:
>> int main() {
>>     void *p = malloc(1024);
>>     setsid();
>>     while(1);
>> }
>> 
>>  mm/slub.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 1bb2a93cf7..3eb2e7d386 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3522,7 +3522,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>  	}
>>  
>>  	slub_put_cpu_ptr(s->cpu_slab);
>> +	if (node != NUMA_NO_NODE) {
>> +		slab = new_slab(s, gfpflags | __GFP_THISNODE, node);
>> +		if (slab)
>> +			goto slab_alloced;
>> +
>> +		slab = get_any_partial(s, &pc);
>> +		if (slab)
>> +			goto slab_alloced;
>> +	}
>>  	slab = new_slab(s, gfpflags, node);
>> +
>> +slab_alloced:
>>  	c = slub_get_cpu_ptr(s->cpu_slab);
>>  
>>  	if (unlikely(!slab)) {
Christoph Lameter (Ampere) April 4, 2024, 7:12 p.m. UTC | #3
On Tue, 2 Apr 2024, Ming Yang wrote:

> The key point of above allocation flow is: the slab should be alloced
> from the partial of other node first, instead of the buddy system of
> other node directly.


If you use GFP_THISNODE then you will trigger a reclaim pass on the remote 
node. That could generate a performance regression.

We already support this kind of behavior via the node_reclaim / 
zone_reclaiom setting in procfs. Please use that.

The remote buildup of the partial pages can be addressed by changing the 
remote_node_defrag_ratio in the slabs. This will make slub scan remote 
nodes for partial slabs before going into the page allocator.
Vlastimil Babka April 5, 2024, 9:05 a.m. UTC | #4
On 4/4/24 9:12 PM, Christoph Lameter (Ampere) wrote:
> On Tue, 2 Apr 2024, Ming Yang wrote:
> 
>> The key point of above allocation flow is: the slab should be alloced
>> from the partial of other node first, instead of the buddy system of
>> other node directly.
> 
> 
> If you use GFP_THISNODE then you will trigger a reclaim pass on the remote 
> node. That could generate a performance regression.

Note the alternative approach I merged doesn't have this issue because it
uses GFP_NOWAIT for that __GFP_THISNODE attempt.

https://lore.kernel.org/all/20240330082335.29710-1-chenjun102@huawei.com/


> We already support this kind of behavior via the node_reclaim / 
> zone_reclaiom setting in procfs. Please use that.
> 
> The remote buildup of the partial pages can be addressed by changing the 
> remote_node_defrag_ratio in the slabs. This will make slub scan remote 
> nodes for partial slabs before going into the page allocator.
> 
>
diff mbox series

Patch

diff --git a/mm/slub.c b/mm/slub.c
index 1bb2a93cf7..3eb2e7d386 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3522,7 +3522,18 @@  static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	}
 
 	slub_put_cpu_ptr(s->cpu_slab);
+	if (node != NUMA_NO_NODE) {
+		slab = new_slab(s, gfpflags | __GFP_THISNODE, node);
+		if (slab)
+			goto slab_alloced;
+
+		slab = get_any_partial(s, &pc);
+		if (slab)
+			goto slab_alloced;
+	}
 	slab = new_slab(s, gfpflags, node);
+
+slab_alloced:
 	c = slub_get_cpu_ptr(s->cpu_slab);
 
 	if (unlikely(!slab)) {