[v2] mm/slub: Reduce memory consumption in extreme scenarios

Message ID	20240330082335.29710-1-chenjun102@huawei.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Chen Jun <chenjun102@huawei.com> To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>, <cl@linux.com>, <penberg@kernel.org>, <rientjes@google.com>, <iamjoonsoo.kim@lge.com>, <akpm@linux-foundation.org>, <vbabka@suse.cz>, <roman.gushchin@linux.dev>, <42.hyeyoo@gmail.com> CC: <xuqiang36@huawei.com>, <chenjun102@huawei.com>, <wangkefeng.wang@huawei.com> Subject: [PATCH v2] mm/slub: Reduce memory consumption in extreme scenarios Date: Sat, 30 Mar 2024 16:23:35 +0800 Message-ID: <20240330082335.29710-1-chenjun102@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] mm/slub: Reduce memory consumption in extreme scenarios \| expand [v2] mm/slub: Reduce memory consumption in extreme scenarios

Chen Jun March 30, 2024, 8:23 a.m. UTC

When kmalloc_node() is called without __GFP_THISNODE and the target node
lacks sufficient memory, SLUB allocates a folio from a different node
other than the requested node, instead of taking a partial slab from it.

However, since the allocated folio does not belong to the requested
node, it is deactivated and added to the partial slab list of the node
it belongs to.

This behavior can result in excessive memory usage when the requested
node has insufficient memory, as SLUB will repeatedly allocate folios
from other nodes without reusing the previously allocated ones.

To prevent memory wastage,
when (node != NUMA_NO_NODE) && !(gfpflags & __GFP_THISNODE) is,
1) try to get a partial slab from target node with GFP_NOWAIT |
   __GFP_THISNODE opportunistically.
2) if 1) failed, try to allocate a new slab from target node with
   GFP_NOWAIT | __GFP_THISNODE opportunistically too.
3) if 2) failed, retry 1) and 2) with orignal gfpflags.

when node != NUMA_NO_NODE || (gfpflags & __GFP_THISNODE), the behavior
remains unchanged.

On qemu with 4 numa nodes and each numa has 1G memory. Write a test ko
to call kmalloc_node(196, GFP_KERNEL, 3) for (4 * 1024 + 4) * 1024 times.

cat /proc/slabinfo shows:
kmalloc-256       4200530 13519712    256   32    2 : tunables..

after this patch,
cat /proc/slabinfo shows:
kmalloc-256       4200558 4200768    256   32    2 : tunables..

Signed-off-by: Chen Jun <chenjun102@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
v2: 
- try to alloc partial slab or new slab with GFP_NOWAIT(it includes
  __GFP_NOWARN) opportunistically, then fallback to orignal gfpflag,
  suggested by Vlastimil Babka,
- update changelog

v1: https://lore.kernel.org/linux-mm/20230314123403.100158-1-chenjun102@huawei.com/

 mm/slub.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

Vlastimil Babka April 2, 2024, 4:08 p.m. UTC | #1

On 3/30/24 9:23 AM, Chen Jun wrote:
> When kmalloc_node() is called without __GFP_THISNODE and the target node
> lacks sufficient memory, SLUB allocates a folio from a different node
> other than the requested node, instead of taking a partial slab from it.
> 
> However, since the allocated folio does not belong to the requested
> node, it is deactivated and added to the partial slab list of the node
> it belongs to.
> 
> This behavior can result in excessive memory usage when the requested
> node has insufficient memory, as SLUB will repeatedly allocate folios
> from other nodes without reusing the previously allocated ones.
> 
> To prevent memory wastage,
> when (node != NUMA_NO_NODE) && !(gfpflags & __GFP_THISNODE) is,
> 1) try to get a partial slab from target node with GFP_NOWAIT |
>    __GFP_THISNODE opportunistically.
> 2) if 1) failed, try to allocate a new slab from target node with
>    GFP_NOWAIT | __GFP_THISNODE opportunistically too.
> 3) if 2) failed, retry 1) and 2) with orignal gfpflags.
> 
> when node != NUMA_NO_NODE || (gfpflags & __GFP_THISNODE), the behavior
> remains unchanged.
> 
> On qemu with 4 numa nodes and each numa has 1G memory. Write a test ko
> to call kmalloc_node(196, GFP_KERNEL, 3) for (4 * 1024 + 4) * 1024 times.
> 
> cat /proc/slabinfo shows:
> kmalloc-256       4200530 13519712    256   32    2 : tunables..
> 
> after this patch,
> cat /proc/slabinfo shows:
> kmalloc-256       4200558 4200768    256   32    2 : tunables..
> 
> Signed-off-by: Chen Jun <chenjun102@huawei.com>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>

Slightly reworded and added an unlikely() to one of the tests, and included
in slab/for-6.10:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/commit/?h=slab/for-6.10/cleanup&id=9198ffbd2b494daae3a67cac1d59c3a2754e64cd

Thanks!

Christoph Lameter (Ampere) April 5, 2024, 4:50 p.m. UTC | #2

On Sat, 30 Mar 2024, Chen Jun wrote:

> When kmalloc_node() is called without __GFP_THISNODE and the target node
> lacks sufficient memory, SLUB allocates a folio from a different node
> other than the requested node, instead of taking a partial slab from it.

Hmmm... This would mean that we do not consult the partial lists of the 
other nodes. That is something to be fixed in the allocator.

> However, since the allocated folio does not belong to the requested
> node, it is deactivated and added to the partial slab list of the node
> it belongs to.

That should only occur if a request for an object for node X follows a 
request for an object from node Y.

> This behavior can result in excessive memory usage when the requested
> node has insufficient memory, as SLUB will repeatedly allocate folios
> from other nodes without reusing the previously allocated ones.

That is bad. Can we avoid that by verifying proper allocator behavior
during deactivation and ensuring that it searches remote partial objects 
first before doing something drastic as going to the page allocator?

> To prevent memory wastage,
> when (node != NUMA_NO_NODE) && !(gfpflags & __GFP_THISNODE) is,
> 1) try to get a partial slab from target node with GFP_NOWAIT |
>   __GFP_THISNODE opportunistically.

Did we check the partial lists of that node first for available 
objects before going to the page allocator?

get_any_partial() should do that. Maybe it is not called in the 
kmalloc_node case.

Vlastimil Babka April 8, 2024, 1:17 p.m. UTC | #3

On 4/5/24 6:50 PM, Christoph Lameter (Ampere) wrote:
> On Sat, 30 Mar 2024, Chen Jun wrote:
> 
>> When kmalloc_node() is called without __GFP_THISNODE and the target node
>> lacks sufficient memory, SLUB allocates a folio from a different node
>> other than the requested node, instead of taking a partial slab from it.
> 
> Hmmm... This would mean that we do not consult the partial lists of the 
> other nodes. That is something to be fixed in the allocator.

Which allocator? If you mean SLUB, this patch fixes it. If you mean page
allocator, I don't see how.

>> However, since the allocated folio does not belong to the requested
>> node, it is deactivated and added to the partial slab list of the node
>> it belongs to.
> 
> That should only occur if a request for an object for node X follows a 
> request for an object from node Y.

Are you sure? I think it's a stream of requests for node X happening on a
cpu of node Y, AFAICS the first attempt will allocate the slab page from
node different than X (possibly node Y because it's local and has pages
available unlike node X which is full). It does get installed as the cpu
slab, but then the next request is also for node X, so the node matching
checks make the slab deactivate and allocate a new one.

>> This behavior can result in excessive memory usage when the requested
>> node has insufficient memory, as SLUB will repeatedly allocate folios
>> from other nodes without reusing the previously allocated ones.
> 
> That is bad. Can we avoid that by verifying proper allocator behavior
> during deactivationand ensuring that it searches remote partial objects 
> first before doing something drastic as going to the page allocator?
> 
>> To prevent memory wastage,
>> when (node != NUMA_NO_NODE) && !(gfpflags & __GFP_THISNODE) is,
>> 1) try to get a partial slab from target node with GFP_NOWAIT |
>>   __GFP_THISNODE opportunistically.
> 
> Did we check the partial lists of that node first for available 
> objects before going to the page allocator?
> 
> get_any_partial() should do that. Maybe it is not called in the 
> kmalloc_node case.

Yes, get_any_partial() is currently skipped for requests of numa node
different from NUMA_NO_NODE.

I think it's a useful tradeof to first try satisfy the node preference with
a GFP_NOWAIT allocation. If it succeeds, the target node is not overloaded,
we get the page from the desired node and further allocations will of the
same node will not deactivate it. If it doesn't succeed then we indeed
fallback to slabs on partial list from other nodes before wastefully
allocating new pages from the other nodes, which addresses the scenario that
motivated this patch.

Christoph Lameter (Ampere) April 8, 2024, 6:17 p.m. UTC | #4

On Mon, 8 Apr 2024, Vlastimil Babka wrote:

> On 4/5/24 6:50 PM, Christoph Lameter (Ampere) wrote:
>> On Sat, 30 Mar 2024, Chen Jun wrote:
>>
>>> When kmalloc_node() is called without __GFP_THISNODE and the target node
>>> lacks sufficient memory, SLUB allocates a folio from a different node
>>> other than the requested node, instead of taking a partial slab from it.
>>
>> Hmmm... This would mean that we do not consult the partial lists of the
>> other nodes. That is something to be fixed in the allocator.
>
> Which allocator? If you mean SLUB, this patch fixes it. If you mean page
> allocator, I don't see how.

The SLUB allocator of course. And the patch does not fix it. It tries to 
convince the page allocator to give us a folio from the right node.

That kind of activity can be controlled within the page allocator via the 
node reclaim setting. No point in doing multiple calls into the page 
allocator.

>>> However, since the allocated folio does not belong to the requested
>>> node, it is deactivated and added to the partial slab list of the node
>>> it belongs to.
>>
>> That should only occur if a request for an object for node X follows a
>> request for an object from node Y.
>
> Are you sure? I think it's a stream of requests for node X happening on a
> cpu of node Y, AFAICS the first attempt will allocate the slab page from
> node different than X (possibly node Y because it's local and has pages
> available unlike node X which is full). It does get installed as the cpu
> slab, but then the next request is also for node X, so the node matching
> checks make the slab deactivate and allocate a new one.

Then there is something broken in the cpuslab logic.

The first request of CPU C for memory from node X should lead to:

1. deactivation of current cpu slab if it is not from node X
2. retrieval of a slab from node X and activation of that slab as cpuslab
3. Return of an object from that slab and therefore from node X.

Further allocation should be caught by the hotpatch where we realize that 
there is a request from node X and the current cpuslab is from node X and 
therefore fastpath logic can be used to retrieve the next object.

>> get_any_partial() should do that. Maybe it is not called in the
>> kmalloc_node case.
>
> Yes, get_any_partial() is currently skipped for requests of numa node
> different from NUMA_NO_NODE.

Maybe we can use that function after checking that the page allocator is 
over the watermark on the node that we were wanting to allocate from. That 
check should be fast.

>> I think it's a useful tradeof to first try satisfy the node preference with
> a GFP_NOWAIT allocation. If it succeeds, the target node is not overloaded,
> we get the page from the desired node and further allocations will of the
> same node will not deactivate it. If it doesn't succeed then we indeed
> fallback to slabs on partial list from other nodes before wastefully
> allocating new pages from the other nodes, which addresses the scenario that
> motivated this patch.

There are also the memory policies etc to consider. F.e. for the 
interleave policy the pages must come from different nodes in sequence to 
properly balance the allocations over multiple NUMA nodes. There are cases 
in which the allocations are forced to specific sets of nodes or where a 
node is preferred but fallback to local should occur.

If you now do multiple page allocator calls then the NUMA interleave 
policy etc etc may no longer work. I have not looked to deep into those.

Vlastimil Babka April 9, 2024, 6:16 a.m. UTC | #5

On 4/8/24 8:17 PM, Christoph Lameter (Ampere) wrote:
> On Mon, 8 Apr 2024, Vlastimil Babka wrote:
> 
>> On 4/5/24 6:50 PM, Christoph Lameter (Ampere) wrote:
>>> On Sat, 30 Mar 2024, Chen Jun wrote:
>>>
>>>> When kmalloc_node() is called without __GFP_THISNODE and the target node
>>>> lacks sufficient memory, SLUB allocates a folio from a different node
>>>> other than the requested node, instead of taking a partial slab from it.
>>>
>>> Hmmm... This would mean that we do not consult the partial lists of the
>>> other nodes. That is something to be fixed in the allocator.
>>
>> Which allocator? If you mean SLUB, this patch fixes it. If you mean page
>> allocator, I don't see how.
> 
> 
> The SLUB allocator of course. And the patch does not fix it. It tries to 
> convince the page allocator to give us a folio from the right node.

The patch primarily makes slub use its partial lists before going to the
page allocator. The "give us a folio from the right node" is not
"convincing" but an opportunistic request.

> That kind of activity can be controlled within the page allocator via the 
> node reclaim setting. No point in doing multiple calls into the page 
> allocator.

That's assuming there's something to reclaim on that overloaded and
requested node in the first place. But yeah, such unbalanced system will
likely have multiple issues and slub wouldn't be the only one where it
manifests. But if we can remove the slub pathological behavior on such
system easily, we should.

>>>> However, since the allocated folio does not belong to the requested
>>>> node, it is deactivated and added to the partial slab list of the node
>>>> it belongs to.
>>>
>>> That should only occur if a request for an object for node X follows a
>>> request for an object from node Y.
>>
>> Are you sure? I think it's a stream of requests for node X happening on a
>> cpu of node Y, AFAICS the first attempt will allocate the slab page from
>> node different than X (possibly node Y because it's local and has pages
>> available unlike node X which is full). It does get installed as the cpu
>> slab, but then the next request is also for node X, so the node matching
>> checks make the slab deactivate and allocate a new one.
> 
> Then there is something broken in the cpuslab logic.
> 
> The first request of CPU C for memory from node X should lead to:
> 
> 1. deactivation of current cpu slab if it is not from node X
> 2. retrieval of a slab from node X and activation of that slab as cpuslab
> 3. Return of an object from that slab and therefore from node X.
> 
> Further allocation should be caught by the hotpatch where we realize that 
> there is a request from node X and the current cpuslab is from node X and 
> therefore fastpath logic can be used to retrieve the next object.

Yes and that logic AFAIK works. But here we are addressing a situation where
we won't get a slab from node X because it's just full.

>>> get_any_partial() should do that. Maybe it is not called in the
>>> kmalloc_node case.
>>
>> Yes, get_any_partial() is currently skipped for requests of numa node
>> different from NUMA_NO_NODE.
> 
> Maybe we can use that function after checking that the page allocator is 
> over the watermark on the node that we were wanting to allocate from. That 
> check should be fast.

A GFP_NOWAIT | __GFP_THISNODE attempt is basically that check. Maybe it
could be made a bit faster but either we duplicate code or not handle some
corner case the full attempt does. This is not a fast path and not a common
case (kmalloc_node() vs kmalloc()) so to me it seems better to call the page
allocator.

>>> I think it's a useful tradeof to first try satisfy the node preference with
>> a GFP_NOWAIT allocation. If it succeeds, the target node is not overloaded,
>> we get the page from the desired node and further allocations will of the
>> same node will not deactivate it. If it doesn't succeed then we indeed
>> fallback to slabs on partial list from other nodes before wastefully
>> allocating new pages from the other nodes, which addresses the scenario that
>> motivated this patch.
> 
> There are also the memory policies etc to consider. F.e. for the 
> interleave policy the pages must come from different nodes in sequence to 
> properly balance the allocations over multiple NUMA nodes. There are cases 
> in which the allocations are forced to specific sets of nodes or where a 
> node is preferred but fallback to local should occur.
> 
> If you now do multiple page allocator calls then the NUMA interleave 
> policy etc etc may no longer work. I have not looked to deep into those.

Yeah there are policies and there are kmalloc_node() calls with a specific
node. If they are incompatible, what should happen? Arguably kmalloc_node()
should win as it's a more specific call than a per-process policy?
I think as long as there's memory available on all nodes, things will
continue working fine and respecting policies. In the corner case we are
addressing, where a node is overloaded, observing policies becomes
infeasible anyway.

Let's look at the steps again:
+	 * 1) try to get a partial slab from target node with GPF_NOWAIT |
+	 *    __GFP_THISNODE opportunistically.

This was always the first step in get_partial() anyway.

+	 * 2) if 1) failed, try to allocate a new slab from target node with
+	 *    GPF_NOWAIT | __GFP_THISNODE opportunistically too.

This will try to satisfy the kmalloc_node() preferred node, which should be
more important than a policy.

+	 * 3) if 2) failed, retry 1) and 2) with original gfpflags.
The step where we attempt to allocate memory from any node will use original
gfpflags

This is where we reuse existing slabs on other node partials list, as we do
for all allocations before resorting to a page allocator. So that's
unchanged. If that fails, we go to the page allocator with the original
flags so if there are any policies, the page allocator will satisfy them.

[v2] mm/slub: Reduce memory consumption in extreme scenarios

Commit Message

Comments

Patch