mm/slub: improve count_partial() for CONFIG_SLUB_CPU_PARTIAL
diff mbox series

Message ID 20200222092428.99488-1-wenyang@linux.alibaba.com
State New
Headers show
Series
  • mm/slub: improve count_partial() for CONFIG_SLUB_CPU_PARTIAL
Related show

Commit Message

Wen Yang Feb. 22, 2020, 9:24 a.m. UTC
In the cloud server scenario, reading "/proc/slabinfo" can possibily
block the slab allocation on another CPU for a while, 200ms in extreme
cases. If the slab object is to carry network packet, targeting the
far-end disk array, it causes block IO jitter issues.

This is because the list_lock, which protecting the node partial list,
is taken when couting the free objects resident in that list. It introduces
locking contention when the page(s) is moved between CPU and node partial
lists in allocation path on another CPU.

We also observed that in this scenario, CONFIG_SLUB_CPU_PARTIAL is turned
on by default, and count_partial() is useless because the returned number
is far from the reality.

Therefore, we can simply return 0, then nr_free is also 0, and eventually
active_objects == total_objects. We do not introduce any regression, and
it's preferable to show the unrealistic uniform 100% slab utilization
rather than some very high but incorrect value.

Co-developed-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Xunlei Pang <xlpang@linux.alibaba.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/slub.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Christopher Lameter Feb. 24, 2020, 1:29 a.m. UTC | #1
On Sat, 22 Feb 2020, Wen Yang wrote:

> We also observed that in this scenario, CONFIG_SLUB_CPU_PARTIAL is turned
> on by default, and count_partial() is useless because the returned number
> is far from the reality.

Well its not useless. Its just not counting the partial objects in per cpu
partial slabs. Those are counted by a different counter it.

> Therefore, we can simply return 0, then nr_free is also 0, and eventually
> active_objects == total_objects. We do not introduce any regression, and
> it's preferable to show the unrealistic uniform 100% slab utilization
> rather than some very high but incorrect value.

I suggest that you simply use the number of partial slabs and multiply
them by the number of objects in a slab and use that as a value. Both
values are readily available via /sys/kernel/slab/<...>/

You dont need a patch to do that.
Roman Gushchin Feb. 24, 2020, 4:57 p.m. UTC | #2
On Mon, Feb 24, 2020 at 01:29:09AM +0000, Christoph Lameter wrote:
> On Sat, 22 Feb 2020, Wen Yang wrote:

Hello, Christopher!

> 
> > We also observed that in this scenario, CONFIG_SLUB_CPU_PARTIAL is turned
> > on by default, and count_partial() is useless because the returned number
> > is far from the reality.
> 
> Well its not useless. Its just not counting the partial objects in per cpu
> partial slabs. Those are counted by a different counter it.

Do you mean CPU_PARTIAL_ALLOC or something else?

"useless" isn't the most accurate wording, sorry for that.

The point is that the number of active objects displayed in /proc/slabinfo
is misleading if percpu partial lists are used. So it's strange to pay
for it by potentially slowing down concurrent allocations.

> 
> > Therefore, we can simply return 0, then nr_free is also 0, and eventually
> > active_objects == total_objects. We do not introduce any regression, and
> > it's preferable to show the unrealistic uniform 100% slab utilization
> > rather than some very high but incorrect value.
> 
> I suggest that you simply use the number of partial slabs and multiply
> them by the number of objects in a slab and use that as a value. Both
> values are readily available via /sys/kernel/slab/<...>/

So maybe something like this?

@@ -5907,7 +5907,9 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
 	for_each_kmem_cache_node(s, node, n) {
 		nr_slabs += node_nr_slabs(n);
 		nr_objs += node_nr_objs(n);
+#ifndef CONFIG_SLUB_CPU_PARTIAL
 		nr_free += count_partial(n, count_free);
+#endif
 	}
 
 	sinfo->active_objs = nr_objs - nr_free;


Thank you!
Vlastimil Babka Feb. 25, 2020, 3:49 p.m. UTC | #3
On 2/24/20 5:57 PM, Roman Gushchin wrote:
> On Mon, Feb 24, 2020 at 01:29:09AM +0000, Christoph Lameter wrote:
>> On Sat, 22 Feb 2020, Wen Yang wrote:
> 
> Hello, Christopher!
> 
>>
>>> We also observed that in this scenario, CONFIG_SLUB_CPU_PARTIAL is turned
>>> on by default, and count_partial() is useless because the returned number
>>> is far from the reality.
>>
>> Well its not useless. Its just not counting the partial objects in per cpu
>> partial slabs. Those are counted by a different counter it.
> 
> Do you mean CPU_PARTIAL_ALLOC or something else?
> 
> "useless" isn't the most accurate wording, sorry for that.
> 
> The point is that the number of active objects displayed in /proc/slabinfo
> is misleading if percpu partial lists are used. So it's strange to pay
> for it by potentially slowing down concurrent allocations.

Hmm, I wonder... kmem_cache_cpu has those quite detailed stats with
CONFIG_SLUB_STATS. Could perhaps the number of free object be
reconstructed from them by adding up / subtracting the relevant items
across all CPUs? Expensive, but the cost would be taken by the
/proc/slabinfo reader, without blocking anyone.

Then again, CONFIG_SLUB_STATS is disabled by default. But the same
percpu mechanism could be used to create some "stats light" variant that
doesn't count everything, just what's needed to track number of free
objects. Percpu should mean the atomic inc/decs wouldn't cause much
contention...

It's certainly useful to have an idea of slab fragmentation (low inuse
vs total object) from /proc/slabinfo. But if that remains available via
/sys/kernel/slab/ then I guess it's fine... until all continuous
monitoring tools that now read /proc/slabinfo periodically start reading
all those /sys/kernel/slab/ files periodically...
Christopher Lameter Feb. 26, 2020, 6:31 p.m. UTC | #4
On Mon, 24 Feb 2020, Roman Gushchin wrote:

> > I suggest that you simply use the number of partial slabs and multiply
> > them by the number of objects in a slab and use that as a value. Both
> > values are readily available via /sys/kernel/slab/<...>/
>
> So maybe something like this?
>
> @@ -5907,7 +5907,9 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
>  	for_each_kmem_cache_node(s, node, n) {
>  		nr_slabs += node_nr_slabs(n);
>  		nr_objs += node_nr_objs(n);
> +#ifndef CONFIG_SLUB_CPU_PARTIAL
>  		nr_free += count_partial(n, count_free);
> +#endif
>  	}

Why would not having cpu partials screws up the counting of objects in
partial slabs?


You dont need kernel mods for this. The numbers are exposed already in
/sys/kernel/slab/xxx.
Roman Gushchin Feb. 27, 2020, 6:35 p.m. UTC | #5
On Wed, Feb 26, 2020 at 06:31:28PM +0000, Christoph Lameter wrote:
> On Mon, 24 Feb 2020, Roman Gushchin wrote:
> 
> > > I suggest that you simply use the number of partial slabs and multiply
> > > them by the number of objects in a slab and use that as a value. Both
> > > values are readily available via /sys/kernel/slab/<...>/
> >
> > So maybe something like this?
> >
> > @@ -5907,7 +5907,9 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
> >  	for_each_kmem_cache_node(s, node, n) {
> >  		nr_slabs += node_nr_slabs(n);
> >  		nr_objs += node_nr_objs(n);
> > +#ifndef CONFIG_SLUB_CPU_PARTIAL
> >  		nr_free += count_partial(n, count_free);
> > +#endif
> >  	}
> 
> Why would not having cpu partials screws up the counting of objects in
> partial slabs?
> 
> 
> You dont need kernel mods for this. The numbers are exposed already in
> /sys/kernel/slab/xxx.

Stepping a little bit back, there are two independent problems:

1) Reading /proc/slabinfo can cause latency spikes in concurrent memory allocations.
   This is the problem which Wen raised initially.

2) The number of active objects as displayed in /proc/slabinfo is misleadingly
   big if CONFIG_SLUB_CPU_PARTIAL is set.

Maybe mixing them in a single workaround wasn't the best idea, but what do you
suggest instead?

Thank you!

Roman
Christopher Lameter March 3, 2020, 4:05 p.m. UTC | #6
On Thu, 27 Feb 2020, Roman Gushchin wrote:

> 1) Reading /proc/slabinfo can cause latency spikes in concurrent memory allocations.
>    This is the problem which Wen raised initially.

Ok. Maybe cache the values or do not read /proc/slabinfo? Use
/sys/kernel/slab/... instead?

> 2) The number of active objects as displayed in /proc/slabinfo is misleadingly
>    big if CONFIG_SLUB_CPU_PARTIAL is set.

Ok that cou.d be fixed by counting the partial slabs when computing
/proc/slabinfo output but that would increase the times even more.

> Maybe mixing them in a single workaround wasn't the best idea, but what do you
> suggest instead?

Read select values from /sys/kernel/slab/.... to determine the values you
need and avoid reading those that cause latency spikes. Use the number of
slabs for example to estimate the number of objects instead of the number
of objects which requires scanning through all the individual slab pages.

Patch
diff mbox series

diff --git a/mm/slub.c b/mm/slub.c
index 17dc00e..d5b7230 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2411,14 +2411,16 @@  static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct page *))
 {
-	unsigned long flags;
 	unsigned long x = 0;
+#ifndef CONFIG_SLUB_CPU_PARTIAL
+	unsigned long flags;
 	struct page *page;
 
 	spin_lock_irqsave(&n->list_lock, flags);
 	list_for_each_entry(page, &n->partial, slab_list)
 		x += get_count(page);
 	spin_unlock_irqrestore(&n->list_lock, flags);
+#endif
 	return x;
 }
 #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */