mm/slub: Detach node lock from counting free objects

Message ID	20200201031502.92218-1-wenyang@linux.alibaba.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=ykjP=3V=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A7AEC20740 From: Wen Yang <wenyang@linux.alibaba.com> To: Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, Andrew Morton <akpm@linux-foundation.org> Cc: Wen Yang <wenyang@linux.alibaba.com>, Xunlei Pang <xlpang@linux.alibaba.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm/slub: Detach node lock from counting free objects Date: Sat, 1 Feb 2020 11:15:02 +0800 Message-Id: <20200201031502.92218-1-wenyang@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/slub: Detach node lock from counting free objects \| expand mm/slub: Detach node lock from counting free objects

Wen Yang Feb. 1, 2020, 3:15 a.m. UTC

The lock, protecting the node partial list, is taken when couting the free
objects resident in that list. It introduces locking contention when the
page(s) is moved between CPU and node partial lists in allocation path
on another CPU. So reading "/proc/slabinfo" can possibily block the slab
allocation on another CPU for a while, 200ms in extreme cases. If the
slab object is to carry network packet, targeting the far-end disk array,
it causes block IO jitter issue.

This fixes the block IO jitter issue by caching the total inuse objects in
the node in advance. The value is retrieved without taking the node partial
list lock on reading "/proc/slabinfo".

Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Xunlei Pang <xlpang@linux.alibaba.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
---
 mm/slab.h |  1 +
 mm/slub.c | 42 +++++++++++++++++++++++++-----------------
 2 files changed, 26 insertions(+), 17 deletions(-)

Wen Yang Feb. 8, 2020, 3:03 a.m. UTC | #1

Hi,

I would greatly appreciate it if you kindly give me some feedback on this patch.

--

Best wishes,
Wen


On 2020/2/1 11:15 上午, Wen Yang wrote:
> The lock, protecting the node partial list, is taken when couting the free
> objects resident in that list. It introduces locking contention when the
> page(s) is moved between CPU and node partial lists in allocation path
> on another CPU. So reading "/proc/slabinfo" can possibily block the slab
> allocation on another CPU for a while, 200ms in extreme cases. If the
> slab object is to carry network packet, targeting the far-end disk array,
> it causes block IO jitter issue.
>
> This fixes the block IO jitter issue by caching the total inuse objects in
> the node in advance. The value is retrieved without taking the node partial
> list lock on reading "/proc/slabinfo".
>
> Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Xunlei Pang <xlpang@linux.alibaba.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> ---
>   mm/slab.h |  1 +
>   mm/slub.c | 42 +++++++++++++++++++++++++-----------------
>   2 files changed, 26 insertions(+), 17 deletions(-)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 7e94700aa78c..27d22837f7ff 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -619,6 +619,7 @@ struct kmem_cache_node {
>   #ifdef CONFIG_SLUB_DEBUG
>   	atomic_long_t nr_slabs;
>   	atomic_long_t total_objects;
> +	atomic_long_t total_inuse;
>   	struct list_head full;
>   #endif
>   #endif
> diff --git a/mm/slub.c b/mm/slub.c
> index 503e11b1c4e1..67640e797550 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1060,7 +1060,8 @@ static inline unsigned long node_nr_slabs(struct kmem_cache_node *n)
>   	return atomic_long_read(&n->nr_slabs);
>   }
>   
> -static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects)
> +static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects,
> +				  int inuse)
>   {
>   	struct kmem_cache_node *n = get_node(s, node);
>   
> @@ -1073,14 +1074,17 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects)
>   	if (likely(n)) {
>   		atomic_long_inc(&n->nr_slabs);
>   		atomic_long_add(objects, &n->total_objects);
> +		atomic_long_add(inuse, &n->total_inuse);
>   	}
>   }
> -static inline void dec_slabs_node(struct kmem_cache *s, int node, int objects)
> +static inline void dec_slabs_node(struct kmem_cache *s, int node, int objects,
> +				  int inuse)
>   {
>   	struct kmem_cache_node *n = get_node(s, node);
>   
>   	atomic_long_dec(&n->nr_slabs);
>   	atomic_long_sub(objects, &n->total_objects);
> +	atomic_long_sub(inuse, &n->total_inuse);
>   }
>   
>   /* Object debug checks for alloc/free paths */
> @@ -1395,9 +1399,11 @@ static inline unsigned long slabs_node(struct kmem_cache *s, int node)
>   static inline unsigned long node_nr_slabs(struct kmem_cache_node *n)
>   							{ return 0; }
>   static inline void inc_slabs_node(struct kmem_cache *s, int node,
> -							int objects) {}
> +							int objects,
> +							int inuse) {}
>   static inline void dec_slabs_node(struct kmem_cache *s, int node,
> -							int objects) {}
> +							int objects,
> +							int inuse) {}
>   
>   #endif /* CONFIG_SLUB_DEBUG */
>   
> @@ -1708,7 +1714,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>   	if (!page)
>   		return NULL;
>   
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> +	inc_slabs_node(s, page_to_nid(page), page->objects, page->inuse);
>   
>   	return page;
>   }
> @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
>   
>   static void discard_slab(struct kmem_cache *s, struct page *page)
>   {
> -	dec_slabs_node(s, page_to_nid(page), page->objects);
> +	int inuse = page->objects;
> +
> +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
>   	free_slab(s, page);
>   }
>   
> @@ -2396,9 +2404,9 @@ static inline int node_match(struct page *page, int node)
>   }
>   
>   #ifdef CONFIG_SLUB_DEBUG
> -static int count_free(struct page *page)
> +static inline unsigned long node_nr_inuse(struct kmem_cache_node *n)
>   {
> -	return page->objects - page->inuse;
> +	return atomic_long_read(&n->total_inuse);
>   }
>   
>   static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
> @@ -2448,14 +2456,14 @@ slab_out_of_memory(struct kmem_cache *s, gfp_t gfpflags, int nid)
>   	for_each_kmem_cache_node(s, node, n) {
>   		unsigned long nr_slabs;
>   		unsigned long nr_objs;
> -		unsigned long nr_free;
> +		unsigned long nr_inuse;
>   
> -		nr_free  = count_partial(n, count_free);
>   		nr_slabs = node_nr_slabs(n);
>   		nr_objs  = node_nr_objs(n);
> +		nr_inuse = node_nr_inuse(n);
>   
>   		pr_warn("  node %d: slabs: %ld, objs: %ld, free: %ld\n",
> -			node, nr_slabs, nr_objs, nr_free);
> +			node, nr_slabs, nr_objs, nr_objs - nr_inuse);
>   	}
>   #endif
>   }
> @@ -3348,6 +3356,7 @@ init_kmem_cache_node(struct kmem_cache_node *n)
>   #ifdef CONFIG_SLUB_DEBUG
>   	atomic_long_set(&n->nr_slabs, 0);
>   	atomic_long_set(&n->total_objects, 0);
> +	atomic_long_set(&n->total_inuse, 0);
>   	INIT_LIST_HEAD(&n->full);
>   #endif
>   }
> @@ -3411,7 +3420,7 @@ static void early_kmem_cache_node_alloc(int node)
>   	page->frozen = 0;
>   	kmem_cache_node->node[node] = n;
>   	init_kmem_cache_node(n);
> -	inc_slabs_node(kmem_cache_node, node, page->objects);
> +	inc_slabs_node(kmem_cache_node, node, page->objects, page->inuse);
>   
>   	/*
>   	 * No locks need to be taken here as it has just been
> @@ -4857,8 +4866,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
>   			if (flags & SO_TOTAL)
>   				x = atomic_long_read(&n->total_objects);
>   			else if (flags & SO_OBJECTS)
> -				x = atomic_long_read(&n->total_objects) -
> -					count_partial(n, count_free);
> +				x = atomic_long_read(&n->total_inuse);
>   			else
>   				x = atomic_long_read(&n->nr_slabs);
>   			total += x;
> @@ -5900,17 +5908,17 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
>   {
>   	unsigned long nr_slabs = 0;
>   	unsigned long nr_objs = 0;
> -	unsigned long nr_free = 0;
> +	unsigned long nr_inuse = 0;
>   	int node;
>   	struct kmem_cache_node *n;
>   
>   	for_each_kmem_cache_node(s, node, n) {
>   		nr_slabs += node_nr_slabs(n);
>   		nr_objs += node_nr_objs(n);
> -		nr_free += count_partial(n, count_free);
> +		nr_inuse += node_nr_inuse(n);
>   	}
>   
> -	sinfo->active_objs = nr_objs - nr_free;
> +	sinfo->active_objs = nr_inuse;
>   	sinfo->num_objs = nr_objs;
>   	sinfo->active_slabs = nr_slabs;
>   	sinfo->num_slabs = nr_slabs;

Christoph Lameter (Ampere) Feb. 8, 2020, 9:41 p.m. UTC | #2

On Sat, 8 Feb 2020, Wen Yang wrote:

> I would greatly appreciate it if you kindly give me some feedback on this
> patch.

I think the measure is too severe given its use and the general impact on code.

Maybe avoid taking the lock or reducing the time a lock is taken when reading /proc/slabinfo is
the best approach?

Also you could cache the value in the userspace application? Why is this
value read continually?

Andrew Morton Feb. 12, 2020, 10:52 p.m. UTC | #3

On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.com> wrote:

> The lock, protecting the node partial list, is taken when couting the free
> objects resident in that list. It introduces locking contention when the
> page(s) is moved between CPU and node partial lists in allocation path
> on another CPU. So reading "/proc/slabinfo" can possibily block the slab
> allocation on another CPU for a while, 200ms in extreme cases. If the
> slab object is to carry network packet, targeting the far-end disk array,
> it causes block IO jitter issue.
> 
> This fixes the block IO jitter issue by caching the total inuse objects in
> the node in advance. The value is retrieved without taking the node partial
> list lock on reading "/proc/slabinfo".
> 
> ...
>
> @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
>  
>  static void discard_slab(struct kmem_cache *s, struct page *page)
>  {
> -	dec_slabs_node(s, page_to_nid(page), page->objects);
> +	int inuse = page->objects;
> +
> +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);

Is this right?  dec_slabs_node(..., page->objects, page->objects)?

If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
and save a function argument.

If yes then why?

>  	free_slab(s, page);
>  }
>

Andrew Morton Feb. 12, 2020, 10:56 p.m. UTC | #4

On Sat, 8 Feb 2020 21:41:49 +0000 (UTC) Christopher Lameter <cl@linux.com> wrote:

> On Sat, 8 Feb 2020, Wen Yang wrote:
> 
> > I would greatly appreciate it if you kindly give me some feedback on this
> > patch.
> 
> I think the measure is too severe given its use and the general impact on code.

Severe in what way?  It's really a quite simple change, although quite
a few edits were needed.

> Maybe avoid taking the lock or reducing the time a lock is taken when reading /proc/slabinfo is
> the best approach?
> 
> Also you could cache the value in the userspace application? Why is this
> value read continually?

: reading "/proc/slabinfo" can possibly block the slab allocation on
: another CPU for a while, 200ms in extreme cases

That was bad of us.  It would be good to stop doing this.

Christoph Lameter (Ampere) Feb. 14, 2020, 2:16 a.m. UTC | #5

On Wed, 12 Feb 2020, Andrew Morton wrote:

> : reading "/proc/slabinfo" can possibly block the slab allocation on
> : another CPU for a while, 200ms in extreme cases
>
> That was bad of us.  It would be good to stop doing this.

The count is not needed for any operations. Just for the slabinfo output.
The value has no operational value for the allocator itself. So why use
extra logic to track it in potentially performance critical paths?

One could estimate the number of objects from the number of allocated
slabs instead?

Wen Yang Feb. 16, 2020, 4:15 a.m. UTC | #6

On 2020/2/13 6:52 上午, Andrew Morton wrote:
> On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.com> wrote:
> 
>> The lock, protecting the node partial list, is taken when couting the free
>> objects resident in that list. It introduces locking contention when the
>> page(s) is moved between CPU and node partial lists in allocation path
>> on another CPU. So reading "/proc/slabinfo" can possibily block the slab
>> allocation on another CPU for a while, 200ms in extreme cases. If the
>> slab object is to carry network packet, targeting the far-end disk array,
>> it causes block IO jitter issue.
>>
>> This fixes the block IO jitter issue by caching the total inuse objects in
>> the node in advance. The value is retrieved without taking the node partial
>> list lock on reading "/proc/slabinfo".
>>
>> ...
>>
>> @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
>>   
>>   static void discard_slab(struct kmem_cache *s, struct page *page)
>>   {
>> -	dec_slabs_node(s, page_to_nid(page), page->objects);
>> +	int inuse = page->objects;
>> +
>> +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
> 
> Is this right?  dec_slabs_node(..., page->objects, page->objects)?
> 
> If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
> and save a function argument.
> 
> If yes then why?
> 

Thanks for your comments.
We are happy to improve this patch based on your suggestions.

When the user reads /proc/slabinfo, in order to obtain the active_objs 
information, the kernel traverses all slabs and executes the following 
code snippet:
static unsigned long count_partial(struct kmem_cache_node *n,
                                         int (*get_count)(struct page *))
{
         unsigned long flags;
         unsigned long x = 0;
         struct page *page;

         spin_lock_irqsave(&n->list_lock, flags);
         list_for_each_entry(page, &n->partial, slab_list)
                 x += get_count(page);
         spin_unlock_irqrestore(&n->list_lock, flags);
         return x;
}

It may cause performance issues.

Christoph suggested "you could cache the value in the userspace 
application? Why is this value read continually?", But reading the 
/proc/slabinfo is initiated by the user program. As a cloud provider, we 
cannot control user behavior. If a user program inadvertently executes 
cat /proc/slabinfo, it may affect other user programs.

As Christoph said: "The count is not needed for any operations. Just for 
the slabinfo output. The value has no operational value for the 
allocator itself. So why use extra logic to track it in potentially 
performance critical paths?"

In this way, could we show the approximate value of active_objs in the 
/proc/slabinfo?

Based on the following information:
In the discard_slab() function, page->inuse is equal to page->total_objects;
In the allocate_slab() function, page->inuse is also equal to 
page->total_objects (with one exception: for kmem_cache_node, page-> 
inuse equals 1);
page->inuse will only change continuously when the obj is constantly 
allocated or released. (This should be the performance critical path 
emphasized by Christoph)

When users query the global slabinfo information, we may use 
total_objects to approximate active_objs.

In this way, the modified patch is as follows:

diff --git a/mm/slub.c b/mm/slub.c
index a0b335d..ef0e6ac 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5900,17 +5900,15 @@ void get_slabinfo(struct kmem_cache *s, struct 
slabinfo *sinfo)
  {
         unsigned long nr_slabs = 0;
         unsigned long nr_objs = 0;
-       unsigned long nr_free = 0;
         int node;
         struct kmem_cache_node *n;

         for_each_kmem_cache_node(s, node, n) {
                 nr_slabs += node_nr_slabs(n);
                 nr_objs += node_nr_objs(n);
-               nr_free += count_partial(n, count_free);
         }

-       sinfo->active_objs = nr_objs - nr_free;
+       sinfo->active_objs = nr_objs;
         sinfo->num_objs = nr_objs;
         sinfo->active_slabs = nr_slabs;
         sinfo->num_slabs = nr_slabs;

In addition, when the user really needs to view the precise active_obj 
value of a slab, he can query this single slab info through an interface 
similar to the following, which avoids traversing all the slabs.

# cat /sys/kernel/slab/kmalloc-512/total_objects
1472 N0=1472
# cat /sys/kernel/slab/kmalloc-512/objects
1311 N0=1311

or
# cat /sys/kernel/slab/kmalloc-8k/total_objects
60 N0=60
# cat /sys/kernel/slab/kmalloc-8k/objects
60 N0=60

Best wishes,
Wen

>>   	free_slab(s, page);
>>   }
>>

Roman Gushchin Feb. 18, 2020, 8:53 p.m. UTC | #7

On Sun, Feb 16, 2020 at 12:15:54PM +0800, Wen Yang wrote:
> 
> 
> On 2020/2/13 6:52 上午, Andrew Morton wrote:
> > On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.com> wrote:
> > 
> > > The lock, protecting the node partial list, is taken when couting the free
> > > objects resident in that list. It introduces locking contention when the
> > > page(s) is moved between CPU and node partial lists in allocation path
> > > on another CPU. So reading "/proc/slabinfo" can possibily block the slab
> > > allocation on another CPU for a while, 200ms in extreme cases. If the
> > > slab object is to carry network packet, targeting the far-end disk array,
> > > it causes block IO jitter issue.
> > > 
> > > This fixes the block IO jitter issue by caching the total inuse objects in
> > > the node in advance. The value is retrieved without taking the node partial
> > > list lock on reading "/proc/slabinfo".
> > > 
> > > ...
> > > 
> > > @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
> > >   static void discard_slab(struct kmem_cache *s, struct page *page)
> > >   {
> > > -	dec_slabs_node(s, page_to_nid(page), page->objects);
> > > +	int inuse = page->objects;
> > > +
> > > +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
> > 
> > Is this right?  dec_slabs_node(..., page->objects, page->objects)?
> > 
> > If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
> > and save a function argument.
> > 
> > If yes then why?
> > 
> 
> Thanks for your comments.
> We are happy to improve this patch based on your suggestions.
> 
> 
> When the user reads /proc/slabinfo, in order to obtain the active_objs
> information, the kernel traverses all slabs and executes the following code
> snippet:
> static unsigned long count_partial(struct kmem_cache_node *n,
>                                         int (*get_count)(struct page *))
> {
>         unsigned long flags;
>         unsigned long x = 0;
>         struct page *page;
> 
>         spin_lock_irqsave(&n->list_lock, flags);
>         list_for_each_entry(page, &n->partial, slab_list)
>                 x += get_count(page);
>         spin_unlock_irqrestore(&n->list_lock, flags);
>         return x;
> }
> 
> It may cause performance issues.
> 
> Christoph suggested "you could cache the value in the userspace application?
> Why is this value read continually?", But reading the /proc/slabinfo is
> initiated by the user program. As a cloud provider, we cannot control user
> behavior. If a user program inadvertently executes cat /proc/slabinfo, it
> may affect other user programs.
> 
> As Christoph said: "The count is not needed for any operations. Just for the
> slabinfo output. The value has no operational value for the allocator
> itself. So why use extra logic to track it in potentially performance
> critical paths?"
> 
> In this way, could we show the approximate value of active_objs in the
> /proc/slabinfo?
> 
> Based on the following information:
> In the discard_slab() function, page->inuse is equal to page->total_objects;
> In the allocate_slab() function, page->inuse is also equal to
> page->total_objects (with one exception: for kmem_cache_node, page-> inuse
> equals 1);
> page->inuse will only change continuously when the obj is constantly
> allocated or released. (This should be the performance critical path
> emphasized by Christoph)
> 
> When users query the global slabinfo information, we may use total_objects
> to approximate active_objs.

Well, from one point of view, it makes no sense, because the ratio between
these two numbers is very meaningful: it's the slab utilization rate.

On the other side, with enabled per-cpu partial lists active_objs has
nothing to do with the reality anyway, so I agree with you, calling
count_partial() is almost useless.

That said, I wonder if the right thing to do is something like the patch below?

Thanks!

Roman

--

diff --git a/mm/slub.c b/mm/slub.c
index 1d644143f93e..ba0505e75ecc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2411,14 +2411,16 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
 static unsigned long count_partial(struct kmem_cache_node *n,
                                        int (*get_count)(struct page *))
 {
-       unsigned long flags;
        unsigned long x = 0;
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+       unsigned long flags;
        struct page *page;
 
        spin_lock_irqsave(&n->list_lock, flags);
        list_for_each_entry(page, &n->partial, slab_list)
                x += get_count(page);
        spin_unlock_irqrestore(&n->list_lock, flags);
+#endif
        return x;
 }
 #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */

Wen Yang Feb. 20, 2020, 1:53 p.m. UTC | #8

On 2020/2/19 4:53 上午, Roman Gushchin wrote:
> On Sun, Feb 16, 2020 at 12:15:54PM +0800, Wen Yang wrote:
>>
>>
>> On 2020/2/13 6:52 上午, Andrew Morton wrote:
>>> On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.com> wrote:
>>>
>>>> The lock, protecting the node partial list, is taken when couting the free
>>>> objects resident in that list. It introduces locking contention when the
>>>> page(s) is moved between CPU and node partial lists in allocation path
>>>> on another CPU. So reading "/proc/slabinfo" can possibily block the slab
>>>> allocation on another CPU for a while, 200ms in extreme cases. If the
>>>> slab object is to carry network packet, targeting the far-end disk array,
>>>> it causes block IO jitter issue.
>>>>
>>>> This fixes the block IO jitter issue by caching the total inuse objects in
>>>> the node in advance. The value is retrieved without taking the node partial
>>>> list lock on reading "/proc/slabinfo".
>>>>
>>>> ...
>>>>
>>>> @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
>>>>    static void discard_slab(struct kmem_cache *s, struct page *page)
>>>>    {
>>>> -	dec_slabs_node(s, page_to_nid(page), page->objects);
>>>> +	int inuse = page->objects;
>>>> +
>>>> +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
>>>
>>> Is this right?  dec_slabs_node(..., page->objects, page->objects)?
>>>
>>> If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
>>> and save a function argument.
>>>
>>> If yes then why?
>>>
>>
>> Thanks for your comments.
>> We are happy to improve this patch based on your suggestions.
>>
>>
>> When the user reads /proc/slabinfo, in order to obtain the active_objs
>> information, the kernel traverses all slabs and executes the following code
>> snippet:
>> static unsigned long count_partial(struct kmem_cache_node *n,
>>                                          int (*get_count)(struct page *))
>> {
>>          unsigned long flags;
>>          unsigned long x = 0;
>>          struct page *page;
>>
>>          spin_lock_irqsave(&n->list_lock, flags);
>>          list_for_each_entry(page, &n->partial, slab_list)
>>                  x += get_count(page);
>>          spin_unlock_irqrestore(&n->list_lock, flags);
>>          return x;
>> }
>>
>> It may cause performance issues.
>>
>> Christoph suggested "you could cache the value in the userspace application?
>> Why is this value read continually?", But reading the /proc/slabinfo is
>> initiated by the user program. As a cloud provider, we cannot control user
>> behavior. If a user program inadvertently executes cat /proc/slabinfo, it
>> may affect other user programs.
>>
>> As Christoph said: "The count is not needed for any operations. Just for the
>> slabinfo output. The value has no operational value for the allocator
>> itself. So why use extra logic to track it in potentially performance
>> critical paths?"
>>
>> In this way, could we show the approximate value of active_objs in the
>> /proc/slabinfo?
>>
>> Based on the following information:
>> In the discard_slab() function, page->inuse is equal to page->total_objects;
>> In the allocate_slab() function, page->inuse is also equal to
>> page->total_objects (with one exception: for kmem_cache_node, page-> inuse
>> equals 1);
>> page->inuse will only change continuously when the obj is constantly
>> allocated or released. (This should be the performance critical path
>> emphasized by Christoph)
>>
>> When users query the global slabinfo information, we may use total_objects
>> to approximate active_objs.
> 
> Well, from one point of view, it makes no sense, because the ratio between
> these two numbers is very meaningful: it's the slab utilization rate.
> 
> On the other side, with enabled per-cpu partial lists active_objs has
> nothing to do with the reality anyway, so I agree with you, calling
> count_partial() is almost useless.
> 
> That said, I wonder if the right thing to do is something like the patch below?
> 
> Thanks!
> 
> Roman
> 
> --
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 1d644143f93e..ba0505e75ecc 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2411,14 +2411,16 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>   static unsigned long count_partial(struct kmem_cache_node *n,
>                                          int (*get_count)(struct page *))
>   {
> -       unsigned long flags;
>          unsigned long x = 0;
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
> +       unsigned long flags;
>          struct page *page;
>   
>          spin_lock_irqsave(&n->list_lock, flags);
>          list_for_each_entry(page, &n->partial, slab_list)
>                  x += get_count(page);
>          spin_unlock_irqrestore(&n->list_lock, flags);
> +#endif
>          return x;
>   }
>   #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
> 

Hi Roman,

Thanks for your comments.

In the server scenario, SLUB_CPU_PARTIAL is turned on by default, and 
can improve the performance of the cloud server, as follows:

   default y
   depends on SLUB && SMP
   bool "SLUB per cpu partial cache"
   help
     Per cpu partial caches accelerate objects allocation and freeing
     that is local to a processor at the price of more indeterminism
     in the latency of the free. On overflow these caches will be cleared
     which requires the taking of locks that may cause latency spikes.
     Typically one would choose no for a realtime system.


Best wishes,
Wen

Roman Gushchin Feb. 20, 2020, 3:40 p.m. UTC | #9

On Thu, Feb 20, 2020 at 09:53:26PM +0800, Wen Yang wrote:
> 
> 
> On 2020/2/19 4:53 上午, Roman Gushchin wrote:
> > On Sun, Feb 16, 2020 at 12:15:54PM +0800, Wen Yang wrote:
> > > 
> > > 
> > > On 2020/2/13 6:52 上午, Andrew Morton wrote:
> > > > On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.com> wrote:
> > > > 
> > > > > The lock, protecting the node partial list, is taken when couting the free
> > > > > objects resident in that list. It introduces locking contention when the
> > > > > page(s) is moved between CPU and node partial lists in allocation path
> > > > > on another CPU. So reading "/proc/slabinfo" can possibily block the slab
> > > > > allocation on another CPU for a while, 200ms in extreme cases. If the
> > > > > slab object is to carry network packet, targeting the far-end disk array,
> > > > > it causes block IO jitter issue.
> > > > > 
> > > > > This fixes the block IO jitter issue by caching the total inuse objects in
> > > > > the node in advance. The value is retrieved without taking the node partial
> > > > > list lock on reading "/proc/slabinfo".
> > > > > 
> > > > > ...
> > > > > 
> > > > > @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
> > > > >    static void discard_slab(struct kmem_cache *s, struct page *page)
> > > > >    {
> > > > > -	dec_slabs_node(s, page_to_nid(page), page->objects);
> > > > > +	int inuse = page->objects;
> > > > > +
> > > > > +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
> > > > 
> > > > Is this right?  dec_slabs_node(..., page->objects, page->objects)?
> > > > 
> > > > If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
> > > > and save a function argument.
> > > > 
> > > > If yes then why?
> > > > 
> > > 
> > > Thanks for your comments.
> > > We are happy to improve this patch based on your suggestions.
> > > 
> > > 
> > > When the user reads /proc/slabinfo, in order to obtain the active_objs
> > > information, the kernel traverses all slabs and executes the following code
> > > snippet:
> > > static unsigned long count_partial(struct kmem_cache_node *n,
> > >                                          int (*get_count)(struct page *))
> > > {
> > >          unsigned long flags;
> > >          unsigned long x = 0;
> > >          struct page *page;
> > > 
> > >          spin_lock_irqsave(&n->list_lock, flags);
> > >          list_for_each_entry(page, &n->partial, slab_list)
> > >                  x += get_count(page);
> > >          spin_unlock_irqrestore(&n->list_lock, flags);
> > >          return x;
> > > }
> > > 
> > > It may cause performance issues.
> > > 
> > > Christoph suggested "you could cache the value in the userspace application?
> > > Why is this value read continually?", But reading the /proc/slabinfo is
> > > initiated by the user program. As a cloud provider, we cannot control user
> > > behavior. If a user program inadvertently executes cat /proc/slabinfo, it
> > > may affect other user programs.
> > > 
> > > As Christoph said: "The count is not needed for any operations. Just for the
> > > slabinfo output. The value has no operational value for the allocator
> > > itself. So why use extra logic to track it in potentially performance
> > > critical paths?"
> > > 
> > > In this way, could we show the approximate value of active_objs in the
> > > /proc/slabinfo?
> > > 
> > > Based on the following information:
> > > In the discard_slab() function, page->inuse is equal to page->total_objects;
> > > In the allocate_slab() function, page->inuse is also equal to
> > > page->total_objects (with one exception: for kmem_cache_node, page-> inuse
> > > equals 1);
> > > page->inuse will only change continuously when the obj is constantly
> > > allocated or released. (This should be the performance critical path
> > > emphasized by Christoph)
> > > 
> > > When users query the global slabinfo information, we may use total_objects
> > > to approximate active_objs.
> > 
> > Well, from one point of view, it makes no sense, because the ratio between
> > these two numbers is very meaningful: it's the slab utilization rate.
> > 
> > On the other side, with enabled per-cpu partial lists active_objs has
> > nothing to do with the reality anyway, so I agree with you, calling
> > count_partial() is almost useless.
> > 
> > That said, I wonder if the right thing to do is something like the patch below?
> > 
> > Thanks!
> > 
> > Roman
> > 
> > --
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 1d644143f93e..ba0505e75ecc 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2411,14 +2411,16 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
> >   static unsigned long count_partial(struct kmem_cache_node *n,
> >                                          int (*get_count)(struct page *))
> >   {
> > -       unsigned long flags;
> >          unsigned long x = 0;
> > +#ifdef CONFIG_SLUB_CPU_PARTIAL
> > +       unsigned long flags;
> >          struct page *page;
> >          spin_lock_irqsave(&n->list_lock, flags);
> >          list_for_each_entry(page, &n->partial, slab_list)
> >                  x += get_count(page);
> >          spin_unlock_irqrestore(&n->list_lock, flags);
> > +#endif
> >          return x;
> >   }
> >   #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
> > 
> 
> Hi Roman,
> 
> Thanks for your comments.
> 
> In the server scenario, SLUB_CPU_PARTIAL is turned on by default, and can
> improve the performance of the cloud server, as follows:

Hello, Wen!

That's exactly my point: if CONFIG_SLUB_CPU_PARTIAL is on, count_partial() is useless
anyway because the returned number is far from the reality. So if we define
active_objects == total_objects, as you basically suggest, we do not introduce any
regression. Actually I think it's even preferable to show the unrealistic uniform 100%
slab utilization rather than some very high but incorrect value.

And on real-time systems uncontrolled readings of /proc/slabinfo is less
of a concern, I hope.

Thank you!

Wen Yang Feb. 22, 2020, 6:55 a.m. UTC | #10

On 2020/2/20 11:40 下午, Roman Gushchin wrote:
> On Thu, Feb 20, 2020 at 09:53:26PM +0800, Wen Yang wrote:
>>
>>
>> On 2020/2/19 4:53 上午, Roman Gushchin wrote:
>>> On Sun, Feb 16, 2020 at 12:15:54PM +0800, Wen Yang wrote:
>>>>
>>>>
>>>> On 2020/2/13 6:52 上午, Andrew Morton wrote:
>>>>> On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.com> wrote:
>>>>>
>>>>>> The lock, protecting the node partial list, is taken when couting the free
>>>>>> objects resident in that list. It introduces locking contention when the
>>>>>> page(s) is moved between CPU and node partial lists in allocation path
>>>>>> on another CPU. So reading "/proc/slabinfo" can possibily block the slab
>>>>>> allocation on another CPU for a while, 200ms in extreme cases. If the
>>>>>> slab object is to carry network packet, targeting the far-end disk array,
>>>>>> it causes block IO jitter issue.
>>>>>>
>>>>>> This fixes the block IO jitter issue by caching the total inuse objects in
>>>>>> the node in advance. The value is retrieved without taking the node partial
>>>>>> list lock on reading "/proc/slabinfo".
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
>>>>>>     static void discard_slab(struct kmem_cache *s, struct page *page)
>>>>>>     {
>>>>>> -	dec_slabs_node(s, page_to_nid(page), page->objects);
>>>>>> +	int inuse = page->objects;
>>>>>> +
>>>>>> +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
>>>>>
>>>>> Is this right?  dec_slabs_node(..., page->objects, page->objects)?
>>>>>
>>>>> If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
>>>>> and save a function argument.
>>>>>
>>>>> If yes then why?
>>>>>
>>>>
>>>> Thanks for your comments.
>>>> We are happy to improve this patch based on your suggestions.
>>>>
>>>>
>>>> When the user reads /proc/slabinfo, in order to obtain the active_objs
>>>> information, the kernel traverses all slabs and executes the following code
>>>> snippet:
>>>> static unsigned long count_partial(struct kmem_cache_node *n,
>>>>                                           int (*get_count)(struct page *))
>>>> {
>>>>           unsigned long flags;
>>>>           unsigned long x = 0;
>>>>           struct page *page;
>>>>
>>>>           spin_lock_irqsave(&n->list_lock, flags);
>>>>           list_for_each_entry(page, &n->partial, slab_list)
>>>>                   x += get_count(page);
>>>>           spin_unlock_irqrestore(&n->list_lock, flags);
>>>>           return x;
>>>> }
>>>>
>>>> It may cause performance issues.
>>>>
>>>> Christoph suggested "you could cache the value in the userspace application?
>>>> Why is this value read continually?", But reading the /proc/slabinfo is
>>>> initiated by the user program. As a cloud provider, we cannot control user
>>>> behavior. If a user program inadvertently executes cat /proc/slabinfo, it
>>>> may affect other user programs.
>>>>
>>>> As Christoph said: "The count is not needed for any operations. Just for the
>>>> slabinfo output. The value has no operational value for the allocator
>>>> itself. So why use extra logic to track it in potentially performance
>>>> critical paths?"
>>>>
>>>> In this way, could we show the approximate value of active_objs in the
>>>> /proc/slabinfo?
>>>>
>>>> Based on the following information:
>>>> In the discard_slab() function, page->inuse is equal to page->total_objects;
>>>> In the allocate_slab() function, page->inuse is also equal to
>>>> page->total_objects (with one exception: for kmem_cache_node, page-> inuse
>>>> equals 1);
>>>> page->inuse will only change continuously when the obj is constantly
>>>> allocated or released. (This should be the performance critical path
>>>> emphasized by Christoph)
>>>>
>>>> When users query the global slabinfo information, we may use total_objects
>>>> to approximate active_objs.
>>>
>>> Well, from one point of view, it makes no sense, because the ratio between
>>> these two numbers is very meaningful: it's the slab utilization rate.
>>>
>>> On the other side, with enabled per-cpu partial lists active_objs has
>>> nothing to do with the reality anyway, so I agree with you, calling
>>> count_partial() is almost useless.
>>>
>>> That said, I wonder if the right thing to do is something like the patch below?
>>>
>>> Thanks!
>>>
>>> Roman
>>>
>>> --
>>>
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index 1d644143f93e..ba0505e75ecc 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -2411,14 +2411,16 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
>>>    static unsigned long count_partial(struct kmem_cache_node *n,
>>>                                           int (*get_count)(struct page *))
>>>    {
>>> -       unsigned long flags;
>>>           unsigned long x = 0;
>>> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>>> +       unsigned long flags;
>>>           struct page *page;
>>>           spin_lock_irqsave(&n->list_lock, flags);
>>>           list_for_each_entry(page, &n->partial, slab_list)
>>>                   x += get_count(page);
>>>           spin_unlock_irqrestore(&n->list_lock, flags);
>>> +#endif
>>>           return x;
>>>    }
>>>    #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
>>>
>>
>> Hi Roman,
>>
>> Thanks for your comments.
>>
>> In the server scenario, SLUB_CPU_PARTIAL is turned on by default, and can
>> improve the performance of the cloud server, as follows:
> 
> Hello, Wen!
> 
> That's exactly my point: if CONFIG_SLUB_CPU_PARTIAL is on, count_partial() is useless
> anyway because the returned number is far from the reality. So if we define
> active_objects == total_objects, as you basically suggest, we do not introduce any
> regression. Actually I think it's even preferable to show the unrealistic uniform 100%
> slab utilization rather than some very high but incorrect value.
> 
> And on real-time systems uncontrolled readings of /proc/slabinfo is less
> of a concern, I hope.
> 
> Thank you!
> 

Great！
We only need to correct a typo to achieve this goal, as follows:
Change #ifdef CONFIG_SLUB_CPU_PARTIAL to #ifndef CONFIG_SLUB_CPU_PARTIAL

We will continue testing and send the modified patch soon.

Thank you very much.

Roman Gushchin Feb. 24, 2020, 5:01 p.m. UTC | #11

On Sat, Feb 22, 2020 at 02:55:39PM +0800, Wen Yang wrote:
> 
> 
> On 2020/2/20 11:40 下午, Roman Gushchin wrote:
> > On Thu, Feb 20, 2020 at 09:53:26PM +0800, Wen Yang wrote:
> > > 
> > > 
> > > On 2020/2/19 4:53 上午, Roman Gushchin wrote:
> > > > On Sun, Feb 16, 2020 at 12:15:54PM +0800, Wen Yang wrote:
> > > > > 
> > > > > 
> > > > > On 2020/2/13 6:52 上午, Andrew Morton wrote:
> > > > > > On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.com> wrote:
> > > > > > 
> > > > > > > The lock, protecting the node partial list, is taken when couting the free
> > > > > > > objects resident in that list. It introduces locking contention when the
> > > > > > > page(s) is moved between CPU and node partial lists in allocation path
> > > > > > > on another CPU. So reading "/proc/slabinfo" can possibily block the slab
> > > > > > > allocation on another CPU for a while, 200ms in extreme cases. If the
> > > > > > > slab object is to carry network packet, targeting the far-end disk array,
> > > > > > > it causes block IO jitter issue.
> > > > > > > 
> > > > > > > This fixes the block IO jitter issue by caching the total inuse objects in
> > > > > > > the node in advance. The value is retrieved without taking the node partial
> > > > > > > list lock on reading "/proc/slabinfo".
> > > > > > > 
> > > > > > > ...
> > > > > > > 
> > > > > > > @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, struct page *page)
> > > > > > >     static void discard_slab(struct kmem_cache *s, struct page *page)
> > > > > > >     {
> > > > > > > -	dec_slabs_node(s, page_to_nid(page), page->objects);
> > > > > > > +	int inuse = page->objects;
> > > > > > > +
> > > > > > > +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
> > > > > > 
> > > > > > Is this right?  dec_slabs_node(..., page->objects, page->objects)?
> > > > > > 
> > > > > > If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node
> > > > > > and save a function argument.
> > > > > > 
> > > > > > If yes then why?
> > > > > > 
> > > > > 
> > > > > Thanks for your comments.
> > > > > We are happy to improve this patch based on your suggestions.
> > > > > 
> > > > > 
> > > > > When the user reads /proc/slabinfo, in order to obtain the active_objs
> > > > > information, the kernel traverses all slabs and executes the following code
> > > > > snippet:
> > > > > static unsigned long count_partial(struct kmem_cache_node *n,
> > > > >                                           int (*get_count)(struct page *))
> > > > > {
> > > > >           unsigned long flags;
> > > > >           unsigned long x = 0;
> > > > >           struct page *page;
> > > > > 
> > > > >           spin_lock_irqsave(&n->list_lock, flags);
> > > > >           list_for_each_entry(page, &n->partial, slab_list)
> > > > >                   x += get_count(page);
> > > > >           spin_unlock_irqrestore(&n->list_lock, flags);
> > > > >           return x;
> > > > > }
> > > > > 
> > > > > It may cause performance issues.
> > > > > 
> > > > > Christoph suggested "you could cache the value in the userspace application?
> > > > > Why is this value read continually?", But reading the /proc/slabinfo is
> > > > > initiated by the user program. As a cloud provider, we cannot control user
> > > > > behavior. If a user program inadvertently executes cat /proc/slabinfo, it
> > > > > may affect other user programs.
> > > > > 
> > > > > As Christoph said: "The count is not needed for any operations. Just for the
> > > > > slabinfo output. The value has no operational value for the allocator
> > > > > itself. So why use extra logic to track it in potentially performance
> > > > > critical paths?"
> > > > > 
> > > > > In this way, could we show the approximate value of active_objs in the
> > > > > /proc/slabinfo?
> > > > > 
> > > > > Based on the following information:
> > > > > In the discard_slab() function, page->inuse is equal to page->total_objects;
> > > > > In the allocate_slab() function, page->inuse is also equal to
> > > > > page->total_objects (with one exception: for kmem_cache_node, page-> inuse
> > > > > equals 1);
> > > > > page->inuse will only change continuously when the obj is constantly
> > > > > allocated or released. (This should be the performance critical path
> > > > > emphasized by Christoph)
> > > > > 
> > > > > When users query the global slabinfo information, we may use total_objects
> > > > > to approximate active_objs.
> > > > 
> > > > Well, from one point of view, it makes no sense, because the ratio between
> > > > these two numbers is very meaningful: it's the slab utilization rate.
> > > > 
> > > > On the other side, with enabled per-cpu partial lists active_objs has
> > > > nothing to do with the reality anyway, so I agree with you, calling
> > > > count_partial() is almost useless.
> > > > 
> > > > That said, I wonder if the right thing to do is something like the patch below?
> > > > 
> > > > Thanks!
> > > > 
> > > > Roman
> > > > 
> > > > --
> > > > 
> > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > index 1d644143f93e..ba0505e75ecc 100644
> > > > --- a/mm/slub.c
> > > > +++ b/mm/slub.c
> > > > @@ -2411,14 +2411,16 @@ static inline unsigned long node_nr_objs(struct kmem_cache_node *n)
> > > >    static unsigned long count_partial(struct kmem_cache_node *n,
> > > >                                           int (*get_count)(struct page *))
> > > >    {
> > > > -       unsigned long flags;
> > > >           unsigned long x = 0;
> > > > +#ifdef CONFIG_SLUB_CPU_PARTIAL
> > > > +       unsigned long flags;
> > > >           struct page *page;
> > > >           spin_lock_irqsave(&n->list_lock, flags);
> > > >           list_for_each_entry(page, &n->partial, slab_list)
> > > >                   x += get_count(page);
> > > >           spin_unlock_irqrestore(&n->list_lock, flags);
> > > > +#endif
> > > >           return x;
> > > >    }
> > > >    #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
> > > > 
> > > 
> > > Hi Roman,
> > > 
> > > Thanks for your comments.
> > > 
> > > In the server scenario, SLUB_CPU_PARTIAL is turned on by default, and can
> > > improve the performance of the cloud server, as follows:
> > 
> > Hello, Wen!
> > 
> > That's exactly my point: if CONFIG_SLUB_CPU_PARTIAL is on, count_partial() is useless
> > anyway because the returned number is far from the reality. So if we define
> > active_objects == total_objects, as you basically suggest, we do not introduce any
> > regression. Actually I think it's even preferable to show the unrealistic uniform 100%
> > slab utilization rather than some very high but incorrect value.
> > 
> > And on real-time systems uncontrolled readings of /proc/slabinfo is less
> > of a concern, I hope.
> > 
> > Thank you!
> > 
> 
> Great！
> We only need to correct a typo to achieve this goal, as follows:
> Change #ifdef CONFIG_SLUB_CPU_PARTIAL to #ifndef CONFIG_SLUB_CPU_PARTIAL

Yes, you're obviously right.

Thanks!

mm/slub: Detach node lock from counting free objects

Commit Message

Comments

Patch