[0/5] mm/memcg: Reduce kmemcache memory accounting overhead

Message ID	20210409231842.8840-1-longman@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=QLeX=JG=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7295761074 From: Waiman Long <longman@redhat.com> To: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, Vlastimil Babka <vbabka@suse.cz>, Roman Gushchin <guro@fb.com> Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Shakeel Butt <shakeelb@google.com>, Muchun Song <songmuchun@bytedance.com>, Alex Shi <alex.shi@linux.alibaba.com>, Chris Down <chris@chrisdown.name>, Yafang Shao <laoar.shao@gmail.com>, Alexander Duyck <alexander.h.duyck@linux.intel.com>, Wei Yang <richard.weiyang@gmail.com>, Masayoshi Mizuma <msys.mizuma@gmail.com>, Waiman Long <longman@redhat.com> Subject: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead Date: Fri, 9 Apr 2021 19:18:37 -0400 Message-Id: <20210409231842.8840-1-longman@redhat.com> Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf30; identity=mailfrom; envelope-from="<longman@redhat.com>"; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/memcg: Reduce kmemcache memory accounting overhead \| expand [0/5] mm/memcg: Reduce kmemcache memory accounting overhead [1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() [2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state() [3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp [4/5] mm/memcg: Separate out object stock data into its own struct [5/5] mm/memcg: Optimize user context object stock access

Waiman Long April 9, 2021, 11:18 p.m. UTC

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1].

With a simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
init. The execution time to load the kernel module with and without
memory accounting were:

  with accounting = 6.798s
  w/o  accounting = 1.758s

That is an increase of 5.04s (287%). With this patchset applied, the
execution time became 4.254s. So the memory accounting overhead is now
2.496s which is a 50% reduction.

It was found that a major part of the memory accounting overhead
is caused by the local_irq_save()/local_irq_restore() sequences in
updating local stock charge bytes and vmstat array, at least in x86
systems. There are two such sequences in kmem_cache_alloc() and two
in kmem_cache_free(). This patchset tries to reduce the use of such
sequences as much as possible. In fact, it eliminates them in the common
case. Another part of this patchset to cache the vmstat data update in
the local stock as well which also helps.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u

Waiman Long (5):
  mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
  mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Separate out object stock data into its own struct
  mm/memcg: Optimize user context object stock access

 include/linux/memcontrol.h |  14 ++-
 mm/memcontrol.c            | 198 ++++++++++++++++++++++++++++++++-----
 mm/percpu.c                |   9 +-
 mm/slab.h                  |  32 +++---
 4 files changed, 195 insertions(+), 58 deletions(-)

Roman Gushchin April 10, 2021, 1:51 a.m. UTC | #1

On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
> With the recent introduction of the new slab memory controller, we
> eliminate the need for having separate kmemcaches for each memory
> cgroup and reduce overall kernel memory usage. However, we also add
> additional memory accounting overhead to each call of kmem_cache_alloc()
> and kmem_cache_free().
> 
> For workloads that require a lot of kmemcache allocations and
> de-allocations, they may experience performance regression as illustrated
> in [1].
> 
> With a simple kernel module that performs repeated loop of 100,000,000
> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
> init. The execution time to load the kernel module with and without
> memory accounting were:
> 
>   with accounting = 6.798s
>   w/o  accounting = 1.758s
> 
> That is an increase of 5.04s (287%). With this patchset applied, the
> execution time became 4.254s. So the memory accounting overhead is now
> 2.496s which is a 50% reduction.

Hi Waiman!

Thank you for working on it, it's indeed very useful!
A couple of questions:
1) did your config included lockdep or not?
2) do you have a (rough) estimation how much each change contributes
   to the overall reduction?

Thanks!

> 
> It was found that a major part of the memory accounting overhead
> is caused by the local_irq_save()/local_irq_restore() sequences in
> updating local stock charge bytes and vmstat array, at least in x86
> systems. There are two such sequences in kmem_cache_alloc() and two
> in kmem_cache_free(). This patchset tries to reduce the use of such
> sequences as much as possible. In fact, it eliminates them in the common
> case. Another part of this patchset to cache the vmstat data update in
> the local stock as well which also helps.
> 
> [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
> 
> Waiman Long (5):
>   mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
>   mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
>   mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
>   mm/memcg: Separate out object stock data into its own struct
>   mm/memcg: Optimize user context object stock access
> 
>  include/linux/memcontrol.h |  14 ++-
>  mm/memcontrol.c            | 198 ++++++++++++++++++++++++++++++++-----
>  mm/percpu.c                |   9 +-
>  mm/slab.h                  |  32 +++---
>  4 files changed, 195 insertions(+), 58 deletions(-)
> 
> -- 
> 2.18.1
>

Waiman Long April 12, 2021, 2:03 p.m. UTC | #2

On 4/9/21 9:51 PM, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
>> With the recent introduction of the new slab memory controller, we
>> eliminate the need for having separate kmemcaches for each memory
>> cgroup and reduce overall kernel memory usage. However, we also add
>> additional memory accounting overhead to each call of kmem_cache_alloc()
>> and kmem_cache_free().
>>
>> For workloads that require a lot of kmemcache allocations and
>> de-allocations, they may experience performance regression as illustrated
>> in [1].
>>
>> With a simple kernel module that performs repeated loop of 100,000,000
>> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
>> init. The execution time to load the kernel module with and without
>> memory accounting were:
>>
>>    with accounting = 6.798s
>>    w/o  accounting = 1.758s
>>
>> That is an increase of 5.04s (287%). With this patchset applied, the
>> execution time became 4.254s. So the memory accounting overhead is now
>> 2.496s which is a 50% reduction.
> Hi Waiman!
>
> Thank you for working on it, it's indeed very useful!
> A couple of questions:
> 1) did your config included lockdep or not?
The test kernel is based on a production kernel config and so lockdep 
isn't enabled.
> 2) do you have a (rough) estimation how much each change contributes
>     to the overall reduction?

I should have a better breakdown of the effect of individual patches. I 
rerun the benchmarking module with turbo-boosting disabled to reduce 
run-to-run variation. The execution times were:

Before patch: time = 10.800s (with memory accounting), 2.848s (w/o 
accounting), overhead = 7.952s
After patch 2: time = 9.140s, overhead = 6.292s
After patch 3: time = 7.641s, overhead = 4.793s
After patch 5: time = 6.801s, overhead = 3.953s

Patches 1 & 4 are preparatory patches that should affect performance.

So the memory accounting overhead was reduced by about half.

Cheers,
Longman

Roman Gushchin April 12, 2021, 5:47 p.m. UTC | #3

On Mon, Apr 12, 2021 at 10:03:13AM -0400, Waiman Long wrote:
> On 4/9/21 9:51 PM, Roman Gushchin wrote:
> > On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
> > > With the recent introduction of the new slab memory controller, we
> > > eliminate the need for having separate kmemcaches for each memory
> > > cgroup and reduce overall kernel memory usage. However, we also add
> > > additional memory accounting overhead to each call of kmem_cache_alloc()
> > > and kmem_cache_free().
> > > 
> > > For workloads that require a lot of kmemcache allocations and
> > > de-allocations, they may experience performance regression as illustrated
> > > in [1].
> > > 
> > > With a simple kernel module that performs repeated loop of 100,000,000
> > > kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
> > > init. The execution time to load the kernel module with and without
> > > memory accounting were:
> > > 
> > >    with accounting = 6.798s
> > >    w/o  accounting = 1.758s
> > > 
> > > That is an increase of 5.04s (287%). With this patchset applied, the
> > > execution time became 4.254s. So the memory accounting overhead is now
> > > 2.496s which is a 50% reduction.
> > Hi Waiman!
> > 
> > Thank you for working on it, it's indeed very useful!
> > A couple of questions:
> > 1) did your config included lockdep or not?
> The test kernel is based on a production kernel config and so lockdep isn't
> enabled.
> > 2) do you have a (rough) estimation how much each change contributes
> >     to the overall reduction?
> 
> I should have a better breakdown of the effect of individual patches. I
> rerun the benchmarking module with turbo-boosting disabled to reduce
> run-to-run variation. The execution times were:
> 
> Before patch: time = 10.800s (with memory accounting), 2.848s (w/o
> accounting), overhead = 7.952s
> After patch 2: time = 9.140s, overhead = 6.292s
> After patch 3: time = 7.641s, overhead = 4.793s
> After patch 5: time = 6.801s, overhead = 3.953s

Thank you! If there will be v2, I'd include this information into commit logs.

> 
> Patches 1 & 4 are preparatory patches that should affect performance.
> 
> So the memory accounting overhead was reduced by about half.

This is really great!

Thanks!

Roman Gushchin April 12, 2021, 7:05 p.m. UTC | #4

On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
> With the recent introduction of the new slab memory controller, we
> eliminate the need for having separate kmemcaches for each memory
> cgroup and reduce overall kernel memory usage. However, we also add
> additional memory accounting overhead to each call of kmem_cache_alloc()
> and kmem_cache_free().
> 
> For workloads that require a lot of kmemcache allocations and
> de-allocations, they may experience performance regression as illustrated
> in [1].
> 
> With a simple kernel module that performs repeated loop of 100,000,000
> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
> init. The execution time to load the kernel module with and without
> memory accounting were:
> 
>   with accounting = 6.798s
>   w/o  accounting = 1.758s
> 
> That is an increase of 5.04s (287%). With this patchset applied, the
> execution time became 4.254s. So the memory accounting overhead is now
> 2.496s which is a 50% reduction.

Btw, there were two recent independent report about benchmark results
regression caused by the introduction of the per-object accounting:
1) Xing reported a hackbench regression:
https://lkml.org/lkml/2021/1/13/1277
2) Masayoshi reported a pgbench regression:
https://www.spinics.net/lists/linux-mm/msg252540.html

I wonder if you can run them (or at least one) and attach the result
to the series? It would be very helpful.

Thank you!

Waiman Long April 12, 2021, 7:20 p.m. UTC | #5

On 4/12/21 1:47 PM, Roman Gushchin wrote:
> On Mon, Apr 12, 2021 at 10:03:13AM -0400, Waiman Long wrote:
>> On 4/9/21 9:51 PM, Roman Gushchin wrote:
>>> On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
>>>> With the recent introduction of the new slab memory controller, we
>>>> eliminate the need for having separate kmemcaches for each memory
>>>> cgroup and reduce overall kernel memory usage. However, we also add
>>>> additional memory accounting overhead to each call of kmem_cache_alloc()
>>>> and kmem_cache_free().
>>>>
>>>> For workloads that require a lot of kmemcache allocations and
>>>> de-allocations, they may experience performance regression as illustrated
>>>> in [1].
>>>>
>>>> With a simple kernel module that performs repeated loop of 100,000,000
>>>> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
>>>> init. The execution time to load the kernel module with and without
>>>> memory accounting were:
>>>>
>>>>     with accounting = 6.798s
>>>>     w/o  accounting = 1.758s
>>>>
>>>> That is an increase of 5.04s (287%). With this patchset applied, the
>>>> execution time became 4.254s. So the memory accounting overhead is now
>>>> 2.496s which is a 50% reduction.
>>> Hi Waiman!
>>>
>>> Thank you for working on it, it's indeed very useful!
>>> A couple of questions:
>>> 1) did your config included lockdep or not?
>> The test kernel is based on a production kernel config and so lockdep isn't
>> enabled.
>>> 2) do you have a (rough) estimation how much each change contributes
>>>      to the overall reduction?
>> I should have a better breakdown of the effect of individual patches. I
>> rerun the benchmarking module with turbo-boosting disabled to reduce
>> run-to-run variation. The execution times were:
>>
>> Before patch: time = 10.800s (with memory accounting), 2.848s (w/o
>> accounting), overhead = 7.952s
>> After patch 2: time = 9.140s, overhead = 6.292s
>> After patch 3: time = 7.641s, overhead = 4.793s
>> After patch 5: time = 6.801s, overhead = 3.953s
> Thank you! If there will be v2, I'd include this information into commit logs.

Yes, I am planning to send out v2 with these information in the 
cover-letter. I am just waiting a bit to see if there are more feedback.

-Longman

>
>> Patches 1 & 4 are preparatory patches that should affect performance.
>>
>> So the memory accounting overhead was reduced by about half.

BTW, the benchmark that I used is kind of the best case behavior as it 
as all updates are to the percpu stocks. Real workloads will likely to 
have a certain amount of update to the memcg charges and vmstats. So the 
performance benefit will be less.

Cheers,
Longman

Waiman Long April 12, 2021, 7:51 p.m. UTC | #6

On 4/12/21 3:05 PM, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
>> With the recent introduction of the new slab memory controller, we
>> eliminate the need for having separate kmemcaches for each memory
>> cgroup and reduce overall kernel memory usage. However, we also add
>> additional memory accounting overhead to each call of kmem_cache_alloc()
>> and kmem_cache_free().
>>
>> For workloads that require a lot of kmemcache allocations and
>> de-allocations, they may experience performance regression as illustrated
>> in [1].
>>
>> With a simple kernel module that performs repeated loop of 100,000,000
>> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
>> init. The execution time to load the kernel module with and without
>> memory accounting were:
>>
>>    with accounting = 6.798s
>>    w/o  accounting = 1.758s
>>
>> That is an increase of 5.04s (287%). With this patchset applied, the
>> execution time became 4.254s. So the memory accounting overhead is now
>> 2.496s which is a 50% reduction.
> Btw, there were two recent independent report about benchmark results
> regression caused by the introduction of the per-object accounting:
> 1) Xing reported a hackbench regression:
> https://lkml.org/lkml/2021/1/13/1277
> 2) Masayoshi reported a pgbench regression:
> https://www.spinics.net/lists/linux-mm/msg252540.html
>
> I wonder if you can run them (or at least one) and attach the result
> to the series? It would be very helpful.

Actually, it was a bug reported filed by Masayoshi-san that triggered me 
to work on reducing the memory accounting overhead. He is also in the cc 
line and so is aware of that. I will cc Xing in my v2 patch.

Cheers,
Longman

[0/5] mm/memcg: Reduce kmemcache memory accounting overhead

Message

Comments