Message ID | 20210409231842.8840-1-longman@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/memcg: Reduce kmemcache memory accounting overhead | expand |
On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: > With the recent introduction of the new slab memory controller, we > eliminate the need for having separate kmemcaches for each memory > cgroup and reduce overall kernel memory usage. However, we also add > additional memory accounting overhead to each call of kmem_cache_alloc() > and kmem_cache_free(). > > For workloads that require a lot of kmemcache allocations and > de-allocations, they may experience performance regression as illustrated > in [1]. > > With a simple kernel module that performs repeated loop of 100,000,000 > kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module > init. The execution time to load the kernel module with and without > memory accounting were: > > with accounting = 6.798s > w/o accounting = 1.758s > > That is an increase of 5.04s (287%). With this patchset applied, the > execution time became 4.254s. So the memory accounting overhead is now > 2.496s which is a 50% reduction. Hi Waiman! Thank you for working on it, it's indeed very useful! A couple of questions: 1) did your config included lockdep or not? 2) do you have a (rough) estimation how much each change contributes to the overall reduction? Thanks! > > It was found that a major part of the memory accounting overhead > is caused by the local_irq_save()/local_irq_restore() sequences in > updating local stock charge bytes and vmstat array, at least in x86 > systems. There are two such sequences in kmem_cache_alloc() and two > in kmem_cache_free(). This patchset tries to reduce the use of such > sequences as much as possible. In fact, it eliminates them in the common > case. Another part of this patchset to cache the vmstat data update in > the local stock as well which also helps. > > [1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u > > Waiman Long (5): > mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state() > mm/memcg: Introduce obj_cgroup_uncharge_mod_state() > mm/memcg: Cache vmstat data in percpu memcg_stock_pcp > mm/memcg: Separate out object stock data into its own struct > mm/memcg: Optimize user context object stock access > > include/linux/memcontrol.h | 14 ++- > mm/memcontrol.c | 198 ++++++++++++++++++++++++++++++++----- > mm/percpu.c | 9 +- > mm/slab.h | 32 +++--- > 4 files changed, 195 insertions(+), 58 deletions(-) > > -- > 2.18.1 >
On 4/9/21 9:51 PM, Roman Gushchin wrote: > On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: >> With the recent introduction of the new slab memory controller, we >> eliminate the need for having separate kmemcaches for each memory >> cgroup and reduce overall kernel memory usage. However, we also add >> additional memory accounting overhead to each call of kmem_cache_alloc() >> and kmem_cache_free(). >> >> For workloads that require a lot of kmemcache allocations and >> de-allocations, they may experience performance regression as illustrated >> in [1]. >> >> With a simple kernel module that performs repeated loop of 100,000,000 >> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module >> init. The execution time to load the kernel module with and without >> memory accounting were: >> >> with accounting = 6.798s >> w/o accounting = 1.758s >> >> That is an increase of 5.04s (287%). With this patchset applied, the >> execution time became 4.254s. So the memory accounting overhead is now >> 2.496s which is a 50% reduction. > Hi Waiman! > > Thank you for working on it, it's indeed very useful! > A couple of questions: > 1) did your config included lockdep or not? The test kernel is based on a production kernel config and so lockdep isn't enabled. > 2) do you have a (rough) estimation how much each change contributes > to the overall reduction? I should have a better breakdown of the effect of individual patches. I rerun the benchmarking module with turbo-boosting disabled to reduce run-to-run variation. The execution times were: Before patch: time = 10.800s (with memory accounting), 2.848s (w/o accounting), overhead = 7.952s After patch 2: time = 9.140s, overhead = 6.292s After patch 3: time = 7.641s, overhead = 4.793s After patch 5: time = 6.801s, overhead = 3.953s Patches 1 & 4 are preparatory patches that should affect performance. So the memory accounting overhead was reduced by about half. Cheers, Longman
On Mon, Apr 12, 2021 at 10:03:13AM -0400, Waiman Long wrote: > On 4/9/21 9:51 PM, Roman Gushchin wrote: > > On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: > > > With the recent introduction of the new slab memory controller, we > > > eliminate the need for having separate kmemcaches for each memory > > > cgroup and reduce overall kernel memory usage. However, we also add > > > additional memory accounting overhead to each call of kmem_cache_alloc() > > > and kmem_cache_free(). > > > > > > For workloads that require a lot of kmemcache allocations and > > > de-allocations, they may experience performance regression as illustrated > > > in [1]. > > > > > > With a simple kernel module that performs repeated loop of 100,000,000 > > > kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module > > > init. The execution time to load the kernel module with and without > > > memory accounting were: > > > > > > with accounting = 6.798s > > > w/o accounting = 1.758s > > > > > > That is an increase of 5.04s (287%). With this patchset applied, the > > > execution time became 4.254s. So the memory accounting overhead is now > > > 2.496s which is a 50% reduction. > > Hi Waiman! > > > > Thank you for working on it, it's indeed very useful! > > A couple of questions: > > 1) did your config included lockdep or not? > The test kernel is based on a production kernel config and so lockdep isn't > enabled. > > 2) do you have a (rough) estimation how much each change contributes > > to the overall reduction? > > I should have a better breakdown of the effect of individual patches. I > rerun the benchmarking module with turbo-boosting disabled to reduce > run-to-run variation. The execution times were: > > Before patch: time = 10.800s (with memory accounting), 2.848s (w/o > accounting), overhead = 7.952s > After patch 2: time = 9.140s, overhead = 6.292s > After patch 3: time = 7.641s, overhead = 4.793s > After patch 5: time = 6.801s, overhead = 3.953s Thank you! If there will be v2, I'd include this information into commit logs. > > Patches 1 & 4 are preparatory patches that should affect performance. > > So the memory accounting overhead was reduced by about half. This is really great! Thanks!
On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: > With the recent introduction of the new slab memory controller, we > eliminate the need for having separate kmemcaches for each memory > cgroup and reduce overall kernel memory usage. However, we also add > additional memory accounting overhead to each call of kmem_cache_alloc() > and kmem_cache_free(). > > For workloads that require a lot of kmemcache allocations and > de-allocations, they may experience performance regression as illustrated > in [1]. > > With a simple kernel module that performs repeated loop of 100,000,000 > kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module > init. The execution time to load the kernel module with and without > memory accounting were: > > with accounting = 6.798s > w/o accounting = 1.758s > > That is an increase of 5.04s (287%). With this patchset applied, the > execution time became 4.254s. So the memory accounting overhead is now > 2.496s which is a 50% reduction. Btw, there were two recent independent report about benchmark results regression caused by the introduction of the per-object accounting: 1) Xing reported a hackbench regression: https://lkml.org/lkml/2021/1/13/1277 2) Masayoshi reported a pgbench regression: https://www.spinics.net/lists/linux-mm/msg252540.html I wonder if you can run them (or at least one) and attach the result to the series? It would be very helpful. Thank you!
On 4/12/21 1:47 PM, Roman Gushchin wrote: > On Mon, Apr 12, 2021 at 10:03:13AM -0400, Waiman Long wrote: >> On 4/9/21 9:51 PM, Roman Gushchin wrote: >>> On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: >>>> With the recent introduction of the new slab memory controller, we >>>> eliminate the need for having separate kmemcaches for each memory >>>> cgroup and reduce overall kernel memory usage. However, we also add >>>> additional memory accounting overhead to each call of kmem_cache_alloc() >>>> and kmem_cache_free(). >>>> >>>> For workloads that require a lot of kmemcache allocations and >>>> de-allocations, they may experience performance regression as illustrated >>>> in [1]. >>>> >>>> With a simple kernel module that performs repeated loop of 100,000,000 >>>> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module >>>> init. The execution time to load the kernel module with and without >>>> memory accounting were: >>>> >>>> with accounting = 6.798s >>>> w/o accounting = 1.758s >>>> >>>> That is an increase of 5.04s (287%). With this patchset applied, the >>>> execution time became 4.254s. So the memory accounting overhead is now >>>> 2.496s which is a 50% reduction. >>> Hi Waiman! >>> >>> Thank you for working on it, it's indeed very useful! >>> A couple of questions: >>> 1) did your config included lockdep or not? >> The test kernel is based on a production kernel config and so lockdep isn't >> enabled. >>> 2) do you have a (rough) estimation how much each change contributes >>> to the overall reduction? >> I should have a better breakdown of the effect of individual patches. I >> rerun the benchmarking module with turbo-boosting disabled to reduce >> run-to-run variation. The execution times were: >> >> Before patch: time = 10.800s (with memory accounting), 2.848s (w/o >> accounting), overhead = 7.952s >> After patch 2: time = 9.140s, overhead = 6.292s >> After patch 3: time = 7.641s, overhead = 4.793s >> After patch 5: time = 6.801s, overhead = 3.953s > Thank you! If there will be v2, I'd include this information into commit logs. Yes, I am planning to send out v2 with these information in the cover-letter. I am just waiting a bit to see if there are more feedback. -Longman > >> Patches 1 & 4 are preparatory patches that should affect performance. >> >> So the memory accounting overhead was reduced by about half. BTW, the benchmark that I used is kind of the best case behavior as it as all updates are to the percpu stocks. Real workloads will likely to have a certain amount of update to the memcg charges and vmstats. So the performance benefit will be less. Cheers, Longman
On 4/12/21 3:05 PM, Roman Gushchin wrote: > On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote: >> With the recent introduction of the new slab memory controller, we >> eliminate the need for having separate kmemcaches for each memory >> cgroup and reduce overall kernel memory usage. However, we also add >> additional memory accounting overhead to each call of kmem_cache_alloc() >> and kmem_cache_free(). >> >> For workloads that require a lot of kmemcache allocations and >> de-allocations, they may experience performance regression as illustrated >> in [1]. >> >> With a simple kernel module that performs repeated loop of 100,000,000 >> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module >> init. The execution time to load the kernel module with and without >> memory accounting were: >> >> with accounting = 6.798s >> w/o accounting = 1.758s >> >> That is an increase of 5.04s (287%). With this patchset applied, the >> execution time became 4.254s. So the memory accounting overhead is now >> 2.496s which is a 50% reduction. > Btw, there were two recent independent report about benchmark results > regression caused by the introduction of the per-object accounting: > 1) Xing reported a hackbench regression: > https://lkml.org/lkml/2021/1/13/1277 > 2) Masayoshi reported a pgbench regression: > https://www.spinics.net/lists/linux-mm/msg252540.html > > I wonder if you can run them (or at least one) and attach the result > to the series? It would be very helpful. Actually, it was a bug reported filed by Masayoshi-san that triggered me to work on reducing the memory accounting overhead. He is also in the cc line and so is aware of that. I will cc Xing in my v2 patch. Cheers, Longman