diff mbox series

[PATCHv2] zsmalloc: allow only one active pool compaction context

Message ID 20230417135420.1836741-1-senozhatsky@chromium.org (mailing list archive)
State New
Headers show
Series [PATCHv2] zsmalloc: allow only one active pool compaction context | expand

Commit Message

Sergey Senozhatsky April 17, 2023, 1:54 p.m. UTC
zsmalloc pool can be compacted concurrently by many contexts,
e.g.

 cc1 handle_mm_fault()
      do_anonymous_page()
       __alloc_pages_slowpath()
        try_to_free_pages()
         do_try_to_free_pages(
          lru_gen_shrink_node()
           shrink_slab()
            do_shrink_slab()
             zs_shrinker_scan()
              zs_compact()

This creates unnecessary contention as all those processes
compete for access to the same classes. A single compaction
process is enough. Moreover contention that is created by
multiple compaction processes impact other zsmalloc functions,
e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to
synchronize access to pool.

Introduce pool compaction mutex and permit only one compaction
context at a time. This reduces overall pool->lock contention.

/proc/lock-stat after make -j$((`nproc`+1)) linux kernel for
&pool->lock#3:

                Base           Patched
------------------------------------------
con-bounces     2035730        1540066
contentions     2343871        1774348
waittime-min    0.10           0.10
waittime-max    4004216.24     2745.22
waittime-total  101334168.29   67865414.91
waittime-avg    43.23          38.25
acq-bounces     2895765        2186745
acquisitions    6247686        5136943
holdtime-min    0.07           0.07
holdtime-max    2605507.97     482439.16
holdtime-total  9998599.59     5107151.01
holdtime-avg    1.60           0.99

Test run time:
Base
2775.15user 1709.13system 2:13.82elapsed 3350%CPU

Patched
2608.25user 1439.03system 2:03.63elapsed 3273%CPU

Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
---
 mm/zsmalloc.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Yosry Ahmed April 17, 2023, 6:32 p.m. UTC | #1
On Mon, Apr 17, 2023 at 6:54 AM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> zsmalloc pool can be compacted concurrently by many contexts,
> e.g.
>
>  cc1 handle_mm_fault()
>       do_anonymous_page()
>        __alloc_pages_slowpath()
>         try_to_free_pages()
>          do_try_to_free_pages(
>           lru_gen_shrink_node()
>            shrink_slab()
>             do_shrink_slab()
>              zs_shrinker_scan()
>               zs_compact()
>
> This creates unnecessary contention as all those processes
> compete for access to the same classes. A single compaction
> process is enough. Moreover contention that is created by
> multiple compaction processes impact other zsmalloc functions,
> e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to
> synchronize access to pool.
>
> Introduce pool compaction mutex and permit only one compaction
> context at a time. This reduces overall pool->lock contention.
>
> /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for
> &pool->lock#3:
>
>                 Base           Patched
> ------------------------------------------
> con-bounces     2035730        1540066
> contentions     2343871        1774348
> waittime-min    0.10           0.10
> waittime-max    4004216.24     2745.22
> waittime-total  101334168.29   67865414.91
> waittime-avg    43.23          38.25
> acq-bounces     2895765        2186745
> acquisitions    6247686        5136943
> holdtime-min    0.07           0.07
> holdtime-max    2605507.97     482439.16
> holdtime-total  9998599.59     5107151.01
> holdtime-avg    1.60           0.99

The numbers seem to be better when using an atomic vs. a mutex, is
this just noise or significant difference? (I am not familiar with
lock-stat).

>
> Test run time:
> Base
> 2775.15user 1709.13system 2:13.82elapsed 3350%CPU
>
> Patched
> 2608.25user 1439.03system 2:03.63elapsed 3273%CPU
>
> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>

FWIW,
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

> ---
>  mm/zsmalloc.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index cc81dfba05a0..dfec2fc6a30f 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -264,6 +264,7 @@ struct zs_pool {
>         struct work_struct free_work;
>  #endif
>         spinlock_t lock;
> +       atomic_t compaction_in_progress;
>  };
>
>  struct zspage {
> @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool)
>         struct size_class *class;
>         unsigned long pages_freed = 0;
>
> +       if (atomic_xchg(&pool->compaction_in_progress, 1))
> +               return 0;
> +
>         for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
>                 class = pool->size_class[i];
>                 if (class->index != i)
> @@ -2281,6 +2285,7 @@ unsigned long zs_compact(struct zs_pool *pool)
>                 pages_freed += __zs_compact(pool, class);
>         }
>         atomic_long_add(pages_freed, &pool->stats.pages_compacted);
> +       atomic_set(&pool->compaction_in_progress, 0);
>
>         return pages_freed;
>  }
> @@ -2388,6 +2393,7 @@ struct zs_pool *zs_create_pool(const char *name)
>
>         init_deferred_free(pool);
>         spin_lock_init(&pool->lock);
> +       atomic_set(&pool->compaction_in_progress, 0);
>
>         pool->name = kstrdup(name, GFP_KERNEL);
>         if (!pool->name)
> --
> 2.40.0.634.g4ca3ef3211-goog
>
Sergey Senozhatsky April 17, 2023, 11:58 p.m. UTC | #2
On (23/04/17 11:32), Yosry Ahmed wrote:
> > /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for
> > &pool->lock#3:
> >
> >                 Base           Patched
> > ------------------------------------------
> > con-bounces     2035730        1540066
> > contentions     2343871        1774348
> > waittime-min    0.10           0.10
> > waittime-max    4004216.24     2745.22
> > waittime-total  101334168.29   67865414.91
> > waittime-avg    43.23          38.25
> > acq-bounces     2895765        2186745
> > acquisitions    6247686        5136943
> > holdtime-min    0.07           0.07
> > holdtime-max    2605507.97     482439.16
> > holdtime-total  9998599.59     5107151.01
> > holdtime-avg    1.60           0.99
> 
> The numbers seem to be better when using an atomic vs. a mutex, is
> this just noise or significant difference? (I am not familiar with
> lock-stat).

Pretty sure that's just noise. The test is make -j72 on a system
that swaps out, so it's terribly noisy.
Andrew Morton April 18, 2023, 12:41 a.m. UTC | #3
On Mon, 17 Apr 2023 22:54:20 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:

> zsmalloc pool can be compacted concurrently by many contexts,
> e.g.
> 
>  cc1 handle_mm_fault()
>       do_anonymous_page()
>        __alloc_pages_slowpath()
>         try_to_free_pages()
>          do_try_to_free_pages(
>           lru_gen_shrink_node()
>            shrink_slab()
>             do_shrink_slab()
>              zs_shrinker_scan()
>               zs_compact()
> 
> This creates unnecessary contention as all those processes
> compete for access to the same classes. A single compaction
> process is enough. Moreover contention that is created by
> multiple compaction processes impact other zsmalloc functions,
> e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to
> synchronize access to pool.
> 
> Introduce pool compaction mutex and permit only one compaction
> context at a time. This reduces overall pool->lock contention.

That isn't what the patch does!  Perhaps an earlier version used a mutex?

> /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for
> &pool->lock#3:
> 
>                 Base           Patched
> ------------------------------------------
> con-bounces     2035730        1540066
> contentions     2343871        1774348
> waittime-min    0.10           0.10
> waittime-max    4004216.24     2745.22
> waittime-total  101334168.29   67865414.91
> waittime-avg    43.23          38.25
> acq-bounces     2895765        2186745
> acquisitions    6247686        5136943
> holdtime-min    0.07           0.07
> holdtime-max    2605507.97     482439.16
> holdtime-total  9998599.59     5107151.01
> holdtime-avg    1.60           0.99
> 
> Test run time:
> Base
> 2775.15user 1709.13system 2:13.82elapsed 3350%CPU
> 
> Patched
> 2608.25user 1439.03system 2:03.63elapsed 3273%CPU
> 
> ...
>
> @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool)
>  	struct size_class *class;
>  	unsigned long pages_freed = 0;
>  
> +	if (atomic_xchg(&pool->compaction_in_progress, 1))
> +		return 0;
> +

A code comment might be appropriate here.

Is the spin_is_contended() test in __zs_compact() still relevant?

And....  single-threading the operation seems a pretty sad way of
addressing a contention issue.  zs_compact() is fairly computationally
expensive - surely a large machine would like to be able to
concurrently run many instances of zs_compact()?
Yosry Ahmed April 18, 2023, 2:53 a.m. UTC | #4
On Mon, Apr 17, 2023 at 5:41 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 17 Apr 2023 22:54:20 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:
>
> > zsmalloc pool can be compacted concurrently by many contexts,
> > e.g.
> >
> >  cc1 handle_mm_fault()
> >       do_anonymous_page()
> >        __alloc_pages_slowpath()
> >         try_to_free_pages()
> >          do_try_to_free_pages(
> >           lru_gen_shrink_node()
> >            shrink_slab()
> >             do_shrink_slab()
> >              zs_shrinker_scan()
> >               zs_compact()
> >
> > This creates unnecessary contention as all those processes
> > compete for access to the same classes. A single compaction
> > process is enough. Moreover contention that is created by
> > multiple compaction processes impact other zsmalloc functions,
> > e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to
> > synchronize access to pool.
> >
> > Introduce pool compaction mutex and permit only one compaction
> > context at a time. This reduces overall pool->lock contention.
>
> That isn't what the patch does!  Perhaps an earlier version used a mutex?

Yup that's from v1. Thanks for catching that.

>
> > /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for
> > &pool->lock#3:
> >
> >                 Base           Patched
> > ------------------------------------------
> > con-bounces     2035730        1540066
> > contentions     2343871        1774348
> > waittime-min    0.10           0.10
> > waittime-max    4004216.24     2745.22
> > waittime-total  101334168.29   67865414.91
> > waittime-avg    43.23          38.25
> > acq-bounces     2895765        2186745
> > acquisitions    6247686        5136943
> > holdtime-min    0.07           0.07
> > holdtime-max    2605507.97     482439.16
> > holdtime-total  9998599.59     5107151.01
> > holdtime-avg    1.60           0.99
> >
> > Test run time:
> > Base
> > 2775.15user 1709.13system 2:13.82elapsed 3350%CPU
> >
> > Patched
> > 2608.25user 1439.03system 2:03.63elapsed 3273%CPU
> >
> > ...
> >
> > @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool)
> >       struct size_class *class;
> >       unsigned long pages_freed = 0;
> >
> > +     if (atomic_xchg(&pool->compaction_in_progress, 1))
> > +             return 0;
> > +
>
> A code comment might be appropriate here.
>
> Is the spin_is_contended() test in __zs_compact() still relevant?

It is, as pool->lock is used to synchronize a lot of operations, not
just compaction.

>
> And....  single-threading the operation seems a pretty sad way of
> addressing a contention issue.  zs_compact() is fairly computationally
> expensive - surely a large machine would like to be able to
> concurrently run many instances of zs_compact()?

Yeah that was my initial thought as well, to attack this problem at a
larger scale and break down the pool->lock into class->lock. AFAICT,
zsmalloc pages/objects cannot change classes during their lifetime, so
this sounds reasonable initially.

The problem is that the pool->lock is used to synchronize ~all
zsmalloc operations:  zs_map_object(), zs_malloc(), zs_free(),
async_free_zspage(), zs_compact(), zs_reclaim_page(),
zs_page_migrate().

For some of those, changing a pool->lock into a class->lock is fairly
straightforward, specifically the ones where we operate on a
page/zspage (e.g. zs_compact()).

The problem is for operations where we operate on an object (e.g.
zs_map_object()). At this point, we only have a handle that we need to
use to find the underlying object. We have no idea what class the
object belongs to, and finding an object given a handle needs
synchronization as the handle<->object mapping can change by
compaction or page migration.

We can store the class somewhere and pay an extra byte per compressed
object (which doesn't sound too bad) -- but even then we'd have
trouble finding the right place to stash it. Perhaps we can enforce
some alignment on the objects and use the lowermost 8 bits of the
handle pointer, which also means we don't need to pay any extra
overhead, but I am not sure if this is practical.

We can also store the class at the first byte of the object, and
guarantee that through migration/compaction the first byte always
remains a valid class. We can then read the class from the handle
locklessly, acquire the class lock, and then make sure the handle is
still the same.

These are all ideas I am throwing in the air, I didn't think about
them thoroughly or try implementing them.

TLDR: I believe it is possible (in theory at least) to break down the
global pool->lock into class locks, which should benefit virtually
~all zsmalloc concurrent operations, not just compaction. It is not
easy though.

As for this patch, I personally do not observe a lot of compaction in
our production environment, and allowing one thread to perform
compaction while others move on with their lives can be better than
having all of them continuously contending for the pool->lock, which
means more contention with ~all zsmalloc operations, not just
concurrent compactors. I can't say for sure that this is an
improvement, but I *believe* it is.

I will shut up now and leave it up to Sergey to decide how to move
forward, sorry for the very long response :)
Sergey Senozhatsky April 18, 2023, 3:05 a.m. UTC | #5
On (23/04/17 17:41), Andrew Morton wrote:
> On Mon, 17 Apr 2023 22:54:20 +0900 Sergey Senozhatsky <senozhatsky@chromium.org> wrote:
> > Introduce pool compaction mutex and permit only one compaction
> > context at a time. This reduces overall pool->lock contention.
> 
> That isn't what the patch does!  Perhaps an earlier version used a mutex?

Oh, yes.

[..]
> > @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool)
> >  	struct size_class *class;
> >  	unsigned long pages_freed = 0;
> >  
> > +	if (atomic_xchg(&pool->compaction_in_progress, 1))
> > +		return 0;
> > +
> 
> A code comment might be appropriate here.

OK.

> Is the spin_is_contended() test in __zs_compact() still relevant?

I'd say yes, pool->lock is "big pool lock", we use it for everything:
zs_map_object() when we read objects, zs_malloc() when we allocate objects,
zs_free() when we free objects, etc.

> And....  single-threading the operation seems a pretty sad way of
> addressing a contention issue.  zs_compact() is fairly computationally
> expensive - surely a large machine would like to be able to
> concurrently run many instances of zs_compact()?

Compaction is effective only when zspool suffers from internal fragmentation.
Concurrent compaction threads iterate size classes in the same order and are
likely to compete for pool->lock to just find out that previous pool->lock
owner has compacted the class already and there isn't much left to do.

As of concurrent compaction on big machines, the thing is - concurrent
compaction doesn't really happen. We always have just one thread compacting
size classes under pool->lock, the remaining compaction threads just spin on
pool->lock. I believe it used to be slightly different in the past when we
had per size-class lock instead of "global" pool->lock. Commit c0547d0b6a4b
("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks") has
basically made compaction single-threaded.
Sergey Senozhatsky April 18, 2023, 11:24 a.m. UTC | #6
On (23/04/17 19:53), Yosry Ahmed wrote:
> 
> As for this patch, I personally do not observe a lot of compaction in
> our production environment, and allowing one thread to perform
> compaction while others move on with their lives can be better than
> having all of them continuously contending for the pool->lock, which
> means more contention with ~all zsmalloc operations, not just
> concurrent compactors. I can't say for sure that this is an
> improvement, but I *believe* it is.

Looking at one of ChromeOS memory-pressure tests, I see that sometimes
(albeit rarely) we can have up to 9 parallel zspool compaction contexts,
perhaps a little bit too many for a 12 CPUs laptop:

[ 2159.378827] zsmalloc: ctx #1 chrome -> zs_compact()
[ 2159.379002] zsmalloc: ctx #2 Chrome_ChildIOT -> zs_compact()
[ 2159.379120] zsmalloc: ctx #3 chrome -> zs_compact()
[ 2159.379135] zsmalloc: ctx #4 chrome -> zs_compact()
[ 2159.379213] zsmalloc: ctx #5 chrome -> zs_compact()
[ 2159.379271] zsmalloc: ctx #6 chrome -> zs_compact()
[ 2159.379276] zsmalloc: ctx #7 chrome -> zs_compact()
[ 2159.382786] zsmalloc: ctx #8 chrome -> zs_compact()
[ 2159.432153] zsmalloc: ctx #9 kswapd0 -> zs_compact()
Yosry Ahmed April 18, 2023, 7:37 p.m. UTC | #7
On Tue, Apr 18, 2023 at 4:24 AM Sergey Senozhatsky
<senozhatsky@chromium.org> wrote:
>
> On (23/04/17 19:53), Yosry Ahmed wrote:
> >
> > As for this patch, I personally do not observe a lot of compaction in
> > our production environment, and allowing one thread to perform
> > compaction while others move on with their lives can be better than
> > having all of them continuously contending for the pool->lock, which
> > means more contention with ~all zsmalloc operations, not just
> > concurrent compactors. I can't say for sure that this is an
> > improvement, but I *believe* it is.
>
> Looking at one of ChromeOS memory-pressure tests, I see that sometimes
> (albeit rarely) we can have up to 9 parallel zspool compaction contexts,
> perhaps a little bit too many for a 12 CPUs laptop:
>
> [ 2159.378827] zsmalloc: ctx #1 chrome -> zs_compact()
> [ 2159.379002] zsmalloc: ctx #2 Chrome_ChildIOT -> zs_compact()
> [ 2159.379120] zsmalloc: ctx #3 chrome -> zs_compact()
> [ 2159.379135] zsmalloc: ctx #4 chrome -> zs_compact()
> [ 2159.379213] zsmalloc: ctx #5 chrome -> zs_compact()
> [ 2159.379271] zsmalloc: ctx #6 chrome -> zs_compact()
> [ 2159.379276] zsmalloc: ctx #7 chrome -> zs_compact()
> [ 2159.382786] zsmalloc: ctx #8 chrome -> zs_compact()
> [ 2159.432153] zsmalloc: ctx #9 kswapd0 -> zs_compact()


I haven't looked into such pressure tests on our end, the information
I have is based on the amount of CPU cycles we are spending on
zsmalloc compaction in general.
diff mbox series

Patch

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index cc81dfba05a0..dfec2fc6a30f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -264,6 +264,7 @@  struct zs_pool {
 	struct work_struct free_work;
 #endif
 	spinlock_t lock;
+	atomic_t compaction_in_progress;
 };
 
 struct zspage {
@@ -2274,6 +2275,9 @@  unsigned long zs_compact(struct zs_pool *pool)
 	struct size_class *class;
 	unsigned long pages_freed = 0;
 
+	if (atomic_xchg(&pool->compaction_in_progress, 1))
+		return 0;
+
 	for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
 		class = pool->size_class[i];
 		if (class->index != i)
@@ -2281,6 +2285,7 @@  unsigned long zs_compact(struct zs_pool *pool)
 		pages_freed += __zs_compact(pool, class);
 	}
 	atomic_long_add(pages_freed, &pool->stats.pages_compacted);
+	atomic_set(&pool->compaction_in_progress, 0);
 
 	return pages_freed;
 }
@@ -2388,6 +2393,7 @@  struct zs_pool *zs_create_pool(const char *name)
 
 	init_deferred_free(pool);
 	spin_lock_init(&pool->lock);
+	atomic_set(&pool->compaction_in_progress, 0);
 
 	pool->name = kstrdup(name, GFP_KERNEL);
 	if (!pool->name)