[0/7] Split list_lru lock into per-cgroup scope

Message ID	20240624175313.47329-1-ryncsn@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Kairui Song <ryncsn@gmail.com> To: linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <roman.gushchin@linux.dev>, Waiman Long <longman@redhat.com>, Shakeel Butt <shakeelb@google.com>, Nhat Pham <nphamcs@gmail.com>, Michal Hocko <mhocko@suse.com>, Chengming Zhou <zhouchengming@bytedance.com>, Qi Zheng <zhengqi.arch@bytedance.com>, Muchun Song <muchun.song@linux.dev>, Chris Li <chrisl@kernel.org>, Yosry Ahmed <yosryahmed@google.com>, "Huang, Ying" <ying.huang@intel.com>, Kairui Song <kasong@tencent.com> Subject: [PATCH 0/7] Split list_lru lock into per-cgroup scope Date: Tue, 25 Jun 2024 01:53:06 +0800 Message-ID: <20240624175313.47329-1-ryncsn@gmail.com> Reply-To: Kairui Song <kasong@tencent.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Split list_lru lock into per-cgroup scope \| expand [0/7] Split list_lru lock into per-cgroup scope [1/7] mm/swap, workingset: make anon workingset nodes memcg aware [2/7] mm/list_lru: don't pass unnecessary key parameters [3/7] mm/list_lru: don't export list_lru_add [4/7] mm/list_lru: code clean up for reparenting [5/7] mm/list_lru: simplify reparenting and initial allocation [6/7] mm/list_lru: split the lock to per-cgroup scope [7/7] mm/list_lru: Simplify the list_lru walk callback function

Message ID

20240624175313.47329-1-ryncsn@gmail.com (mailing list archive)

Headers

From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Waiman Long <longman@redhat.com>,
	Shakeel Butt <shakeelb@google.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Chengming Zhou <zhouchengming@bytedance.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Muchun Song <muchun.song@linux.dev>,
	Chris Li <chrisl@kernel.org>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH 0/7] Split list_lru lock into per-cgroup scope
Date: Tue, 25 Jun 2024 01:53:06 +0800
Message-ID: <20240624175313.47329-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Split list_lru lock into per-cgroup scope | expand

Message

Kairui Song June 24, 2024, 5:53 p.m. UTC

From: Kairui Song <kasong@tencent.com>

Currently, every list_lru has a per-node lock that protects adding,
deletion, isolation, and reparenting of all list_lru_one instances
belonging to this list_lru on this node. This lock contention is heavy
when multiple cgroups modify the same list_lru.

This can be alleviated by splitting the lock into per-cgroup scope.

To achieve this, this series reworks and optimizes the reparenting
process step by step, making it possible to have a stable list_lru_one,
and making it possible to pin the list_lru_one. Then split the lock
into per-cgroup scope.

The result is reduced LOC and better performance: I see a ~25%
improvement for multi-cgroup SWAP over ZRAM and a ~10% improvement for
multi-cgroup inode / dentry workload, as tested in PATCH 6/7:

memhog SWAP test (shadow nodes):
Before:
real    0m20.328s user    0m4.315s sys     10m23.639s
real    0m20.440s user    0m4.142s sys     10m34.756s
real    0m20.381s user    0m4.164s sys     10m29.035s

After:
real    0m15.156s user    0m4.590s sys     7m34.361s
real    0m15.161s user    0m4.776s sys     7m35.086s
real    0m15.429s user    0m4.734s sys     7m42.919s

File read test (inode / dentry):
Before:
real    0m26.939s user    0m36.322s sys     6m30.248s
real    0m15.111s user    0m33.749s sys     5m4.991s
real    0m16.796s user    0m33.438s sys     5m22.865s
real    0m15.256s user    0m34.060s sys     4m56.870s
real    0m14.826s user    0m33.531s sys     4m55.907s
real    0m15.664s user    0m35.619s sys     6m3.638s
real    0m15.746s user    0m34.066s sys     4m56.519s

After:
real    0m22.166s user    0m35.155s sys     6m21.045s
real    0m13.753s user    0m34.554s sys     4m40.982s
real    0m13.815s user    0m34.693s sys     4m39.605s
real    0m13.495s user    0m34.372s sys     4m40.776s
real    0m13.895s user    0m34.005s sys     4m39.061s
real    0m13.629s user    0m33.476s sys     4m43.626s
real    0m14.001s user    0m33.463s sys     4m41.261s

PATCH 1/7: Fixes a long-existing bug, so shadow nodes will be accounted
    to the right cgroup and put into the right list_lru.
PATCH 2/7 - 4/7: Clean up
PATCH 6/7: Reworks and optimizes reparenting process, avoids touching
    kmemcg_id on reparenting as first step.
PATCH 7/7: Makes it possible to pin the list_lru_one and prevent racing
    with reparenting, and splits the lock.

Kairui Song (7):
  mm/swap, workingset: make anon workingset nodes memcg aware
  mm/list_lru: don't pass unnecessary key parameters
  mm/list_lru: don't export list_lru_add
  mm/list_lru: code clean up for reparenting
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: Simplify the list_lru walk callback function

 drivers/android/binder_alloc.c |   6 +-
 drivers/android/binder_alloc.h |   2 +-
 fs/dcache.c                    |   4 +-
 fs/gfs2/quota.c                |   2 +-
 fs/inode.c                     |   5 +-
 fs/nfs/nfs42xattr.c            |   4 +-
 fs/nfsd/filecache.c            |   5 +-
 fs/xfs/xfs_buf.c               |   2 -
 fs/xfs/xfs_qm.c                |   6 +-
 include/linux/list_lru.h       |  26 ++-
 mm/list_lru.c                  | 387 +++++++++++++++++----------------
 mm/memcontrol.c                |  10 +-
 mm/swap_state.c                |   3 +-
 mm/workingset.c                |  20 +-
 mm/zswap.c                     |  12 +-
 15 files changed, 246 insertions(+), 248 deletions(-)

Comments

Andrew Morton June 24, 2024, 9:26 p.m. UTC | #1

On Tue, 25 Jun 2024 01:53:06 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> Currently, every list_lru has a per-node lock that protects adding,
> deletion, isolation, and reparenting of all list_lru_one instances
> belonging to this list_lru on this node. This lock contention is heavy
> when multiple cgroups modify the same list_lru.
> 
> This can be alleviated by splitting the lock into per-cgroup scope.

I'm wavering over this.  We're at -rc5 and things generally feel a bit
unstable at present.

The performance numbers are nice for extreme workloads, but can you
suggest how much benefit users will see in more typical workloads?

Anyway, opinions are sought and I'd ask people to please review this
work promptly if they feel is it sufficiently beneficial.

Kairui Song June 25, 2024, 7:47 a.m. UTC | #2

On Tue, Jun 25, 2024 at 5:26 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 25 Jun 2024 01:53:06 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> > Currently, every list_lru has a per-node lock that protects adding,
> > deletion, isolation, and reparenting of all list_lru_one instances
> > belonging to this list_lru on this node. This lock contention is heavy
> > when multiple cgroups modify the same list_lru.
> >
> > This can be alleviated by splitting the lock into per-cgroup scope.
>
> I'm wavering over this.  We're at -rc5 and things generally feel a bit
> unstable at present.
>
> The performance numbers are nice for extreme workloads, but can you
> suggest how much benefit users will see in more typical workloads?

Hi, the contention issue might be minor if the memory stress is low,
but still beneficial, and this series optimizes the cgroup
initialization too.

The memhog test I provided is tested on a 32 core system with 64
cgroups (I forgot to provide this detail, sry), not a very extreme
configuration actually, considering it's not rare to have thousands of
cgroups on a system nowadays. They all sharing a global lock is
definitely not a good idea.

The issue is barely observable for things like desktop usage though.

>
> Anyway, opinions are sought and I'd ask people to please review this
> work promptly if they feel is it sufficiently beneficial.

More reviews are definitely beneficial.

Shakeel Butt June 25, 2024, 5 p.m. UTC | #3

On Mon, Jun 24, 2024 at 02:26:54PM GMT, Andrew Morton wrote:
> On Tue, 25 Jun 2024 01:53:06 +0800 Kairui Song <ryncsn@gmail.com> wrote:
> 
> > Currently, every list_lru has a per-node lock that protects adding,
> > deletion, isolation, and reparenting of all list_lru_one instances
> > belonging to this list_lru on this node. This lock contention is heavy
> > when multiple cgroups modify the same list_lru.
> > 
> > This can be alleviated by splitting the lock into per-cgroup scope.
> 
> I'm wavering over this.  We're at -rc5 and things generally feel a bit
> unstable at present.
> 
> The performance numbers are nice for extreme workloads, but can you
> suggest how much benefit users will see in more typical workloads?
> 
> Anyway, opinions are sought and I'd ask people to please review this
> work promptly if they feel is it sufficiently beneficial.

I will take a look this week (or next) hopefully but I don't see any
issue or concern missing the upcoming open window.

Shakeel
> 
>

Shakeel Butt Aug. 27, 2024, 6:35 p.m. UTC | #4

On Tue, Jun 25, 2024 at 01:53:06AM GMT, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Currently, every list_lru has a per-node lock that protects adding,
> deletion, isolation, and reparenting of all list_lru_one instances
> belonging to this list_lru on this node. This lock contention is heavy
> when multiple cgroups modify the same list_lru.
> 
> This can be alleviated by splitting the lock into per-cgroup scope.
> 
> To achieve this, this series reworks and optimizes the reparenting
> process step by step, making it possible to have a stable list_lru_one,
> and making it possible to pin the list_lru_one. Then split the lock
> into per-cgroup scope.
> 
> The result is reduced LOC and better performance: I see a ~25%
> improvement for multi-cgroup SWAP over ZRAM and a ~10% improvement for
> multi-cgroup inode / dentry workload, as tested in PATCH 6/7:
> 
> memhog SWAP test (shadow nodes):
> Before:
> real    0m20.328s user    0m4.315s sys     10m23.639s
> real    0m20.440s user    0m4.142s sys     10m34.756s
> real    0m20.381s user    0m4.164s sys     10m29.035s
> 
> After:
> real    0m15.156s user    0m4.590s sys     7m34.361s
> real    0m15.161s user    0m4.776s sys     7m35.086s
> real    0m15.429s user    0m4.734s sys     7m42.919s
> 
> File read test (inode / dentry):
> Before:
> real    0m26.939s user    0m36.322s sys     6m30.248s
> real    0m15.111s user    0m33.749s sys     5m4.991s
> real    0m16.796s user    0m33.438s sys     5m22.865s
> real    0m15.256s user    0m34.060s sys     4m56.870s
> real    0m14.826s user    0m33.531s sys     4m55.907s
> real    0m15.664s user    0m35.619s sys     6m3.638s
> real    0m15.746s user    0m34.066s sys     4m56.519s
> 
> After:
> real    0m22.166s user    0m35.155s sys     6m21.045s
> real    0m13.753s user    0m34.554s sys     4m40.982s
> real    0m13.815s user    0m34.693s sys     4m39.605s
> real    0m13.495s user    0m34.372s sys     4m40.776s
> real    0m13.895s user    0m34.005s sys     4m39.061s
> real    0m13.629s user    0m33.476s sys     4m43.626s
> real    0m14.001s user    0m33.463s sys     4m41.261s
> 
> PATCH 1/7: Fixes a long-existing bug, so shadow nodes will be accounted
>     to the right cgroup and put into the right list_lru.
> PATCH 2/7 - 4/7: Clean up
> PATCH 6/7: Reworks and optimizes reparenting process, avoids touching
>     kmemcg_id on reparenting as first step.
> PATCH 7/7: Makes it possible to pin the list_lru_one and prevent racing
>     with reparenting, and splits the lock.
> 
> Kairui Song (7):
>   mm/swap, workingset: make anon workingset nodes memcg aware
>   mm/list_lru: don't pass unnecessary key parameters
>   mm/list_lru: don't export list_lru_add
>   mm/list_lru: code clean up for reparenting
>   mm/list_lru: simplify reparenting and initial allocation
>   mm/list_lru: split the lock to per-cgroup scope
>   mm/list_lru: Simplify the list_lru walk callback function
> 
>  drivers/android/binder_alloc.c |   6 +-
>  drivers/android/binder_alloc.h |   2 +-
>  fs/dcache.c                    |   4 +-
>  fs/gfs2/quota.c                |   2 +-
>  fs/inode.c                     |   5 +-
>  fs/nfs/nfs42xattr.c            |   4 +-
>  fs/nfsd/filecache.c            |   5 +-
>  fs/xfs/xfs_buf.c               |   2 -
>  fs/xfs/xfs_qm.c                |   6 +-
>  include/linux/list_lru.h       |  26 ++-
>  mm/list_lru.c                  | 387 +++++++++++++++++----------------
>  mm/memcontrol.c                |  10 +-
>  mm/swap_state.c                |   3 +-
>  mm/workingset.c                |  20 +-
>  mm/zswap.c                     |  12 +-
>  15 files changed, 246 insertions(+), 248 deletions(-)
> 
> -- 
> 2.45.2
> 
> 

Hi Kairui, can you send the v2 of this series (without the first patch)?

thanks,
Shakeel