Message ID | 20240624175313.47329-1-ryncsn@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Split list_lru lock into per-cgroup scope | expand |
On Tue, 25 Jun 2024 01:53:06 +0800 Kairui Song <ryncsn@gmail.com> wrote: > Currently, every list_lru has a per-node lock that protects adding, > deletion, isolation, and reparenting of all list_lru_one instances > belonging to this list_lru on this node. This lock contention is heavy > when multiple cgroups modify the same list_lru. > > This can be alleviated by splitting the lock into per-cgroup scope. I'm wavering over this. We're at -rc5 and things generally feel a bit unstable at present. The performance numbers are nice for extreme workloads, but can you suggest how much benefit users will see in more typical workloads? Anyway, opinions are sought and I'd ask people to please review this work promptly if they feel is it sufficiently beneficial.
On Tue, Jun 25, 2024 at 5:26 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 25 Jun 2024 01:53:06 +0800 Kairui Song <ryncsn@gmail.com> wrote: > > > Currently, every list_lru has a per-node lock that protects adding, > > deletion, isolation, and reparenting of all list_lru_one instances > > belonging to this list_lru on this node. This lock contention is heavy > > when multiple cgroups modify the same list_lru. > > > > This can be alleviated by splitting the lock into per-cgroup scope. > > I'm wavering over this. We're at -rc5 and things generally feel a bit > unstable at present. > > The performance numbers are nice for extreme workloads, but can you > suggest how much benefit users will see in more typical workloads? Hi, the contention issue might be minor if the memory stress is low, but still beneficial, and this series optimizes the cgroup initialization too. The memhog test I provided is tested on a 32 core system with 64 cgroups (I forgot to provide this detail, sry), not a very extreme configuration actually, considering it's not rare to have thousands of cgroups on a system nowadays. They all sharing a global lock is definitely not a good idea. The issue is barely observable for things like desktop usage though. > > Anyway, opinions are sought and I'd ask people to please review this > work promptly if they feel is it sufficiently beneficial. More reviews are definitely beneficial.
On Mon, Jun 24, 2024 at 02:26:54PM GMT, Andrew Morton wrote: > On Tue, 25 Jun 2024 01:53:06 +0800 Kairui Song <ryncsn@gmail.com> wrote: > > > Currently, every list_lru has a per-node lock that protects adding, > > deletion, isolation, and reparenting of all list_lru_one instances > > belonging to this list_lru on this node. This lock contention is heavy > > when multiple cgroups modify the same list_lru. > > > > This can be alleviated by splitting the lock into per-cgroup scope. > > I'm wavering over this. We're at -rc5 and things generally feel a bit > unstable at present. > > The performance numbers are nice for extreme workloads, but can you > suggest how much benefit users will see in more typical workloads? > > Anyway, opinions are sought and I'd ask people to please review this > work promptly if they feel is it sufficiently beneficial. I will take a look this week (or next) hopefully but I don't see any issue or concern missing the upcoming open window. Shakeel > >
On Tue, Jun 25, 2024 at 01:53:06AM GMT, Kairui Song wrote: > From: Kairui Song <kasong@tencent.com> > > Currently, every list_lru has a per-node lock that protects adding, > deletion, isolation, and reparenting of all list_lru_one instances > belonging to this list_lru on this node. This lock contention is heavy > when multiple cgroups modify the same list_lru. > > This can be alleviated by splitting the lock into per-cgroup scope. > > To achieve this, this series reworks and optimizes the reparenting > process step by step, making it possible to have a stable list_lru_one, > and making it possible to pin the list_lru_one. Then split the lock > into per-cgroup scope. > > The result is reduced LOC and better performance: I see a ~25% > improvement for multi-cgroup SWAP over ZRAM and a ~10% improvement for > multi-cgroup inode / dentry workload, as tested in PATCH 6/7: > > memhog SWAP test (shadow nodes): > Before: > real 0m20.328s user 0m4.315s sys 10m23.639s > real 0m20.440s user 0m4.142s sys 10m34.756s > real 0m20.381s user 0m4.164s sys 10m29.035s > > After: > real 0m15.156s user 0m4.590s sys 7m34.361s > real 0m15.161s user 0m4.776s sys 7m35.086s > real 0m15.429s user 0m4.734s sys 7m42.919s > > File read test (inode / dentry): > Before: > real 0m26.939s user 0m36.322s sys 6m30.248s > real 0m15.111s user 0m33.749s sys 5m4.991s > real 0m16.796s user 0m33.438s sys 5m22.865s > real 0m15.256s user 0m34.060s sys 4m56.870s > real 0m14.826s user 0m33.531s sys 4m55.907s > real 0m15.664s user 0m35.619s sys 6m3.638s > real 0m15.746s user 0m34.066s sys 4m56.519s > > After: > real 0m22.166s user 0m35.155s sys 6m21.045s > real 0m13.753s user 0m34.554s sys 4m40.982s > real 0m13.815s user 0m34.693s sys 4m39.605s > real 0m13.495s user 0m34.372s sys 4m40.776s > real 0m13.895s user 0m34.005s sys 4m39.061s > real 0m13.629s user 0m33.476s sys 4m43.626s > real 0m14.001s user 0m33.463s sys 4m41.261s > > PATCH 1/7: Fixes a long-existing bug, so shadow nodes will be accounted > to the right cgroup and put into the right list_lru. > PATCH 2/7 - 4/7: Clean up > PATCH 6/7: Reworks and optimizes reparenting process, avoids touching > kmemcg_id on reparenting as first step. > PATCH 7/7: Makes it possible to pin the list_lru_one and prevent racing > with reparenting, and splits the lock. > > Kairui Song (7): > mm/swap, workingset: make anon workingset nodes memcg aware > mm/list_lru: don't pass unnecessary key parameters > mm/list_lru: don't export list_lru_add > mm/list_lru: code clean up for reparenting > mm/list_lru: simplify reparenting and initial allocation > mm/list_lru: split the lock to per-cgroup scope > mm/list_lru: Simplify the list_lru walk callback function > > drivers/android/binder_alloc.c | 6 +- > drivers/android/binder_alloc.h | 2 +- > fs/dcache.c | 4 +- > fs/gfs2/quota.c | 2 +- > fs/inode.c | 5 +- > fs/nfs/nfs42xattr.c | 4 +- > fs/nfsd/filecache.c | 5 +- > fs/xfs/xfs_buf.c | 2 - > fs/xfs/xfs_qm.c | 6 +- > include/linux/list_lru.h | 26 ++- > mm/list_lru.c | 387 +++++++++++++++++---------------- > mm/memcontrol.c | 10 +- > mm/swap_state.c | 3 +- > mm/workingset.c | 20 +- > mm/zswap.c | 12 +- > 15 files changed, 246 insertions(+), 248 deletions(-) > > -- > 2.45.2 > > Hi Kairui, can you send the v2 of this series (without the first patch)? thanks, Shakeel
From: Kairui Song <kasong@tencent.com> Currently, every list_lru has a per-node lock that protects adding, deletion, isolation, and reparenting of all list_lru_one instances belonging to this list_lru on this node. This lock contention is heavy when multiple cgroups modify the same list_lru. This can be alleviated by splitting the lock into per-cgroup scope. To achieve this, this series reworks and optimizes the reparenting process step by step, making it possible to have a stable list_lru_one, and making it possible to pin the list_lru_one. Then split the lock into per-cgroup scope. The result is reduced LOC and better performance: I see a ~25% improvement for multi-cgroup SWAP over ZRAM and a ~10% improvement for multi-cgroup inode / dentry workload, as tested in PATCH 6/7: memhog SWAP test (shadow nodes): Before: real 0m20.328s user 0m4.315s sys 10m23.639s real 0m20.440s user 0m4.142s sys 10m34.756s real 0m20.381s user 0m4.164s sys 10m29.035s After: real 0m15.156s user 0m4.590s sys 7m34.361s real 0m15.161s user 0m4.776s sys 7m35.086s real 0m15.429s user 0m4.734s sys 7m42.919s File read test (inode / dentry): Before: real 0m26.939s user 0m36.322s sys 6m30.248s real 0m15.111s user 0m33.749s sys 5m4.991s real 0m16.796s user 0m33.438s sys 5m22.865s real 0m15.256s user 0m34.060s sys 4m56.870s real 0m14.826s user 0m33.531s sys 4m55.907s real 0m15.664s user 0m35.619s sys 6m3.638s real 0m15.746s user 0m34.066s sys 4m56.519s After: real 0m22.166s user 0m35.155s sys 6m21.045s real 0m13.753s user 0m34.554s sys 4m40.982s real 0m13.815s user 0m34.693s sys 4m39.605s real 0m13.495s user 0m34.372s sys 4m40.776s real 0m13.895s user 0m34.005s sys 4m39.061s real 0m13.629s user 0m33.476s sys 4m43.626s real 0m14.001s user 0m33.463s sys 4m41.261s PATCH 1/7: Fixes a long-existing bug, so shadow nodes will be accounted to the right cgroup and put into the right list_lru. PATCH 2/7 - 4/7: Clean up PATCH 6/7: Reworks and optimizes reparenting process, avoids touching kmemcg_id on reparenting as first step. PATCH 7/7: Makes it possible to pin the list_lru_one and prevent racing with reparenting, and splits the lock. Kairui Song (7): mm/swap, workingset: make anon workingset nodes memcg aware mm/list_lru: don't pass unnecessary key parameters mm/list_lru: don't export list_lru_add mm/list_lru: code clean up for reparenting mm/list_lru: simplify reparenting and initial allocation mm/list_lru: split the lock to per-cgroup scope mm/list_lru: Simplify the list_lru walk callback function drivers/android/binder_alloc.c | 6 +- drivers/android/binder_alloc.h | 2 +- fs/dcache.c | 4 +- fs/gfs2/quota.c | 2 +- fs/inode.c | 5 +- fs/nfs/nfs42xattr.c | 4 +- fs/nfsd/filecache.c | 5 +- fs/xfs/xfs_buf.c | 2 - fs/xfs/xfs_qm.c | 6 +- include/linux/list_lru.h | 26 ++- mm/list_lru.c | 387 +++++++++++++++++---------------- mm/memcontrol.c | 10 +- mm/swap_state.c | 3 +- mm/workingset.c | 20 +- mm/zswap.c | 12 +- 15 files changed, 246 insertions(+), 248 deletions(-)