[v3,00/76] Optimize list lru memory consumption

Message ID	20210914072938.6440-1-songmuchun@bytedance.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Muchun Song <songmuchun@bytedance.com> To: willy@infradead.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, shakeelb@google.com, guro@fb.com, shy828301@gmail.com, alexs@kernel.org, richard.weiyang@gmail.com, david@fromorbit.com, trond.myklebust@hammerspace.com, anna.schumaker@netapp.com Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, zhengqi.arch@bytedance.com, duanxiongchun@bytedance.com, fam.zheng@bytedance.com, smuchun@gmail.com, Muchun Song <songmuchun@bytedance.com> Subject: [PATCH v3 00/76] Optimize list lru memory consumption Date: Tue, 14 Sep 2021 15:28:22 +0800 Message-Id: <20210914072938.6440-1-songmuchun@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Optimize list lru memory consumption \| expand [v3,00/76] Optimize list lru memory consumption [v3,01/76] mm: list_lru: fix the return value of list_lru_count_one() [v3,02/76] mm: memcontrol: remove kmemcg_id reparenting [v3,03/76] mm: memcontrol: remove the kmem states [v3,04/76] mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online() [v3,05/76] mm: list_lru: remove holding lru lock [v3,06/76] mm: list_lru: only add memcg-aware lrus to the global lru list [v3,07/76] mm: list_lru: optimize memory consumption of arrays [v3,08/76] mm: introduce kmem_cache_alloc_lru [v3,09/76] fs: introduce alloc_inode_sb() to allocate filesystems specific inode [v3,10/76] dax: allocate inode by using alloc_inode_sb() [v3,11/76] 9p: allocate inode by using alloc_inode_sb() [v3,12/76] adfs: allocate inode by using alloc_inode_sb() [v3,13/76] affs: allocate inode by using alloc_inode_sb() [v3,14/76] afs: allocate inode by using alloc_inode_sb() [v3,15/76] befs: allocate inode by using alloc_inode_sb() [v3,16/76] bfs: allocate inode by using alloc_inode_sb() [v3,17/76] block: allocate inode by using alloc_inode_sb() [v3,18/76] btrfs: allocate inode by using alloc_inode_sb() [v3,19/76] ceph: allocate inode by using alloc_inode_sb() [v3,20/76] cifs: allocate inode by using alloc_inode_sb() [v3,21/76] coda: allocate inode by using alloc_inode_sb() [v3,22/76] ecryptfs: allocate inode by using alloc_inode_sb() [v3,23/76] efs: allocate inode by using alloc_inode_sb() [v3,24/76] erofs: allocate inode by using alloc_inode_sb() [v3,25/76] exfat: allocate inode by using alloc_inode_sb() [v3,26/76] ext2: allocate inode by using alloc_inode_sb() [v3,27/76] ext4: allocate inode by using alloc_inode_sb() [v3,28/76] fat: allocate inode by using alloc_inode_sb() [v3,29/76] freevxfs: allocate inode by using alloc_inode_sb() [v3,30/76] fuse: allocate inode by using alloc_inode_sb() [v3,31/76] gfs2: allocate inode by using alloc_inode_sb() [v3,32/76] hfs: allocate inode by using alloc_inode_sb() [v3,33/76] hfsplus: allocate inode by using alloc_inode_sb() [v3,34/76] hostfs: allocate inode by using alloc_inode_sb() [v3,35/76] hpfs: allocate inode by using alloc_inode_sb() [v3,36/76] hugetlbfs: allocate inode by using alloc_inode_sb() [v3,37/76] isofs: allocate inode by using alloc_inode_sb() [v3,38/76] jffs2: allocate inode by using alloc_inode_sb() [v3,39/76] jfs: allocate inode by using alloc_inode_sb() [v3,40/76] minix: allocate inode by using alloc_inode_sb() [v3,41/76] nfs: allocate inode by using alloc_inode_sb() [v3,42/76] nilfs2: allocate inode by using alloc_inode_sb() [v3,43/76] ntfs: allocate inode by using alloc_inode_sb() [v3,44/76] ocfs2: allocate inode by using alloc_inode_sb() [v3,45/76] openpromfs: allocate inode by using alloc_inode_sb() [v3,46/76] orangefs: allocate inode by using alloc_inode_sb() [v3,47/76] overlayfs: allocate inode by using alloc_inode_sb() [v3,48/76] proc: allocate inode by using alloc_inode_sb() [v3,49/76] qnx4: allocate inode by using alloc_inode_sb() [v3,50/76] qnx6: allocate inode by using alloc_inode_sb() [v3,51/76] reiserfs: allocate inode by using alloc_inode_sb() [v3,52/76] romfs: allocate inode by using alloc_inode_sb() [v3,53/76] squashfs: allocate inode by using alloc_inode_sb() [v3,54/76] sysv: allocate inode by using alloc_inode_sb() [v3,55/76] ubifs: allocate inode by using alloc_inode_sb() [v3,56/76] udf: allocate inode by using alloc_inode_sb() [v3,57/76] ufs: allocate inode by using alloc_inode_sb() [v3,58/76] vboxsf: allocate inode by using alloc_inode_sb() [v3,59/76] xfs: allocate inode by using alloc_inode_sb() [v3,60/76] zonefs: allocate inode by using alloc_inode_sb() [v3,61/76] ipc: allocate inode by using alloc_inode_sb() [v3,62/76] shmem: allocate inode by using alloc_inode_sb() [v3,63/76] net: allocate inode by using alloc_inode_sb() [v3,64/76] rpc: allocate inode by using alloc_inode_sb() [v3,65/76] f2fs: allocate inode by using alloc_inode_sb() [v3,66/76] nfs42: use a specific kmem_cache to allocate nfs4_xattr_entry [v3,67/76] mm: dcache: use kmem_cache_alloc_lru() to allocate dentry [v3,68/76] xarray: use kmem_cache_alloc_lru to allocate xa_node [v3,69/76] mm: workingset: use xas_set_lru() to pass shadow_nodes [v3,70/76] mm: list_lru: allocate list_lru_one only when needed [v3,71/76] mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus [v3,72/76] mm: list_lru: replace linear array with xarray [v3,73/76] mm: memcontrol: reuse memory cgroup ID for kmem ID [v3,74/76] mm: memcontrol: fix cannot alloc the maximum memcg ID [v3,75/76] mm: list_lru: rename list_lru_per_memcg to list_lru_memcg [v3,76/76] mm: memcontrol: rename memcg_cache_id to memcg_kmem_id

Message ID

20210914072938.6440-1-songmuchun@bytedance.com (mailing list archive)

Headers

From: Muchun Song <songmuchun@bytedance.com>
To: willy@infradead.org, akpm@linux-foundation.org, hannes@cmpxchg.org,
        mhocko@kernel.org, vdavydov.dev@gmail.com, shakeelb@google.com,
        guro@fb.com, shy828301@gmail.com, alexs@kernel.org,
        richard.weiyang@gmail.com, david@fromorbit.com,
        trond.myklebust@hammerspace.com, anna.schumaker@netapp.com
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
        linux-mm@kvack.org, linux-nfs@vger.kernel.org,
        zhengqi.arch@bytedance.com, duanxiongchun@bytedance.com,
        fam.zheng@bytedance.com, smuchun@gmail.com,
        Muchun Song <songmuchun@bytedance.com>
Subject: [PATCH v3 00/76] Optimize list lru memory consumption
Date: Tue, 14 Sep 2021 15:28:22 +0800
Message-Id: <20210914072938.6440-1-songmuchun@bytedance.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Optimize list lru memory consumption | expand

Message

Muchun Song Sept. 14, 2021, 7:28 a.m. UTC

We introduced alloc_inode_sb() in previous version 2, which sets up the
inode reclaim context properly, to allocate filesystems specific inode.
So we have to convert to new API for all filesystems, which is done in
one patch. Some filesystems are easy to convert (just replace
kmem_cache_alloc() to alloc_inode_sb()), while other filesystems need to
do more work. In order to make it easy for maintainers of different
filesystems to review their own maintained part, I split the patch into
patches which are per-filesystem in this version. I am not sure if this
is a good idea, because there is going to be more commits.

In our server, we found a suspected memory leak problem. The kmalloc-32
consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
memory.

After our in-depth analysis, the memory consumption of kmalloc-32 slab
cache is the cause of list_lru_one allocation.

  crash> p memcg_nr_cache_ids
  memcg_nr_cache_ids = $2 = 24574

memcg_nr_cache_ids is very large and memory consumption of each list_lru
can be calculated with the following formula.

  num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)

There are 4 numa nodes in our system, so each list_lru consumes ~3MB.

  crash> list super_blocks | wc -l
  952

Every mount will register 2 list lrus, one is for inode, another is for
dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
MB (~5.6GB). But now the number of memory cgroups is less than 500. So I
guess more than 12286 memory cgroups have been created on this machine (I
do not know why there are so many cgroups, it may be a user's bug or
the user really want to do that). Because memcg_nr_cache_ids has not been
reduced to a suitable value. It leads to waste a lot of memory. If we want
to reduce memcg_nr_cache_ids, we have to *reboot* the server. This is not
what we want.

In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do
this. But this did not fundamentally solve the problem.

We currently allocate scope for every memcg to be able to tracked on every
superblock instantiated in the system, regardless of whether that superblock
is even accessible to that memcg.

These huge memcg counts come from container hosts where memcgs are confined
to just a small subset of the total number of superblocks that instantiated
at any given point in time.

For these systems with huge container counts, list_lru does not need the
capability of tracking every memcg on every superblock.

What it comes down to is that the list_lru is only needed for a given memcg
if that memcg is instatiating and freeing objects on a given list_lru.

As Dave said, "Which makes me think we should be moving more towards 'add the
memcg to the list_lru at the first insert' model rather than 'instantiate
all at memcg init time just in case'."

This patchset aims to optimize the list lru memory consumption from different
aspects.

Patch 1-6 are code simplification.
Patch 7 converts the array from per-memcg per-node to per-memcg
Patch 8 introduces kmem_cache_alloc_lru()
Patch 9 introduces alloc_inode_sb()
Patch 10-66 convert all filesystems to alloc_inode_sb() respectively.
Patch 70 let list_lru allocation dynamically.
Patch 72 use xarray to optimize per memcg pointer array size.
Patch 73-76 is code simplification.

I had done a easy test to show the optimization. I create 10k memory cgroups
and mount 10k filesystems in the systems. We use free command to show how many
memory does the systems comsumes after this operation (There are 2 numa nodes
in the system).

        +-----------------------+------------------------+
        |      condition        |   memory consumption   |
        +-----------------------+------------------------+
        | without this patchset |        24464 MB        |
        +-----------------------+------------------------+
        |     after patch 7     |        21957 MB        | <--------+
        +-----------------------+------------------------+          |
        |     after patch 70    |         6895 MB        |          |
        +-----------------------+------------------------+          |
        |     after patch 72    |         4367 MB        |          |
        +-----------------------+------------------------+          |
                                                                    |
        The more the number of nodes, the more obvious the effect---+

BTW, there was a recent discussion [2] on the same issue.

[1] https://lore.kernel.org/linux-fsdevel/20210428094949.43579-1-songmuchun@bytedance.com/
[2] https://lore.kernel.org/linux-fsdevel/20210405054848.GA1077931@in.ibm.com/

This series not only optimizes the memory usage of list_lru but also
simplifies the code.

Changelog in v3:
  - Fix mixing advanced and normal XArray concepts (Thanks to Matthew).
  - Split one patch into per-filesystem patches.

Changelog in v2:
  - Update Documentation/filesystems/porting.rst suggested by Dave.
  - Add a comment above alloc_inode_sb() suggested by Dave.
  - Rework some patch's commit log.
  - Add patch 18-21.

  Thanks Dave.

Muchun Song (76):
  mm: list_lru: fix the return value of list_lru_count_one()
  mm: memcontrol: remove kmemcg_id reparenting
  mm: memcontrol: remove the kmem states
  mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online()
  mm: list_lru: remove holding lru lock
  mm: list_lru: only add memcg-aware lrus to the global lru list
  mm: list_lru: optimize memory consumption of arrays
  mm: introduce kmem_cache_alloc_lru
  fs: introduce alloc_inode_sb() to allocate filesystems specific inode
  dax: allocate inode by using alloc_inode_sb()
  9p: allocate inode by using alloc_inode_sb()
  adfs: allocate inode by using alloc_inode_sb()
  affs: allocate inode by using alloc_inode_sb()
  afs: allocate inode by using alloc_inode_sb()
  befs: allocate inode by using alloc_inode_sb()
  bfs: allocate inode by using alloc_inode_sb()
  block: allocate inode by using alloc_inode_sb()
  btrfs: allocate inode by using alloc_inode_sb()
  ceph: allocate inode by using alloc_inode_sb()
  cifs: allocate inode by using alloc_inode_sb()
  coda: allocate inode by using alloc_inode_sb()
  ecryptfs: allocate inode by using alloc_inode_sb()
  efs: allocate inode by using alloc_inode_sb()
  erofs: allocate inode by using alloc_inode_sb()
  exfat: allocate inode by using alloc_inode_sb()
  ext2: allocate inode by using alloc_inode_sb()
  ext4: allocate inode by using alloc_inode_sb()
  fat: allocate inode by using alloc_inode_sb()
  freevxfs: allocate inode by using alloc_inode_sb()
  fuse: allocate inode by using alloc_inode_sb()
  gfs2: allocate inode by using alloc_inode_sb()
  hfs: allocate inode by using alloc_inode_sb()
  hfsplus: allocate inode by using alloc_inode_sb()
  hostfs: allocate inode by using alloc_inode_sb()
  hpfs: allocate inode by using alloc_inode_sb()
  hugetlbfs: allocate inode by using alloc_inode_sb()
  isofs: allocate inode by using alloc_inode_sb()
  jffs2: allocate inode by using alloc_inode_sb()
  jfs: allocate inode by using alloc_inode_sb()
  minix: allocate inode by using alloc_inode_sb()
  nfs: allocate inode by using alloc_inode_sb()
  nilfs2: allocate inode by using alloc_inode_sb()
  ntfs: allocate inode by using alloc_inode_sb()
  ocfs2: allocate inode by using alloc_inode_sb()
  openpromfs: allocate inode by using alloc_inode_sb()
  orangefs: allocate inode by using alloc_inode_sb()
  overlayfs: allocate inode by using alloc_inode_sb()
  proc: allocate inode by using alloc_inode_sb()
  qnx4: allocate inode by using alloc_inode_sb()
  qnx6: allocate inode by using alloc_inode_sb()
  reiserfs: allocate inode by using alloc_inode_sb()
  romfs: allocate inode by using alloc_inode_sb()
  squashfs: allocate inode by using alloc_inode_sb()
  sysv: allocate inode by using alloc_inode_sb()
  ubifs: allocate inode by using alloc_inode_sb()
  udf: allocate inode by using alloc_inode_sb()
  ufs: allocate inode by using alloc_inode_sb()
  vboxsf: allocate inode by using alloc_inode_sb()
  xfs: allocate inode by using alloc_inode_sb()
  zonefs: allocate inode by using alloc_inode_sb()
  ipc: allocate inode by using alloc_inode_sb()
  shmem: allocate inode by using alloc_inode_sb()
  net: allocate inode by using alloc_inode_sb()
  rpc: allocate inode by using alloc_inode_sb()
  f2fs: allocate inode by using alloc_inode_sb()
  nfs42: use a specific kmem_cache to allocate nfs4_xattr_entry
  mm: dcache: use kmem_cache_alloc_lru() to allocate dentry
  xarray: use kmem_cache_alloc_lru to allocate xa_node
  mm: workingset: use xas_set_lru() to pass shadow_nodes
  mm: list_lru: allocate list_lru_one only when needed
  mm: list_lru: rename memcg_drain_all_list_lrus to
    memcg_reparent_list_lrus
  mm: list_lru: replace linear array with xarray
  mm: memcontrol: reuse memory cgroup ID for kmem ID
  mm: memcontrol: fix cannot alloc the maximum memcg ID
  mm: list_lru: rename list_lru_per_memcg to list_lru_memcg
  mm: memcontrol: rename memcg_cache_id to memcg_kmem_id

 Documentation/filesystems/porting.rst |   5 +
 drivers/dax/super.c                   |   2 +-
 fs/9p/vfs_inode.c                     |   2 +-
 fs/adfs/super.c                       |   2 +-
 fs/affs/super.c                       |   2 +-
 fs/afs/super.c                        |   2 +-
 fs/befs/linuxvfs.c                    |   2 +-
 fs/bfs/inode.c                        |   2 +-
 fs/block_dev.c                        |   2 +-
 fs/btrfs/inode.c                      |   2 +-
 fs/ceph/inode.c                       |   2 +-
 fs/cifs/cifsfs.c                      |   2 +-
 fs/coda/inode.c                       |   2 +-
 fs/dcache.c                           |   3 +-
 fs/ecryptfs/super.c                   |   2 +-
 fs/efs/super.c                        |   2 +-
 fs/erofs/super.c                      |   2 +-
 fs/exfat/super.c                      |   2 +-
 fs/ext2/super.c                       |   2 +-
 fs/ext4/super.c                       |   2 +-
 fs/f2fs/super.c                       |   8 +-
 fs/fat/inode.c                        |   2 +-
 fs/freevxfs/vxfs_super.c              |   2 +-
 fs/fuse/inode.c                       |   2 +-
 fs/gfs2/super.c                       |   2 +-
 fs/hfs/super.c                        |   2 +-
 fs/hfsplus/super.c                    |   2 +-
 fs/hostfs/hostfs_kern.c               |   2 +-
 fs/hpfs/super.c                       |   2 +-
 fs/hugetlbfs/inode.c                  |   2 +-
 fs/inode.c                            |   2 +-
 fs/isofs/inode.c                      |   2 +-
 fs/jffs2/super.c                      |   2 +-
 fs/jfs/super.c                        |   2 +-
 fs/minix/inode.c                      |   2 +-
 fs/nfs/inode.c                        |   2 +-
 fs/nfs/nfs42xattr.c                   |  95 ++++---
 fs/nilfs2/super.c                     |   2 +-
 fs/ntfs/inode.c                       |   2 +-
 fs/ocfs2/dlmfs/dlmfs.c                |   2 +-
 fs/ocfs2/super.c                      |   2 +-
 fs/openpromfs/inode.c                 |   2 +-
 fs/orangefs/super.c                   |   2 +-
 fs/overlayfs/super.c                  |   2 +-
 fs/proc/inode.c                       |   2 +-
 fs/qnx4/inode.c                       |   2 +-
 fs/qnx6/inode.c                       |   2 +-
 fs/reiserfs/super.c                   |   2 +-
 fs/romfs/super.c                      |   2 +-
 fs/squashfs/super.c                   |   2 +-
 fs/sysv/inode.c                       |   2 +-
 fs/ubifs/super.c                      |   2 +-
 fs/udf/super.c                        |   2 +-
 fs/ufs/super.c                        |   2 +-
 fs/vboxsf/super.c                     |   2 +-
 fs/xfs/xfs_icache.c                   |   2 +-
 fs/zonefs/super.c                     |   2 +-
 include/linux/fs.h                    |  11 +
 include/linux/list_lru.h              |  16 +-
 include/linux/memcontrol.h            |  49 ++--
 include/linux/slab.h                  |   3 +
 include/linux/swap.h                  |   5 +-
 include/linux/xarray.h                |   9 +-
 ipc/mqueue.c                          |   2 +-
 lib/xarray.c                          |  10 +-
 mm/list_lru.c                         | 472 ++++++++++++++++------------------
 mm/memcontrol.c                       | 190 ++------------
 mm/shmem.c                            |   2 +-
 mm/slab.c                             |  39 ++-
 mm/slab.h                             |  17 +-
 mm/slob.c                             |   6 +
 mm/slub.c                             |  42 ++-
 mm/workingset.c                       |   2 +-
 net/socket.c                          |   2 +-
 net/sunrpc/rpc_pipe.c                 |   2 +-
 75 files changed, 498 insertions(+), 598 deletions(-)

Comments

Theodore Ts'o Sept. 14, 2021, 8:22 p.m. UTC | #1

On Tue, Sep 14, 2021 at 03:28:22PM +0800, Muchun Song wrote:
> So we have to convert to new API for all filesystems, which is done in
> one patch. Some filesystems are easy to convert (just replace
> kmem_cache_alloc() to alloc_inode_sb()), while other filesystems need to
> do more work.

From what I can tell, three are 54 file systems for which it was a
trivial one-line change, and two (f2fs and nfs42) that were a tad bit
more complex.

> In order to make it easy for maintainers of different
> filesystems to review their own maintained part, I split the patch into
> patches which are per-filesystem in this version. I am not sure if this
> is a good idea, because there is going to be more commits.

What I'd actually suggest is that you combine all of the trivial file
system changes into a single commit, and keep the two more complex
changes for f2fs and nfs42 in separate commits.

Acked-by: Theodore Ts'o <tytso@mit.edu>

... for the ext4 related change.

						- Ted

Muchun Song Sept. 15, 2021, 7:30 a.m. UTC | #2

On Wed, Sep 15, 2021 at 4:23 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Sep 14, 2021 at 03:28:22PM +0800, Muchun Song wrote:
> > So we have to convert to new API for all filesystems, which is done in
> > one patch. Some filesystems are easy to convert (just replace
> > kmem_cache_alloc() to alloc_inode_sb()), while other filesystems need to
> > do more work.
>
> From what I can tell, three are 54 file systems for which it was a
> trivial one-line change, and two (f2fs and nfs42) that were a tad bit
> more complex.

Definitely right. Thanks for your clarification.

>
> > In order to make it easy for maintainers of different
> > filesystems to review their own maintained part, I split the patch into
> > patches which are per-filesystem in this version. I am not sure if this
> > is a good idea, because there is going to be more commits.
>
> What I'd actually suggest is that you combine all of the trivial file
> system changes into a single commit, and keep the two more complex
> changes for f2fs and nfs42 in separate commits.

Got it. Will do in the next version.

>
> Acked-by: Theodore Ts'o <tytso@mit.edu>

Thanks.

>
> ... for the ext4 related change.
>
>                                                 - Ted
>

Kari Argillander Sept. 18, 2021, 6:56 a.m. UTC | #3

On Tue, Sep 14, 2021 at 03:28:22PM +0800, Muchun Song wrote:
> We introduced alloc_inode_sb() in previous version 2, which sets up the
> inode reclaim context properly, to allocate filesystems specific inode.
> So we have to convert to new API for all filesystems, which is done in
> one patch. Some filesystems are easy to convert (just replace
> kmem_cache_alloc() to alloc_inode_sb()), while other filesystems need to
> do more work. In order to make it easy for maintainers of different
> filesystems to review their own maintained part, I split the patch into
> patches which are per-filesystem in this version. I am not sure if this
> is a good idea, because there is going to be more commits.
> 
> In our server, we found a suspected memory leak problem. The kmalloc-32
> consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> memory.
> 
> After our in-depth analysis, the memory consumption of kmalloc-32 slab
> cache is the cause of list_lru_one allocation.
> 
>   crash> p memcg_nr_cache_ids
>   memcg_nr_cache_ids = $2 = 24574
> 
> memcg_nr_cache_ids is very large and memory consumption of each list_lru
> can be calculated with the following formula.
> 
>   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> 
> There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> 
>   crash> list super_blocks | wc -l
>   952
> 
> Every mount will register 2 list lrus, one is for inode, another is for
> dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
> MB (~5.6GB). But now the number of memory cgroups is less than 500. So I
> guess more than 12286 memory cgroups have been created on this machine (I
> do not know why there are so many cgroups, it may be a user's bug or
> the user really want to do that). Because memcg_nr_cache_ids has not been
> reduced to a suitable value. It leads to waste a lot of memory. If we want
> to reduce memcg_nr_cache_ids, we have to *reboot* the server. This is not
> what we want.
> 
> In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do
> this. But this did not fundamentally solve the problem.
> 
> We currently allocate scope for every memcg to be able to tracked on every
> superblock instantiated in the system, regardless of whether that superblock
> is even accessible to that memcg.
> 
> These huge memcg counts come from container hosts where memcgs are confined
> to just a small subset of the total number of superblocks that instantiated
> at any given point in time.
> 
> For these systems with huge container counts, list_lru does not need the
> capability of tracking every memcg on every superblock.
> 
> What it comes down to is that the list_lru is only needed for a given memcg
> if that memcg is instatiating and freeing objects on a given list_lru.
> 
> As Dave said, "Which makes me think we should be moving more towards 'add the
> memcg to the list_lru at the first insert' model rather than 'instantiate
> all at memcg init time just in case'."
> 
> This patchset aims to optimize the list lru memory consumption from different
> aspects.
> 
> Patch 1-6 are code simplification.
> Patch 7 converts the array from per-memcg per-node to per-memcg
> Patch 8 introduces kmem_cache_alloc_lru()
> Patch 9 introduces alloc_inode_sb()
> Patch 10-66 convert all filesystems to alloc_inode_sb() respectively.

There is now days also ntfs3. If you do not plan to convert this please
CC me atleast so that I can do it when these lands.

  Argillander

> Patch 70 let list_lru allocation dynamically.
> Patch 72 use xarray to optimize per memcg pointer array size.
> Patch 73-76 is code simplification.
> 
> I had done a easy test to show the optimization. I create 10k memory cgroups
> and mount 10k filesystems in the systems. We use free command to show how many
> memory does the systems comsumes after this operation (There are 2 numa nodes
> in the system).
> 
>         +-----------------------+------------------------+
>         |      condition        |   memory consumption   |
>         +-----------------------+------------------------+
>         | without this patchset |        24464 MB        |
>         +-----------------------+------------------------+
>         |     after patch 7     |        21957 MB        | <--------+
>         +-----------------------+------------------------+          |
>         |     after patch 70    |         6895 MB        |          |
>         +-----------------------+------------------------+          |
>         |     after patch 72    |         4367 MB        |          |
>         +-----------------------+------------------------+          |
>                                                                     |
>         The more the number of nodes, the more obvious the effect---+
> 
> BTW, there was a recent discussion [2] on the same issue.
> 
> [1] https://lore.kernel.org/linux-fsdevel/20210428094949.43579-1-songmuchun@bytedance.com/
> [2] https://lore.kernel.org/linux-fsdevel/20210405054848.GA1077931@in.ibm.com/
> 
> This series not only optimizes the memory usage of list_lru but also
> simplifies the code.
> 
> Changelog in v3:
>   - Fix mixing advanced and normal XArray concepts (Thanks to Matthew).
>   - Split one patch into per-filesystem patches.
> 
> Changelog in v2:
>   - Update Documentation/filesystems/porting.rst suggested by Dave.
>   - Add a comment above alloc_inode_sb() suggested by Dave.
>   - Rework some patch's commit log.
>   - Add patch 18-21.
> 
>   Thanks Dave.
> 
> Muchun Song (76):
>   mm: list_lru: fix the return value of list_lru_count_one()
>   mm: memcontrol: remove kmemcg_id reparenting
>   mm: memcontrol: remove the kmem states
>   mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online()
>   mm: list_lru: remove holding lru lock
>   mm: list_lru: only add memcg-aware lrus to the global lru list
>   mm: list_lru: optimize memory consumption of arrays
>   mm: introduce kmem_cache_alloc_lru
>   fs: introduce alloc_inode_sb() to allocate filesystems specific inode
>   dax: allocate inode by using alloc_inode_sb()
>   9p: allocate inode by using alloc_inode_sb()
>   adfs: allocate inode by using alloc_inode_sb()
>   affs: allocate inode by using alloc_inode_sb()
>   afs: allocate inode by using alloc_inode_sb()
>   befs: allocate inode by using alloc_inode_sb()
>   bfs: allocate inode by using alloc_inode_sb()
>   block: allocate inode by using alloc_inode_sb()
>   btrfs: allocate inode by using alloc_inode_sb()
>   ceph: allocate inode by using alloc_inode_sb()
>   cifs: allocate inode by using alloc_inode_sb()
>   coda: allocate inode by using alloc_inode_sb()
>   ecryptfs: allocate inode by using alloc_inode_sb()
>   efs: allocate inode by using alloc_inode_sb()
>   erofs: allocate inode by using alloc_inode_sb()
>   exfat: allocate inode by using alloc_inode_sb()
>   ext2: allocate inode by using alloc_inode_sb()
>   ext4: allocate inode by using alloc_inode_sb()
>   fat: allocate inode by using alloc_inode_sb()
>   freevxfs: allocate inode by using alloc_inode_sb()
>   fuse: allocate inode by using alloc_inode_sb()
>   gfs2: allocate inode by using alloc_inode_sb()
>   hfs: allocate inode by using alloc_inode_sb()
>   hfsplus: allocate inode by using alloc_inode_sb()
>   hostfs: allocate inode by using alloc_inode_sb()
>   hpfs: allocate inode by using alloc_inode_sb()
>   hugetlbfs: allocate inode by using alloc_inode_sb()
>   isofs: allocate inode by using alloc_inode_sb()
>   jffs2: allocate inode by using alloc_inode_sb()
>   jfs: allocate inode by using alloc_inode_sb()
>   minix: allocate inode by using alloc_inode_sb()
>   nfs: allocate inode by using alloc_inode_sb()
>   nilfs2: allocate inode by using alloc_inode_sb()
>   ntfs: allocate inode by using alloc_inode_sb()
>   ocfs2: allocate inode by using alloc_inode_sb()
>   openpromfs: allocate inode by using alloc_inode_sb()
>   orangefs: allocate inode by using alloc_inode_sb()
>   overlayfs: allocate inode by using alloc_inode_sb()
>   proc: allocate inode by using alloc_inode_sb()
>   qnx4: allocate inode by using alloc_inode_sb()
>   qnx6: allocate inode by using alloc_inode_sb()
>   reiserfs: allocate inode by using alloc_inode_sb()
>   romfs: allocate inode by using alloc_inode_sb()
>   squashfs: allocate inode by using alloc_inode_sb()
>   sysv: allocate inode by using alloc_inode_sb()
>   ubifs: allocate inode by using alloc_inode_sb()
>   udf: allocate inode by using alloc_inode_sb()
>   ufs: allocate inode by using alloc_inode_sb()
>   vboxsf: allocate inode by using alloc_inode_sb()
>   xfs: allocate inode by using alloc_inode_sb()
>   zonefs: allocate inode by using alloc_inode_sb()
>   ipc: allocate inode by using alloc_inode_sb()
>   shmem: allocate inode by using alloc_inode_sb()
>   net: allocate inode by using alloc_inode_sb()
>   rpc: allocate inode by using alloc_inode_sb()
>   f2fs: allocate inode by using alloc_inode_sb()
>   nfs42: use a specific kmem_cache to allocate nfs4_xattr_entry
>   mm: dcache: use kmem_cache_alloc_lru() to allocate dentry
>   xarray: use kmem_cache_alloc_lru to allocate xa_node
>   mm: workingset: use xas_set_lru() to pass shadow_nodes
>   mm: list_lru: allocate list_lru_one only when needed
>   mm: list_lru: rename memcg_drain_all_list_lrus to
>     memcg_reparent_list_lrus
>   mm: list_lru: replace linear array with xarray
>   mm: memcontrol: reuse memory cgroup ID for kmem ID
>   mm: memcontrol: fix cannot alloc the maximum memcg ID
>   mm: list_lru: rename list_lru_per_memcg to list_lru_memcg
>   mm: memcontrol: rename memcg_cache_id to memcg_kmem_id
> 
>  Documentation/filesystems/porting.rst |   5 +
>  drivers/dax/super.c                   |   2 +-
>  fs/9p/vfs_inode.c                     |   2 +-
>  fs/adfs/super.c                       |   2 +-
>  fs/affs/super.c                       |   2 +-
>  fs/afs/super.c                        |   2 +-
>  fs/befs/linuxvfs.c                    |   2 +-
>  fs/bfs/inode.c                        |   2 +-
>  fs/block_dev.c                        |   2 +-
>  fs/btrfs/inode.c                      |   2 +-
>  fs/ceph/inode.c                       |   2 +-
>  fs/cifs/cifsfs.c                      |   2 +-
>  fs/coda/inode.c                       |   2 +-
>  fs/dcache.c                           |   3 +-
>  fs/ecryptfs/super.c                   |   2 +-
>  fs/efs/super.c                        |   2 +-
>  fs/erofs/super.c                      |   2 +-
>  fs/exfat/super.c                      |   2 +-
>  fs/ext2/super.c                       |   2 +-
>  fs/ext4/super.c                       |   2 +-
>  fs/f2fs/super.c                       |   8 +-
>  fs/fat/inode.c                        |   2 +-
>  fs/freevxfs/vxfs_super.c              |   2 +-
>  fs/fuse/inode.c                       |   2 +-
>  fs/gfs2/super.c                       |   2 +-
>  fs/hfs/super.c                        |   2 +-
>  fs/hfsplus/super.c                    |   2 +-
>  fs/hostfs/hostfs_kern.c               |   2 +-
>  fs/hpfs/super.c                       |   2 +-
>  fs/hugetlbfs/inode.c                  |   2 +-
>  fs/inode.c                            |   2 +-
>  fs/isofs/inode.c                      |   2 +-
>  fs/jffs2/super.c                      |   2 +-
>  fs/jfs/super.c                        |   2 +-
>  fs/minix/inode.c                      |   2 +-
>  fs/nfs/inode.c                        |   2 +-
>  fs/nfs/nfs42xattr.c                   |  95 ++++---
>  fs/nilfs2/super.c                     |   2 +-
>  fs/ntfs/inode.c                       |   2 +-
>  fs/ocfs2/dlmfs/dlmfs.c                |   2 +-
>  fs/ocfs2/super.c                      |   2 +-
>  fs/openpromfs/inode.c                 |   2 +-
>  fs/orangefs/super.c                   |   2 +-
>  fs/overlayfs/super.c                  |   2 +-
>  fs/proc/inode.c                       |   2 +-
>  fs/qnx4/inode.c                       |   2 +-
>  fs/qnx6/inode.c                       |   2 +-
>  fs/reiserfs/super.c                   |   2 +-
>  fs/romfs/super.c                      |   2 +-
>  fs/squashfs/super.c                   |   2 +-
>  fs/sysv/inode.c                       |   2 +-
>  fs/ubifs/super.c                      |   2 +-
>  fs/udf/super.c                        |   2 +-
>  fs/ufs/super.c                        |   2 +-
>  fs/vboxsf/super.c                     |   2 +-
>  fs/xfs/xfs_icache.c                   |   2 +-
>  fs/zonefs/super.c                     |   2 +-
>  include/linux/fs.h                    |  11 +
>  include/linux/list_lru.h              |  16 +-
>  include/linux/memcontrol.h            |  49 ++--
>  include/linux/slab.h                  |   3 +
>  include/linux/swap.h                  |   5 +-
>  include/linux/xarray.h                |   9 +-
>  ipc/mqueue.c                          |   2 +-
>  lib/xarray.c                          |  10 +-
>  mm/list_lru.c                         | 472 ++++++++++++++++------------------
>  mm/memcontrol.c                       | 190 ++------------
>  mm/shmem.c                            |   2 +-
>  mm/slab.c                             |  39 ++-
>  mm/slab.h                             |  17 +-
>  mm/slob.c                             |   6 +
>  mm/slub.c                             |  42 ++-
>  mm/workingset.c                       |   2 +-
>  net/socket.c                          |   2 +-
>  net/sunrpc/rpc_pipe.c                 |   2 +-
>  75 files changed, 498 insertions(+), 598 deletions(-)
> 
> -- 
> 2.11.0
>

Muchun Song Sept. 18, 2021, 7:59 a.m. UTC | #4

On Sat, Sep 18, 2021 at 2:56 PM Kari Argillander
<kari.argillander@gmail.com> wrote:
>
> On Tue, Sep 14, 2021 at 03:28:22PM +0800, Muchun Song wrote:
> > We introduced alloc_inode_sb() in previous version 2, which sets up the
> > inode reclaim context properly, to allocate filesystems specific inode.
> > So we have to convert to new API for all filesystems, which is done in
> > one patch. Some filesystems are easy to convert (just replace
> > kmem_cache_alloc() to alloc_inode_sb()), while other filesystems need to
> > do more work. In order to make it easy for maintainers of different
> > filesystems to review their own maintained part, I split the patch into
> > patches which are per-filesystem in this version. I am not sure if this
> > is a good idea, because there is going to be more commits.
> >
> > In our server, we found a suspected memory leak problem. The kmalloc-32
> > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > memory.
> >
> > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > cache is the cause of list_lru_one allocation.
> >
> >   crash> p memcg_nr_cache_ids
> >   memcg_nr_cache_ids = $2 = 24574
> >
> > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > can be calculated with the following formula.
> >
> >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> >
> > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> >
> >   crash> list super_blocks | wc -l
> >   952
> >
> > Every mount will register 2 list lrus, one is for inode, another is for
> > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
> > MB (~5.6GB). But now the number of memory cgroups is less than 500. So I
> > guess more than 12286 memory cgroups have been created on this machine (I
> > do not know why there are so many cgroups, it may be a user's bug or
> > the user really want to do that). Because memcg_nr_cache_ids has not been
> > reduced to a suitable value. It leads to waste a lot of memory. If we want
> > to reduce memcg_nr_cache_ids, we have to *reboot* the server. This is not
> > what we want.
> >
> > In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do
> > this. But this did not fundamentally solve the problem.
> >
> > We currently allocate scope for every memcg to be able to tracked on every
> > superblock instantiated in the system, regardless of whether that superblock
> > is even accessible to that memcg.
> >
> > These huge memcg counts come from container hosts where memcgs are confined
> > to just a small subset of the total number of superblocks that instantiated
> > at any given point in time.
> >
> > For these systems with huge container counts, list_lru does not need the
> > capability of tracking every memcg on every superblock.
> >
> > What it comes down to is that the list_lru is only needed for a given memcg
> > if that memcg is instatiating and freeing objects on a given list_lru.
> >
> > As Dave said, "Which makes me think we should be moving more towards 'add the
> > memcg to the list_lru at the first insert' model rather than 'instantiate
> > all at memcg init time just in case'."
> >
> > This patchset aims to optimize the list lru memory consumption from different
> > aspects.
> >
> > Patch 1-6 are code simplification.
> > Patch 7 converts the array from per-memcg per-node to per-memcg
> > Patch 8 introduces kmem_cache_alloc_lru()
> > Patch 9 introduces alloc_inode_sb()
> > Patch 10-66 convert all filesystems to alloc_inode_sb() respectively.
>
> There is now days also ntfs3. If you do not plan to convert this please
> CC me atleast so that I can do it when these lands.
>
>   Argillander
>

Wow, a new filesystem. I didn't notice it before. I'll cover it
in the next version and Cc you if you can do a review.
Thanks for your reminder.