diff mbox series

High kmalloc-32 slab cache consumption with 10k containers

Message ID 20210405054848.GA1077931@in.ibm.com (mailing list archive)
State New, archived
Headers show
Series High kmalloc-32 slab cache consumption with 10k containers | expand

Commit Message

Bharata B Rao April 5, 2021, 5:48 a.m. UTC
Hi,

When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
consumption increases quite a lot (around 172G) when the containers are
running. Most of it comes from slab (149G) and within slab, the majority of
it comes from kmalloc-32 cache (102G)

The major allocator of kmalloc-32 slab cache happens to be the list_head
allocations of list_lru_one list. These lists are created whenever a
FS mount happens. Specially two such lists are registered by alloc_super(),
one for dentry and another for inode shrinker list. And these lists
are created for all possible NUMA nodes and for all given memcgs
(memcg_nr_cache_ids to be particular)

If,

A = Nr allocation request per mount: 2 (one for dentry and inode list)
B = Nr NUMA possible nodes
C = memcg_nr_cache_ids
D = size of each kmalloc-32 object: 32 bytes,

then for every mount, the amount of memory consumed by kmalloc-32 slab
cache for list_lru creation is A*B*C*D bytes.

Following factors contribute to the excessive allocations:

- Lists are created for possible NUMA nodes.
- memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
  list_lrus are created when it grows. Thus we end up creating list_lru_one
  list_heads even for those memcgs which are yet to be created.
  For example, when 10000 memcgs are created, memcg_nr_cache_ids reach
  a value of 12286.
- When a memcg goes offline, the list elements are drained to the parent
  memcg, but the list_head entry remains.
- The lists are destroyed only when the FS is unmounted. So list_heads
  for non-existing memcgs remain and continue to contribute to the
  kmalloc-32 allocation. This is presumably done for performance
  reason as they get reused when new memcgs are created, but they end up
  consuming slab memory until then.
- In case of containers, a few file systems get mounted and are specific
  to the container namespace and hence to a particular memcg, but we
  end up creating lists for all the memcgs.
  As an example, if 7 FS mounts are done for every container and when
  10k containers are created, we end up creating 2*7*12286 list_lru_one
  lists for each NUMA node. It appears that no elements will get added
  to other than 2*7=14 of them in the case of containers.

One straight forward way to prevent this excessive list_lru_one
allocations is to limit the list_lru_one creation only to the
relevant memcg. However I don't see an easy way to figure out
that relevant memcg from FS mount path (alloc_super())

As an alternative approach, I have this below hack that does lazy
list_lru creation. The memcg-specific list is created and initialized
only when there is a request to add an element to that particular
list. Though I am not sure about the full impact of this change
on the owners of the lists and also the performance impact of this,
the overall savings look good.

Used memory
		Before		During		After
W/o patch	23G		172G		40G
W/  patch	23G		69G		29G

Slab consumption
		Before		During		After
W/o patch	1.5G		149G		22G
W/  patch	1.5G		45G		10G

Number of kmalloc-32 allocations
		Before		During		After
W/o patch	178176		3442409472	388933632
W/  patch	190464		468992		468992

Any thoughts on other approaches to address this scenario and
any specific comments about the approach that I have taken is
appreciated. Meanwhile the patch looks like below:

From 9444a0c6734c2853057b1f486f85da2c409fdc84 Mon Sep 17 00:00:00 2001
From: Bharata B Rao <bharata@linux.ibm.com>
Date: Wed, 31 Mar 2021 18:21:45 +0530
Subject: [PATCH 1/1] mm: list_lru: Allocate list_lru_one only when required.

Don't pre-allocate list_lru_one list heads for all memcg_cache_ids.
Instead allocate and initialize it only when required.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
 mm/list_lru.c | 79 +++++++++++++++++++++++++--------------------------
 1 file changed, 38 insertions(+), 41 deletions(-)

Comments

kernel test robot April 5, 2021, 8:22 a.m. UTC | #1
Hi Bharata,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v5.12-rc6 next-20210401]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Bharata-B-Rao/High-kmalloc-32-slab-cache-consumption-with-10k-containers/20210405-135124
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 5e46d1b78a03d52306f21f77a4e4a144b6d31486
config: s390-randconfig-r032-20210405 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 2760a808b9916a2839513b7fd7314a464f52481e)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install s390 cross compiling tool for clang build
        # apt-get install binutils-s390x-linux-gnu
        # https://github.com/0day-ci/linux/commit/b7ba3522029771bd74c8b6324de9ce20b5a06593
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Bharata-B-Rao/High-kmalloc-32-slab-cache-consumption-with-10k-containers/20210405-135124
        git checkout b7ba3522029771bd74c8b6324de9ce20b5a06593
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=s390 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from mm/list_lru.c:14:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:14:
   In file included from include/linux/blk-cgroup.h:23:
   In file included from include/linux/blkdev.h:26:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:80:
   include/asm-generic/io.h:464:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __raw_readb(PCI_IOBASE + addr);
                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:477:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:36:59: note: expanded from macro '__le16_to_cpu'
   #define __le16_to_cpu(x) __swab16((__force __u16)(__le16)(x))
                                                             ^
   include/uapi/linux/swab.h:102:54: note: expanded from macro '__swab16'
   #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
                                                        ^
   In file included from mm/list_lru.c:14:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:14:
   In file included from include/linux/blk-cgroup.h:23:
   In file included from include/linux/blkdev.h:26:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:80:
   include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
                                                             ^
   include/uapi/linux/swab.h:115:54: note: expanded from macro '__swab32'
   #define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
                                                        ^
   In file included from mm/list_lru.c:14:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:14:
   In file included from include/linux/blk-cgroup.h:23:
   In file included from include/linux/blkdev.h:26:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:80:
   include/asm-generic/io.h:501:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:511:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:521:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:609:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsb(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:617:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsw(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:625:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsl(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:634:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesb(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:643:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesw(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:652:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesl(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
>> mm/list_lru.c:138:49: error: no member named 'memcg_lrus' in 'struct list_lru_node'
                           memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
                                                                  ~~~~  ^
   include/linux/rcupdate.h:562:31: note: expanded from macro 'rcu_dereference_protected'
           __rcu_dereference_protected((p), (c), __rcu)
                                        ^
   include/linux/rcupdate.h:383:12: note: expanded from macro '__rcu_dereference_protected'
           ((typeof(*p) __force __kernel *)(p)); \
                     ^
>> mm/list_lru.c:138:49: error: no member named 'memcg_lrus' in 'struct list_lru_node'
                           memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
                                                                  ~~~~  ^
   include/linux/rcupdate.h:562:31: note: expanded from macro 'rcu_dereference_protected'
           __rcu_dereference_protected((p), (c), __rcu)
                                        ^
   include/linux/rcupdate.h:383:35: note: expanded from macro '__rcu_dereference_protected'
           ((typeof(*p) __force __kernel *)(p)); \
                                            ^
>> mm/list_lru.c:138:15: error: assigning to 'struct list_lru_memcg *' from incompatible type 'void'
                           memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
                                      ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   12 warnings and 3 errors generated.


vim +138 mm/list_lru.c

   120	
   121	bool list_lru_add(struct list_lru *lru, struct list_head *item)
   122	{
   123		int nid = page_to_nid(virt_to_page(item));
   124		struct list_lru_node *nlru = &lru->node[nid];
   125		struct mem_cgroup *memcg;
   126		struct list_lru_one *l;
   127		struct list_lru_memcg *memcg_lrus;
   128	
   129		spin_lock(&nlru->lock);
   130		if (list_empty(item)) {
   131			l = list_lru_from_kmem(nlru, item, &memcg);
   132			if (!l) {
   133				l = kmalloc(sizeof(struct list_lru_one), GFP_ATOMIC);
   134				if (!l)
   135					goto out;
   136	
   137				init_one_lru(l);
 > 138				memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
   139				memcg_lrus->lru[memcg_cache_id(memcg)] = l;
   140			}
   141			list_add_tail(item, &l->list);
   142			/* Set shrinker bit if the first element was added */
   143			if (!l->nr_items++)
   144				memcg_set_shrinker_bit(memcg, nid,
   145						       lru_shrinker_id(lru));
   146			nlru->nr_items++;
   147			spin_unlock(&nlru->lock);
   148			return true;
   149		}
   150	out:
   151		spin_unlock(&nlru->lock);
   152		return false;
   153	}
   154	EXPORT_SYMBOL_GPL(list_lru_add);
   155	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Yang Shi April 5, 2021, 6:08 p.m. UTC | #2
On Sun, Apr 4, 2021 at 10:49 PM Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> Hi,
>
> When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> consumption increases quite a lot (around 172G) when the containers are
> running. Most of it comes from slab (149G) and within slab, the majority of
> it comes from kmalloc-32 cache (102G)
>
> The major allocator of kmalloc-32 slab cache happens to be the list_head
> allocations of list_lru_one list. These lists are created whenever a
> FS mount happens. Specially two such lists are registered by alloc_super(),
> one for dentry and another for inode shrinker list. And these lists
> are created for all possible NUMA nodes and for all given memcgs
> (memcg_nr_cache_ids to be particular)
>
> If,
>
> A = Nr allocation request per mount: 2 (one for dentry and inode list)
> B = Nr NUMA possible nodes
> C = memcg_nr_cache_ids
> D = size of each kmalloc-32 object: 32 bytes,
>
> then for every mount, the amount of memory consumed by kmalloc-32 slab
> cache for list_lru creation is A*B*C*D bytes.

Yes, this is exactly what the current implementation does.

>
> Following factors contribute to the excessive allocations:
>
> - Lists are created for possible NUMA nodes.

Yes, because filesystem caches (dentry and inode) are NUMA aware.

> - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
>   list_lrus are created when it grows. Thus we end up creating list_lru_one
>   list_heads even for those memcgs which are yet to be created.
>   For example, when 10000 memcgs are created, memcg_nr_cache_ids reach
>   a value of 12286.
> - When a memcg goes offline, the list elements are drained to the parent
>   memcg, but the list_head entry remains.
> - The lists are destroyed only when the FS is unmounted. So list_heads
>   for non-existing memcgs remain and continue to contribute to the
>   kmalloc-32 allocation. This is presumably done for performance
>   reason as they get reused when new memcgs are created, but they end up
>   consuming slab memory until then.

The current implementation has list_lrus attached with super_block. So
the list can't be freed until the super block is unmounted.

I'm looking into consolidating list_lrus more closely with memcgs. It
means the list_lrus will have the same life cycles as memcgs rather
than filesystems. This may be able to improve some. But I'm supposed
the filesystem will be unmounted once the container exits and the
memcgs will get offlined for your usecase.

> - In case of containers, a few file systems get mounted and are specific
>   to the container namespace and hence to a particular memcg, but we
>   end up creating lists for all the memcgs.

Yes, because the kernel is *NOT* aware of containers.

>   As an example, if 7 FS mounts are done for every container and when
>   10k containers are created, we end up creating 2*7*12286 list_lru_one
>   lists for each NUMA node. It appears that no elements will get added
>   to other than 2*7=14 of them in the case of containers.
>
> One straight forward way to prevent this excessive list_lru_one
> allocations is to limit the list_lru_one creation only to the
> relevant memcg. However I don't see an easy way to figure out
> that relevant memcg from FS mount path (alloc_super())
>
> As an alternative approach, I have this below hack that does lazy
> list_lru creation. The memcg-specific list is created and initialized
> only when there is a request to add an element to that particular
> list. Though I am not sure about the full impact of this change
> on the owners of the lists and also the performance impact of this,
> the overall savings look good.

It is fine to reduce the memory consumption for your usecase, but I'm
not sure if this would incur any noticeable overhead for vfs
operations since list_lru_add() should be called quite often, but it
just needs to allocate the list for once (for each memcg +
filesystem), so the overhead might be fine.

And I'm wondering how much memory can be saved for real life workload.
I don't expect most containers are idle in production environments.

Added some more memcg/list_lru experts in this loop, they may have better ideas.

>
> Used memory
>                 Before          During          After
> W/o patch       23G             172G            40G
> W/  patch       23G             69G             29G
>
> Slab consumption
>                 Before          During          After
> W/o patch       1.5G            149G            22G
> W/  patch       1.5G            45G             10G
>
> Number of kmalloc-32 allocations
>                 Before          During          After
> W/o patch       178176          3442409472      388933632
> W/  patch       190464          468992          468992
>
> Any thoughts on other approaches to address this scenario and
> any specific comments about the approach that I have taken is
> appreciated. Meanwhile the patch looks like below:
>
> From 9444a0c6734c2853057b1f486f85da2c409fdc84 Mon Sep 17 00:00:00 2001
> From: Bharata B Rao <bharata@linux.ibm.com>
> Date: Wed, 31 Mar 2021 18:21:45 +0530
> Subject: [PATCH 1/1] mm: list_lru: Allocate list_lru_one only when required.
>
> Don't pre-allocate list_lru_one list heads for all memcg_cache_ids.
> Instead allocate and initialize it only when required.
>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> ---
>  mm/list_lru.c | 79 +++++++++++++++++++++++++--------------------------
>  1 file changed, 38 insertions(+), 41 deletions(-)
>
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 6f067b6b935f..b453fa5008cc 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -112,16 +112,32 @@ list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
>  }
>  #endif /* CONFIG_MEMCG_KMEM */
>
> +static void init_one_lru(struct list_lru_one *l)
> +{
> +       INIT_LIST_HEAD(&l->list);
> +       l->nr_items = 0;
> +}
> +
>  bool list_lru_add(struct list_lru *lru, struct list_head *item)
>  {
>         int nid = page_to_nid(virt_to_page(item));
>         struct list_lru_node *nlru = &lru->node[nid];
>         struct mem_cgroup *memcg;
>         struct list_lru_one *l;
> +       struct list_lru_memcg *memcg_lrus;
>
>         spin_lock(&nlru->lock);
>         if (list_empty(item)) {
>                 l = list_lru_from_kmem(nlru, item, &memcg);
> +               if (!l) {
> +                       l = kmalloc(sizeof(struct list_lru_one), GFP_ATOMIC);
> +                       if (!l)
> +                               goto out;
> +
> +                       init_one_lru(l);
> +                       memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
> +                       memcg_lrus->lru[memcg_cache_id(memcg)] = l;
> +               }
>                 list_add_tail(item, &l->list);
>                 /* Set shrinker bit if the first element was added */
>                 if (!l->nr_items++)
> @@ -131,6 +147,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
>                 spin_unlock(&nlru->lock);
>                 return true;
>         }
> +out:
>         spin_unlock(&nlru->lock);
>         return false;
>  }
> @@ -176,11 +193,12 @@ unsigned long list_lru_count_one(struct list_lru *lru,
>  {
>         struct list_lru_node *nlru = &lru->node[nid];
>         struct list_lru_one *l;
> -       unsigned long count;
> +       unsigned long count = 0;
>
>         rcu_read_lock();
>         l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
> -       count = READ_ONCE(l->nr_items);
> +       if (l)
> +               count = READ_ONCE(l->nr_items);
>         rcu_read_unlock();
>
>         return count;
> @@ -207,6 +225,9 @@ __list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx,
>         unsigned long isolated = 0;
>
>         l = list_lru_from_memcg_idx(nlru, memcg_idx);
> +       if (!l)
> +               goto out;
> +
>  restart:
>         list_for_each_safe(item, n, &l->list) {
>                 enum lru_status ret;
> @@ -251,6 +272,7 @@ __list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx,
>                         BUG();
>                 }
>         }
> +out:
>         return isolated;
>  }
>
> @@ -312,12 +334,6 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
>  }
>  EXPORT_SYMBOL_GPL(list_lru_walk_node);
>
> -static void init_one_lru(struct list_lru_one *l)
> -{
> -       INIT_LIST_HEAD(&l->list);
> -       l->nr_items = 0;
> -}
> -
>  #ifdef CONFIG_MEMCG_KMEM
>  static void __memcg_destroy_list_lru_node(struct list_lru_memcg *memcg_lrus,
>                                           int begin, int end)
> @@ -328,41 +344,16 @@ static void __memcg_destroy_list_lru_node(struct list_lru_memcg *memcg_lrus,
>                 kfree(memcg_lrus->lru[i]);
>  }
>
> -static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus,
> -                                     int begin, int end)
> -{
> -       int i;
> -
> -       for (i = begin; i < end; i++) {
> -               struct list_lru_one *l;
> -
> -               l = kmalloc(sizeof(struct list_lru_one), GFP_KERNEL);
> -               if (!l)
> -                       goto fail;
> -
> -               init_one_lru(l);
> -               memcg_lrus->lru[i] = l;
> -       }
> -       return 0;
> -fail:
> -       __memcg_destroy_list_lru_node(memcg_lrus, begin, i);
> -       return -ENOMEM;
> -}
> -
>  static int memcg_init_list_lru_node(struct list_lru_node *nlru)
>  {
>         struct list_lru_memcg *memcg_lrus;
>         int size = memcg_nr_cache_ids;
>
> -       memcg_lrus = kvmalloc(sizeof(*memcg_lrus) +
> +       memcg_lrus = kvzalloc(sizeof(*memcg_lrus) +
>                               size * sizeof(void *), GFP_KERNEL);
>         if (!memcg_lrus)
>                 return -ENOMEM;
>
> -       if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) {
> -               kvfree(memcg_lrus);
> -               return -ENOMEM;
> -       }
>         RCU_INIT_POINTER(nlru->memcg_lrus, memcg_lrus);
>
>         return 0;
> @@ -389,15 +380,10 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,
>
>         old = rcu_dereference_protected(nlru->memcg_lrus,
>                                         lockdep_is_held(&list_lrus_mutex));
> -       new = kvmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
> +       new = kvzalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
>         if (!new)
>                 return -ENOMEM;
>
> -       if (__memcg_init_list_lru_node(new, old_size, new_size)) {
> -               kvfree(new);
> -               return -ENOMEM;
> -       }
> -
>         memcpy(&new->lru, &old->lru, old_size * sizeof(void *));
>
>         /*
> @@ -526,6 +512,7 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
>         struct list_lru_node *nlru = &lru->node[nid];
>         int dst_idx = dst_memcg->kmemcg_id;
>         struct list_lru_one *src, *dst;
> +       struct list_lru_memcg *memcg_lrus;
>
>         /*
>          * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> @@ -534,7 +521,17 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
>         spin_lock_irq(&nlru->lock);
>
>         src = list_lru_from_memcg_idx(nlru, src_idx);
> +       if (!src)
> +               goto out;
> +
>         dst = list_lru_from_memcg_idx(nlru, dst_idx);
> +       if (!dst) {
> +               /* TODO: Use __GFP_NOFAIL? */
> +               dst = kmalloc(sizeof(struct list_lru_one), GFP_ATOMIC);
> +               init_one_lru(dst);
> +               memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
> +               memcg_lrus->lru[dst_idx] = dst;
> +       }
>
>         list_splice_init(&src->list, &dst->list);
>
> @@ -543,7 +540,7 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
>                 memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
>                 src->nr_items = 0;
>         }
> -
> +out:
>         spin_unlock_irq(&nlru->lock);
>  }
>
> --
> 2.26.2
>
>
Roman Gushchin April 5, 2021, 6:38 p.m. UTC | #3
On Mon, Apr 05, 2021 at 11:08:26AM -0700, Yang Shi wrote:
> On Sun, Apr 4, 2021 at 10:49 PM Bharata B Rao <bharata@linux.ibm.com> wrote:
> >
> > Hi,
> >
> > When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
> >
> > The major allocator of kmalloc-32 slab cache happens to be the list_head
> > allocations of list_lru_one list. These lists are created whenever a
> > FS mount happens. Specially two such lists are registered by alloc_super(),
> > one for dentry and another for inode shrinker list. And these lists
> > are created for all possible NUMA nodes and for all given memcgs
> > (memcg_nr_cache_ids to be particular)
> >
> > If,
> >
> > A = Nr allocation request per mount: 2 (one for dentry and inode list)
> > B = Nr NUMA possible nodes
> > C = memcg_nr_cache_ids
> > D = size of each kmalloc-32 object: 32 bytes,
> >
> > then for every mount, the amount of memory consumed by kmalloc-32 slab
> > cache for list_lru creation is A*B*C*D bytes.
> 
> Yes, this is exactly what the current implementation does.
> 
> >
> > Following factors contribute to the excessive allocations:
> >
> > - Lists are created for possible NUMA nodes.
> 
> Yes, because filesystem caches (dentry and inode) are NUMA aware.
> 
> > - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
> >   list_lrus are created when it grows. Thus we end up creating list_lru_one
> >   list_heads even for those memcgs which are yet to be created.
> >   For example, when 10000 memcgs are created, memcg_nr_cache_ids reach
> >   a value of 12286.
> > - When a memcg goes offline, the list elements are drained to the parent
> >   memcg, but the list_head entry remains.
> > - The lists are destroyed only when the FS is unmounted. So list_heads
> >   for non-existing memcgs remain and continue to contribute to the
> >   kmalloc-32 allocation. This is presumably done for performance
> >   reason as they get reused when new memcgs are created, but they end up
> >   consuming slab memory until then.
> 
> The current implementation has list_lrus attached with super_block. So
> the list can't be freed until the super block is unmounted.
> 
> I'm looking into consolidating list_lrus more closely with memcgs. It
> means the list_lrus will have the same life cycles as memcgs rather
> than filesystems. This may be able to improve some. But I'm supposed
> the filesystem will be unmounted once the container exits and the
> memcgs will get offlined for your usecase.
> 
> > - In case of containers, a few file systems get mounted and are specific
> >   to the container namespace and hence to a particular memcg, but we
> >   end up creating lists for all the memcgs.
> 
> Yes, because the kernel is *NOT* aware of containers.
> 
> >   As an example, if 7 FS mounts are done for every container and when
> >   10k containers are created, we end up creating 2*7*12286 list_lru_one
> >   lists for each NUMA node. It appears that no elements will get added
> >   to other than 2*7=14 of them in the case of containers.
> >
> > One straight forward way to prevent this excessive list_lru_one
> > allocations is to limit the list_lru_one creation only to the
> > relevant memcg. However I don't see an easy way to figure out
> > that relevant memcg from FS mount path (alloc_super())
> >
> > As an alternative approach, I have this below hack that does lazy
> > list_lru creation. The memcg-specific list is created and initialized
> > only when there is a request to add an element to that particular
> > list. Though I am not sure about the full impact of this change
> > on the owners of the lists and also the performance impact of this,
> > the overall savings look good.
> 
> It is fine to reduce the memory consumption for your usecase, but I'm
> not sure if this would incur any noticeable overhead for vfs
> operations since list_lru_add() should be called quite often, but it
> just needs to allocate the list for once (for each memcg +
> filesystem), so the overhead might be fine.
> 
> And I'm wondering how much memory can be saved for real life workload.
> I don't expect most containers are idle in production environments.
> 
> Added some more memcg/list_lru experts in this loop, they may have better ideas.
> 
> >
> > Used memory
> >                 Before          During          After
> > W/o patch       23G             172G            40G
> > W/  patch       23G             69G             29G
> >
> > Slab consumption
> >                 Before          During          After
> > W/o patch       1.5G            149G            22G
> > W/  patch       1.5G            45G             10G
> >
> > Number of kmalloc-32 allocations
> >                 Before          During          After
> > W/o patch       178176          3442409472      388933632
> > W/  patch       190464          468992          468992
> >
> > Any thoughts on other approaches to address this scenario and
> > any specific comments about the approach that I have taken is
> > appreciated. Meanwhile the patch looks like below:
> >
> > From 9444a0c6734c2853057b1f486f85da2c409fdc84 Mon Sep 17 00:00:00 2001
> > From: Bharata B Rao <bharata@linux.ibm.com>
> > Date: Wed, 31 Mar 2021 18:21:45 +0530
> > Subject: [PATCH 1/1] mm: list_lru: Allocate list_lru_one only when required.
> >
> > Don't pre-allocate list_lru_one list heads for all memcg_cache_ids.
> > Instead allocate and initialize it only when required.
> >
> > Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> > ---
> >  mm/list_lru.c | 79 +++++++++++++++++++++++++--------------------------
> >  1 file changed, 38 insertions(+), 41 deletions(-)
> >
> > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > index 6f067b6b935f..b453fa5008cc 100644
> > --- a/mm/list_lru.c
> > +++ b/mm/list_lru.c
> > @@ -112,16 +112,32 @@ list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
> >  }
> >  #endif /* CONFIG_MEMCG_KMEM */
> >
> > +static void init_one_lru(struct list_lru_one *l)
> > +{
> > +       INIT_LIST_HEAD(&l->list);
> > +       l->nr_items = 0;
> > +}
> > +
> >  bool list_lru_add(struct list_lru *lru, struct list_head *item)
> >  {
> >         int nid = page_to_nid(virt_to_page(item));
> >         struct list_lru_node *nlru = &lru->node[nid];
> >         struct mem_cgroup *memcg;
> >         struct list_lru_one *l;
> > +       struct list_lru_memcg *memcg_lrus;
> >
> >         spin_lock(&nlru->lock);
> >         if (list_empty(item)) {
> >                 l = list_lru_from_kmem(nlru, item, &memcg);
> > +               if (!l) {
> > +                       l = kmalloc(sizeof(struct list_lru_one), GFP_ATOMIC);
> > +                       if (!l)
> > +                               goto out;
> > +
> > +                       init_one_lru(l);
> > +                       memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
> > +                       memcg_lrus->lru[memcg_cache_id(memcg)] = l;
> > +               }
> >                 list_add_tail(item, &l->list);
> >                 /* Set shrinker bit if the first element was added */
> >                 if (!l->nr_items++)
> > @@ -131,6 +147,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
> >                 spin_unlock(&nlru->lock);
> >                 return true;
> >         }
> > +out:
> >         spin_unlock(&nlru->lock);
> >         return false;
> >  }
> > @@ -176,11 +193,12 @@ unsigned long list_lru_count_one(struct list_lru *lru,
> >  {
> >         struct list_lru_node *nlru = &lru->node[nid];
> >         struct list_lru_one *l;
> > -       unsigned long count;
> > +       unsigned long count = 0;
> >
> >         rcu_read_lock();
> >         l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
> > -       count = READ_ONCE(l->nr_items);
> > +       if (l)
> > +               count = READ_ONCE(l->nr_items);
> >         rcu_read_unlock();
> >
> >         return count;
> > @@ -207,6 +225,9 @@ __list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx,
> >         unsigned long isolated = 0;
> >
> >         l = list_lru_from_memcg_idx(nlru, memcg_idx);
> > +       if (!l)
> > +               goto out;
> > +
> >  restart:
> >         list_for_each_safe(item, n, &l->list) {
> >                 enum lru_status ret;
> > @@ -251,6 +272,7 @@ __list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx,
> >                         BUG();
> >                 }
> >         }
> > +out:
> >         return isolated;
> >  }
> >
> > @@ -312,12 +334,6 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> >  }
> >  EXPORT_SYMBOL_GPL(list_lru_walk_node);
> >
> > -static void init_one_lru(struct list_lru_one *l)
> > -{
> > -       INIT_LIST_HEAD(&l->list);
> > -       l->nr_items = 0;
> > -}
> > -
> >  #ifdef CONFIG_MEMCG_KMEM
> >  static void __memcg_destroy_list_lru_node(struct list_lru_memcg *memcg_lrus,
> >                                           int begin, int end)
> > @@ -328,41 +344,16 @@ static void __memcg_destroy_list_lru_node(struct list_lru_memcg *memcg_lrus,
> >                 kfree(memcg_lrus->lru[i]);
> >  }
> >
> > -static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus,
> > -                                     int begin, int end)
> > -{
> > -       int i;
> > -
> > -       for (i = begin; i < end; i++) {
> > -               struct list_lru_one *l;
> > -
> > -               l = kmalloc(sizeof(struct list_lru_one), GFP_KERNEL);
> > -               if (!l)
> > -                       goto fail;
> > -
> > -               init_one_lru(l);
> > -               memcg_lrus->lru[i] = l;
> > -       }
> > -       return 0;
> > -fail:
> > -       __memcg_destroy_list_lru_node(memcg_lrus, begin, i);
> > -       return -ENOMEM;
> > -}
> > -
> >  static int memcg_init_list_lru_node(struct list_lru_node *nlru)
> >  {
> >         struct list_lru_memcg *memcg_lrus;
> >         int size = memcg_nr_cache_ids;
> >
> > -       memcg_lrus = kvmalloc(sizeof(*memcg_lrus) +
> > +       memcg_lrus = kvzalloc(sizeof(*memcg_lrus) +
> >                               size * sizeof(void *), GFP_KERNEL);
> >         if (!memcg_lrus)
> >                 return -ENOMEM;
> >
> > -       if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) {
> > -               kvfree(memcg_lrus);
> > -               return -ENOMEM;
> > -       }
> >         RCU_INIT_POINTER(nlru->memcg_lrus, memcg_lrus);
> >
> >         return 0;
> > @@ -389,15 +380,10 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,
> >
> >         old = rcu_dereference_protected(nlru->memcg_lrus,
> >                                         lockdep_is_held(&list_lrus_mutex));
> > -       new = kvmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
> > +       new = kvzalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
> >         if (!new)
> >                 return -ENOMEM;
> >
> > -       if (__memcg_init_list_lru_node(new, old_size, new_size)) {
> > -               kvfree(new);
> > -               return -ENOMEM;
> > -       }
> > -
> >         memcpy(&new->lru, &old->lru, old_size * sizeof(void *));
> >
> >         /*
> > @@ -526,6 +512,7 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
> >         struct list_lru_node *nlru = &lru->node[nid];
> >         int dst_idx = dst_memcg->kmemcg_id;
> >         struct list_lru_one *src, *dst;
> > +       struct list_lru_memcg *memcg_lrus;
> >
> >         /*
> >          * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> > @@ -534,7 +521,17 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
> >         spin_lock_irq(&nlru->lock);
> >
> >         src = list_lru_from_memcg_idx(nlru, src_idx);
> > +       if (!src)
> > +               goto out;
> > +
> >         dst = list_lru_from_memcg_idx(nlru, dst_idx);
> > +       if (!dst) {
> > +               /* TODO: Use __GFP_NOFAIL? */
> > +               dst = kmalloc(sizeof(struct list_lru_one), GFP_ATOMIC);
> > +               init_one_lru(dst);
> > +               memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
> > +               memcg_lrus->lru[dst_idx] = dst;
> > +       }

Hm, can't we just reuse src as dst in this case?
We don't need src anymore and we're basically allocating dst to move all data from src.
If not, we can allocate up to the root memcg every time to avoid having
!dst case and fiddle with __GFP_NOFAIL.

Otherwise I like the idea and I think it might reduce the memory overhead
especially on (very) big machines.

Thanks!
Bharata B Rao April 6, 2021, 10:05 a.m. UTC | #4
On Mon, Apr 05, 2021 at 11:08:26AM -0700, Yang Shi wrote:
> On Sun, Apr 4, 2021 at 10:49 PM Bharata B Rao <bharata@linux.ibm.com> wrote:
> >
> > Hi,
> >
> > When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
> >
> > The major allocator of kmalloc-32 slab cache happens to be the list_head
> > allocations of list_lru_one list. These lists are created whenever a
> > FS mount happens. Specially two such lists are registered by alloc_super(),
> > one for dentry and another for inode shrinker list. And these lists
> > are created for all possible NUMA nodes and for all given memcgs
> > (memcg_nr_cache_ids to be particular)
> >
> > If,
> >
> > A = Nr allocation request per mount: 2 (one for dentry and inode list)
> > B = Nr NUMA possible nodes
> > C = memcg_nr_cache_ids
> > D = size of each kmalloc-32 object: 32 bytes,
> >
> > then for every mount, the amount of memory consumed by kmalloc-32 slab
> > cache for list_lru creation is A*B*C*D bytes.
> 
> Yes, this is exactly what the current implementation does.
> 
> >
> > Following factors contribute to the excessive allocations:
> >
> > - Lists are created for possible NUMA nodes.
> 
> Yes, because filesystem caches (dentry and inode) are NUMA aware.

True, but creating lists for possible nodes as against online nodes
can hurt platforms where possible is typically higher than online.

> 
> > - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
> >   list_lrus are created when it grows. Thus we end up creating list_lru_one
> >   list_heads even for those memcgs which are yet to be created.
> >   For example, when 10000 memcgs are created, memcg_nr_cache_ids reach
> >   a value of 12286.
> > - When a memcg goes offline, the list elements are drained to the parent
> >   memcg, but the list_head entry remains.
> > - The lists are destroyed only when the FS is unmounted. So list_heads
> >   for non-existing memcgs remain and continue to contribute to the
> >   kmalloc-32 allocation. This is presumably done for performance
> >   reason as they get reused when new memcgs are created, but they end up
> >   consuming slab memory until then.
> 
> The current implementation has list_lrus attached with super_block. So
> the list can't be freed until the super block is unmounted.
> 
> I'm looking into consolidating list_lrus more closely with memcgs. It
> means the list_lrus will have the same life cycles as memcgs rather
> than filesystems. This may be able to improve some. But I'm supposed
> the filesystem will be unmounted once the container exits and the
> memcgs will get offlined for your usecase.

Yes, but when the containers are still running, the lists that get
created for non-existing memcgs and non-relavent memcgs are the main
cause of increased memory consumption.

> 
> > - In case of containers, a few file systems get mounted and are specific
> >   to the container namespace and hence to a particular memcg, but we
> >   end up creating lists for all the memcgs.
> 
> Yes, because the kernel is *NOT* aware of containers.
> 
> >   As an example, if 7 FS mounts are done for every container and when
> >   10k containers are created, we end up creating 2*7*12286 list_lru_one
> >   lists for each NUMA node. It appears that no elements will get added
> >   to other than 2*7=14 of them in the case of containers.
> >
> > One straight forward way to prevent this excessive list_lru_one
> > allocations is to limit the list_lru_one creation only to the
> > relevant memcg. However I don't see an easy way to figure out
> > that relevant memcg from FS mount path (alloc_super())
> >
> > As an alternative approach, I have this below hack that does lazy
> > list_lru creation. The memcg-specific list is created and initialized
> > only when there is a request to add an element to that particular
> > list. Though I am not sure about the full impact of this change
> > on the owners of the lists and also the performance impact of this,
> > the overall savings look good.
> 
> It is fine to reduce the memory consumption for your usecase, but I'm
> not sure if this would incur any noticeable overhead for vfs
> operations since list_lru_add() should be called quite often, but it
> just needs to allocate the list for once (for each memcg +
> filesystem), so the overhead might be fine.

Let me run some benchmarks to measure the overhead. Any particular
benchmark suggestion?
 
> 
> And I'm wondering how much memory can be saved for real life workload.
> I don't expect most containers are idle in production environments.

I don't think kmalloc-32 slab cache memory consumption from list_lru
would be any different for real life workload compared to idle containers.

> 
> Added some more memcg/list_lru experts in this loop, they may have better ideas.

Thanks.

Regards,
Bharata.
Bharata B Rao April 6, 2021, 10:13 a.m. UTC | #5
On Mon, Apr 05, 2021 at 11:38:44AM -0700, Roman Gushchin wrote:
> > > @@ -534,7 +521,17 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
> > >         spin_lock_irq(&nlru->lock);
> > >
> > >         src = list_lru_from_memcg_idx(nlru, src_idx);
> > > +       if (!src)
> > > +               goto out;
> > > +
> > >         dst = list_lru_from_memcg_idx(nlru, dst_idx);
> > > +       if (!dst) {
> > > +               /* TODO: Use __GFP_NOFAIL? */
> > > +               dst = kmalloc(sizeof(struct list_lru_one), GFP_ATOMIC);
> > > +               init_one_lru(dst);
> > > +               memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
> > > +               memcg_lrus->lru[dst_idx] = dst;
> > > +       }
> 
> Hm, can't we just reuse src as dst in this case?
> We don't need src anymore and we're basically allocating dst to move all data from src.

Yes, we can do that and it would be much simpler.

> If not, we can allocate up to the root memcg every time to avoid having
> !dst case and fiddle with __GFP_NOFAIL.
> 
> Otherwise I like the idea and I think it might reduce the memory overhead
> especially on (very) big machines.

Yes, I will however have to check if the callers of list_lru_add() are capable
of handling failure which can happen with this approach if kmalloc fails.

Regards,
Bharata.
Dave Chinner April 6, 2021, 10:28 p.m. UTC | #6
On Mon, Apr 05, 2021 at 11:18:48AM +0530, Bharata B Rao wrote:
> Hi,
> 
> When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> consumption increases quite a lot (around 172G) when the containers are
> running. Most of it comes from slab (149G) and within slab, the majority of
> it comes from kmalloc-32 cache (102G)
> 
> The major allocator of kmalloc-32 slab cache happens to be the list_head
> allocations of list_lru_one list. These lists are created whenever a
> FS mount happens. Specially two such lists are registered by alloc_super(),
> one for dentry and another for inode shrinker list. And these lists
> are created for all possible NUMA nodes and for all given memcgs
> (memcg_nr_cache_ids to be particular)
> 
> If,
> 
> A = Nr allocation request per mount: 2 (one for dentry and inode list)
> B = Nr NUMA possible nodes
> C = memcg_nr_cache_ids
> D = size of each kmalloc-32 object: 32 bytes,
> 
> then for every mount, the amount of memory consumed by kmalloc-32 slab
> cache for list_lru creation is A*B*C*D bytes.
> 
> Following factors contribute to the excessive allocations:
> 
> - Lists are created for possible NUMA nodes.
> - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
>   list_lrus are created when it grows. Thus we end up creating list_lru_one
>   list_heads even for those memcgs which are yet to be created.
>   For example, when 10000 memcgs are created, memcg_nr_cache_ids reach
>   a value of 12286.

So, by your numbers, we have 2 * 2 * 12286 * 32 = 1.5MB per mount.

So for that to make up 100GB of RAM, you must have somewhere over
500,000 mounted superblocks on the machine?

That implies 50+ unique mounted superblocks per container, which
seems like an awful lot.

> - When a memcg goes offline, the list elements are drained to the parent
>   memcg, but the list_head entry remains.
> - The lists are destroyed only when the FS is unmounted. So list_heads
>   for non-existing memcgs remain and continue to contribute to the
>   kmalloc-32 allocation. This is presumably done for performance
>   reason as they get reused when new memcgs are created, but they end up
>   consuming slab memory until then.
> - In case of containers, a few file systems get mounted and are specific
>   to the container namespace and hence to a particular memcg, but we
>   end up creating lists for all the memcgs.
>   As an example, if 7 FS mounts are done for every container and when
>   10k containers are created, we end up creating 2*7*12286 list_lru_one
>   lists for each NUMA node. It appears that no elements will get added
>   to other than 2*7=14 of them in the case of containers.

Yeah, at first glance this doesn't strike me as a problem with the
list_lru structure, it smells more like a problem resulting from a
huge number of superblock instantiations on the machine. Which,
probably, mostly have no significant need for anything other than a
single memcg awareness?

Can you post a typical /proc/self/mounts output from one of these
idle/empty containers so we can see exactly how many mounts and
their type are being instantiated in each container?

> One straight forward way to prevent this excessive list_lru_one
> allocations is to limit the list_lru_one creation only to the
> relevant memcg. However I don't see an easy way to figure out
> that relevant memcg from FS mount path (alloc_super())

Superblocks have to support an unknown number of memcgs after they
have been mounted. bind mounts, child memcgs, etc, all mean that we
can't just have a static, single mount time memcg instantiation.

> As an alternative approach, I have this below hack that does lazy
> list_lru creation. The memcg-specific list is created and initialized
> only when there is a request to add an element to that particular
> list. Though I am not sure about the full impact of this change
> on the owners of the lists and also the performance impact of this,
> the overall savings look good.

Avoiding memory allocation in list_lru_add() was one of the main
reasons for up-front static allocation of memcg lists. We cannot do
memory allocation while callers are holding multiple spinlocks in
core system algorithms (e.g. dentry_kill -> retain_dentry ->
d_lru_add -> list_lru_add), let alone while holding an internal
spinlock.

Putting a GFP_ATOMIC allocation inside 3-4 nested spinlocks in a
path we know might have memory demand in the *hundreds of GB* range
gets an NACK from me. It's a great idea, but it's just not a
feasible, robust solution as proposed. Work out how to put the
memory allocation outside all the locks (including caller locks) and
it might be ok, but that's messy.

Another approach may be to identify filesystem types that do not
need memcg awareness and feed that into alloc_super() to set/clear
the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
virtual filesystems that expose system information do not really
need full memcg awareness because they are generally only visible to
a single memcg instance...

Cheers,

Dave.
Yang Shi April 7, 2021, 1:39 a.m. UTC | #7
On Tue, Apr 6, 2021 at 3:05 AM Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Mon, Apr 05, 2021 at 11:08:26AM -0700, Yang Shi wrote:
> > On Sun, Apr 4, 2021 at 10:49 PM Bharata B Rao <bharata@linux.ibm.com> wrote:
> > >
> > > Hi,
> > >
> > > When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> > > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > > consumption increases quite a lot (around 172G) when the containers are
> > > running. Most of it comes from slab (149G) and within slab, the majority of
> > > it comes from kmalloc-32 cache (102G)
> > >
> > > The major allocator of kmalloc-32 slab cache happens to be the list_head
> > > allocations of list_lru_one list. These lists are created whenever a
> > > FS mount happens. Specially two such lists are registered by alloc_super(),
> > > one for dentry and another for inode shrinker list. And these lists
> > > are created for all possible NUMA nodes and for all given memcgs
> > > (memcg_nr_cache_ids to be particular)
> > >
> > > If,
> > >
> > > A = Nr allocation request per mount: 2 (one for dentry and inode list)
> > > B = Nr NUMA possible nodes
> > > C = memcg_nr_cache_ids
> > > D = size of each kmalloc-32 object: 32 bytes,
> > >
> > > then for every mount, the amount of memory consumed by kmalloc-32 slab
> > > cache for list_lru creation is A*B*C*D bytes.
> >
> > Yes, this is exactly what the current implementation does.
> >
> > >
> > > Following factors contribute to the excessive allocations:
> > >
> > > - Lists are created for possible NUMA nodes.
> >
> > Yes, because filesystem caches (dentry and inode) are NUMA aware.
>
> True, but creating lists for possible nodes as against online nodes
> can hurt platforms where possible is typically higher than online.

I'm supposed just because hotplug doesn't handle memcg list_lru
creation/deletion.

Get much simpler and less-prone implementation by wasting some memory.

>
> >
> > > - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
> > >   list_lrus are created when it grows. Thus we end up creating list_lru_one
> > >   list_heads even for those memcgs which are yet to be created.
> > >   For example, when 10000 memcgs are created, memcg_nr_cache_ids reach
> > >   a value of 12286.
> > > - When a memcg goes offline, the list elements are drained to the parent
> > >   memcg, but the list_head entry remains.
> > > - The lists are destroyed only when the FS is unmounted. So list_heads
> > >   for non-existing memcgs remain and continue to contribute to the
> > >   kmalloc-32 allocation. This is presumably done for performance
> > >   reason as they get reused when new memcgs are created, but they end up
> > >   consuming slab memory until then.
> >
> > The current implementation has list_lrus attached with super_block. So
> > the list can't be freed until the super block is unmounted.
> >
> > I'm looking into consolidating list_lrus more closely with memcgs. It
> > means the list_lrus will have the same life cycles as memcgs rather
> > than filesystems. This may be able to improve some. But I'm supposed
> > the filesystem will be unmounted once the container exits and the
> > memcgs will get offlined for your usecase.
>
> Yes, but when the containers are still running, the lists that get
> created for non-existing memcgs and non-relavent memcgs are the main
> cause of increased memory consumption.

Since kernel doesn't know about containers so kernel simply doesn't
know what memcgs are non-relevant.

>
> >
> > > - In case of containers, a few file systems get mounted and are specific
> > >   to the container namespace and hence to a particular memcg, but we
> > >   end up creating lists for all the memcgs.
> >
> > Yes, because the kernel is *NOT* aware of containers.
> >
> > >   As an example, if 7 FS mounts are done for every container and when
> > >   10k containers are created, we end up creating 2*7*12286 list_lru_one
> > >   lists for each NUMA node. It appears that no elements will get added
> > >   to other than 2*7=14 of them in the case of containers.
> > >
> > > One straight forward way to prevent this excessive list_lru_one
> > > allocations is to limit the list_lru_one creation only to the
> > > relevant memcg. However I don't see an easy way to figure out
> > > that relevant memcg from FS mount path (alloc_super())
> > >
> > > As an alternative approach, I have this below hack that does lazy
> > > list_lru creation. The memcg-specific list is created and initialized
> > > only when there is a request to add an element to that particular
> > > list. Though I am not sure about the full impact of this change
> > > on the owners of the lists and also the performance impact of this,
> > > the overall savings look good.
> >
> > It is fine to reduce the memory consumption for your usecase, but I'm
> > not sure if this would incur any noticeable overhead for vfs
> > operations since list_lru_add() should be called quite often, but it
> > just needs to allocate the list for once (for each memcg +
> > filesystem), so the overhead might be fine.
>
> Let me run some benchmarks to measure the overhead. Any particular
> benchmark suggestion?

Open/close files should manipulate list_lru.

>
> >
> > And I'm wondering how much memory can be saved for real life workload.
> > I don't expect most containers are idle in production environments.
>
> I don't think kmalloc-32 slab cache memory consumption from list_lru
> would be any different for real life workload compared to idle containers.

I don't mean the memory consumption itself. I mean the list is
typically not empty with real life workload so the memory is not
allocated in vain.

>
> >
> > Added some more memcg/list_lru experts in this loop, they may have better ideas.
>
> Thanks.
>
> Regards,
> Bharata.
Bharata B Rao April 7, 2021, 5:05 a.m. UTC | #8
On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> On Mon, Apr 05, 2021 at 11:18:48AM +0530, Bharata B Rao wrote:
> > Hi,
> > 
> > When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
> > 
> > The major allocator of kmalloc-32 slab cache happens to be the list_head
> > allocations of list_lru_one list. These lists are created whenever a
> > FS mount happens. Specially two such lists are registered by alloc_super(),
> > one for dentry and another for inode shrinker list. And these lists
> > are created for all possible NUMA nodes and for all given memcgs
> > (memcg_nr_cache_ids to be particular)
> > 
> > If,
> > 
> > A = Nr allocation request per mount: 2 (one for dentry and inode list)
> > B = Nr NUMA possible nodes
> > C = memcg_nr_cache_ids
> > D = size of each kmalloc-32 object: 32 bytes,
> > 
> > then for every mount, the amount of memory consumed by kmalloc-32 slab
> > cache for list_lru creation is A*B*C*D bytes.
> > 
> > Following factors contribute to the excessive allocations:
> > 
> > - Lists are created for possible NUMA nodes.
> > - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
> >   list_lrus are created when it grows. Thus we end up creating list_lru_one
> >   list_heads even for those memcgs which are yet to be created.
> >   For example, when 10000 memcgs are created, memcg_nr_cache_ids reach
> >   a value of 12286.
> 
> So, by your numbers, we have 2 * 2 * 12286 * 32 = 1.5MB per mount.
> 
> So for that to make up 100GB of RAM, you must have somewhere over
> 500,000 mounted superblocks on the machine?
> 
> That implies 50+ unique mounted superblocks per container, which
> seems like an awful lot.

Here is how the calculation turns out to be in my setup:

Number of possible NUMA nodes = 2
Number of mounts per container = 7 (Check below to see which are these)
Number of list creation requests per mount = 2
Number of containers = 10000
memcg_nr_cache_ids for 10k containers = 12286
size of kmalloc-32 = 32 byes

2*7*2*10000*12286*32 = 110082560000 bytes = 102.5G

> 
> > - When a memcg goes offline, the list elements are drained to the parent
> >   memcg, but the list_head entry remains.
> > - The lists are destroyed only when the FS is unmounted. So list_heads
> >   for non-existing memcgs remain and continue to contribute to the
> >   kmalloc-32 allocation. This is presumably done for performance
> >   reason as they get reused when new memcgs are created, but they end up
> >   consuming slab memory until then.
> > - In case of containers, a few file systems get mounted and are specific
> >   to the container namespace and hence to a particular memcg, but we
> >   end up creating lists for all the memcgs.
> >   As an example, if 7 FS mounts are done for every container and when
> >   10k containers are created, we end up creating 2*7*12286 list_lru_one
> >   lists for each NUMA node. It appears that no elements will get added
> >   to other than 2*7=14 of them in the case of containers.
> 
> Yeah, at first glance this doesn't strike me as a problem with the
> list_lru structure, it smells more like a problem resulting from a
> huge number of superblock instantiations on the machine. Which,
> probably, mostly have no significant need for anything other than a
> single memcg awareness?
> 
> Can you post a typical /proc/self/mounts output from one of these
> idle/empty containers so we can see exactly how many mounts and
> their type are being instantiated in each container?

Tracing type->name in alloc_super() lists these 7 types for
every container.

3-2691    [041] ....   222.761041: alloc_super: fstype: mqueue
3-2692    [072] ....   222.812187: alloc_super: fstype: proc
3-2692    [072] ....   222.812261: alloc_super: fstype: tmpfs
3-2692    [072] ....   222.812329: alloc_super: fstype: devpts
3-2692    [072] ....   222.812392: alloc_super: fstype: tmpfs
3-2692    [072] ....   222.813102: alloc_super: fstype: tmpfs
3-2692    [072] ....   222.813159: alloc_super: fstype: tmpfs

> 
> > One straight forward way to prevent this excessive list_lru_one
> > allocations is to limit the list_lru_one creation only to the
> > relevant memcg. However I don't see an easy way to figure out
> > that relevant memcg from FS mount path (alloc_super())
> 
> Superblocks have to support an unknown number of memcgs after they
> have been mounted. bind mounts, child memcgs, etc, all mean that we
> can't just have a static, single mount time memcg instantiation.
> 
> > As an alternative approach, I have this below hack that does lazy
> > list_lru creation. The memcg-specific list is created and initialized
> > only when there is a request to add an element to that particular
> > list. Though I am not sure about the full impact of this change
> > on the owners of the lists and also the performance impact of this,
> > the overall savings look good.
> 
> Avoiding memory allocation in list_lru_add() was one of the main
> reasons for up-front static allocation of memcg lists. We cannot do
> memory allocation while callers are holding multiple spinlocks in
> core system algorithms (e.g. dentry_kill -> retain_dentry ->
> d_lru_add -> list_lru_add), let alone while holding an internal
> spinlock.
> 
> Putting a GFP_ATOMIC allocation inside 3-4 nested spinlocks in a
> path we know might have memory demand in the *hundreds of GB* range
> gets an NACK from me. It's a great idea, but it's just not a
> feasible, robust solution as proposed. Work out how to put the
> memory allocation outside all the locks (including caller locks) and
> it might be ok, but that's messy.

Ok, I see the problem and it looks like hard to get allocations
outside of those locks.

> 
> Another approach may be to identify filesystem types that do not
> need memcg awareness and feed that into alloc_super() to set/clear
> the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
> virtual filesystems that expose system information do not really
> need full memcg awareness because they are generally only visible to
> a single memcg instance...

This however seems like a feasible approach, let me check on this.

Regards,
Bharata.
Kirill Tkhai April 7, 2021, 10:07 a.m. UTC | #9
On 07.04.2021 08:05, Bharata B Rao wrote:
> On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
>> On Mon, Apr 05, 2021 at 11:18:48AM +0530, Bharata B Rao wrote:
>>> Hi,
>>>
>>> When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
>>> server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
>>> consumption increases quite a lot (around 172G) when the containers are
>>> running. Most of it comes from slab (149G) and within slab, the majority of
>>> it comes from kmalloc-32 cache (102G)
>>>
>>> The major allocator of kmalloc-32 slab cache happens to be the list_head
>>> allocations of list_lru_one list. These lists are created whenever a
>>> FS mount happens. Specially two such lists are registered by alloc_super(),
>>> one for dentry and another for inode shrinker list. And these lists
>>> are created for all possible NUMA nodes and for all given memcgs
>>> (memcg_nr_cache_ids to be particular)
>>>
>>> If,
>>>
>>> A = Nr allocation request per mount: 2 (one for dentry and inode list)
>>> B = Nr NUMA possible nodes
>>> C = memcg_nr_cache_ids
>>> D = size of each kmalloc-32 object: 32 bytes,
>>>
>>> then for every mount, the amount of memory consumed by kmalloc-32 slab
>>> cache for list_lru creation is A*B*C*D bytes.
>>>
>>> Following factors contribute to the excessive allocations:
>>>
>>> - Lists are created for possible NUMA nodes.
>>> - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
>>>   list_lrus are created when it grows. Thus we end up creating list_lru_one
>>>   list_heads even for those memcgs which are yet to be created.
>>>   For example, when 10000 memcgs are created, memcg_nr_cache_ids reach
>>>   a value of 12286.
>>
>> So, by your numbers, we have 2 * 2 * 12286 * 32 = 1.5MB per mount.
>>
>> So for that to make up 100GB of RAM, you must have somewhere over
>> 500,000 mounted superblocks on the machine?
>>
>> That implies 50+ unique mounted superblocks per container, which
>> seems like an awful lot.
> 
> Here is how the calculation turns out to be in my setup:
> 
> Number of possible NUMA nodes = 2
> Number of mounts per container = 7 (Check below to see which are these)
> Number of list creation requests per mount = 2
> Number of containers = 10000
> memcg_nr_cache_ids for 10k containers = 12286

Luckily, we have "+1" in memcg_nr_cache_ids formula: size = 2 * (id + 1).
In case of we only multiplied it, you would have to had memcg_nr_cache_ids=20000.

Maybe, we need change that formula to increase memcg_nr_cache_ids more accurate
for further growths of containers number. Say,

size = id < 2000 ? 2 * (id + 1) : id + 2000

> size of kmalloc-32 = 32 byes
> 
> 2*7*2*10000*12286*32 = 110082560000 bytes = 102.5G
> 
>>
>>> - When a memcg goes offline, the list elements are drained to the parent
>>>   memcg, but the list_head entry remains.
>>> - The lists are destroyed only when the FS is unmounted. So list_heads
>>>   for non-existing memcgs remain and continue to contribute to the
>>>   kmalloc-32 allocation. This is presumably done for performance
>>>   reason as they get reused when new memcgs are created, but they end up
>>>   consuming slab memory until then.
>>> - In case of containers, a few file systems get mounted and are specific
>>>   to the container namespace and hence to a particular memcg, but we
>>>   end up creating lists for all the memcgs.
>>>   As an example, if 7 FS mounts are done for every container and when
>>>   10k containers are created, we end up creating 2*7*12286 list_lru_one
>>>   lists for each NUMA node. It appears that no elements will get added
>>>   to other than 2*7=14 of them in the case of containers.
>>
>> Yeah, at first glance this doesn't strike me as a problem with the
>> list_lru structure, it smells more like a problem resulting from a
>> huge number of superblock instantiations on the machine. Which,
>> probably, mostly have no significant need for anything other than a
>> single memcg awareness?
>>
>> Can you post a typical /proc/self/mounts output from one of these
>> idle/empty containers so we can see exactly how many mounts and
>> their type are being instantiated in each container?
> 
> Tracing type->name in alloc_super() lists these 7 types for
> every container.
> 
> 3-2691    [041] ....   222.761041: alloc_super: fstype: mqueue
> 3-2692    [072] ....   222.812187: alloc_super: fstype: proc
> 3-2692    [072] ....   222.812261: alloc_super: fstype: tmpfs
> 3-2692    [072] ....   222.812329: alloc_super: fstype: devpts
> 3-2692    [072] ....   222.812392: alloc_super: fstype: tmpfs
> 3-2692    [072] ....   222.813102: alloc_super: fstype: tmpfs
> 3-2692    [072] ....   222.813159: alloc_super: fstype: tmpfs
> 
>>
>>> One straight forward way to prevent this excessive list_lru_one
>>> allocations is to limit the list_lru_one creation only to the
>>> relevant memcg. However I don't see an easy way to figure out
>>> that relevant memcg from FS mount path (alloc_super())
>>
>> Superblocks have to support an unknown number of memcgs after they
>> have been mounted. bind mounts, child memcgs, etc, all mean that we
>> can't just have a static, single mount time memcg instantiation.
>>
>>> As an alternative approach, I have this below hack that does lazy
>>> list_lru creation. The memcg-specific list is created and initialized
>>> only when there is a request to add an element to that particular
>>> list. Though I am not sure about the full impact of this change
>>> on the owners of the lists and also the performance impact of this,
>>> the overall savings look good.
>>
>> Avoiding memory allocation in list_lru_add() was one of the main
>> reasons for up-front static allocation of memcg lists. We cannot do
>> memory allocation while callers are holding multiple spinlocks in
>> core system algorithms (e.g. dentry_kill -> retain_dentry ->
>> d_lru_add -> list_lru_add), let alone while holding an internal
>> spinlock.
>>
>> Putting a GFP_ATOMIC allocation inside 3-4 nested spinlocks in a
>> path we know might have memory demand in the *hundreds of GB* range
>> gets an NACK from me. It's a great idea, but it's just not a
>> feasible, robust solution as proposed. Work out how to put the
>> memory allocation outside all the locks (including caller locks) and
>> it might be ok, but that's messy.
> 
> Ok, I see the problem and it looks like hard to get allocations
> outside of those locks.
> 
>>
>> Another approach may be to identify filesystem types that do not
>> need memcg awareness and feed that into alloc_super() to set/clear
>> the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
>> virtual filesystems that expose system information do not really
>> need full memcg awareness because they are generally only visible to
>> a single memcg instance...
> 
> This however seems like a feasible approach, let me check on this.
> 
> Regards,
> Bharata.
>
Bharata B Rao April 7, 2021, 11:47 a.m. UTC | #10
On Wed, Apr 07, 2021 at 01:07:27PM +0300, Kirill Tkhai wrote:
> > Here is how the calculation turns out to be in my setup:
> > 
> > Number of possible NUMA nodes = 2
> > Number of mounts per container = 7 (Check below to see which are these)
> > Number of list creation requests per mount = 2
> > Number of containers = 10000
> > memcg_nr_cache_ids for 10k containers = 12286
> 
> Luckily, we have "+1" in memcg_nr_cache_ids formula: size = 2 * (id + 1).
> In case of we only multiplied it, you would have to had memcg_nr_cache_ids=20000.

Not really, it would grow like this for size = 2 * id

id 0 size 4
id 4 size 8
id 8 size 16
id 16 size 32
id 32 size 64
id 64 size 128
id 128 size 256
id 256 size 512
id 512 size 1024
id 1024 size 2048
id 2048 size 4096
id 4096 size 8192
id 8192 size 16384

Currently (size = 2 * (id + 1)), it grows like this:

id 0 size 4
id 4 size 10
id 10 size 22
id 22 size 46
id 46 size 94
id 94 size 190
id 190 size 382
id 382 size 766
id 766 size 1534
id 1534 size 3070
id 3070 size 6142
id 6142 size 12286

> 
> Maybe, we need change that formula to increase memcg_nr_cache_ids more accurate
> for further growths of containers number. Say,
> 
> size = id < 2000 ? 2 * (id + 1) : id + 2000

For the above, it would only be marginally better like this:

id 0 size 4
id 4 size 10
id 10 size 22
id 22 size 46
id 46 size 94
id 94 size 190
id 190 size 382
id 382 size 766
id 766 size 1534
id 1534 size 3070
id 3070 size 5070
id 5070 size 7070
id 7070 size 9070
id 9070 size 11070

All the above numbers are for 10k memcgs.

Regards,
Bharata.
Michal Hocko April 7, 2021, 11:54 a.m. UTC | #11
On Mon 05-04-21 11:18:48, Bharata B Rao wrote:
> Hi,
> 
> When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> consumption increases quite a lot (around 172G) when the containers are
> running. Most of it comes from slab (149G) and within slab, the majority of
> it comes from kmalloc-32 cache (102G)

Is this 10k cgroups a testing enviroment or does anybody really use that
in production? I would be really curious to hear how that behaves when
those containers are not idle. E.g. global memory reclaim iterating over
10k memcgs will likely be very visible. I do remember playing with
similar setups few years back and the overhead was very high.
Kirill Tkhai April 7, 2021, 12:49 p.m. UTC | #12
On 07.04.2021 14:47, Bharata B Rao wrote:
> On Wed, Apr 07, 2021 at 01:07:27PM +0300, Kirill Tkhai wrote:
>>> Here is how the calculation turns out to be in my setup:
>>>
>>> Number of possible NUMA nodes = 2
>>> Number of mounts per container = 7 (Check below to see which are these)
>>> Number of list creation requests per mount = 2
>>> Number of containers = 10000
>>> memcg_nr_cache_ids for 10k containers = 12286
>>
>> Luckily, we have "+1" in memcg_nr_cache_ids formula: size = 2 * (id + 1).
>> In case of we only multiplied it, you would have to had memcg_nr_cache_ids=20000.
> 
> Not really, it would grow like this for size = 2 * id
> 
> id 0 size 4
> id 4 size 8
> id 8 size 16
> id 16 size 32
> id 32 size 64
> id 64 size 128
> id 128 size 256
> id 256 size 512
> id 512 size 1024
> id 1024 size 2048
> id 2048 size 4096
> id 4096 size 8192
> id 8192 size 16384
> 
> Currently (size = 2 * (id + 1)), it grows like this:
> 
> id 0 size 4
> id 4 size 10
> id 10 size 22
> id 22 size 46
> id 46 size 94
> id 94 size 190
> id 190 size 382
> id 382 size 766
> id 766 size 1534
> id 1534 size 3070
> id 3070 size 6142
> id 6142 size 12286

Oh, thanks, I forgot what power of two is :)
 
>>
>> Maybe, we need change that formula to increase memcg_nr_cache_ids more accurate
>> for further growths of containers number. Say,
>>
>> size = id < 2000 ? 2 * (id + 1) : id + 2000
> 
> For the above, it would only be marginally better like this:
> 
> id 0 size 4
> id 4 size 10
> id 10 size 22
> id 22 size 46
> id 46 size 94
> id 94 size 190
> id 190 size 382
> id 382 size 766
> id 766 size 1534
> id 1534 size 3070
> id 3070 size 5070
> id 5070 size 7070
> id 7070 size 9070
> id 9070 size 11070
> 
> All the above numbers are for 10k memcgs.

I mean the number of containers bigger then your 10000.
Christian Brauner April 7, 2021, 1:32 p.m. UTC | #13
On Wed, Apr 07, 2021 at 01:54:48PM +0200, Michal Hocko wrote:
> On Mon 05-04-21 11:18:48, Bharata B Rao wrote:
> > Hi,
> > 
> > When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
> 
> Is this 10k cgroups a testing enviroment or does anybody really use that
> in production? I would be really curious to hear how that behaves when
> those containers are not idle. E.g. global memory reclaim iterating over
> 10k memcgs will likely be very visible. I do remember playing with
> similar setups few years back and the overhead was very high.

Ccing Stéphane Graber who has experience/insight about stuff like this.

Christian
Bharata B Rao April 7, 2021, 1:43 p.m. UTC | #14
On Wed, Apr 07, 2021 at 01:54:48PM +0200, Michal Hocko wrote:
> On Mon 05-04-21 11:18:48, Bharata B Rao wrote:
> > Hi,
> > 
> > When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
> 
> Is this 10k cgroups a testing enviroment or does anybody really use that
> in production? I would be really curious to hear how that behaves when
> those containers are not idle. E.g. global memory reclaim iterating over
> 10k memcgs will likely be very visible. I do remember playing with
> similar setups few years back and the overhead was very high.

This 10k containers is only a test scenario that we are looking at.

Regards,
Bharata.
Michal Hocko April 7, 2021, 1:57 p.m. UTC | #15
On Wed 07-04-21 19:13:42, Bharata B Rao wrote:
> On Wed, Apr 07, 2021 at 01:54:48PM +0200, Michal Hocko wrote:
> > On Mon 05-04-21 11:18:48, Bharata B Rao wrote:
> > > Hi,
> > > 
> > > When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> > > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > > consumption increases quite a lot (around 172G) when the containers are
> > > running. Most of it comes from slab (149G) and within slab, the majority of
> > > it comes from kmalloc-32 cache (102G)
> > 
> > Is this 10k cgroups a testing enviroment or does anybody really use that
> > in production? I would be really curious to hear how that behaves when
> > those containers are not idle. E.g. global memory reclaim iterating over
> > 10k memcgs will likely be very visible. I do remember playing with
> > similar setups few years back and the overhead was very high.
> 
> This 10k containers is only a test scenario that we are looking at.

OK, this is good to know. I would definitely recommend looking at the
runtime aspect of such a large scale deployment before optimizing for
memory footprint. I do agree that the later is an interesting topic on
its own but I do not expect such a deployment on small machines so the
overhead shouldn't be a showstopper. I would be definitely interested
to hear about the runtime overhead. I do expect some interesting
finidings there.

Btw. I do expect that memory controller will not be the only one
deployed right?
Christian Brauner April 7, 2021, 1:57 p.m. UTC | #16
On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> On Mon, Apr 05, 2021 at 11:18:48AM +0530, Bharata B Rao wrote:
> > Hi,
> > 
> > When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
> > 
> > The major allocator of kmalloc-32 slab cache happens to be the list_head
> > allocations of list_lru_one list. These lists are created whenever a
> > FS mount happens. Specially two such lists are registered by alloc_super(),
> > one for dentry and another for inode shrinker list. And these lists
> > are created for all possible NUMA nodes and for all given memcgs
> > (memcg_nr_cache_ids to be particular)
> > 
> > If,
> > 
> > A = Nr allocation request per mount: 2 (one for dentry and inode list)
> > B = Nr NUMA possible nodes
> > C = memcg_nr_cache_ids
> > D = size of each kmalloc-32 object: 32 bytes,
> > 
> > then for every mount, the amount of memory consumed by kmalloc-32 slab
> > cache for list_lru creation is A*B*C*D bytes.
> > 
> > Following factors contribute to the excessive allocations:
> > 
> > - Lists are created for possible NUMA nodes.
> > - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and additional
> >   list_lrus are created when it grows. Thus we end up creating list_lru_one
> >   list_heads even for those memcgs which are yet to be created.
> >   For example, when 10000 memcgs are created, memcg_nr_cache_ids reach
> >   a value of 12286.
> 
> So, by your numbers, we have 2 * 2 * 12286 * 32 = 1.5MB per mount.
> 
> So for that to make up 100GB of RAM, you must have somewhere over
> 500,000 mounted superblocks on the machine?
> 
> That implies 50+ unique mounted superblocks per container, which
> seems like an awful lot.
> 
> > - When a memcg goes offline, the list elements are drained to the parent
> >   memcg, but the list_head entry remains.
> > - The lists are destroyed only when the FS is unmounted. So list_heads
> >   for non-existing memcgs remain and continue to contribute to the
> >   kmalloc-32 allocation. This is presumably done for performance
> >   reason as they get reused when new memcgs are created, but they end up
> >   consuming slab memory until then.
> > - In case of containers, a few file systems get mounted and are specific
> >   to the container namespace and hence to a particular memcg, but we
> >   end up creating lists for all the memcgs.
> >   As an example, if 7 FS mounts are done for every container and when
> >   10k containers are created, we end up creating 2*7*12286 list_lru_one
> >   lists for each NUMA node. It appears that no elements will get added
> >   to other than 2*7=14 of them in the case of containers.
> 
> Yeah, at first glance this doesn't strike me as a problem with the
> list_lru structure, it smells more like a problem resulting from a
> huge number of superblock instantiations on the machine. Which,
> probably, mostly have no significant need for anything other than a
> single memcg awareness?
> 
> Can you post a typical /proc/self/mounts output from one of these
> idle/empty containers so we can see exactly how many mounts and
> their type are being instantiated in each container?

Similar to Michal I wonder how much of that is really used in production
environments. From our experience it really depends on the type of
container we're talking about.
For a regular app container that essentially serves as an application
isolator the number of mounts could be fairly limited and essentially be
restricted to:

tmpfs
devptfs
sysfs
[cgroupfs]
and a few bind-mounts of standard devices such as
/dev/null
/dev/zero
/dev/full
.
.
.
from the host's devtmpfs into the container.

Then there are containers that behave like regular systems and are
managed like regular systems and those might have quite a bit more. For
example, here is the output of a regular unprivileged Fedora 33
container I created out of the box:

[root@f33 ~]# findmnt 
TARGET                                SOURCE                                                                       FSTYPE      OPTIONS
/                                     /dev/mapper/ubuntu--vg-ubuntu--lv[/var/lib/lxd/storage-pools/default/containers/f33/rootfs]
│                                                                                                                  xfs         rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota
├─/run                                tmpfs                                                                        tmpfs       rw,nosuid,nodev,size=3226884k,nr_inodes=819200,mode=755,uid=100000,gid=100000
│ └─/run/user/0                       tmpfs                                                                        tmpfs       rw,nosuid,nodev,relatime,size=1613440k,nr_inodes=403360,mode=700,uid=100000,gid=100000
├─/tmp                                tmpfs                                                                        tmpfs       rw,nosuid,nodev,nr_inodes=409600,uid=100000,gid=100000
├─/dev                                none                                                                         tmpfs       rw,relatime,size=492k,mode=755,uid=100000,gid=100000
│ ├─/dev/shm                          tmpfs                                                                        tmpfs       rw,nosuid,nodev,uid=100000,gid=100000
│ ├─/dev/fuse                         udev[/fuse]                                                                  devtmpfs    rw,nosuid,noexec,relatime,size=8019708k,nr_inodes=2004927,mode=755
│ ├─/dev/net/tun                      udev[/net/tun]                                                               devtmpfs    rw,nosuid,noexec,relatime,size=8019708k,nr_inodes=2004927,mode=755
│ ├─/dev/mqueue                       mqueue                                                                       mqueue      rw,nosuid,nodev,noexec,relatime
│ ├─/dev/lxd                          tmpfs                                                                        tmpfs       rw,relatime,size=100k,mode=755
│ ├─/dev/.lxd-mounts                  tmpfs[/f33]                                                                  tmpfs       rw,relatime,size=100k,mode=711
│ ├─/dev/full                         udev[/full]                                                                  devtmpfs    rw,nosuid,noexec,relatime,size=8019708k,nr_inodes=2004927,mode=755
│ ├─/dev/null                         udev[/null]                                                                  devtmpfs    rw,nosuid,noexec,relatime,size=8019708k,nr_inodes=2004927,mode=755
│ ├─/dev/random                       udev[/random]                                                                devtmpfs    rw,nosuid,noexec,relatime,size=8019708k,nr_inodes=2004927,mode=755
│ ├─/dev/tty                          udev[/tty]                                                                   devtmpfs    rw,nosuid,noexec,relatime,size=8019708k,nr_inodes=2004927,mode=755
│ ├─/dev/urandom                      udev[/urandom]                                                               devtmpfs    rw,nosuid,noexec,relatime,size=8019708k,nr_inodes=2004927,mode=755
│ ├─/dev/zero                         udev[/zero]                                                                  devtmpfs    rw,nosuid,noexec,relatime,size=8019708k,nr_inodes=2004927,mode=755
│ ├─/dev/console                      devpts[/40]                                                                  devpts      rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000
│ ├─/dev/pts                          devpts                                                                       devpts      rw,nosuid,noexec,relatime,gid=100005,mode=620,ptmxmode=666,max=1024
│ └─/dev/ptmx                         devpts[/ptmx]                                                                devpts      rw,nosuid,noexec,relatime,gid=100005,mode=620,ptmxmode=666,max=1024
├─/proc                               proc                                                                         proc        rw,nosuid,nodev,noexec,relatime
│ ├─/proc/sys/fs/binfmt_misc          binfmt_misc                                                                  binfmt_misc rw,nosuid,nodev,noexec,relatime
│ └─/proc/sys/kernel/random/boot_id   none[/.lxc-boot-id]                                                          tmpfs       ro,nosuid,nodev,noexec,relatime,size=492k,mode=755,uid=100000,gid=100000
└─/sys                                sysfs                                                                        sysfs       rw,relatime
  ├─/sys/fs/cgroup                    tmpfs                                                                        tmpfs       ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755,uid=100000,gid=100000
  │ ├─/sys/fs/cgroup/unified          cgroup2                                                                      cgroup2     rw,nosuid,nodev,noexec,relatime
  │ ├─/sys/fs/cgroup/systemd          cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,xattr,name=systemd
  │ ├─/sys/fs/cgroup/net_cls,net_prio cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,net_cls,net_prio
  │ ├─/sys/fs/cgroup/hugetlb          cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,hugetlb
  │ ├─/sys/fs/cgroup/cpu,cpuacct      cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,cpu,cpuacct
  │ ├─/sys/fs/cgroup/blkio            cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,blkio
  │ ├─/sys/fs/cgroup/cpuset           cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,cpuset,clone_children
  │ ├─/sys/fs/cgroup/memory           cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,memory
  │ ├─/sys/fs/cgroup/devices          cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,devices
  │ ├─/sys/fs/cgroup/perf_event       cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,perf_event
  │ ├─/sys/fs/cgroup/freezer          cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,freezer
  │ ├─/sys/fs/cgroup/pids             cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,pids
  │ └─/sys/fs/cgroup/rdma             cgroup                                                                       cgroup      rw,nosuid,nodev,noexec,relatime,rdma
  ├─/sys/firmware/efi/efivars         efivarfs                                                                     efivarfs    rw,nosuid,nodev,noexec,relatime
  ├─/sys/fs/fuse/connections          fusectl                                                                      fusectl     rw,nosuid,nodev,noexec,relatime
  ├─/sys/fs/pstore                    pstore                                                                       pstore      rw,nosuid,nodev,noexec,relatime
  ├─/sys/kernel/config                configfs                                                                     configfs    rw,nosuid,nodev,noexec,relatime
  ├─/sys/kernel/debug                 debugfs                                                                      debugfs     rw,nosuid,nodev,noexec,relatime
  ├─/sys/kernel/security              securityfs                                                                   securityfs  rw,nosuid,nodev,noexec,relatime
  ├─/sys/kernel/tracing               tracefs                                                                      tracefs     rw,nosuid,nodev,noexec,relatime

People that use those tend to also run systemd services in there and
newer systemd has a range of service isolation features that may also
create quite a few mounts. Those will again mostly be pseudo filesystems
(A service might have private proc, tmp etc.) and bind-mounts. The
number of actual separate superblocks for "real" filesystem such as xfs,
ext4 per container is usually quite low. (For one, most of them can't
even be mounted in a user namespace.). From experience it's rare to see
workloads that exceed 500 containers (of this type at least) on a single
machine. At least on x86_64 we have not yet had issues with memory
consumption.

We do run stress tests with thousands of such system containers. They
tend to boot busybox, not e.g. Fedora or Debian or Ubuntu and that
hasn't pushed us over the edge yet.

> 
> > One straight forward way to prevent this excessive list_lru_one
> > allocations is to limit the list_lru_one creation only to the
> > relevant memcg. However I don't see an easy way to figure out
> > that relevant memcg from FS mount path (alloc_super())
> 
> Superblocks have to support an unknown number of memcgs after they
> have been mounted. bind mounts, child memcgs, etc, all mean that we
> can't just have a static, single mount time memcg instantiation.
> 
> > As an alternative approach, I have this below hack that does lazy
> > list_lru creation. The memcg-specific list is created and initialized
> > only when there is a request to add an element to that particular
> > list. Though I am not sure about the full impact of this change
> > on the owners of the lists and also the performance impact of this,
> > the overall savings look good.
> 
> Avoiding memory allocation in list_lru_add() was one of the main
> reasons for up-front static allocation of memcg lists. We cannot do
> memory allocation while callers are holding multiple spinlocks in
> core system algorithms (e.g. dentry_kill -> retain_dentry ->
> d_lru_add -> list_lru_add), let alone while holding an internal
> spinlock.
> 
> Putting a GFP_ATOMIC allocation inside 3-4 nested spinlocks in a
> path we know might have memory demand in the *hundreds of GB* range
> gets an NACK from me. It's a great idea, but it's just not a
> feasible, robust solution as proposed. Work out how to put the
> memory allocation outside all the locks (including caller locks) and
> it might be ok, but that's messy.
> 
> Another approach may be to identify filesystem types that do not
> need memcg awareness and feed that into alloc_super() to set/clear
> the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
> virtual filesystems that expose system information do not really

I think that might already help quite a bit as those tend to make up
most of the mounts and even unprivileged containers can create new
instances of such mounts and will do so when they e.g. run systemd and
thus also systemd services.

Christian
Shakeel Butt April 7, 2021, 3:42 p.m. UTC | #17
On Wed, Apr 7, 2021 at 4:55 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 05-04-21 11:18:48, Bharata B Rao wrote:
> > Hi,
> >
> > When running 10000 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
>
> Is this 10k cgroups a testing enviroment or does anybody really use that
> in production? I would be really curious to hear how that behaves when
> those containers are not idle. E.g. global memory reclaim iterating over
> 10k memcgs will likely be very visible. I do remember playing with
> similar setups few years back and the overhead was very high.
> --

I can tell about our environment. Couple of thousands of memcgs (~2k)
are very normal on our machines as machines can be running 100+ jobs
(and each job can manage their own sub-memcgs). However each job can
have a high number of mounts. There is no local disk and each package
of the job is remotely mounted (a bit more complicated).

We do have issues with global memory reclaim but mostly the proactive
reclaim makes the global reclaim a tail issue (and at tail it often
does create havoc).
Bharata B Rao April 15, 2021, 5:23 a.m. UTC | #18
On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> 
> Another approach may be to identify filesystem types that do not
> need memcg awareness and feed that into alloc_super() to set/clear
> the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
> virtual filesystems that expose system information do not really
> need full memcg awareness because they are generally only visible to
> a single memcg instance...

Would something like below be appropriate?

From f314083ad69fde2a420a1b74febd6d3f7a25085f Mon Sep 17 00:00:00 2001
From: Bharata B Rao <bharata@linux.ibm.com>
Date: Wed, 14 Apr 2021 11:21:24 +0530
Subject: [PATCH 1/1] fs: Let filesystems opt out of memcg awareness

All filesystem mounts by default are memcg aware and end hence
end up creating shrinker list_lrus for all the memcgs. Due to
the way the memcg_nr_cache_ids grow and the list_lru heads are
allocated for all memcgs, huge amount of memory gets consumed
by kmalloc-32 slab cache when running thousands of containers.

Improve this situation by allowing filesystems to opt out
of memcg awareness. In this patch, tmpfs, proc and ramfs
opt out of memcg awareness. This leads to considerable memory
savings when running 10k containers.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
---
 fs/proc/root.c             |  1 +
 fs/ramfs/inode.c           |  1 +
 fs/super.c                 | 27 +++++++++++++++++++--------
 include/linux/fs_context.h |  2 ++
 mm/shmem.c                 |  1 +
 5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index c7e3b1350ef8..7856bc2ca9f4 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -257,6 +257,7 @@ static int proc_init_fs_context(struct fs_context *fc)
 	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
 	fc->fs_private = ctx;
 	fc->ops = &proc_fs_context_ops;
+	fc->memcg_optout = true;
 	return 0;
 }
 
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 9ebd17d7befb..576a88bb7407 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -278,6 +278,7 @@ int ramfs_init_fs_context(struct fs_context *fc)
 	fsi->mount_opts.mode = RAMFS_DEFAULT_MODE;
 	fc->s_fs_info = fsi;
 	fc->ops = &ramfs_context_ops;
+	fc->memcg_optout = true;
 	return 0;
 }
 
diff --git a/fs/super.c b/fs/super.c
index 8c1baca35c16..59aa22c678e6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -198,7 +198,8 @@ static void destroy_unused_super(struct super_block *s)
  *	returns a pointer new superblock or %NULL if allocation had failed.
  */
 static struct super_block *alloc_super(struct file_system_type *type, int flags,
-				       struct user_namespace *user_ns)
+				       struct user_namespace *user_ns,
+				       bool memcg_optout)
 {
 	struct super_block *s = kzalloc(sizeof(struct super_block),  GFP_USER);
 	static const struct super_operations default_op;
@@ -266,13 +267,22 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	s->s_shrink.scan_objects = super_cache_scan;
 	s->s_shrink.count_objects = super_cache_count;
 	s->s_shrink.batch = 1024;
-	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
+	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+	if (!memcg_optout)
+		s->s_shrink.flags |= SHRINKER_MEMCG_AWARE;
 	if (prealloc_shrinker(&s->s_shrink))
 		goto fail;
-	if (list_lru_init_memcg(&s->s_dentry_lru, &s->s_shrink))
-		goto fail;
-	if (list_lru_init_memcg(&s->s_inode_lru, &s->s_shrink))
-		goto fail;
+	if (memcg_optout) {
+		if (list_lru_init(&s->s_dentry_lru))
+			goto fail;
+		if (list_lru_init(&s->s_inode_lru))
+			goto fail;
+	} else {
+		if (list_lru_init_memcg(&s->s_dentry_lru, &s->s_shrink))
+			goto fail;
+		if (list_lru_init_memcg(&s->s_inode_lru, &s->s_shrink))
+			goto fail;
+	}
 	return s;
 
 fail:
@@ -527,7 +537,8 @@ struct super_block *sget_fc(struct fs_context *fc,
 	}
 	if (!s) {
 		spin_unlock(&sb_lock);
-		s = alloc_super(fc->fs_type, fc->sb_flags, user_ns);
+		s = alloc_super(fc->fs_type, fc->sb_flags, user_ns,
+				fc->memcg_optout);
 		if (!s)
 			return ERR_PTR(-ENOMEM);
 		goto retry;
@@ -610,7 +621,7 @@ struct super_block *sget(struct file_system_type *type,
 	}
 	if (!s) {
 		spin_unlock(&sb_lock);
-		s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns);
+		s = alloc_super(type, (flags & ~SB_SUBMOUNT), user_ns, false);
 		if (!s)
 			return ERR_PTR(-ENOMEM);
 		goto retry;
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 37e1e8f7f08d..73388c0b6950 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -110,6 +110,8 @@ struct fs_context {
 	bool			need_free:1;	/* Need to call ops->free() */
 	bool			global:1;	/* Goes into &init_user_ns */
 	bool			oldapi:1;	/* Coming from mount(2) */
+	bool			memcg_optout:1;	/* Opt out from per-memcg
+						   lru handling */
 };
 
 struct fs_context_operations {
diff --git a/mm/shmem.c b/mm/shmem.c
index b2db4ed0fbc7..0c9b2af52825 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3915,6 +3915,7 @@ int shmem_init_fs_context(struct fs_context *fc)
 
 	fc->fs_private = ctx;
 	fc->ops = &shmem_fs_context_ops;
+	fc->memcg_optout = true;
 	return 0;
 }
Michal Hocko April 15, 2021, 6:54 a.m. UTC | #19
On Thu 15-04-21 10:53:00, Bharata B Rao wrote:
> On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> > 
> > Another approach may be to identify filesystem types that do not
> > need memcg awareness and feed that into alloc_super() to set/clear
> > the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
> > virtual filesystems that expose system information do not really
> > need full memcg awareness because they are generally only visible to
> > a single memcg instance...
> 
> Would something like below be appropriate?

No. First of all you are defining yet another way to say
SHRINKER_MEMCG_AWARE which is messy. And secondly why would shmem, proc
and ramfs be any special and they would be ok to opt out? There is no
single word about that reasoning in your changelog.

> >From f314083ad69fde2a420a1b74febd6d3f7a25085f Mon Sep 17 00:00:00 2001
> From: Bharata B Rao <bharata@linux.ibm.com>
> Date: Wed, 14 Apr 2021 11:21:24 +0530
> Subject: [PATCH 1/1] fs: Let filesystems opt out of memcg awareness
> 
> All filesystem mounts by default are memcg aware and end hence
> end up creating shrinker list_lrus for all the memcgs. Due to
> the way the memcg_nr_cache_ids grow and the list_lru heads are
> allocated for all memcgs, huge amount of memory gets consumed
> by kmalloc-32 slab cache when running thousands of containers.
> 
> Improve this situation by allowing filesystems to opt out
> of memcg awareness. In this patch, tmpfs, proc and ramfs
> opt out of memcg awareness. This leads to considerable memory
> savings when running 10k containers.
> 
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> ---
>  fs/proc/root.c             |  1 +
>  fs/ramfs/inode.c           |  1 +
>  fs/super.c                 | 27 +++++++++++++++++++--------
>  include/linux/fs_context.h |  2 ++
>  mm/shmem.c                 |  1 +
>  5 files changed, 24 insertions(+), 8 deletions(-)

[...]
Bharata B Rao April 15, 2021, 7:21 a.m. UTC | #20
On Thu, Apr 15, 2021 at 08:54:43AM +0200, Michal Hocko wrote:
> On Thu 15-04-21 10:53:00, Bharata B Rao wrote:
> > On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> > > 
> > > Another approach may be to identify filesystem types that do not
> > > need memcg awareness and feed that into alloc_super() to set/clear
> > > the SHRINKER_MEMCG_AWARE flag. This could be based on fstype - most
> > > virtual filesystems that expose system information do not really
> > > need full memcg awareness because they are generally only visible to
> > > a single memcg instance...
> > 
> > Would something like below be appropriate?
> 
> No. First of all you are defining yet another way to say
> SHRINKER_MEMCG_AWARE which is messy.

Ok.

> And secondly why would shmem, proc
> and ramfs be any special and they would be ok to opt out? There is no
> single word about that reasoning in your changelog.

Right, I am only checking if the suggestion given by David (see above)
is indeed this. There are a few other things to take care of
which I shall if the overall direction of the patch turns
out to be acceptable.

Regards,
Bharata.
Bharata B Rao April 16, 2021, 4:44 a.m. UTC | #21
On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> On Mon, Apr 05, 2021 at 11:18:48AM +0530, Bharata B Rao wrote:
> 
> > As an alternative approach, I have this below hack that does lazy
> > list_lru creation. The memcg-specific list is created and initialized
> > only when there is a request to add an element to that particular
> > list. Though I am not sure about the full impact of this change
> > on the owners of the lists and also the performance impact of this,
> > the overall savings look good.
> 
> Avoiding memory allocation in list_lru_add() was one of the main
> reasons for up-front static allocation of memcg lists. We cannot do
> memory allocation while callers are holding multiple spinlocks in
> core system algorithms (e.g. dentry_kill -> retain_dentry ->
> d_lru_add -> list_lru_add), let alone while holding an internal
> spinlock.
> 
> Putting a GFP_ATOMIC allocation inside 3-4 nested spinlocks in a
> path we know might have memory demand in the *hundreds of GB* range
> gets an NACK from me. It's a great idea, but it's just not a

I do understand that GFP_ATOMIC allocations are really not preferrable
but want to point out that the allocations in the range of hundreds of
GBs get reduced to tens of MBs when we do lazy list_lru head allocations
under GFP_ATOMIC.

As shown earlier, this is what I see in my experimental setup with
10k containers:

Number of kmalloc-32 allocations
		Before		During		After
W/o patch	178176		3442409472	388933632
W/  patch	190464		468992		468992

So 3442409472*32=102GB upfront list_lru creation-time GFP_KERNEL allocations
get reduced to 468992*32=14MB dynamic list_lru addtion-time GFP_ATOMIC
allocations.

This does really depend and vary on the type of the container and
the number of mounts it does, but I suspect we are looking
at GFP_ATOMIC allocations in the MB range. Also the number of
GFP_ATOMIC slab allocation requests matter I suppose.

There are other users of list_lru, but I was just looking at
dentry and inode list_lru usecase. It appears to me that for both
dentry and inode, we can tolerate the failure from list_lru_add
due to GFP_ATOMIC allocation failure. The failure to add dentry
or inode to the lru list means that they won't be retained in
the lru list, but would be freed immediately. Is this understanding
correct?

If so, would that likely impact the subsequent lookups adversely?
We failed to retain a dentry or inode in the lru list because
we failed to allocate memory, presumably under memory pressure.
Even in such a scenario, is failure to add dentry or inode to
lru list so bad and not tolerable? 

Regards,
Bharata.
Dave Chinner April 19, 2021, 1:23 a.m. UTC | #22
On Fri, Apr 16, 2021 at 10:14:39AM +0530, Bharata B Rao wrote:
> On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> > On Mon, Apr 05, 2021 at 11:18:48AM +0530, Bharata B Rao wrote:
> > 
> > > As an alternative approach, I have this below hack that does lazy
> > > list_lru creation. The memcg-specific list is created and initialized
> > > only when there is a request to add an element to that particular
> > > list. Though I am not sure about the full impact of this change
> > > on the owners of the lists and also the performance impact of this,
> > > the overall savings look good.
> > 
> > Avoiding memory allocation in list_lru_add() was one of the main
> > reasons for up-front static allocation of memcg lists. We cannot do
> > memory allocation while callers are holding multiple spinlocks in
> > core system algorithms (e.g. dentry_kill -> retain_dentry ->
> > d_lru_add -> list_lru_add), let alone while holding an internal
> > spinlock.
> > 
> > Putting a GFP_ATOMIC allocation inside 3-4 nested spinlocks in a
> > path we know might have memory demand in the *hundreds of GB* range
> > gets an NACK from me. It's a great idea, but it's just not a
> 
> I do understand that GFP_ATOMIC allocations are really not preferrable
> but want to point out that the allocations in the range of hundreds of
> GBs get reduced to tens of MBs when we do lazy list_lru head allocations
> under GFP_ATOMIC.

That does not make GFP_ATOMIC allocations safe or desirable. In
general, using GFP_ATOMIC outside of interrupt context indicates
something is being done incorrectly. Especially if it can be
triggered from userspace, which is likely in this particular case...



> As shown earlier, this is what I see in my experimental setup with
> 10k containers:
> 
> Number of kmalloc-32 allocations
> 		Before		During		After
> W/o patch	178176		3442409472	388933632
> W/  patch	190464		468992		468992

SO now we have an additional half million GFP_ATOMIC allocations
when we currently have none. That's not an improvement, that rings
loud alarm bells.

> This does really depend and vary on the type of the container and
> the number of mounts it does, but I suspect we are looking
> at GFP_ATOMIC allocations in the MB range. Also the number of
> GFP_ATOMIC slab allocation requests matter I suppose.

They are slab allocations, which mean every single one of them
could require a new slab backing page (pages!) to be allocated.
Hence the likely memory demand might be a lot higher than the
optimal case you are considering here...

> There are other users of list_lru, but I was just looking at
> dentry and inode list_lru usecase. It appears to me that for both
> dentry and inode, we can tolerate the failure from list_lru_add
> due to GFP_ATOMIC allocation failure. The failure to add dentry
> or inode to the lru list means that they won't be retained in
> the lru list, but would be freed immediately. Is this understanding
> correct?

No. Both retain_dentry() and iput_final() would currently leak
objects that fail insertion into the LRU. They don't check for
insertion success at all.

But, really, this is irrelevant - GFP_ATOMIC usage is the problem,
and allowing it to fail doesn't avoid the problems that unbound
GFP_ATOMIC allocation can have on the stability of the rest of the
system when low on memory. Being able to handle a GFP_ATOMIC memory
allocation failure doesn't change the fact that you should not be
doing GFP_ATOMIC allocation in the first place...

Cheers,

Dave.
diff mbox series

Patch

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 6f067b6b935f..b453fa5008cc 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -112,16 +112,32 @@  list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
+static void init_one_lru(struct list_lru_one *l)
+{
+	INIT_LIST_HEAD(&l->list);
+	l->nr_items = 0;
+}
+
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
 	int nid = page_to_nid(virt_to_page(item));
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct mem_cgroup *memcg;
 	struct list_lru_one *l;
+	struct list_lru_memcg *memcg_lrus;
 
 	spin_lock(&nlru->lock);
 	if (list_empty(item)) {
 		l = list_lru_from_kmem(nlru, item, &memcg);
+		if (!l) {
+			l = kmalloc(sizeof(struct list_lru_one), GFP_ATOMIC);
+			if (!l)
+				goto out;
+
+			init_one_lru(l);
+			memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
+			memcg_lrus->lru[memcg_cache_id(memcg)] = l;
+		}
 		list_add_tail(item, &l->list);
 		/* Set shrinker bit if the first element was added */
 		if (!l->nr_items++)
@@ -131,6 +147,7 @@  bool list_lru_add(struct list_lru *lru, struct list_head *item)
 		spin_unlock(&nlru->lock);
 		return true;
 	}
+out:
 	spin_unlock(&nlru->lock);
 	return false;
 }
@@ -176,11 +193,12 @@  unsigned long list_lru_count_one(struct list_lru *lru,
 {
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
-	unsigned long count;
+	unsigned long count = 0;
 
 	rcu_read_lock();
 	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
-	count = READ_ONCE(l->nr_items);
+	if (l)
+		count = READ_ONCE(l->nr_items);
 	rcu_read_unlock();
 
 	return count;
@@ -207,6 +225,9 @@  __list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx,
 	unsigned long isolated = 0;
 
 	l = list_lru_from_memcg_idx(nlru, memcg_idx);
+	if (!l)
+		goto out;
+
 restart:
 	list_for_each_safe(item, n, &l->list) {
 		enum lru_status ret;
@@ -251,6 +272,7 @@  __list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx,
 			BUG();
 		}
 	}
+out:
 	return isolated;
 }
 
@@ -312,12 +334,6 @@  unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 }
 EXPORT_SYMBOL_GPL(list_lru_walk_node);
 
-static void init_one_lru(struct list_lru_one *l)
-{
-	INIT_LIST_HEAD(&l->list);
-	l->nr_items = 0;
-}
-
 #ifdef CONFIG_MEMCG_KMEM
 static void __memcg_destroy_list_lru_node(struct list_lru_memcg *memcg_lrus,
 					  int begin, int end)
@@ -328,41 +344,16 @@  static void __memcg_destroy_list_lru_node(struct list_lru_memcg *memcg_lrus,
 		kfree(memcg_lrus->lru[i]);
 }
 
-static int __memcg_init_list_lru_node(struct list_lru_memcg *memcg_lrus,
-				      int begin, int end)
-{
-	int i;
-
-	for (i = begin; i < end; i++) {
-		struct list_lru_one *l;
-
-		l = kmalloc(sizeof(struct list_lru_one), GFP_KERNEL);
-		if (!l)
-			goto fail;
-
-		init_one_lru(l);
-		memcg_lrus->lru[i] = l;
-	}
-	return 0;
-fail:
-	__memcg_destroy_list_lru_node(memcg_lrus, begin, i);
-	return -ENOMEM;
-}
-
 static int memcg_init_list_lru_node(struct list_lru_node *nlru)
 {
 	struct list_lru_memcg *memcg_lrus;
 	int size = memcg_nr_cache_ids;
 
-	memcg_lrus = kvmalloc(sizeof(*memcg_lrus) +
+	memcg_lrus = kvzalloc(sizeof(*memcg_lrus) +
 			      size * sizeof(void *), GFP_KERNEL);
 	if (!memcg_lrus)
 		return -ENOMEM;
 
-	if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) {
-		kvfree(memcg_lrus);
-		return -ENOMEM;
-	}
 	RCU_INIT_POINTER(nlru->memcg_lrus, memcg_lrus);
 
 	return 0;
@@ -389,15 +380,10 @@  static int memcg_update_list_lru_node(struct list_lru_node *nlru,
 
 	old = rcu_dereference_protected(nlru->memcg_lrus,
 					lockdep_is_held(&list_lrus_mutex));
-	new = kvmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
+	new = kvzalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
 	if (!new)
 		return -ENOMEM;
 
-	if (__memcg_init_list_lru_node(new, old_size, new_size)) {
-		kvfree(new);
-		return -ENOMEM;
-	}
-
 	memcpy(&new->lru, &old->lru, old_size * sizeof(void *));
 
 	/*
@@ -526,6 +512,7 @@  static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	struct list_lru_node *nlru = &lru->node[nid];
 	int dst_idx = dst_memcg->kmemcg_id;
 	struct list_lru_one *src, *dst;
+	struct list_lru_memcg *memcg_lrus;
 
 	/*
 	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -534,7 +521,17 @@  static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	spin_lock_irq(&nlru->lock);
 
 	src = list_lru_from_memcg_idx(nlru, src_idx);
+	if (!src)
+		goto out;
+
 	dst = list_lru_from_memcg_idx(nlru, dst_idx);
+	if (!dst) {
+		/* TODO: Use __GFP_NOFAIL? */
+		dst = kmalloc(sizeof(struct list_lru_one), GFP_ATOMIC);
+		init_one_lru(dst);
+		memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
+		memcg_lrus->lru[dst_idx] = dst;
+	}
 
 	list_splice_init(&src->list, &dst->list);
 
@@ -543,7 +540,7 @@  static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 		memcg_set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
 		src->nr_items = 0;
 	}
-
+out:
 	spin_unlock_irq(&nlru->lock);
 }