[4/5] mm: rework non-root kmem_cache lifecycle management

Message ID	20190417215434.25897-5-guro@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of guroan@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: Roman Gushchin <guroan@gmail.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Rik van Riel <riel@surriel.com>, david@fromorbit.com, Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, cgroups@vger.kernel.org, Roman Gushchin <guro@fb.com> Subject: [PATCH 4/5] mm: rework non-root kmem_cache lifecycle management Date: Wed, 17 Apr 2019 14:54:33 -0700 Message-Id: <20190417215434.25897-5-guro@fb.com> In-Reply-To: <20190417215434.25897-1-guro@fb.com> References: <20190417215434.25897-1-guro@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: reparent slab memory on cgroup removal \| expand [0/5] mm: reparent slab memory on cgroup removal [1/5] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() [2/5] mm: generalize postponed non-root kmem_cache deactivation [3/5] mm: introduce __memcg_kmem_uncharge_memcg() [4/5] mm: rework non-root kmem_cache lifecycle management [5/5] mm: reparent slab memory on cgroup removal

Roman Gushchin April 17, 2019, 9:54 p.m. UTC

This commit makes several important changes in the lifecycle
of a non-root kmem_cache, which also affect the lifecycle
of a memory cgroup.

Currently each charged slab page has a page->mem_cgroup pointer
to the memory cgroup and holds a reference to it.
Kmem_caches are held by the cgroup. On offlining empty kmem_caches
are freed, all other are freed on cgroup release.

So the current scheme can be illustrated as:
page->mem_cgroup->kmem_cache.

To implement the slab memory reparenting we need to invert the scheme
into: page->kmem_cache->mem_cgroup.

Let's make every page to hold a reference to the kmem_cache (we
already have a stable pointer), and make kmem_caches to hold a single
reference to the memory cgroup.

To make this possible we need to introduce a new refcounter
for non-root kmem_caches. It's atomic for now, but can be easily
converted to a percpu counter, had we any performance penalty*.
The initial value is set to 1, and it's decremented on deactivation,
so we never shutdown an active cache.

To shutdown non-active empty kmem_caches, let's reuse the
infrastructure of the RCU-delayed work queue, used previously for
the deactivation. After the generalization, it's perfectly suited
for our needs.

Since now we can release a kmem_cache at any moment after the
deactivation, let's call sysfs_slab_remove() only from the shutdown
path. It makes deactivation path simpler.

Because we don't set the page->mem_cgroup pointer, we need to change
the way how memcg-level stats is working for slab pages. We can't use
mod_lruvec_page_state() helpers anymore, so switch over to
mod_lruvec_state().

* I used the following simple approach to test the performance
(stolen from another patchset by T. Harding):

    time find / -name fname-no-exist
    echo 2 > /proc/sys/vm/drop_caches
    repeat several times

Results (I've chosen best results in several runs):

        orig       patched

real	0m0.712s   0m0.690s
user	0m0.104s   0m0.101s
sys	0m0.346s   0m0.340s

real	0m0.728s   0m0.723s
user	0m0.114s   0m0.115s
sys	0m0.342s   0m0.338s

real	0m0.685s   0m0.767s
user	0m0.118s   0m0.114s
sys	0m0.343s   0m0.336s

So it looks like the difference is not noticeable in this test.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/slab.h |  2 +-
 mm/memcontrol.c      |  9 ++++----
 mm/slab.c            | 15 +-----------
 mm/slab.h            | 54 +++++++++++++++++++++++++++++++++++++++++---
 mm/slab_common.c     | 51 +++++++++++++++++------------------------
 mm/slub.c            | 22 +-----------------
 6 files changed, 79 insertions(+), 74 deletions(-)

Shakeel Butt April 17, 2019, 11:41 p.m. UTC | #1

On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <guroan@gmail.com> wrote:
>
> This commit makes several important changes in the lifecycle
> of a non-root kmem_cache, which also affect the lifecycle
> of a memory cgroup.
>
> Currently each charged slab page has a page->mem_cgroup pointer
> to the memory cgroup and holds a reference to it.
> Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> are freed, all other are freed on cgroup release.

No, they are not freed (i.e. destroyed) on offlining, only
deactivated. All memcg kmem_caches are freed/destroyed on memcg's
css_free.

>
> So the current scheme can be illustrated as:
> page->mem_cgroup->kmem_cache.
>
> To implement the slab memory reparenting we need to invert the scheme
> into: page->kmem_cache->mem_cgroup.
>
> Let's make every page to hold a reference to the kmem_cache (we
> already have a stable pointer), and make kmem_caches to hold a single
> reference to the memory cgroup.

What about memcg_kmem_get_cache()? That function assumes that by
taking reference on memcg, it's kmem_caches will stay. I think you
need to get reference on the kmem_cache in memcg_kmem_get_cache()
within the rcu lock where you get the memcg through css_tryget_online.

>
> To make this possible we need to introduce a new refcounter
> for non-root kmem_caches. It's atomic for now, but can be easily
> converted to a percpu counter, had we any performance penalty*.
> The initial value is set to 1, and it's decremented on deactivation,
> so we never shutdown an active cache.
>
> To shutdown non-active empty kmem_caches, let's reuse the
> infrastructure of the RCU-delayed work queue, used previously for
> the deactivation. After the generalization, it's perfectly suited
> for our needs.
>
> Since now we can release a kmem_cache at any moment after the
> deactivation, let's call sysfs_slab_remove() only from the shutdown
> path. It makes deactivation path simpler.
>
> Because we don't set the page->mem_cgroup pointer, we need to change
> the way how memcg-level stats is working for slab pages. We can't use
> mod_lruvec_page_state() helpers anymore, so switch over to
> mod_lruvec_state().
>
> * I used the following simple approach to test the performance
> (stolen from another patchset by T. Harding):
>
>     time find / -name fname-no-exist
>     echo 2 > /proc/sys/vm/drop_caches
>     repeat several times
>
> Results (I've chosen best results in several runs):
>
>         orig       patched
>
> real    0m0.712s   0m0.690s
> user    0m0.104s   0m0.101s
> sys     0m0.346s   0m0.340s
>
> real    0m0.728s   0m0.723s
> user    0m0.114s   0m0.115s
> sys     0m0.342s   0m0.338s
>
> real    0m0.685s   0m0.767s
> user    0m0.118s   0m0.114s
> sys     0m0.343s   0m0.336s
>
> So it looks like the difference is not noticeable in this test.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/slab.h |  2 +-
>  mm/memcontrol.c      |  9 ++++----
>  mm/slab.c            | 15 +-----------
>  mm/slab.h            | 54 +++++++++++++++++++++++++++++++++++++++++---
>  mm/slab_common.c     | 51 +++++++++++++++++------------------------
>  mm/slub.c            | 22 +-----------------
>  6 files changed, 79 insertions(+), 74 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 47923c173f30..4daaade76c63 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -152,7 +152,6 @@ int kmem_cache_shrink(struct kmem_cache *);
>
>  void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
>  void memcg_deactivate_kmem_caches(struct mem_cgroup *);
> -void memcg_destroy_kmem_caches(struct mem_cgroup *);
>
>  /*
>   * Please use this macro to create slab caches. Simply specify the
> @@ -641,6 +640,7 @@ struct memcg_cache_params {
>                         struct mem_cgroup *memcg;
>                         struct list_head children_node;
>                         struct list_head kmem_caches_node;
> +                       atomic_long_t refcnt;
>
>                         void (*work_fn)(struct kmem_cache *);
>                         union {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b2c39f187cbb..87c06e342e05 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2719,9 +2719,6 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>                 cancel_charge(memcg, nr_pages);
>                 return -ENOMEM;
>         }
> -
> -       page->mem_cgroup = memcg;
> -
>         return 0;
>  }
>
> @@ -2744,8 +2741,10 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
>         memcg = get_mem_cgroup_from_current();
>         if (!mem_cgroup_is_root(memcg)) {
>                 ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg);
> -               if (!ret)
> +               if (!ret) {
> +                       page->mem_cgroup = memcg;
>                         __SetPageKmemcg(page);
> +               }
>         }
>         css_put(&memcg->css);
>         return ret;
> @@ -3238,7 +3237,7 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
>                 memcg_offline_kmem(memcg);
>
>         if (memcg->kmem_state == KMEM_ALLOCATED) {
> -               memcg_destroy_kmem_caches(memcg);
> +               WARN_ON(!list_empty(&memcg->kmem_caches));
>                 static_branch_dec(&memcg_kmem_enabled_key);
>                 WARN_ON(page_counter_read(&memcg->kmem));
>         }
> diff --git a/mm/slab.c b/mm/slab.c
> index 14466a73d057..171b21ca617f 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1389,7 +1389,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>                                                                 int nodeid)
>  {
>         struct page *page;
> -       int nr_pages;
>
>         flags |= cachep->allocflags;
>
> @@ -1404,12 +1403,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>                 return NULL;
>         }
>
> -       nr_pages = (1 << cachep->gfporder);
> -       if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
> -               mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages);
> -       else
> -               mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, nr_pages);
> -
>         __SetPageSlab(page);
>         /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
>         if (sk_memalloc_socks() && page_is_pfmemalloc(page))
> @@ -1424,12 +1417,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>  static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
>  {
>         int order = cachep->gfporder;
> -       unsigned long nr_freed = (1 << order);
> -
> -       if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
> -               mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, -nr_freed);
> -       else
> -               mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, -nr_freed);
>
>         BUG_ON(!PageSlab(page));
>         __ClearPageSlabPfmemalloc(page);
> @@ -1438,7 +1425,7 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
>         page->mapping = NULL;
>
>         if (current->reclaim_state)
> -               current->reclaim_state->reclaimed_slab += nr_freed;
> +               current->reclaim_state->reclaimed_slab += 1 << order;
>         memcg_uncharge_slab(page, order, cachep);
>         __free_pages(page, order);
>  }
> diff --git a/mm/slab.h b/mm/slab.h
> index 4a261c97c138..1f49945f5c1d 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -173,7 +173,9 @@ void __kmem_cache_release(struct kmem_cache *);
>  int __kmem_cache_shrink(struct kmem_cache *);
>  void __kmemcg_cache_deactivate(struct kmem_cache *s);
>  void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
> +void kmemcg_cache_shutdown(struct kmem_cache *s);
>  void slab_kmem_cache_release(struct kmem_cache *);
> +void kmemcg_queue_cache_shutdown(struct kmem_cache *s);
>
>  struct seq_file;
>  struct file;
> @@ -274,19 +276,65 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
>         return s->memcg_params.root_cache;
>  }
>
> +static __always_inline void kmemcg_cache_get_many(struct kmem_cache *s, long n)
> +{
> +       atomic_long_add(n, &s->memcg_params.refcnt);
> +}
> +
> +static __always_inline void kmemcg_cache_put_many(struct kmem_cache *s, long n)
> +{
> +       if (atomic_long_sub_and_test(n, &s->memcg_params.refcnt))
> +               kmemcg_queue_cache_shutdown(s);
> +}
> +
>  static __always_inline int memcg_charge_slab(struct page *page,
>                                              gfp_t gfp, int order,
>                                              struct kmem_cache *s)
>  {
> -       if (is_root_cache(s))
> +       int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> +               NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> +       struct mem_cgroup *memcg;
> +       struct lruvec *lruvec;
> +       int ret;
> +
> +       if (is_root_cache(s)) {
> +               mod_node_page_state(page_pgdat(page), idx, 1 << order);
>                 return 0;
> -       return memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
> +       }
> +
> +       memcg = s->memcg_params.memcg;
> +       ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
> +       if (!ret) {
> +               lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> +               mod_lruvec_state(lruvec, idx, 1 << order);
> +
> +               /* transer try_charge() page references to kmem_cache */
> +               kmemcg_cache_get_many(s, 1 << order);
> +               css_put_many(&memcg->css, 1 << order);
> +       }
> +
> +       return 0;
>  }
>
>  static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>                                                 struct kmem_cache *s)
>  {
> -       memcg_kmem_uncharge(page, order);
> +       int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> +               NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> +       struct mem_cgroup *memcg;
> +       struct lruvec *lruvec;
> +
> +       if (is_root_cache(s)) {
> +               mod_node_page_state(page_pgdat(page), idx, -(1 << order));
> +               return;
> +       }
> +
> +       memcg = s->memcg_params.memcg;
> +       lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> +       mod_lruvec_state(lruvec, idx, -(1 << order));
> +       memcg_kmem_uncharge_memcg(page, order, memcg);
> +
> +       kmemcg_cache_put_many(s, 1 << order);
>  }
>
>  extern void slab_init_memcg_params(struct kmem_cache *);
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 4e5b4292a763..3fdd02979a1c 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -148,6 +148,7 @@ static int init_memcg_params(struct kmem_cache *s,
>                 s->memcg_params.root_cache = root_cache;
>                 INIT_LIST_HEAD(&s->memcg_params.children_node);
>                 INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
> +               atomic_long_set(&s->memcg_params.refcnt, 1);
>                 return 0;
>         }
>
> @@ -225,6 +226,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
>         if (is_root_cache(s)) {
>                 list_add(&s->root_caches_node, &slab_root_caches);
>         } else {
> +               css_get(&memcg->css);
>                 s->memcg_params.memcg = memcg;
>                 list_add(&s->memcg_params.children_node,
>                          &s->memcg_params.root_cache->memcg_params.children);
> @@ -240,6 +242,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
>         } else {
>                 list_del(&s->memcg_params.children_node);
>                 list_del(&s->memcg_params.kmem_caches_node);
> +               css_put(&s->memcg_params.memcg->css);
>         }
>  }
>  #else
> @@ -703,21 +706,19 @@ static void kmemcg_after_rcu_workfn(struct work_struct *work)
>
>         s->memcg_params.work_fn(s);
>         s->memcg_params.work_fn = NULL;
> +       kmemcg_cache_put_many(s, 1);
>
>         mutex_unlock(&slab_mutex);
>
>         put_online_mems();
>         put_online_cpus();
> -
> -       /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
> -       css_put(&s->memcg_params.memcg->css);
>  }
>
>  /*
>   * We need to grab blocking locks.  Bounce to ->work.  The
>   * work item shares the space with the RCU head and can't be
> - * initialized eariler.
> -*/
> + * initialized earlier.
> + */
>  static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
>  {
>         struct kmem_cache *s = container_of(head, struct kmem_cache,
> @@ -727,6 +728,21 @@ static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
>         queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
>  }
>
> +static void kmemcg_cache_shutdown_after_rcu(struct kmem_cache *s)
> +{
> +       WARN_ON(shutdown_cache(s));
> +}
> +
> +void kmemcg_queue_cache_shutdown(struct kmem_cache *s)
> +{
> +       if (s->memcg_params.root_cache->memcg_params.dying)
> +               return;
> +
> +       WARN_ON(s->memcg_params.work_fn);
> +       s->memcg_params.work_fn = kmemcg_cache_shutdown_after_rcu;
> +       call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
> +}
> +
>  static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
>  {
>         __kmemcg_cache_deactivate_after_rcu(s);
> @@ -739,9 +755,6 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
>         if (s->memcg_params.root_cache->memcg_params.dying)
>                 return;
>
> -       /* pin memcg so that @s doesn't get destroyed in the middle */
> -       css_get(&s->memcg_params.memcg->css);
> -
>         WARN_ON_ONCE(s->memcg_params.work_fn);
>         s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
>         call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
> @@ -775,28 +788,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
>         put_online_cpus();
>  }
>
> -void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
> -{
> -       struct kmem_cache *s, *s2;
> -
> -       get_online_cpus();
> -       get_online_mems();
> -
> -       mutex_lock(&slab_mutex);
> -       list_for_each_entry_safe(s, s2, &memcg->kmem_caches,
> -                                memcg_params.kmem_caches_node) {
> -               /*
> -                * The cgroup is about to be freed and therefore has no charges
> -                * left. Hence, all its caches must be empty by now.
> -                */
> -               BUG_ON(shutdown_cache(s));
> -       }
> -       mutex_unlock(&slab_mutex);
> -
> -       put_online_mems();
> -       put_online_cpus();
> -}
> -
>  static int shutdown_memcg_caches(struct kmem_cache *s)
>  {
>         struct memcg_cache_array *arr;
> diff --git a/mm/slub.c b/mm/slub.c
> index 195f61785c7d..a28b3b3abf29 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1692,11 +1692,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>         if (!page)
>                 return NULL;
>
> -       mod_lruvec_page_state(page,
> -               (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> -               NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
> -               1 << oo_order(oo));
> -
>         inc_slabs_node(s, page_to_nid(page), page->objects);
>
>         return page;
> @@ -1730,11 +1725,6 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
>                         check_object(s, page, p, SLUB_RED_INACTIVE);
>         }
>
> -       mod_lruvec_page_state(page,
> -               (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> -               NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
> -               -pages);
> -
>         __ClearPageSlabPfmemalloc(page);
>         __ClearPageSlab(page);
>
> @@ -4037,18 +4027,8 @@ void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
>  {
>         /*
>          * Called with all the locks held after a sched RCU grace period.
> -        * Even if @s becomes empty after shrinking, we can't know that @s
> -        * doesn't have allocations already in-flight and thus can't
> -        * destroy @s until the associated memcg is released.
> -        *
> -        * However, let's remove the sysfs files for empty caches here.
> -        * Each cache has a lot of interface files which aren't
> -        * particularly useful for empty draining caches; otherwise, we can
> -        * easily end up with millions of unnecessary sysfs files on
> -        * systems which have a lot of memory and transient cgroups.
>          */
> -       if (!__kmem_cache_shrink(s))
> -               sysfs_slab_remove(s);
> +       __kmem_cache_shrink(s);
>  }
>
>  void __kmemcg_cache_deactivate(struct kmem_cache *s)
> --
> 2.20.1
>

Roman Gushchin April 18, 2019, 12:38 a.m. UTC | #2

On Wed, Apr 17, 2019 at 04:41:01PM -0700, Shakeel Butt wrote:
> On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <guroan@gmail.com> wrote:
> >
> > This commit makes several important changes in the lifecycle
> > of a non-root kmem_cache, which also affect the lifecycle
> > of a memory cgroup.
> >
> > Currently each charged slab page has a page->mem_cgroup pointer
> > to the memory cgroup and holds a reference to it.
> > Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> > are freed, all other are freed on cgroup release.
> 
> No, they are not freed (i.e. destroyed) on offlining, only
> deactivated. All memcg kmem_caches are freed/destroyed on memcg's
> css_free.

You're right, my bad. I was thinking about the corresponding sysfs entry
when was writing it. We try to free it from the deactivation path too.

> 
> >
> > So the current scheme can be illustrated as:
> > page->mem_cgroup->kmem_cache.
> >
> > To implement the slab memory reparenting we need to invert the scheme
> > into: page->kmem_cache->mem_cgroup.
> >
> > Let's make every page to hold a reference to the kmem_cache (we
> > already have a stable pointer), and make kmem_caches to hold a single
> > reference to the memory cgroup.
> 
> What about memcg_kmem_get_cache()? That function assumes that by
> taking reference on memcg, it's kmem_caches will stay. I think you
> need to get reference on the kmem_cache in memcg_kmem_get_cache()
> within the rcu lock where you get the memcg through css_tryget_online.

Yeah, a very good question.

I believe it's safe because css_tryget_online() guarantees that
the cgroup is online and won't go offline before css_free() in
slab_post_alloc_hook(). I do initialize kmem_cache's refcount to 1
and drop it on offlining, so it protects the online kmem_cache.

Thank you for looking into the patchset!

Shakeel Butt April 18, 2019, 1:55 a.m. UTC | #3

On Wed, Apr 17, 2019 at 5:39 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Apr 17, 2019 at 04:41:01PM -0700, Shakeel Butt wrote:
> > On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <guroan@gmail.com> wrote:
> > >
> > > This commit makes several important changes in the lifecycle
> > > of a non-root kmem_cache, which also affect the lifecycle
> > > of a memory cgroup.
> > >
> > > Currently each charged slab page has a page->mem_cgroup pointer
> > > to the memory cgroup and holds a reference to it.
> > > Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> > > are freed, all other are freed on cgroup release.
> >
> > No, they are not freed (i.e. destroyed) on offlining, only
> > deactivated. All memcg kmem_caches are freed/destroyed on memcg's
> > css_free.
>
> You're right, my bad. I was thinking about the corresponding sysfs entry
> when was writing it. We try to free it from the deactivation path too.
>
> >
> > >
> > > So the current scheme can be illustrated as:
> > > page->mem_cgroup->kmem_cache.
> > >
> > > To implement the slab memory reparenting we need to invert the scheme
> > > into: page->kmem_cache->mem_cgroup.
> > >
> > > Let's make every page to hold a reference to the kmem_cache (we
> > > already have a stable pointer), and make kmem_caches to hold a single
> > > reference to the memory cgroup.
> >
> > What about memcg_kmem_get_cache()? That function assumes that by
> > taking reference on memcg, it's kmem_caches will stay. I think you
> > need to get reference on the kmem_cache in memcg_kmem_get_cache()
> > within the rcu lock where you get the memcg through css_tryget_online.
>
> Yeah, a very good question.
>
> I believe it's safe because css_tryget_online() guarantees that
> the cgroup is online and won't go offline before css_free() in
> slab_post_alloc_hook(). I do initialize kmem_cache's refcount to 1
> and drop it on offlining, so it protects the online kmem_cache.
>

Let's suppose a thread doing a remote charging calls
memcg_kmem_get_cache() and gets an empty kmem_cache of the remote
memcg having refcnt equal to 1. That thread got a reference on the
remote memcg but no reference on the kmem_cache. Let's suppose that
thread got stuck in the reclaim and scheduled away. In the meantime
that remote memcg got offlined and decremented the refcnt of all of
its kmem_caches. The empty kmem_cache which the thread stuck in
reclaim have pointer to can get deleted and may be using an already
destroyed kmem_cache after coming back from reclaim.

I think the above situation is possible unless the thread gets the
reference on the kmem_cache in memcg_kmem_get_cache().

Shakeel

Roman Gushchin April 18, 2019, 3:07 a.m. UTC | #4

On Wed, Apr 17, 2019 at 06:55:12PM -0700, Shakeel Butt wrote:
> On Wed, Apr 17, 2019 at 5:39 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Wed, Apr 17, 2019 at 04:41:01PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <guroan@gmail.com> wrote:
> > > >
> > > > This commit makes several important changes in the lifecycle
> > > > of a non-root kmem_cache, which also affect the lifecycle
> > > > of a memory cgroup.
> > > >
> > > > Currently each charged slab page has a page->mem_cgroup pointer
> > > > to the memory cgroup and holds a reference to it.
> > > > Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> > > > are freed, all other are freed on cgroup release.
> > >
> > > No, they are not freed (i.e. destroyed) on offlining, only
> > > deactivated. All memcg kmem_caches are freed/destroyed on memcg's
> > > css_free.
> >
> > You're right, my bad. I was thinking about the corresponding sysfs entry
> > when was writing it. We try to free it from the deactivation path too.
> >
> > >
> > > >
> > > > So the current scheme can be illustrated as:
> > > > page->mem_cgroup->kmem_cache.
> > > >
> > > > To implement the slab memory reparenting we need to invert the scheme
> > > > into: page->kmem_cache->mem_cgroup.
> > > >
> > > > Let's make every page to hold a reference to the kmem_cache (we
> > > > already have a stable pointer), and make kmem_caches to hold a single
> > > > reference to the memory cgroup.
> > >
> > > What about memcg_kmem_get_cache()? That function assumes that by
> > > taking reference on memcg, it's kmem_caches will stay. I think you
> > > need to get reference on the kmem_cache in memcg_kmem_get_cache()
> > > within the rcu lock where you get the memcg through css_tryget_online.
> >
> > Yeah, a very good question.
> >
> > I believe it's safe because css_tryget_online() guarantees that
> > the cgroup is online and won't go offline before css_free() in
> > slab_post_alloc_hook(). I do initialize kmem_cache's refcount to 1
> > and drop it on offlining, so it protects the online kmem_cache.
> >
> 
> Let's suppose a thread doing a remote charging calls
> memcg_kmem_get_cache() and gets an empty kmem_cache of the remote
> memcg having refcnt equal to 1. That thread got a reference on the
> remote memcg but no reference on the kmem_cache. Let's suppose that
> thread got stuck in the reclaim and scheduled away. In the meantime
> that remote memcg got offlined and decremented the refcnt of all of
> its kmem_caches. The empty kmem_cache which the thread stuck in
> reclaim have pointer to can get deleted and may be using an already
> destroyed kmem_cache after coming back from reclaim.
> 
> I think the above situation is possible unless the thread gets the
> reference on the kmem_cache in memcg_kmem_get_cache().

Yes, you're right and I'm writing a nonsense: css_tryget_online()
can't prevent the cgroup from being offlined.

So, the problem with getting a reference in memcg_kmem_get_cache()
is that it's an atomic operation on the hot path, something I'd like
to avoid.

I can make the refcounter percpu, but it'll add some complexity and size
to the kmem_cache object. Still an option, of course.

I wonder if we can use rcu_read_lock() instead, and bump the refcounter
only if we're going into reclaim.

What do you think?

Thanks!

Christoph Lameter (Ampere) April 18, 2019, 1:34 p.m. UTC | #5

On Wed, 17 Apr 2019, Roman Gushchin wrote:

> Let's make every page to hold a reference to the kmem_cache (we
> already have a stable pointer), and make kmem_caches to hold a single
> reference to the memory cgroup.

Ok you are freeing one word in the page struct that can be used for other
purposes now?

Christoph Lameter (Ampere) April 18, 2019, 1:38 p.m. UTC | #6

On Wed, 17 Apr 2019, Roman Gushchin wrote:

>  static __always_inline int memcg_charge_slab(struct page *page,
>  					     gfp_t gfp, int order,
>  					     struct kmem_cache *s)
>  {
> -	if (is_root_cache(s))
> +	int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> +		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> +	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec;
> +	int ret;
> +
> +	if (is_root_cache(s)) {
> +		mod_node_page_state(page_pgdat(page), idx, 1 << order);

Hmmm... This is functionality that is not memcg specific being moved into
a memcg function??? Maybe rename the function to indicate that it is not
memcg specific and add the proper #ifdefs?

>  static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>  						struct kmem_cache *s)
>  {
> -	memcg_kmem_uncharge(page, order);
> +	int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> +		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> +	struct mem_cgroup *memcg;
> +	struct lruvec *lruvec;
> +
> +	if (is_root_cache(s)) {
> +		mod_node_page_state(page_pgdat(page), idx, -(1 << order));
> +		return;
> +	}

And again.

Shakeel Butt April 18, 2019, 2:05 p.m. UTC | #7

On Wed, Apr 17, 2019 at 8:07 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Apr 17, 2019 at 06:55:12PM -0700, Shakeel Butt wrote:
> > On Wed, Apr 17, 2019 at 5:39 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Wed, Apr 17, 2019 at 04:41:01PM -0700, Shakeel Butt wrote:
> > > > On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <guroan@gmail.com> wrote:
> > > > >
> > > > > This commit makes several important changes in the lifecycle
> > > > > of a non-root kmem_cache, which also affect the lifecycle
> > > > > of a memory cgroup.
> > > > >
> > > > > Currently each charged slab page has a page->mem_cgroup pointer
> > > > > to the memory cgroup and holds a reference to it.
> > > > > Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> > > > > are freed, all other are freed on cgroup release.
> > > >
> > > > No, they are not freed (i.e. destroyed) on offlining, only
> > > > deactivated. All memcg kmem_caches are freed/destroyed on memcg's
> > > > css_free.
> > >
> > > You're right, my bad. I was thinking about the corresponding sysfs entry
> > > when was writing it. We try to free it from the deactivation path too.
> > >
> > > >
> > > > >
> > > > > So the current scheme can be illustrated as:
> > > > > page->mem_cgroup->kmem_cache.
> > > > >
> > > > > To implement the slab memory reparenting we need to invert the scheme
> > > > > into: page->kmem_cache->mem_cgroup.
> > > > >
> > > > > Let's make every page to hold a reference to the kmem_cache (we
> > > > > already have a stable pointer), and make kmem_caches to hold a single
> > > > > reference to the memory cgroup.
> > > >
> > > > What about memcg_kmem_get_cache()? That function assumes that by
> > > > taking reference on memcg, it's kmem_caches will stay. I think you
> > > > need to get reference on the kmem_cache in memcg_kmem_get_cache()
> > > > within the rcu lock where you get the memcg through css_tryget_online.
> > >
> > > Yeah, a very good question.
> > >
> > > I believe it's safe because css_tryget_online() guarantees that
> > > the cgroup is online and won't go offline before css_free() in
> > > slab_post_alloc_hook(). I do initialize kmem_cache's refcount to 1
> > > and drop it on offlining, so it protects the online kmem_cache.
> > >
> >
> > Let's suppose a thread doing a remote charging calls
> > memcg_kmem_get_cache() and gets an empty kmem_cache of the remote
> > memcg having refcnt equal to 1. That thread got a reference on the
> > remote memcg but no reference on the kmem_cache. Let's suppose that
> > thread got stuck in the reclaim and scheduled away. In the meantime
> > that remote memcg got offlined and decremented the refcnt of all of
> > its kmem_caches. The empty kmem_cache which the thread stuck in
> > reclaim have pointer to can get deleted and may be using an already
> > destroyed kmem_cache after coming back from reclaim.
> >
> > I think the above situation is possible unless the thread gets the
> > reference on the kmem_cache in memcg_kmem_get_cache().
>
> Yes, you're right and I'm writing a nonsense: css_tryget_online()
> can't prevent the cgroup from being offlined.
>

The reason I knew about that race is because I tried something similar
but for different use-case:

https://lkml.org/lkml/2018/3/26/472

> So, the problem with getting a reference in memcg_kmem_get_cache()
> is that it's an atomic operation on the hot path, something I'd like
> to avoid.
>
> I can make the refcounter percpu, but it'll add some complexity and size
> to the kmem_cache object. Still an option, of course.
>

I kind of prefer this option.

> I wonder if we can use rcu_read_lock() instead, and bump the refcounter
> only if we're going into reclaim.
>
> What do you think?

Should it be just reclaim or anything that can reschedule the current thread?

I can tell how we resolve the similar issue for our
eager-kmem_cache-deletion use-case. Our solution (hack) works only for
CONFIG_SLAB (we only use SLAB) and non-preemptible kernel. The
underlying motivation was to reduce the overhead of slab reaper of
traversing thousands of empty offlined kmem caches. CONFIG_SLAB
disables interrupts before accessing the per-cpu caches and reenables
the interrupts if it has to fallback to the page allocation. We use
this window to call memcg_kmem_get_cache() and only increment the
refcnt of kmem_cache if going to the fallback. Thus no need to do
atomic operation on the hot path.

Anyways, I think having percpu refcounter for each memcg kmem_cache is
not that costy for CONFIG_MEMCG_KMEM users and to me that seems like
the most simple solution.

Shakeel

Roman Gushchin April 18, 2019, 6:04 p.m. UTC | #8

On Thu, Apr 18, 2019 at 01:34:52PM +0000, Christopher Lameter wrote:
> On Wed, 17 Apr 2019, Roman Gushchin wrote:
> 
> > Let's make every page to hold a reference to the kmem_cache (we
> > already have a stable pointer), and make kmem_caches to hold a single
> > reference to the memory cgroup.
> 
> Ok you are freeing one word in the page struct that can be used for other
> purposes now?
> 

Looks so!

Roman Gushchin April 18, 2019, 6:05 p.m. UTC | #9

On Thu, Apr 18, 2019 at 01:38:44PM +0000, Christopher Lameter wrote:
> On Wed, 17 Apr 2019, Roman Gushchin wrote:
> 
> >  static __always_inline int memcg_charge_slab(struct page *page,
> >  					     gfp_t gfp, int order,
> >  					     struct kmem_cache *s)
> >  {
> > -	if (is_root_cache(s))
> > +	int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> > +		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> > +	struct mem_cgroup *memcg;
> > +	struct lruvec *lruvec;
> > +	int ret;
> > +
> > +	if (is_root_cache(s)) {
> > +		mod_node_page_state(page_pgdat(page), idx, 1 << order);
> 
> Hmmm... This is functionality that is not memcg specific being moved into
> a memcg function??? Maybe rename the function to indicate that it is not
> memcg specific and add the proper #ifdefs?
> 
> >  static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> >  						struct kmem_cache *s)
> >  {
> > -	memcg_kmem_uncharge(page, order);
> > +	int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> > +		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> > +	struct mem_cgroup *memcg;
> > +	struct lruvec *lruvec;
> > +
> > +	if (is_root_cache(s)) {
> > +		mod_node_page_state(page_pgdat(page), idx, -(1 << order));
> > +		return;
> > +	}
> 
> And again.
> 

Good point! Will do in v2.

Thanks!

Roman Gushchin April 18, 2019, 6:14 p.m. UTC | #10

On Thu, Apr 18, 2019 at 07:05:24AM -0700, Shakeel Butt wrote:
> On Wed, Apr 17, 2019 at 8:07 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Wed, Apr 17, 2019 at 06:55:12PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 17, 2019 at 5:39 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Wed, Apr 17, 2019 at 04:41:01PM -0700, Shakeel Butt wrote:
> > > > > On Wed, Apr 17, 2019 at 2:55 PM Roman Gushchin <guroan@gmail.com> wrote:
> > > > > >
> > > > > > This commit makes several important changes in the lifecycle
> > > > > > of a non-root kmem_cache, which also affect the lifecycle
> > > > > > of a memory cgroup.
> > > > > >
> > > > > > Currently each charged slab page has a page->mem_cgroup pointer
> > > > > > to the memory cgroup and holds a reference to it.
> > > > > > Kmem_caches are held by the cgroup. On offlining empty kmem_caches
> > > > > > are freed, all other are freed on cgroup release.
> > > > >
> > > > > No, they are not freed (i.e. destroyed) on offlining, only
> > > > > deactivated. All memcg kmem_caches are freed/destroyed on memcg's
> > > > > css_free.
> > > >
> > > > You're right, my bad. I was thinking about the corresponding sysfs entry
> > > > when was writing it. We try to free it from the deactivation path too.
> > > >
> > > > >
> > > > > >
> > > > > > So the current scheme can be illustrated as:
> > > > > > page->mem_cgroup->kmem_cache.
> > > > > >
> > > > > > To implement the slab memory reparenting we need to invert the scheme
> > > > > > into: page->kmem_cache->mem_cgroup.
> > > > > >
> > > > > > Let's make every page to hold a reference to the kmem_cache (we
> > > > > > already have a stable pointer), and make kmem_caches to hold a single
> > > > > > reference to the memory cgroup.
> > > > >
> > > > > What about memcg_kmem_get_cache()? That function assumes that by
> > > > > taking reference on memcg, it's kmem_caches will stay. I think you
> > > > > need to get reference on the kmem_cache in memcg_kmem_get_cache()
> > > > > within the rcu lock where you get the memcg through css_tryget_online.
> > > >
> > > > Yeah, a very good question.
> > > >
> > > > I believe it's safe because css_tryget_online() guarantees that
> > > > the cgroup is online and won't go offline before css_free() in
> > > > slab_post_alloc_hook(). I do initialize kmem_cache's refcount to 1
> > > > and drop it on offlining, so it protects the online kmem_cache.
> > > >
> > >
> > > Let's suppose a thread doing a remote charging calls
> > > memcg_kmem_get_cache() and gets an empty kmem_cache of the remote
> > > memcg having refcnt equal to 1. That thread got a reference on the
> > > remote memcg but no reference on the kmem_cache. Let's suppose that
> > > thread got stuck in the reclaim and scheduled away. In the meantime
> > > that remote memcg got offlined and decremented the refcnt of all of
> > > its kmem_caches. The empty kmem_cache which the thread stuck in
> > > reclaim have pointer to can get deleted and may be using an already
> > > destroyed kmem_cache after coming back from reclaim.
> > >
> > > I think the above situation is possible unless the thread gets the
> > > reference on the kmem_cache in memcg_kmem_get_cache().
> >
> > Yes, you're right and I'm writing a nonsense: css_tryget_online()
> > can't prevent the cgroup from being offlined.
> >
> 
> The reason I knew about that race is because I tried something similar
> but for different use-case:
> 
> https://lkml.org/lkml/2018/3/26/472
> 
> > So, the problem with getting a reference in memcg_kmem_get_cache()
> > is that it's an atomic operation on the hot path, something I'd like
> > to avoid.
> >
> > I can make the refcounter percpu, but it'll add some complexity and size
> > to the kmem_cache object. Still an option, of course.
> >
> 
> I kind of prefer this option.
> 
> > I wonder if we can use rcu_read_lock() instead, and bump the refcounter
> > only if we're going into reclaim.
> >
> > What do you think?
> 
> Should it be just reclaim or anything that can reschedule the current thread?
> 
> I can tell how we resolve the similar issue for our
> eager-kmem_cache-deletion use-case. Our solution (hack) works only for
> CONFIG_SLAB (we only use SLAB) and non-preemptible kernel. The
> underlying motivation was to reduce the overhead of slab reaper of
> traversing thousands of empty offlined kmem caches. CONFIG_SLAB
> disables interrupts before accessing the per-cpu caches and reenables
> the interrupts if it has to fallback to the page allocation. We use
> this window to call memcg_kmem_get_cache() and only increment the
> refcnt of kmem_cache if going to the fallback. Thus no need to do
> atomic operation on the hot path.
> 
> Anyways, I think having percpu refcounter for each memcg kmem_cache is
> not that costy for CONFIG_MEMCG_KMEM users and to me that seems like
> the most simple solution.
> 
> Shakeel

Ok, sounds like a percpu refcounter is the best option.
I'll try this approach in v2.

Thanks!

[4/5] mm: rework non-root kmem_cache lifecycle management

Commit Message

Comments

Patch