[v2,15/28] mm: memcg/slab: obj_cgroup API

Message ID	20200127173453.2089565-16-guro@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=/2eA=3Q=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AB7E1214D8 Smtp-Origin-Hostprefix: devvm From: Roman Gushchin <guro@fb.com> Smtp-Origin-Hostname: devvm2643.prn2.facebook.com To: <linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org> CC: Michal Hocko <mhocko@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, Shakeel Butt <shakeelb@google.com>, Vladimir Davydov <vdavydov.dev@gmail.com>, <linux-kernel@vger.kernel.org>, <kernel-team@fb.com>, Bharata B Rao <bharata@linux.ibm.com>, Yafang Shao <laoar.shao@gmail.com>, Roman Gushchin <guro@fb.com> Smtp-Origin-Cluster: prn2c23 Subject: [PATCH v2 15/28] mm: memcg/slab: obj_cgroup API Date: Mon, 27 Jan 2020 09:34:40 -0800 Message-ID: <20200127173453.2089565-16-guro@fb.com> In-Reply-To: <20200127173453.2089565-1-guro@fb.com> References: <20200127173453.2089565-1-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	The new cgroup slab memory controller \| expand [v2,00/28] The new cgroup slab memory controller [v2,01/28] mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments [v2,02/28] mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments [v2,03/28] mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() [v2,04/28] mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg() [v2,05/28] mm: memcg/slab: cache page number in memcg_(un)charge_slab() [v2,06/28] mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge() [v2,07/28] mm: memcg/slab: introduce mem_cgroup_from_obj() [v2,08/28] mm: fork: fix kernel_stack memcg stats for various stack implementations [v2,09/28] mm: memcg/slab: rename __mod_lruvec_slab_state() into __mod_lruvec_obj_state() [v2,10/28] mm: memcg: introduce mod_lruvec_memcg_state() [v2,11/28] mm: slub: implement SLUB version of obj_to_index() [v2,12/28] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat [v2,13/28] mm: vmstat: convert slab vmstat counter to bytes [v2,14/28] mm: memcontrol: decouple reference counting from page accounting [v2,15/28] mm: memcg/slab: obj_cgroup API [v2,16/28] mm: memcg/slab: allocate obj_cgroups for non-root slab pages [v2,17/28] mm: memcg/slab: save obj_cgroup for non-root slab objects [v2,18/28] mm: memcg/slab: charge individual slab objects instead of pages [v2,19/28] mm: memcg/slab: deprecate memory.kmem.slabinfo [v2,20/28] mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h [v2,21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups [v2,22/28] mm: memcg/slab: simplify memcg cache creation [v2,23/28] mm: memcg/slab: deprecate memcg_kmem_get_cache() [v2,24/28] mm: memcg/slab: deprecate slab_root_caches [v2,25/28] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo() [v2,26/28] tools/cgroup: add slabinfo.py tool [v2,27/28] tools/cgroup: make slabinfo.py compatible with new slab controller [v2,28/28] kselftests: cgroup: add kernel memory accounting tests

Message ID

20200127173453.2089565-16-guro@fb.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AB7E1214D8
Smtp-Origin-Hostprefix: devvm
From: Roman Gushchin <guro@fb.com>
Smtp-Origin-Hostname: devvm2643.prn2.facebook.com
To: <linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org>
CC: Michal Hocko <mhocko@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>,
        Shakeel Butt <shakeelb@google.com>,
        Vladimir Davydov
	<vdavydov.dev@gmail.com>,
        <linux-kernel@vger.kernel.org>, <kernel-team@fb.com>,
        Bharata B Rao <bharata@linux.ibm.com>,
        Yafang Shao
	<laoar.shao@gmail.com>, Roman Gushchin <guro@fb.com>
Smtp-Origin-Cluster: prn2c23
Subject: [PATCH v2 15/28] mm: memcg/slab: obj_cgroup API
Date: Mon, 27 Jan 2020 09:34:40 -0800
Message-ID: <20200127173453.2089565-16-guro@fb.com>
In-Reply-To: <20200127173453.2089565-1-guro@fb.com>
References: <20200127173453.2089565-1-guro@fb.com>
MIME-Version: 1.0
Content-Type: text/plain
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

The new cgroup slab memory controller | expand

Commit Message

Roman Gushchin Jan. 27, 2020, 5:34 p.m. UTC

Obj_cgroup API provides an ability to account sub-page sized kernel
objects, which potentially outlive the original memory cgroup.

The top-level API consists of the following functions:
  bool obj_cgroup_tryget(struct obj_cgroup *objcg);
  void obj_cgroup_get(struct obj_cgroup *objcg);
  void obj_cgroup_put(struct obj_cgroup *objcg);

  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
  void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);

  struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);

Object cgroup is basically a pointer to a memory cgroup with a per-cpu
reference counter. It substitutes a memory cgroup in places where
it's necessary to charge a custom amount of bytes instead of pages.

All charged memory rounded down to pages is charged to the
corresponding memory cgroup using __memcg_kmem_charge().

It implements reparenting: on memcg offlining it's getting reattached
to the parent memory cgroup. Each online memory cgroup has an
associated active object cgroup to handle new allocations and the list
of all attached object cgroups. On offlining of a cgroup this list is
reparented and for each object cgroup in the list the memcg pointer is
swapped to the parent memory cgroup. It prevents long-living objects
from pinning the original memory cgroup in the memory.

The implementation is based on byte-sized per-cpu stocks. A sub-page
sized leftover is stored in an atomic field, which is a part of
obj_cgroup object. So on cgroup offlining the leftover is automatically
reparented.

memcg->objcg is rcu protected.
objcg->memcg is a raw pointer, which is always pointing at a memory
cgroup, but can be atomically swapped to the parent memory cgroup. So
the caller must ensure the lifetime of the cgroup, e.g. grab
rcu_read_lock or css_set_lock.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |  49 ++++++++
 mm/memcontrol.c            | 222 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 269 insertions(+), 2 deletions(-)

Comments

Johannes Weiner Feb. 3, 2020, 7:31 p.m. UTC | #1

On Mon, Jan 27, 2020 at 09:34:40AM -0800, Roman Gushchin wrote:
> Obj_cgroup API provides an ability to account sub-page sized kernel
> objects, which potentially outlive the original memory cgroup.
> 
> The top-level API consists of the following functions:
>   bool obj_cgroup_tryget(struct obj_cgroup *objcg);
>   void obj_cgroup_get(struct obj_cgroup *objcg);
>   void obj_cgroup_put(struct obj_cgroup *objcg);
> 
>   int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
>   void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
> 
>   struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg);
> 
> Object cgroup is basically a pointer to a memory cgroup with a per-cpu
> reference counter. It substitutes a memory cgroup in places where
> it's necessary to charge a custom amount of bytes instead of pages.
> 
> All charged memory rounded down to pages is charged to the
> corresponding memory cgroup using __memcg_kmem_charge().
> 
> It implements reparenting: on memcg offlining it's getting reattached
> to the parent memory cgroup. Each online memory cgroup has an
> associated active object cgroup to handle new allocations and the list
> of all attached object cgroups. On offlining of a cgroup this list is
> reparented and for each object cgroup in the list the memcg pointer is
> swapped to the parent memory cgroup. It prevents long-living objects
> from pinning the original memory cgroup in the memory.
> 
> The implementation is based on byte-sized per-cpu stocks. A sub-page
> sized leftover is stored in an atomic field, which is a part of
> obj_cgroup object. So on cgroup offlining the leftover is automatically
> reparented.
> 
> memcg->objcg is rcu protected.
> objcg->memcg is a raw pointer, which is always pointing at a memory
> cgroup, but can be atomically swapped to the parent memory cgroup. So
> the caller must ensure the lifetime of the cgroup, e.g. grab
> rcu_read_lock or css_set_lock.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

> @@ -194,6 +195,22 @@ struct memcg_cgwb_frn {
>  	struct wb_completion done;	/* tracks in-flight foreign writebacks */
>  };
>  
> +/*
> + * Bucket for arbitrarily byte-sized objects charged to a memory
> + * cgroup. The bucket can be reparented in one piece when the cgroup
> + * is destroyed, without having to round up the individual references
> + * of all live memory objects in the wild.
> + */
> +struct obj_cgroup {
> +	struct percpu_ref refcnt;
> +	struct mem_cgroup *memcg;
> +	atomic_t nr_charged_bytes;
> +	union {
> +		struct list_head list;
> +		struct rcu_head rcu;
> +	};
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -306,6 +323,8 @@ struct mem_cgroup {
>  	int kmemcg_id;
>  	enum memcg_kmem_state kmem_state;
>  	struct list_head kmem_caches;
> +	struct obj_cgroup __rcu *objcg;
> +	struct list_head objcg_list;

These could use a comment, IMO.

	/*
	 * Active object acounting bucket, as well as
	 * reparented buckets from dead children with
	 * outstanding objects.
	 */

or something like that.

> @@ -257,6 +257,73 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
>  }
>  
>  #ifdef CONFIG_MEMCG_KMEM
> +extern spinlock_t css_set_lock;
> +
> +static void obj_cgroup_release(struct percpu_ref *ref)
> +{
> +	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> +	unsigned int nr_bytes;
> +	unsigned int nr_pages;
> +	unsigned long flags;
> +
> +	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> +	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> +	nr_pages = nr_bytes >> PAGE_SHIFT;
> +
> +	if (nr_pages) {
> +		rcu_read_lock();
> +		__memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> +		rcu_read_unlock();
> +	}
> +
> +	spin_lock_irqsave(&css_set_lock, flags);
> +	list_del(&objcg->list);
> +	mem_cgroup_put(obj_cgroup_memcg(objcg));
> +	spin_unlock_irqrestore(&css_set_lock, flags);

Heh, two obj_cgroup_memcg() lookups with different synchronization
rules.

I know that reparenting could happen in between the page uncharge and
the mem_cgroup_put(), and it would still be safe because the counters
are migrated atomically. But it seems needlessly lockless and complex.

Since you have to css_set_lock anyway, wouldn't it be better to do

	spin_lock_irqsave(&css_set_lock, flags);
	memcg = obj_cgroup_memcg(objcg);
	if (nr_pages)
		__memcg_kmem_uncharge(memcg, nr_pages);
	list_del(&objcg->list);
	mem_cgroup_put(memcg);
	spin_unlock_irqrestore(&css_set_lock, flags);

instead?

> +	percpu_ref_exit(ref);
> +	kfree_rcu(objcg, rcu);
> +}
> +
> +static struct obj_cgroup *obj_cgroup_alloc(void)
> +{
> +	struct obj_cgroup *objcg;
> +	int ret;
> +
> +	objcg = kzalloc(sizeof(struct obj_cgroup), GFP_KERNEL);
> +	if (!objcg)
> +		return NULL;
> +
> +	ret = percpu_ref_init(&objcg->refcnt, obj_cgroup_release, 0,
> +			      GFP_KERNEL);
> +	if (ret) {
> +		kfree(objcg);
> +		return NULL;
> +	}
> +	INIT_LIST_HEAD(&objcg->list);
> +	return objcg;
> +}
> +
> +static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> +				  struct mem_cgroup *parent)
> +{
> +	struct obj_cgroup *objcg;
> +
> +	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);

Can this actually race with new charges? By the time we are going
offline, where would they be coming from?

What happens if the charger sees a live memcg, but its memcg->objcg is
cleared? Shouldn't they have the same kind of lifetime, where as long
as the memcg can be charged, so can the objcg? What would happen if
you didn't clear memcg->objcg here?

> +	/* Paired with mem_cgroup_put() in objcg_release(). */
> +	css_get(&memcg->css);
> +	percpu_ref_kill(&objcg->refcnt);
> +
> +	spin_lock_irq(&css_set_lock);
> +	list_for_each_entry(objcg, &memcg->objcg_list, list) {
> +		css_get(&parent->css);
> +		xchg(&objcg->memcg, parent);
> +		css_put(&memcg->css);
> +	}

I'm having a pretty hard time following this refcounting.

Why does objcg only acquire a css reference on the way out? It should
hold one when objcg->memcg is set up, and put it when that pointer
goes away.

But also, objcg is already on its own memcg->objcg_list from the
start, so on the first reparenting we get a css ref, then move it to
the parent, then obj_cgroup_release() puts one it doesn't have ...?

Argh, help.

> @@ -2978,6 +3070,120 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>  	if (PageKmemcg(page))
>  		__ClearPageKmemcg(page);
>  }
> +
> +static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +{
> +	struct memcg_stock_pcp *stock;
> +	unsigned long flags;
> +	bool ret = false;
> +
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
> +	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
> +		stock->nr_bytes -= nr_bytes;
> +		ret = true;
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	return ret;
> +}
> +
> +static void drain_obj_stock(struct memcg_stock_pcp *stock)
> +{
> +	struct obj_cgroup *old = stock->cached_objcg;
> +
> +	if (!old)
> +		return;
> +
> +	if (stock->nr_bytes) {
> +		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> +		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
> +
> +		if (nr_pages) {
> +			rcu_read_lock();
> +			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
> +			rcu_read_unlock();
> +		}
> +
> +		atomic_add(nr_bytes, &old->nr_charged_bytes);
> +		stock->nr_bytes = 0;
> +	}
> +
> +	obj_cgroup_put(old);
> +	stock->cached_objcg = NULL;
> +}
> +
> +static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> +				     struct mem_cgroup *root_memcg)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (stock->cached_objcg) {
> +		memcg = obj_cgroup_memcg(stock->cached_objcg);
> +		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +{
> +	struct memcg_stock_pcp *stock;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	stock = this_cpu_ptr(&memcg_stock);
> +	if (stock->cached_objcg != objcg) { /* reset if necessary */
> +		drain_obj_stock(stock);
> +		obj_cgroup_get(objcg);
> +		stock->cached_objcg = objcg;
> +		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> +	}
> +	stock->nr_bytes += nr_bytes;
> +
> +	if (stock->nr_bytes > PAGE_SIZE)
> +		drain_obj_stock(stock);
> +
> +	local_irq_restore(flags);
> +}
> +
> +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned int nr_pages, nr_bytes;
> +	int ret;
> +
> +	if (consume_obj_stock(objcg, size))
> +		return 0;
> +
> +	rcu_read_lock();
> +	memcg = obj_cgroup_memcg(objcg);
> +	css_get(&memcg->css);
> +	rcu_read_unlock();

I don't quite understand the lifetime rules here. You're holding the
rcu lock, so the memcg object cannot get physically freed while you
are looking it up. But you could be racing with an offlining and see
the stale memcg pointer. Isn't css_get() unsafe? Doesn't this need a
retry loop around css_tryget() similar to get_mem_cgroup_from_mm()?

> +
> +	nr_pages = size >> PAGE_SHIFT;
> +	nr_bytes = size & (PAGE_SIZE - 1);
> +
> +	if (nr_bytes)
> +		nr_pages += 1;
> +
> +	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> +	if (!ret && nr_bytes)
> +		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> +
> +	css_put(&memcg->css);
> +	return ret;
> +}
> +
> +void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> +{
> +	refill_obj_stock(objcg, size);
> +}
> +
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -3400,7 +3606,8 @@ static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
>  #ifdef CONFIG_MEMCG_KMEM
>  static int memcg_online_kmem(struct mem_cgroup *memcg)
>  {
> -	int memcg_id;
> +	struct obj_cgroup *objcg;
> +	int memcg_id, ret;
>  
>  	if (cgroup_memory_nokmem)
>  		return 0;
> @@ -3412,6 +3619,15 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
>  	if (memcg_id < 0)
>  		return memcg_id;
>  
> +	objcg = obj_cgroup_alloc();
> +	if (!objcg) {
> +		memcg_free_cache_id(memcg_id);
> +		return ret;
> +	}
> +	objcg->memcg = memcg;
> +	rcu_assign_pointer(memcg->objcg, objcg);
> +	list_add(&objcg->list, &memcg->objcg_list);

This self-hosting significantly adds to my confusion. It'd be a lot
easier to understand ownership rules and references if this list_add()
was done directly to the parent's list at the time of reparenting, not
here.

If the objcg holds a css reference, right here is where it should be
acquired. Then transferred in reparent and put during release.

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 73c2a7d32862..30bbea3f85e2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@ 
 #include <linux/page-flags.h>
 
 struct mem_cgroup;
+struct obj_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
@@ -194,6 +195,22 @@  struct memcg_cgwb_frn {
 	struct wb_completion done;	/* tracks in-flight foreign writebacks */
 };
 
+/*
+ * Bucket for arbitrarily byte-sized objects charged to a memory
+ * cgroup. The bucket can be reparented in one piece when the cgroup
+ * is destroyed, without having to round up the individual references
+ * of all live memory objects in the wild.
+ */
+struct obj_cgroup {
+	struct percpu_ref refcnt;
+	struct mem_cgroup *memcg;
+	atomic_t nr_charged_bytes;
+	union {
+		struct list_head list;
+		struct rcu_head rcu;
+	};
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -306,6 +323,8 @@  struct mem_cgroup {
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
 	struct list_head kmem_caches;
+	struct obj_cgroup __rcu *objcg;
+	struct list_head objcg_list;
 #endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -429,6 +448,33 @@  struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
+{
+	return percpu_ref_tryget(&objcg->refcnt);
+}
+
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+{
+	percpu_ref_get(&objcg->refcnt);
+}
+
+static inline void obj_cgroup_put(struct obj_cgroup *objcg)
+{
+	percpu_ref_put(&objcg->refcnt);
+}
+
+/*
+ * After the initialization objcg->memcg is always pointing at
+ * a valid memcg, but can be atomically swapped to the parent memcg.
+ *
+ * The caller must ensure that the returned memcg won't be released:
+ * e.g. acquire the rcu_read_lock or css_set_lock.
+ */
+static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
+{
+	return READ_ONCE(objcg->memcg);
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 	if (memcg)
@@ -1395,6 +1441,9 @@  void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages);
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
+int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+
 extern struct static_key_false memcg_kmem_enabled_key;
 extern struct workqueue_struct *memcg_kmem_cache_wq;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b86cfdcf2e1d..9aa37bc61db5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -257,6 +257,73 @@  struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+extern spinlock_t css_set_lock;
+
+static void obj_cgroup_release(struct percpu_ref *ref)
+{
+	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
+	unsigned int nr_bytes;
+	unsigned int nr_pages;
+	unsigned long flags;
+
+	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
+	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
+	nr_pages = nr_bytes >> PAGE_SHIFT;
+
+	if (nr_pages) {
+		rcu_read_lock();
+		__memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
+		rcu_read_unlock();
+	}
+
+	spin_lock_irqsave(&css_set_lock, flags);
+	list_del(&objcg->list);
+	mem_cgroup_put(obj_cgroup_memcg(objcg));
+	spin_unlock_irqrestore(&css_set_lock, flags);
+
+	percpu_ref_exit(ref);
+	kfree_rcu(objcg, rcu);
+}
+
+static struct obj_cgroup *obj_cgroup_alloc(void)
+{
+	struct obj_cgroup *objcg;
+	int ret;
+
+	objcg = kzalloc(sizeof(struct obj_cgroup), GFP_KERNEL);
+	if (!objcg)
+		return NULL;
+
+	ret = percpu_ref_init(&objcg->refcnt, obj_cgroup_release, 0,
+			      GFP_KERNEL);
+	if (ret) {
+		kfree(objcg);
+		return NULL;
+	}
+	INIT_LIST_HEAD(&objcg->list);
+	return objcg;
+}
+
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
+				  struct mem_cgroup *parent)
+{
+	struct obj_cgroup *objcg;
+
+	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
+	/* Paired with mem_cgroup_put() in objcg_release(). */
+	css_get(&memcg->css);
+	percpu_ref_kill(&objcg->refcnt);
+
+	spin_lock_irq(&css_set_lock);
+	list_for_each_entry(objcg, &memcg->objcg_list, list) {
+		css_get(&parent->css);
+		xchg(&objcg->memcg, parent);
+		css_put(&memcg->css);
+	}
+	list_splice(&memcg->objcg_list, &parent->objcg_list);
+	spin_unlock_irq(&css_set_lock);
+}
+
 /*
  * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
  * The main reason for not using cgroup id for this:
@@ -2062,6 +2129,12 @@  EXPORT_SYMBOL(unlock_page_memcg);
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
+
+#ifdef CONFIG_MEMCG_KMEM
+	struct obj_cgroup *cached_objcg;
+	unsigned int nr_bytes;
+#endif
+
 	struct work_struct work;
 	unsigned long flags;
 #define FLUSHING_CACHED_CHARGE	0
@@ -2069,6 +2142,22 @@  struct memcg_stock_pcp {
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
+#ifdef CONFIG_MEMCG_KMEM
+static void drain_obj_stock(struct memcg_stock_pcp *stock);
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg);
+
+#else
+static inline void drain_obj_stock(struct memcg_stock_pcp *stock)
+{
+}
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg)
+{
+	return false;
+}
+#endif
+
 /**
  * consume_stock: Try to consume stocked charge on this cpu.
  * @memcg: memcg to consume from.
@@ -2135,6 +2224,7 @@  static void drain_local_stock(struct work_struct *dummy)
 	local_irq_save(flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
+	drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
@@ -2194,6 +2284,8 @@  static void drain_all_stock(struct mem_cgroup *root_memcg)
 		if (memcg && stock->nr_pages &&
 		    mem_cgroup_is_descendant(memcg, root_memcg))
 			flush = true;
+		if (obj_stock_flush_required(stock, root_memcg))
+			flush = true;
 		rcu_read_unlock();
 
 		if (flush &&
@@ -2978,6 +3070,120 @@  void __memcg_kmem_uncharge_page(struct page *page, int order)
 	if (PageKmemcg(page))
 		__ClearPageKmemcg(page);
 }
+
+static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+	bool ret = false;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) {
+		stock->nr_bytes -= nr_bytes;
+		ret = true;
+	}
+
+	local_irq_restore(flags);
+
+	return ret;
+}
+
+static void drain_obj_stock(struct memcg_stock_pcp *stock)
+{
+	struct obj_cgroup *old = stock->cached_objcg;
+
+	if (!old)
+		return;
+
+	if (stock->nr_bytes) {
+		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
+
+		if (nr_pages) {
+			rcu_read_lock();
+			__memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+			rcu_read_unlock();
+		}
+
+		atomic_add(nr_bytes, &old->nr_charged_bytes);
+		stock->nr_bytes = 0;
+	}
+
+	obj_cgroup_put(old);
+	stock->cached_objcg = NULL;
+}
+
+static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+				     struct mem_cgroup *root_memcg)
+{
+	struct mem_cgroup *memcg;
+
+	if (stock->cached_objcg) {
+		memcg = obj_cgroup_memcg(stock->cached_objcg);
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+			return true;
+	}
+
+	return false;
+}
+
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+{
+	struct memcg_stock_pcp *stock;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	stock = this_cpu_ptr(&memcg_stock);
+	if (stock->cached_objcg != objcg) { /* reset if necessary */
+		drain_obj_stock(stock);
+		obj_cgroup_get(objcg);
+		stock->cached_objcg = objcg;
+		stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
+	}
+	stock->nr_bytes += nr_bytes;
+
+	if (stock->nr_bytes > PAGE_SIZE)
+		drain_obj_stock(stock);
+
+	local_irq_restore(flags);
+}
+
+int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
+{
+	struct mem_cgroup *memcg;
+	unsigned int nr_pages, nr_bytes;
+	int ret;
+
+	if (consume_obj_stock(objcg, size))
+		return 0;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	css_get(&memcg->css);
+	rcu_read_unlock();
+
+	nr_pages = size >> PAGE_SHIFT;
+	nr_bytes = size & (PAGE_SIZE - 1);
+
+	if (nr_bytes)
+		nr_pages += 1;
+
+	ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
+	if (!ret && nr_bytes)
+		refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
+
+	css_put(&memcg->css);
+	return ret;
+}
+
+void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
+{
+	refill_obj_stock(objcg, size);
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -3400,7 +3606,8 @@  static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
-	int memcg_id;
+	struct obj_cgroup *objcg;
+	int memcg_id, ret;
 
 	if (cgroup_memory_nokmem)
 		return 0;
@@ -3412,6 +3619,15 @@  static int memcg_online_kmem(struct mem_cgroup *memcg)
 	if (memcg_id < 0)
 		return memcg_id;
 
+	objcg = obj_cgroup_alloc();
+	if (!objcg) {
+		memcg_free_cache_id(memcg_id);
+		return ret;
+	}
+	objcg->memcg = memcg;
+	rcu_assign_pointer(memcg->objcg, objcg);
+	list_add(&objcg->list, &memcg->objcg_list);
+
 	static_branch_inc(&memcg_kmem_enabled_key);
 	/*
 	 * A memory cgroup is considered kmem-online as soon as it gets
@@ -3447,9 +3663,10 @@  static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		parent = root_mem_cgroup;
 
 	/*
-	 * Deactivate and reparent kmem_caches.
+	 * Deactivate and reparent kmem_caches and objcgs.
 	 */
 	memcg_deactivate_kmem_caches(memcg, parent);
+	memcg_reparent_objcgs(memcg, parent);
 
 	kmemcg_id = memcg->kmemcg_id;
 	BUG_ON(kmemcg_id < 0);
@@ -5003,6 +5220,7 @@  static struct mem_cgroup *mem_cgroup_alloc(void)
 	memcg->socket_pressure = jiffies;
 #ifdef CONFIG_MEMCG_KMEM
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->objcg_list);
 #endif
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);

[v2,15/28] mm: memcg/slab: obj_cgroup API

Commit Message

Comments

Patch