diff mbox series

[v2,21/28] mm: memcg/slab: use a single set of kmem_caches for all memory cgroups

Message ID 20200127173453.2089565-22-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series The new cgroup slab memory controller | expand

Commit Message

Roman Gushchin Jan. 27, 2020, 5:34 p.m. UTC
This is fairly big but mostly red patch, which makes all non-root
slab allocations use a single set of kmem_caches instead of
creating a separate set for each memory cgroup.

Because the number of non-root kmem_caches is now capped by the number
of root kmem_caches, there is no need to shrink or destroy them
prematurely. They can be perfectly destroyed together with their
root counterparts. This allows to dramatically simplify the
management of non-root kmem_caches and delete a ton of code.

This patch performs the following changes:
1) introduces memcg_params.memcg_cache pointer to represent the
   kmem_cache which will be used for all non-root allocations
2) reuses the existing memcg kmem_cache creation mechanism
   to create memcg kmem_cache on the first allocation attempt
3) memcg kmem_caches are named <kmemcache_name>-memcg,
   e.g. dentry-memcg
4) simplifies memcg_kmem_get_cache() to just return memcg kmem_cache
   or schedule it's creation and return the root cache
5) removes almost all non-root kmem_cache management code
   (separate refcounter, reparenting, shrinking, etc)
6) makes slab debugfs to display root_mem_cgroup css id and never
   show :dead and :deact flags in the memcg_slabinfo attribute.

Following patches in the series will simplify the kmem_cache creation.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h |   7 +-
 include/linux/slab.h       |   5 +-
 mm/memcontrol.c            | 172 +++++----------
 mm/slab.c                  |  16 +-
 mm/slab.h                  | 145 ++++---------
 mm/slab_common.c           | 426 ++++---------------------------------
 mm/slub.c                  |  38 +---
 7 files changed, 146 insertions(+), 663 deletions(-)

Comments

Johannes Weiner Feb. 3, 2020, 7:50 p.m. UTC | #1
On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> This is fairly big but mostly red patch, which makes all non-root
> slab allocations use a single set of kmem_caches instead of
> creating a separate set for each memory cgroup.
> 
> Because the number of non-root kmem_caches is now capped by the number
> of root kmem_caches, there is no need to shrink or destroy them
> prematurely. They can be perfectly destroyed together with their
> root counterparts. This allows to dramatically simplify the
> management of non-root kmem_caches and delete a ton of code.

This is definitely going in the right direction. But it doesn't quite
explain why we still need two sets of kmem_caches?

In the old scheme, we had completely separate per-cgroup caches with
separate slab pages. If a cgrouped process wanted to allocate a slab
object, we'd go to the root cache and used the cgroup id to look up
the right cgroup cache. On slab free we'd use page->slab_cache.

Now we have slab pages that have a page->objcg array. Why can't all
allocations go through a single set of kmem caches? If an allocation
is coming from a cgroup and the slab page the allocator wants to use
doesn't have an objcg array yet, we can allocate it on the fly, no?
Roman Gushchin Feb. 3, 2020, 8:58 p.m. UTC | #2
On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > This is fairly big but mostly red patch, which makes all non-root
> > slab allocations use a single set of kmem_caches instead of
> > creating a separate set for each memory cgroup.
> > 
> > Because the number of non-root kmem_caches is now capped by the number
> > of root kmem_caches, there is no need to shrink or destroy them
> > prematurely. They can be perfectly destroyed together with their
> > root counterparts. This allows to dramatically simplify the
> > management of non-root kmem_caches and delete a ton of code.
> 
> This is definitely going in the right direction. But it doesn't quite
> explain why we still need two sets of kmem_caches?
> 
> In the old scheme, we had completely separate per-cgroup caches with
> separate slab pages. If a cgrouped process wanted to allocate a slab
> object, we'd go to the root cache and used the cgroup id to look up
> the right cgroup cache. On slab free we'd use page->slab_cache.
> 
> Now we have slab pages that have a page->objcg array. Why can't all
> allocations go through a single set of kmem caches? If an allocation
> is coming from a cgroup and the slab page the allocator wants to use
> doesn't have an objcg array yet, we can allocate it on the fly, no?

Well, arguably it can be done, but there are few drawbacks:

1) On the release path you'll need to make some extra work even for
   root allocations: calculate the offset only to find the NULL objcg pointer.

2) There will be a memory overhead for root allocations
   (which might or might not be compensated by the increase
   of the slab utilization).

3) I'm working on percpu memory accounting that resembles the same scheme,
   except that obj_cgroups vector is created for the whole percpu block.
   There will be root- and memcg-blocks, and it will be expensive to merge them.
   I kinda like using the same scheme here and there.

Upsides?

1) slab utilization might increase a little bit (but I doubt it will have
   a huge effect, because both merging sets should be relatively big and well
   utilized)
2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
   but there isn't so much code left anyway.


So IMO it's an interesting direction to explore, but not something
that necessarily has to be done in the context of this patchset.
Johannes Weiner Feb. 3, 2020, 10:17 p.m. UTC | #3
On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > This is fairly big but mostly red patch, which makes all non-root
> > > slab allocations use a single set of kmem_caches instead of
> > > creating a separate set for each memory cgroup.
> > > 
> > > Because the number of non-root kmem_caches is now capped by the number
> > > of root kmem_caches, there is no need to shrink or destroy them
> > > prematurely. They can be perfectly destroyed together with their
> > > root counterparts. This allows to dramatically simplify the
> > > management of non-root kmem_caches and delete a ton of code.
> > 
> > This is definitely going in the right direction. But it doesn't quite
> > explain why we still need two sets of kmem_caches?
> > 
> > In the old scheme, we had completely separate per-cgroup caches with
> > separate slab pages. If a cgrouped process wanted to allocate a slab
> > object, we'd go to the root cache and used the cgroup id to look up
> > the right cgroup cache. On slab free we'd use page->slab_cache.
> > 
> > Now we have slab pages that have a page->objcg array. Why can't all
> > allocations go through a single set of kmem caches? If an allocation
> > is coming from a cgroup and the slab page the allocator wants to use
> > doesn't have an objcg array yet, we can allocate it on the fly, no?
> 
> Well, arguably it can be done, but there are few drawbacks:
> 
> 1) On the release path you'll need to make some extra work even for
>    root allocations: calculate the offset only to find the NULL objcg pointer.
> 
> 2) There will be a memory overhead for root allocations
>    (which might or might not be compensated by the increase
>    of the slab utilization).

Those two are only true if there is a wild mix of root and cgroup
allocations inside the same slab, and that doesn't really happen in
practice. Either the machine is dedicated to one workload and cgroups
are only enabled due to e.g. a vendor kernel, or you have cgrouped
systems (like most distro systems now) that cgroup everything.

> 3) I'm working on percpu memory accounting that resembles the same scheme,
>    except that obj_cgroups vector is created for the whole percpu block.
>    There will be root- and memcg-blocks, and it will be expensive to merge them.
>    I kinda like using the same scheme here and there.

It's hard to conclude anything based on this information alone. If
it's truly expensive to merge them, then it warrants the additional
complexity. But I don't understand the desire to share a design for
two systems with sufficiently different constraints.

> Upsides?
> 
> 1) slab utilization might increase a little bit (but I doubt it will have
>    a huge effect, because both merging sets should be relatively big and well
>    utilized)

Right.

> 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
>    but there isn't so much code left anyway.

There is a lot of complexity associated with the cache cloning that
isn't the lines of code, but the lifetime and synchronization rules.

And these two things are the primary aspects that make my head hurt
trying to review this patch series.

> So IMO it's an interesting direction to explore, but not something
> that necessarily has to be done in the context of this patchset.

I disagree. Instead of replacing the old coherent model and its
complexities with a new coherent one, you are mixing the two. And I
can barely understand the end result.

Dynamically cloning entire slab caches for the sole purpose of telling
whether the pages have an obj_cgroup array or not is *completely
insane*. If the controller had followed the obj_cgroup design from the
start, nobody would have ever thought about doing it like this.

From a maintainability POV, we cannot afford merging it in this form.
Roman Gushchin Feb. 3, 2020, 10:38 p.m. UTC | #4
On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > This is fairly big but mostly red patch, which makes all non-root
> > > > slab allocations use a single set of kmem_caches instead of
> > > > creating a separate set for each memory cgroup.
> > > > 
> > > > Because the number of non-root kmem_caches is now capped by the number
> > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > prematurely. They can be perfectly destroyed together with their
> > > > root counterparts. This allows to dramatically simplify the
> > > > management of non-root kmem_caches and delete a ton of code.
> > > 
> > > This is definitely going in the right direction. But it doesn't quite
> > > explain why we still need two sets of kmem_caches?
> > > 
> > > In the old scheme, we had completely separate per-cgroup caches with
> > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > object, we'd go to the root cache and used the cgroup id to look up
> > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > 
> > > Now we have slab pages that have a page->objcg array. Why can't all
> > > allocations go through a single set of kmem caches? If an allocation
> > > is coming from a cgroup and the slab page the allocator wants to use
> > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > 
> > Well, arguably it can be done, but there are few drawbacks:
> > 
> > 1) On the release path you'll need to make some extra work even for
> >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > 
> > 2) There will be a memory overhead for root allocations
> >    (which might or might not be compensated by the increase
> >    of the slab utilization).
> 
> Those two are only true if there is a wild mix of root and cgroup
> allocations inside the same slab, and that doesn't really happen in
> practice. Either the machine is dedicated to one workload and cgroups
> are only enabled due to e.g. a vendor kernel, or you have cgrouped
> systems (like most distro systems now) that cgroup everything.
> 
> > 3) I'm working on percpu memory accounting that resembles the same scheme,
> >    except that obj_cgroups vector is created for the whole percpu block.
> >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> >    I kinda like using the same scheme here and there.
> 
> It's hard to conclude anything based on this information alone. If
> it's truly expensive to merge them, then it warrants the additional
> complexity. But I don't understand the desire to share a design for
> two systems with sufficiently different constraints.
> 
> > Upsides?
> > 
> > 1) slab utilization might increase a little bit (but I doubt it will have
> >    a huge effect, because both merging sets should be relatively big and well
> >    utilized)
> 
> Right.
> 
> > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> >    but there isn't so much code left anyway.
> 
> There is a lot of complexity associated with the cache cloning that
> isn't the lines of code, but the lifetime and synchronization rules.
> 
> And these two things are the primary aspects that make my head hurt
> trying to review this patch series.
> 
> > So IMO it's an interesting direction to explore, but not something
> > that necessarily has to be done in the context of this patchset.
> 
> I disagree. Instead of replacing the old coherent model and its
> complexities with a new coherent one, you are mixing the two. And I
> can barely understand the end result.
> 
> Dynamically cloning entire slab caches for the sole purpose of telling
> whether the pages have an obj_cgroup array or not is *completely
> insane*. If the controller had followed the obj_cgroup design from the
> start, nobody would have ever thought about doing it like this.

Having two sets of kmem_caches has nothing to do with the refcounting
and obj_cgroup abstraction.
Please, take a look at the final code.

Thanks!
Roman Gushchin Feb. 4, 2020, 1:15 a.m. UTC | #5
On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > This is fairly big but mostly red patch, which makes all non-root
> > > > slab allocations use a single set of kmem_caches instead of
> > > > creating a separate set for each memory cgroup.
> > > > 
> > > > Because the number of non-root kmem_caches is now capped by the number
> > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > prematurely. They can be perfectly destroyed together with their
> > > > root counterparts. This allows to dramatically simplify the
> > > > management of non-root kmem_caches and delete a ton of code.
> > > 
> > > This is definitely going in the right direction. But it doesn't quite
> > > explain why we still need two sets of kmem_caches?
> > > 
> > > In the old scheme, we had completely separate per-cgroup caches with
> > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > object, we'd go to the root cache and used the cgroup id to look up
> > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > 
> > > Now we have slab pages that have a page->objcg array. Why can't all
> > > allocations go through a single set of kmem caches? If an allocation
> > > is coming from a cgroup and the slab page the allocator wants to use
> > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > 
> > Well, arguably it can be done, but there are few drawbacks:
> > 
> > 1) On the release path you'll need to make some extra work even for
> >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > 
> > 2) There will be a memory overhead for root allocations
> >    (which might or might not be compensated by the increase
> >    of the slab utilization).
> 
> Those two are only true if there is a wild mix of root and cgroup
> allocations inside the same slab, and that doesn't really happen in
> practice. Either the machine is dedicated to one workload and cgroups
> are only enabled due to e.g. a vendor kernel, or you have cgrouped
> systems (like most distro systems now) that cgroup everything.

It's actually a questionable statement: we do skip allocations from certain
contexts, and we do merge slab caches.

Most likely it's true for certain slab_caches and not true for others.
Think of kmalloc-* caches.

Also, because obj_cgroup vectors will not be freed without underlying pages,
most likely the percentage of pages with obj_cgroups will grow with uptime.
In other words, memcg allocations will fragment root slab pages.

> 
> > 3) I'm working on percpu memory accounting that resembles the same scheme,
> >    except that obj_cgroups vector is created for the whole percpu block.
> >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> >    I kinda like using the same scheme here and there.
> 
> It's hard to conclude anything based on this information alone. If
> it's truly expensive to merge them, then it warrants the additional
> complexity. But I don't understand the desire to share a design for
> two systems with sufficiently different constraints.
> 
> > Upsides?
> > 
> > 1) slab utilization might increase a little bit (but I doubt it will have
> >    a huge effect, because both merging sets should be relatively big and well
> >    utilized)
> 
> Right.
> 
> > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> >    but there isn't so much code left anyway.
> 
> There is a lot of complexity associated with the cache cloning that
> isn't the lines of code, but the lifetime and synchronization rules.

Quite opposite: the patchset removes all the complexity (or 90% of it),
because it makes the kmem_cache lifetime independent from any cgroup stuff.

Kmem_caches are created on demand on the first request (most likely during
the system start-up), and destroyed together with their root counterparts
(most likely never or on rmmod). First request means globally first request,
not a first request from a given memcg.

Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
after creation just matches the lifetime of the root kmem caches.

The only reason to keep the async creation is that some kmem_caches
are created very early in the boot process, long before any cgroup
stuff is initialized.

> 
> And these two things are the primary aspects that make my head hurt
> trying to review this patch series.
> 
> > So IMO it's an interesting direction to explore, but not something
> > that necessarily has to be done in the context of this patchset.
> 
> I disagree. Instead of replacing the old coherent model and its
> complexities with a new coherent one, you are mixing the two. And I
> can barely understand the end result.
> 
> Dynamically cloning entire slab caches for the sole purpose of telling
> whether the pages have an obj_cgroup array or not is *completely
> insane*. If the controller had followed the obj_cgroup design from the
> start, nobody would have ever thought about doing it like this.

It's just not true. The whole point of having root- and memcg sets is
to be able to not look for a NULL pointer in the obj_cgroup vector on
releasing of the root object. In other words, it allows to keep zero
overhead for root allocations. IMHO it's an important thing, and calling
it *completely insane* isn't the best way to communicate.

> 
> From a maintainability POV, we cannot afford merging it in this form.

It sounds strange: the patchset eliminates 90% of the complexity,
but it's unmergeable because there are 10% left.

I agree that it's an arguable question if we can tolerate some
additional overhead on root allocations to eliminate these additional
10%, but I really don't think it's so obvious that even discussing
it is insane.

Btw, there is another good idea to explore (also suggested by Christopher
Lameter): we can put memcg/objcg pointer into the slab page, avoiding
an extra allocation.
Johannes Weiner Feb. 4, 2020, 2:47 a.m. UTC | #6
On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > slab allocations use a single set of kmem_caches instead of
> > > > > creating a separate set for each memory cgroup.
> > > > > 
> > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > prematurely. They can be perfectly destroyed together with their
> > > > > root counterparts. This allows to dramatically simplify the
> > > > > management of non-root kmem_caches and delete a ton of code.
> > > > 
> > > > This is definitely going in the right direction. But it doesn't quite
> > > > explain why we still need two sets of kmem_caches?
> > > > 
> > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > 
> > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > allocations go through a single set of kmem caches? If an allocation
> > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > 
> > > Well, arguably it can be done, but there are few drawbacks:
> > > 
> > > 1) On the release path you'll need to make some extra work even for
> > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > 
> > > 2) There will be a memory overhead for root allocations
> > >    (which might or might not be compensated by the increase
> > >    of the slab utilization).
> > 
> > Those two are only true if there is a wild mix of root and cgroup
> > allocations inside the same slab, and that doesn't really happen in
> > practice. Either the machine is dedicated to one workload and cgroups
> > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > systems (like most distro systems now) that cgroup everything.
> 
> It's actually a questionable statement: we do skip allocations from certain
> contexts, and we do merge slab caches.
> 
> Most likely it's true for certain slab_caches and not true for others.
> Think of kmalloc-* caches.

With merging it's actually really hard to say how sparse or dense the
resulting objcgroup arrays would be. It could change all the time too.

> Also, because obj_cgroup vectors will not be freed without underlying pages,
> most likely the percentage of pages with obj_cgroups will grow with uptime.
> In other words, memcg allocations will fragment root slab pages.

I understand the first part of this paragraph, but not the second. The
objcgroup vectors will be freed when the slab pages get freed. But the
partially filled slab pages can be reused by any types of allocations,
surely? How would this cause the pages to fragment?

> > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > >    except that obj_cgroups vector is created for the whole percpu block.
> > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > >    I kinda like using the same scheme here and there.
> > 
> > It's hard to conclude anything based on this information alone. If
> > it's truly expensive to merge them, then it warrants the additional
> > complexity. But I don't understand the desire to share a design for
> > two systems with sufficiently different constraints.
> > 
> > > Upsides?
> > > 
> > > 1) slab utilization might increase a little bit (but I doubt it will have
> > >    a huge effect, because both merging sets should be relatively big and well
> > >    utilized)
> > 
> > Right.
> > 
> > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > >    but there isn't so much code left anyway.
> > 
> > There is a lot of complexity associated with the cache cloning that
> > isn't the lines of code, but the lifetime and synchronization rules.
> 
> Quite opposite: the patchset removes all the complexity (or 90% of it),
> because it makes the kmem_cache lifetime independent from any cgroup stuff.
> 
> Kmem_caches are created on demand on the first request (most likely during
> the system start-up), and destroyed together with their root counterparts
> (most likely never or on rmmod). First request means globally first request,
> not a first request from a given memcg.
> 
> Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> after creation just matches the lifetime of the root kmem caches.
> 
> The only reason to keep the async creation is that some kmem_caches
> are created very early in the boot process, long before any cgroup
> stuff is initialized.

Yes, it's independent of the obj_cgroup and memcg, and yes it's
simpler after your patches. But I'm not talking about the delta, I'm
trying to understand the end result.

And the truth is there is a decent chunk of code and tentacles spread
throughout the slab/cgroup code to clone, destroy, and handle the
split caches, as well as the branches/indirections on every cgrouped
slab allocation.

Yet there is no good explanation for why things are done this way
anywhere in the changelog, the cover letter, or the code. And it's
hard to get a satisfying answer even to direct questions about it.

Forget about how anything was before your patches and put yourself
into the shoes of somebody who comes at the new code without any
previous knowledge. "It was even worse before" just isn't a satisfying
answer.

> > And these two things are the primary aspects that make my head hurt
> > trying to review this patch series.
> > 
> > > So IMO it's an interesting direction to explore, but not something
> > > that necessarily has to be done in the context of this patchset.
> > 
> > I disagree. Instead of replacing the old coherent model and its
> > complexities with a new coherent one, you are mixing the two. And I
> > can barely understand the end result.
> > 
> > Dynamically cloning entire slab caches for the sole purpose of telling
> > whether the pages have an obj_cgroup array or not is *completely
> > insane*. If the controller had followed the obj_cgroup design from the
> > start, nobody would have ever thought about doing it like this.
> 
> It's just not true. The whole point of having root- and memcg sets is
> to be able to not look for a NULL pointer in the obj_cgroup vector on
> releasing of the root object. In other words, it allows to keep zero
> overhead for root allocations. IMHO it's an important thing, and calling
> it *completely insane* isn't the best way to communicate.

But you're trading it for the indirection of going through a separate
kmem_cache for every single cgroup-accounted allocation. Why is this a
preferable trade-off to make?

I'm asking basic questions about your design choices. It's not okay to
dismiss this with "it's an interesting direction to explore outside
the context this patchset".

> > From a maintainability POV, we cannot afford merging it in this form.
> 
> It sounds strange: the patchset eliminates 90% of the complexity,
> but it's unmergeable because there are 10% left.

No, it's unmergeable if you're unwilling to explain and document your
design choices when somebody who is taking the time and effort to look
at your patches doesn't understand why things are the way they are.

We are talking about 1500 lines of complicated core kernel code. They
*have* to make sense to people other than you if we want to have this
upstream.

> I agree that it's an arguable question if we can tolerate some
> additional overhead on root allocations to eliminate these additional
> 10%, but I really don't think it's so obvious that even discussing
> it is insane.

Well that's exactly my point.

> Btw, there is another good idea to explore (also suggested by Christopher
> Lameter): we can put memcg/objcg pointer into the slab page, avoiding
> an extra allocation.

I agree with this idea, but I do think that's a bit more obviously in
optimization territory. The objcg is much larger than a pointer to it,
and it wouldn't significantly change the alloc/free sequence, right?
Roman Gushchin Feb. 4, 2020, 4:35 a.m. UTC | #7
On Mon, Feb 03, 2020 at 09:47:04PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > > slab allocations use a single set of kmem_caches instead of
> > > > > > creating a separate set for each memory cgroup.
> > > > > > 
> > > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > > prematurely. They can be perfectly destroyed together with their
> > > > > > root counterparts. This allows to dramatically simplify the
> > > > > > management of non-root kmem_caches and delete a ton of code.
> > > > > 
> > > > > This is definitely going in the right direction. But it doesn't quite
> > > > > explain why we still need two sets of kmem_caches?
> > > > > 
> > > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > > 
> > > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > > allocations go through a single set of kmem caches? If an allocation
> > > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > > 
> > > > Well, arguably it can be done, but there are few drawbacks:
> > > > 
> > > > 1) On the release path you'll need to make some extra work even for
> > > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > > 
> > > > 2) There will be a memory overhead for root allocations
> > > >    (which might or might not be compensated by the increase
> > > >    of the slab utilization).
> > > 
> > > Those two are only true if there is a wild mix of root and cgroup
> > > allocations inside the same slab, and that doesn't really happen in
> > > practice. Either the machine is dedicated to one workload and cgroups
> > > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > > systems (like most distro systems now) that cgroup everything.
> > 
> > It's actually a questionable statement: we do skip allocations from certain
> > contexts, and we do merge slab caches.
> > 
> > Most likely it's true for certain slab_caches and not true for others.
> > Think of kmalloc-* caches.
> 
> With merging it's actually really hard to say how sparse or dense the
> resulting objcgroup arrays would be. It could change all the time too.

So here is some actual data from my dev machine. The first column is the number
of pages in the root cache, the second - in the corresponding memcg.

   ext4_groupinfo_4k          1          0
     rpc_inode_cache          1          0
        fuse_request         62          0
          fuse_inode          1       2732
  btrfs_delayed_node       1192          0
btrfs_ordered_extent        129          0
    btrfs_extent_map       8686          0
 btrfs_extent_buffer       2648          0
         btrfs_inode         12       6739
              PINGv6          1         11
               RAWv6          2          5
               UDPv6          1         34
       tw_sock_TCPv6        378          3
  request_sock_TCPv6         24          0
               TCPv6         46         74
  mqueue_inode_cache          1          0
 jbd2_journal_handle          2          0
   jbd2_journal_head          2          0
 jbd2_revoke_table_s          1          0
    ext4_inode_cache          1          3
ext4_allocation_context          1          0
         ext4_io_end          1          0
  ext4_extent_status          5          0
             mbcache          1          0
      dnotify_struct          1          0
  posix_timers_cache         24          0
      xfrm_dst_cache        202          0
                 RAW          3         12
                 UDP          2         24
         tw_sock_TCP         25          0
    request_sock_TCP         24          0
                 TCP          7         24
hugetlbfs_inode_cache          2          0
               dquot          2          0
       eventpoll_pwq          1        119
           dax_cache          1          0
       request_queue          9          0
          blkdev_ioc        241          0
          biovec-max        112          0
          biovec-128          2          0
           biovec-64          6          0
  khugepaged_mm_slot        248          0
 dmaengine-unmap-256          1          0
 dmaengine-unmap-128          1          0
  dmaengine-unmap-16         39          0
    sock_inode_cache          9        219
    skbuff_ext_cache        249          0
 skbuff_fclone_cache         83          0
   skbuff_head_cache        138        141
     file_lock_cache         24          0
       net_namespace          1          5
   shmem_inode_cache         14         56
     task_delay_info         23        165
           taskstats         24          0
      proc_dir_entry         24          0
          pde_opener         16         24
    proc_inode_cache         24       1103
          bdev_cache          4         20
   kernfs_node_cache       1405          0
           mnt_cache         54          0
                filp         53        460
         inode_cache        488       2287
              dentry        367      10576
         names_cache         24          0
        ebitmap_node          2          0
     avc_xperms_data        256          0
      lsm_file_cache         92          0
         buffer_head         24          9
       uts_namespace          1          3
      vm_area_struct         48        810
           mm_struct         19         29
         files_cache         14         26
        signal_cache         28        143
       sighand_cache         45         47
         task_struct         77        430
            cred_jar         29        424
      anon_vma_chain         39        492
            anon_vma         28        467
                 pid         30        369
        Acpi-Operand         56          0
          Acpi-Parse       5587          0
          Acpi-State       4137          0
      Acpi-Namespace          8          0
         numa_policy        137          0
  ftrace_event_field         68          0
      pool_workqueue         25          0
     radix_tree_node       1694       7776
          task_group         21          0
           vmap_area        477          0
     kmalloc-rcl-512        473          0
     kmalloc-rcl-256        605          0
     kmalloc-rcl-192         43         16
     kmalloc-rcl-128          1         47
      kmalloc-rcl-96          3        229
      kmalloc-rcl-64          6        611
          kmalloc-8k         48         24
          kmalloc-4k        372         59
          kmalloc-2k        132         50
          kmalloc-1k        251         82
         kmalloc-512        360        150
         kmalloc-256        237          0
         kmalloc-192        298         24
         kmalloc-128        203         24
          kmalloc-96        112         24
          kmalloc-64        796         24
          kmalloc-32       1188         26
          kmalloc-16        555         25
           kmalloc-8         42         24
     kmem_cache_node         20          0
          kmem_cache         24          0

> 
> > Also, because obj_cgroup vectors will not be freed without underlying pages,
> > most likely the percentage of pages with obj_cgroups will grow with uptime.
> > In other words, memcg allocations will fragment root slab pages.
> 
> I understand the first part of this paragraph, but not the second. The
> objcgroup vectors will be freed when the slab pages get freed. But the
> partially filled slab pages can be reused by any types of allocations,
> surely? How would this cause the pages to fragment?

I mean the following: once you allocate a single accounted object
from the page, obj_cgroup vector is allocated and will be released only
with the slab page. We really really don't want to count how many accounted
objects are on the page and release obj_cgroup vector on reaching 0.
So even if all following allocations are root allocations, the overhead
will not go away with the uptime.

In other words, even a small percentage of accounted objects will
turn the whole cache into "accountable".

> 
> > > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > > >    except that obj_cgroups vector is created for the whole percpu block.
> > > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > > >    I kinda like using the same scheme here and there.
> > > 
> > > It's hard to conclude anything based on this information alone. If
> > > it's truly expensive to merge them, then it warrants the additional
> > > complexity. But I don't understand the desire to share a design for
> > > two systems with sufficiently different constraints.
> > > 
> > > > Upsides?
> > > > 
> > > > 1) slab utilization might increase a little bit (but I doubt it will have
> > > >    a huge effect, because both merging sets should be relatively big and well
> > > >    utilized)
> > > 
> > > Right.
> > > 
> > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > > >    but there isn't so much code left anyway.
> > > 
> > > There is a lot of complexity associated with the cache cloning that
> > > isn't the lines of code, but the lifetime and synchronization rules.
> > 
> > Quite opposite: the patchset removes all the complexity (or 90% of it),
> > because it makes the kmem_cache lifetime independent from any cgroup stuff.
> > 
> > Kmem_caches are created on demand on the first request (most likely during
> > the system start-up), and destroyed together with their root counterparts
> > (most likely never or on rmmod). First request means globally first request,
> > not a first request from a given memcg.
> > 
> > Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> > after creation just matches the lifetime of the root kmem caches.
> > 
> > The only reason to keep the async creation is that some kmem_caches
> > are created very early in the boot process, long before any cgroup
> > stuff is initialized.
> 
> Yes, it's independent of the obj_cgroup and memcg, and yes it's
> simpler after your patches. But I'm not talking about the delta, I'm
> trying to understand the end result.
> 
> And the truth is there is a decent chunk of code and tentacles spread
> throughout the slab/cgroup code to clone, destroy, and handle the
> split caches, as well as the branches/indirections on every cgrouped
> slab allocation.

Did you see the final code? It's fairly simple and there is really not
much of complexity left. If you don't think so, let's go into details,
because otherwise it's hard to say anything.

With a such change which basically removes the current implementation
and replaces it with a new one, it's hard to keep the balance between
making commits self-contained and small, but also showing the whole picture.
I'm fully open to questions and generally want to make it simpler.

I've tried to separate some parts and get them merged before the main
thing, but they haven't been merged yet, so I have to include them
to keep the thing building.

Will a more-detailed design in the cover help?
Will writing a design doc to put into Documentation/ help?
Is it better to rearrange patches in a way to eliminate the current
implementation first and build from scratch?

> 
> Yet there is no good explanation for why things are done this way
> anywhere in the changelog, the cover letter, or the code. And it's
> hard to get a satisfying answer even to direct questions about it.

I do not agree. I try to answer all questions. But I also expect
that my arguments will be listened.
(I didn't answer questions re lifetime of obj_cgroup, but only
because I need some more time to think. If it wasn't clear, I'm sorry.).

> 
> Forget about how anything was before your patches and put yourself
> into the shoes of somebody who comes at the new code without any
> previous knowledge. "It was even worse before" just isn't a satisfying
> answer.

Absolutely agree.

But at the same time "now it's better than before" sounds like a good
validation for a change. The code is never perfect.

But, please, let's don't go into long discussions here and save some time.

> 
> > > And these two things are the primary aspects that make my head hurt
> > > trying to review this patch series.
> > > 
> > > > So IMO it's an interesting direction to explore, but not something
> > > > that necessarily has to be done in the context of this patchset.
> > > 
> > > I disagree. Instead of replacing the old coherent model and its
> > > complexities with a new coherent one, you are mixing the two. And I
> > > can barely understand the end result.
> > > 
> > > Dynamically cloning entire slab caches for the sole purpose of telling
> > > whether the pages have an obj_cgroup array or not is *completely
> > > insane*. If the controller had followed the obj_cgroup design from the
> > > start, nobody would have ever thought about doing it like this.
> > 
> > It's just not true. The whole point of having root- and memcg sets is
> > to be able to not look for a NULL pointer in the obj_cgroup vector on
> > releasing of the root object. In other words, it allows to keep zero
> > overhead for root allocations. IMHO it's an important thing, and calling
> > it *completely insane* isn't the best way to communicate.
> 
> But you're trading it for the indirection of going through a separate
> kmem_cache for every single cgroup-accounted allocation. Why is this a
> preferable trade-off to make?

Because it allows to keep zero memory and cpu overhead for root allocations.
I've no data showing that this overhead is small and acceptable in all cases.
I think keeping zero overhead for root allocations is more important
than having a single set of kmem caches.

> 
> I'm asking basic questions about your design choices. It's not okay to
> dismiss this with "it's an interesting direction to explore outside
> the context this patchset".

I'm not dismissing any questions.
There is a difference between a question and a must-to-follow suggestion,
which has known and ignored trade-offs.

> 
> > > From a maintainability POV, we cannot afford merging it in this form.
> > 
> > It sounds strange: the patchset eliminates 90% of the complexity,
> > but it's unmergeable because there are 10% left.
> 
> No, it's unmergeable if you're unwilling to explain and document your
> design choices when somebody who is taking the time and effort to look
> at your patches doesn't understand why things are the way they are.

I'm not unwilling to explain. Otherwise I just wouldn't post it upstream,
right? And I assume you're spending your time reviewing it not with the goal
to keep the current code intact.

Please, let's keep separate things which are hard to understand and
require an explanation and things which you think are better done differently.

Both are valid and appreciated comments, but mixing them isn't productive.

> 
> We are talking about 1500 lines of complicated core kernel code. They
> *have* to make sense to people other than you if we want to have this
> upstream.

Right.

> 
> > I agree that it's an arguable question if we can tolerate some
> > additional overhead on root allocations to eliminate these additional
> > 10%, but I really don't think it's so obvious that even discussing
> > it is insane.
> 
> Well that's exactly my point.

Ok, what's the acceptable performance penalty?
Is adding 20% on free path is acceptable, for example?
Or adding 3% of slab memory?

> 
> > Btw, there is another good idea to explore (also suggested by Christopher
> > Lameter): we can put memcg/objcg pointer into the slab page, avoiding
> > an extra allocation.
> 
> I agree with this idea, but I do think that's a bit more obviously in
> optimization territory. The objcg is much larger than a pointer to it,
> and it wouldn't significantly change the alloc/free sequence, right?

So the idea is that putting the obj_cgroup pointer nearby will eliminate
some cache misses. But then it's preferable to have two sets, because otherwise
there is a memory overhead from allocating an extra space for the objcg pointer.


Stepping a bit back: the new scheme (new slab controller) adds some cpu operations
on the allocation and release paths. It's unavoidable: more precise
accounting requires more CPU. But IMO it's worth it because it leads
to significant memory savings and reduced memory fragmentation.
Also it reduces the code complexity (which is a bonus but not the primary goal).

I haven't seen so far any workloads where the difference was noticeable,
but it doesn't mean they do not exist. That's why I'm very concerned about
any suggestions which might even in theory increase the cpu overhead.
Keeping it at zero level for root allocations allows do exclude
something from the accounting if the performance penalty is not tolerable.

Thanks!
Johannes Weiner Feb. 4, 2020, 6:41 p.m. UTC | #8
On Mon, Feb 03, 2020 at 08:35:41PM -0800, Roman Gushchin wrote:
> On Mon, Feb 03, 2020 at 09:47:04PM -0500, Johannes Weiner wrote:
> > On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> > > On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > > > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > > > slab allocations use a single set of kmem_caches instead of
> > > > > > > creating a separate set for each memory cgroup.
> > > > > > > 
> > > > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > > > prematurely. They can be perfectly destroyed together with their
> > > > > > > root counterparts. This allows to dramatically simplify the
> > > > > > > management of non-root kmem_caches and delete a ton of code.
> > > > > > 
> > > > > > This is definitely going in the right direction. But it doesn't quite
> > > > > > explain why we still need two sets of kmem_caches?
> > > > > > 
> > > > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > > > 
> > > > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > > > allocations go through a single set of kmem caches? If an allocation
> > > > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > > > 
> > > > > Well, arguably it can be done, but there are few drawbacks:
> > > > > 
> > > > > 1) On the release path you'll need to make some extra work even for
> > > > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > > > 
> > > > > 2) There will be a memory overhead for root allocations
> > > > >    (which might or might not be compensated by the increase
> > > > >    of the slab utilization).
> > > > 
> > > > Those two are only true if there is a wild mix of root and cgroup
> > > > allocations inside the same slab, and that doesn't really happen in
> > > > practice. Either the machine is dedicated to one workload and cgroups
> > > > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > > > systems (like most distro systems now) that cgroup everything.
> > > 
> > > It's actually a questionable statement: we do skip allocations from certain
> > > contexts, and we do merge slab caches.
> > > 
> > > Most likely it's true for certain slab_caches and not true for others.
> > > Think of kmalloc-* caches.
> > 
> > With merging it's actually really hard to say how sparse or dense the
> > resulting objcgroup arrays would be. It could change all the time too.
> 
> So here is some actual data from my dev machine. The first column is the number
> of pages in the root cache, the second - in the corresponding memcg.
> 
>    ext4_groupinfo_4k          1          0
>      rpc_inode_cache          1          0
>         fuse_request         62          0
>           fuse_inode          1       2732
>   btrfs_delayed_node       1192          0
> btrfs_ordered_extent        129          0
>     btrfs_extent_map       8686          0
>  btrfs_extent_buffer       2648          0
>          btrfs_inode         12       6739
>               PINGv6          1         11
>                RAWv6          2          5
>                UDPv6          1         34
>        tw_sock_TCPv6        378          3
>   request_sock_TCPv6         24          0
>                TCPv6         46         74
>   mqueue_inode_cache          1          0
>  jbd2_journal_handle          2          0
>    jbd2_journal_head          2          0
>  jbd2_revoke_table_s          1          0
>     ext4_inode_cache          1          3
> ext4_allocation_context          1          0
>          ext4_io_end          1          0
>   ext4_extent_status          5          0
>              mbcache          1          0
>       dnotify_struct          1          0
>   posix_timers_cache         24          0
>       xfrm_dst_cache        202          0
>                  RAW          3         12
>                  UDP          2         24
>          tw_sock_TCP         25          0
>     request_sock_TCP         24          0
>                  TCP          7         24
> hugetlbfs_inode_cache          2          0
>                dquot          2          0
>        eventpoll_pwq          1        119
>            dax_cache          1          0
>        request_queue          9          0
>           blkdev_ioc        241          0
>           biovec-max        112          0
>           biovec-128          2          0
>            biovec-64          6          0
>   khugepaged_mm_slot        248          0
>  dmaengine-unmap-256          1          0
>  dmaengine-unmap-128          1          0
>   dmaengine-unmap-16         39          0
>     sock_inode_cache          9        219
>     skbuff_ext_cache        249          0
>  skbuff_fclone_cache         83          0
>    skbuff_head_cache        138        141
>      file_lock_cache         24          0
>        net_namespace          1          5
>    shmem_inode_cache         14         56
>      task_delay_info         23        165
>            taskstats         24          0
>       proc_dir_entry         24          0
>           pde_opener         16         24
>     proc_inode_cache         24       1103
>           bdev_cache          4         20
>    kernfs_node_cache       1405          0
>            mnt_cache         54          0
>                 filp         53        460
>          inode_cache        488       2287
>               dentry        367      10576
>          names_cache         24          0
>         ebitmap_node          2          0
>      avc_xperms_data        256          0
>       lsm_file_cache         92          0
>          buffer_head         24          9
>        uts_namespace          1          3
>       vm_area_struct         48        810
>            mm_struct         19         29
>          files_cache         14         26
>         signal_cache         28        143
>        sighand_cache         45         47
>          task_struct         77        430
>             cred_jar         29        424
>       anon_vma_chain         39        492
>             anon_vma         28        467
>                  pid         30        369
>         Acpi-Operand         56          0
>           Acpi-Parse       5587          0
>           Acpi-State       4137          0
>       Acpi-Namespace          8          0
>          numa_policy        137          0
>   ftrace_event_field         68          0
>       pool_workqueue         25          0
>      radix_tree_node       1694       7776
>           task_group         21          0
>            vmap_area        477          0
>      kmalloc-rcl-512        473          0
>      kmalloc-rcl-256        605          0
>      kmalloc-rcl-192         43         16
>      kmalloc-rcl-128          1         47
>       kmalloc-rcl-96          3        229
>       kmalloc-rcl-64          6        611
>           kmalloc-8k         48         24
>           kmalloc-4k        372         59
>           kmalloc-2k        132         50
>           kmalloc-1k        251         82
>          kmalloc-512        360        150
>          kmalloc-256        237          0
>          kmalloc-192        298         24
>          kmalloc-128        203         24
>           kmalloc-96        112         24
>           kmalloc-64        796         24
>           kmalloc-32       1188         26
>           kmalloc-16        555         25
>            kmalloc-8         42         24
>      kmem_cache_node         20          0
>           kmem_cache         24          0

That's interesting, thanks. It does look fairly bimodal, except in
some smaller caches. Which does make sense when you think about it: we
focus on accounting consumers that are driven by userspace activity
and big enough to actually matter in terms of cgroup footprint.

> > > Also, because obj_cgroup vectors will not be freed without underlying pages,
> > > most likely the percentage of pages with obj_cgroups will grow with uptime.
> > > In other words, memcg allocations will fragment root slab pages.
> > 
> > I understand the first part of this paragraph, but not the second. The
> > objcgroup vectors will be freed when the slab pages get freed. But the
> > partially filled slab pages can be reused by any types of allocations,
> > surely? How would this cause the pages to fragment?
> 
> I mean the following: once you allocate a single accounted object
> from the page, obj_cgroup vector is allocated and will be released only
> with the slab page. We really really don't want to count how many accounted
> objects are on the page and release obj_cgroup vector on reaching 0.
> So even if all following allocations are root allocations, the overhead
> will not go away with the uptime.
> 
> In other words, even a small percentage of accounted objects will
> turn the whole cache into "accountable".

Correct. The worst case is where we have a large cache that has N
objects per slab, but only ~1/N objects are accounted to a cgroup.

The question is whether this is common or even realistic. When would a
cache be big, but only a small subset of its allocations would be
attributable to specific cgroups?

On the less extreme overlapping cases, yeah there are fragmented
obj_cgroup arrays, but there is also better slab packing. One is an
array of pointers, the other is an array of larger objects. It would
seem slab fragmentation has the potential to waste much more memory?

> > > > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > > > >    except that obj_cgroups vector is created for the whole percpu block.
> > > > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > > > >    I kinda like using the same scheme here and there.
> > > > 
> > > > It's hard to conclude anything based on this information alone. If
> > > > it's truly expensive to merge them, then it warrants the additional
> > > > complexity. But I don't understand the desire to share a design for
> > > > two systems with sufficiently different constraints.
> > > > 
> > > > > Upsides?
> > > > > 
> > > > > 1) slab utilization might increase a little bit (but I doubt it will have
> > > > >    a huge effect, because both merging sets should be relatively big and well
> > > > >    utilized)
> > > > 
> > > > Right.
> > > > 
> > > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > > > >    but there isn't so much code left anyway.
> > > > 
> > > > There is a lot of complexity associated with the cache cloning that
> > > > isn't the lines of code, but the lifetime and synchronization rules.
> > > 
> > > Quite opposite: the patchset removes all the complexity (or 90% of it),
> > > because it makes the kmem_cache lifetime independent from any cgroup stuff.
> > > 
> > > Kmem_caches are created on demand on the first request (most likely during
> > > the system start-up), and destroyed together with their root counterparts
> > > (most likely never or on rmmod). First request means globally first request,
> > > not a first request from a given memcg.
> > > 
> > > Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> > > after creation just matches the lifetime of the root kmem caches.
> > > 
> > > The only reason to keep the async creation is that some kmem_caches
> > > are created very early in the boot process, long before any cgroup
> > > stuff is initialized.
> > 
> > Yes, it's independent of the obj_cgroup and memcg, and yes it's
> > simpler after your patches. But I'm not talking about the delta, I'm
> > trying to understand the end result.
> > 
> > And the truth is there is a decent chunk of code and tentacles spread
> > throughout the slab/cgroup code to clone, destroy, and handle the
> > split caches, as well as the branches/indirections on every cgrouped
> > slab allocation.
> 
> Did you see the final code? It's fairly simple and there is really not
> much of complexity left. If you don't think so, let's go into details,
> because otherwise it's hard to say anything.

I have the patches applied to a local tree and am looking at the final
code. But I can only repeat that "it's not too bad" simply isn't a
good explanation for why the code is the way it is.

> With a such change which basically removes the current implementation
> and replaces it with a new one, it's hard to keep the balance between
> making commits self-contained and small, but also showing the whole picture.
> I'm fully open to questions and generally want to make it simpler.
> 
> I've tried to separate some parts and get them merged before the main
> thing, but they haven't been merged yet, so I have to include them
> to keep the thing building.
> 
> Will a more-detailed design in the cover help?
> Will writing a design doc to put into Documentation/ help?
> Is it better to rearrange patches in a way to eliminate the current
> implementation first and build from scratch?

It would help to have changelogs that actually describe how the new
design is supposed to work, and why you made the decisions you made.

The changelog in this patch here sells the change as a reduction in
complexity, without explaining why it stopped where it stopped. So
naturally, if that's the declared goal, the first question is whether
we can make it simpler.

Both the cover letter and the changelogs should focus less on what was
there and how it was deleted, and more on how the code is supposed to
work after the patches. How the components were designed and how they
all work together.

As I said before, imagine somebody without any historical knowledge
reading the code. They should be able to find out why you chose to
have two sets of kmem caches. There is no explanation for it other
than "there used to be more, so we cut it down to two".

> > > > And these two things are the primary aspects that make my head hurt
> > > > trying to review this patch series.
> > > > 
> > > > > So IMO it's an interesting direction to explore, but not something
> > > > > that necessarily has to be done in the context of this patchset.
> > > > 
> > > > I disagree. Instead of replacing the old coherent model and its
> > > > complexities with a new coherent one, you are mixing the two. And I
> > > > can barely understand the end result.
> > > > 
> > > > Dynamically cloning entire slab caches for the sole purpose of telling
> > > > whether the pages have an obj_cgroup array or not is *completely
> > > > insane*. If the controller had followed the obj_cgroup design from the
> > > > start, nobody would have ever thought about doing it like this.
> > > 
> > > It's just not true. The whole point of having root- and memcg sets is
> > > to be able to not look for a NULL pointer in the obj_cgroup vector on
> > > releasing of the root object. In other words, it allows to keep zero
> > > overhead for root allocations. IMHO it's an important thing, and calling
> > > it *completely insane* isn't the best way to communicate.
> > 
> > But you're trading it for the indirection of going through a separate
> > kmem_cache for every single cgroup-accounted allocation. Why is this a
> > preferable trade-off to make?
> 
> Because it allows to keep zero memory and cpu overhead for root allocations.
> I've no data showing that this overhead is small and acceptable in all cases.
> I think keeping zero overhead for root allocations is more important
> than having a single set of kmem caches.

In the kmem cache breakdown you provided above, there are 35887 pages
allocated to root caches and 37300 pages allocated to cgroup caches.

Why are root allocations supposed to be more important? Aren't some of
the hottest allocations tracked by cgroups? Look at fork():

>       vm_area_struct         48        810
>            mm_struct         19         29
>          files_cache         14         26
>         signal_cache         28        143
>        sighand_cache         45         47
>          task_struct         77        430
>             cred_jar         29        424
>       anon_vma_chain         39        492
>             anon_vma         28        467
>                  pid         30        369

Hard to think of much hotter allocations. They all have to suffer the
additional branch and cache footprint of the auxiliary cgroup caches.

> > > I agree that it's an arguable question if we can tolerate some
> > > additional overhead on root allocations to eliminate these additional
> > > 10%, but I really don't think it's so obvious that even discussing
> > > it is insane.
> > 
> > Well that's exactly my point.
> 
> Ok, what's the acceptable performance penalty?
> Is adding 20% on free path is acceptable, for example?
> Or adding 3% of slab memory?

I find offhand replies like these very jarring.

There is a legitimate design question: Why are you using a separate
set of kmem caches for the cgroup allocations, citing the additional
complexity over having one set? And your reply was mostly handwaving.

So: what's the overhead you're saving by having two sets? What is this
additional stuff buying us?

Pretend the split-cache infra hadn't been there. Would you have added
it? If so, based on what data? Now obviously, you didn't write it - it
was there because that's the way the per-cgroup accounting was done
previously. But you did choose to keep it. And it's a fair question
what (quantifiable) role it plays in your new way of doing things.

> > > Btw, there is another good idea to explore (also suggested by Christopher
> > > Lameter): we can put memcg/objcg pointer into the slab page, avoiding
> > > an extra allocation.
> > 
> > I agree with this idea, but I do think that's a bit more obviously in
> > optimization territory. The objcg is much larger than a pointer to it,
> > and it wouldn't significantly change the alloc/free sequence, right?
> 
> So the idea is that putting the obj_cgroup pointer nearby will eliminate
> some cache misses. But then it's preferable to have two sets, because otherwise
> there is a memory overhead from allocating an extra space for the objcg pointer.

This trade-off is based on two assumptions:

1) Unaccounted allocations are more performance sensitive than
accounted allocations.

2) Fragmented obj_cgroup arrays waste more memory than fragmented
slabs.

You haven't sufficiently shown that either of those are true. (And I
suspect they are both false.)

So my stance is that until you make a more convincing argument for
this, a simpler concept and implementation, as well as balanced CPU
cost for unaccounted and accounted allocations, wins out.

> Stepping a bit back: the new scheme (new slab controller) adds some cpu operations
> on the allocation and release paths. It's unavoidable: more precise
> accounting requires more CPU. But IMO it's worth it because it leads
> to significant memory savings and reduced memory fragmentation.
> Also it reduces the code complexity (which is a bonus but not the primary goal).
> 
> I haven't seen so far any workloads where the difference was noticeable,
> but it doesn't mean they do not exist. That's why I'm very concerned about
> any suggestions which might even in theory increase the cpu overhead.
> Keeping it at zero level for root allocations allows do exclude
> something from the accounting if the performance penalty is not tolerable.

That sounds like a false trade-off to me. We account memory for
functional correctness - consumers that are big enough to
fundamentally alter the picture of cgroup memory footprints, allow
users to disturb other containers, or even cause host-level OOM
situations. Not based on whether they are cheap to track.

In fact, I would make the counter argument. It'd be pretty bad if
everytime we had to make an accounting change to maintain functional
correctness we'd have to worry about a CPU regression that exists in
part because we're trying to keep unaccounted allocations cheaper.
Roman Gushchin Feb. 5, 2020, 3:58 p.m. UTC | #9
On Tue, Feb 04, 2020 at 01:41:59PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 08:35:41PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 09:47:04PM -0500, Johannes Weiner wrote:
> > > On Mon, Feb 03, 2020 at 05:15:05PM -0800, Roman Gushchin wrote:
> > > > On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> > > > > On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > > > > > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > > > > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > > > > > This is fairly big but mostly red patch, which makes all non-root
> > > > > > > > slab allocations use a single set of kmem_caches instead of
> > > > > > > > creating a separate set for each memory cgroup.
> > > > > > > > 
> > > > > > > > Because the number of non-root kmem_caches is now capped by the number
> > > > > > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > > > > > prematurely. They can be perfectly destroyed together with their
> > > > > > > > root counterparts. This allows to dramatically simplify the
> > > > > > > > management of non-root kmem_caches and delete a ton of code.
> > > > > > > 
> > > > > > > This is definitely going in the right direction. But it doesn't quite
> > > > > > > explain why we still need two sets of kmem_caches?
> > > > > > > 
> > > > > > > In the old scheme, we had completely separate per-cgroup caches with
> > > > > > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > > > > > object, we'd go to the root cache and used the cgroup id to look up
> > > > > > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > > > > > 
> > > > > > > Now we have slab pages that have a page->objcg array. Why can't all
> > > > > > > allocations go through a single set of kmem caches? If an allocation
> > > > > > > is coming from a cgroup and the slab page the allocator wants to use
> > > > > > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > > > > > 
> > > > > > Well, arguably it can be done, but there are few drawbacks:
> > > > > > 
> > > > > > 1) On the release path you'll need to make some extra work even for
> > > > > >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > > > > > 
> > > > > > 2) There will be a memory overhead for root allocations
> > > > > >    (which might or might not be compensated by the increase
> > > > > >    of the slab utilization).
> > > > > 
> > > > > Those two are only true if there is a wild mix of root and cgroup
> > > > > allocations inside the same slab, and that doesn't really happen in
> > > > > practice. Either the machine is dedicated to one workload and cgroups
> > > > > are only enabled due to e.g. a vendor kernel, or you have cgrouped
> > > > > systems (like most distro systems now) that cgroup everything.
> > > > 
> > > > It's actually a questionable statement: we do skip allocations from certain
> > > > contexts, and we do merge slab caches.
> > > > 
> > > > Most likely it's true for certain slab_caches and not true for others.
> > > > Think of kmalloc-* caches.
> > > 
> > > With merging it's actually really hard to say how sparse or dense the
> > > resulting objcgroup arrays would be. It could change all the time too.
> > 
> > So here is some actual data from my dev machine. The first column is the number
> > of pages in the root cache, the second - in the corresponding memcg.
> > 
> >    ext4_groupinfo_4k          1          0
> >      rpc_inode_cache          1          0
> >         fuse_request         62          0
> >           fuse_inode          1       2732
> >   btrfs_delayed_node       1192          0
> > btrfs_ordered_extent        129          0
> >     btrfs_extent_map       8686          0
> >  btrfs_extent_buffer       2648          0
> >          btrfs_inode         12       6739
> >               PINGv6          1         11
> >                RAWv6          2          5
> >                UDPv6          1         34
> >        tw_sock_TCPv6        378          3
> >   request_sock_TCPv6         24          0
> >                TCPv6         46         74
> >   mqueue_inode_cache          1          0
> >  jbd2_journal_handle          2          0
> >    jbd2_journal_head          2          0
> >  jbd2_revoke_table_s          1          0
> >     ext4_inode_cache          1          3
> > ext4_allocation_context          1          0
> >          ext4_io_end          1          0
> >   ext4_extent_status          5          0
> >              mbcache          1          0
> >       dnotify_struct          1          0
> >   posix_timers_cache         24          0
> >       xfrm_dst_cache        202          0
> >                  RAW          3         12
> >                  UDP          2         24
> >          tw_sock_TCP         25          0
> >     request_sock_TCP         24          0
> >                  TCP          7         24
> > hugetlbfs_inode_cache          2          0
> >                dquot          2          0
> >        eventpoll_pwq          1        119
> >            dax_cache          1          0
> >        request_queue          9          0
> >           blkdev_ioc        241          0
> >           biovec-max        112          0
> >           biovec-128          2          0
> >            biovec-64          6          0
> >   khugepaged_mm_slot        248          0
> >  dmaengine-unmap-256          1          0
> >  dmaengine-unmap-128          1          0
> >   dmaengine-unmap-16         39          0
> >     sock_inode_cache          9        219
> >     skbuff_ext_cache        249          0
> >  skbuff_fclone_cache         83          0
> >    skbuff_head_cache        138        141
> >      file_lock_cache         24          0
> >        net_namespace          1          5
> >    shmem_inode_cache         14         56
> >      task_delay_info         23        165
> >            taskstats         24          0
> >       proc_dir_entry         24          0
> >           pde_opener         16         24
> >     proc_inode_cache         24       1103
> >           bdev_cache          4         20
> >    kernfs_node_cache       1405          0
> >            mnt_cache         54          0
> >                 filp         53        460
> >          inode_cache        488       2287
> >               dentry        367      10576
> >          names_cache         24          0
> >         ebitmap_node          2          0
> >      avc_xperms_data        256          0
> >       lsm_file_cache         92          0
> >          buffer_head         24          9
> >        uts_namespace          1          3
> >       vm_area_struct         48        810
> >            mm_struct         19         29
> >          files_cache         14         26
> >         signal_cache         28        143
> >        sighand_cache         45         47
> >          task_struct         77        430
> >             cred_jar         29        424
> >       anon_vma_chain         39        492
> >             anon_vma         28        467
> >                  pid         30        369
> >         Acpi-Operand         56          0
> >           Acpi-Parse       5587          0
> >           Acpi-State       4137          0
> >       Acpi-Namespace          8          0
> >          numa_policy        137          0
> >   ftrace_event_field         68          0
> >       pool_workqueue         25          0
> >      radix_tree_node       1694       7776
> >           task_group         21          0
> >            vmap_area        477          0
> >      kmalloc-rcl-512        473          0
> >      kmalloc-rcl-256        605          0
> >      kmalloc-rcl-192         43         16
> >      kmalloc-rcl-128          1         47
> >       kmalloc-rcl-96          3        229
> >       kmalloc-rcl-64          6        611
> >           kmalloc-8k         48         24
> >           kmalloc-4k        372         59
> >           kmalloc-2k        132         50
> >           kmalloc-1k        251         82
> >          kmalloc-512        360        150
> >          kmalloc-256        237          0
> >          kmalloc-192        298         24
> >          kmalloc-128        203         24
> >           kmalloc-96        112         24
> >           kmalloc-64        796         24
> >           kmalloc-32       1188         26
> >           kmalloc-16        555         25
> >            kmalloc-8         42         24
> >      kmem_cache_node         20          0
> >           kmem_cache         24          0
> 
> That's interesting, thanks. It does look fairly bimodal, except in
> some smaller caches. Which does make sense when you think about it: we
> focus on accounting consumers that are driven by userspace activity
> and big enough to actually matter in terms of cgroup footprint.
> 
> > > > Also, because obj_cgroup vectors will not be freed without underlying pages,
> > > > most likely the percentage of pages with obj_cgroups will grow with uptime.
> > > > In other words, memcg allocations will fragment root slab pages.
> > > 
> > > I understand the first part of this paragraph, but not the second. The
> > > objcgroup vectors will be freed when the slab pages get freed. But the
> > > partially filled slab pages can be reused by any types of allocations,
> > > surely? How would this cause the pages to fragment?
> > 
> > I mean the following: once you allocate a single accounted object
> > from the page, obj_cgroup vector is allocated and will be released only
> > with the slab page. We really really don't want to count how many accounted
> > objects are on the page and release obj_cgroup vector on reaching 0.
> > So even if all following allocations are root allocations, the overhead
> > will not go away with the uptime.
> > 
> > In other words, even a small percentage of accounted objects will
> > turn the whole cache into "accountable".
> 
> Correct. The worst case is where we have a large cache that has N
> objects per slab, but only ~1/N objects are accounted to a cgroup.
> 
> The question is whether this is common or even realistic. When would a
> cache be big, but only a small subset of its allocations would be
> attributable to specific cgroups?
> 
> On the less extreme overlapping cases, yeah there are fragmented
> obj_cgroup arrays, but there is also better slab packing. One is an
> array of pointers, the other is an array of larger objects. It would
> seem slab fragmentation has the potential to waste much more memory?
> 
> > > > > > 3) I'm working on percpu memory accounting that resembles the same scheme,
> > > > > >    except that obj_cgroups vector is created for the whole percpu block.
> > > > > >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> > > > > >    I kinda like using the same scheme here and there.
> > > > > 
> > > > > It's hard to conclude anything based on this information alone. If
> > > > > it's truly expensive to merge them, then it warrants the additional
> > > > > complexity. But I don't understand the desire to share a design for
> > > > > two systems with sufficiently different constraints.
> > > > > 
> > > > > > Upsides?
> > > > > > 
> > > > > > 1) slab utilization might increase a little bit (but I doubt it will have
> > > > > >    a huge effect, because both merging sets should be relatively big and well
> > > > > >    utilized)
> > > > > 
> > > > > Right.
> > > > > 
> > > > > > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> > > > > >    but there isn't so much code left anyway.
> > > > > 
> > > > > There is a lot of complexity associated with the cache cloning that
> > > > > isn't the lines of code, but the lifetime and synchronization rules.
> > > > 
> > > > Quite opposite: the patchset removes all the complexity (or 90% of it),
> > > > because it makes the kmem_cache lifetime independent from any cgroup stuff.
> > > > 
> > > > Kmem_caches are created on demand on the first request (most likely during
> > > > the system start-up), and destroyed together with their root counterparts
> > > > (most likely never or on rmmod). First request means globally first request,
> > > > not a first request from a given memcg.
> > > > 
> > > > Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
> > > > after creation just matches the lifetime of the root kmem caches.
> > > > 
> > > > The only reason to keep the async creation is that some kmem_caches
> > > > are created very early in the boot process, long before any cgroup
> > > > stuff is initialized.
> > > 
> > > Yes, it's independent of the obj_cgroup and memcg, and yes it's
> > > simpler after your patches. But I'm not talking about the delta, I'm
> > > trying to understand the end result.
> > > 
> > > And the truth is there is a decent chunk of code and tentacles spread
> > > throughout the slab/cgroup code to clone, destroy, and handle the
> > > split caches, as well as the branches/indirections on every cgrouped
> > > slab allocation.
> > 
> > Did you see the final code? It's fairly simple and there is really not
> > much of complexity left. If you don't think so, let's go into details,
> > because otherwise it's hard to say anything.
> 
> I have the patches applied to a local tree and am looking at the final
> code. But I can only repeat that "it's not too bad" simply isn't a
> good explanation for why the code is the way it is.
> 
> > With a such change which basically removes the current implementation
> > and replaces it with a new one, it's hard to keep the balance between
> > making commits self-contained and small, but also showing the whole picture.
> > I'm fully open to questions and generally want to make it simpler.
> > 
> > I've tried to separate some parts and get them merged before the main
> > thing, but they haven't been merged yet, so I have to include them
> > to keep the thing building.
> > 
> > Will a more-detailed design in the cover help?
> > Will writing a design doc to put into Documentation/ help?
> > Is it better to rearrange patches in a way to eliminate the current
> > implementation first and build from scratch?
> 
> It would help to have changelogs that actually describe how the new
> design is supposed to work, and why you made the decisions you made.
> 
> The changelog in this patch here sells the change as a reduction in
> complexity, without explaining why it stopped where it stopped. So
> naturally, if that's the declared goal, the first question is whether
> we can make it simpler.
> 
> Both the cover letter and the changelogs should focus less on what was
> there and how it was deleted, and more on how the code is supposed to
> work after the patches. How the components were designed and how they
> all work together.
> 
> As I said before, imagine somebody without any historical knowledge
> reading the code. They should be able to find out why you chose to
> have two sets of kmem caches. There is no explanation for it other
> than "there used to be more, so we cut it down to two".
> 
> > > > > And these two things are the primary aspects that make my head hurt
> > > > > trying to review this patch series.
> > > > > 
> > > > > > So IMO it's an interesting direction to explore, but not something
> > > > > > that necessarily has to be done in the context of this patchset.
> > > > > 
> > > > > I disagree. Instead of replacing the old coherent model and its
> > > > > complexities with a new coherent one, you are mixing the two. And I
> > > > > can barely understand the end result.
> > > > > 
> > > > > Dynamically cloning entire slab caches for the sole purpose of telling
> > > > > whether the pages have an obj_cgroup array or not is *completely
> > > > > insane*. If the controller had followed the obj_cgroup design from the
> > > > > start, nobody would have ever thought about doing it like this.
> > > > 
> > > > It's just not true. The whole point of having root- and memcg sets is
> > > > to be able to not look for a NULL pointer in the obj_cgroup vector on
> > > > releasing of the root object. In other words, it allows to keep zero
> > > > overhead for root allocations. IMHO it's an important thing, and calling
> > > > it *completely insane* isn't the best way to communicate.
> > > 
> > > But you're trading it for the indirection of going through a separate
> > > kmem_cache for every single cgroup-accounted allocation. Why is this a
> > > preferable trade-off to make?
> > 
> > Because it allows to keep zero memory and cpu overhead for root allocations.
> > I've no data showing that this overhead is small and acceptable in all cases.
> > I think keeping zero overhead for root allocations is more important
> > than having a single set of kmem caches.
> 
> In the kmem cache breakdown you provided above, there are 35887 pages
> allocated to root caches and 37300 pages allocated to cgroup caches.
> 
> Why are root allocations supposed to be more important? Aren't some of
> the hottest allocations tracked by cgroups? Look at fork():
> 
> >       vm_area_struct         48        810
> >            mm_struct         19         29
> >          files_cache         14         26
> >         signal_cache         28        143
> >        sighand_cache         45         47
> >          task_struct         77        430
> >             cred_jar         29        424
> >       anon_vma_chain         39        492
> >             anon_vma         28        467
> >                  pid         30        369
> 
> Hard to think of much hotter allocations. They all have to suffer the
> additional branch and cache footprint of the auxiliary cgroup caches.
> 
> > > > I agree that it's an arguable question if we can tolerate some
> > > > additional overhead on root allocations to eliminate these additional
> > > > 10%, but I really don't think it's so obvious that even discussing
> > > > it is insane.
> > > 
> > > Well that's exactly my point.
> > 
> > Ok, what's the acceptable performance penalty?
> > Is adding 20% on free path is acceptable, for example?
> > Or adding 3% of slab memory?
> 
> I find offhand replies like these very jarring.
> 
> There is a legitimate design question: Why are you using a separate
> set of kmem caches for the cgroup allocations, citing the additional
> complexity over having one set? And your reply was mostly handwaving.

Johannes,

I posted patches and numbers that shows that the patchset improves
a fundamental kernel characteristic (slab utilization) by a meaningful margin.
It has been confirmed by others, who kindly tested it on their machines.

Surprisingly, during this and previous review sessions, I didn't hear
a single good word from you, but a constant stream of blame: I do not answer
questions, I do not write perfect code, I fail to provide satisfying
answers, I'm waving hands, saying insane things etc etc.
Any minimal disagreement with you and you're basically raising the tone.

I find this style of discussions irritating and non-productive.
So I'm taking a break and start working on the next version.
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f578d1a24280..95d66f46493c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -322,7 +322,6 @@  struct mem_cgroup {
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
-	struct list_head kmem_caches;
 	struct obj_cgroup __rcu *objcg;
 	struct list_head objcg_list;
 #endif
@@ -1431,9 +1430,7 @@  static inline void memcg_set_shrinker_bit(struct mem_cgroup *memcg,
 }
 #endif
 
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
-					struct obj_cgroup **objcgp);
-void memcg_kmem_put_cache(struct kmem_cache *cachep);
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
@@ -1442,6 +1439,8 @@  void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages);
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
 
+struct obj_cgroup *get_obj_cgroup_from_current(void);
+
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 03a389358562..b2dde3f24cfa 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -155,8 +155,7 @@  struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 
-void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
-void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *);
+void memcg_create_kmem_cache(struct kmem_cache *cachep);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -578,8 +577,6 @@  static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 	return __kmalloc_node(size, flags, node);
 }
 
-int memcg_update_all_caches(int num_memcgs);
-
 /**
  * kmalloc_array - allocate memory for an array.
  * @n: number of elements.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ffe7e1e9f3c0..3f92b1c71aed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -325,7 +325,7 @@  static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
 }
 
 /*
- * This will be the memcg's index in each cache's ->memcg_params.memcg_caches.
+ * This will be used as a shrinker list's index.
  * The main reason for not using cgroup id for this:
  *  this works better in sparse environments, where we have a lot of memcgs,
  *  but only a few kmem-limited. Or also, if we have, for instance, 200
@@ -542,11 +542,7 @@  ino_t page_cgroup_ino(struct page *page)
 	unsigned long ino = 0;
 
 	rcu_read_lock();
-	if (PageSlab(page) && !PageTail(page))
-		memcg = memcg_from_slab_page(page);
-	else
-		memcg = page_memcg_rcu(page);
-
+	memcg = page_memcg_rcu(page);
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
 		memcg = parent_mem_cgroup(memcg);
 	if (memcg)
@@ -2776,17 +2772,46 @@  struct mem_cgroup *mem_cgroup_from_obj(void *p)
 	page = virt_to_head_page(p);
 
 	/*
-	 * Slab pages don't have page->mem_cgroup set because corresponding
-	 * kmem caches can be reparented during the lifetime. That's why
-	 * memcg_from_slab_page() should be used instead.
+	 * Slab objects are accounted individually, not per-page.
+	 * Memcg membership data for each individual object is saved in
+	 * the page->obj_cgroups.
 	 */
-	if (PageSlab(page))
-		return memcg_from_slab_page(page);
+	if (page_has_obj_cgroups(page)) {
+		struct obj_cgroup *objcg;
+		unsigned int off;
+
+		off = obj_to_index(page->slab_cache, page, p);
+		objcg = page_obj_cgroups(page)[off];
+		return obj_cgroup_memcg(objcg);
+	}
 
-	/* All other pages use page->mem_cgroup */
 	return page_memcg(page);
 }
 
+__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
+{
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg;
+
+	if (unlikely(!current->mm))
+		return NULL;
+
+	rcu_read_lock();
+	if (unlikely(current->active_memcg))
+		memcg = rcu_dereference(current->active_memcg);
+	else
+		memcg = mem_cgroup_from_task(current);
+
+	if (memcg && memcg != root_mem_cgroup) {
+		objcg = rcu_dereference(memcg->objcg);
+		if (objcg && !obj_cgroup_tryget(objcg))
+			objcg = NULL;
+	}
+	rcu_read_unlock();
+
+	return objcg;
+}
+
 static int memcg_alloc_cache_id(void)
 {
 	int id, size;
@@ -2812,9 +2837,7 @@  static int memcg_alloc_cache_id(void)
 	else if (size > MEMCG_CACHES_MAX_SIZE)
 		size = MEMCG_CACHES_MAX_SIZE;
 
-	err = memcg_update_all_caches(size);
-	if (!err)
-		err = memcg_update_all_list_lrus(size);
+	err = memcg_update_all_list_lrus(size);
 	if (!err)
 		memcg_nr_cache_ids = size;
 
@@ -2833,7 +2856,6 @@  static void memcg_free_cache_id(int id)
 }
 
 struct memcg_kmem_cache_create_work {
-	struct mem_cgroup *memcg;
 	struct kmem_cache *cachep;
 	struct work_struct work;
 };
@@ -2842,31 +2864,24 @@  static void memcg_kmem_cache_create_func(struct work_struct *w)
 {
 	struct memcg_kmem_cache_create_work *cw =
 		container_of(w, struct memcg_kmem_cache_create_work, work);
-	struct mem_cgroup *memcg = cw->memcg;
 	struct kmem_cache *cachep = cw->cachep;
 
-	memcg_create_kmem_cache(memcg, cachep);
+	memcg_create_kmem_cache(cachep);
 
-	css_put(&memcg->css);
 	kfree(cw);
 }
 
 /*
  * Enqueue the creation of a per-memcg kmem_cache.
  */
-static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
-					       struct kmem_cache *cachep)
+static void memcg_schedule_kmem_cache_create(struct kmem_cache *cachep)
 {
 	struct memcg_kmem_cache_create_work *cw;
 
-	if (!css_tryget_online(&memcg->css))
-		return;
-
 	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
 	if (!cw)
 		return;
 
-	cw->memcg = memcg;
 	cw->cachep = cachep;
 	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
 
@@ -2874,102 +2889,26 @@  static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 }
 
 /**
- * memcg_kmem_get_cache: select the correct per-memcg cache for allocation
+ * memcg_kmem_get_cache: select memcg or root cache for allocation
  * @cachep: the original global kmem cache
  *
  * Return the kmem_cache we're supposed to use for a slab allocation.
- * We try to use the current memcg's version of the cache.
  *
  * If the cache does not exist yet, if we are the first user of it, we
  * create it asynchronously in a workqueue and let the current allocation
  * go through with the original cache.
- *
- * This function takes a reference to the cache it returns to assure it
- * won't get destroyed while we are working with it. Once the caller is
- * done with it, memcg_kmem_put_cache() must be called to release the
- * reference.
  */
-struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep,
-					struct obj_cgroup **objcgp)
+struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 {
-	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
-	struct memcg_cache_array *arr;
-	int kmemcg_id;
 
-	VM_BUG_ON(!is_root_cache(cachep));
-
-	if (memcg_kmem_bypass())
+	memcg_cachep = READ_ONCE(cachep->memcg_params.memcg_cache);
+	if (unlikely(!memcg_cachep)) {
+		memcg_schedule_kmem_cache_create(cachep);
 		return cachep;
-
-	rcu_read_lock();
-
-	if (unlikely(current->active_memcg))
-		memcg = current->active_memcg;
-	else
-		memcg = mem_cgroup_from_task(current);
-
-	if (!memcg || memcg == root_mem_cgroup)
-		goto out_unlock;
-
-	kmemcg_id = READ_ONCE(memcg->kmemcg_id);
-	if (kmemcg_id < 0)
-		goto out_unlock;
-
-	arr = rcu_dereference(cachep->memcg_params.memcg_caches);
-
-	/*
-	 * Make sure we will access the up-to-date value. The code updating
-	 * memcg_caches issues a write barrier to match the data dependency
-	 * barrier inside READ_ONCE() (see memcg_create_kmem_cache()).
-	 */
-	memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
-
-	/*
-	 * If we are in a safe context (can wait, and not in interrupt
-	 * context), we could be be predictable and return right away.
-	 * This would guarantee that the allocation being performed
-	 * already belongs in the new cache.
-	 *
-	 * However, there are some clashes that can arrive from locking.
-	 * For instance, because we acquire the slab_mutex while doing
-	 * memcg_create_kmem_cache, this means no further allocation
-	 * could happen with the slab_mutex held. So it's better to
-	 * defer everything.
-	 *
-	 * If the memcg is dying or memcg_cache is about to be released,
-	 * don't bother creating new kmem_caches. Because memcg_cachep
-	 * is ZEROed as the fist step of kmem offlining, we don't need
-	 * percpu_ref_tryget_live() here. css_tryget_online() check in
-	 * memcg_schedule_kmem_cache_create() will prevent us from
-	 * creation of a new kmem_cache.
-	 */
-	if (unlikely(!memcg_cachep))
-		memcg_schedule_kmem_cache_create(memcg, cachep);
-	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt)) {
-		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
-
-		if (!objcg || !obj_cgroup_tryget(objcg)) {
-			percpu_ref_put(&memcg_cachep->memcg_params.refcnt);
-			goto out_unlock;
-		}
-
-		*objcgp = objcg;
-		cachep = memcg_cachep;
 	}
-out_unlock:
-	rcu_read_unlock();
-	return cachep;
-}
 
-/**
- * memcg_kmem_put_cache: drop reference taken by memcg_kmem_get_cache
- * @cachep: the cache returned by memcg_kmem_get_cache
- */
-void memcg_kmem_put_cache(struct kmem_cache *cachep)
-{
-	if (!is_root_cache(cachep))
-		percpu_ref_put(&cachep->memcg_params.refcnt);
+	return memcg_cachep;
 }
 
 /**
@@ -3641,7 +3580,6 @@  static int memcg_online_kmem(struct mem_cgroup *memcg)
 	 */
 	memcg->kmemcg_id = memcg_id;
 	memcg->kmem_state = KMEM_ONLINE;
-	INIT_LIST_HEAD(&memcg->kmem_caches);
 
 	return 0;
 }
@@ -3654,22 +3592,13 @@  static void memcg_offline_kmem(struct mem_cgroup *memcg)
 
 	if (memcg->kmem_state != KMEM_ONLINE)
 		return;
-	/*
-	 * Clear the online state before clearing memcg_caches array
-	 * entries. The slab_mutex in memcg_deactivate_kmem_caches()
-	 * guarantees that no cache will be created for this cgroup
-	 * after we are done (see memcg_create_kmem_cache()).
-	 */
+
 	memcg->kmem_state = KMEM_ALLOCATED;
 
 	parent = parent_mem_cgroup(memcg);
 	if (!parent)
 		parent = root_mem_cgroup;
 
-	/*
-	 * Deactivate and reparent kmem_caches and objcgs.
-	 */
-	memcg_deactivate_kmem_caches(memcg, parent);
 	memcg_reparent_objcgs(memcg, parent);
 
 	kmemcg_id = memcg->kmemcg_id;
@@ -3704,10 +3633,8 @@  static void memcg_free_kmem(struct mem_cgroup *memcg)
 	if (unlikely(memcg->kmem_state == KMEM_ONLINE))
 		memcg_offline_kmem(memcg);
 
-	if (memcg->kmem_state == KMEM_ALLOCATED) {
-		WARN_ON(!list_empty(&memcg->kmem_caches));
+	if (memcg->kmem_state == KMEM_ALLOCATED)
 		static_branch_dec(&memcg_kmem_enabled_key);
-	}
 }
 #else
 static int memcg_online_kmem(struct mem_cgroup *memcg)
@@ -5283,9 +5210,6 @@  mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	/* The following stuff does not apply to the root */
 	if (!parent) {
-#ifdef CONFIG_MEMCG_KMEM
-		INIT_LIST_HEAD(&memcg->kmem_caches);
-#endif
 		root_mem_cgroup = memcg;
 		return &memcg->css;
 	}
diff --git a/mm/slab.c b/mm/slab.c
index f16e896d5a3d..1f6ce2018993 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1239,7 +1239,7 @@  void __init kmem_cache_init(void)
 				  nr_node_ids * sizeof(struct kmem_cache_node *),
 				  SLAB_HWCACHE_ALIGN, 0, 0);
 	list_add(&kmem_cache->list, &slab_caches);
-	memcg_link_cache(kmem_cache, NULL);
+	memcg_link_cache(kmem_cache);
 	slab_state = PARTIAL;
 
 	/*
@@ -2244,17 +2244,6 @@  int __kmem_cache_shrink(struct kmem_cache *cachep)
 	return (ret ? 1 : 0);
 }
 
-#ifdef CONFIG_MEMCG
-void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
-{
-	__kmem_cache_shrink(cachep);
-}
-
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-}
-#endif
-
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
 	return __kmem_cache_shrink(cachep);
@@ -3862,7 +3851,8 @@  static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
 		return ret;
 
 	lockdep_assert_held(&slab_mutex);
-	for_each_memcg_cache(c, cachep) {
+	c = memcg_cache(cachep);
+	if (c) {
 		/* return value determined by the root cache only */
 		__do_tune_cpucache(c, limit, batchcount, shared, gfp);
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 6585638e5be0..732051d6861d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -32,66 +32,25 @@  struct kmem_cache {
 
 #else /* !CONFIG_SLOB */
 
-struct memcg_cache_array {
-	struct rcu_head rcu;
-	struct kmem_cache *entries[0];
-};
-
 /*
  * This is the main placeholder for memcg-related information in kmem caches.
- * Both the root cache and the child caches will have it. For the root cache,
- * this will hold a dynamically allocated array large enough to hold
- * information about the currently limited memcgs in the system. To allow the
- * array to be accessed without taking any locks, on relocation we free the old
- * version only after a grace period.
- *
- * Root and child caches hold different metadata.
+ * Both the root cache and the child cache will have it. Some fields are used
+ * in both cases, other are specific to root caches.
  *
  * @root_cache:	Common to root and child caches.  NULL for root, pointer to
  *		the root cache for children.
  *
  * The following fields are specific to root caches.
  *
- * @memcg_caches: kmemcg ID indexed table of child caches.  This table is
- *		used to index child cachces during allocation and cleared
- *		early during shutdown.
- *
- * @root_caches_node: List node for slab_root_caches list.
- *
- * @children:	List of all child caches.  While the child caches are also
- *		reachable through @memcg_caches, a child cache remains on
- *		this list until it is actually destroyed.
- *
- * The following fields are specific to child caches.
- *
- * @memcg:	Pointer to the memcg this cache belongs to.
- *
- * @children_node: List node for @root_cache->children list.
- *
- * @kmem_caches_node: List node for @memcg->kmem_caches list.
+ * @memcg_cache: pointer to memcg kmem cache, used by all non-root memory
+ *		cgroups.
+ * @root_caches_node: list node for slab_root_caches list.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
-	union {
-		struct {
-			struct memcg_cache_array __rcu *memcg_caches;
-			struct list_head __root_caches_node;
-			struct list_head children;
-			bool dying;
-		};
-		struct {
-			struct mem_cgroup *memcg;
-			struct list_head children_node;
-			struct list_head kmem_caches_node;
-			struct percpu_ref refcnt;
-
-			void (*work_fn)(struct kmem_cache *);
-			union {
-				struct rcu_head rcu_head;
-				struct work_struct work;
-			};
-		};
-	};
+
+	struct kmem_cache *memcg_cache;
+	struct list_head __root_caches_node;
 };
 #endif /* CONFIG_SLOB */
 
@@ -234,8 +193,6 @@  bool __kmem_cache_empty(struct kmem_cache *);
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
 int __kmem_cache_shrink(struct kmem_cache *);
-void __kmemcg_cache_deactivate(struct kmem_cache *s);
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
 void kmem_cache_shrink_all(struct kmem_cache *s);
 
@@ -281,14 +238,6 @@  static inline int cache_vmstat_idx(struct kmem_cache *s)
 extern struct list_head		slab_root_caches;
 #define root_caches_node	memcg_params.__root_caches_node
 
-/*
- * Iterate over all memcg caches of the given root cache. The caller must hold
- * slab_mutex.
- */
-#define for_each_memcg_cache(iter, root) \
-	list_for_each_entry(iter, &(root)->memcg_params.children, \
-			    memcg_params.children_node)
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return !s->memcg_params.root_cache;
@@ -319,6 +268,13 @@  static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s->memcg_params.root_cache;
 }
 
+static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
+{
+	if (is_root_cache(s))
+		return s->memcg_params.memcg_cache;
+	return NULL;
+}
+
 static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 {
 	/*
@@ -331,25 +287,9 @@  static inline struct obj_cgroup **page_obj_cgroups(struct page *page)
 		((unsigned long)page->obj_cgroups & ~0x1UL);
 }
 
-/*
- * Expects a pointer to a slab page. Please note, that PageSlab() check
- * isn't sufficient, as it returns true also for tail compound slab pages,
- * which do not have slab_cache pointer set.
- * So this function assumes that the page can pass PageSlab() && !PageTail()
- * check.
- *
- * The kmem_cache can be reparented asynchronously. The caller must ensure
- * the memcg lifetime, e.g. by taking rcu_read_lock() or cgroup_mutex.
- */
-static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+static inline bool page_has_obj_cgroups(struct page *page)
 {
-	struct kmem_cache *s;
-
-	s = READ_ONCE(page->slab_cache);
-	if (s && !is_root_cache(s))
-		return READ_ONCE(s->memcg_params.memcg);
-
-	return NULL;
+	return ((unsigned long)page->obj_cgroups & 0x1UL);
 }
 
 static inline int memcg_alloc_page_obj_cgroups(struct page *page, gfp_t gfp,
@@ -385,16 +325,25 @@  static inline struct kmem_cache *memcg_slab_pre_alloc_hook(struct kmem_cache *s,
 						size_t objects, gfp_t flags)
 {
 	struct kmem_cache *cachep;
+	struct obj_cgroup *objcg;
+
+	if (memcg_kmem_bypass())
+		return s;
 
-	cachep = memcg_kmem_get_cache(s, objcgp);
+	cachep = memcg_kmem_get_cache(s);
 	if (is_root_cache(cachep))
 		return s;
 
-	if (obj_cgroup_charge(*objcgp, flags, objects * obj_full_size(s))) {
-		memcg_kmem_put_cache(cachep);
+	objcg = get_obj_cgroup_from_current();
+	if (!objcg)
+		return s;
+
+	if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) {
+		obj_cgroup_put(objcg);
 		cachep = NULL;
 	}
 
+	*objcgp = objcg;
 	return cachep;
 }
 
@@ -431,7 +380,6 @@  static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
 		}
 	}
 	obj_cgroup_put(objcg);
-	memcg_kmem_put_cache(s);
 }
 
 static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
@@ -455,7 +403,7 @@  static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
+extern void memcg_link_cache(struct kmem_cache *s);
 
 #else /* CONFIG_MEMCG_KMEM */
 
@@ -463,9 +411,6 @@  extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 #define slab_root_caches	slab_caches
 #define root_caches_node	list
 
-#define for_each_memcg_cache(iter, root) \
-	for ((void)(iter), (void)(root); 0; )
-
 static inline bool is_root_cache(struct kmem_cache *s)
 {
 	return true;
@@ -487,7 +432,17 @@  static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s;
 }
 
-static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+static inline struct kmem_cache *memcg_cache(struct kmem_cache *s)
+{
+	return NULL;
+}
+
+static inline bool page_has_obj_cgroups(struct page *page)
+{
+	return false;
+}
+
+static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr)
 {
 	return NULL;
 }
@@ -524,8 +479,7 @@  static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
 
-static inline void memcg_link_cache(struct kmem_cache *s,
-				    struct mem_cgroup *memcg)
+static inline void memcg_link_cache(struct kmem_cache *s)
 {
 }
 
@@ -547,17 +501,14 @@  static __always_inline int charge_slab_page(struct page *page,
 					    struct kmem_cache *s,
 					    unsigned int objects)
 {
-#ifdef CONFIG_MEMCG_KMEM
 	if (!is_root_cache(s)) {
 		int ret;
 
 		ret = memcg_alloc_page_obj_cgroups(page, gfp, objects);
 		if (ret)
 			return ret;
-
-		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
 	}
-#endif
+
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    PAGE_SIZE << order);
 	return 0;
@@ -566,12 +517,9 @@  static __always_inline int charge_slab_page(struct page *page,
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-#ifdef CONFIG_MEMCG_KMEM
-	if (!is_root_cache(s)) {
+	if (!is_root_cache(s))
 		memcg_free_page_obj_cgroups(page);
-		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
-	}
-#endif
+
 	mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 			    -(PAGE_SIZE << order));
 }
@@ -720,9 +668,6 @@  static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 void *slab_start(struct seq_file *m, loff_t *pos);
 void *slab_next(struct seq_file *m, void *p, loff_t *pos);
 void slab_stop(struct seq_file *m, void *p);
-void *memcg_slab_start(struct seq_file *m, loff_t *pos);
-void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos);
-void memcg_slab_stop(struct seq_file *m, void *p);
 int memcg_slab_show(struct seq_file *m, void *p);
 
 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 80b4efdb1df3..9ad13c7e28fc 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -131,141 +131,36 @@  int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 #ifdef CONFIG_MEMCG_KMEM
 
 LIST_HEAD(slab_root_caches);
-static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
-
-static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref);
 
 void slab_init_memcg_params(struct kmem_cache *s)
 {
 	s->memcg_params.root_cache = NULL;
-	RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL);
-	INIT_LIST_HEAD(&s->memcg_params.children);
-	s->memcg_params.dying = false;
+	s->memcg_params.memcg_cache = NULL;
 }
 
-static int init_memcg_params(struct kmem_cache *s,
-			     struct kmem_cache *root_cache)
+static void init_memcg_params(struct kmem_cache *s,
+			      struct kmem_cache *root_cache)
 {
-	struct memcg_cache_array *arr;
-
-	if (root_cache) {
-		int ret = percpu_ref_init(&s->memcg_params.refcnt,
-					  kmemcg_cache_shutdown,
-					  0, GFP_KERNEL);
-		if (ret)
-			return ret;
-
+	if (root_cache)
 		s->memcg_params.root_cache = root_cache;
-		INIT_LIST_HEAD(&s->memcg_params.children_node);
-		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
-		return 0;
-	}
-
-	slab_init_memcg_params(s);
-
-	if (!memcg_nr_cache_ids)
-		return 0;
-
-	arr = kvzalloc(sizeof(struct memcg_cache_array) +
-		       memcg_nr_cache_ids * sizeof(void *),
-		       GFP_KERNEL);
-	if (!arr)
-		return -ENOMEM;
-
-	RCU_INIT_POINTER(s->memcg_params.memcg_caches, arr);
-	return 0;
-}
-
-static void destroy_memcg_params(struct kmem_cache *s)
-{
-	if (is_root_cache(s)) {
-		kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
-	} else {
-		mem_cgroup_put(s->memcg_params.memcg);
-		WRITE_ONCE(s->memcg_params.memcg, NULL);
-		percpu_ref_exit(&s->memcg_params.refcnt);
-	}
-}
-
-static void free_memcg_params(struct rcu_head *rcu)
-{
-	struct memcg_cache_array *old;
-
-	old = container_of(rcu, struct memcg_cache_array, rcu);
-	kvfree(old);
-}
-
-static int update_memcg_params(struct kmem_cache *s, int new_array_size)
-{
-	struct memcg_cache_array *old, *new;
-
-	new = kvzalloc(sizeof(struct memcg_cache_array) +
-		       new_array_size * sizeof(void *), GFP_KERNEL);
-	if (!new)
-		return -ENOMEM;
-
-	old = rcu_dereference_protected(s->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-	if (old)
-		memcpy(new->entries, old->entries,
-		       memcg_nr_cache_ids * sizeof(void *));
-
-	rcu_assign_pointer(s->memcg_params.memcg_caches, new);
-	if (old)
-		call_rcu(&old->rcu, free_memcg_params);
-	return 0;
-}
-
-int memcg_update_all_caches(int num_memcgs)
-{
-	struct kmem_cache *s;
-	int ret = 0;
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
-		ret = update_memcg_params(s, num_memcgs);
-		/*
-		 * Instead of freeing the memory, we'll just leave the caches
-		 * up to this point in an updated state.
-		 */
-		if (ret)
-			break;
-	}
-	mutex_unlock(&slab_mutex);
-	return ret;
+	else
+		slab_init_memcg_params(s);
 }
 
-void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
+void memcg_link_cache(struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
+	if (is_root_cache(s))
 		list_add(&s->root_caches_node, &slab_root_caches);
-	} else {
-		css_get(&memcg->css);
-		s->memcg_params.memcg = memcg;
-		list_add(&s->memcg_params.children_node,
-			 &s->memcg_params.root_cache->memcg_params.children);
-		list_add(&s->memcg_params.kmem_caches_node,
-			 &s->memcg_params.memcg->kmem_caches);
-	}
 }
 
 static void memcg_unlink_cache(struct kmem_cache *s)
 {
-	if (is_root_cache(s)) {
+	if (is_root_cache(s))
 		list_del(&s->root_caches_node);
-	} else {
-		list_del(&s->memcg_params.children_node);
-		list_del(&s->memcg_params.kmem_caches_node);
-	}
 }
 #else
-static inline int init_memcg_params(struct kmem_cache *s,
-				    struct kmem_cache *root_cache)
-{
-	return 0;
-}
-
-static inline void destroy_memcg_params(struct kmem_cache *s)
+static inline void init_memcg_params(struct kmem_cache *s,
+				     struct kmem_cache *root_cache)
 {
 }
 
@@ -380,7 +275,7 @@  static struct kmem_cache *create_cache(const char *name,
 		unsigned int object_size, unsigned int align,
 		slab_flags_t flags, unsigned int useroffset,
 		unsigned int usersize, void (*ctor)(void *),
-		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
+		struct kmem_cache *root_cache)
 {
 	struct kmem_cache *s;
 	int err;
@@ -400,24 +295,20 @@  static struct kmem_cache *create_cache(const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	err = init_memcg_params(s, root_cache);
-	if (err)
-		goto out_free_cache;
-
+	init_memcg_params(s, root_cache);
 	err = __kmem_cache_create(s, flags);
 	if (err)
 		goto out_free_cache;
 
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, memcg);
+	memcg_link_cache(s);
 out:
 	if (err)
 		return ERR_PTR(err);
 	return s;
 
 out_free_cache:
-	destroy_memcg_params(s);
 	kmem_cache_free(kmem_cache, s);
 	goto out;
 }
@@ -504,7 +395,7 @@  kmem_cache_create_usercopy(const char *name,
 
 	s = create_cache(cache_name, size,
 			 calculate_alignment(flags, align, size),
-			 flags, useroffset, usersize, ctor, NULL, NULL);
+			 flags, useroffset, usersize, ctor, NULL);
 	if (IS_ERR(s)) {
 		err = PTR_ERR(s);
 		kfree_const(cache_name);
@@ -629,51 +520,27 @@  static int shutdown_cache(struct kmem_cache *s)
 
 #ifdef CONFIG_MEMCG_KMEM
 /*
- * memcg_create_kmem_cache - Create a cache for a memory cgroup.
- * @memcg: The memory cgroup the new cache is for.
+ * memcg_create_kmem_cache - Create a cache for non-root memory cgroups.
  * @root_cache: The parent of the new cache.
  *
  * This function attempts to create a kmem cache that will serve allocation
- * requests going from @memcg to @root_cache. The new cache inherits properties
- * from its parent.
+ * requests going all non-root memory cgroups to @root_cache. The new cache
+ * inherits properties from its parent.
  */
-void memcg_create_kmem_cache(struct mem_cgroup *memcg,
-			     struct kmem_cache *root_cache)
+void memcg_create_kmem_cache(struct kmem_cache *root_cache)
 {
-	static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */
-	struct cgroup_subsys_state *css = &memcg->css;
-	struct memcg_cache_array *arr;
 	struct kmem_cache *s = NULL;
 	char *cache_name;
-	int idx;
 
 	get_online_cpus();
 	get_online_mems();
 
 	mutex_lock(&slab_mutex);
 
-	/*
-	 * The memory cgroup could have been offlined while the cache
-	 * creation work was pending.
-	 */
-	if (memcg->kmem_state != KMEM_ONLINE)
+	if (root_cache->memcg_params.memcg_cache)
 		goto out_unlock;
 
-	idx = memcg_cache_id(memcg);
-	arr = rcu_dereference_protected(root_cache->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-
-	/*
-	 * Since per-memcg caches are created asynchronously on first
-	 * allocation (see memcg_kmem_get_cache()), several threads can try to
-	 * create the same cache, but only one of them may succeed.
-	 */
-	if (arr->entries[idx])
-		goto out_unlock;
-
-	cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf));
-	cache_name = kasprintf(GFP_KERNEL, "%s(%llu:%s)", root_cache->name,
-			       css->serial_nr, memcg_name_buf);
+	cache_name = kasprintf(GFP_KERNEL, "%s-memcg", root_cache->name);
 	if (!cache_name)
 		goto out_unlock;
 
@@ -681,7 +548,7 @@  void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 			 root_cache->align,
 			 root_cache->flags & CACHE_CREATE_MASK,
 			 root_cache->useroffset, root_cache->usersize,
-			 root_cache->ctor, memcg, root_cache);
+			 root_cache->ctor, root_cache);
 	/*
 	 * If we could not create a memcg cache, do not complain, because
 	 * that's not critical at all as we can always proceed with the root
@@ -698,7 +565,7 @@  void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	 * initialized.
 	 */
 	smp_wmb();
-	arr->entries[idx] = s;
+	root_cache->memcg_params.memcg_cache = s;
 
 out_unlock:
 	mutex_unlock(&slab_mutex);
@@ -707,197 +574,18 @@  void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	put_online_cpus();
 }
 
-static void kmemcg_workfn(struct work_struct *work)
-{
-	struct kmem_cache *s = container_of(work, struct kmem_cache,
-					    memcg_params.work);
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	s->memcg_params.work_fn(s);
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
-static void kmemcg_rcufn(struct rcu_head *head)
-{
-	struct kmem_cache *s = container_of(head, struct kmem_cache,
-					    memcg_params.rcu_head);
-
-	/*
-	 * We need to grab blocking locks.  Bounce to ->work.  The
-	 * work item shares the space with the RCU head and can't be
-	 * initialized eariler.
-	 */
-	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
-}
-
-static void kmemcg_cache_shutdown_fn(struct kmem_cache *s)
-{
-	WARN_ON(shutdown_cache(s));
-}
-
-static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref)
-{
-	struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache,
-					    memcg_params.refcnt);
-	unsigned long flags;
-
-	spin_lock_irqsave(&memcg_kmem_wq_lock, flags);
-	if (s->memcg_params.root_cache->memcg_params.dying)
-		goto unlock;
-
-	s->memcg_params.work_fn = kmemcg_cache_shutdown_fn;
-	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
-
-unlock:
-	spin_unlock_irqrestore(&memcg_kmem_wq_lock, flags);
-}
-
-static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-	__kmemcg_cache_deactivate_after_rcu(s);
-	percpu_ref_kill(&s->memcg_params.refcnt);
-}
-
-static void kmemcg_cache_deactivate(struct kmem_cache *s)
-{
-	if (WARN_ON_ONCE(is_root_cache(s)))
-		return;
-
-	__kmemcg_cache_deactivate(s);
-	s->flags |= SLAB_DEACTIVATED;
-
-	/*
-	 * memcg_kmem_wq_lock is used to synchronize memcg_params.dying
-	 * flag and make sure that no new kmem_cache deactivation tasks
-	 * are queued (see flush_memcg_workqueue() ).
-	 */
-	spin_lock_irq(&memcg_kmem_wq_lock);
-	if (s->memcg_params.root_cache->memcg_params.dying)
-		goto unlock;
-
-	s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
-	call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn);
-unlock:
-	spin_unlock_irq(&memcg_kmem_wq_lock);
-}
-
-void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg,
-				  struct mem_cgroup *parent)
-{
-	int idx;
-	struct memcg_cache_array *arr;
-	struct kmem_cache *s, *c;
-	unsigned int nr_reparented;
-
-	idx = memcg_cache_id(memcg);
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
-		arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
-						lockdep_is_held(&slab_mutex));
-		c = arr->entries[idx];
-		if (!c)
-			continue;
-
-		kmemcg_cache_deactivate(c);
-		arr->entries[idx] = NULL;
-	}
-	nr_reparented = 0;
-	list_for_each_entry(s, &memcg->kmem_caches,
-			    memcg_params.kmem_caches_node) {
-		WRITE_ONCE(s->memcg_params.memcg, parent);
-		css_put(&memcg->css);
-		nr_reparented++;
-	}
-	if (nr_reparented) {
-		list_splice_init(&memcg->kmem_caches,
-				 &parent->kmem_caches);
-		css_get_many(&parent->css, nr_reparented);
-	}
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
 static int shutdown_memcg_caches(struct kmem_cache *s)
 {
-	struct memcg_cache_array *arr;
-	struct kmem_cache *c, *c2;
-	LIST_HEAD(busy);
-	int i;
-
 	BUG_ON(!is_root_cache(s));
 
-	/*
-	 * First, shutdown active caches, i.e. caches that belong to online
-	 * memory cgroups.
-	 */
-	arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
-					lockdep_is_held(&slab_mutex));
-	for_each_memcg_cache_index(i) {
-		c = arr->entries[i];
-		if (!c)
-			continue;
-		if (shutdown_cache(c))
-			/*
-			 * The cache still has objects. Move it to a temporary
-			 * list so as not to try to destroy it for a second
-			 * time while iterating over inactive caches below.
-			 */
-			list_move(&c->memcg_params.children_node, &busy);
-		else
-			/*
-			 * The cache is empty and will be destroyed soon. Clear
-			 * the pointer to it in the memcg_caches array so that
-			 * it will never be accessed even if the root cache
-			 * stays alive.
-			 */
-			arr->entries[i] = NULL;
-	}
-
-	/*
-	 * Second, shutdown all caches left from memory cgroups that are now
-	 * offline.
-	 */
-	list_for_each_entry_safe(c, c2, &s->memcg_params.children,
-				 memcg_params.children_node)
-		shutdown_cache(c);
-
-	list_splice(&busy, &s->memcg_params.children);
+	if (s->memcg_params.memcg_cache)
+		WARN_ON(shutdown_cache(s->memcg_params.memcg_cache));
 
-	/*
-	 * A cache being destroyed must be empty. In particular, this means
-	 * that all per memcg caches attached to it must be empty too.
-	 */
-	if (!list_empty(&s->memcg_params.children))
-		return -EBUSY;
 	return 0;
 }
 
 static void flush_memcg_workqueue(struct kmem_cache *s)
 {
-	spin_lock_irq(&memcg_kmem_wq_lock);
-	s->memcg_params.dying = true;
-	spin_unlock_irq(&memcg_kmem_wq_lock);
-
-	/*
-	 * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
-	 * sure all registered rcu callbacks have been invoked.
-	 */
-	rcu_barrier();
-
 	/*
 	 * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
 	 * deactivates the memcg kmem_caches through workqueue. Make sure all
@@ -905,18 +593,6 @@  static void flush_memcg_workqueue(struct kmem_cache *s)
 	 */
 	if (likely(memcg_kmem_cache_wq))
 		flush_workqueue(memcg_kmem_cache_wq);
-
-	/*
-	 * If we're racing with children kmem_cache deactivation, it might
-	 * take another rcu grace period to complete their destruction.
-	 * At this moment the corresponding percpu_ref_kill() call should be
-	 * done, but it might take another rcu grace period to complete
-	 * switching to the atomic mode.
-	 * Please, note that we check without grabbing the slab_mutex. It's safe
-	 * because at this moment the children list can't grow.
-	 */
-	if (!list_empty(&s->memcg_params.children))
-		rcu_barrier();
 }
 #else
 static inline int shutdown_memcg_caches(struct kmem_cache *s)
@@ -932,7 +608,6 @@  static inline void flush_memcg_workqueue(struct kmem_cache *s)
 void slab_kmem_cache_release(struct kmem_cache *s)
 {
 	__kmem_cache_release(s);
-	destroy_memcg_params(s);
 	kfree_const(s->name);
 	kmem_cache_free(kmem_cache, s);
 }
@@ -996,7 +671,7 @@  int kmem_cache_shrink(struct kmem_cache *cachep)
 EXPORT_SYMBOL(kmem_cache_shrink);
 
 /**
- * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
+ * kmem_cache_shrink_all - shrink root and memcg caches
  * @s: The cache pointer
  */
 void kmem_cache_shrink_all(struct kmem_cache *s)
@@ -1013,21 +688,11 @@  void kmem_cache_shrink_all(struct kmem_cache *s)
 	kasan_cache_shrink(s);
 	__kmem_cache_shrink(s);
 
-	/*
-	 * We have to take the slab_mutex to protect from the memcg list
-	 * modification.
-	 */
-	mutex_lock(&slab_mutex);
-	for_each_memcg_cache(c, s) {
-		/*
-		 * Don't need to shrink deactivated memcg caches.
-		 */
-		if (s->flags & SLAB_DEACTIVATED)
-			continue;
+	c = memcg_cache(s);
+	if (c) {
 		kasan_cache_shrink(c);
 		__kmem_cache_shrink(c);
 	}
-	mutex_unlock(&slab_mutex);
 	put_online_mems();
 	put_online_cpus();
 }
@@ -1082,7 +747,7 @@  struct kmem_cache *__init create_kmalloc_cache(const char *name,
 
 	create_boot_cache(s, name, size, flags, useroffset, usersize);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, NULL);
+	memcg_link_cache(s);
 	s->refcount = 1;
 	return s;
 }
@@ -1444,7 +1109,8 @@  memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 	if (!is_root_cache(s))
 		return;
 
-	for_each_memcg_cache(c, s) {
+	c = memcg_cache(s);
+	if (c) {
 		memset(&sinfo, 0, sizeof(sinfo));
 		get_slabinfo(c, &sinfo);
 
@@ -1574,7 +1240,7 @@  module_init(slab_proc_init);
 
 #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
 /*
- * Display information about kmem caches that have child memcg caches.
+ * Display information about kmem caches that have memcg cache.
  */
 static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 {
@@ -1586,9 +1252,9 @@  static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 	seq_puts(m, " <active_slabs> <num_slabs>\n");
 	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
 		/*
-		 * Skip kmem caches that don't have any memcg children.
+		 * Skip kmem caches that don't have the memcg cache.
 		 */
-		if (list_empty(&s->memcg_params.children))
+		if (!s->memcg_params.memcg_cache)
 			continue;
 
 		memset(&sinfo, 0, sizeof(sinfo));
@@ -1597,23 +1263,13 @@  static int memcg_slabinfo_show(struct seq_file *m, void *unused)
 			   cache_name(s), sinfo.active_objs, sinfo.num_objs,
 			   sinfo.active_slabs, sinfo.num_slabs);
 
-		for_each_memcg_cache(c, s) {
-			struct cgroup_subsys_state *css;
-			char *status = "";
-
-			css = &c->memcg_params.memcg->css;
-			if (!(css->flags & CSS_ONLINE))
-				status = ":dead";
-			else if (c->flags & SLAB_DEACTIVATED)
-				status = ":deact";
-
-			memset(&sinfo, 0, sizeof(sinfo));
-			get_slabinfo(c, &sinfo);
-			seq_printf(m, "%-17s %4d%-6s %6lu %6lu %6lu %6lu\n",
-				   cache_name(c), css->id, status,
-				   sinfo.active_objs, sinfo.num_objs,
-				   sinfo.active_slabs, sinfo.num_slabs);
-		}
+		c = s->memcg_params.memcg_cache;
+		memset(&sinfo, 0, sizeof(sinfo));
+		get_slabinfo(c, &sinfo);
+		seq_printf(m, "%-17s %4d %6lu %6lu %6lu %6lu\n",
+			   cache_name(c), root_mem_cgroup->css.id,
+			   sinfo.active_objs, sinfo.num_objs,
+			   sinfo.active_slabs, sinfo.num_slabs);
 	}
 	mutex_unlock(&slab_mutex);
 	return 0;
diff --git a/mm/slub.c b/mm/slub.c
index 6365e89cd503..5b39b5c005bc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4055,36 +4055,6 @@  int __kmem_cache_shrink(struct kmem_cache *s)
 	return ret;
 }
 
-#ifdef CONFIG_MEMCG
-void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
-{
-	/*
-	 * Called with all the locks held after a sched RCU grace period.
-	 * Even if @s becomes empty after shrinking, we can't know that @s
-	 * doesn't have allocations already in-flight and thus can't
-	 * destroy @s until the associated memcg is released.
-	 *
-	 * However, let's remove the sysfs files for empty caches here.
-	 * Each cache has a lot of interface files which aren't
-	 * particularly useful for empty draining caches; otherwise, we can
-	 * easily end up with millions of unnecessary sysfs files on
-	 * systems which have a lot of memory and transient cgroups.
-	 */
-	if (!__kmem_cache_shrink(s))
-		sysfs_slab_remove(s);
-}
-
-void __kmemcg_cache_deactivate(struct kmem_cache *s)
-{
-	/*
-	 * Disable empty slabs caching. Used to avoid pinning offline
-	 * memory cgroups by kmem pages that can be freed.
-	 */
-	slub_set_cpu_partial(s, 0);
-	s->min_partial = 0;
-}
-#endif	/* CONFIG_MEMCG */
-
 static int slab_mem_going_offline_callback(void *arg)
 {
 	struct kmem_cache *s;
@@ -4241,7 +4211,7 @@  static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	}
 	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s, NULL);
+	memcg_link_cache(s);
 	return s;
 }
 
@@ -4309,7 +4279,8 @@  __kmem_cache_alias(const char *name, unsigned int size, unsigned int align,
 		s->object_size = max(s->object_size, size);
 		s->inuse = max(s->inuse, ALIGN(size, sizeof(void *)));
 
-		for_each_memcg_cache(c, s) {
+		c = memcg_cache(s);
+		if (c) {
 			c->object_size = s->object_size;
 			c->inuse = max(c->inuse, ALIGN(size, sizeof(void *)));
 		}
@@ -5562,7 +5533,8 @@  static ssize_t slab_attr_store(struct kobject *kobj,
 		 * directly either failed or succeeded, in which case we loop
 		 * through the descendants with best-effort propagation.
 		 */
-		for_each_memcg_cache(c, s)
+		c = memcg_cache(s);
+		if (c)
 			attribute->store(c, buf, len);
 		mutex_unlock(&slab_mutex);
 	}