Message ID | 20210330101531.82752-1-songmuchun@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | Use obj_cgroup APIs to charge the LRU pages | expand |
On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote: > > Since the following patchsets applied. All the kernel memory are charged > with the new APIs of obj_cgroup. > > [v17,00/19] The new cgroup slab memory controller > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > But user memory allocations (LRU pages) pinning memcgs for a long time - > it exists at a larger scale and is causing recurring problems in the real > world: page cache doesn't get reclaimed for a long time, or is used by the > second, third, fourth, ... instance of the same job that was restarted into > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > and make page reclaim very inefficient. > > We can convert LRU pages and most other raw memcg pins to the objcg direction > to fix this problem, and then the LRU pages will not pin the memcgs. > > This patchset aims to make the LRU pages to drop the reference to memory > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > of the dying cgroups will not increase if we run the following test script. > > ```bash > #!/bin/bash > > cat /proc/cgroups | grep memory > > cd /sys/fs/cgroup/memory > > for i in range{1..500} > do > mkdir test > echo $$ > test/cgroup.procs > sleep 60 & > echo $$ > cgroup.procs > echo `cat test/cgroup.procs` > cgroup.procs > rmdir test > done > > cat /proc/cgroups | grep memory > ``` > > Patch 1 aims to fix page charging in page replacement. > Patch 2-5 are code cleanup and simplification. > Patch 6-15 convert LRU pages pin to the objcg direction. The main concern I have with *just* reparenting LRU pages is that for the long running systems, the root memcg will become a dumping ground. In addition a job running multiple times on a machine will see inconsistent memory usage if it re-accesses the file pages which were reparented to the root memcg. Please note that I do agree with the mentioned problem and we do see this issue in our fleet. Internally we have a "memcg mount option" feature which couples a file system with a memcg and all file pages allocated on that file system will be charged to that memcg. Multiple instances (concurrent or subsequent) of the job will use that file system (with a dedicated memcg) without leaving the zombies behind. I am not pushing for this solution as it comes with its own intricacies (e.g. if memcg coupled with a file system has a limit, the oom behavior would be awkward and therefore internally we don't put a limit on such memcgs). Though I want this to be part of discussion. I think the underlying reasons behind this issue are: 1) Filesystem shared by disjoint jobs. 2) For job dedicated filesystems, the lifetime of the filesystem is different from the lifetime of the job. For now, we have two potential solutions to the zombies-due-to-offlined-LRU-pages i.e. (1) reparent and (2) pairing memcg and filesystem. For reparenting, the cons are inconsistent memory usage and root memcg potentially becoming dumping ground. For pairing, the oom behavior is awkward which is the same for any type of remote charging. I am wondering how we can resolve the cons for each. To resolve the root-memcg-dump issue in the reparenting, maybe we uncharge the page when it reaches root and the next accesser will be charged. For inconsistent memory usage, either users accept the inconsistency or force reclaim before offline which will reduce the benefit of sharing filesystem with subsequent instances of the job. For the pairing, maybe we can punt to the user/admin to not set a limit on such memcg to avoid awkward oom situations. Thoughts?
On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote: > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > Since the following patchsets applied. All the kernel memory are charged > > with the new APIs of obj_cgroup. > > > > [v17,00/19] The new cgroup slab memory controller > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > > > But user memory allocations (LRU pages) pinning memcgs for a long time - > > it exists at a larger scale and is causing recurring problems in the real > > world: page cache doesn't get reclaimed for a long time, or is used by the > > second, third, fourth, ... instance of the same job that was restarted into > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > > and make page reclaim very inefficient. > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction > > to fix this problem, and then the LRU pages will not pin the memcgs. > > > > This patchset aims to make the LRU pages to drop the reference to memory > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > > of the dying cgroups will not increase if we run the following test script. > > > > ```bash > > #!/bin/bash > > > > cat /proc/cgroups | grep memory > > > > cd /sys/fs/cgroup/memory > > > > for i in range{1..500} > > do > > mkdir test > > echo $$ > test/cgroup.procs > > sleep 60 & > > echo $$ > cgroup.procs > > echo `cat test/cgroup.procs` > cgroup.procs > > rmdir test > > done > > > > cat /proc/cgroups | grep memory > > ``` > > > > Patch 1 aims to fix page charging in page replacement. > > Patch 2-5 are code cleanup and simplification. > > Patch 6-15 convert LRU pages pin to the objcg direction. > > The main concern I have with *just* reparenting LRU pages is that for > the long running systems, the root memcg will become a dumping ground. > In addition a job running multiple times on a machine will see > inconsistent memory usage if it re-accesses the file pages which were > reparented to the root memcg. I agree, but also the reparenting is not the perfect thing in a combination with any memory protections (e.g. memory.low). Imagine the following configuration: workload.slice - workload_gen_1.service memory.min = 30G - workload_gen_2.service memory.min = 30G - workload_gen_3.service memory.min = 30G ... Parent cgroup and several generations of the child cgroup, protected by a memory.low. Once the memory is getting reparented, it's not protected anymore. I guess we need something smarter: e.g. reassign a page to a different cgroup if the page is activated/rotated and is currently on a dying lru. Also, I'm somewhat concerned about the interaction of the reparenting with the writeback and dirty throttling. How does it work together? Thanks!
On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote: > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > Since the following patchsets applied. All the kernel memory are charged > > with the new APIs of obj_cgroup. > > > > [v17,00/19] The new cgroup slab memory controller > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > > > But user memory allocations (LRU pages) pinning memcgs for a long time - > > it exists at a larger scale and is causing recurring problems in the real > > world: page cache doesn't get reclaimed for a long time, or is used by the > > second, third, fourth, ... instance of the same job that was restarted into > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > > and make page reclaim very inefficient. > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction > > to fix this problem, and then the LRU pages will not pin the memcgs. > > > > This patchset aims to make the LRU pages to drop the reference to memory > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > > of the dying cgroups will not increase if we run the following test script. > > > > ```bash > > #!/bin/bash > > > > cat /proc/cgroups | grep memory > > > > cd /sys/fs/cgroup/memory > > > > for i in range{1..500} > > do > > mkdir test > > echo $$ > test/cgroup.procs > > sleep 60 & > > echo $$ > cgroup.procs > > echo `cat test/cgroup.procs` > cgroup.procs > > rmdir test > > done > > > > cat /proc/cgroups | grep memory > > ``` > > > > Patch 1 aims to fix page charging in page replacement. > > Patch 2-5 are code cleanup and simplification. > > Patch 6-15 convert LRU pages pin to the objcg direction. > > The main concern I have with *just* reparenting LRU pages is that for > the long running systems, the root memcg will become a dumping ground. > In addition a job running multiple times on a machine will see > inconsistent memory usage if it re-accesses the file pages which were > reparented to the root memcg. I don't understand how Muchun's patches are supposed to *change* the behavior the way you are describing it. This IS today's behavior. We have hierarchical accounting, and a page that belongs to a leaf cgroup will automatically belong to all its parents. Further, if you delete a cgroup today, the abandoned cache will stay physically linked to that cgroup, but that zombie group no longer acts as a control structure: it imposes no limit and no protection; the pages will be reclaimed as if it WERE linked to the parent. For all intents and purposes, when you delete a cgroup today, its remaining pages ARE dumped onto the parent. The only difference is that today they pointlessly pin the leaf cgroup as a holding vessel - which is then round-robin'd from the parent during reclaim in order to pretend that all these child pages actually ARE linked to the parent's LRU list. Remember how we used to have every page physically linked to multiple lrus? The leaf cgroup and the root? All pages always belong to the (virtual) LRU list of all ancestor cgroups. The only thing Muchun changes is that they no longer pin a cgroup that has no semantical meaning anymore (because it's neither visible to the user nor exerts any contol over the pages anymore). Maybe I'm missing something that either you or Roman can explain to me. But this series looks like a (rare) pure win. Whether you like the current semantics is a separate discussion IMO. > Please note that I do agree with the mentioned problem and we do see > this issue in our fleet. Internally we have a "memcg mount option" > feature which couples a file system with a memcg and all file pages > allocated on that file system will be charged to that memcg. Multiple > instances (concurrent or subsequent) of the job will use that file > system (with a dedicated memcg) without leaving the zombies behind. I > am not pushing for this solution as it comes with its own intricacies > (e.g. if memcg coupled with a file system has a limit, the oom > behavior would be awkward and therefore internally we don't put a > limit on such memcgs). Though I want this to be part of discussion. Right, you disconnect memory from the tasks that are allocating it, and so you can't assign culpability when you need to. OOM is one thing, but there are also CPU cycles and IO bandwidth consumed during reclaim. > I think the underlying reasons behind this issue are: > > 1) Filesystem shared by disjoint jobs. > 2) For job dedicated filesystems, the lifetime of the filesystem is > different from the lifetime of the job. There is also the case of deleting a cgroup just to recreate it right after for the same job. Many job managers do this on restart right now - like systemd, and what we're using in our fleet. This seems avoidable by recycling a group for another instance of the same job. Sharing is a more difficult discussion. If you access a page that you share with another cgroup, it may or may not be subject to your own or your buddy's memory limits. The only limit it is guaranteed to be subjected to is that of your parent. So One thing I could imagine is, instead of having a separate cgroup outside the hierarchy, we would reparent live pages the second they are accessed from a foreign cgroup. And reparent them until you reach the first common ancestor. This way, when you mount a filesystem shared by two jobs, you can put them into a joint subtree, and the root level of this subtree captures all the memory (as well as the reclaim CPU and IO) used by the two jobs - the private portions and the shared portions - and doesn't make them the liability of jobs in the system that DON'T share the same fs. But again, this is a useful discussion to have, but I don't quite see why it's relevant to Muchun's patches. They're purely an optimization. So I'd like to clear that up first before going further.
On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote: > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote: > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > > > Since the following patchsets applied. All the kernel memory are charged > > > with the new APIs of obj_cgroup. > > > > > > [v17,00/19] The new cgroup slab memory controller > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time - > > > it exists at a larger scale and is causing recurring problems in the real > > > world: page cache doesn't get reclaimed for a long time, or is used by the > > > second, third, fourth, ... instance of the same job that was restarted into > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > > > and make page reclaim very inefficient. > > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction > > > to fix this problem, and then the LRU pages will not pin the memcgs. > > > > > > This patchset aims to make the LRU pages to drop the reference to memory > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > > > of the dying cgroups will not increase if we run the following test script. > > > > > > ```bash > > > #!/bin/bash > > > > > > cat /proc/cgroups | grep memory > > > > > > cd /sys/fs/cgroup/memory > > > > > > for i in range{1..500} > > > do > > > mkdir test > > > echo $$ > test/cgroup.procs > > > sleep 60 & > > > echo $$ > cgroup.procs > > > echo `cat test/cgroup.procs` > cgroup.procs > > > rmdir test > > > done > > > > > > cat /proc/cgroups | grep memory > > > ``` > > > > > > Patch 1 aims to fix page charging in page replacement. > > > Patch 2-5 are code cleanup and simplification. > > > Patch 6-15 convert LRU pages pin to the objcg direction. > > > > The main concern I have with *just* reparenting LRU pages is that for > > the long running systems, the root memcg will become a dumping ground. > > In addition a job running multiple times on a machine will see > > inconsistent memory usage if it re-accesses the file pages which were > > reparented to the root memcg. > > I agree, but also the reparenting is not the perfect thing in a combination > with any memory protections (e.g. memory.low). > > Imagine the following configuration: > workload.slice > - workload_gen_1.service memory.min = 30G > - workload_gen_2.service memory.min = 30G > - workload_gen_3.service memory.min = 30G > ... > > Parent cgroup and several generations of the child cgroup, protected by a memory.low. > Once the memory is getting reparented, it's not protected anymore. That doesn't sound right. A deleted cgroup today exerts no control over its abandoned pages. css_reset() will blow out any control settings. If you're talking about protection previously inherited by workload.slice, that continues to apply as it always has. None of this is really accidental. Per definition the workload.slice control domain includes workload_gen_1.service. And per definition, the workload_gen_1.service domain ceases to exist when you delete it. There are no (or shouldn't be any!) semantic changes from the physical unlinking from a dead control domain. > Also, I'm somewhat concerned about the interaction of the reparenting > with the writeback and dirty throttling. How does it work together? What interaction specifically? When you delete a cgroup that had both the block and the memory controller enabled, the control domain of both goes away and it becomes subject to whatever control domain is above it (if any). A higher control domain in turn takes a recursive view of the subtree, see mem_cgroup_wb_stats(), so when control is exerted, it applies regardless of how and where pages are physically linked in children. It's also already possible to enable e.g. block control only at a very high level and memory control down to a lower level. Per design this code can live with different domain sizes for memory and block.
On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote: > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote: > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote: > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > > > > > Since the following patchsets applied. All the kernel memory are charged > > > > with the new APIs of obj_cgroup. > > > > > > > > [v17,00/19] The new cgroup slab memory controller > > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time - > > > > it exists at a larger scale and is causing recurring problems in the real > > > > world: page cache doesn't get reclaimed for a long time, or is used by the > > > > second, third, fourth, ... instance of the same job that was restarted into > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > > > > and make page reclaim very inefficient. > > > > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction > > > > to fix this problem, and then the LRU pages will not pin the memcgs. > > > > > > > > This patchset aims to make the LRU pages to drop the reference to memory > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > > > > of the dying cgroups will not increase if we run the following test script. > > > > > > > > ```bash > > > > #!/bin/bash > > > > > > > > cat /proc/cgroups | grep memory > > > > > > > > cd /sys/fs/cgroup/memory > > > > > > > > for i in range{1..500} > > > > do > > > > mkdir test > > > > echo $$ > test/cgroup.procs > > > > sleep 60 & > > > > echo $$ > cgroup.procs > > > > echo `cat test/cgroup.procs` > cgroup.procs > > > > rmdir test > > > > done > > > > > > > > cat /proc/cgroups | grep memory > > > > ``` > > > > > > > > Patch 1 aims to fix page charging in page replacement. > > > > Patch 2-5 are code cleanup and simplification. > > > > Patch 6-15 convert LRU pages pin to the objcg direction. > > > > > > The main concern I have with *just* reparenting LRU pages is that for > > > the long running systems, the root memcg will become a dumping ground. > > > In addition a job running multiple times on a machine will see > > > inconsistent memory usage if it re-accesses the file pages which were > > > reparented to the root memcg. > > > > I agree, but also the reparenting is not the perfect thing in a combination > > with any memory protections (e.g. memory.low). > > > > Imagine the following configuration: > > workload.slice > > - workload_gen_1.service memory.min = 30G > > - workload_gen_2.service memory.min = 30G > > - workload_gen_3.service memory.min = 30G > > ... > > > > Parent cgroup and several generations of the child cgroup, protected by a memory.low. > > Once the memory is getting reparented, it's not protected anymore. > > That doesn't sound right. > > A deleted cgroup today exerts no control over its abandoned > pages. css_reset() will blow out any control settings. I know. Currently it works in the following way: once cgroup gen_1 is deleted, it's memory is not protected anymore, so eventually it's getting evicted and re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this, of course. But long-term we likely wanna re-charge such pages to new cgroups and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't help and likely will complicate this change. So I'm a bit skeptical here. Also, in my experience the pagecache is not the main/worst memcg reference holder (writeback structures are way worse). Pages are relatively large (in comparison to some slab objects), and rarely there is only one page pinning a separate memcg. And switching to obj_cgroup doesn't completely eliminate the problem: we just switch from accumulating larger mem_cgroups to accumulating smaller obj_cgroups. With all this said, I'm not necessarily opposing the patchset, but it's necessary to discuss how it fits into the long-term picture. E.g. if we're going to use obj_cgroup API for page-sized objects, shouldn't we split it back into the reparenting and bytes-sized accounting parts, as I initially suggested. And shouldn't we move the reparenting part to the cgroup core level, so we could use it for other controllers (e.g. to fix the writeback problem). > > If you're talking about protection previously inherited by > workload.slice, that continues to apply as it always has. > > None of this is really accidental. Per definition the workload.slice > control domain includes workload_gen_1.service. And per definition, > the workload_gen_1.service domain ceases to exist when you delete it. > > There are no (or shouldn't be any!) semantic changes from the physical > unlinking from a dead control domain. > > > Also, I'm somewhat concerned about the interaction of the reparenting > > with the writeback and dirty throttling. How does it work together? > > What interaction specifically? > > When you delete a cgroup that had both the block and the memory > controller enabled, the control domain of both goes away and it > becomes subject to whatever control domain is above it (if any). > > A higher control domain in turn takes a recursive view of the subtree, > see mem_cgroup_wb_stats(), so when control is exerted, it applies > regardless of how and where pages are physically linked in children. > > It's also already possible to enable e.g. block control only at a very > high level and memory control down to a lower level. Per design this > code can live with different domain sizes for memory and block. I'm totally happy if it's safe, I just don't know this code well enough to be sure without taking a closer look.
On Tue, Mar 30, 2021 at 2:10 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > [...] > > The main concern I have with *just* reparenting LRU pages is that for > > the long running systems, the root memcg will become a dumping ground. > > In addition a job running multiple times on a machine will see > > inconsistent memory usage if it re-accesses the file pages which were > > reparented to the root memcg. > > I don't understand how Muchun's patches are supposed to *change* the > behavior the way you are describing it. This IS today's behavior. > > We have hierarchical accounting, and a page that belongs to a leaf > cgroup will automatically belong to all its parents. > > Further, if you delete a cgroup today, the abandoned cache will stay > physically linked to that cgroup, but that zombie group no longer acts > as a control structure: it imposes no limit and no protection; the > pages will be reclaimed as if it WERE linked to the parent. > > For all intents and purposes, when you delete a cgroup today, its > remaining pages ARE dumped onto the parent. > > The only difference is that today they pointlessly pin the leaf cgroup > as a holding vessel - which is then round-robin'd from the parent > during reclaim in order to pretend that all these child pages actually > ARE linked to the parent's LRU list. > > Remember how we used to have every page physically linked to multiple > lrus? The leaf cgroup and the root? > > All pages always belong to the (virtual) LRU list of all ancestor > cgroups. The only thing Muchun changes is that they no longer pin a > cgroup that has no semantical meaning anymore (because it's neither > visible to the user nor exerts any contol over the pages anymore). > Indeed you are right. Even if the physical representation of the tree has changed, the logical picture remains the same. [Subconsciously I was sad that we will lose the information about the origin memcg of the page for debugging purposes but then I thought if we really need it then we can just add that metadata in the obj_cgroup object. So, never mind.] > Maybe I'm missing something that either you or Roman can explain to > me. But this series looks like a (rare) pure win. > > Whether you like the current semantics is a separate discussion IMO. > > > Please note that I do agree with the mentioned problem and we do see > > this issue in our fleet. Internally we have a "memcg mount option" > > feature which couples a file system with a memcg and all file pages > > allocated on that file system will be charged to that memcg. Multiple > > instances (concurrent or subsequent) of the job will use that file > > system (with a dedicated memcg) without leaving the zombies behind. I > > am not pushing for this solution as it comes with its own intricacies > > (e.g. if memcg coupled with a file system has a limit, the oom > > behavior would be awkward and therefore internally we don't put a > > limit on such memcgs). Though I want this to be part of discussion. > > Right, you disconnect memory from the tasks that are allocating it, > and so you can't assign culpability when you need to. > > OOM is one thing, but there are also CPU cycles and IO bandwidth > consumed during reclaim. > We didn't really have any issue regarding CPU or IO but that might be due to our unique setup (i.e. no local disk). > > I think the underlying reasons behind this issue are: > > > > 1) Filesystem shared by disjoint jobs. > > 2) For job dedicated filesystems, the lifetime of the filesystem is > > different from the lifetime of the job. > > There is also the case of deleting a cgroup just to recreate it right > after for the same job. Many job managers do this on restart right now > - like systemd, and what we're using in our fleet. This seems > avoidable by recycling a group for another instance of the same job. I was bundling the scenario you mentioned with (2) i.e. the filesystem persists across multiple subsequent instances of the same job. > > Sharing is a more difficult discussion. If you access a page that you > share with another cgroup, it may or may not be subject to your own or > your buddy's memory limits. The only limit it is guaranteed to be > subjected to is that of your parent. So One thing I could imagine is, > instead of having a separate cgroup outside the hierarchy, we would > reparent live pages the second they are accessed from a foreign > cgroup. And reparent them until you reach the first common ancestor. > > This way, when you mount a filesystem shared by two jobs, you can put > them into a joint subtree, and the root level of this subtree captures > all the memory (as well as the reclaim CPU and IO) used by the two > jobs - the private portions and the shared portions - and doesn't make > them the liability of jobs in the system that DON'T share the same fs. I will give more thought on this idea and see where it goes. > > But again, this is a useful discussion to have, but I don't quite see > why it's relevant to Muchun's patches. They're purely an optimization. > > So I'd like to clear that up first before going further. I think we are on the same page i.e. these patches change the physical representation of the memcg tree but logically it remains the same and fixes the zombie memcg issue.
On Tue, Mar 30, 2021 at 05:10:04PM -0400, Johannes Weiner wrote: > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote: > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > > > Since the following patchsets applied. All the kernel memory are charged > > > with the new APIs of obj_cgroup. > > > > > > [v17,00/19] The new cgroup slab memory controller > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time - > > > it exists at a larger scale and is causing recurring problems in the real > > > world: page cache doesn't get reclaimed for a long time, or is used by the > > > second, third, fourth, ... instance of the same job that was restarted into > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > > > and make page reclaim very inefficient. > > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction > > > to fix this problem, and then the LRU pages will not pin the memcgs. > > > > > > This patchset aims to make the LRU pages to drop the reference to memory > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > > > of the dying cgroups will not increase if we run the following test script. > > > > > > ```bash > > > #!/bin/bash > > > > > > cat /proc/cgroups | grep memory > > > > > > cd /sys/fs/cgroup/memory > > > > > > for i in range{1..500} > > > do > > > mkdir test > > > echo $$ > test/cgroup.procs > > > sleep 60 & > > > echo $$ > cgroup.procs > > > echo `cat test/cgroup.procs` > cgroup.procs > > > rmdir test > > > done > > > > > > cat /proc/cgroups | grep memory > > > ``` > > > > > > Patch 1 aims to fix page charging in page replacement. > > > Patch 2-5 are code cleanup and simplification. > > > Patch 6-15 convert LRU pages pin to the objcg direction. > > > > The main concern I have with *just* reparenting LRU pages is that for > > the long running systems, the root memcg will become a dumping ground. > > In addition a job running multiple times on a machine will see > > inconsistent memory usage if it re-accesses the file pages which were > > reparented to the root memcg. > > I don't understand how Muchun's patches are supposed to *change* the > behavior the way you are describing it. This IS today's behavior. > > We have hierarchical accounting, and a page that belongs to a leaf > cgroup will automatically belong to all its parents. > > Further, if you delete a cgroup today, the abandoned cache will stay > physically linked to that cgroup, but that zombie group no longer acts > as a control structure: it imposes no limit and no protection; the > pages will be reclaimed as if it WERE linked to the parent. > > For all intents and purposes, when you delete a cgroup today, its > remaining pages ARE dumped onto the parent. > > The only difference is that today they pointlessly pin the leaf cgroup > as a holding vessel - which is then round-robin'd from the parent > during reclaim in order to pretend that all these child pages actually > ARE linked to the parent's LRU list. > > Remember how we used to have every page physically linked to multiple > lrus? The leaf cgroup and the root? > > All pages always belong to the (virtual) LRU list of all ancestor > cgroups. The only thing Muchun changes is that they no longer pin a > cgroup that has no semantical meaning anymore (because it's neither > visible to the user nor exerts any contol over the pages anymore). > > Maybe I'm missing something that either you or Roman can explain to > me. But this series looks like a (rare) pure win. > > Whether you like the current semantics is a separate discussion IMO. > > > Please note that I do agree with the mentioned problem and we do see > > this issue in our fleet. Internally we have a "memcg mount option" > > feature which couples a file system with a memcg and all file pages > > allocated on that file system will be charged to that memcg. Multiple > > instances (concurrent or subsequent) of the job will use that file > > system (with a dedicated memcg) without leaving the zombies behind. I > > am not pushing for this solution as it comes with its own intricacies > > (e.g. if memcg coupled with a file system has a limit, the oom > > behavior would be awkward and therefore internally we don't put a > > limit on such memcgs). Though I want this to be part of discussion. > > Right, you disconnect memory from the tasks that are allocating it, > and so you can't assign culpability when you need to. > > OOM is one thing, but there are also CPU cycles and IO bandwidth > consumed during reclaim. > > > I think the underlying reasons behind this issue are: > > > > 1) Filesystem shared by disjoint jobs. > > 2) For job dedicated filesystems, the lifetime of the filesystem is > > different from the lifetime of the job. > > There is also the case of deleting a cgroup just to recreate it right > after for the same job. Many job managers do this on restart right now > - like systemd, and what we're using in our fleet. This seems > avoidable by recycling a group for another instance of the same job. > > Sharing is a more difficult discussion. If you access a page that you > share with another cgroup, it may or may not be subject to your own or > your buddy's memory limits. The only limit it is guaranteed to be > subjected to is that of your parent. So One thing I could imagine is, > instead of having a separate cgroup outside the hierarchy, we would > reparent live pages the second they are accessed from a foreign > cgroup. And reparent them until you reach the first common ancestor. > > This way, when you mount a filesystem shared by two jobs, you can put > them into a joint subtree, and the root level of this subtree captures > all the memory (as well as the reclaim CPU and IO) used by the two > jobs - the private portions and the shared portions - and doesn't make > them the liability of jobs in the system that DON'T share the same fs. > > But again, this is a useful discussion to have, but I don't quite see > why it's relevant to Muchun's patches. They're purely an optimization. > > So I'd like to clear that up first before going further. > I suspect a lot of the issue really is the lack of lockstepping between a page (unmapped page cache) and the corresponding memcgroups lifecycle. When we delete a memcgroup, we sort of lose accounting (depending on the inheriting parent) and ideally we want to bring back the accounting when the page is reused in a different cgroup (almost like first touch). I would like to look at the patches and see if they do solve the issue that leads to zombie cgroups hanging around. In my experience, the combination of namespaces and number of cgroups (several of which could be zombies), does not scale well. Balbir Singh.
On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote: > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote: > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote: > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote: > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > > > > > > > Since the following patchsets applied. All the kernel memory are charged > > > > > with the new APIs of obj_cgroup. > > > > > > > > > > [v17,00/19] The new cgroup slab memory controller > > > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > > > > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time - > > > > > it exists at a larger scale and is causing recurring problems in the real > > > > > world: page cache doesn't get reclaimed for a long time, or is used by the > > > > > second, third, fourth, ... instance of the same job that was restarted into > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > > > > > and make page reclaim very inefficient. > > > > > > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction > > > > > to fix this problem, and then the LRU pages will not pin the memcgs. > > > > > > > > > > This patchset aims to make the LRU pages to drop the reference to memory > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > > > > > of the dying cgroups will not increase if we run the following test script. > > > > > > > > > > ```bash > > > > > #!/bin/bash > > > > > > > > > > cat /proc/cgroups | grep memory > > > > > > > > > > cd /sys/fs/cgroup/memory > > > > > > > > > > for i in range{1..500} > > > > > do > > > > > mkdir test > > > > > echo $$ > test/cgroup.procs > > > > > sleep 60 & > > > > > echo $$ > cgroup.procs > > > > > echo `cat test/cgroup.procs` > cgroup.procs > > > > > rmdir test > > > > > done > > > > > > > > > > cat /proc/cgroups | grep memory > > > > > ``` > > > > > > > > > > Patch 1 aims to fix page charging in page replacement. > > > > > Patch 2-5 are code cleanup and simplification. > > > > > Patch 6-15 convert LRU pages pin to the objcg direction. > > > > > > > > The main concern I have with *just* reparenting LRU pages is that for > > > > the long running systems, the root memcg will become a dumping ground. > > > > In addition a job running multiple times on a machine will see > > > > inconsistent memory usage if it re-accesses the file pages which were > > > > reparented to the root memcg. > > > > > > I agree, but also the reparenting is not the perfect thing in a combination > > > with any memory protections (e.g. memory.low). > > > > > > Imagine the following configuration: > > > workload.slice > > > - workload_gen_1.service memory.min = 30G > > > - workload_gen_2.service memory.min = 30G > > > - workload_gen_3.service memory.min = 30G > > > ... > > > > > > Parent cgroup and several generations of the child cgroup, protected by a memory.low. > > > Once the memory is getting reparented, it's not protected anymore. > > > > That doesn't sound right. > > > > A deleted cgroup today exerts no control over its abandoned > > pages. css_reset() will blow out any control settings. > > I know. Currently it works in the following way: once cgroup gen_1 is deleted, > it's memory is not protected anymore, so eventually it's getting evicted and > re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this, > of course. But long-term we likely wanna re-charge such pages to new cgroups > and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't > help and likely will complicate this change. So I'm a bit skeptical here. We should be careful with the long-term plans. The zombie issue is a pretty urgent concern that has caused several production emergencies now. It needs a fix sooner rather than later. The long-term plans of how to handle shared/reused data better will require some time to work out. There are MANY open questions around recharging to arbitrary foreign cgroup users. Like how to identify accesses after the page's cgroup has been deleted: Do we need to annotate every page cache lookup? Do we need to inject minor faults to recharge mmapped pages? We can't wait for this discussion to complete. I also think the objcg is helping with that direction rather than getting in the way because: - The old charge moving code we have for LRU pages isn't reusable anyway. It relies on optimistic locking to work, and may leave memory behind in arbitrary and unpredictable ways. After a few cycles, objects tend to be spread all over the place. The objcg provides a new synchronization scheme that will always work because the readside (page/object to memcg lookup) needs to be prepared for the memcg to change and/or die at any time. - There isn't much point in recharging only some of the abandoned memory. We've tried per-allocation class reparenting and it didn't work out too well. Newly accounted allocations crop up faster than we can conjure up tailor-made reparenting schemes for them. The objcg provides a generic reference and reassignment scheme that can be used by all tracked objects. Importantly, once we have a central thing as LRU pages converted, we can require all new allocation tracking to use objcg from the start. > Also, in my experience the pagecache is not the main/worst memcg reference > holder (writeback structures are way worse). Pages are relatively large > (in comparison to some slab objects), and rarely there is only one page pinning > a separate memcg. I've seen that exact thing cause zombies to pile up: one or two pages in the old group, pinned by the next instance of a job. If the job has a sufficiently large working set, this can pin a LOT of dead cgroups. Is it the biggest or most problematic source of buildups? Maybe not. But there is definitely cause to fix it. LRU pages are also the most difficult to convert because of their rich interactions. It's a good test of the api. If it works for those pages, converting everybody else will be much simpler. And again, as the core memory consumer it sets the tone of how memcg rmapping is supposed to work for new and existing object types. This helps align ongoing development. > And switching to obj_cgroup doesn't completely eliminate the > problem: we just switch from accumulating larger mem_cgroups to > accumulating smaller obj_cgroups. In your own words, the discrepancy between tiny objects pinning large memcgs is a problem. objcgs are smaller than most objects, so it's not much different as if an object were simply a few bytes bigger in size. A memcg on the other hand is vastly bigger than most objects. It's also composed of many allocations and so causes more fragmentation. Another issue is that memcgs with abandoned objects need to be visited by the reclaimer on every single reclaim walk, a hot path. The list of objcgs on the other hand is only iterated when the cgroup is deleted, which is not a fast path. It's also rare that parents with many dead children are deleted at all (system.slice, workload.slice etc.) So no, I would say for all intents and purposes, it fixes all the problems we're having with zombie memcgs. > With all this said, I'm not necessarily opposing the patchset, but it's > necessary to discuss how it fits into the long-term picture. > E.g. if we're going to use obj_cgroup API for page-sized objects, shouldn't > we split it back into the reparenting and bytes-sized accounting parts, > as I initially suggested. And shouldn't we move the reparenting part to > the cgroup core level, so we could use it for other controllers > (e.g. to fix the writeback problem). Yes, I do think we want to generalize it. But I wouldn't say it's a requirement for these patches, either: - The byte-sized accounting part is one atomic_t. That's 4 bytes pinned unnecessarily - compared to an entire struct memcg right now which includes memory accounting, swap accounting, byte accounting, and a whole lot of other things likely not used by the stale object. - The cgroup generalization is a good idea too, but that doesn't really change the callsites either. Unless you were thinking of renaming, but objcg seems like a good, generic fit for a name to describe the linkage between objects to a cgroup. The memcg member will need to change into something generic (a css_set type mapping), but that can likely be hidden behind page_memcg(), objcg_memcg() and similar helpers. Both of these aspects can be improved incrementally.
On Wed, Mar 31, 2021 at 11:17 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote: > > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote: > > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote: > > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote: > > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > > > > > > > > > Since the following patchsets applied. All the kernel memory are charged > > > > > > with the new APIs of obj_cgroup. > > > > > > > > > > > > [v17,00/19] The new cgroup slab memory controller > > > > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > > > > > > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time - > > > > > > it exists at a larger scale and is causing recurring problems in the real > > > > > > world: page cache doesn't get reclaimed for a long time, or is used by the > > > > > > second, third, fourth, ... instance of the same job that was restarted into > > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > > > > > > and make page reclaim very inefficient. > > > > > > > > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction > > > > > > to fix this problem, and then the LRU pages will not pin the memcgs. > > > > > > > > > > > > This patchset aims to make the LRU pages to drop the reference to memory > > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > > > > > > of the dying cgroups will not increase if we run the following test script. > > > > > > > > > > > > ```bash > > > > > > #!/bin/bash > > > > > > > > > > > > cat /proc/cgroups | grep memory > > > > > > > > > > > > cd /sys/fs/cgroup/memory > > > > > > > > > > > > for i in range{1..500} > > > > > > do > > > > > > mkdir test > > > > > > echo $$ > test/cgroup.procs > > > > > > sleep 60 & > > > > > > echo $$ > cgroup.procs > > > > > > echo `cat test/cgroup.procs` > cgroup.procs > > > > > > rmdir test > > > > > > done > > > > > > > > > > > > cat /proc/cgroups | grep memory > > > > > > ``` > > > > > > > > > > > > Patch 1 aims to fix page charging in page replacement. > > > > > > Patch 2-5 are code cleanup and simplification. > > > > > > Patch 6-15 convert LRU pages pin to the objcg direction. > > > > > > > > > > The main concern I have with *just* reparenting LRU pages is that for > > > > > the long running systems, the root memcg will become a dumping ground. > > > > > In addition a job running multiple times on a machine will see > > > > > inconsistent memory usage if it re-accesses the file pages which were > > > > > reparented to the root memcg. > > > > > > > > I agree, but also the reparenting is not the perfect thing in a combination > > > > with any memory protections (e.g. memory.low). > > > > > > > > Imagine the following configuration: > > > > workload.slice > > > > - workload_gen_1.service memory.min = 30G > > > > - workload_gen_2.service memory.min = 30G > > > > - workload_gen_3.service memory.min = 30G > > > > ... > > > > > > > > Parent cgroup and several generations of the child cgroup, protected by a memory.low. > > > > Once the memory is getting reparented, it's not protected anymore. > > > > > > That doesn't sound right. > > > > > > A deleted cgroup today exerts no control over its abandoned > > > pages. css_reset() will blow out any control settings. > > > > I know. Currently it works in the following way: once cgroup gen_1 is deleted, > > it's memory is not protected anymore, so eventually it's getting evicted and > > re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this, > > of course. But long-term we likely wanna re-charge such pages to new cgroups > > and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't > > help and likely will complicate this change. So I'm a bit skeptical here. > > We should be careful with the long-term plans. > > The zombie issue is a pretty urgent concern that has caused several > production emergencies now. It needs a fix sooner rather than later. Thank you very much for clarifying the problem for me. I do agree with you. This issue should be fixed ASAP. Using objcg is a good choice. Dying objcg should not be a problem. Because the size of objcg is so small compared to memcg. Thanks. > > The long-term plans of how to handle shared/reused data better will > require some time to work out. There are MANY open questions around > recharging to arbitrary foreign cgroup users. Like how to identify > accesses after the page's cgroup has been deleted: Do we need to > annotate every page cache lookup? Do we need to inject minor faults to > recharge mmapped pages? We can't wait for this discussion to complete. > > I also think the objcg is helping with that direction rather than > getting in the way because: > > - The old charge moving code we have for LRU pages isn't reusable > anyway. It relies on optimistic locking to work, and may leave > memory behind in arbitrary and unpredictable ways. After a few > cycles, objects tend to be spread all over the place. > > The objcg provides a new synchronization scheme that will always > work because the readside (page/object to memcg lookup) needs to be > prepared for the memcg to change and/or die at any time. > > - There isn't much point in recharging only some of the abandoned > memory. We've tried per-allocation class reparenting and it didn't > work out too well. Newly accounted allocations crop up faster than > we can conjure up tailor-made reparenting schemes for them. > > The objcg provides a generic reference and reassignment scheme that > can be used by all tracked objects. > > Importantly, once we have a central thing as LRU pages converted, we > can require all new allocation tracking to use objcg from the start. > > > Also, in my experience the pagecache is not the main/worst memcg reference > > holder (writeback structures are way worse). Pages are relatively large > > (in comparison to some slab objects), and rarely there is only one page pinning > > a separate memcg. > > I've seen that exact thing cause zombies to pile up: one or two pages > in the old group, pinned by the next instance of a job. If the job has > a sufficiently large working set, this can pin a LOT of dead > cgroups. Is it the biggest or most problematic source of buildups? > Maybe not. But there is definitely cause to fix it. > > LRU pages are also the most difficult to convert because of their rich > interactions. It's a good test of the api. If it works for those > pages, converting everybody else will be much simpler. > > And again, as the core memory consumer it sets the tone of how memcg > rmapping is supposed to work for new and existing object types. This > helps align ongoing development. > > > And switching to obj_cgroup doesn't completely eliminate the > > problem: we just switch from accumulating larger mem_cgroups to > > accumulating smaller obj_cgroups. > > In your own words, the discrepancy between tiny objects pinning large > memcgs is a problem. objcgs are smaller than most objects, so it's not > much different as if an object were simply a few bytes bigger in size. > A memcg on the other hand is vastly bigger than most objects. It's > also composed of many allocations and so causes more fragmentation. > > Another issue is that memcgs with abandoned objects need to be visited > by the reclaimer on every single reclaim walk, a hot path. The list of > objcgs on the other hand is only iterated when the cgroup is deleted, > which is not a fast path. It's also rare that parents with many dead > children are deleted at all (system.slice, workload.slice etc.) > > So no, I would say for all intents and purposes, it fixes all the > problems we're having with zombie memcgs. > > > With all this said, I'm not necessarily opposing the patchset, but it's > > necessary to discuss how it fits into the long-term picture. > > E.g. if we're going to use obj_cgroup API for page-sized objects, shouldn't > > we split it back into the reparenting and bytes-sized accounting parts, > > as I initially suggested. And shouldn't we move the reparenting part to > > the cgroup core level, so we could use it for other controllers > > (e.g. to fix the writeback problem). > > Yes, I do think we want to generalize it. But I wouldn't say it's a > requirement for these patches, either: > > - The byte-sized accounting part is one atomic_t. That's 4 bytes > pinned unnecessarily - compared to an entire struct memcg right now > which includes memory accounting, swap accounting, byte accounting, > and a whole lot of other things likely not used by the stale object. > > - The cgroup generalization is a good idea too, but that doesn't > really change the callsites either. Unless you were thinking of > renaming, but objcg seems like a good, generic fit for a name to > describe the linkage between objects to a cgroup. > > The memcg member will need to change into something generic (a > css_set type mapping), but that can likely be hidden behind > page_memcg(), objcg_memcg() and similar helpers. > > Both of these aspects can be improved incrementally.
On Thu, Apr 1, 2021 at 9:08 AM Muchun Song <songmuchun@bytedance.com> wrote: > [...] > > The zombie issue is a pretty urgent concern that has caused several > > production emergencies now. It needs a fix sooner rather than later. > > Thank you very much for clarifying the problem for me. I do agree > with you. This issue should be fixed ASAP. Using objcg is a good > choice. Dying objcg should not be a problem. Because the size of > objcg is so small compared to memcg. > Just wanted to say out loud that yes this patchset will reduce the memcg zombie issue but this is not the final destination. We should continue the discussions on sharing/reusing scenarios. Muchun, can you please also CC Hugh Dickins and Alex Shi in the next version of your patchset?
On Fri, Apr 02, 2021 at 12:07:42AM +0800, Muchun Song wrote: > On Wed, Mar 31, 2021 at 11:17 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote: > > > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote: > > > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote: > > > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote: > > > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > > > > > > > > > > > Since the following patchsets applied. All the kernel memory are charged > > > > > > > with the new APIs of obj_cgroup. > > > > > > > > > > > > > > [v17,00/19] The new cgroup slab memory controller > > > > > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > > > > > > > > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time - > > > > > > > it exists at a larger scale and is causing recurring problems in the real > > > > > > > world: page cache doesn't get reclaimed for a long time, or is used by the > > > > > > > second, third, fourth, ... instance of the same job that was restarted into > > > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > > > > > > > and make page reclaim very inefficient. > > > > > > > > > > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction > > > > > > > to fix this problem, and then the LRU pages will not pin the memcgs. > > > > > > > > > > > > > > This patchset aims to make the LRU pages to drop the reference to memory > > > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > > > > > > > of the dying cgroups will not increase if we run the following test script. > > > > > > > > > > > > > > ```bash > > > > > > > #!/bin/bash > > > > > > > > > > > > > > cat /proc/cgroups | grep memory > > > > > > > > > > > > > > cd /sys/fs/cgroup/memory > > > > > > > > > > > > > > for i in range{1..500} > > > > > > > do > > > > > > > mkdir test > > > > > > > echo $$ > test/cgroup.procs > > > > > > > sleep 60 & > > > > > > > echo $$ > cgroup.procs > > > > > > > echo `cat test/cgroup.procs` > cgroup.procs > > > > > > > rmdir test > > > > > > > done > > > > > > > > > > > > > > cat /proc/cgroups | grep memory > > > > > > > ``` > > > > > > > > > > > > > > Patch 1 aims to fix page charging in page replacement. > > > > > > > Patch 2-5 are code cleanup and simplification. > > > > > > > Patch 6-15 convert LRU pages pin to the objcg direction. > > > > > > > > > > > > The main concern I have with *just* reparenting LRU pages is that for > > > > > > the long running systems, the root memcg will become a dumping ground. > > > > > > In addition a job running multiple times on a machine will see > > > > > > inconsistent memory usage if it re-accesses the file pages which were > > > > > > reparented to the root memcg. > > > > > > > > > > I agree, but also the reparenting is not the perfect thing in a combination > > > > > with any memory protections (e.g. memory.low). > > > > > > > > > > Imagine the following configuration: > > > > > workload.slice > > > > > - workload_gen_1.service memory.min = 30G > > > > > - workload_gen_2.service memory.min = 30G > > > > > - workload_gen_3.service memory.min = 30G > > > > > ... > > > > > > > > > > Parent cgroup and several generations of the child cgroup, protected by a memory.low. > > > > > Once the memory is getting reparented, it's not protected anymore. > > > > > > > > That doesn't sound right. > > > > > > > > A deleted cgroup today exerts no control over its abandoned > > > > pages. css_reset() will blow out any control settings. > > > > > > I know. Currently it works in the following way: once cgroup gen_1 is deleted, > > > it's memory is not protected anymore, so eventually it's getting evicted and > > > re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this, > > > of course. But long-term we likely wanna re-charge such pages to new cgroups > > > and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't > > > help and likely will complicate this change. So I'm a bit skeptical here. > > > > We should be careful with the long-term plans. > > > > The zombie issue is a pretty urgent concern that has caused several > > production emergencies now. It needs a fix sooner rather than later. > > Thank you very much for clarifying the problem for me. I do agree > with you. This issue should be fixed ASAP. Using objcg is a good > choice. Dying objcg should not be a problem. Because the size of > objcg is so small compared to memcg. It would be nice to see some data from real-life workloads showing an improvement in the memory usage/reclaim efficiency and also not showing any performance degradation. I agree that it would be nice to fix this issue, but let's not exaggerate it's urgency: it was here for years and nothing really changed. I'm not proposing to postpone it for years, but there were valid questions raised how it fits into the bigger picture (e.g. other controllers). Indeed we added an accounting of different objects, but it's not directly connected to LRU pages. Thanks!
On Wed, Mar 31, 2021 at 8:17 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote: > > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote: > > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote: > > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote: > > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > > > > > > > > > Since the following patchsets applied. All the kernel memory are charged > > > > > > with the new APIs of obj_cgroup. > > > > > > > > > > > > [v17,00/19] The new cgroup slab memory controller > > > > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > > > > > > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time - > > > > > > it exists at a larger scale and is causing recurring problems in the real > > > > > > world: page cache doesn't get reclaimed for a long time, or is used by the > > > > > > second, third, fourth, ... instance of the same job that was restarted into > > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > > > > > > and make page reclaim very inefficient. > > > > > > > > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction > > > > > > to fix this problem, and then the LRU pages will not pin the memcgs. > > > > > > > > > > > > This patchset aims to make the LRU pages to drop the reference to memory > > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > > > > > > of the dying cgroups will not increase if we run the following test script. > > > > > > > > > > > > ```bash > > > > > > #!/bin/bash > > > > > > > > > > > > cat /proc/cgroups | grep memory > > > > > > > > > > > > cd /sys/fs/cgroup/memory > > > > > > > > > > > > for i in range{1..500} > > > > > > do > > > > > > mkdir test > > > > > > echo $$ > test/cgroup.procs > > > > > > sleep 60 & > > > > > > echo $$ > cgroup.procs > > > > > > echo `cat test/cgroup.procs` > cgroup.procs > > > > > > rmdir test > > > > > > done > > > > > > > > > > > > cat /proc/cgroups | grep memory > > > > > > ``` > > > > > > > > > > > > Patch 1 aims to fix page charging in page replacement. > > > > > > Patch 2-5 are code cleanup and simplification. > > > > > > Patch 6-15 convert LRU pages pin to the objcg direction. > > > > > > > > > > The main concern I have with *just* reparenting LRU pages is that for > > > > > the long running systems, the root memcg will become a dumping ground. > > > > > In addition a job running multiple times on a machine will see > > > > > inconsistent memory usage if it re-accesses the file pages which were > > > > > reparented to the root memcg. > > > > > > > > I agree, but also the reparenting is not the perfect thing in a combination > > > > with any memory protections (e.g. memory.low). > > > > > > > > Imagine the following configuration: > > > > workload.slice > > > > - workload_gen_1.service memory.min = 30G > > > > - workload_gen_2.service memory.min = 30G > > > > - workload_gen_3.service memory.min = 30G > > > > ... > > > > > > > > Parent cgroup and several generations of the child cgroup, protected by a memory.low. > > > > Once the memory is getting reparented, it's not protected anymore. > > > > > > That doesn't sound right. > > > > > > A deleted cgroup today exerts no control over its abandoned > > > pages. css_reset() will blow out any control settings. > > > > I know. Currently it works in the following way: once cgroup gen_1 is deleted, > > it's memory is not protected anymore, so eventually it's getting evicted and > > re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this, > > of course. But long-term we likely wanna re-charge such pages to new cgroups > > and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't > > help and likely will complicate this change. So I'm a bit skeptical here. > > We should be careful with the long-term plans. Excuse me for a dumb question. I recall we did reparent LRU pages before (before 4.x kernel). I vaguely recall there were some tricky race conditions during reparenting so we didn't do it anymore once reclaimer could reclaim from offlined memcgs. My memory may be wrong, if it is not so please feel free to correct me. If my memory is true, it means the race conditions are gone? Or the new obj_cgroup APIs made life easier? Thanks, Yang > > The zombie issue is a pretty urgent concern that has caused several > production emergencies now. It needs a fix sooner rather than later. > > The long-term plans of how to handle shared/reused data better will > require some time to work out. There are MANY open questions around > recharging to arbitrary foreign cgroup users. Like how to identify > accesses after the page's cgroup has been deleted: Do we need to > annotate every page cache lookup? Do we need to inject minor faults to > recharge mmapped pages? We can't wait for this discussion to complete. > > I also think the objcg is helping with that direction rather than > getting in the way because: > > - The old charge moving code we have for LRU pages isn't reusable > anyway. It relies on optimistic locking to work, and may leave > memory behind in arbitrary and unpredictable ways. After a few > cycles, objects tend to be spread all over the place. > > The objcg provides a new synchronization scheme that will always > work because the readside (page/object to memcg lookup) needs to be > prepared for the memcg to change and/or die at any time. > > - There isn't much point in recharging only some of the abandoned > memory. We've tried per-allocation class reparenting and it didn't > work out too well. Newly accounted allocations crop up faster than > we can conjure up tailor-made reparenting schemes for them. > > The objcg provides a generic reference and reassignment scheme that > can be used by all tracked objects. > > Importantly, once we have a central thing as LRU pages converted, we > can require all new allocation tracking to use objcg from the start. > > > Also, in my experience the pagecache is not the main/worst memcg reference > > holder (writeback structures are way worse). Pages are relatively large > > (in comparison to some slab objects), and rarely there is only one page pinning > > a separate memcg. > > I've seen that exact thing cause zombies to pile up: one or two pages > in the old group, pinned by the next instance of a job. If the job has > a sufficiently large working set, this can pin a LOT of dead > cgroups. Is it the biggest or most problematic source of buildups? > Maybe not. But there is definitely cause to fix it. > > LRU pages are also the most difficult to convert because of their rich > interactions. It's a good test of the api. If it works for those > pages, converting everybody else will be much simpler. > > And again, as the core memory consumer it sets the tone of how memcg > rmapping is supposed to work for new and existing object types. This > helps align ongoing development. > > > And switching to obj_cgroup doesn't completely eliminate the > > problem: we just switch from accumulating larger mem_cgroups to > > accumulating smaller obj_cgroups. > > In your own words, the discrepancy between tiny objects pinning large > memcgs is a problem. objcgs are smaller than most objects, so it's not > much different as if an object were simply a few bytes bigger in size. > A memcg on the other hand is vastly bigger than most objects. It's > also composed of many allocations and so causes more fragmentation. > > Another issue is that memcgs with abandoned objects need to be visited > by the reclaimer on every single reclaim walk, a hot path. The list of > objcgs on the other hand is only iterated when the cgroup is deleted, > which is not a fast path. It's also rare that parents with many dead > children are deleted at all (system.slice, workload.slice etc.) > > So no, I would say for all intents and purposes, it fixes all the > problems we're having with zombie memcgs. > > > With all this said, I'm not necessarily opposing the patchset, but it's > > necessary to discuss how it fits into the long-term picture. > > E.g. if we're going to use obj_cgroup API for page-sized objects, shouldn't > > we split it back into the reparenting and bytes-sized accounting parts, > > as I initially suggested. And shouldn't we move the reparenting part to > > the cgroup core level, so we could use it for other controllers > > (e.g. to fix the writeback problem). > > Yes, I do think we want to generalize it. But I wouldn't say it's a > requirement for these patches, either: > > - The byte-sized accounting part is one atomic_t. That's 4 bytes > pinned unnecessarily - compared to an entire struct memcg right now > which includes memory accounting, swap accounting, byte accounting, > and a whole lot of other things likely not used by the stale object. > > - The cgroup generalization is a good idea too, but that doesn't > really change the callsites either. Unless you were thinking of > renaming, but objcg seems like a good, generic fit for a name to > describe the linkage between objects to a cgroup. > > The memcg member will need to change into something generic (a > css_set type mapping), but that can likely be hidden behind > page_memcg(), objcg_memcg() and similar helpers. > > Both of these aspects can be improved incrementally. >
On Fri, Apr 2, 2021 at 1:15 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Apr 1, 2021 at 9:08 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > [...] > > > The zombie issue is a pretty urgent concern that has caused several > > > production emergencies now. It needs a fix sooner rather than later. > > > > Thank you very much for clarifying the problem for me. I do agree > > with you. This issue should be fixed ASAP. Using objcg is a good > > choice. Dying objcg should not be a problem. Because the size of > > objcg is so small compared to memcg. > > > > Just wanted to say out loud that yes this patchset will reduce the > memcg zombie issue but this is not the final destination. We should > continue the discussions on sharing/reusing scenarios. Yeah. Reducing the zombie memcg is not the final destination. But it is an optimization. OK. The discussions about sharing/reusing is also welcome. > > Muchun, can you please also CC Hugh Dickins and Alex Shi in the next > version of your patchset? No problem. I will cc Alex Shi in the next version.
On Fri, Apr 2, 2021 at 6:55 AM Yang Shi <shy828301@gmail.com> wrote: > > On Wed, Mar 31, 2021 at 8:17 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote: > > > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote: > > > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote: > > > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote: > > > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > > > > > > > > > > > Since the following patchsets applied. All the kernel memory are charged > > > > > > > with the new APIs of obj_cgroup. > > > > > > > > > > > > > > [v17,00/19] The new cgroup slab memory controller > > > > > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > > > > > > > > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long time - > > > > > > > it exists at a larger scale and is causing recurring problems in the real > > > > > > > world: page cache doesn't get reclaimed for a long time, or is used by the > > > > > > > second, third, fourth, ... instance of the same job that was restarted into > > > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > > > > > > > and make page reclaim very inefficient. > > > > > > > > > > > > > > We can convert LRU pages and most other raw memcg pins to the objcg direction > > > > > > > to fix this problem, and then the LRU pages will not pin the memcgs. > > > > > > > > > > > > > > This patchset aims to make the LRU pages to drop the reference to memory > > > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > > > > > > > of the dying cgroups will not increase if we run the following test script. > > > > > > > > > > > > > > ```bash > > > > > > > #!/bin/bash > > > > > > > > > > > > > > cat /proc/cgroups | grep memory > > > > > > > > > > > > > > cd /sys/fs/cgroup/memory > > > > > > > > > > > > > > for i in range{1..500} > > > > > > > do > > > > > > > mkdir test > > > > > > > echo $$ > test/cgroup.procs > > > > > > > sleep 60 & > > > > > > > echo $$ > cgroup.procs > > > > > > > echo `cat test/cgroup.procs` > cgroup.procs > > > > > > > rmdir test > > > > > > > done > > > > > > > > > > > > > > cat /proc/cgroups | grep memory > > > > > > > ``` > > > > > > > > > > > > > > Patch 1 aims to fix page charging in page replacement. > > > > > > > Patch 2-5 are code cleanup and simplification. > > > > > > > Patch 6-15 convert LRU pages pin to the objcg direction. > > > > > > > > > > > > The main concern I have with *just* reparenting LRU pages is that for > > > > > > the long running systems, the root memcg will become a dumping ground. > > > > > > In addition a job running multiple times on a machine will see > > > > > > inconsistent memory usage if it re-accesses the file pages which were > > > > > > reparented to the root memcg. > > > > > > > > > > I agree, but also the reparenting is not the perfect thing in a combination > > > > > with any memory protections (e.g. memory.low). > > > > > > > > > > Imagine the following configuration: > > > > > workload.slice > > > > > - workload_gen_1.service memory.min = 30G > > > > > - workload_gen_2.service memory.min = 30G > > > > > - workload_gen_3.service memory.min = 30G > > > > > ... > > > > > > > > > > Parent cgroup and several generations of the child cgroup, protected by a memory.low. > > > > > Once the memory is getting reparented, it's not protected anymore. > > > > > > > > That doesn't sound right. > > > > > > > > A deleted cgroup today exerts no control over its abandoned > > > > pages. css_reset() will blow out any control settings. > > > > > > I know. Currently it works in the following way: once cgroup gen_1 is deleted, > > > it's memory is not protected anymore, so eventually it's getting evicted and > > > re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this, > > > of course. But long-term we likely wanna re-charge such pages to new cgroups > > > and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't > > > help and likely will complicate this change. So I'm a bit skeptical here. > > > > We should be careful with the long-term plans. > > Excuse me for a dumb question. I recall we did reparent LRU pages > before (before 4.x kernel). I vaguely recall there were some tricky > race conditions during reparenting so we didn't do it anymore once > reclaimer could reclaim from offlined memcgs. My memory may be wrong, > if it is not so please feel free to correct me. If my memory is true, I looked at the historical information by git. Your memory is right. > it means the race conditions are gone? Or the new obj_cgroup APIs made > life easier? I do not know about the history of how many race conditions there are. Johannes should be very familiar with this. From my point of view, obj_cgroup APIs could make life easier. Because we do not need to care about the correctness of page counters and statistics when the LRU pages are reparented. The operation of reparenting can be easy. Just splice the lruvec list to its parent lruvec list (We do not need to isolate page one by one). Thanks. > > Thanks, > Yang > > > > > The zombie issue is a pretty urgent concern that has caused several > > production emergencies now. It needs a fix sooner rather than later. > > > > The long-term plans of how to handle shared/reused data better will > > require some time to work out. There are MANY open questions around > > recharging to arbitrary foreign cgroup users. Like how to identify > > accesses after the page's cgroup has been deleted: Do we need to > > annotate every page cache lookup? Do we need to inject minor faults to > > recharge mmapped pages? We can't wait for this discussion to complete. > > > > I also think the objcg is helping with that direction rather than > > getting in the way because: > > > > - The old charge moving code we have for LRU pages isn't reusable > > anyway. It relies on optimistic locking to work, and may leave > > memory behind in arbitrary and unpredictable ways. After a few > > cycles, objects tend to be spread all over the place. > > > > The objcg provides a new synchronization scheme that will always > > work because the readside (page/object to memcg lookup) needs to be > > prepared for the memcg to change and/or die at any time. > > > > - There isn't much point in recharging only some of the abandoned > > memory. We've tried per-allocation class reparenting and it didn't > > work out too well. Newly accounted allocations crop up faster than > > we can conjure up tailor-made reparenting schemes for them. > > > > The objcg provides a generic reference and reassignment scheme that > > can be used by all tracked objects. > > > > Importantly, once we have a central thing as LRU pages converted, we > > can require all new allocation tracking to use objcg from the start. > > > > > Also, in my experience the pagecache is not the main/worst memcg reference > > > holder (writeback structures are way worse). Pages are relatively large > > > (in comparison to some slab objects), and rarely there is only one page pinning > > > a separate memcg. > > > > I've seen that exact thing cause zombies to pile up: one or two pages > > in the old group, pinned by the next instance of a job. If the job has > > a sufficiently large working set, this can pin a LOT of dead > > cgroups. Is it the biggest or most problematic source of buildups? > > Maybe not. But there is definitely cause to fix it. > > > > LRU pages are also the most difficult to convert because of their rich > > interactions. It's a good test of the api. If it works for those > > pages, converting everybody else will be much simpler. > > > > And again, as the core memory consumer it sets the tone of how memcg > > rmapping is supposed to work for new and existing object types. This > > helps align ongoing development. > > > > > And switching to obj_cgroup doesn't completely eliminate the > > > problem: we just switch from accumulating larger mem_cgroups to > > > accumulating smaller obj_cgroups. > > > > In your own words, the discrepancy between tiny objects pinning large > > memcgs is a problem. objcgs are smaller than most objects, so it's not > > much different as if an object were simply a few bytes bigger in size. > > A memcg on the other hand is vastly bigger than most objects. It's > > also composed of many allocations and so causes more fragmentation. > > > > Another issue is that memcgs with abandoned objects need to be visited > > by the reclaimer on every single reclaim walk, a hot path. The list of > > objcgs on the other hand is only iterated when the cgroup is deleted, > > which is not a fast path. It's also rare that parents with many dead > > children are deleted at all (system.slice, workload.slice etc.) > > > > So no, I would say for all intents and purposes, it fixes all the > > problems we're having with zombie memcgs. > > > > > With all this said, I'm not necessarily opposing the patchset, but it's > > > necessary to discuss how it fits into the long-term picture. > > > E.g. if we're going to use obj_cgroup API for page-sized objects, shouldn't > > > we split it back into the reparenting and bytes-sized accounting parts, > > > as I initially suggested. And shouldn't we move the reparenting part to > > > the cgroup core level, so we could use it for other controllers > > > (e.g. to fix the writeback problem). > > > > Yes, I do think we want to generalize it. But I wouldn't say it's a > > requirement for these patches, either: > > > > - The byte-sized accounting part is one atomic_t. That's 4 bytes > > pinned unnecessarily - compared to an entire struct memcg right now > > which includes memory accounting, swap accounting, byte accounting, > > and a whole lot of other things likely not used by the stale object. > > > > - The cgroup generalization is a good idea too, but that doesn't > > really change the callsites either. Unless you were thinking of > > renaming, but objcg seems like a good, generic fit for a name to > > describe the linkage between objects to a cgroup. > > > > The memcg member will need to change into something generic (a > > css_set type mapping), but that can likely be hidden behind > > page_memcg(), objcg_memcg() and similar helpers. > > > > Both of these aspects can be improved incrementally. > >
On Thu, Apr 01, 2021 at 10:15:45AM -0700, Shakeel Butt wrote: > On Thu, Apr 1, 2021 at 9:08 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > [...] > > > The zombie issue is a pretty urgent concern that has caused several > > > production emergencies now. It needs a fix sooner rather than later. > > > > Thank you very much for clarifying the problem for me. I do agree > > with you. This issue should be fixed ASAP. Using objcg is a good > > choice. Dying objcg should not be a problem. Because the size of > > objcg is so small compared to memcg. > > > > Just wanted to say out loud that yes this patchset will reduce the > memcg zombie issue but this is not the final destination. We should > continue the discussions on sharing/reusing scenarios. Absolutely. I think it's an important discussion to have. My only concern is that Muchun's patches fix a regression, which admittedly has built over a few years, but is a regression nonetheless that can leave machines in undesirable states and may require reboots. The sharing and reuse semantics on the other hand have been the same since the beginning of cgroups. Users have had some time to structure their requirements around these semantics :-) If there were a concrete proposal or an RFC on the table for how sharing and reusing could be implemented, and this proposal would be in direct conflict with the reparenting patches, I would say let's try to figure out a way whether the bug could be fixed in a way that is compatible with such another imminent change. But we shouldn't hold up a bug fix to start planning a new feature.