Message ID | 20230506114948.6862-1-chengkaitao@didiglobal.com (mailing list archive) |
---|---|
Headers | show |
Series | memcontrol: support cgroup level OOM protection | expand |
On Sat 06-05-23 19:49:46, chengkaitao wrote: > Establish a new OOM score algorithm, supports the cgroup level OOM > protection mechanism. When an global/memcg oom event occurs, we treat > all processes in the cgroup as a whole, and OOM killers need to select > the process to kill based on the protection quota of the cgroup Although your patch 1 briefly touches on some advantages of this interface there is a lack of actual usecase. Arguing that oom_score_adj is hard because it needs a parent process is rather weak to be honest. It is just trivial to create a thin wrapper, use systemd to launch important services or simply update the value after the fact. Now oom_score_adj has its own downsides of course (most notably a granularity and a lack of group protection. That being said, make sure you describe your usecase more thoroughly. Please also make sure you describe the intended heuristic of the knob. It is not really clear from the description how this fits hierarchical behavior of cgroups. I would be especially interested in the semantics of non-leaf memcgs protection as they do not have any actual processes to protect. Also there have been concerns mentioned in v2 discussion and it would be really appreciated to summarize how you have dealt with them. Please also note that many people are going to be slow in responding this week because of LSFMM conference (https://events.linuxfoundation.org/lsfmm/)
At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@suse.com> wrote: >On Sat 06-05-23 19:49:46, chengkaitao wrote: >> Establish a new OOM score algorithm, supports the cgroup level OOM >> protection mechanism. When an global/memcg oom event occurs, we treat >> all processes in the cgroup as a whole, and OOM killers need to select >> the process to kill based on the protection quota of the cgroup > >Although your patch 1 briefly touches on some advantages of this >interface there is a lack of actual usecase. Arguing that oom_score_adj >is hard because it needs a parent process is rather weak to be honest. >It is just trivial to create a thin wrapper, use systemd to launch >important services or simply update the value after the fact. Now >oom_score_adj has its own downsides of course (most notably a >granularity and a lack of group protection. > >That being said, make sure you describe your usecase more thoroughly. >Please also make sure you describe the intended heuristic of the knob. >It is not really clear from the description how this fits hierarchical >behavior of cgroups. I would be especially interested in the semantics >of non-leaf memcgs protection as they do not have any actual processes >to protect. > >Also there have been concerns mentioned in v2 discussion and it would be >really appreciated to summarize how you have dealt with them. > >Please also note that many people are going to be slow in responding >this week because of LSFMM conference >(https://events.linuxfoundation.org/lsfmm/) Here is a more detailed comparison and introduction of the old oom_score_adj mechanism and the new oom_protect mechanism, 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj. On a 512G physical machine, the minimum granularity adjusted by oom_score_adj is 512M, and the minimum granularity adjusted by oom_protect is one page (4K). 2. It may be simple to create a lightweight parent process and uniformly set the oom_score_adj of some important processes, but it is not a simple matter to make multi-level settings for tens of thousands of processes on the physical machine through the lightweight parent processes. We may need a huge table to record the value of oom_score_adj maintained by all lightweight parent processes, and the user process limited by the parent process has no ability to change its own oom_score_adj, because it does not know the details of the huge table. The new patch adopts the cgroup mechanism. It does not need any parent process to manage oom_score_adj. the settings between each memcg are independent of each other, making it easier to plan the OOM order of all processes. Due to the unique nature of memory resources, current Service cloud vendors are not oversold in memory planning. I would like to use the new patch to try to achieve the possibility of oversold memory resources. 3. I conducted a test and deployed an excessive number of containers on a physical machine, By setting the oom_score_adj value of all processes in the container to a positive number through dockerinit, even processes that occupy very little memory in the container are easily killed, resulting in a large number of invalid kill behaviors. If dockerinit is also killed unfortunately, it will trigger container self-healing, and the container will rebuild, resulting in more severe memory oscillations. The new patch abandons the behavior of adding an equal amount of oom_score_adj to each process in the container and adopts a shared oom_protect quota for all processes in the container. If a process in the container is killed, the remaining other processes will receive more oom_protect quota, making it more difficult for the remaining processes to be killed. In my test case, the new patch reduced the number of invalid kill behaviors by 70%. 4. oom_score_adj is a global configuration that cannot achieve a kill order that only affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits downwards, and user can only change the kill order of its own memcg oom, but the kill order of their parent memcg-oom-killer or global-oom-killer will not be affected In the final discussion of patch v2, we discussed that although the adjustment range of oom_score_adj is [-1000,1000], but essentially it only allows two usecases (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is clumsy at best. In order to solve this problem in the new patch, I introduced a new indicator oom_kill_inherit, which counts the number of times the local and child cgroups have been selected by the OOM killer of the ancestor cgroup. By observing the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the value of oom_protect to achieve the best. about the semantics of non-leaf memcgs protection, If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally calculate the new effective oom_protect quota based on non-leaf memcg's quota.
On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote: > At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@suse.com> wrote: > >On Sat 06-05-23 19:49:46, chengkaitao wrote: > >> Establish a new OOM score algorithm, supports the cgroup level OOM > >> protection mechanism. When an global/memcg oom event occurs, we treat > >> all processes in the cgroup as a whole, and OOM killers need to select > >> the process to kill based on the protection quota of the cgroup > > > >Although your patch 1 briefly touches on some advantages of this > >interface there is a lack of actual usecase. Arguing that oom_score_adj > >is hard because it needs a parent process is rather weak to be honest. > >It is just trivial to create a thin wrapper, use systemd to launch > >important services or simply update the value after the fact. Now > >oom_score_adj has its own downsides of course (most notably a > >granularity and a lack of group protection. > > > >That being said, make sure you describe your usecase more thoroughly. > >Please also make sure you describe the intended heuristic of the knob. > >It is not really clear from the description how this fits hierarchical > >behavior of cgroups. I would be especially interested in the semantics > >of non-leaf memcgs protection as they do not have any actual processes > >to protect. > > > >Also there have been concerns mentioned in v2 discussion and it would be > >really appreciated to summarize how you have dealt with them. > > > >Please also note that many people are going to be slow in responding > >this week because of LSFMM conference > >(https://events.linuxfoundation.org/lsfmm/) > > Here is a more detailed comparison and introduction of the old oom_score_adj > mechanism and the new oom_protect mechanism, > 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj. > On a 512G physical machine, the minimum granularity adjusted by oom_score_adj > is 512M, and the minimum granularity adjusted by oom_protect is one page (4K). > 2. It may be simple to create a lightweight parent process and uniformly set the > oom_score_adj of some important processes, but it is not a simple matter to make > multi-level settings for tens of thousands of processes on the physical machine > through the lightweight parent processes. We may need a huge table to record the > value of oom_score_adj maintained by all lightweight parent processes, and the > user process limited by the parent process has no ability to change its own > oom_score_adj, because it does not know the details of the huge table. The new > patch adopts the cgroup mechanism. It does not need any parent process to manage > oom_score_adj. the settings between each memcg are independent of each other, > making it easier to plan the OOM order of all processes. Due to the unique nature > of memory resources, current Service cloud vendors are not oversold in memory > planning. I would like to use the new patch to try to achieve the possibility of > oversold memory resources. OK, this is more specific about the usecase. Thanks! So essentially what it boils down to is that you are handling many containers (memcgs from our POV) and they have different priorities. You want to overcommit the memory to the extend that global ooms are not an unexpected event. Once that happens the total memory consumption of a specific memcg is less important than its "priority". You define that priority by the excess of the memory usage above a user defined threshold. Correct? Your cover letter mentions that then "all processes in the cgroup as a whole". That to me reads as oom.group oom killer policy. But a brief look into the patch suggests you are still looking at specific tasks and this has been a concern in the previous version of the patch because memcg accounting and per-process accounting are detached. > 3. I conducted a test and deployed an excessive number of containers on a physical > machine, By setting the oom_score_adj value of all processes in the container to > a positive number through dockerinit, even processes that occupy very little memory > in the container are easily killed, resulting in a large number of invalid kill behaviors. > If dockerinit is also killed unfortunately, it will trigger container self-healing, and the > container will rebuild, resulting in more severe memory oscillations. The new patch > abandons the behavior of adding an equal amount of oom_score_adj to each process > in the container and adopts a shared oom_protect quota for all processes in the container. > If a process in the container is killed, the remaining other processes will receive more > oom_protect quota, making it more difficult for the remaining processes to be killed. > In my test case, the new patch reduced the number of invalid kill behaviors by 70%. > 4. oom_score_adj is a global configuration that cannot achieve a kill order that only > affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits > downwards, and user can only change the kill order of its own memcg oom, but the > kill order of their parent memcg-oom-killer or global-oom-killer will not be affected Yes oom_score_adj has shortcomings. > In the final discussion of patch v2, we discussed that although the adjustment range > of oom_score_adj is [-1000,1000], but essentially it only allows two usecases > (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is > clumsy at best. In order to solve this problem in the new patch, I introduced a new > indicator oom_kill_inherit, which counts the number of times the local and child > cgroups have been selected by the OOM killer of the ancestor cgroup. By observing > the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the > value of oom_protect to achieve the best. What does the best mean in this context? > about the semantics of non-leaf memcgs protection, > If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally > calculate the new effective oom_protect quota based on non-leaf memcg's quota. So the non-leaf memcg is never used as a target? What if the workload is distributed over several sub-groups? Our current oom.group implementation traverses the tree to find a common ancestor in the oom domain with the oom.group. All that being said and with the usecase described more specifically. I can see that memcg based oom victim selection makes some sense. That menas that it is always a memcg selected and all tasks withing killed. Memcg based protection can be used to evaluate which memcg to choose and the overall scheme should be still manageable. It would indeed resemble memory protection for the regular reclaim. One thing that is still not really clear to me is to how group vs. non-group ooms could be handled gracefully. Right now we can handle that because the oom selection is still process based but with the protection this will become more problematic as explained previously. Essentially we would need to enforce the oom selection to be memcg based for all memcgs. Maybe a mount knob? What do you think?
At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@suse.com> wrote: >On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote: >> At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@suse.com> wrote: >> >On Sat 06-05-23 19:49:46, chengkaitao wrote: >> > >> >That being said, make sure you describe your usecase more thoroughly. >> >Please also make sure you describe the intended heuristic of the knob. >> >It is not really clear from the description how this fits hierarchical >> >behavior of cgroups. I would be especially interested in the semantics >> >of non-leaf memcgs protection as they do not have any actual processes >> >to protect. >> > >> >Also there have been concerns mentioned in v2 discussion and it would be >> >really appreciated to summarize how you have dealt with them. >> > >> >Please also note that many people are going to be slow in responding >> >this week because of LSFMM conference >> >(https://events.linuxfoundation.org/lsfmm/) >> >> Here is a more detailed comparison and introduction of the old oom_score_adj >> mechanism and the new oom_protect mechanism, >> 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj. >> On a 512G physical machine, the minimum granularity adjusted by oom_score_adj >> is 512M, and the minimum granularity adjusted by oom_protect is one page (4K). >> 2. It may be simple to create a lightweight parent process and uniformly set the >> oom_score_adj of some important processes, but it is not a simple matter to make >> multi-level settings for tens of thousands of processes on the physical machine >> through the lightweight parent processes. We may need a huge table to record the >> value of oom_score_adj maintained by all lightweight parent processes, and the >> user process limited by the parent process has no ability to change its own >> oom_score_adj, because it does not know the details of the huge table. The new >> patch adopts the cgroup mechanism. It does not need any parent process to manage >> oom_score_adj. the settings between each memcg are independent of each other, >> making it easier to plan the OOM order of all processes. Due to the unique nature >> of memory resources, current Service cloud vendors are not oversold in memory >> planning. I would like to use the new patch to try to achieve the possibility of >> oversold memory resources. > >OK, this is more specific about the usecase. Thanks! So essentially what >it boils down to is that you are handling many containers (memcgs from >our POV) and they have different priorities. You want to overcommit the >memory to the extend that global ooms are not an unexpected event. Once >that happens the total memory consumption of a specific memcg is less >important than its "priority". You define that priority by the excess of >the memory usage above a user defined threshold. Correct? It's correct. >Your cover letter mentions that then "all processes in the cgroup as a >whole". That to me reads as oom.group oom killer policy. But a brief >look into the patch suggests you are still looking at specific tasks and >this has been a concern in the previous version of the patch because >memcg accounting and per-process accounting are detached. I think the memcg accounting may be more reasonable, as its memory statistics are more comprehensive, similar to active page cache, which also increases the probability of OOM-kill. In the new patch, all the shared memory will also consume the oom_protect quota of the memcg, and the process's oom_protect quota of the memcg will decrease. >> 3. I conducted a test and deployed an excessive number of containers on a physical >> machine, By setting the oom_score_adj value of all processes in the container to >> a positive number through dockerinit, even processes that occupy very little memory >> in the container are easily killed, resulting in a large number of invalid kill behaviors. >> If dockerinit is also killed unfortunately, it will trigger container self-healing, and the >> container will rebuild, resulting in more severe memory oscillations. The new patch >> abandons the behavior of adding an equal amount of oom_score_adj to each process >> in the container and adopts a shared oom_protect quota for all processes in the container. >> If a process in the container is killed, the remaining other processes will receive more >> oom_protect quota, making it more difficult for the remaining processes to be killed. >> In my test case, the new patch reduced the number of invalid kill behaviors by 70%. >> 4. oom_score_adj is a global configuration that cannot achieve a kill order that only >> affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits >> downwards, and user can only change the kill order of its own memcg oom, but the >> kill order of their parent memcg-oom-killer or global-oom-killer will not be affected > >Yes oom_score_adj has shortcomings. > >> In the final discussion of patch v2, we discussed that although the adjustment range >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is >> clumsy at best. In order to solve this problem in the new patch, I introduced a new >> indicator oom_kill_inherit, which counts the number of times the local and child >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the >> value of oom_protect to achieve the best. > >What does the best mean in this context? I have created a new indicator oom_kill_inherit that maintains a negative correlation with memory.oom.protect, so we have a ruler to measure the optimal value of memory.oom.protect. >> about the semantics of non-leaf memcgs protection, >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally >> calculate the new effective oom_protect quota based on non-leaf memcg's quota. > >So the non-leaf memcg is never used as a target? What if the workload is >distributed over several sub-groups? Our current oom.group >implementation traverses the tree to find a common ancestor in the oom >domain with the oom.group. If the oom_protect quota of the parent non-leaf memcg is less than the sum of sub-groups oom_protect quota, the oom_protect quota of each sub-group will be proportionally reduced If the oom_protect quota of the parent non-leaf memcg is greater than the sum of sub-groups oom_protect quota, the oom_protect quota of each sub-group will be proportionally increased The purpose of doing so is that users can set oom_protect quota according to their own needs, and the system management process can set appropriate oom_protect quota on the parent non-leaf memcg as the final cover, so that the system management process can indirectly manage all user processes. >All that being said and with the usecase described more specifically. I >can see that memcg based oom victim selection makes some sense. That >menas that it is always a memcg selected and all tasks withing killed. >Memcg based protection can be used to evaluate which memcg to choose and >the overall scheme should be still manageable. It would indeed resemble >memory protection for the regular reclaim. > >One thing that is still not really clear to me is to how group vs. >non-group ooms could be handled gracefully. Right now we can handle that >because the oom selection is still process based but with the protection >this will become more problematic as explained previously. Essentially >we would need to enforce the oom selection to be memcg based for all >memcgs. Maybe a mount knob? What do you think? There is a function in the patch to determine whether the oom_protect mechanism is enabled. All memory.oom.protect nodes default to 0, so the function <is_root_oom_protect> returns 0 by default. The oom_protect mechanism will only take effect when "root_mem_cgroup->memory.children_oom_protect_usage" is not 0, and only memcg with memory.oom.protect node set will take effect. +bool is_root_oom_protect(void) +{ + if (mem_cgroup_disabled()) + return 0; + + return !!atomic_long_read(&root_mem_cgroup->memory.children_oom_protect_usage); +} I don't know if there is some problems with my understanding?
[Sorry for a late reply but I was mostly offline last 2 weeks] On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote: > At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@suse.com> wrote: [...] > >Your cover letter mentions that then "all processes in the cgroup as a > >whole". That to me reads as oom.group oom killer policy. But a brief > >look into the patch suggests you are still looking at specific tasks and > >this has been a concern in the previous version of the patch because > >memcg accounting and per-process accounting are detached. > > I think the memcg accounting may be more reasonable, as its memory > statistics are more comprehensive, similar to active page cache, which > also increases the probability of OOM-kill. In the new patch, all the > shared memory will also consume the oom_protect quota of the memcg, > and the process's oom_protect quota of the memcg will decrease. I am sorry but I do not follow. Could you elaborate please? Are you arguing for per memcg or per process metrics? [...] > >> In the final discussion of patch v2, we discussed that although the adjustment range > >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases > >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is > >> clumsy at best. In order to solve this problem in the new patch, I introduced a new > >> indicator oom_kill_inherit, which counts the number of times the local and child > >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing > >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the > >> value of oom_protect to achieve the best. > > > >What does the best mean in this context? > > I have created a new indicator oom_kill_inherit that maintains a negative correlation > with memory.oom.protect, so we have a ruler to measure the optimal value of > memory.oom.protect. An example might help here. > >> about the semantics of non-leaf memcgs protection, > >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally > >> calculate the new effective oom_protect quota based on non-leaf memcg's quota. > > > >So the non-leaf memcg is never used as a target? What if the workload is > >distributed over several sub-groups? Our current oom.group > >implementation traverses the tree to find a common ancestor in the oom > >domain with the oom.group. > > If the oom_protect quota of the parent non-leaf memcg is less than the sum of > sub-groups oom_protect quota, the oom_protect quota of each sub-group will > be proportionally reduced > If the oom_protect quota of the parent non-leaf memcg is greater than the sum > of sub-groups oom_protect quota, the oom_protect quota of each sub-group > will be proportionally increased > The purpose of doing so is that users can set oom_protect quota according to > their own needs, and the system management process can set appropriate > oom_protect quota on the parent non-leaf memcg as the final cover, so that > the system management process can indirectly manage all user processes. I guess that you are trying to say that the oom protection has a standard hierarchical behavior. And that is fine, well, in fact it is mandatory for any control knob to have a sane hierarchical properties. But that doesn't address my above question. Let me try again. When is a non-leaf memcg potentially selected as the oom victim? It doesn't have any tasks directly but it might be a suitable target to kill a multi memcg based workload (e.g. a full container). > >All that being said and with the usecase described more specifically. I > >can see that memcg based oom victim selection makes some sense. That > >menas that it is always a memcg selected and all tasks withing killed. > >Memcg based protection can be used to evaluate which memcg to choose and > >the overall scheme should be still manageable. It would indeed resemble > >memory protection for the regular reclaim. > > > >One thing that is still not really clear to me is to how group vs. > >non-group ooms could be handled gracefully. Right now we can handle that > >because the oom selection is still process based but with the protection > >this will become more problematic as explained previously. Essentially > >we would need to enforce the oom selection to be memcg based for all > >memcgs. Maybe a mount knob? What do you think? > > There is a function in the patch to determine whether the oom_protect > mechanism is enabled. All memory.oom.protect nodes default to 0, so the function > <is_root_oom_protect> returns 0 by default. How can an admin determine what is the current oom detection logic?
At 2023-05-22 21:03:50, "Michal Hocko" <mhocko@suse.com> wrote: >[Sorry for a late reply but I was mostly offline last 2 weeks] > >On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote: >> At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@suse.com> wrote: >[...] >> >Your cover letter mentions that then "all processes in the cgroup as a >> >whole". That to me reads as oom.group oom killer policy. But a brief >> >look into the patch suggests you are still looking at specific tasks and >> >this has been a concern in the previous version of the patch because >> >memcg accounting and per-process accounting are detached. >> >> I think the memcg accounting may be more reasonable, as its memory >> statistics are more comprehensive, similar to active page cache, which >> also increases the probability of OOM-kill. In the new patch, all the >> shared memory will also consume the oom_protect quota of the memcg, >> and the process's oom_protect quota of the memcg will decrease. > >I am sorry but I do not follow. Could you elaborate please? Are you >arguing for per memcg or per process metrics? You mentioned earlier that 'memcg accounting and per process accounting are detached', and I may have misunderstood your question. I want to express above that memcg accounting is more comprehensive than per process accounting, and using memcg accounting in the OOM-killer mechanism would be more reasonable. >[...] > >> >> In the final discussion of patch v2, we discussed that although the adjustment range >> >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases >> >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is >> >> clumsy at best. In order to solve this problem in the new patch, I introduced a new >> >> indicator oom_kill_inherit, which counts the number of times the local and child >> >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing >> >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the >> >> value of oom_protect to achieve the best. >> > >> >What does the best mean in this context? >> >> I have created a new indicator oom_kill_inherit that maintains a negative correlation >> with memory.oom.protect, so we have a ruler to measure the optimal value of >> memory.oom.protect. > >An example might help here. In my testing case, by adjusting memory.oom.protect, I was able to significantly reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine with severely oversold memory, I divided all cgroups into three categories and controlled their probability of being selected by the oom-killer to 0%,% 20, and 80%, respectively. >> >> about the semantics of non-leaf memcgs protection, >> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally >> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota. >> > >> >So the non-leaf memcg is never used as a target? What if the workload is >> >distributed over several sub-groups? Our current oom.group >> >implementation traverses the tree to find a common ancestor in the oom >> >domain with the oom.group. >> >> If the oom_protect quota of the parent non-leaf memcg is less than the sum of >> sub-groups oom_protect quota, the oom_protect quota of each sub-group will >> be proportionally reduced >> If the oom_protect quota of the parent non-leaf memcg is greater than the sum >> of sub-groups oom_protect quota, the oom_protect quota of each sub-group >> will be proportionally increased >> The purpose of doing so is that users can set oom_protect quota according to >> their own needs, and the system management process can set appropriate >> oom_protect quota on the parent non-leaf memcg as the final cover, so that >> the system management process can indirectly manage all user processes. > >I guess that you are trying to say that the oom protection has a >standard hierarchical behavior. And that is fine, well, in fact it is >mandatory for any control knob to have a sane hierarchical properties. >But that doesn't address my above question. Let me try again. When is a >non-leaf memcg potentially selected as the oom victim? It doesn't have >any tasks directly but it might be a suitable target to kill a multi >memcg based workload (e.g. a full container). If nonleaf memcg have the higher memory usage and the smaller memory.oom.protect, it will have the higher the probability being selected by the killer. If the non-leaf memcg is selected as the oom victim, OOM-killer will continue to select the appropriate child memcg downwards until the leaf memcg is selected. >> >All that being said and with the usecase described more specifically. I >> >can see that memcg based oom victim selection makes some sense. That >> >menas that it is always a memcg selected and all tasks withing killed. >> >Memcg based protection can be used to evaluate which memcg to choose and >> >the overall scheme should be still manageable. It would indeed resemble >> >memory protection for the regular reclaim. >> > >> >One thing that is still not really clear to me is to how group vs. >> >non-group ooms could be handled gracefully. Right now we can handle that >> >because the oom selection is still process based but with the protection >> >this will become more problematic as explained previously. Essentially >> >we would need to enforce the oom selection to be memcg based for all >> >memcgs. Maybe a mount knob? What do you think? >> >> There is a function in the patch to determine whether the oom_protect >> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function >> <is_root_oom_protect> returns 0 by default. > >How can an admin determine what is the current oom detection logic? The memory.oom.protect are set by the administrator themselves, and they must know what the current OOM policy is. Reading the memory.oom.protect of the first level cgroup directory and observing whether it is 0 can also determine whether the oom.protect policy is enabled. For a process, the physical machine administrator, k8s administrator, agent administrator, and container administrator see different effective memory.oom.protect for the process, so they only need to pay attention to the memory.oom.protect of the local cgroup directory. If an administrator wants to know the OOM detection logic of all administrators, I don't think there is such a business requirement.
On Thu 25-05-23 07:35:41, 程垲涛 Chengkaitao Cheng wrote: > At 2023-05-22 21:03:50, "Michal Hocko" <mhocko@suse.com> wrote: [...] > >> I have created a new indicator oom_kill_inherit that maintains a negative correlation > >> with memory.oom.protect, so we have a ruler to measure the optimal value of > >> memory.oom.protect. > > > >An example might help here. > > In my testing case, by adjusting memory.oom.protect, I was able to significantly > reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine > with severely oversold memory, I divided all cgroups into three categories and > controlled their probability of being selected by the oom-killer to 0%,% 20, > and 80%, respectively. I might be just dense but I am lost. Can we focus on the barebone semantic of the group oom selection and killing first. No magic auto-tuning at this stage please. > >> >> about the semantics of non-leaf memcgs protection, > >> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally > >> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota. > >> > > >> >So the non-leaf memcg is never used as a target? What if the workload is > >> >distributed over several sub-groups? Our current oom.group > >> >implementation traverses the tree to find a common ancestor in the oom > >> >domain with the oom.group. > >> > >> If the oom_protect quota of the parent non-leaf memcg is less than the sum of > >> sub-groups oom_protect quota, the oom_protect quota of each sub-group will > >> be proportionally reduced > >> If the oom_protect quota of the parent non-leaf memcg is greater than the sum > >> of sub-groups oom_protect quota, the oom_protect quota of each sub-group > >> will be proportionally increased > >> The purpose of doing so is that users can set oom_protect quota according to > >> their own needs, and the system management process can set appropriate > >> oom_protect quota on the parent non-leaf memcg as the final cover, so that > >> the system management process can indirectly manage all user processes. > > > >I guess that you are trying to say that the oom protection has a > >standard hierarchical behavior. And that is fine, well, in fact it is > >mandatory for any control knob to have a sane hierarchical properties. > >But that doesn't address my above question. Let me try again. When is a > >non-leaf memcg potentially selected as the oom victim? It doesn't have > >any tasks directly but it might be a suitable target to kill a multi > >memcg based workload (e.g. a full container). > > If nonleaf memcg have the higher memory usage and the smaller > memory.oom.protect, it will have the higher the probability being > selected by the killer. If the non-leaf memcg is selected as the oom > victim, OOM-killer will continue to select the appropriate child > memcg downwards until the leaf memcg is selected. Parent memcg has more or equal memory charged than its child(ren) by definition. Let me try to ask differently. Say you have the following hierarchy root / \ container_A container_B (oom.prot=100M) (oom.prot=200M) (usage=120M) (usage=180M) / | \ A B C / \ C1 C2 container_B is protected so it should be excluded. Correct? So we are at container_A to chose from. There are multiple ways the system and continer admin might want to achieve. 1) system admin might want to shut down the whole container. 2) continer admin might want to shut the whole container down 3) cont. admin might want to shut down a whole sub group (e.g. C as it is a self contained workload and killing portion of it will put it into inconsistent state). 4) cont. admin might want to kill the most excess cgroup with tasks (i.e. a leaf memcg). 5) admin might want to kill a process in the most excess memcg. Now we already have oom.group thingy that can drive the group killing policy but it is not really clear how you want to incorporate that to the protection. Again, I think that an oom.protection makes sense but the semantic has to be very carefully thought through because it is quite easy to create corner cases and weird behavior. I also think that oom.group has to be consistent with the protection. > >> >All that being said and with the usecase described more specifically. I > >> >can see that memcg based oom victim selection makes some sense. That > >> >menas that it is always a memcg selected and all tasks withing killed. > >> >Memcg based protection can be used to evaluate which memcg to choose and > >> >the overall scheme should be still manageable. It would indeed resemble > >> >memory protection for the regular reclaim. > >> > > >> >One thing that is still not really clear to me is to how group vs. > >> >non-group ooms could be handled gracefully. Right now we can handle that > >> >because the oom selection is still process based but with the protection > >> >this will become more problematic as explained previously. Essentially > >> >we would need to enforce the oom selection to be memcg based for all > >> >memcgs. Maybe a mount knob? What do you think? > >> > >> There is a function in the patch to determine whether the oom_protect > >> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function > >> <is_root_oom_protect> returns 0 by default. > > > >How can an admin determine what is the current oom detection logic? > > The memory.oom.protect are set by the administrator themselves, and they > must know what the current OOM policy is. Reading the memory.oom.protect > of the first level cgroup directory and observing whether it is 0 can also > determine whether the oom.protect policy is enabled. How do you achieve that from withing a container which doesn't have a full visibility to the cgroup tree?
At 2023-05-29 22:02:47, "Michal Hocko" <mhocko@suse.com> wrote: >On Thu 25-05-23 07:35:41, 程垲涛 Chengkaitao Cheng wrote: >> At 2023-05-22 21:03:50, "Michal Hocko" <mhocko@suse.com> wrote: >[...] >> >> I have created a new indicator oom_kill_inherit that maintains a negative correlation >> >> with memory.oom.protect, so we have a ruler to measure the optimal value of >> >> memory.oom.protect. >> > >> >An example might help here. >> >> In my testing case, by adjusting memory.oom.protect, I was able to significantly >> reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine >> with severely oversold memory, I divided all cgroups into three categories and >> controlled their probability of being selected by the oom-killer to 0%,% 20, >> and 80%, respectively. > >I might be just dense but I am lost. Can we focus on the barebone >semantic of the group oom selection and killing first. No magic >auto-tuning at this stage please. > >> >> >> about the semantics of non-leaf memcgs protection, >> >> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally >> >> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota. >> >> > >> >> >So the non-leaf memcg is never used as a target? What if the workload is >> >> >distributed over several sub-groups? Our current oom.group >> >> >implementation traverses the tree to find a common ancestor in the oom >> >> >domain with the oom.group. >> >> >> >> If the oom_protect quota of the parent non-leaf memcg is less than the sum of >> >> sub-groups oom_protect quota, the oom_protect quota of each sub-group will >> >> be proportionally reduced >> >> If the oom_protect quota of the parent non-leaf memcg is greater than the sum >> >> of sub-groups oom_protect quota, the oom_protect quota of each sub-group >> >> will be proportionally increased >> >> The purpose of doing so is that users can set oom_protect quota according to >> >> their own needs, and the system management process can set appropriate >> >> oom_protect quota on the parent non-leaf memcg as the final cover, so that >> >> the system management process can indirectly manage all user processes. >> > >> >I guess that you are trying to say that the oom protection has a >> >standard hierarchical behavior. And that is fine, well, in fact it is >> >mandatory for any control knob to have a sane hierarchical properties. >> >But that doesn't address my above question. Let me try again. When is a >> >non-leaf memcg potentially selected as the oom victim? It doesn't have >> >any tasks directly but it might be a suitable target to kill a multi >> >memcg based workload (e.g. a full container). >> >> If nonleaf memcg have the higher memory usage and the smaller >> memory.oom.protect, it will have the higher the probability being >> selected by the killer. If the non-leaf memcg is selected as the oom >> victim, OOM-killer will continue to select the appropriate child >> memcg downwards until the leaf memcg is selected. > >Parent memcg has more or equal memory charged than its child(ren) by >definition. Let me try to ask differently. Say you have the following >hierarchy > > root > / \ > container_A container_B > (oom.prot=100M) (oom.prot=200M) > (usage=120M) (usage=180M) > / | \ > A B C > / \ > C1 C2 > > >container_B is protected so it should be excluded. Correct? So we are at >container_A to chose from. There are multiple ways the system and >continer admin might want to achieve. >1) system admin might want to shut down the whole container. >2) continer admin might want to shut the whole container down >3) cont. admin might want to shut down a whole sub group (e.g. C as it > is a self contained workload and killing portion of it will put it into > inconsistent state). >4) cont. admin might want to kill the most excess cgroup with tasks (i.e. a > leaf memcg). >5) admin might want to kill a process in the most excess memcg. > >Now we already have oom.group thingy that can drive the group killing >policy but it is not really clear how you want to incorporate that to >the protection. > >Again, I think that an oom.protection makes sense but the semantic has >to be very carefully thought through because it is quite easy to create >corner cases and weird behavior. I also think that oom.group has to be >consistent with the protection. The barebone semantic of the function implemented by my patch are summarized as follows: Memcg only allows processes in the memcg to be selected by their ancestor's OOM killer when the memory usage exceeds "oom.protect" It should be noted that "oom.protect" and "oom.group" are completely different things, and kneading them together may make the explanation more confusing. >> >> >All that being said and with the usecase described more specifically. I >> >> >can see that memcg based oom victim selection makes some sense. That >> >> >menas that it is always a memcg selected and all tasks withing killed. >> >> >Memcg based protection can be used to evaluate which memcg to choose and >> >> >the overall scheme should be still manageable. It would indeed resemble >> >> >memory protection for the regular reclaim. >> >> > >> >> >One thing that is still not really clear to me is to how group vs. >> >> >non-group ooms could be handled gracefully. Right now we can handle that >> >> >because the oom selection is still process based but with the protection >> >> >this will become more problematic as explained previously. Essentially >> >> >we would need to enforce the oom selection to be memcg based for all >> >> >memcgs. Maybe a mount knob? What do you think? >> >> >> >> There is a function in the patch to determine whether the oom_protect >> >> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function >> >> <is_root_oom_protect> returns 0 by default. >> > >> >How can an admin determine what is the current oom detection logic? >> >> The memory.oom.protect are set by the administrator themselves, and they >> must know what the current OOM policy is. Reading the memory.oom.protect >> of the first level cgroup directory and observing whether it is 0 can also >> determine whether the oom.protect policy is enabled. > >How do you achieve that from withing a container which doesn't have a >full visibility to the cgroup tree? The container does not need to have full visibility to the cgroup tree, function "oom.protect" requires it to only focus on local cgroup and subgroup trees. -- Thanks for your comment! chengkaitao
On Mon, May 8, 2023 at 7:18 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote: > > At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@suse.com> wrote: > > >On Sat 06-05-23 19:49:46, chengkaitao wrote: > > >> Establish a new OOM score algorithm, supports the cgroup level OOM > > >> protection mechanism. When an global/memcg oom event occurs, we treat > > >> all processes in the cgroup as a whole, and OOM killers need to select > > >> the process to kill based on the protection quota of the cgroup > > > > > >Although your patch 1 briefly touches on some advantages of this > > >interface there is a lack of actual usecase. Arguing that oom_score_adj > > >is hard because it needs a parent process is rather weak to be honest. > > >It is just trivial to create a thin wrapper, use systemd to launch > > >important services or simply update the value after the fact. Now > > >oom_score_adj has its own downsides of course (most notably a > > >granularity and a lack of group protection. > > > > > >That being said, make sure you describe your usecase more thoroughly. > > >Please also make sure you describe the intended heuristic of the knob. > > >It is not really clear from the description how this fits hierarchical > > >behavior of cgroups. I would be especially interested in the semantics > > >of non-leaf memcgs protection as they do not have any actual processes > > >to protect. > > > > > >Also there have been concerns mentioned in v2 discussion and it would be > > >really appreciated to summarize how you have dealt with them. > > > > > >Please also note that many people are going to be slow in responding > > >this week because of LSFMM conference > > >(https://events.linuxfoundation.org/lsfmm/) > > > > Here is a more detailed comparison and introduction of the old oom_score_adj > > mechanism and the new oom_protect mechanism, > > 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj. > > On a 512G physical machine, the minimum granularity adjusted by oom_score_adj > > is 512M, and the minimum granularity adjusted by oom_protect is one page (4K). > > 2. It may be simple to create a lightweight parent process and uniformly set the > > oom_score_adj of some important processes, but it is not a simple matter to make > > multi-level settings for tens of thousands of processes on the physical machine > > through the lightweight parent processes. We may need a huge table to record the > > value of oom_score_adj maintained by all lightweight parent processes, and the > > user process limited by the parent process has no ability to change its own > > oom_score_adj, because it does not know the details of the huge table. The new > > patch adopts the cgroup mechanism. It does not need any parent process to manage > > oom_score_adj. the settings between each memcg are independent of each other, > > making it easier to plan the OOM order of all processes. Due to the unique nature > > of memory resources, current Service cloud vendors are not oversold in memory > > planning. I would like to use the new patch to try to achieve the possibility of > > oversold memory resources. > > OK, this is more specific about the usecase. Thanks! So essentially what > it boils down to is that you are handling many containers (memcgs from > our POV) and they have different priorities. You want to overcommit the > memory to the extend that global ooms are not an unexpected event. Once > that happens the total memory consumption of a specific memcg is less > important than its "priority". You define that priority by the excess of > the memory usage above a user defined threshold. Correct? There has been a parallel discussion in the cover letter thread of v4 [1]. To summarize, at Google, we have been using OOM scores to describe different job priorities in a more explicit way -- regardless of memory usage. It is strictly priority-based OOM killing. Ties are broken based on memory usage. We understand that something like memory.oom.protect has an advantage in the sense that you can skip killing a process if you know that it won't free enough memory anyway, but for an environment where multiple jobs of different priorities are running, we find it crucial to be able to define strict ordering. Some jobs are simply more important than others, regardless of their memory usage. It would be great if we can arrive at an interface that serves this use case as well. Thanks! [1]https://lore.kernel.org/linux-mm/CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com/ > > Your cover letter mentions that then "all processes in the cgroup as a > whole". That to me reads as oom.group oom killer policy. But a brief > look into the patch suggests you are still looking at specific tasks and > this has been a concern in the previous version of the patch because > memcg accounting and per-process accounting are detached. > > > 3. I conducted a test and deployed an excessive number of containers on a physical > > machine, By setting the oom_score_adj value of all processes in the container to > > a positive number through dockerinit, even processes that occupy very little memory > > in the container are easily killed, resulting in a large number of invalid kill behaviors. > > If dockerinit is also killed unfortunately, it will trigger container self-healing, and the > > container will rebuild, resulting in more severe memory oscillations. The new patch > > abandons the behavior of adding an equal amount of oom_score_adj to each process > > in the container and adopts a shared oom_protect quota for all processes in the container. > > If a process in the container is killed, the remaining other processes will receive more > > oom_protect quota, making it more difficult for the remaining processes to be killed. > > In my test case, the new patch reduced the number of invalid kill behaviors by 70%. > > 4. oom_score_adj is a global configuration that cannot achieve a kill order that only > > affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits > > downwards, and user can only change the kill order of its own memcg oom, but the > > kill order of their parent memcg-oom-killer or global-oom-killer will not be affected > > Yes oom_score_adj has shortcomings. > > > In the final discussion of patch v2, we discussed that although the adjustment range > > of oom_score_adj is [-1000,1000], but essentially it only allows two usecases > > (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is > > clumsy at best. In order to solve this problem in the new patch, I introduced a new > > indicator oom_kill_inherit, which counts the number of times the local and child > > cgroups have been selected by the OOM killer of the ancestor cgroup. By observing > > the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the > > value of oom_protect to achieve the best. > > What does the best mean in this context? > > > about the semantics of non-leaf memcgs protection, > > If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally > > calculate the new effective oom_protect quota based on non-leaf memcg's quota. > > So the non-leaf memcg is never used as a target? What if the workload is > distributed over several sub-groups? Our current oom.group > implementation traverses the tree to find a common ancestor in the oom > domain with the oom.group. > > All that being said and with the usecase described more specifically. I > can see that memcg based oom victim selection makes some sense. That > menas that it is always a memcg selected and all tasks withing killed. > Memcg based protection can be used to evaluate which memcg to choose and > the overall scheme should be still manageable. It would indeed resemble > memory protection for the regular reclaim. > > One thing that is still not really clear to me is to how group vs. > non-group ooms could be handled gracefully. Right now we can handle that > because the oom selection is still process based but with the protection > this will become more problematic as explained previously. Essentially > we would need to enforce the oom selection to be memcg based for all > memcgs. Maybe a mount knob? What do you think? > -- > Michal Hocko > SUSE Labs
On Sun 04-06-23 08:05:53, 程垲涛 Chengkaitao Cheng wrote: > At 2023-05-29 22:02:47, "Michal Hocko" <mhocko@suse.com> wrote: > >On Thu 25-05-23 07:35:41, 程垲涛 Chengkaitao Cheng wrote: > >> At 2023-05-22 21:03:50, "Michal Hocko" <mhocko@suse.com> wrote: > >[...] > >> >> I have created a new indicator oom_kill_inherit that maintains a negative correlation > >> >> with memory.oom.protect, so we have a ruler to measure the optimal value of > >> >> memory.oom.protect. > >> > > >> >An example might help here. > >> > >> In my testing case, by adjusting memory.oom.protect, I was able to significantly > >> reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine > >> with severely oversold memory, I divided all cgroups into three categories and > >> controlled their probability of being selected by the oom-killer to 0%,% 20, > >> and 80%, respectively. > > > >I might be just dense but I am lost. Can we focus on the barebone > >semantic of the group oom selection and killing first. No magic > >auto-tuning at this stage please. > > > >> >> >> about the semantics of non-leaf memcgs protection, > >> >> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally > >> >> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota. > >> >> > > >> >> >So the non-leaf memcg is never used as a target? What if the workload is > >> >> >distributed over several sub-groups? Our current oom.group > >> >> >implementation traverses the tree to find a common ancestor in the oom > >> >> >domain with the oom.group. > >> >> > >> >> If the oom_protect quota of the parent non-leaf memcg is less than the sum of > >> >> sub-groups oom_protect quota, the oom_protect quota of each sub-group will > >> >> be proportionally reduced > >> >> If the oom_protect quota of the parent non-leaf memcg is greater than the sum > >> >> of sub-groups oom_protect quota, the oom_protect quota of each sub-group > >> >> will be proportionally increased > >> >> The purpose of doing so is that users can set oom_protect quota according to > >> >> their own needs, and the system management process can set appropriate > >> >> oom_protect quota on the parent non-leaf memcg as the final cover, so that > >> >> the system management process can indirectly manage all user processes. > >> > > >> >I guess that you are trying to say that the oom protection has a > >> >standard hierarchical behavior. And that is fine, well, in fact it is > >> >mandatory for any control knob to have a sane hierarchical properties. > >> >But that doesn't address my above question. Let me try again. When is a > >> >non-leaf memcg potentially selected as the oom victim? It doesn't have > >> >any tasks directly but it might be a suitable target to kill a multi > >> >memcg based workload (e.g. a full container). > >> > >> If nonleaf memcg have the higher memory usage and the smaller > >> memory.oom.protect, it will have the higher the probability being > >> selected by the killer. If the non-leaf memcg is selected as the oom > >> victim, OOM-killer will continue to select the appropriate child > >> memcg downwards until the leaf memcg is selected. > > > >Parent memcg has more or equal memory charged than its child(ren) by > >definition. Let me try to ask differently. Say you have the following > >hierarchy > > > > root > > / \ > > container_A container_B > > (oom.prot=100M) (oom.prot=200M) > > (usage=120M) (usage=180M) > > / | \ > > A B C > > / \ > > C1 C2 > > > > > >container_B is protected so it should be excluded. Correct? So we are at > >container_A to chose from. There are multiple ways the system and > >continer admin might want to achieve. > >1) system admin might want to shut down the whole container. > >2) continer admin might want to shut the whole container down > >3) cont. admin might want to shut down a whole sub group (e.g. C as it > > is a self contained workload and killing portion of it will put it into > > inconsistent state). > >4) cont. admin might want to kill the most excess cgroup with tasks (i.e. a > > leaf memcg). > >5) admin might want to kill a process in the most excess memcg. > > > >Now we already have oom.group thingy that can drive the group killing > >policy but it is not really clear how you want to incorporate that to > >the protection. > > > >Again, I think that an oom.protection makes sense but the semantic has > >to be very carefully thought through because it is quite easy to create > >corner cases and weird behavior. I also think that oom.group has to be > >consistent with the protection. > > The barebone semantic of the function implemented by my patch are > summarized as follows: > Memcg only allows processes in the memcg to be selected by their > ancestor's OOM killer when the memory usage exceeds "oom.protect" I am sure you would need to break this expectation if there is no such memcg with tasks available or do you panic the system in that case in the global case and retry for ever for the memcg oom? > It should be noted that "oom.protect" and "oom.group" are completely > different things, and kneading them together may make the explanation > more confusing. I am not suggesting to tight those two together by any means. I am merely saying that those two have to be mutually cooperative and still represent a reasonable semantic. Please have a look at above example usecases and try to explain how the memory protection fits in here as you have defined and implemented it.
On Sun 04-06-23 01:25:42, Yosry Ahmed wrote: [...] > There has been a parallel discussion in the cover letter thread of v4 > [1]. To summarize, at Google, we have been using OOM scores to > describe different job priorities in a more explicit way -- regardless > of memory usage. It is strictly priority-based OOM killing. Ties are > broken based on memory usage. > > We understand that something like memory.oom.protect has an advantage > in the sense that you can skip killing a process if you know that it > won't free enough memory anyway, but for an environment where multiple > jobs of different priorities are running, we find it crucial to be > able to define strict ordering. Some jobs are simply more important > than others, regardless of their memory usage. I do remember that discussion. I am not a great fan of simple priority based interfaces TBH. It sounds as an easy interface but it hits complications as soon as you try to define a proper/sensible hierarchical semantic. I can see how they might work on leaf memcgs with statically assigned priorities but that sounds like a very narrow usecase IMHO. I do not think we can effort a plethora of different OOM selection algorithms implemented in the kernel. Therefore we should really consider a control interface to be as much extensible and in line with the existing interfaces as much as possible. That is why I am really open to the oom protection concept which fits reasonably well to the reclaim protection scheme. After all oom killer is just a very aggressive method of the memory reclaim. On the other hand I can see a need to customizable OOM victim selection functionality. We've been through that discussion on several other occasions and the best thing we could come up with was to allow to plug BPF into the victim selection process and allow to bypass the system default method. No code has ever materialized from those discussions though. Maybe this is the time to revive that idea again? > It would be great if we can arrive at an interface that serves this > use case as well. > > Thanks! > > [1]https://lore.kernel.org/linux-mm/CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com/
+David Rientjes On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <mhocko@suse.com> wrote: > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote: > [...] > > There has been a parallel discussion in the cover letter thread of v4 > > [1]. To summarize, at Google, we have been using OOM scores to > > describe different job priorities in a more explicit way -- regardless > > of memory usage. It is strictly priority-based OOM killing. Ties are > > broken based on memory usage. > > > > We understand that something like memory.oom.protect has an advantage > > in the sense that you can skip killing a process if you know that it > > won't free enough memory anyway, but for an environment where multiple > > jobs of different priorities are running, we find it crucial to be > > able to define strict ordering. Some jobs are simply more important > > than others, regardless of their memory usage. > > I do remember that discussion. I am not a great fan of simple priority > based interfaces TBH. It sounds as an easy interface but it hits > complications as soon as you try to define a proper/sensible > hierarchical semantic. I can see how they might work on leaf memcgs with > statically assigned priorities but that sounds like a very narrow > usecase IMHO. Do you mind elaborating the problem with the hierarchical semantics? The way it works with our internal implementation is (imo) sensible and straightforward from a hierarchy POV. Starting at the OOM memcg (which can be root), we recursively compare the OOM scores of the children memcgs and pick the one with the lowest score, until we arrive at a leaf memcg. Within that leaf, we also define per-process scores, but these are less important to us. I am not sure I understand why this is not sensible from a hierarchy POV or a very narrow use case. Not that all this is optional, by default all memcgs are given the same score, and ties are broken based on per-memcg (or per-process) usage. > > I do not think we can effort a plethora of different OOM selection > algorithms implemented in the kernel. Therefore we should really > consider a control interface to be as much extensible and in line > with the existing interfaces as much as possible. That is why I am > really open to the oom protection concept which fits reasonably well > to the reclaim protection scheme. After all oom killer is just a very > aggressive method of the memory reclaim. > > On the other hand I can see a need to customizable OOM victim selection > functionality. We've been through that discussion on several other > occasions and the best thing we could come up with was to allow to plug > BPF into the victim selection process and allow to bypass the system > default method. No code has ever materialized from those discussions > though. Maybe this is the time to revive that idea again? That definitely sounds interesting, and it was brought up before. It does sound like BPF (or a different customization framework) can be the answer here. Interested to hear what others think as well. > > > It would be great if we can arrive at an interface that serves this > > use case as well. > > > > Thanks! > > > > [1]https://lore.kernel.org/linux-mm/CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com/ > -- > Michal Hocko > SUSE Labs
Hello, On Tue, Jun 13, 2023 at 10:27:32AM +0200, Michal Hocko wrote: > On the other hand I can see a need to customizable OOM victim selection > functionality. We've been through that discussion on several other > occasions and the best thing we could come up with was to allow to plug > BPF into the victim selection process and allow to bypass the system > default method. No code has ever materialized from those discussions > though. Maybe this is the time to revive that idea again? Yeah, my 5 cent - trying to define a rigid interface for something complex and flexible is a fool's errand. Johannes knows a lot better than me but we (meta) are handling most OOMs with oomd which gives more than sufficient policy flexibility. That said, handling OOMs in userspace requires wholesale configuration changes (e.g. working IO isolation) and being able to steer kernel OOM kills can be useful for many folks. I haven't thought too much about it but the problem seems pretty well fit for offloading policy decisions to a BPF program. Thanks.
On Tue 13-06-23 01:36:51, Yosry Ahmed wrote: > +David Rientjes > > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote: > > [...] > > > There has been a parallel discussion in the cover letter thread of v4 > > > [1]. To summarize, at Google, we have been using OOM scores to > > > describe different job priorities in a more explicit way -- regardless > > > of memory usage. It is strictly priority-based OOM killing. Ties are > > > broken based on memory usage. > > > > > > We understand that something like memory.oom.protect has an advantage > > > in the sense that you can skip killing a process if you know that it > > > won't free enough memory anyway, but for an environment where multiple > > > jobs of different priorities are running, we find it crucial to be > > > able to define strict ordering. Some jobs are simply more important > > > than others, regardless of their memory usage. > > > > I do remember that discussion. I am not a great fan of simple priority > > based interfaces TBH. It sounds as an easy interface but it hits > > complications as soon as you try to define a proper/sensible > > hierarchical semantic. I can see how they might work on leaf memcgs with > > statically assigned priorities but that sounds like a very narrow > > usecase IMHO. > > Do you mind elaborating the problem with the hierarchical semantics? Well, let me be more specific. If you have a simple hierarchical numeric enforcement (assume higher priority more likely to be chosen and the effective priority to be max(self, max(parents)) then the semantic itslef is straightforward. I am not really sure about the practical manageability though. I have hard time to imagine priority assignment on something like a shared workload with a more complex hierarchy. For example: root / | \ cont_A cont_B cont_C each container running its workload with own hierarchy structures that might be rather dynamic during the lifetime. In order to have a predictable OOM behavior you need to watch and reassign priorities all the time, no? > The way it works with our internal implementation is (imo) sensible > and straightforward from a hierarchy POV. Starting at the OOM memcg > (which can be root), we recursively compare the OOM scores of the > children memcgs and pick the one with the lowest score, until we > arrive at a leaf memcg. This approach has a strong requirement on the memcg hierarchy organization. Siblings have to be directly comparable because you cut off many potential sub-trees this way (e.g. is it easy to tell whether you want to rule out all system or user slices?). I can imagine usecases where this could work reasonably well e.g. a set of workers of a different priority all of them running under a shared memcg parent. But more more involved hierarchies seem more complex because you always keep in mind how the hierarchy is organize to get to your desired victim.
On Tue, Jun 13, 2023 at 5:06 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 13-06-23 01:36:51, Yosry Ahmed wrote: > > +David Rientjes > > > > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote: > > > [...] > > > > There has been a parallel discussion in the cover letter thread of v4 > > > > [1]. To summarize, at Google, we have been using OOM scores to > > > > describe different job priorities in a more explicit way -- regardless > > > > of memory usage. It is strictly priority-based OOM killing. Ties are > > > > broken based on memory usage. > > > > > > > > We understand that something like memory.oom.protect has an advantage > > > > in the sense that you can skip killing a process if you know that it > > > > won't free enough memory anyway, but for an environment where multiple > > > > jobs of different priorities are running, we find it crucial to be > > > > able to define strict ordering. Some jobs are simply more important > > > > than others, regardless of their memory usage. > > > > > > I do remember that discussion. I am not a great fan of simple priority > > > based interfaces TBH. It sounds as an easy interface but it hits > > > complications as soon as you try to define a proper/sensible > > > hierarchical semantic. I can see how they might work on leaf memcgs with > > > statically assigned priorities but that sounds like a very narrow > > > usecase IMHO. > > > > Do you mind elaborating the problem with the hierarchical semantics? > > Well, let me be more specific. If you have a simple hierarchical numeric > enforcement (assume higher priority more likely to be chosen and the > effective priority to be max(self, max(parents)) then the semantic > itslef is straightforward. > > I am not really sure about the practical manageability though. I have > hard time to imagine priority assignment on something like a shared > workload with a more complex hierarchy. For example: > root > / | \ > cont_A cont_B cont_C > > each container running its workload with own hierarchy structures that > might be rather dynamic during the lifetime. In order to have a > predictable OOM behavior you need to watch and reassign priorities all > the time, no? In our case we don't really manage the entire hierarchy in a centralized fashion. Each container gets a score based on their relative priority, and each container is free to set scores within its subcontainers if needed. Isn't this what the hierarchy is all about? Each parent only cares about its direct children. On the system level, we care about the priority ordering of containers. Ordering within containers can be deferred to containers. > > > The way it works with our internal implementation is (imo) sensible > > and straightforward from a hierarchy POV. Starting at the OOM memcg > > (which can be root), we recursively compare the OOM scores of the > > children memcgs and pick the one with the lowest score, until we > > arrive at a leaf memcg. > > This approach has a strong requirement on the memcg hierarchy > organization. Siblings have to be directly comparable because you cut > off many potential sub-trees this way (e.g. is it easy to tell > whether you want to rule out all system or user slices?). > > I can imagine usecases where this could work reasonably well e.g. a set > of workers of a different priority all of them running under a shared > memcg parent. But more more involved hierarchies seem more complex > because you always keep in mind how the hierarchy is organize to get to > your desired victim. I guess the main point is what I mentioned above, you don't need to manage the entire tree, containers can manage their subtrees. The most important thing is to provide the kernel with priority ordering among containers, and optionally priority ordering within a container (disregarding other containers). > > -- > Michal Hocko > SUSE Labs
On Tue 13-06-23 13:24:24, Yosry Ahmed wrote: > On Tue, Jun 13, 2023 at 5:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 13-06-23 01:36:51, Yosry Ahmed wrote: > > > +David Rientjes > > > > > > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote: > > > > [...] > > > > > There has been a parallel discussion in the cover letter thread of v4 > > > > > [1]. To summarize, at Google, we have been using OOM scores to > > > > > describe different job priorities in a more explicit way -- regardless > > > > > of memory usage. It is strictly priority-based OOM killing. Ties are > > > > > broken based on memory usage. > > > > > > > > > > We understand that something like memory.oom.protect has an advantage > > > > > in the sense that you can skip killing a process if you know that it > > > > > won't free enough memory anyway, but for an environment where multiple > > > > > jobs of different priorities are running, we find it crucial to be > > > > > able to define strict ordering. Some jobs are simply more important > > > > > than others, regardless of their memory usage. > > > > > > > > I do remember that discussion. I am not a great fan of simple priority > > > > based interfaces TBH. It sounds as an easy interface but it hits > > > > complications as soon as you try to define a proper/sensible > > > > hierarchical semantic. I can see how they might work on leaf memcgs with > > > > statically assigned priorities but that sounds like a very narrow > > > > usecase IMHO. > > > > > > Do you mind elaborating the problem with the hierarchical semantics? > > > > Well, let me be more specific. If you have a simple hierarchical numeric > > enforcement (assume higher priority more likely to be chosen and the > > effective priority to be max(self, max(parents)) then the semantic > > itslef is straightforward. > > > > I am not really sure about the practical manageability though. I have > > hard time to imagine priority assignment on something like a shared > > workload with a more complex hierarchy. For example: > > root > > / | \ > > cont_A cont_B cont_C > > > > each container running its workload with own hierarchy structures that > > might be rather dynamic during the lifetime. In order to have a > > predictable OOM behavior you need to watch and reassign priorities all > > the time, no? > > In our case we don't really manage the entire hierarchy in a > centralized fashion. Each container gets a score based on their > relative priority, and each container is free to set scores within its > subcontainers if needed. Isn't this what the hierarchy is all about? > Each parent only cares about its direct children. On the system level, > we care about the priority ordering of containers. Ordering within > containers can be deferred to containers. This really depends on the workload. This might be working for your setup but as I've said above, many workloads would be struggling with re-prioritizing as soon as a new workload is started and oom priorities would need to be reorganized as a result. The setup is just too static to be generally useful IMHO. You can avoid that by essentially making mid-layers no priority and only rely on leaf memcgs when this would become more flexible. This is something even more complicated with the top-down approach. That being said, I can see workloads which could benefit from a priority (essentially user spaced controlled oom pre-selection) based policy. But there are many other policies like that that would be usecase specific and not generic enough so I do not think this is worth a generic interface and would fall into BPF or alike based policies.
On Thu, Jun 15, 2023 at 3:39 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 13-06-23 13:24:24, Yosry Ahmed wrote: > > On Tue, Jun 13, 2023 at 5:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Tue 13-06-23 01:36:51, Yosry Ahmed wrote: > > > > +David Rientjes > > > > > > > > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote: > > > > > [...] > > > > > > There has been a parallel discussion in the cover letter thread of v4 > > > > > > [1]. To summarize, at Google, we have been using OOM scores to > > > > > > describe different job priorities in a more explicit way -- regardless > > > > > > of memory usage. It is strictly priority-based OOM killing. Ties are > > > > > > broken based on memory usage. > > > > > > > > > > > > We understand that something like memory.oom.protect has an advantage > > > > > > in the sense that you can skip killing a process if you know that it > > > > > > won't free enough memory anyway, but for an environment where multiple > > > > > > jobs of different priorities are running, we find it crucial to be > > > > > > able to define strict ordering. Some jobs are simply more important > > > > > > than others, regardless of their memory usage. > > > > > > > > > > I do remember that discussion. I am not a great fan of simple priority > > > > > based interfaces TBH. It sounds as an easy interface but it hits > > > > > complications as soon as you try to define a proper/sensible > > > > > hierarchical semantic. I can see how they might work on leaf memcgs with > > > > > statically assigned priorities but that sounds like a very narrow > > > > > usecase IMHO. > > > > > > > > Do you mind elaborating the problem with the hierarchical semantics? > > > > > > Well, let me be more specific. If you have a simple hierarchical numeric > > > enforcement (assume higher priority more likely to be chosen and the > > > effective priority to be max(self, max(parents)) then the semantic > > > itslef is straightforward. > > > > > > I am not really sure about the practical manageability though. I have > > > hard time to imagine priority assignment on something like a shared > > > workload with a more complex hierarchy. For example: > > > root > > > / | \ > > > cont_A cont_B cont_C > > > > > > each container running its workload with own hierarchy structures that > > > might be rather dynamic during the lifetime. In order to have a > > > predictable OOM behavior you need to watch and reassign priorities all > > > the time, no? > > > > In our case we don't really manage the entire hierarchy in a > > centralized fashion. Each container gets a score based on their > > relative priority, and each container is free to set scores within its > > subcontainers if needed. Isn't this what the hierarchy is all about? > > Each parent only cares about its direct children. On the system level, > > we care about the priority ordering of containers. Ordering within > > containers can be deferred to containers. > > This really depends on the workload. This might be working for your > setup but as I've said above, many workloads would be struggling with > re-prioritizing as soon as a new workload is started and oom priorities > would need to be reorganized as a result. The setup is just too static > to be generally useful IMHO. > You can avoid that by essentially making mid-layers no priority and only > rely on leaf memcgs when this would become more flexible. This is > something even more complicated with the top-down approach. I agree that other setups may find it more difficult if one entity needs to manage the entire tree, although if the scores range is large enough, I don't really think it's that static. When a new workload is started you decide what its priority is compared to the existing workloads and set its score as such. We use a range of scores from 0 to 10,000 (and it can easily be larger), so it's easy to assign new scores without reorganizing the existing scores. > > That being said, I can see workloads which could benefit from a > priority (essentially user spaced controlled oom pre-selection) based > policy. But there are many other policies like that that would be > usecase specific and not generic enough so I do not think this is worth > a generic interface and would fall into BPF or alike based policies. That's reasonable. I can't speak for other folks. Perhaps no single policy will be generic enough, and we should focus on enabling customized policy. Perhaps other userspace OOM agents can benefit from this as well. > > -- > Michal Hocko > SUSE Labs