Message ID | 20200511225516.2431921-4-kuba@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | memcg: Slow down swap allocation as the available space gets depleted | expand |
On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > Add a memory.swap.high knob, which can be used to protect the system > from SWAP exhaustion. The mechanism used for penelizing is similar > to memory.high penalty (sleep on return to user space), but with > a less steep slope. > > That is not to say that the knob itself is equivalent to memory.high. > The objective is more to protect the system from potentially buggy > tasks consuming a lot of swap and impacting other tasks, or even > bringing the whole system to stand still with complete SWAP > exhaustion. Hopefully without the need to find per-task hard > limits. > > Slowing misbehaving tasks down gradually allows user space oom > killers or other protection mechanisms to react. oomd and earlyoom > already do killing based on swap exhaustion, and memory.swap.high > protection will help implement such userspace oom policies more > reliably. Thanks for adding more information about the usecase and motivation. > > Use one counter for number of pages allocated under pressure > to save struct task space and avoid two separate hierarchy > walks on the hot path. > > Use swap.high when deciding if swap is full. Please be more specific why. > Perform reclaim and count memory over high events. Please expand on this and explain how this is working and why the semantic is subtly different from MEMCG_HIGH. I suspect the reason is that there is no reclaim for the swap so you are only emitting an event on the memcg which is actually throttled. This is in line with memory.high but the difference is that we do reclaim each memcg subtree in the high limit excess. That means that the counter tells us how many times the specific memcg was in excess which would be impossible with your implementation. I would also suggest to explain or ideally even separate the swap penalty scaling logic to a seprate patch. What kind of data it is based on? > Signed-off-by: Jakub Kicinski <kuba@kernel.org> > -- > v2: > - add docs, > - improve commit message. > --- > Documentation/admin-guide/cgroup-v2.rst | 16 +++++ > include/linux/memcontrol.h | 4 ++ > mm/memcontrol.c | 94 +++++++++++++++++++++++-- > 3 files changed, 107 insertions(+), 7 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 5f12f203822e..c60226daa193 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back. > The total amount of swap currently being used by the cgroup > and its descendants. > > + memory.swap.high > + A read-write single value file which exists on non-root > + cgroups. The default is "max". > + > + Swap usage throttle limit. If a cgroup's swap usage exceeds > + this limit, all its further allocations will be throttled to > + allow userspace to implement custom out-of-memory procedures. > + > + This limit marks a point of no return for the cgroup. It is NOT > + designed to manage the amount of swapping a workload does > + during regular operation. Compare to memory.swap.max, which > + prohibits swapping past a set amount, but lets the cgroup > + continue unimpeded as long as other memory can be reclaimed. > + > + Healthy workloads are not expected to reach this limit. > + > memory.swap.max > A read-write single value file which exists on non-root > cgroups. The default is "max". > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index b478a4e83297..882bda952a5c 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -45,6 +45,7 @@ enum memcg_memory_event { > MEMCG_MAX, > MEMCG_OOM, > MEMCG_OOM_KILL, > + MEMCG_SWAP_HIGH, > MEMCG_SWAP_MAX, > MEMCG_SWAP_FAIL, > MEMCG_NR_MEMORY_EVENTS, > @@ -212,6 +213,9 @@ struct mem_cgroup { > /* Upper bound of normal memory consumption range */ > unsigned long high; > > + /* Upper bound of swap consumption range */ > + unsigned long swap_high; > + > /* Range enforcement for interrupt charges */ > struct work_struct high_work; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 66dd87bb9e0f..a3d13b30e3d6 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2353,12 +2353,34 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg) > return max_overage; > } > > +static u64 swap_find_max_overage(struct mem_cgroup *memcg) > +{ > + u64 overage, max_overage = 0; > + struct mem_cgroup *max_cg; > + > + do { > + overage = calculate_overage(page_counter_read(&memcg->swap), > + READ_ONCE(memcg->swap_high)); > + if (overage > max_overage) { > + max_overage = overage; > + max_cg = memcg; > + } > + } while ((memcg = parent_mem_cgroup(memcg)) && > + !mem_cgroup_is_root(memcg)); > + > + if (max_overage) > + memcg_memory_event(max_cg, MEMCG_SWAP_HIGH); > + > + return max_overage; > +} > + > /* > * Get the number of jiffies that we should penalise a mischievous cgroup which > * is exceeding its memory.high by checking both it and its ancestors. > */ > static unsigned long calculate_high_delay(struct mem_cgroup *memcg, > unsigned int nr_pages, > + unsigned char cost_shift, > u64 max_overage) > { > unsigned long penalty_jiffies; > @@ -2366,6 +2388,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, > if (!max_overage) > return 0; > > + if (cost_shift) > + max_overage >>= cost_shift; > + > /* > * We use overage compared to memory.high to calculate the number of > * jiffies to sleep (penalty_jiffies). Ideally this value should be > @@ -2411,9 +2436,16 @@ void mem_cgroup_handle_over_high(void) > * memory.high is breached and reclaim is unable to keep up. Throttle > * allocators proactively to slow down excessive growth. > */ > - penalty_jiffies = calculate_high_delay(memcg, nr_pages, > + penalty_jiffies = calculate_high_delay(memcg, nr_pages, 0, > mem_find_max_overage(memcg)); > > + /* > + * Make the swap curve more gradual, swap can be considered "cheaper", > + * and is allocated in larger chunks. We want the delays to be gradual. > + */ > + penalty_jiffies += calculate_high_delay(memcg, nr_pages, 2, > + swap_find_max_overage(memcg)); > + > /* > * Clamp the max delay per usermode return so as to still keep the > * application moving forwards and also permit diagnostics, albeit > @@ -2604,12 +2636,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > * reclaim, the cost of mismatch is negligible. > */ > do { > - if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) { > - /* Don't bother a random interrupted task */ > - if (in_interrupt()) { > + bool mem_high, swap_high; > + > + mem_high = page_counter_read(&memcg->memory) > > + READ_ONCE(memcg->high); > + swap_high = page_counter_read(&memcg->swap) > > + READ_ONCE(memcg->swap_high); > + > + /* Don't bother a random interrupted task */ > + if (in_interrupt()) { > + if (mem_high) { > schedule_work(&memcg->high_work); > break; > } > + continue; > + } > + > + if (mem_high || swap_high) { > current->memcg_nr_pages_over_high += batch; > set_notify_resume(current); > break; > @@ -5076,6 +5119,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > > WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX); > memcg->soft_limit = PAGE_COUNTER_MAX; > + WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX); > if (parent) { > memcg->swappiness = mem_cgroup_swappiness(parent); > memcg->oom_kill_disable = parent->oom_kill_disable; > @@ -5229,6 +5273,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) > page_counter_set_low(&memcg->memory, 0); > WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX); > memcg->soft_limit = PAGE_COUNTER_MAX; > + WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX); > memcg_wb_domain_size_changed(memcg); > } > > @@ -7136,10 +7181,13 @@ bool mem_cgroup_swap_full(struct page *page) > if (!memcg) > return false; > > - for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) > - if (page_counter_read(&memcg->swap) * 2 >= > - READ_ONCE(memcg->swap.max)) > + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) { > + unsigned long usage = page_counter_read(&memcg->swap); > + > + if (usage * 2 >= READ_ONCE(memcg->swap_high) || > + usage * 2 >= READ_ONCE(memcg->swap.max)) > return true; > + } > > return false; > } > @@ -7169,6 +7217,30 @@ static u64 swap_current_read(struct cgroup_subsys_state *css, > return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE; > } > > +static int swap_high_show(struct seq_file *m, void *v) > +{ > + unsigned long high = READ_ONCE(mem_cgroup_from_seq(m)->swap_high); > + > + return seq_puts_memcg_tunable(m, high); > +} > + > +static ssize_t swap_high_write(struct kernfs_open_file *of, > + char *buf, size_t nbytes, loff_t off) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > + unsigned long high; > + int err; > + > + buf = strstrip(buf); > + err = page_counter_memparse(buf, "max", &high); > + if (err) > + return err; > + > + WRITE_ONCE(memcg->swap_high, high); > + > + return nbytes; > +} > + > static int swap_max_show(struct seq_file *m, void *v) > { > return seq_puts_memcg_tunable(m, > @@ -7196,6 +7268,8 @@ static int swap_events_show(struct seq_file *m, void *v) > { > struct mem_cgroup *memcg = mem_cgroup_from_seq(m); > > + seq_printf(m, "high %lu\n", > + atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH])); > seq_printf(m, "max %lu\n", > atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX])); > seq_printf(m, "fail %lu\n", > @@ -7210,6 +7284,12 @@ static struct cftype swap_files[] = { > .flags = CFTYPE_NOT_ON_ROOT, > .read_u64 = swap_current_read, > }, > + { > + .name = "swap.high", > + .flags = CFTYPE_NOT_ON_ROOT, > + .seq_show = swap_high_show, > + .write = swap_high_write, > + }, > { > .name = "swap.max", > .flags = CFTYPE_NOT_ON_ROOT, > -- > 2.25.4
On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > Use swap.high when deciding if swap is full. > > Please be more specific why. How about: Use swap.high when deciding if swap is full to influence ongoing swap reclaim in a best effort manner. > > Perform reclaim and count memory over high events. > > Please expand on this and explain how this is working and why the > semantic is subtly different from MEMCG_HIGH. I suspect the reason > is that there is no reclaim for the swap so you are only emitting an > event on the memcg which is actually throttled. This is in line with > memory.high but the difference is that we do reclaim each memcg subtree > in the high limit excess. That means that the counter tells us how many > times the specific memcg was in excess which would be impossible with > your implementation. Right, with memory all cgroups over high get penalized with the extra reclaim work. For swap we just have the delay, so the event is associated with the worst offender, anything lower didn't really matter. But it's easy enough to change if you prefer. Otherwise I'll just add this to the commit message: Count swap over high events. Note that unlike memory over high events we only count them for the worst offender. This is because the delay penalties for both swap and memory over high are not cumulative, i.e. we use the max delay. > I would also suggest to explain or ideally even separate the swap > penalty scaling logic to a seprate patch. What kind of data it is based > on? It's a hard thing to get production data for since, as we mentioned we don't expect the limit to be hit. It was more of a process of experimentation and finding a gradual slope that "felt right"... Is there a more scientific process we can follow here? We want the delay to be small at first for a first few pages and then grow to make sure we stop the task from going too much over high. The square function works pretty well IMHO.
On Tue 12-05-20 10:55:36, Jakub Kicinski wrote: > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > > Use swap.high when deciding if swap is full. > > > > Please be more specific why. > > How about: > > Use swap.high when deciding if swap is full to influence ongoing > swap reclaim in a best effort manner. This is still way too vague. The crux is why should we treat hard and high swap limit the same for mem_cgroup_swap_full purpose. Please note that I am not saying this is wrong. I am asking for a more detailed explanation mostly because I would bet that somebody stumbles over this sooner or later. mem_cgroup_swap_full is an odd predicate. It doesn't really want to tell that the swap is really full. I haven't studied the original intention but it is more in line of mem_cgroup_swap_under_pressure based on the current usage to (attempt) scale swap cache size. > > > Perform reclaim and count memory over high events. > > > > Please expand on this and explain how this is working and why the > > semantic is subtly different from MEMCG_HIGH. I suspect the reason > > is that there is no reclaim for the swap so you are only emitting an > > event on the memcg which is actually throttled. This is in line with > > memory.high but the difference is that we do reclaim each memcg subtree > > in the high limit excess. That means that the counter tells us how many > > times the specific memcg was in excess which would be impossible with > > your implementation. > > Right, with memory all cgroups over high get penalized with the extra > reclaim work. For swap we just have the delay, so the event is > associated with the worst offender, anything lower didn't really matter. > > But it's easy enough to change if you prefer. Otherwise I'll just add > this to the commit message: > > Count swap over high events. Note that unlike memory over high events > we only count them for the worst offender. This is because the > delay penalties for both swap and memory over high are not cumulative, > i.e. we use the max delay. Well, memory high penalty is in fact cumulative, because the reclaim would happen for each memcg subtree up the hierarchy. Sure the additional throttling is not cumulative but that is not really that important because the exact amount of throttling is an implementation detail. The swap high is an odd one here because we do not reclaim swap so the cumulative effect of that is 0 and there is only the additional throttling happening. I suspect that your current implementation is exposing an internal implementation to the userspace but considering how the current memory high event is documented high The number of times processes of the cgroup are throttled and routed to perform direct memory reclaim because the high memory boundary was exceeded. For a cgroup whose memory usage is capped by the high limit rather than global memory pressure, this event's occurrences are expected. it talks about throttling rather than excess (like max) so I am not really sure. I believe that it would be much better if both events were more explicit about counting an excess and a throttling is just a side effect of that situation. I do not expect that we will have any form of the swap reclaim anytime soon (if ever) but I fail to see why to creat a small little trap like this now. > > I would also suggest to explain or ideally even separate the swap > > penalty scaling logic to a seprate patch. What kind of data it is based > > on? > > It's a hard thing to get production data for since, as we mentioned we > don't expect the limit to be hit. It was more of a process of > experimentation and finding a gradual slope that "felt right"... > > Is there a more scientific process we can follow here? We want the > delay to be small at first for a first few pages and then grow to make > sure we stop the task from going too much over high. The square > function works pretty well IMHO. If there is no data to showing this to be an improvement then I would just not add an additional scaling factor. Why? Mostly because once we have it there it would be extremely hard to change. MM is full of these little heuristics that are copied over because nobody dares to touch them. If a different scaling is really needed it can always be added later with some data to back that.
On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 5f12f203822e..c60226daa193 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back. > The total amount of swap currently being used by the cgroup > and its descendants. > > + memory.swap.high > + A read-write single value file which exists on non-root > + cgroups. The default is "max". > + > + Swap usage throttle limit. If a cgroup's swap usage exceeds > + this limit, all its further allocations will be throttled to > + allow userspace to implement custom out-of-memory procedures. > + > + This limit marks a point of no return for the cgroup. It is NOT > + designed to manage the amount of swapping a workload does > + during regular operation. Compare to memory.swap.max, which > + prohibits swapping past a set amount, but lets the cgroup > + continue unimpeded as long as other memory can be reclaimed. > + > + Healthy workloads are not expected to reach this limit. > + Btw. I forgot to mention that before but you should also add a documentation for the swap high event to this file.
On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote: > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote: > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > > > Use swap.high when deciding if swap is full. > > > > > > Please be more specific why. > > > > How about: > > > > Use swap.high when deciding if swap is full to influence ongoing > > swap reclaim in a best effort manner. > > This is still way too vague. The crux is why should we treat hard and > high swap limit the same for mem_cgroup_swap_full purpose. Please > note that I am not saying this is wrong. I am asking for a more > detailed explanation mostly because I would bet that somebody > stumbles over this sooner or later. Stumbles in what way? Isn't it expected for the kernel to take reasonable precautions to avoid hitting limits? We expect the application which breaches swap.high to get terminated by user space OOM, kernel best be careful about approaching that limit, no? > mem_cgroup_swap_full is an odd predicate. It doesn't really want to > tell that the swap is really full. I haven't studied the original > intention but it is more in line of mem_cgroup_swap_under_pressure > based on the current usage to (attempt) scale swap cache size. Perhaps Johannes has some experience here? The 50% means full heuristic predates git :( > > > > Perform reclaim and count memory over high events. > > > > > > Please expand on this and explain how this is working and why the > > > semantic is subtly different from MEMCG_HIGH. I suspect the reason > > > is that there is no reclaim for the swap so you are only emitting > > > an event on the memcg which is actually throttled. This is in > > > line with memory.high but the difference is that we do reclaim > > > each memcg subtree in the high limit excess. That means that the > > > counter tells us how many times the specific memcg was in excess > > > which would be impossible with your implementation. > > > > Right, with memory all cgroups over high get penalized with the > > extra reclaim work. For swap we just have the delay, so the event is > > associated with the worst offender, anything lower didn't really > > matter. > > > > But it's easy enough to change if you prefer. Otherwise I'll just > > add this to the commit message: > > > > Count swap over high events. Note that unlike memory over high > > events we only count them for the worst offender. This is because > > the delay penalties for both swap and memory over high are not > > cumulative, i.e. we use the max delay. > > Well, memory high penalty is in fact cumulative, because the reclaim > would happen for each memcg subtree up the hierarchy. Sure the > additional throttling is not cumulative but that is not really that > important because the exact amount of throttling is an implementation > detail. The swap high is an odd one here because we do not reclaim > swap so the cumulative effect of that is 0 and there is only the > additional throttling happening. I suspect that your current > implementation is exposing an internal implementation to the > userspace but considering how the current memory high event is > documented high > The number of times processes of the cgroup are > throttled and routed to perform direct memory reclaim > because the high memory boundary was exceeded. For a > cgroup whose memory usage is capped by the high limit > rather than global memory pressure, this event's > occurrences are expected. > > it talks about throttling rather than excess (like max) so I am not > really sure. I believe that it would be much better if both events > were more explicit about counting an excess and a throttling is just > a side effect of that situation. > > I do not expect that we will have any form of the swap reclaim anytime > soon (if ever) but I fail to see why to creat a small little trap like > this now. Right, let me adjust then. > > > I would also suggest to explain or ideally even separate the swap > > > penalty scaling logic to a seprate patch. What kind of data it is > > > based on? > > > > It's a hard thing to get production data for since, as we mentioned > > we don't expect the limit to be hit. It was more of a process of > > experimentation and finding a gradual slope that "felt right"... > > > > Is there a more scientific process we can follow here? We want the > > delay to be small at first for a first few pages and then grow to > > make sure we stop the task from going too much over high. The square > > function works pretty well IMHO. > > If there is no data to showing this to be an improvement then I would > just not add an additional scaling factor. Why? Mostly because once we > have it there it would be extremely hard to change. MM is full of > these little heuristics that are copied over because nobody dares to > touch them. If a different scaling is really needed it can always be > added later with some data to back that. Oh, I misunderstood the question, you were asking about the scaling factor.. The allocation of swap is in larger batches, according to my tests, example below (AR - after reclaim, swap overage changes after memory reclaim). mem overage AR swap pages over_high AR | swap overage AR swap pages over at call. \ | | . mem sleep mem pages over_high. \ \ | | / . swap sleep v v v v v v v [ 73.360533] sleep (32/10->67) [-35|13379] 0+253 [ 73.631291] sleep (32/ 3->54) [-18|13430] 0+205 [ 73.851629] sleep (32/22->35) [-20|13443] 0+133 [ 74.021396] sleep (32/ 3->60) [-29|13500] 0+230 [ 74.263355] sleep (32/28->79) [-44|13551] 0+306 [ 74.585689] sleep (32/29->91) [-17|13627] 0+355 [ 74.958675] sleep (32/27->79) [-31|13679] 0+311 [ 75.293021] sleep (32/29->86) [ -9|13750] 0+344 [ 75.654218] sleep (32/22->72) [-24|13800] 0+290 [ 75.962467] sleep (32/22->73) [-39|13865] 0+296 That's for a process slowly leaking memory. Swap gets over the high by about 2.5x MEMCG_CHARGE_BATCH on average. Hence to keep the same slope I was trying to scale it back. But you make a fair point, someone more knowledgeable can add the heuristic later if it's really needed.
On Wed 13-05-20 11:36:23, Jakub Kicinski wrote: > On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote: > > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote: > > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > > > > Use swap.high when deciding if swap is full. > > > > > > > > Please be more specific why. > > > > > > How about: > > > > > > Use swap.high when deciding if swap is full to influence ongoing > > > swap reclaim in a best effort manner. > > > > This is still way too vague. The crux is why should we treat hard and > > high swap limit the same for mem_cgroup_swap_full purpose. Please > > note that I am not saying this is wrong. I am asking for a more > > detailed explanation mostly because I would bet that somebody > > stumbles over this sooner or later. > > Stumbles in what way? Reading the code and trying to understand why this particular decision has been made. Because it might be surprising that the hard and high limits are treated same here. > Isn't it expected for the kernel to take reasonable precautions to > avoid hitting limits? Isn't the throttling itself the precautious? How does the swap cache and its control via mem_cgroup_swap_full interact here. See? This is what I am asking to have explained in the changelog. [...] > > > > I would also suggest to explain or ideally even separate the swap > > > > penalty scaling logic to a seprate patch. What kind of data it is > > > > based on? > > > > > > It's a hard thing to get production data for since, as we mentioned > > > we don't expect the limit to be hit. It was more of a process of > > > experimentation and finding a gradual slope that "felt right"... > > > > > > Is there a more scientific process we can follow here? We want the > > > delay to be small at first for a first few pages and then grow to > > > make sure we stop the task from going too much over high. The square > > > function works pretty well IMHO. > > > > If there is no data to showing this to be an improvement then I would > > just not add an additional scaling factor. Why? Mostly because once we > > have it there it would be extremely hard to change. MM is full of > > these little heuristics that are copied over because nobody dares to > > touch them. If a different scaling is really needed it can always be > > added later with some data to back that. > > Oh, I misunderstood the question, you were asking about the scaling > factor.. The allocation of swap is in larger batches, according to > my tests, example below (AR - after reclaim, swap overage changes > after memory reclaim). > mem overage AR > swap pages over_high AR | swap overage AR > swap pages over at call. \ | | . mem sleep > mem pages over_high. \ \ | | / . swap sleep > v v v v v v v > [ 73.360533] sleep (32/10->67) [-35|13379] 0+253 > [ 73.631291] sleep (32/ 3->54) [-18|13430] 0+205 > [ 73.851629] sleep (32/22->35) [-20|13443] 0+133 > [ 74.021396] sleep (32/ 3->60) [-29|13500] 0+230 > [ 74.263355] sleep (32/28->79) [-44|13551] 0+306 > [ 74.585689] sleep (32/29->91) [-17|13627] 0+355 > [ 74.958675] sleep (32/27->79) [-31|13679] 0+311 > [ 75.293021] sleep (32/29->86) [ -9|13750] 0+344 > [ 75.654218] sleep (32/22->72) [-24|13800] 0+290 > [ 75.962467] sleep (32/22->73) [-39|13865] 0+296 > > That's for a process slowly leaking memory. Swap gets over the high by > about 2.5x MEMCG_CHARGE_BATCH on average. Hence to keep the same slope > I was trying to scale it back. > > But you make a fair point, someone more knowledgeable can add the > heuristic later if it's really needed. Or just make it a separate patch with all that information. This would allow anybody touching that code in the future to understand the initial motivation. I am still not sure this scaling is a good fit in general (e.g. how does it work with THP swapping?) though but this can be discussed separately at least.
On Thu, May 14, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > On Wed 13-05-20 11:36:23, Jakub Kicinski wrote: > > On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote: > > > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote: > > > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > > > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > > > > > Use swap.high when deciding if swap is full. > > > > > > > > > > Please be more specific why. > > > > > > > > How about: > > > > > > > > Use swap.high when deciding if swap is full to influence ongoing > > > > swap reclaim in a best effort manner. > > > > > > This is still way too vague. The crux is why should we treat hard and > > > high swap limit the same for mem_cgroup_swap_full purpose. Please > > > note that I am not saying this is wrong. I am asking for a more > > > detailed explanation mostly because I would bet that somebody > > > stumbles over this sooner or later. > > > > Stumbles in what way? > > Reading the code and trying to understand why this particular decision > has been made. Because it might be surprising that the hard and high > limits are treated same here. I don't quite understand the controversy. The idea behind "swap full" is that as long as the workload has plenty of swap space available and it's not changing its memory contents, it makes sense to generously hold on to copies of data in the swap device, even after the swapin. A later reclaim cycle can drop the page without any IO. Trading disk space for IO. But the only two ways to reclaim a swap slot is when they're faulted in and the references go away, or by scanning the virtual address space like swapoff does - which is very expensive (one could argue it's too expensive even for swapoff, it's often more practical to just reboot). So at some point in the fill level, we have to start freeing up swap slots on fault/swapin. Otherwise we could eventually run out of swap slots while they're filled with copies of data that is also in RAM. We don't want to OOM a workload because its available swap space is filled with redundant cache. That applies to physical swap limits, swap.max, and naturally also to swap.high which is a limit to implement userspace OOM for swap space exhaustion. > > Isn't it expected for the kernel to take reasonable precautions to > > avoid hitting limits? > > Isn't the throttling itself the precautious? How does the swap cache > and its control via mem_cgroup_swap_full interact here. See? This is > what I am asking to have explained in the changelog. It sounds like we need better documentation of what vm_swap_full() and friends are there for. It should have been obvious why swap.high - a limit on available swap space - hooks into it.
On Thu 14-05-20 16:21:30, Johannes Weiner wrote: > On Thu, May 14, 2020 at 09:42:46AM +0200, Michal Hocko wrote: > > On Wed 13-05-20 11:36:23, Jakub Kicinski wrote: > > > On Wed, 13 May 2020 10:32:49 +0200 Michal Hocko wrote: > > > > On Tue 12-05-20 10:55:36, Jakub Kicinski wrote: > > > > > On Tue, 12 May 2020 09:26:34 +0200 Michal Hocko wrote: > > > > > > On Mon 11-05-20 15:55:16, Jakub Kicinski wrote: > > > > > > > Use swap.high when deciding if swap is full. > > > > > > > > > > > > Please be more specific why. > > > > > > > > > > How about: > > > > > > > > > > Use swap.high when deciding if swap is full to influence ongoing > > > > > swap reclaim in a best effort manner. > > > > > > > > This is still way too vague. The crux is why should we treat hard and > > > > high swap limit the same for mem_cgroup_swap_full purpose. Please > > > > note that I am not saying this is wrong. I am asking for a more > > > > detailed explanation mostly because I would bet that somebody > > > > stumbles over this sooner or later. > > > > > > Stumbles in what way? > > > > Reading the code and trying to understand why this particular decision > > has been made. Because it might be surprising that the hard and high > > limits are treated same here. > > I don't quite understand the controversy. I do not think there is any controversy. All I am asking for is a clarification because this is non-intuitive. > The idea behind "swap full" is that as long as the workload has plenty > of swap space available and it's not changing its memory contents, it > makes sense to generously hold on to copies of data in the swap > device, even after the swapin. A later reclaim cycle can drop the page > without any IO. Trading disk space for IO. > > But the only two ways to reclaim a swap slot is when they're faulted > in and the references go away, or by scanning the virtual address space > like swapoff does - which is very expensive (one could argue it's too > expensive even for swapoff, it's often more practical to just reboot). > > So at some point in the fill level, we have to start freeing up swap > slots on fault/swapin. Otherwise we could eventually run out of swap > slots while they're filled with copies of data that is also in RAM. > > We don't want to OOM a workload because its available swap space is > filled with redundant cache. Thanks this is a useful summary. > That applies to physical swap limits, swap.max, and naturally also to > swap.high which is a limit to implement userspace OOM for swap space > exhaustion. > > > > Isn't it expected for the kernel to take reasonable precautions to > > > avoid hitting limits? > > > > Isn't the throttling itself the precautious? How does the swap cache > > and its control via mem_cgroup_swap_full interact here. See? This is > > what I am asking to have explained in the changelog. > > It sounds like we need better documentation of what vm_swap_full() and > friends are there for. It should have been obvious why swap.high - a > limit on available swap space - hooks into it. Agreed. The primary source for a confusion is the naming here. Because vm_swap_full doesn't really try to tell that the swap is full. It merely tries to tell that it is getting full and so duplicated data should be dropped.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 5f12f203822e..c60226daa193 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1374,6 +1374,22 @@ PAGE_SIZE multiple when read back. The total amount of swap currently being used by the cgroup and its descendants. + memory.swap.high + A read-write single value file which exists on non-root + cgroups. The default is "max". + + Swap usage throttle limit. If a cgroup's swap usage exceeds + this limit, all its further allocations will be throttled to + allow userspace to implement custom out-of-memory procedures. + + This limit marks a point of no return for the cgroup. It is NOT + designed to manage the amount of swapping a workload does + during regular operation. Compare to memory.swap.max, which + prohibits swapping past a set amount, but lets the cgroup + continue unimpeded as long as other memory can be reclaimed. + + Healthy workloads are not expected to reach this limit. + memory.swap.max A read-write single value file which exists on non-root cgroups. The default is "max". diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b478a4e83297..882bda952a5c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -45,6 +45,7 @@ enum memcg_memory_event { MEMCG_MAX, MEMCG_OOM, MEMCG_OOM_KILL, + MEMCG_SWAP_HIGH, MEMCG_SWAP_MAX, MEMCG_SWAP_FAIL, MEMCG_NR_MEMORY_EVENTS, @@ -212,6 +213,9 @@ struct mem_cgroup { /* Upper bound of normal memory consumption range */ unsigned long high; + /* Upper bound of swap consumption range */ + unsigned long swap_high; + /* Range enforcement for interrupt charges */ struct work_struct high_work; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 66dd87bb9e0f..a3d13b30e3d6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2353,12 +2353,34 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg) return max_overage; } +static u64 swap_find_max_overage(struct mem_cgroup *memcg) +{ + u64 overage, max_overage = 0; + struct mem_cgroup *max_cg; + + do { + overage = calculate_overage(page_counter_read(&memcg->swap), + READ_ONCE(memcg->swap_high)); + if (overage > max_overage) { + max_overage = overage; + max_cg = memcg; + } + } while ((memcg = parent_mem_cgroup(memcg)) && + !mem_cgroup_is_root(memcg)); + + if (max_overage) + memcg_memory_event(max_cg, MEMCG_SWAP_HIGH); + + return max_overage; +} + /* * Get the number of jiffies that we should penalise a mischievous cgroup which * is exceeding its memory.high by checking both it and its ancestors. */ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, unsigned int nr_pages, + unsigned char cost_shift, u64 max_overage) { unsigned long penalty_jiffies; @@ -2366,6 +2388,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg, if (!max_overage) return 0; + if (cost_shift) + max_overage >>= cost_shift; + /* * We use overage compared to memory.high to calculate the number of * jiffies to sleep (penalty_jiffies). Ideally this value should be @@ -2411,9 +2436,16 @@ void mem_cgroup_handle_over_high(void) * memory.high is breached and reclaim is unable to keep up. Throttle * allocators proactively to slow down excessive growth. */ - penalty_jiffies = calculate_high_delay(memcg, nr_pages, + penalty_jiffies = calculate_high_delay(memcg, nr_pages, 0, mem_find_max_overage(memcg)); + /* + * Make the swap curve more gradual, swap can be considered "cheaper", + * and is allocated in larger chunks. We want the delays to be gradual. + */ + penalty_jiffies += calculate_high_delay(memcg, nr_pages, 2, + swap_find_max_overage(memcg)); + /* * Clamp the max delay per usermode return so as to still keep the * application moving forwards and also permit diagnostics, albeit @@ -2604,12 +2636,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, * reclaim, the cost of mismatch is negligible. */ do { - if (page_counter_read(&memcg->memory) > READ_ONCE(memcg->high)) { - /* Don't bother a random interrupted task */ - if (in_interrupt()) { + bool mem_high, swap_high; + + mem_high = page_counter_read(&memcg->memory) > + READ_ONCE(memcg->high); + swap_high = page_counter_read(&memcg->swap) > + READ_ONCE(memcg->swap_high); + + /* Don't bother a random interrupted task */ + if (in_interrupt()) { + if (mem_high) { schedule_work(&memcg->high_work); break; } + continue; + } + + if (mem_high || swap_high) { current->memcg_nr_pages_over_high += batch; set_notify_resume(current); break; @@ -5076,6 +5119,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX); memcg->soft_limit = PAGE_COUNTER_MAX; + WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX); if (parent) { memcg->swappiness = mem_cgroup_swappiness(parent); memcg->oom_kill_disable = parent->oom_kill_disable; @@ -5229,6 +5273,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) page_counter_set_low(&memcg->memory, 0); WRITE_ONCE(memcg->high, PAGE_COUNTER_MAX); memcg->soft_limit = PAGE_COUNTER_MAX; + WRITE_ONCE(memcg->swap_high, PAGE_COUNTER_MAX); memcg_wb_domain_size_changed(memcg); } @@ -7136,10 +7181,13 @@ bool mem_cgroup_swap_full(struct page *page) if (!memcg) return false; - for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) - if (page_counter_read(&memcg->swap) * 2 >= - READ_ONCE(memcg->swap.max)) + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) { + unsigned long usage = page_counter_read(&memcg->swap); + + if (usage * 2 >= READ_ONCE(memcg->swap_high) || + usage * 2 >= READ_ONCE(memcg->swap.max)) return true; + } return false; } @@ -7169,6 +7217,30 @@ static u64 swap_current_read(struct cgroup_subsys_state *css, return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE; } +static int swap_high_show(struct seq_file *m, void *v) +{ + unsigned long high = READ_ONCE(mem_cgroup_from_seq(m)->swap_high); + + return seq_puts_memcg_tunable(m, high); +} + +static ssize_t swap_high_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long high; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", &high); + if (err) + return err; + + WRITE_ONCE(memcg->swap_high, high); + + return nbytes; +} + static int swap_max_show(struct seq_file *m, void *v) { return seq_puts_memcg_tunable(m, @@ -7196,6 +7268,8 @@ static int swap_events_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + seq_printf(m, "high %lu\n", + atomic_long_read(&memcg->memory_events[MEMCG_SWAP_HIGH])); seq_printf(m, "max %lu\n", atomic_long_read(&memcg->memory_events[MEMCG_SWAP_MAX])); seq_printf(m, "fail %lu\n", @@ -7210,6 +7284,12 @@ static struct cftype swap_files[] = { .flags = CFTYPE_NOT_ON_ROOT, .read_u64 = swap_current_read, }, + { + .name = "swap.high", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = swap_high_show, + .write = swap_high_write, + }, { .name = "swap.max", .flags = CFTYPE_NOT_ON_ROOT,
Add a memory.swap.high knob, which can be used to protect the system from SWAP exhaustion. The mechanism used for penelizing is similar to memory.high penalty (sleep on return to user space), but with a less steep slope. That is not to say that the knob itself is equivalent to memory.high. The objective is more to protect the system from potentially buggy tasks consuming a lot of swap and impacting other tasks, or even bringing the whole system to stand still with complete SWAP exhaustion. Hopefully without the need to find per-task hard limits. Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react. oomd and earlyoom already do killing based on swap exhaustion, and memory.swap.high protection will help implement such userspace oom policies more reliably. Use one counter for number of pages allocated under pressure to save struct task space and avoid two separate hierarchy walks on the hot path. Use swap.high when deciding if swap is full. Perform reclaim and count memory over high events. Signed-off-by: Jakub Kicinski <kuba@kernel.org> -- v2: - add docs, - improve commit message. --- Documentation/admin-guide/cgroup-v2.rst | 16 +++++ include/linux/memcontrol.h | 4 ++ mm/memcontrol.c | 94 +++++++++++++++++++++++-- 3 files changed, 107 insertions(+), 7 deletions(-)