Message ID | 156431697805.3170.6377599347542228221.stgit@buzz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] mm/memcontrol: reclaim severe usage over high limit in get_user_pages loop | expand |
On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote: > High memory limit in memory cgroup allows to batch memory reclaiming and > defer it until returning into userland. This moves it out of any locks. > > Fixed gap between high and max limit works pretty well (we are using > 64 * NR_CPUS pages) except cases when one syscall allocates tons of > memory. This affects all other tasks in cgroup because they might hit > max memory limit in unhandy places and\or under hot locks. > > For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot > of pages and push memory cgroup usage far ahead high memory limit. > > This patch uses halfway between high and max limits as threshold and > in this case starts memory reclaiming if mem_cgroup_handle_over_high() > called with argument only_severe = true, otherwise reclaim is deferred > till returning into userland. If high limits isn't set nothing changes. > > Now long running get_user_pages will periodically reclaim cgroup memory. > Other possible targets are generic file read/write iter loops. I do see how gup can lead to a large high limit excess, but could you be more specific why is that a problem? We should be reclaiming the similar number of pages cumulatively.
On 29.07.2019 12:17, Michal Hocko wrote: > On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote: >> High memory limit in memory cgroup allows to batch memory reclaiming and >> defer it until returning into userland. This moves it out of any locks. >> >> Fixed gap between high and max limit works pretty well (we are using >> 64 * NR_CPUS pages) except cases when one syscall allocates tons of >> memory. This affects all other tasks in cgroup because they might hit >> max memory limit in unhandy places and\or under hot locks. >> >> For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot >> of pages and push memory cgroup usage far ahead high memory limit. >> >> This patch uses halfway between high and max limits as threshold and >> in this case starts memory reclaiming if mem_cgroup_handle_over_high() >> called with argument only_severe = true, otherwise reclaim is deferred >> till returning into userland. If high limits isn't set nothing changes. >> >> Now long running get_user_pages will periodically reclaim cgroup memory. >> Other possible targets are generic file read/write iter loops. > > I do see how gup can lead to a large high limit excess, but could you be > more specific why is that a problem? We should be reclaiming the similar > number of pages cumulatively. > Large gup might push usage close to limit and keep it here for a some time. As a result concurrent allocations will enter direct reclaim right at charging much more frequently. Right now deferred recalaim after passing high limit works like distributed memcg kswapd which reclaims memory in "background" and prevents completely synchronous direct reclaim. Maybe somebody have any plans for real kswapd for memcg? I've put mem_cgroup_handle_over_high in gup next to cond_resched() and later that gave me idea that this is good place for running any deferred works, like bottom half for tasks. Right now this happens only at switching into userspace.
On Mon 29-07-19 12:40:29, Konstantin Khlebnikov wrote: > On 29.07.2019 12:17, Michal Hocko wrote: > > On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote: > > > High memory limit in memory cgroup allows to batch memory reclaiming and > > > defer it until returning into userland. This moves it out of any locks. > > > > > > Fixed gap between high and max limit works pretty well (we are using > > > 64 * NR_CPUS pages) except cases when one syscall allocates tons of > > > memory. This affects all other tasks in cgroup because they might hit > > > max memory limit in unhandy places and\or under hot locks. > > > > > > For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot > > > of pages and push memory cgroup usage far ahead high memory limit. > > > > > > This patch uses halfway between high and max limits as threshold and > > > in this case starts memory reclaiming if mem_cgroup_handle_over_high() > > > called with argument only_severe = true, otherwise reclaim is deferred > > > till returning into userland. If high limits isn't set nothing changes. > > > > > > Now long running get_user_pages will periodically reclaim cgroup memory. > > > Other possible targets are generic file read/write iter loops. > > > > I do see how gup can lead to a large high limit excess, but could you be > > more specific why is that a problem? We should be reclaiming the similar > > number of pages cumulatively. > > > > Large gup might push usage close to limit and keep it here for a some time. > As a result concurrent allocations will enter direct reclaim right at > charging much more frequently. Yes, this is indeed prossible. On the other hand even the reclaim from the charge path doesn't really prevent from that happening because the context might get preempted or blocked on locks. So I guess we need a more detailed information of an actual world visible problem here. > Right now deferred recalaim after passing high limit works like distributed > memcg kswapd which reclaims memory in "background" and prevents completely > synchronous direct reclaim. > > Maybe somebody have any plans for real kswapd for memcg? I am not aware of that. The primary problem back then was that we simply cannot have a kernel thread per each memcg because that doesn't scale. Using kthreads and a dynamic pool of threads tends to be quite tricky - e.g. a proper accounting, scaling again. > I've put mem_cgroup_handle_over_high in gup next to cond_resched() and > later that gave me idea that this is good place for running any > deferred works, like bottom half for tasks. Right now this happens > only at switching into userspace. I am not against pushing high memory reclaim into the charge path in principle. I just want to hear how big of a problem this really is in practice. If this is mostly a theoretical problem that might hit then I would rather stick with the existing code though.
On 29.07.2019 13:33, Michal Hocko wrote: > On Mon 29-07-19 12:40:29, Konstantin Khlebnikov wrote: >> On 29.07.2019 12:17, Michal Hocko wrote: >>> On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote: >>>> High memory limit in memory cgroup allows to batch memory reclaiming and >>>> defer it until returning into userland. This moves it out of any locks. >>>> >>>> Fixed gap between high and max limit works pretty well (we are using >>>> 64 * NR_CPUS pages) except cases when one syscall allocates tons of >>>> memory. This affects all other tasks in cgroup because they might hit >>>> max memory limit in unhandy places and\or under hot locks. >>>> >>>> For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot >>>> of pages and push memory cgroup usage far ahead high memory limit. >>>> >>>> This patch uses halfway between high and max limits as threshold and >>>> in this case starts memory reclaiming if mem_cgroup_handle_over_high() >>>> called with argument only_severe = true, otherwise reclaim is deferred >>>> till returning into userland. If high limits isn't set nothing changes. >>>> >>>> Now long running get_user_pages will periodically reclaim cgroup memory. >>>> Other possible targets are generic file read/write iter loops. >>> >>> I do see how gup can lead to a large high limit excess, but could you be >>> more specific why is that a problem? We should be reclaiming the similar >>> number of pages cumulatively. >>> >> >> Large gup might push usage close to limit and keep it here for a some time. >> As a result concurrent allocations will enter direct reclaim right at >> charging much more frequently. > > Yes, this is indeed prossible. On the other hand even the reclaim from > the charge path doesn't really prevent from that happening because the > context might get preempted or blocked on locks. So I guess we need a > more detailed information of an actual world visible problem here. > >> Right now deferred recalaim after passing high limit works like distributed >> memcg kswapd which reclaims memory in "background" and prevents completely >> synchronous direct reclaim. >> >> Maybe somebody have any plans for real kswapd for memcg? > > I am not aware of that. The primary problem back then was that we simply > cannot have a kernel thread per each memcg because that doesn't scale. > Using kthreads and a dynamic pool of threads tends to be quite tricky - > e.g. a proper accounting, scaling again. Yep, for containers proper accounting is important, especially cpu usage. We're using manual kwapd-style reclaim in userspace by MADV_STOCKPILE within container where memory allocation latency is critical. This patch is about less extreme cases which would be nice to handle automatically, without custom tuning. > >> I've put mem_cgroup_handle_over_high in gup next to cond_resched() and >> later that gave me idea that this is good place for running any >> deferred works, like bottom half for tasks. Right now this happens >> only at switching into userspace. > > I am not against pushing high memory reclaim into the charge path in > principle. I just want to hear how big of a problem this really is in > practice. If this is mostly a theoretical problem that might hit then I > would rather stick with the existing code though. > Besides latency which might be not so important for everybody I see these: First problem is a fairness within cgroup - task that generates allocation flow isn't throttled after passing high limits as documentation states. It will feel memory pressure only after hitting max limit while other tasks with smaller allocations will go into direct reclaim right away. Second is an accumulating too much deferred reclaim - after large gup task might call direct reclaim with target amount much larger than gap between high and max limits, or even larger than max limit itself.
On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote: > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > ret = -ERESTARTSYS; > goto out; > } > - cond_resched(); > > + /* Reclaim memory over high limit before stocking too much */ > + mem_cgroup_handle_over_high(true); I'd rather this remained part of the try_charge() call. The code comment in try_charge says this: * We can perform reclaim here if __GFP_RECLAIM but let's * always punt for simplicity and so that GFP_KERNEL can * consistently be used during reclaim. The simplicity argument doesn't hold true anymore once we have to add manual calls into allocation sites. We should instead fix try_charge() to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace return when actually needed.
On Mon, Jul 29, 2019 at 3:33 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 29-07-19 12:40:29, Konstantin Khlebnikov wrote: > > On 29.07.2019 12:17, Michal Hocko wrote: > > > On Sun 28-07-19 15:29:38, Konstantin Khlebnikov wrote: > > > > High memory limit in memory cgroup allows to batch memory reclaiming and > > > > defer it until returning into userland. This moves it out of any locks. > > > > > > > > Fixed gap between high and max limit works pretty well (we are using > > > > 64 * NR_CPUS pages) except cases when one syscall allocates tons of > > > > memory. This affects all other tasks in cgroup because they might hit > > > > max memory limit in unhandy places and\or under hot locks. > > > > > > > > For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot > > > > of pages and push memory cgroup usage far ahead high memory limit. > > > > > > > > This patch uses halfway between high and max limits as threshold and > > > > in this case starts memory reclaiming if mem_cgroup_handle_over_high() > > > > called with argument only_severe = true, otherwise reclaim is deferred > > > > till returning into userland. If high limits isn't set nothing changes. > > > > > > > > Now long running get_user_pages will periodically reclaim cgroup memory. > > > > Other possible targets are generic file read/write iter loops. > > > > > > I do see how gup can lead to a large high limit excess, but could you be > > > more specific why is that a problem? We should be reclaiming the similar > > > number of pages cumulatively. > > > > > > > Large gup might push usage close to limit and keep it here for a some time. > > As a result concurrent allocations will enter direct reclaim right at > > charging much more frequently. > > Yes, this is indeed prossible. On the other hand even the reclaim from > the charge path doesn't really prevent from that happening because the > context might get preempted or blocked on locks. So I guess we need a > more detailed information of an actual world visible problem here. > > > Right now deferred recalaim after passing high limit works like distributed > > memcg kswapd which reclaims memory in "background" and prevents completely > > synchronous direct reclaim. > > > > Maybe somebody have any plans for real kswapd for memcg? > > I am not aware of that. The primary problem back then was that we simply > cannot have a kernel thread per each memcg because that doesn't scale. > Using kthreads and a dynamic pool of threads tends to be quite tricky - > e.g. a proper accounting, scaling again. We did discuss this topic in last year's LSF/MM, please see: https://lwn.net/Articles/753162/. We (Alibaba) do have the memcg kswapd thing work in our production environment for a while, and it works well for our 11.11 shopping festival flood. I did plan to post the patches to upstream, but I was distracted by something else for a long time, now I already redesigned it and already had some preliminary patches work, if you are interested in this I would like post the patches soon to gather some comments early. However, some of the issues mentioned by Michal does still exist, i.e. accounting. I have not thought too much about accounting yet. I recalled Johannes mentioned they were working on accounting kswapd back then. But, I'm not sure if they are still working on that or not, or he just meant some throttling solved by commit 2cf855837b89d92996cf264713f3bed2bf9b0b4f ("memcontrol: schedule throttling if we are congested")? But, I recalled vaguely accounting sounds not very critical. I don't worry too much about scale since the scale issue is not unique to background reclaim, direct reclaim may run into the same problem. If you run into extreme memory pressure, a lot memcgs run into direct relcaim and global reclaim is also running at the mean time, your machine is definitely already not usable. And, our usecase is memcg backgroup reclaim is mainly used by some latency sensitive memcgs (running latency sensitive applications) to try to minimize direct reclaim, for other memcgs they'd better be throttled by direct reclaim if they consume too much memory. Regards, Yang > > > I've put mem_cgroup_handle_over_high in gup next to cond_resched() and > > later that gave me idea that this is good place for running any > > deferred works, like bottom half for tasks. Right now this happens > > only at switching into userspace. > > I am not against pushing high memory reclaim into the charge path in > principle. I just want to hear how big of a problem this really is in > practice. If this is mostly a theoretical problem that might hit then I > would rather stick with the existing code though. > > -- > Michal Hocko > SUSE Labs >
On Mon 29-07-19 10:28:43, Yang Shi wrote: [...] > I don't worry too much about scale since the scale issue is not unique > to background reclaim, direct reclaim may run into the same problem. Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg. You can have thousands of memcgs and I do not think we really do want to create one kswapd for each. Once we have a kswapd thread pool then we get into a tricky land where a determinism/fairness would be non trivial to achieve. Direct reclaim, on the other hand is bound by the workload itself.
On Mon 29-07-19 11:49:52, Johannes Weiner wrote: > On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote: > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > ret = -ERESTARTSYS; > > goto out; > > } > > - cond_resched(); > > > > + /* Reclaim memory over high limit before stocking too much */ > > + mem_cgroup_handle_over_high(true); > > I'd rather this remained part of the try_charge() call. The code > comment in try_charge says this: > > * We can perform reclaim here if __GFP_RECLAIM but let's > * always punt for simplicity and so that GFP_KERNEL can > * consistently be used during reclaim. > > The simplicity argument doesn't hold true anymore once we have to add > manual calls into allocation sites. We should instead fix try_charge() > to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace > return when actually needed. Agreed. If we want to do direct reclaim on the high limit breach then it should go into try_charge same way we do hard limit reclaim there. I am not yet sure about how/whether to scale the excess. The only reason to move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you say, maybe we should start by always performing the reclaim for sleepable contexts first and only defer for non-sleeping requests.
On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 29-07-19 10:28:43, Yang Shi wrote: > [...] > > I don't worry too much about scale since the scale issue is not unique > > to background reclaim, direct reclaim may run into the same problem. > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg. > You can have thousands of memcgs and I do not think we really do want > to create one kswapd for each. Once we have a kswapd thread pool then we > get into a tricky land where a determinism/fairness would be non trivial > to achieve. Direct reclaim, on the other hand is bound by the workload > itself. Yes, I agree thread pool would introduce more latency than dedicated kswapd thread. But, it looks not that bad in our test. When memory allocation is fast, even though dedicated kswapd thread can't catch up. So, such background reclaim is best effort, not guaranteed. I don't quite get what you mean about fairness. Do you mean they may spend excessive cpu time then cause other processes starvation? I think this could be mitigated by properly organizing and setting groups. But, I agree this is tricky. Typically, the processes are placed into different cgroups according to their importance and priority. For example, in our cluster, system processes would go to system group, then latency sensitive jobs and batch jobs (they are usually second class citizens) go to different groups. The memcg kswapd would be enabled for latency sensitive groups only. The memcg kswapd threads would have the same priority with global kswapd. > -- > Michal Hocko > SUSE Labs
On Thu 01-08-19 14:00:51, Yang Shi wrote: > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote: > > [...] > > > I don't worry too much about scale since the scale issue is not unique > > > to background reclaim, direct reclaim may run into the same problem. > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg. > > You can have thousands of memcgs and I do not think we really do want > > to create one kswapd for each. Once we have a kswapd thread pool then we > > get into a tricky land where a determinism/fairness would be non trivial > > to achieve. Direct reclaim, on the other hand is bound by the workload > > itself. > > Yes, I agree thread pool would introduce more latency than dedicated > kswapd thread. But, it looks not that bad in our test. When memory > allocation is fast, even though dedicated kswapd thread can't catch > up. So, such background reclaim is best effort, not guaranteed. > > I don't quite get what you mean about fairness. Do you mean they may > spend excessive cpu time then cause other processes starvation? I > think this could be mitigated by properly organizing and setting > groups. But, I agree this is tricky. No, I meant that the cost of reclaiming a unit of charges (e.g. SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory on LRUs. Therefore any thread pool mechanism would lead to unfair reclaim and non-deterministic behavior. I can imagine a middle ground where the background reclaim would have to be an opt-in feature and a dedicated kernel thread would be assigned to the particular memcg (hierarchy).
On Mon 29-07-19 20:55:09, Michal Hocko wrote: > On Mon 29-07-19 11:49:52, Johannes Weiner wrote: > > On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote: > > > --- a/mm/gup.c > > > +++ b/mm/gup.c > > > @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > > ret = -ERESTARTSYS; > > > goto out; > > > } > > > - cond_resched(); > > > > > > + /* Reclaim memory over high limit before stocking too much */ > > > + mem_cgroup_handle_over_high(true); > > > > I'd rather this remained part of the try_charge() call. The code > > comment in try_charge says this: > > > > * We can perform reclaim here if __GFP_RECLAIM but let's > > * always punt for simplicity and so that GFP_KERNEL can > > * consistently be used during reclaim. > > > > The simplicity argument doesn't hold true anymore once we have to add > > manual calls into allocation sites. We should instead fix try_charge() > > to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace > > return when actually needed. > > Agreed. If we want to do direct reclaim on the high limit breach then it > should go into try_charge same way we do hard limit reclaim there. I am > not yet sure about how/whether to scale the excess. The only reason to > move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you > say, maybe we should start by always performing the reclaim for > sleepable contexts first and only defer for non-sleeping requests. In other words. Something like patch below (completely untested). Could you give it a try Konstantin? diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ba9138a4a1de..53a35c526e43 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, schedule_work(&memcg->high_work); break; } - current->memcg_nr_pages_over_high += batch; - set_notify_resume(current); + if (gfpflags_allow_blocking(gfp_mask)) { + reclaim_high(memcg, nr_pages, GFP_KERNEL); + } else { + current->memcg_nr_pages_over_high += batch; + set_notify_resume(current); + } break; } } while ((memcg = parent_mem_cgroup(memcg)));
On 02.08.2019 12:40, Michal Hocko wrote: > On Mon 29-07-19 20:55:09, Michal Hocko wrote: >> On Mon 29-07-19 11:49:52, Johannes Weiner wrote: >>> On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote: >>>> --- a/mm/gup.c >>>> +++ b/mm/gup.c >>>> @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, >>>> ret = -ERESTARTSYS; >>>> goto out; >>>> } >>>> - cond_resched(); >>>> >>>> + /* Reclaim memory over high limit before stocking too much */ >>>> + mem_cgroup_handle_over_high(true); >>> >>> I'd rather this remained part of the try_charge() call. The code >>> comment in try_charge says this: >>> >>> * We can perform reclaim here if __GFP_RECLAIM but let's >>> * always punt for simplicity and so that GFP_KERNEL can >>> * consistently be used during reclaim. >>> >>> The simplicity argument doesn't hold true anymore once we have to add >>> manual calls into allocation sites. We should instead fix try_charge() >>> to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace >>> return when actually needed. >> >> Agreed. If we want to do direct reclaim on the high limit breach then it >> should go into try_charge same way we do hard limit reclaim there. I am >> not yet sure about how/whether to scale the excess. The only reason to >> move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you >> say, maybe we should start by always performing the reclaim for >> sleepable contexts first and only defer for non-sleeping requests. > > In other words. Something like patch below (completely untested). Could > you give it a try Konstantin? This should work but also eliminate all benefits from deferred reclaim: bigger batching and running without of any locks. After that gap between high and max will work just as reserve for atomic allocations. > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ba9138a4a1de..53a35c526e43 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > schedule_work(&memcg->high_work); > break; > } > - current->memcg_nr_pages_over_high += batch; > - set_notify_resume(current); > + if (gfpflags_allow_blocking(gfp_mask)) { > + reclaim_high(memcg, nr_pages, GFP_KERNEL); > + } else { > + current->memcg_nr_pages_over_high += batch; > + set_notify_resume(current); > + } > break; > } > } while ((memcg = parent_mem_cgroup(memcg))); >
On Fri 02-08-19 13:01:07, Konstantin Khlebnikov wrote: > > > On 02.08.2019 12:40, Michal Hocko wrote: > > On Mon 29-07-19 20:55:09, Michal Hocko wrote: > > > On Mon 29-07-19 11:49:52, Johannes Weiner wrote: > > > > On Sun, Jul 28, 2019 at 03:29:38PM +0300, Konstantin Khlebnikov wrote: > > > > > --- a/mm/gup.c > > > > > +++ b/mm/gup.c > > > > > @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, > > > > > ret = -ERESTARTSYS; > > > > > goto out; > > > > > } > > > > > - cond_resched(); > > > > > + /* Reclaim memory over high limit before stocking too much */ > > > > > + mem_cgroup_handle_over_high(true); > > > > > > > > I'd rather this remained part of the try_charge() call. The code > > > > comment in try_charge says this: > > > > > > > > * We can perform reclaim here if __GFP_RECLAIM but let's > > > > * always punt for simplicity and so that GFP_KERNEL can > > > > * consistently be used during reclaim. > > > > > > > > The simplicity argument doesn't hold true anymore once we have to add > > > > manual calls into allocation sites. We should instead fix try_charge() > > > > to do synchronous reclaim for __GFP_RECLAIM and only punt to userspace > > > > return when actually needed. > > > > > > Agreed. If we want to do direct reclaim on the high limit breach then it > > > should go into try_charge same way we do hard limit reclaim there. I am > > > not yet sure about how/whether to scale the excess. The only reason to > > > move reclaim to return-to-userspace path was GFP_NOWAIT charges. As you > > > say, maybe we should start by always performing the reclaim for > > > sleepable contexts first and only defer for non-sleeping requests. > > > > In other words. Something like patch below (completely untested). Could > > you give it a try Konstantin? > > This should work but also eliminate all benefits from deferred reclaim: > bigger batching and running without of any locks. Yes, but we already have to deal with for hard limit reclaim. Also I would like to see any actual data to back any more complex solution. We should definitely start simple. > After that gap between high and max will work just as reserve for atomic allocations. > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index ba9138a4a1de..53a35c526e43 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > schedule_work(&memcg->high_work); > > break; > > } > > - current->memcg_nr_pages_over_high += batch; > > - set_notify_resume(current); > > + if (gfpflags_allow_blocking(gfp_mask)) { > > + reclaim_high(memcg, nr_pages, GFP_KERNEL); ups, this should be s@GFP_KERNEL@gfp_mask@ > > + } else { > > + current->memcg_nr_pages_over_high += batch; > > + set_notify_resume(current); > > + } > > break; > > } > > } while ((memcg = parent_mem_cgroup(memcg))); > >
On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 01-08-19 14:00:51, Yang Shi wrote: > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote: > > > [...] > > > > I don't worry too much about scale since the scale issue is not unique > > > > to background reclaim, direct reclaim may run into the same problem. > > > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg. > > > You can have thousands of memcgs and I do not think we really do want > > > to create one kswapd for each. Once we have a kswapd thread pool then we > > > get into a tricky land where a determinism/fairness would be non trivial > > > to achieve. Direct reclaim, on the other hand is bound by the workload > > > itself. > > > > Yes, I agree thread pool would introduce more latency than dedicated > > kswapd thread. But, it looks not that bad in our test. When memory > > allocation is fast, even though dedicated kswapd thread can't catch > > up. So, such background reclaim is best effort, not guaranteed. > > > > I don't quite get what you mean about fairness. Do you mean they may > > spend excessive cpu time then cause other processes starvation? I > > think this could be mitigated by properly organizing and setting > > groups. But, I agree this is tricky. > > No, I meant that the cost of reclaiming a unit of charges (e.g. > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory > on LRUs. Therefore any thread pool mechanism would lead to unfair > reclaim and non-deterministic behavior. Yes, the cost depends on the state of pages, but I still don't quite understand what does "unfair" refer to in this context. Do you mean some cgroups may reclaim much more than others? Or the work may take too long so it can't not serve other cgroups in time? > > I can imagine a middle ground where the background reclaim would have to > be an opt-in feature and a dedicated kernel thread would be assigned to > the particular memcg (hierarchy). Yes, it is opt-in by defining a proper "water mark". As long as "water mark" is defined (0, 100), the "kswapd" work would be queued once the usage is greater than "water mark", then it would exit once the usage is under "water mark". If "water mark" is 0, it will never queue any background reclaim work. We did use dedicated kernel thread for each cgroup, but it turns out it is also tricky and error prone to manage the kernel thread, workqueue sounds much more simple and less error prone. > -- > Michal Hocko > SUSE Labs
On Fri 02-08-19 11:56:28, Yang Shi wrote: > On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Thu 01-08-19 14:00:51, Yang Shi wrote: > > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote: > > > > [...] > > > > > I don't worry too much about scale since the scale issue is not unique > > > > > to background reclaim, direct reclaim may run into the same problem. > > > > > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg. > > > > You can have thousands of memcgs and I do not think we really do want > > > > to create one kswapd for each. Once we have a kswapd thread pool then we > > > > get into a tricky land where a determinism/fairness would be non trivial > > > > to achieve. Direct reclaim, on the other hand is bound by the workload > > > > itself. > > > > > > Yes, I agree thread pool would introduce more latency than dedicated > > > kswapd thread. But, it looks not that bad in our test. When memory > > > allocation is fast, even though dedicated kswapd thread can't catch > > > up. So, such background reclaim is best effort, not guaranteed. > > > > > > I don't quite get what you mean about fairness. Do you mean they may > > > spend excessive cpu time then cause other processes starvation? I > > > think this could be mitigated by properly organizing and setting > > > groups. But, I agree this is tricky. > > > > No, I meant that the cost of reclaiming a unit of charges (e.g. > > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory > > on LRUs. Therefore any thread pool mechanism would lead to unfair > > reclaim and non-deterministic behavior. > > Yes, the cost depends on the state of pages, but I still don't quite > understand what does "unfair" refer to in this context. Do you mean > some cgroups may reclaim much more than others? > Or the work may take too long so it can't not serve other cgroups in time? exactly.
On Mon, Aug 5, 2019 at 7:32 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 02-08-19 11:56:28, Yang Shi wrote: > > On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Thu 01-08-19 14:00:51, Yang Shi wrote: > > > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote: > > > > > [...] > > > > > > I don't worry too much about scale since the scale issue is not unique > > > > > > to background reclaim, direct reclaim may run into the same problem. > > > > > > > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg. > > > > > You can have thousands of memcgs and I do not think we really do want > > > > > to create one kswapd for each. Once we have a kswapd thread pool then we > > > > > get into a tricky land where a determinism/fairness would be non trivial > > > > > to achieve. Direct reclaim, on the other hand is bound by the workload > > > > > itself. > > > > > > > > Yes, I agree thread pool would introduce more latency than dedicated > > > > kswapd thread. But, it looks not that bad in our test. When memory > > > > allocation is fast, even though dedicated kswapd thread can't catch > > > > up. So, such background reclaim is best effort, not guaranteed. > > > > > > > > I don't quite get what you mean about fairness. Do you mean they may > > > > spend excessive cpu time then cause other processes starvation? I > > > > think this could be mitigated by properly organizing and setting > > > > groups. But, I agree this is tricky. > > > > > > No, I meant that the cost of reclaiming a unit of charges (e.g. > > > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory > > > on LRUs. Therefore any thread pool mechanism would lead to unfair > > > reclaim and non-deterministic behavior. > > > > Yes, the cost depends on the state of pages, but I still don't quite > > understand what does "unfair" refer to in this context. Do you mean > > some cgroups may reclaim much more than others? > > > Or the work may take too long so it can't not serve other cgroups in time? > > exactly. > How about allowing the users to implement their own user space kswapd? A memcg interface similar to MADV_PAGEOUT. Users can register for MEMCG_HIGH notification (it needs some modification) and on receiving the notification, the uswapd (User's kswapd) will trigger reclaim through memory.pageout (or memory.try_to_free_pages). One can argue why not just use MADV_PAGEOUT? In real workload, a job can be a combination of different sub-jobs and most probably may not know the importance of the memory layout of the tasks of the sub-jobs. So, a memcg level interface makes more sense there. Shakeel
On Mon, Aug 5, 2019 at 7:32 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 02-08-19 11:56:28, Yang Shi wrote: > > On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Thu 01-08-19 14:00:51, Yang Shi wrote: > > > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote: > > > > > [...] > > > > > > I don't worry too much about scale since the scale issue is not unique > > > > > > to background reclaim, direct reclaim may run into the same problem. > > > > > > > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg. > > > > > You can have thousands of memcgs and I do not think we really do want > > > > > to create one kswapd for each. Once we have a kswapd thread pool then we > > > > > get into a tricky land where a determinism/fairness would be non trivial > > > > > to achieve. Direct reclaim, on the other hand is bound by the workload > > > > > itself. > > > > > > > > Yes, I agree thread pool would introduce more latency than dedicated > > > > kswapd thread. But, it looks not that bad in our test. When memory > > > > allocation is fast, even though dedicated kswapd thread can't catch > > > > up. So, such background reclaim is best effort, not guaranteed. > > > > > > > > I don't quite get what you mean about fairness. Do you mean they may > > > > spend excessive cpu time then cause other processes starvation? I > > > > think this could be mitigated by properly organizing and setting > > > > groups. But, I agree this is tricky. > > > > > > No, I meant that the cost of reclaiming a unit of charges (e.g. > > > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory > > > on LRUs. Therefore any thread pool mechanism would lead to unfair > > > reclaim and non-deterministic behavior. > > > > Yes, the cost depends on the state of pages, but I still don't quite > > understand what does "unfair" refer to in this context. Do you mean > > some cgroups may reclaim much more than others? > > > Or the work may take too long so it can't not serve other cgroups in time? > > exactly. Actually, I'm not very concerned by this. In our design each memcg has its dedicated work (memcg->wmark_work), so the reclaim work for different memcgs could be run in parallel since they are *different* work in fact although they run the same function. And, We could queue them to a dedicated unbound workqueue which may have maximum 512 or scale with nr cpus active works. Although the system may have thousands of online memcgs, I'm supposed it should be rare to have all of them trigger reclaim at the same time. > -- > Michal Hocko > SUSE Labs
On Mon 05-08-19 20:28:40, Yang Shi wrote: > On Mon, Aug 5, 2019 at 7:32 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Fri 02-08-19 11:56:28, Yang Shi wrote: > > > On Fri, Aug 2, 2019 at 2:35 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > On Thu 01-08-19 14:00:51, Yang Shi wrote: > > > > > On Mon, Jul 29, 2019 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > > > > > On Mon 29-07-19 10:28:43, Yang Shi wrote: > > > > > > [...] > > > > > > > I don't worry too much about scale since the scale issue is not unique > > > > > > > to background reclaim, direct reclaim may run into the same problem. > > > > > > > > > > > > Just to clarify. By scaling problem I mean 1:1 kswapd thread to memcg. > > > > > > You can have thousands of memcgs and I do not think we really do want > > > > > > to create one kswapd for each. Once we have a kswapd thread pool then we > > > > > > get into a tricky land where a determinism/fairness would be non trivial > > > > > > to achieve. Direct reclaim, on the other hand is bound by the workload > > > > > > itself. > > > > > > > > > > Yes, I agree thread pool would introduce more latency than dedicated > > > > > kswapd thread. But, it looks not that bad in our test. When memory > > > > > allocation is fast, even though dedicated kswapd thread can't catch > > > > > up. So, such background reclaim is best effort, not guaranteed. > > > > > > > > > > I don't quite get what you mean about fairness. Do you mean they may > > > > > spend excessive cpu time then cause other processes starvation? I > > > > > think this could be mitigated by properly organizing and setting > > > > > groups. But, I agree this is tricky. > > > > > > > > No, I meant that the cost of reclaiming a unit of charges (e.g. > > > > SWAP_CLUSTER_MAX) is not constant and depends on the state of the memory > > > > on LRUs. Therefore any thread pool mechanism would lead to unfair > > > > reclaim and non-deterministic behavior. > > > > > > Yes, the cost depends on the state of pages, but I still don't quite > > > understand what does "unfair" refer to in this context. Do you mean > > > some cgroups may reclaim much more than others? > > > > > Or the work may take too long so it can't not serve other cgroups in time? > > > > exactly. > > Actually, I'm not very concerned by this. In our design each memcg has > its dedicated work (memcg->wmark_work), so the reclaim work for > different memcgs could be run in parallel since they are *different* > work in fact although they run the same function. And, We could queue > them to a dedicated unbound workqueue which may have maximum 512 or > scale with nr cpus active works. Although the system may have > thousands of online memcgs, I'm supposed it should be rare to have all > of them trigger reclaim at the same time. I do believe that it might work for your particular usecase but I do not think this is robust enough for the upstream kernel, I am afraid. As I've said I am open to discuss an opt-in per memcg pro-active reclaim (a kernel thread that belongs to the memcg) but it has to be a dedicated worker bound by all the cgroup resource restrictions.
On Fri 02-08-19 13:44:38, Michal Hocko wrote: [...] > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index ba9138a4a1de..53a35c526e43 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > schedule_work(&memcg->high_work); > > > break; > > > } > > > - current->memcg_nr_pages_over_high += batch; > > > - set_notify_resume(current); > > > + if (gfpflags_allow_blocking(gfp_mask)) { > > > + reclaim_high(memcg, nr_pages, GFP_KERNEL); > > ups, this should be s@GFP_KERNEL@gfp_mask@ > > > > + } else { > > > + current->memcg_nr_pages_over_high += batch; > > > + set_notify_resume(current); > > > + } > > > break; > > > } > > > } while ((memcg = parent_mem_cgroup(memcg))); > > > Should I send an official patch for this?
On 8/6/19 10:07 AM, Michal Hocko wrote: > On Fri 02-08-19 13:44:38, Michal Hocko wrote: > [...] >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>>> index ba9138a4a1de..53a35c526e43 100644 >>>> --- a/mm/memcontrol.c >>>> +++ b/mm/memcontrol.c >>>> @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, >>>> schedule_work(&memcg->high_work); >>>> break; >>>> } >>>> - current->memcg_nr_pages_over_high += batch; >>>> - set_notify_resume(current); >>>> + if (gfpflags_allow_blocking(gfp_mask)) { >>>> + reclaim_high(memcg, nr_pages, GFP_KERNEL); >> >> ups, this should be s@GFP_KERNEL@gfp_mask@ >> >>>> + } else { >>>> + current->memcg_nr_pages_over_high += batch; >>>> + set_notify_resume(current); >>>> + } >>>> break; >>>> } >>>> } while ((memcg = parent_mem_cgroup(memcg))); >>>> > > Should I send an official patch for this? > I prefer to keep it as is while we have no better solution.
On Tue 06-08-19 10:19:49, Konstantin Khlebnikov wrote: > On 8/6/19 10:07 AM, Michal Hocko wrote: > > On Fri 02-08-19 13:44:38, Michal Hocko wrote: > > [...] > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > index ba9138a4a1de..53a35c526e43 100644 > > > > > --- a/mm/memcontrol.c > > > > > +++ b/mm/memcontrol.c > > > > > @@ -2429,8 +2429,12 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > > > schedule_work(&memcg->high_work); > > > > > break; > > > > > } > > > > > - current->memcg_nr_pages_over_high += batch; > > > > > - set_notify_resume(current); > > > > > + if (gfpflags_allow_blocking(gfp_mask)) { > > > > > + reclaim_high(memcg, nr_pages, GFP_KERNEL); > > > > > > ups, this should be s@GFP_KERNEL@gfp_mask@ > > > > > > > > + } else { > > > > > + current->memcg_nr_pages_over_high += batch; > > > > > + set_notify_resume(current); > > > > > + } > > > > > break; > > > > > } > > > > > } while ((memcg = parent_mem_cgroup(memcg))); > > > > > > > > > Should I send an official patch for this? > > > > I prefer to keep it as is while we have no better solution. Fine with me.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 44c41462be33..eca2bf9560f2 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -512,7 +512,7 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, return mz->lru_zone_size[zone_idx][lru]; } -void mem_cgroup_handle_over_high(void); +void mem_cgroup_handle_over_high(bool only_severe); unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); @@ -969,7 +969,7 @@ static inline void unlock_page_memcg(struct page *page) { } -static inline void mem_cgroup_handle_over_high(void) +static inline void mem_cgroup_handle_over_high(bool only_severe) { } diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h index 36fb3bbed6b2..8845fb65353f 100644 --- a/include/linux/tracehook.h +++ b/include/linux/tracehook.h @@ -194,7 +194,7 @@ static inline void tracehook_notify_resume(struct pt_regs *regs) } #endif - mem_cgroup_handle_over_high(); + mem_cgroup_handle_over_high(false); blkcg_maybe_throttle_current(); } diff --git a/mm/gup.c b/mm/gup.c index 98f13ab37bac..42b93fffe824 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -847,8 +847,11 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, ret = -ERESTARTSYS; goto out; } - cond_resched(); + /* Reclaim memory over high limit before stocking too much */ + mem_cgroup_handle_over_high(true); + + cond_resched(); page = follow_page_mask(vma, start, foll_flags, &ctx); if (!page) { ret = faultin_page(tsk, vma, start, &foll_flags, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cdbb7a84cb6e..15fa664ce98c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2317,11 +2317,16 @@ static void high_work_func(struct work_struct *work) reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL); } +#define MEMCG_SEVERE_OVER_HIGH (1 << 31) + /* * Scheduled by try_charge() to be executed from the userland return path * and reclaims memory over the high limit. + * + * Long allocation loops should call periodically with only_severe = true + * to reclaim memory if usage already over halfway to the max limit. */ -void mem_cgroup_handle_over_high(void) +void mem_cgroup_handle_over_high(bool only_severe) { unsigned int nr_pages = current->memcg_nr_pages_over_high; struct mem_cgroup *memcg; @@ -2329,6 +2334,11 @@ void mem_cgroup_handle_over_high(void) if (likely(!nr_pages)) return; + if (nr_pages & MEMCG_SEVERE_OVER_HIGH) + nr_pages -= MEMCG_SEVERE_OVER_HIGH; + else if (only_severe) + return; + memcg = get_mem_cgroup_from_mm(current->mm); reclaim_high(memcg, nr_pages, GFP_KERNEL); css_put(&memcg->css); @@ -2493,6 +2503,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, schedule_work(&memcg->high_work); break; } + /* Mark as severe if over halfway to the max limit */ + if (page_counter_read(&memcg->memory) > + (memcg->high >> 1) + (memcg->memory.max >> 1)) + current->memcg_nr_pages_over_high |= + MEMCG_SEVERE_OVER_HIGH; current->memcg_nr_pages_over_high += batch; set_notify_resume(current); break;
High memory limit in memory cgroup allows to batch memory reclaiming and defer it until returning into userland. This moves it out of any locks. Fixed gap between high and max limit works pretty well (we are using 64 * NR_CPUS pages) except cases when one syscall allocates tons of memory. This affects all other tasks in cgroup because they might hit max memory limit in unhandy places and\or under hot locks. For example mmap with MAP_POPULATE or MAP_LOCKED might allocate a lot of pages and push memory cgroup usage far ahead high memory limit. This patch uses halfway between high and max limits as threshold and in this case starts memory reclaiming if mem_cgroup_handle_over_high() called with argument only_severe = true, otherwise reclaim is deferred till returning into userland. If high limits isn't set nothing changes. Now long running get_user_pages will periodically reclaim cgroup memory. Other possible targets are generic file read/write iter loops. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> --- include/linux/memcontrol.h | 4 ++-- include/linux/tracehook.h | 2 +- mm/gup.c | 5 ++++- mm/memcontrol.c | 17 ++++++++++++++++- 4 files changed, 23 insertions(+), 5 deletions(-)