Message ID | 20200502135910.7255-2-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: memcontrol: memory.{low,min} reclaim fix & cleanup | expand |
Yafang Shao writes: >A cgroup can have both memory protection and a memory limit to isolate >it from its siblings in both directions - for example, to prevent it >from being shrunk below 2G under high pressure from outside, but also >from growing beyond 4G under low pressure. > >Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") >implemented proportional scan pressure so that multiple siblings in >excess of their protection settings don't get reclaimed equally but >instead in accordance to their unprotected portion. > >During limit reclaim, this proportionality shouldn't apply of course: >there is no competition, all pressure is from within the cgroup and >should be applied as such. Reclaim should operate at full efficiency. > >However, mem_cgroup_protected() never expected anybody to look at the >effective protection values when it indicated that the cgroup is above >its protection. As a result, a query during limit reclaim may return >stale protection values that were calculated by a previous reclaim cycle >in which the cgroup did have siblings. > >When this happens, reclaim is unnecessarily hesitant and potentially >slow to meet the desired limit. In theory this could lead to premature >OOM kills, although it's not obvious this has occurred in practice. > >[hannes@cmpxchg.org: changelog] >[mhocko@kernel.org: rework code comment] >[chris@chrisdown.name: retitle] >Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") >Signed-off-by: Yafang Shao <laoar.shao@gmail.com> >Acked-by: Roman Gushchin <guro@fb.com> >Cc: Michal Hocko <mhocko@kernel.org> >Cc: Johannes Weiner <hannes@cmpxchg.org> >Cc: Chris Down <chris@chrisdown.name> Thanks! Acked-by: Chris Down <chris@chrisdown.name>
On Sat, May 02, 2020 at 09:59:09AM -0400, Yafang Shao wrote: > A cgroup can have both memory protection and a memory limit to isolate > it from its siblings in both directions - for example, to prevent it > from being shrunk below 2G under high pressure from outside, but also > from growing beyond 4G under low pressure. > > Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") > implemented proportional scan pressure so that multiple siblings in > excess of their protection settings don't get reclaimed equally but > instead in accordance to their unprotected portion. > > During limit reclaim, this proportionality shouldn't apply of course: > there is no competition, all pressure is from within the cgroup and > should be applied as such. Reclaim should operate at full efficiency. > > However, mem_cgroup_protected() never expected anybody to look at the > effective protection values when it indicated that the cgroup is above > its protection. As a result, a query during limit reclaim may return > stale protection values that were calculated by a previous reclaim cycle > in which the cgroup did have siblings. > > When this happens, reclaim is unnecessarily hesitant and potentially > slow to meet the desired limit. In theory this could lead to premature > OOM kills, although it's not obvious this has occurred in practice. > > [hannes@cmpxchg.org: changelog] > [mhocko@kernel.org: rework code comment] > [chris@chrisdown.name: retitle] > Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") > Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > Acked-by: Roman Gushchin <guro@fb.com> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Sat 02-05-20 09:59:09, Yafang Shao wrote: > A cgroup can have both memory protection and a memory limit to isolate > it from its siblings in both directions - for example, to prevent it > from being shrunk below 2G under high pressure from outside, but also > from growing beyond 4G under low pressure. > > Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") > implemented proportional scan pressure so that multiple siblings in > excess of their protection settings don't get reclaimed equally but > instead in accordance to their unprotected portion. > > During limit reclaim, this proportionality shouldn't apply of course: > there is no competition, all pressure is from within the cgroup and > should be applied as such. Reclaim should operate at full efficiency. > > However, mem_cgroup_protected() never expected anybody to look at the > effective protection values when it indicated that the cgroup is above > its protection. As a result, a query during limit reclaim may return > stale protection values that were calculated by a previous reclaim cycle > in which the cgroup did have siblings. > > When this happens, reclaim is unnecessarily hesitant and potentially > slow to meet the desired limit. In theory this could lead to premature > OOM kills, although it's not obvious this has occurred in practice. > > [hannes@cmpxchg.org: changelog] > [mhocko@kernel.org: rework code comment] > [chris@chrisdown.name: retitle] > Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim") > Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > Acked-by: Roman Gushchin <guro@fb.com> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Chris Down <chris@chrisdown.name> I have only now processed my inbox to this email. Please consider the changelog part which explains the fix I have posted earlier this morning http://lkml.kernel.org/r/20200504072342.GD22838@dhcp22.suse.cz Other than that Acked-by: Michal Hocko <mhocko@suse.com>
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index d275c72c4f8e..c07548ce26cb 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -344,12 +344,49 @@ static inline bool mem_cgroup_disabled(void) return !cgroup_subsys_enabled(memory_cgrp_subsys); } -static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root, + struct mem_cgroup *memcg, bool in_low_reclaim) { if (mem_cgroup_disabled()) return 0; + /* + * There is no reclaim protection applied to a targeted reclaim. + * We are special casing this specific case here because + * mem_cgroup_protected calculation is not robust enough to keep + * the protection invariant for calculated effective values for + * parallel reclaimers with different reclaim target. This is + * especially a problem for tail memcgs (as they have pages on LRU) + * which would want to have effective values 0 for targeted reclaim + * but a different value for external reclaim. + * + * Example + * Let's have global and A's reclaim in parallel: + * | + * A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G) + * |\ + * | C (low = 1G, usage = 2.5G) + * B (low = 1G, usage = 0.5G) + * + * For the global reclaim + * A.elow = A.low + * B.elow = min(B.usage, B.low) because children_low_usage <= A.elow + * C.elow = min(C.usage, C.low) + * + * With the effective values resetting we have A reclaim + * A.elow = 0 + * B.elow = B.low + * C.elow = C.low + * + * If the global reclaim races with A's reclaim then + * B.elow = C.elow = 0 because children_low_usage > A.elow) + * is possible and reclaiming B would be violating the protection. + * + */ + if (root == memcg) + return 0; + if (in_low_reclaim) return READ_ONCE(memcg->memory.emin); @@ -835,7 +872,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, { } -static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root, + struct mem_cgroup *memcg, bool in_low_reclaim) { return 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5beea03dd58a..1206682edc1a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6388,6 +6388,14 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, if (!root) root = root_mem_cgroup; + + /* + * Effective values of the reclaim targets are ignored so they + * can be stale. Have a look at mem_cgroup_protection for more + * details. + * TODO: calculation should be more robust so that we do not need + * that special casing. + */ if (memcg == root) return MEMCG_PROT_NONE; diff --git a/mm/vmscan.c b/mm/vmscan.c index b06868fc4926..4d3027ac131c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2346,7 +2346,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, unsigned long protection; lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx); - protection = mem_cgroup_protection(memcg, + protection = mem_cgroup_protection(sc->target_mem_cgroup, + memcg, sc->memcg_low_reclaim); if (protection) {