[v2,3/3] mm: Fix missing mem cgroup soft limit tree updates

Message ID	e269f5df3af1157232b01a9b0dae3edf4880d786.1613584277.git.tim.c.chen@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=QuKr=HT=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 07BA16186A IronPort-SDR: rzLMIAB/XrG/gu5pGuHOB2m9vgEpxkt2GZgEjVeIzPmzTeuJrqo4QpZJxPuwrIrdSa1xn6f16M xQWtwQeTJUZA== IronPort-SDR: fyXWGY+qTbQbuVmeCtCGViaNOmO5KLB9GCktnf6sqEPS4OOLdfqg8pm8qyxyXHVzxofdAq+crC i8lmAP2b6Ivw== From: Tim Chen <tim.c.chen@linux.intel.com> To: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@suse.cz>, Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tim Chen <tim.c.chen@linux.intel.com>, Dave Hansen <dave.hansen@intel.com>, Ying Huang <ying.huang@intel.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 3/3] mm: Fix missing mem cgroup soft limit tree updates Date: Wed, 17 Feb 2021 12:41:36 -0800 Message-Id: <e269f5df3af1157232b01a9b0dae3edf4880d786.1613584277.git.tim.c.chen@linux.intel.com> In-Reply-To: <cover.1613584277.git.tim.c.chen@linux.intel.com> References: <cover.1613584277.git.tim.c.chen@linux.intel.com> MIME-Version: 1.0 Received-SPF: none (linux.intel.com>: No applicable sender policy available) receiver=imf09; identity=mailfrom; envelope-from="<tim.c.chen@linux.intel.com>"; helo=mga14.intel.com; client-ip=192.55.52.115 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Soft limit memory management bug fixes \| expand [v2,0/3] Soft limit memory management bug fixes [v2,1/3] mm: Fix dropped memcg from mem cgroup soft limit tree [v2,2/3] mm: Force update of mem cgroup soft limit tree on usage excess [v2,3/3] mm: Fix missing mem cgroup soft limit tree updates

Tim Chen Feb. 17, 2021, 8:41 p.m. UTC

On a per node basis, the mem cgroup soft limit tree on each node tracks
how much a cgroup has exceeded its soft limit memory limit and sorts
the cgroup by its excess usage.  On page release, the trees are not
updated right away, until we have gathered a batch of pages belonging to
the same cgroup. This reduces the frequency of updating the soft limit tree
and locking of the tree and associated cgroup.

However, the batch of pages could contain pages from multiple nodes but
only the soft limit tree from one node would get updated.  Change the
logic so that we update the tree in batch of pages, with each batch of
pages all in the same mem cgroup and memory node.  An update is issued for
the batch of pages of a node collected till now whenever we encounter
a page belonging to a different node.  Note that this batching for
the same node logic is only relevant for v1 cgroup that has a memory
soft limit.

Reviewed-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 mm/memcontrol.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Johannes Weiner Feb. 18, 2021, 5:56 a.m. UTC | #1

On Wed, Feb 17, 2021 at 12:41:36PM -0800, Tim Chen wrote:
> On a per node basis, the mem cgroup soft limit tree on each node tracks
> how much a cgroup has exceeded its soft limit memory limit and sorts
> the cgroup by its excess usage.  On page release, the trees are not
> updated right away, until we have gathered a batch of pages belonging to
> the same cgroup. This reduces the frequency of updating the soft limit tree
> and locking of the tree and associated cgroup.
> 
> However, the batch of pages could contain pages from multiple nodes but
> only the soft limit tree from one node would get updated.  Change the
> logic so that we update the tree in batch of pages, with each batch of
> pages all in the same mem cgroup and memory node.  An update is issued for
> the batch of pages of a node collected till now whenever we encounter
> a page belonging to a different node.  Note that this batching for
> the same node logic is only relevant for v1 cgroup that has a memory
> soft limit.
> 
> Reviewed-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  mm/memcontrol.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d72449eeb85a..8bddee75f5cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6804,6 +6804,7 @@ struct uncharge_gather {
>  	unsigned long pgpgout;
>  	unsigned long nr_kmem;
>  	struct page *dummy_page;
> +	int nid;
>  };
>  
>  static inline void uncharge_gather_clear(struct uncharge_gather *ug)
> @@ -6849,7 +6850,13 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
>  	 * exclusive access to the page.
>  	 */
>  
> -	if (ug->memcg != page_memcg(page)) {
> +	if (ug->memcg != page_memcg(page) ||
> +	    /*
> +	     * Update soft limit tree used in v1 cgroup in page batch for
> +	     * the same node. Relevant only to v1 cgroup with a soft limit.
> +	     */
> +	    (ug->dummy_page && ug->nid != page_to_nid(page) &&
> +	     ug->memcg->soft_limit != PAGE_COUNTER_MAX)) {

Sorry, I used weird phrasing in my last email.

Can you please preface the checks you're adding with a
!cgroup_subsys_on_dfl(memory_cgrp_subsys) to static branch for
cgroup1? The uncharge path is pretty hot, and this would avoid the
runtime overhead on cgroup2 at least, which doesn't have the SL.

Also, do we need the ug->dummy_page check? It's only NULL on the first
loop - where ug->memcg is NULL as well and the branch is taken anyway.

The soft limit check is also slightly cheaper than the nid check, as
page_to_nid() might be out-of-line, so we should do it first. This?

	/*
	 * Batch-uncharge all pages of the same memcg.
	 *
	 * Unless we're looking at a cgroup1 with a softlimit
	 * set: the soft limit trees are maintained per-node
	 * and updated on uncharge (via dummy_page), so keep
	 * batches confined to a single node as well.
	 */
	if (ug->memcg != page_memcg(page) ||
	    (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
	     ug->memcg->soft_limit != PAGE_COUNTER_MAX &&
	     ug->nid != page_to_nid(page)))

Michal Hocko Feb. 19, 2021, 9:16 a.m. UTC | #2

On Wed 17-02-21 12:41:36, Tim Chen wrote:
> On a per node basis, the mem cgroup soft limit tree on each node tracks
> how much a cgroup has exceeded its soft limit memory limit and sorts
> the cgroup by its excess usage.  On page release, the trees are not
> updated right away, until we have gathered a batch of pages belonging to
> the same cgroup. This reduces the frequency of updating the soft limit tree
> and locking of the tree and associated cgroup.
> 
> However, the batch of pages could contain pages from multiple nodes but
> only the soft limit tree from one node would get updated.  Change the
> logic so that we update the tree in batch of pages, with each batch of
> pages all in the same mem cgroup and memory node.  An update is issued for
> the batch of pages of a node collected till now whenever we encounter
> a page belonging to a different node.  Note that this batching for
> the same node logic is only relevant for v1 cgroup that has a memory
> soft limit.

Let me paste the discussion related to this patch from other reply:
> >> For patch 3 regarding the uncharge_batch, it
> >> is more of an observation that we should uncharge in batch of same node
> >> and not prompted by actual workload.
> >> Thinking more about this, the worst that could happen
> >> is we could have some entries in the soft limit tree that overestimate
> >> the memory used.  The worst that could happen is a soft page reclaim
> >> on that cgroup.  The overhead from extra memcg event update could
> >> be more than a soft page reclaim pass.  So let's drop patch 3
> >> for now.
> >
> > I would still prefer to handle that in the soft limit reclaim path and
> > check each memcg for the soft limit reclaim excess before the reclaim.
> >
> 
> Something like this?
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8bddee75f5cb..b50cae3b2a1a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3472,6 +3472,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>                 if (!mz)
>                         break;
> 
> +               /*
> +                * Soft limit tree is updated based on memcg events sampling.
> +                * We could have missed some updates on page uncharge and
> +                * the cgroup is below soft limit.  Skip useless soft reclaim.
> +                */
> +               if (!soft_limit_excess(mz->memcg))
> +                       continue;
> +
>                 nr_scanned = 0;
>                 reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,

Yes I meant something like this but then I have looked more closely and
this shouldn't be needed afterall. __mem_cgroup_largest_soft_limit_node
already does all the work
        if (!soft_limit_excess(mz->memcg) ||
            !css_tryget(&mz->memcg->css))
                goto retry;
so this shouldn't really happen.

Tim Chen Feb. 19, 2021, 7:28 p.m. UTC | #3

On 2/19/21 1:16 AM, Michal Hocko wrote:

>>
>> Something like this?
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 8bddee75f5cb..b50cae3b2a1a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3472,6 +3472,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>>                 if (!mz)
>>                         break;
>>
>> +               /*
>> +                * Soft limit tree is updated based on memcg events sampling.
>> +                * We could have missed some updates on page uncharge and
>> +                * the cgroup is below soft limit.  Skip useless soft reclaim.
>> +                */
>> +               if (!soft_limit_excess(mz->memcg))
>> +                       continue;
>> +
>>                 nr_scanned = 0;
>>                 reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
> 
> Yes I meant something like this but then I have looked more closely and
> this shouldn't be needed afterall. __mem_cgroup_largest_soft_limit_node
> already does all the work
>         if (!soft_limit_excess(mz->memcg) ||
>             !css_tryget(&mz->memcg->css))
>                 goto retry;
> so this shouldn't really happen.
> 

Ah, that's true.  The added check for soft_limit_excess is not needed.

Do you think it is still a good idea to add patch 3 to
restrict the uncharge update in page batch of the same node and cgroup?

I am okay with dropping patch 3 and let the inaccuracies in the ordering
of soft limit tree be cleared out by an occasional soft reclaim.
These inaccuracies will still be there even with patch 3 fix due
to the memcg event sampling.  Patch 3 does help to keep the soft reclaim
tree ordering more up to date.

Thanks.

Tim

Michal Hocko Feb. 22, 2021, 8:41 a.m. UTC | #4

On Fri 19-02-21 11:28:47, Tim Chen wrote:
> 
> 
> On 2/19/21 1:16 AM, Michal Hocko wrote:
> 
> >>
> >> Something like this?
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 8bddee75f5cb..b50cae3b2a1a 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -3472,6 +3472,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> >>                 if (!mz)
> >>                         break;
> >>
> >> +               /*
> >> +                * Soft limit tree is updated based on memcg events sampling.
> >> +                * We could have missed some updates on page uncharge and
> >> +                * the cgroup is below soft limit.  Skip useless soft reclaim.
> >> +                */
> >> +               if (!soft_limit_excess(mz->memcg))
> >> +                       continue;
> >> +
> >>                 nr_scanned = 0;
> >>                 reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
> > 
> > Yes I meant something like this but then I have looked more closely and
> > this shouldn't be needed afterall. __mem_cgroup_largest_soft_limit_node
> > already does all the work
> >         if (!soft_limit_excess(mz->memcg) ||
> >             !css_tryget(&mz->memcg->css))
> >                 goto retry;
> > so this shouldn't really happen.
> > 
> 
> Ah, that's true.  The added check for soft_limit_excess is not needed.
> 
> Do you think it is still a good idea to add patch 3 to
> restrict the uncharge update in page batch of the same node and cgroup?

I would rather drop it. The less the soft limit reclaim code is spread
around the better.

Tim Chen Feb. 22, 2021, 5:45 p.m. UTC | #5

On 2/22/21 12:41 AM, Michal Hocko wrote:

>>
>>
>> Ah, that's true.  The added check for soft_limit_excess is not needed.
>>
>> Do you think it is still a good idea to add patch 3 to
>> restrict the uncharge update in page batch of the same node and cgroup?
> 
> I would rather drop it. The less the soft limit reclaim code is spread
> around the better.
> 

Let's drop patch 3 then.  I find patch 2 is the most critical one in this series.  
Without that patch some cgroups exceeds the soft limit excess very badly.

Tim

Tim Chen Feb. 22, 2021, 6:38 p.m. UTC | #6

On 2/17/21 9:56 PM, Johannes Weiner wrote:

>>  static inline void uncharge_gather_clear(struct uncharge_gather *ug)
>> @@ -6849,7 +6850,13 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
>>  	 * exclusive access to the page.
>>  	 */
>>  
>> -	if (ug->memcg != page_memcg(page)) {
>> +	if (ug->memcg != page_memcg(page) ||
>> +	    /*
>> +	     * Update soft limit tree used in v1 cgroup in page batch for
>> +	     * the same node. Relevant only to v1 cgroup with a soft limit.
>> +	     */
>> +	    (ug->dummy_page && ug->nid != page_to_nid(page) &&
>> +	     ug->memcg->soft_limit != PAGE_COUNTER_MAX)) {
> 
> Sorry, I used weird phrasing in my last email.
> 
> Can you please preface the checks you're adding with a
> !cgroup_subsys_on_dfl(memory_cgrp_subsys) to static branch for
> cgroup1? The uncharge path is pretty hot, and this would avoid the
> runtime overhead on cgroup2 at least, which doesn't have the SL.
> 
> Also, do we need the ug->dummy_page check? It's only NULL on the first
> loop - where ug->memcg is NULL as well and the branch is taken anyway.
> 
> The soft limit check is also slightly cheaper than the nid check, as
> page_to_nid() might be out-of-line, so we should do it first. This?
> 
> 	/*
> 	 * Batch-uncharge all pages of the same memcg.
> 	 *
> 	 * Unless we're looking at a cgroup1 with a softlimit
> 	 * set: the soft limit trees are maintained per-node
> 	 * and updated on uncharge (via dummy_page), so keep
> 	 * batches confined to a single node as well.
> 	 */
> 	if (ug->memcg != page_memcg(page) ||
> 	    (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
> 	     ug->memcg->soft_limit != PAGE_COUNTER_MAX &&
> 	     ug->nid != page_to_nid(page)))
> 

Johannes,

Thanks for your feedback.  Since Michal has concerns about the overhead
this patch could incur, I think we'll hold the patch for now.  If later
on Michal think that this patch is a good idea, I'll incorporate these
changes you suggested.

Tim

Johannes Weiner Feb. 23, 2021, 3:18 p.m. UTC | #7

On Mon, Feb 22, 2021 at 10:38:27AM -0800, Tim Chen wrote:
> Johannes,
> 
> Thanks for your feedback.  Since Michal has concerns about the overhead
> this patch could incur, I think we'll hold the patch for now.  If later
> on Michal think that this patch is a good idea, I'll incorporate these
> changes you suggested.

That works for me.

Thanks!

[v2,3/3] mm: Fix missing mem cgroup soft limit tree updates

Commit Message

Comments

Patch