[v4] mm: thp: remove the defer list related code since this will not happen
diff mbox series

Message ID 20200117233836.3434-1-richardw.yang@linux.intel.com
State New
Headers show
Series
  • [v4] mm: thp: remove the defer list related code since this will not happen
Related show

Commit Message

Wei Yang Jan. 17, 2020, 11:38 p.m. UTC
If compound is true, this means it is a PMD mapped THP. Which implies
the page is not linked to any defer list. So the first code chunk will
not be executed.

Also with this reason, it would not be proper to add this page to a
defer list. So the second code chunk is not correct.

Based on this, we should remove the defer list related code.

Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")

Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>    [5.4+]

---
v4:
  * finally we identified the related code is not necessary and not
    correct, just remove it
  * thanks to Kirill T first spot some problem
v3:
  * remove all review/ack tag since rewrite the changelog
  * use deferred_split_huge_page as the example of race
  * add cc stable 5.4+ tag as suggested by David Rientjes

v2:
  * move check on compound outside suggested by Alexander
  * an example of the race condition, suggested by Michal
---
 mm/memcontrol.c | 18 ------------------
 1 file changed, 18 deletions(-)

Comments

Yang Shi Jan. 18, 2020, 12:57 a.m. UTC | #1
On 1/17/20 3:38 PM, Wei Yang wrote:
> If compound is true, this means it is a PMD mapped THP. Which implies
> the page is not linked to any defer list. So the first code chunk will
> not be executed.
>
> Also with this reason, it would not be proper to add this page to a
> defer list. So the second code chunk is not correct.
>
> Based on this, we should remove the defer list related code.
>
> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
>
> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: <stable@vger.kernel.org>    [5.4+]
>
> ---
> v4:
>    * finally we identified the related code is not necessary and not
>      correct, just remove it
>    * thanks to Kirill T first spot some problem

Thanks for debugging and figuring this out. Acked-by: Yang Shi 
<yang.shi@linux.alibaba.com>

> v3:
>    * remove all review/ack tag since rewrite the changelog
>    * use deferred_split_huge_page as the example of race
>    * add cc stable 5.4+ tag as suggested by David Rientjes
>
> v2:
>    * move check on compound outside suggested by Alexander
>    * an example of the race condition, suggested by Michal
> ---
>   mm/memcontrol.c | 18 ------------------
>   1 file changed, 18 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6c83cf4ed970..27c231bf4565 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5340,14 +5340,6 @@ static int mem_cgroup_move_account(struct page *page,
>   		__mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
>   	}
>   
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	if (compound && !list_empty(page_deferred_list(page))) {
> -		spin_lock(&from->deferred_split_queue.split_queue_lock);
> -		list_del_init(page_deferred_list(page));
> -		from->deferred_split_queue.split_queue_len--;
> -		spin_unlock(&from->deferred_split_queue.split_queue_lock);
> -	}
> -#endif
>   	/*
>   	 * It is safe to change page->mem_cgroup here because the page
>   	 * is referenced, charged, and isolated - we can't race with
> @@ -5357,16 +5349,6 @@ static int mem_cgroup_move_account(struct page *page,
>   	/* caller should have done css_get */
>   	page->mem_cgroup = to;
>   
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	if (compound && list_empty(page_deferred_list(page))) {
> -		spin_lock(&to->deferred_split_queue.split_queue_lock);
> -		list_add_tail(page_deferred_list(page),
> -			      &to->deferred_split_queue.split_queue);
> -		to->deferred_split_queue.split_queue_len++;
> -		spin_unlock(&to->deferred_split_queue.split_queue_lock);
> -	}
> -#endif
> -
>   	spin_unlock_irqrestore(&from->move_lock, flags);
>   
>   	ret = 0;
Yang Shi Jan. 18, 2020, 5:30 a.m. UTC | #2
On 1/17/20 4:57 PM, Yang Shi wrote:
>
>
> On 1/17/20 3:38 PM, Wei Yang wrote:
>> If compound is true, this means it is a PMD mapped THP. Which implies
>> the page is not linked to any defer list. So the first code chunk will
>> not be executed.
>>
>> Also with this reason, it would not be proper to add this page to a
>> defer list. So the second code chunk is not correct.
>>
>> Based on this, we should remove the defer list related code.
>>
>> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg 
>> aware")
>>
>> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
>> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: <stable@vger.kernel.org>    [5.4+]
>>
>> ---
>> v4:
>>    * finally we identified the related code is not necessary and not
>>      correct, just remove it
>>    * thanks to Kirill T first spot some problem
>
> Thanks for debugging and figuring this out. Acked-by: Yang Shi 
> <yang.shi@linux.alibaba.com>

BTW, the patch itself is fine, but the subject looks really confusing. 
It sounds like we would remove all deferred list code. I'd suggest 
rephrase it to:

mm: thp: don't need care deferred split queue in memcg charge move path

>
>> v3:
>>    * remove all review/ack tag since rewrite the changelog
>>    * use deferred_split_huge_page as the example of race
>>    * add cc stable 5.4+ tag as suggested by David Rientjes
>>
>> v2:
>>    * move check on compound outside suggested by Alexander
>>    * an example of the race condition, suggested by Michal
>> ---
>>   mm/memcontrol.c | 18 ------------------
>>   1 file changed, 18 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6c83cf4ed970..27c231bf4565 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5340,14 +5340,6 @@ static int mem_cgroup_move_account(struct page 
>> *page,
>>           __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
>>       }
>>   -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -    if (compound && !list_empty(page_deferred_list(page))) {
>> - spin_lock(&from->deferred_split_queue.split_queue_lock);
>> -        list_del_init(page_deferred_list(page));
>> -        from->deferred_split_queue.split_queue_len--;
>> - spin_unlock(&from->deferred_split_queue.split_queue_lock);
>> -    }
>> -#endif
>>       /*
>>        * It is safe to change page->mem_cgroup here because the page
>>        * is referenced, charged, and isolated - we can't race with
>> @@ -5357,16 +5349,6 @@ static int mem_cgroup_move_account(struct page 
>> *page,
>>       /* caller should have done css_get */
>>       page->mem_cgroup = to;
>>   -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -    if (compound && list_empty(page_deferred_list(page))) {
>> - spin_lock(&to->deferred_split_queue.split_queue_lock);
>> -        list_add_tail(page_deferred_list(page),
>> - &to->deferred_split_queue.split_queue);
>> -        to->deferred_split_queue.split_queue_len++;
>> - spin_unlock(&to->deferred_split_queue.split_queue_lock);
>> -    }
>> -#endif
>> -
>>       spin_unlock_irqrestore(&from->move_lock, flags);
>>         ret = 0;
>
Andrew Morton Jan. 18, 2020, 10:54 p.m. UTC | #3
On Sat, 18 Jan 2020 07:38:36 +0800 Wei Yang <richardw.yang@linux.intel.com> wrote:

> If compound is true, this means it is a PMD mapped THP. Which implies
> the page is not linked to any defer list. So the first code chunk will
> not be executed.
> 
> Also with this reason, it would not be proper to add this page to a
> defer list. So the second code chunk is not correct.
> 
> Based on this, we should remove the defer list related code.
> 
> Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> 
> Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: <stable@vger.kernel.org>    [5.4+]

This patch is identical to "mm: thp: grab the lock before manipulating
defer list", which is rather confusing.  Please let people know when
this sort of thing is done.

The earlier changelog mentioned a possible race condition.  This
changelog does not.  In fact this changelog fails to provide any
description of any userspace-visible runtime effects of the bug. 
Please send along such a description for inclusion, as always.
David Rientjes Jan. 18, 2020, 11:36 p.m. UTC | #4
On Sat, 18 Jan 2020, Andrew Morton wrote:

> On Sat, 18 Jan 2020 07:38:36 +0800 Wei Yang <richardw.yang@linux.intel.com> wrote:
> 
> > If compound is true, this means it is a PMD mapped THP. Which implies
> > the page is not linked to any defer list. So the first code chunk will
> > not be executed.
> > 
> > Also with this reason, it would not be proper to add this page to a
> > defer list. So the second code chunk is not correct.
> > 
> > Based on this, we should remove the defer list related code.
> > 
> > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> > 
> > Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
> > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: <stable@vger.kernel.org>    [5.4+]
> 
> This patch is identical to "mm: thp: grab the lock before manipulating
> defer list", which is rather confusing.  Please let people know when
> this sort of thing is done.
> 
> The earlier changelog mentioned a possible race condition.  This
> changelog does not.  In fact this changelog fails to provide any
> description of any userspace-visible runtime effects of the bug. 
> Please send along such a description for inclusion, as always.
> 

The locking concern that Wei was originally looking at is no longer an 
issue because we determined that the code in question could simply be 
removed.

I think the following can be added to the changelog:

----->o-----

When migrating memcg charges of thp memory, there are two possibilities:

 (1) The underlying compound page is mapped by a pmd and thus does is not 
     on a deferred split queue (it's mapped), or

 (2) The compound page is not mapped by a pmd and is awaiting split on a
     deferred split queue.

The current charge migration implementation does *not* migrate charges for 
thp memory on the deferred split queue, it only migrates charges for pages 
that are mapped by a pmd.

Thus, to migrate charges, the underlying compound page cannot be on a 
deferred split queue; no list manipulation needs to be done in 
mem_cgroup_move_account().

With the current code, the underlying compound page is moved to the 
deferred split queue of the memcg its memory is not charged to, so 
susbequent reclaim will consider these pages for the wrong memcg.  Remove 
the deferred split queue handling in mem_cgroup_move_account() entirely.

----->o-----

Acked-by: David Rientjes <rientjes@google.com>
Wei Yang Jan. 19, 2020, 2:24 a.m. UTC | #5
On Sat, Jan 18, 2020 at 03:36:06PM -0800, David Rientjes wrote:
>On Sat, 18 Jan 2020, Andrew Morton wrote:
>
>> On Sat, 18 Jan 2020 07:38:36 +0800 Wei Yang <richardw.yang@linux.intel.com> wrote:
>> 
>> > If compound is true, this means it is a PMD mapped THP. Which implies
>> > the page is not linked to any defer list. So the first code chunk will
>> > not be executed.
>> > 
>> > Also with this reason, it would not be proper to add this page to a
>> > defer list. So the second code chunk is not correct.
>> > 
>> > Based on this, we should remove the defer list related code.
>> > 
>> > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
>> > 
>> > Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
>> > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > Cc: <stable@vger.kernel.org>    [5.4+]
>> 
>> This patch is identical to "mm: thp: grab the lock before manipulating
>> defer list", which is rather confusing.  Please let people know when
>> this sort of thing is done.
>> 
>> The earlier changelog mentioned a possible race condition.  This
>> changelog does not.  In fact this changelog fails to provide any
>> description of any userspace-visible runtime effects of the bug. 
>> Please send along such a description for inclusion, as always.
>> 
>
>The locking concern that Wei was originally looking at is no longer an 
>issue because we determined that the code in question could simply be 
>removed.
>
>I think the following can be added to the changelog:
>
>----->o-----
>
>When migrating memcg charges of thp memory, there are two possibilities:
>
> (1) The underlying compound page is mapped by a pmd and thus does is not 
>     on a deferred split queue (it's mapped), or
>
> (2) The compound page is not mapped by a pmd and is awaiting split on a
>     deferred split queue.
>
>The current charge migration implementation does *not* migrate charges for 
>thp memory on the deferred split queue, it only migrates charges for pages 
>that are mapped by a pmd.
>
>Thus, to migrate charges, the underlying compound page cannot be on a 
>deferred split queue; no list manipulation needs to be done in 
>mem_cgroup_move_account().
>
>With the current code, the underlying compound page is moved to the 
>deferred split queue of the memcg its memory is not charged to, so 
>susbequent reclaim will consider these pages for the wrong memcg.  Remove 
>the deferred split queue handling in mem_cgroup_move_account() entirely.
>
>----->o-----
>
>Acked-by: David Rientjes <rientjes@google.com>

Hi David,

The changlog looks awesome to me. Thanks ~

Hi Andrew

I see you queue this in you tree, do I need to rewrite a patch with better
changelog?
Michal Hocko Jan. 20, 2020, 7:22 a.m. UTC | #6
On Sat 18-01-20 15:36:06, David Rientjes wrote:
> On Sat, 18 Jan 2020, Andrew Morton wrote:
> 
> > On Sat, 18 Jan 2020 07:38:36 +0800 Wei Yang <richardw.yang@linux.intel.com> wrote:
> > 
> > > If compound is true, this means it is a PMD mapped THP. Which implies
> > > the page is not linked to any defer list. So the first code chunk will
> > > not be executed.
> > > 
> > > Also with this reason, it would not be proper to add this page to a
> > > defer list. So the second code chunk is not correct.
> > > 
> > > Based on this, we should remove the defer list related code.
> > > 
> > > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
> > > 
> > > Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
> > > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Cc: <stable@vger.kernel.org>    [5.4+]
> > 
> > This patch is identical to "mm: thp: grab the lock before manipulating
> > defer list", which is rather confusing.  Please let people know when
> > this sort of thing is done.
> > 
> > The earlier changelog mentioned a possible race condition.  This
> > changelog does not.  In fact this changelog fails to provide any
> > description of any userspace-visible runtime effects of the bug. 
> > Please send along such a description for inclusion, as always.
> > 
> 
> The locking concern that Wei was originally looking at is no longer an 
> issue because we determined that the code in question could simply be 
> removed.
> 
> I think the following can be added to the changelog:
> 
> ----->o-----
> 
> When migrating memcg charges of thp memory, there are two possibilities:
> 
>  (1) The underlying compound page is mapped by a pmd and thus does is not 
>      on a deferred split queue (it's mapped), or
> 
>  (2) The compound page is not mapped by a pmd and is awaiting split on a
>      deferred split queue.
> 
> The current charge migration implementation does *not* migrate charges for 
> thp memory on the deferred split queue, it only migrates charges for pages 
> that are mapped by a pmd.
> 
> Thus, to migrate charges, the underlying compound page cannot be on a 
> deferred split queue; no list manipulation needs to be done in 
> mem_cgroup_move_account().
> 
> With the current code, the underlying compound page is moved to the 
> deferred split queue of the memcg its memory is not charged to, so 
> susbequent reclaim will consider these pages for the wrong memcg.  Remove 
> the deferred split queue handling in mem_cgroup_move_account() entirely.

I believe this still doesn't describe the underlying problem to the full
extent. What happens with the page on the deferred list when it
shouldn't be there in fact? Unless I am missing something deferred_split_scan
will simply split that huge page. Which is a bit unfortunate but nothing
really critical. This should be mentioned in the changelog.

With that clarified, feel free to add

Acked-by: Michal Hocko <mhocko@suse.com>
Wei Yang Jan. 20, 2020, 8:17 a.m. UTC | #7
On Mon, Jan 20, 2020 at 08:22:37AM +0100, Michal Hocko wrote:
>On Sat 18-01-20 15:36:06, David Rientjes wrote:
>> On Sat, 18 Jan 2020, Andrew Morton wrote:
>> 
>> > On Sat, 18 Jan 2020 07:38:36 +0800 Wei Yang <richardw.yang@linux.intel.com> wrote:
>> > 
>> > > If compound is true, this means it is a PMD mapped THP. Which implies
>> > > the page is not linked to any defer list. So the first code chunk will
>> > > not be executed.
>> > > 
>> > > Also with this reason, it would not be proper to add this page to a
>> > > defer list. So the second code chunk is not correct.
>> > > 
>> > > Based on this, we should remove the defer list related code.
>> > > 
>> > > Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
>> > > 
>> > > Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
>> > > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > > Cc: <stable@vger.kernel.org>    [5.4+]
>> > 
>> > This patch is identical to "mm: thp: grab the lock before manipulating
>> > defer list", which is rather confusing.  Please let people know when
>> > this sort of thing is done.
>> > 
>> > The earlier changelog mentioned a possible race condition.  This
>> > changelog does not.  In fact this changelog fails to provide any
>> > description of any userspace-visible runtime effects of the bug. 
>> > Please send along such a description for inclusion, as always.
>> > 
>> 
>> The locking concern that Wei was originally looking at is no longer an 
>> issue because we determined that the code in question could simply be 
>> removed.
>> 
>> I think the following can be added to the changelog:
>> 
>> ----->o-----
>> 
>> When migrating memcg charges of thp memory, there are two possibilities:
>> 
>>  (1) The underlying compound page is mapped by a pmd and thus does is not 
>>      on a deferred split queue (it's mapped), or
>> 
>>  (2) The compound page is not mapped by a pmd and is awaiting split on a
>>      deferred split queue.
>> 
>> The current charge migration implementation does *not* migrate charges for 
>> thp memory on the deferred split queue, it only migrates charges for pages 
>> that are mapped by a pmd.
>> 
>> Thus, to migrate charges, the underlying compound page cannot be on a 
>> deferred split queue; no list manipulation needs to be done in 
>> mem_cgroup_move_account().
>> 
>> With the current code, the underlying compound page is moved to the 
>> deferred split queue of the memcg its memory is not charged to, so 
>> susbequent reclaim will consider these pages for the wrong memcg.  Remove 
>> the deferred split queue handling in mem_cgroup_move_account() entirely.
>
>I believe this still doesn't describe the underlying problem to the full
>extent. What happens with the page on the deferred list when it
>shouldn't be there in fact? Unless I am missing something deferred_split_scan
>will simply split that huge page. Which is a bit unfortunate but nothing
>really critical. This should be mentioned in the changelog.
>

Per my understanding, if we do the split when it is not necessary, we
probably have a lower performance due to tlb miss. For others, I don't see the
impact.

>With that clarified, feel free to add
>
>Acked-by: Michal Hocko <mhocko@suse.com>
>
>-- 
>Michal Hocko
>SUSE Labs
David Rientjes Jan. 20, 2020, 9:10 p.m. UTC | #8
On Mon, 20 Jan 2020, Michal Hocko wrote:

> > When migrating memcg charges of thp memory, there are two possibilities:
> > 
> >  (1) The underlying compound page is mapped by a pmd and thus does is not 
> >      on a deferred split queue (it's mapped), or
> > 
> >  (2) The compound page is not mapped by a pmd and is awaiting split on a
> >      deferred split queue.
> > 
> > The current charge migration implementation does *not* migrate charges for 
> > thp memory on the deferred split queue, it only migrates charges for pages 
> > that are mapped by a pmd.
> > 
> > Thus, to migrate charges, the underlying compound page cannot be on a 
> > deferred split queue; no list manipulation needs to be done in 
> > mem_cgroup_move_account().
> > 
> > With the current code, the underlying compound page is moved to the 
> > deferred split queue of the memcg its memory is not charged to, so 
> > susbequent reclaim will consider these pages for the wrong memcg.  Remove 
> > the deferred split queue handling in mem_cgroup_move_account() entirely.
> 
> I believe this still doesn't describe the underlying problem to the full
> extent. What happens with the page on the deferred list when it
> shouldn't be there in fact? Unless I am missing something deferred_split_scan
> will simply split that huge page. Which is a bit unfortunate but nothing
> really critical. This should be mentioned in the changelog.
> 

Are you referring to a compound page on the deferred split queue before a 
task is moved?  I'm not sure this is within the scope of Wei's patch.. 
this is simply preventing a page from being moved to the deferred split
queue of a memcg that it is not charged to.  Is there a concern about why 
this code can be removed or a suggestion on something else it should be 
doing instead?
Michal Hocko Jan. 20, 2020, 9:27 p.m. UTC | #9
On Mon 20-01-20 13:10:56, David Rientjes wrote:
> On Mon, 20 Jan 2020, Michal Hocko wrote:
> 
> > > When migrating memcg charges of thp memory, there are two possibilities:
> > > 
> > >  (1) The underlying compound page is mapped by a pmd and thus does is not 
> > >      on a deferred split queue (it's mapped), or
> > > 
> > >  (2) The compound page is not mapped by a pmd and is awaiting split on a
> > >      deferred split queue.
> > > 
> > > The current charge migration implementation does *not* migrate charges for 
> > > thp memory on the deferred split queue, it only migrates charges for pages 
> > > that are mapped by a pmd.
> > > 
> > > Thus, to migrate charges, the underlying compound page cannot be on a 
> > > deferred split queue; no list manipulation needs to be done in 
> > > mem_cgroup_move_account().
> > > 
> > > With the current code, the underlying compound page is moved to the 
> > > deferred split queue of the memcg its memory is not charged to, so 
> > > susbequent reclaim will consider these pages for the wrong memcg.  Remove 
> > > the deferred split queue handling in mem_cgroup_move_account() entirely.
> > 
> > I believe this still doesn't describe the underlying problem to the full
> > extent. What happens with the page on the deferred list when it
> > shouldn't be there in fact? Unless I am missing something deferred_split_scan
> > will simply split that huge page. Which is a bit unfortunate but nothing
> > really critical. This should be mentioned in the changelog.
> > 
> 
> Are you referring to a compound page on the deferred split queue before a 
> task is moved?  I'm not sure this is within the scope of Wei's patch.. 
> this is simply preventing a page from being moved to the deferred split
> queue of a memcg that it is not charged to.  Is there a concern about why 
> this code can be removed or a suggestion on something else it should be 
> doing instead?

No, I do not have any concern about the patch itslef. It is that the
changelog doesn't decribe the user visible effect. All I am saying is
that the current code splits THPs of moved pages under memory pressure
even if that is not needed. And that is a clear bug.
David Rientjes Jan. 21, 2020, 11:08 p.m. UTC | #10
On Mon, 20 Jan 2020, Michal Hocko wrote:

> > > > When migrating memcg charges of thp memory, there are two possibilities:
> > > > 
> > > >  (1) The underlying compound page is mapped by a pmd and thus does is not 
> > > >      on a deferred split queue (it's mapped), or
> > > > 
> > > >  (2) The compound page is not mapped by a pmd and is awaiting split on a
> > > >      deferred split queue.
> > > > 
> > > > The current charge migration implementation does *not* migrate charges for 
> > > > thp memory on the deferred split queue, it only migrates charges for pages 
> > > > that are mapped by a pmd.
> > > > 
> > > > Thus, to migrate charges, the underlying compound page cannot be on a 
> > > > deferred split queue; no list manipulation needs to be done in 
> > > > mem_cgroup_move_account().
> > > > 
> > > > With the current code, the underlying compound page is moved to the 
> > > > deferred split queue of the memcg its memory is not charged to, so 
> > > > susbequent reclaim will consider these pages for the wrong memcg.  Remove 
> > > > the deferred split queue handling in mem_cgroup_move_account() entirely.
> > > 
> > > I believe this still doesn't describe the underlying problem to the full
> > > extent. What happens with the page on the deferred list when it
> > > shouldn't be there in fact? Unless I am missing something deferred_split_scan
> > > will simply split that huge page. Which is a bit unfortunate but nothing
> > > really critical. This should be mentioned in the changelog.
> > > 
> > 
> > Are you referring to a compound page on the deferred split queue before a 
> > task is moved?  I'm not sure this is within the scope of Wei's patch.. 
> > this is simply preventing a page from being moved to the deferred split
> > queue of a memcg that it is not charged to.  Is there a concern about why 
> > this code can be removed or a suggestion on something else it should be 
> > doing instead?
> 
> No, I do not have any concern about the patch itslef. It is that the
> changelog doesn't decribe the user visible effect. All I am saying is
> that the current code splits THPs of moved pages under memory pressure
> even if that is not needed. And that is a clear bug.

Ah, gotcha.  I tried to do this in the final paragraph of my amedment to 
Wei's patch and why it's important that this is marked as stable.

The current code in 5.4 from commit 87eaceb3faa59 places any migrated 
compound page onto the deferred split queue of the destination memcg 
regardless of whether it has a mapping pmd 
(list_empty(page_deferred_list()) was already false) or it does not have a 
mapping pmd (but is now on the wrong queue).  For the latter, 
can_split_huge_page() can help for the actual split but not for the 
removal of the page that is now erroneously on the queue.  For the former, 
memcg reclaim would not see the pages that it should split under memcg 
pressure so we'll see the same memcg oom conditions we saw before the 
deferred split shrinker became SHRINKER_MEMCG_AWARE: unnecessary ooms.
Michal Hocko Jan. 22, 2020, 8:14 a.m. UTC | #11
On Tue 21-01-20 15:08:39, David Rientjes wrote:
> On Mon, 20 Jan 2020, Michal Hocko wrote:
> 
> > > > > When migrating memcg charges of thp memory, there are two possibilities:
> > > > > 
> > > > >  (1) The underlying compound page is mapped by a pmd and thus does is not 
> > > > >      on a deferred split queue (it's mapped), or
> > > > > 
> > > > >  (2) The compound page is not mapped by a pmd and is awaiting split on a
> > > > >      deferred split queue.
> > > > > 
> > > > > The current charge migration implementation does *not* migrate charges for 
> > > > > thp memory on the deferred split queue, it only migrates charges for pages 
> > > > > that are mapped by a pmd.
> > > > > 
> > > > > Thus, to migrate charges, the underlying compound page cannot be on a 
> > > > > deferred split queue; no list manipulation needs to be done in 
> > > > > mem_cgroup_move_account().
> > > > > 
> > > > > With the current code, the underlying compound page is moved to the 
> > > > > deferred split queue of the memcg its memory is not charged to, so 
> > > > > susbequent reclaim will consider these pages for the wrong memcg.  Remove 
> > > > > the deferred split queue handling in mem_cgroup_move_account() entirely.
> > > > 
> > > > I believe this still doesn't describe the underlying problem to the full
> > > > extent. What happens with the page on the deferred list when it
> > > > shouldn't be there in fact? Unless I am missing something deferred_split_scan
> > > > will simply split that huge page. Which is a bit unfortunate but nothing
> > > > really critical. This should be mentioned in the changelog.
> > > > 
> > > 
> > > Are you referring to a compound page on the deferred split queue before a 
> > > task is moved?  I'm not sure this is within the scope of Wei's patch.. 
> > > this is simply preventing a page from being moved to the deferred split
> > > queue of a memcg that it is not charged to.  Is there a concern about why 
> > > this code can be removed or a suggestion on something else it should be 
> > > doing instead?
> > 
> > No, I do not have any concern about the patch itslef. It is that the
> > changelog doesn't decribe the user visible effect. All I am saying is
> > that the current code splits THPs of moved pages under memory pressure
> > even if that is not needed. And that is a clear bug.
> 
> Ah, gotcha.  I tried to do this in the final paragraph of my amedment to 
> Wei's patch and why it's important that this is marked as stable.

I considered "susbequent reclaim will consider these pages for the wrong
memcg." quite unclear TBH.
 
> The current code in 5.4 from commit 87eaceb3faa59 places any migrated 
> compound page onto the deferred split queue of the destination memcg 
> regardless of whether it has a mapping pmd 
> (list_empty(page_deferred_list()) was already false) or it does not have a 
> mapping pmd (but is now on the wrong queue).  For the latter, 
> can_split_huge_page() can help for the actual split but not for the 
> removal of the page that is now erroneously on the queue.

Does that mean that those fully mapped THPs are not going to be split?

> For the former, 
> memcg reclaim would not see the pages that it should split under memcg 
> pressure so we'll see the same memcg oom conditions we saw before the 
> deferred split shrinker became SHRINKER_MEMCG_AWARE: unnecessary ooms.

OK, this is yet another user visibile effect and it would be better to
mention it explicitly in the changelog.
David Rientjes Jan. 22, 2020, 11:39 p.m. UTC | #12
On Wed, 22 Jan 2020, Michal Hocko wrote:

> > The current code in 5.4 from commit 87eaceb3faa59 places any migrated 
> > compound page onto the deferred split queue of the destination memcg 
> > regardless of whether it has a mapping pmd 
> > (list_empty(page_deferred_list()) was already false) or it does not have a 
> > mapping pmd (but is now on the wrong queue).  For the latter, 
> > can_split_huge_page() can help for the actual split but not for the 
> > removal of the page that is now erroneously on the queue.
> 
> Does that mean that those fully mapped THPs are not going to be split?
> 

It believe it should but deferred_split_scan() also won't take it off the 
wrong split queue so it will remain there and any other checks for 
page_deferred_list(page) will succeed.

> > For the former, 
> > memcg reclaim would not see the pages that it should split under memcg 
> > pressure so we'll see the same memcg oom conditions we saw before the 
> > deferred split shrinker became SHRINKER_MEMCG_AWARE: unnecessary ooms.
> 
> OK, this is yet another user visibile effect and it would be better to
> mention it explicitly in the changelog. 
> 

Ok, feel free to add to the last paragraph:

Memcg reclaim would not see the compound pages that it should split under 
memcg pressure so we'll otherwise see the same memcg oom conditions we saw 
before the deferred split shrinker became SHRINKER_MEMCG_AWARE: 
unnecessary ooms.

Patch
diff mbox series

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6c83cf4ed970..27c231bf4565 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5340,14 +5340,6 @@  static int mem_cgroup_move_account(struct page *page,
 		__mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
 	}
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (compound && !list_empty(page_deferred_list(page))) {
-		spin_lock(&from->deferred_split_queue.split_queue_lock);
-		list_del_init(page_deferred_list(page));
-		from->deferred_split_queue.split_queue_len--;
-		spin_unlock(&from->deferred_split_queue.split_queue_lock);
-	}
-#endif
 	/*
 	 * It is safe to change page->mem_cgroup here because the page
 	 * is referenced, charged, and isolated - we can't race with
@@ -5357,16 +5349,6 @@  static int mem_cgroup_move_account(struct page *page,
 	/* caller should have done css_get */
 	page->mem_cgroup = to;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (compound && list_empty(page_deferred_list(page))) {
-		spin_lock(&to->deferred_split_queue.split_queue_lock);
-		list_add_tail(page_deferred_list(page),
-			      &to->deferred_split_queue.split_queue);
-		to->deferred_split_queue.split_queue_len++;
-		spin_unlock(&to->deferred_split_queue.split_queue_lock);
-	}
-#endif
-
 	spin_unlock_irqrestore(&from->move_lock, flags);
 
 	ret = 0;