mm: khugepaged: fix potential page state corruption

Message ID	1584573582-116702-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=j5h/=5D=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9C75A2076F From: Yang Shi <yang.shi@linux.alibaba.com> To: kirill.shutemov@linux.intel.com, hughd@google.com, aarcange@redhat.com, akpm@linux-foundation.org Cc: yang.shi@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm: khugepaged: fix potential page state corruption Date: Thu, 19 Mar 2020 07:19:42 +0800 Message-Id: <1584573582-116702-1-git-send-email-yang.shi@linux.alibaba.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: khugepaged: fix potential page state corruption \| expand mm: khugepaged: fix potential page state corruption

Yang Shi March 18, 2020, 11:19 p.m. UTC

When khugepaged collapses anonymous pages, the base pages would be freed
via pagevec or free_page_and_swap_cache().  But, the anonymous page may
be added back to LRU, then it might result in the below race:

	CPU A				CPU B
khugepaged:
  unlock page
  putback_lru_page
    add to lru
				page reclaim:
				  isolate this page
				  try_to_unmap
  page_remove_rmap <-- corrupt _mapcount

It looks nothing would prevent the pages from isolating by reclaimer.

The other problem is the page's active or unevictable flag might be
still set when freeing the page via free_page_and_swap_cache().  The
putback_lru_page() would not clear those two flags if the pages are
released via pagevec, it sounds nothing prevents from isolating active
or unevictable pages.

However I didn't really run into these problems, just in theory by visual
inspection.

And, it also seems unnecessary to have the pages add back to LRU again since
they are about to be freed when reaching this point.  So, clearing active
and unevictable flags, unlocking and dropping refcount from isolate
instead of calling putback_lru_page() as what page cache collapse does.

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/khugepaged.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Kirill A . Shutemov March 18, 2020, 11:40 p.m. UTC | #1

On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
> When khugepaged collapses anonymous pages, the base pages would be freed
> via pagevec or free_page_and_swap_cache().  But, the anonymous page may
> be added back to LRU, then it might result in the below race:
> 
> 	CPU A				CPU B
> khugepaged:
>   unlock page
>   putback_lru_page
>     add to lru
> 				page reclaim:
> 				  isolate this page
> 				  try_to_unmap
>   page_remove_rmap <-- corrupt _mapcount
> 
> It looks nothing would prevent the pages from isolating by reclaimer.
> 
> The other problem is the page's active or unevictable flag might be
> still set when freeing the page via free_page_and_swap_cache().  The
> putback_lru_page() would not clear those two flags if the pages are
> released via pagevec, it sounds nothing prevents from isolating active
> or unevictable pages.
> 
> However I didn't really run into these problems, just in theory by visual
> inspection.
> 
> And, it also seems unnecessary to have the pages add back to LRU again since
> they are about to be freed when reaching this point.  So, clearing active
> and unevictable flags, unlocking and dropping refcount from isolate
> instead of calling putback_lru_page() as what page cache collapse does.
> 
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  mm/khugepaged.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b679908..f42fa4e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -673,7 +673,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>  			src_page = pte_page(pteval);
>  			copy_user_highpage(page, src_page, address, vma);
>  			VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
> -			release_pte_page(src_page);
>  			/*
>  			 * ptl mostly unnecessary, but preempt has to
>  			 * be disabled to update the per-cpu stats
> @@ -687,6 +686,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>  			pte_clear(vma->vm_mm, address, _pte);
>  			page_remove_rmap(src_page, false);
>  			spin_unlock(ptl);
> +
> +			dec_node_page_state(src_page,
> +				NR_ISOLATED_ANON + page_is_file_cache(src_page));
> +			ClearPageActive(src_page);
> +			ClearPageUnevictable(src_page);
> +			unlock_page(src_page);
> +			/* Drop refcount from isolate */
> +			put_page(src_page);
> +
>  			free_page_and_swap_cache(src_page);
>  		}
>  	}

I will look at this closer tomorrow, but looks like an easier fix is to
move release_pte_page() under ptl that we take just after it.

Yang Shi March 18, 2020, 11:47 p.m. UTC | #2

On 3/18/20 4:40 PM, Kirill A. Shutemov wrote:
> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>> When khugepaged collapses anonymous pages, the base pages would be freed
>> via pagevec or free_page_and_swap_cache().  But, the anonymous page may
>> be added back to LRU, then it might result in the below race:
>>
>> 	CPU A				CPU B
>> khugepaged:
>>    unlock page
>>    putback_lru_page
>>      add to lru
>> 				page reclaim:
>> 				  isolate this page
>> 				  try_to_unmap
>>    page_remove_rmap <-- corrupt _mapcount
>>
>> It looks nothing would prevent the pages from isolating by reclaimer.
>>
>> The other problem is the page's active or unevictable flag might be
>> still set when freeing the page via free_page_and_swap_cache().  The
>> putback_lru_page() would not clear those two flags if the pages are
>> released via pagevec, it sounds nothing prevents from isolating active
>> or unevictable pages.
>>
>> However I didn't really run into these problems, just in theory by visual
>> inspection.
>>
>> And, it also seems unnecessary to have the pages add back to LRU again since
>> they are about to be freed when reaching this point.  So, clearing active
>> and unevictable flags, unlocking and dropping refcount from isolate
>> instead of calling putback_lru_page() as what page cache collapse does.
>>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>>   mm/khugepaged.c | 10 +++++++++-
>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b679908..f42fa4e 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -673,7 +673,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>>   			src_page = pte_page(pteval);
>>   			copy_user_highpage(page, src_page, address, vma);
>>   			VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
>> -			release_pte_page(src_page);
>>   			/*
>>   			 * ptl mostly unnecessary, but preempt has to
>>   			 * be disabled to update the per-cpu stats
>> @@ -687,6 +686,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>>   			pte_clear(vma->vm_mm, address, _pte);
>>   			page_remove_rmap(src_page, false);
>>   			spin_unlock(ptl);
>> +
>> +			dec_node_page_state(src_page,
>> +				NR_ISOLATED_ANON + page_is_file_cache(src_page));
>> +			ClearPageActive(src_page);
>> +			ClearPageUnevictable(src_page);
>> +			unlock_page(src_page);
>> +			/* Drop refcount from isolate */
>> +			put_page(src_page);
>> +
>>   			free_page_and_swap_cache(src_page);
>>   		}
>>   	}
> I will look at this closer tomorrow, but looks like an easier fix is to
> move release_pte_page() under ptl that we take just after it.

Thanks, Kirill. However, it looks putting the page back to lru is not 
necessary. And, it also sounds not very efficient to do this under ptl 
IMHO since it may spins to wait for lru lock.

>

Kirill A . Shutemov March 19, 2020, 12:12 a.m. UTC | #3

On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
> When khugepaged collapses anonymous pages, the base pages would be freed
> via pagevec or free_page_and_swap_cache().  But, the anonymous page may
> be added back to LRU, then it might result in the below race:
> 
> 	CPU A				CPU B
> khugepaged:
>   unlock page
>   putback_lru_page
>     add to lru
> 				page reclaim:
> 				  isolate this page
> 				  try_to_unmap
>   page_remove_rmap <-- corrupt _mapcount
> 
> It looks nothing would prevent the pages from isolating by reclaimer.

Hm. Why should it?

try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
protected by ptl. And this particular _mapcount pin is reachable for
reclaim as it's not part of usual page table tree. Basically
try_to_unmap() will never succeeds until we give up the _mapcount on
khugepaged side.

I don't see the issue right away.

> The other problem is the page's active or unevictable flag might be
> still set when freeing the page via free_page_and_swap_cache().

So what?

> The putback_lru_page() would not clear those two flags if the pages are
> released via pagevec, it sounds nothing prevents from isolating active
> or unevictable pages.

Again, why should it? vmscan is equipped to deal with this.

> However I didn't really run into these problems, just in theory by visual
> inspection.
> 
> And, it also seems unnecessary to have the pages add back to LRU again since
> they are about to be freed when reaching this point.  So, clearing active
> and unevictable flags, unlocking and dropping refcount from isolate
> instead of calling putback_lru_page() as what page cache collapse does.

Hm? But we do call putback_lru_page() on the way out. I do not follow.

> 
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  mm/khugepaged.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b679908..f42fa4e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -673,7 +673,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>  			src_page = pte_page(pteval);
>  			copy_user_highpage(page, src_page, address, vma);
>  			VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
> -			release_pte_page(src_page);
>  			/*
>  			 * ptl mostly unnecessary, but preempt has to
>  			 * be disabled to update the per-cpu stats
> @@ -687,6 +686,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>  			pte_clear(vma->vm_mm, address, _pte);
>  			page_remove_rmap(src_page, false);
>  			spin_unlock(ptl);
> +
> +			dec_node_page_state(src_page,
> +				NR_ISOLATED_ANON + page_is_file_cache(src_page));
> +			ClearPageActive(src_page);
> +			ClearPageUnevictable(src_page);
> +			unlock_page(src_page);
> +			/* Drop refcount from isolate */
> +			put_page(src_page);
> +
>  			free_page_and_swap_cache(src_page);
>  		}
>  	}
> -- 
> 1.8.3.1
> 
>

Yang Shi March 19, 2020, 12:55 a.m. UTC | #4

On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>> When khugepaged collapses anonymous pages, the base pages would be freed
>> via pagevec or free_page_and_swap_cache().  But, the anonymous page may
>> be added back to LRU, then it might result in the below race:
>>
>> 	CPU A				CPU B
>> khugepaged:
>>    unlock page
>>    putback_lru_page
>>      add to lru
>> 				page reclaim:
>> 				  isolate this page
>> 				  try_to_unmap
>>    page_remove_rmap <-- corrupt _mapcount
>>
>> It looks nothing would prevent the pages from isolating by reclaimer.
> Hm. Why should it?
>
> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
> protected by ptl. And this particular _mapcount pin is reachable for
> reclaim as it's not part of usual page table tree. Basically
> try_to_unmap() will never succeeds until we give up the _mapcount on
> khugepaged side.

I don't quite get. What does "not part of usual page table tree" means?

How's about try_to_unmap() acquires ptl before khugepaged?

>
> I don't see the issue right away.
>
>> The other problem is the page's active or unevictable flag might be
>> still set when freeing the page via free_page_and_swap_cache().
> So what?

The flags may leak to page free path then kernel may complain if 
DEBUG_VM is set.

>
>> The putback_lru_page() would not clear those two flags if the pages are
>> released via pagevec, it sounds nothing prevents from isolating active
>> or unevictable pages.
> Again, why should it? vmscan is equipped to deal with this.

I don't mean vmscan, I mean khugepaged may isolate active and 
unevictable pages since it just simply walks page table.

>
>> However I didn't really run into these problems, just in theory by visual
>> inspection.
>>
>> And, it also seems unnecessary to have the pages add back to LRU again since
>> they are about to be freed when reaching this point.  So, clearing active
>> and unevictable flags, unlocking and dropping refcount from isolate
>> instead of calling putback_lru_page() as what page cache collapse does.
> Hm? But we do call putback_lru_page() on the way out. I do not follow.

It just calls putback_lru_page() at error path, not success path. 
Putting pages back to lru on error path definitely makes sense. Here it 
is the success path.

>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>>   mm/khugepaged.c | 10 +++++++++-
>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index b679908..f42fa4e 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -673,7 +673,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>>   			src_page = pte_page(pteval);
>>   			copy_user_highpage(page, src_page, address, vma);
>>   			VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
>> -			release_pte_page(src_page);
>>   			/*
>>   			 * ptl mostly unnecessary, but preempt has to
>>   			 * be disabled to update the per-cpu stats
>> @@ -687,6 +686,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>>   			pte_clear(vma->vm_mm, address, _pte);
>>   			page_remove_rmap(src_page, false);
>>   			spin_unlock(ptl);
>> +
>> +			dec_node_page_state(src_page,
>> +				NR_ISOLATED_ANON + page_is_file_cache(src_page));
>> +			ClearPageActive(src_page);
>> +			ClearPageUnevictable(src_page);
>> +			unlock_page(src_page);
>> +			/* Drop refcount from isolate */
>> +			put_page(src_page);
>> +
>>   			free_page_and_swap_cache(src_page);
>>   		}
>>   	}
>> -- 
>> 1.8.3.1
>>
>>

Yang Shi March 19, 2020, 5:39 a.m. UTC | #5

On 3/18/20 5:55 PM, Yang Shi wrote:
>
>
> On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
>> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>>> When khugepaged collapses anonymous pages, the base pages would be 
>>> freed
>>> via pagevec or free_page_and_swap_cache().  But, the anonymous page may
>>> be added back to LRU, then it might result in the below race:
>>>
>>>     CPU A                CPU B
>>> khugepaged:
>>>    unlock page
>>>    putback_lru_page
>>>      add to lru
>>>                 page reclaim:
>>>                   isolate this page
>>>                   try_to_unmap
>>>    page_remove_rmap <-- corrupt _mapcount
>>>
>>> It looks nothing would prevent the pages from isolating by reclaimer.
>> Hm. Why should it?
>>
>> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
>> protected by ptl. And this particular _mapcount pin is reachable for
>> reclaim as it's not part of usual page table tree. Basically
>> try_to_unmap() will never succeeds until we give up the _mapcount on
>> khugepaged side.
>
> I don't quite get. What does "not part of usual page table tree" means?
>
> How's about try_to_unmap() acquires ptl before khugepaged?
>
>>
>> I don't see the issue right away.
>>
>>> The other problem is the page's active or unevictable flag might be
>>> still set when freeing the page via free_page_and_swap_cache().
>> So what?
>
> The flags may leak to page free path then kernel may complain if 
> DEBUG_VM is set.
>
>>
>>> The putback_lru_page() would not clear those two flags if the pages are
>>> released via pagevec, it sounds nothing prevents from isolating active

Sorry, this is a typo. If the page is freed via pagevec, active and 
unevictable flag would get cleared before freeing by page_off_lru().

But, if the page is freed by free_page_and_swap_cache(), these two flags 
are not cleared. But, it seems this path is hit rare, the pages are 
freed by pagevec for the most cases.

>>> or unevictable pages.
>> Again, why should it? vmscan is equipped to deal with this.
>
> I don't mean vmscan, I mean khugepaged may isolate active and 
> unevictable pages since it just simply walks page table.
>
>>
>>> However I didn't really run into these problems, just in theory by 
>>> visual
>>> inspection.
>>>
>>> And, it also seems unnecessary to have the pages add back to LRU 
>>> again since
>>> they are about to be freed when reaching this point.  So, clearing 
>>> active
>>> and unevictable flags, unlocking and dropping refcount from isolate
>>> instead of calling putback_lru_page() as what page cache collapse does.
>> Hm? But we do call putback_lru_page() on the way out. I do not follow.
>
> It just calls putback_lru_page() at error path, not success path. 
> Putting pages back to lru on error path definitely makes sense. Here 
> it is the success path.
>
>>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>> ---
>>>   mm/khugepaged.c | 10 +++++++++-
>>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index b679908..f42fa4e 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -673,7 +673,6 @@ static void __collapse_huge_page_copy(pte_t 
>>> *pte, struct page *page,
>>>               src_page = pte_page(pteval);
>>>               copy_user_highpage(page, src_page, address, vma);
>>>               VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
>>> -            release_pte_page(src_page);
>>>               /*
>>>                * ptl mostly unnecessary, but preempt has to
>>>                * be disabled to update the per-cpu stats
>>> @@ -687,6 +686,15 @@ static void __collapse_huge_page_copy(pte_t 
>>> *pte, struct page *page,
>>>               pte_clear(vma->vm_mm, address, _pte);
>>>               page_remove_rmap(src_page, false);
>>>               spin_unlock(ptl);
>>> +
>>> +            dec_node_page_state(src_page,
>>> +                NR_ISOLATED_ANON + page_is_file_cache(src_page));
>>> +            ClearPageActive(src_page);
>>> +            ClearPageUnevictable(src_page);
>>> +            unlock_page(src_page);
>>> +            /* Drop refcount from isolate */
>>> +            put_page(src_page);
>>> +
>>>               free_page_and_swap_cache(src_page);
>>>           }
>>>       }
>>> -- 
>>> 1.8.3.1
>>>
>>>
>

Kirill A . Shutemov March 19, 2020, 10:49 a.m. UTC | #6

On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
> 
> 
> On 3/18/20 5:55 PM, Yang Shi wrote:
> > 
> > 
> > On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
> > > On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
> > > > When khugepaged collapses anonymous pages, the base pages would
> > > > be freed
> > > > via pagevec or free_page_and_swap_cache().  But, the anonymous page may
> > > > be added back to LRU, then it might result in the below race:
> > > > 
> > > >     CPU A                CPU B
> > > > khugepaged:
> > > >    unlock page
> > > >    putback_lru_page
> > > >      add to lru
> > > >                 page reclaim:
> > > >                   isolate this page
> > > >                   try_to_unmap
> > > >    page_remove_rmap <-- corrupt _mapcount
> > > > 
> > > > It looks nothing would prevent the pages from isolating by reclaimer.
> > > Hm. Why should it?
> > > 
> > > try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
> > > protected by ptl. And this particular _mapcount pin is reachable for
> > > reclaim as it's not part of usual page table tree. Basically
> > > try_to_unmap() will never succeeds until we give up the _mapcount on
> > > khugepaged side.
> > 
> > I don't quite get. What does "not part of usual page table tree" means?
> > 
> > How's about try_to_unmap() acquires ptl before khugepaged?

The page table we are dealing with was detached from the process' page
table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
pte.

try_to_unmap() can only reach the ptl if split ptl is disabled
(mm->page_table_lock is used), but it still will not be able to reach pte.

> > > 
> > > I don't see the issue right away.
> > > 
> > > > The other problem is the page's active or unevictable flag might be
> > > > still set when freeing the page via free_page_and_swap_cache().
> > > So what?
> > 
> > The flags may leak to page free path then kernel may complain if
> > DEBUG_VM is set.

Could you elaborate on what codepath you are talking about?

> > > > The putback_lru_page() would not clear those two flags if the pages are
> > > > released via pagevec, it sounds nothing prevents from isolating active
> 
> Sorry, this is a typo. If the page is freed via pagevec, active and
> unevictable flag would get cleared before freeing by page_off_lru().
> 
> But, if the page is freed by free_page_and_swap_cache(), these two flags are
> not cleared. But, it seems this path is hit rare, the pages are freed by
> pagevec for the most cases.
> 
> > > > or unevictable pages.
> > > Again, why should it? vmscan is equipped to deal with this.
> > 
> > I don't mean vmscan, I mean khugepaged may isolate active and
> > unevictable pages since it just simply walks page table.

Why it is wrong? lru_cache_add() only complains if both flags set, it
shouldn't happen.

> > > > However I didn't really run into these problems, just in theory
> > > > by visual
> > > > inspection.
> > > > 
> > > > And, it also seems unnecessary to have the pages add back to LRU
> > > > again since
> > > > they are about to be freed when reaching this point.  So,
> > > > clearing active
> > > > and unevictable flags, unlocking and dropping refcount from isolate
> > > > instead of calling putback_lru_page() as what page cache collapse does.
> > > Hm? But we do call putback_lru_page() on the way out. I do not follow.
> > 
> > It just calls putback_lru_page() at error path, not success path.
> > Putting pages back to lru on error path definitely makes sense. Here it
> > is the success path.

I agree that putting the apage on LRU just before free the page is
suboptimal, but I don't see it as a critical issue.

Yang Shi March 19, 2020, 4:57 p.m. UTC | #7

On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
> On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
>>
>> On 3/18/20 5:55 PM, Yang Shi wrote:
>>>
>>> On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
>>>> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>>>>> When khugepaged collapses anonymous pages, the base pages would
>>>>> be freed
>>>>> via pagevec or free_page_and_swap_cache().  But, the anonymous page may
>>>>> be added back to LRU, then it might result in the below race:
>>>>>
>>>>>      CPU A                CPU B
>>>>> khugepaged:
>>>>>     unlock page
>>>>>     putback_lru_page
>>>>>       add to lru
>>>>>                  page reclaim:
>>>>>                    isolate this page
>>>>>                    try_to_unmap
>>>>>     page_remove_rmap <-- corrupt _mapcount
>>>>>
>>>>> It looks nothing would prevent the pages from isolating by reclaimer.
>>>> Hm. Why should it?
>>>>
>>>> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
>>>> protected by ptl. And this particular _mapcount pin is reachable for
>>>> reclaim as it's not part of usual page table tree. Basically
>>>> try_to_unmap() will never succeeds until we give up the _mapcount on
>>>> khugepaged side.
>>> I don't quite get. What does "not part of usual page table tree" means?
>>>
>>> How's about try_to_unmap() acquires ptl before khugepaged?
> The page table we are dealing with was detached from the process' page
> table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
> pte.
>
> try_to_unmap() can only reach the ptl if split ptl is disabled
> (mm->page_table_lock is used), but it still will not be able to reach pte.

Aha, got it. Thanks for explaining. I definitely missed this point. Yes, 
pmdp_collapse_flush() would clear the pmd, then others won't see the 
page table.

However, it looks the vmscan would not stop at try_to_unmap() at all, 
try_to_unmap() would just return true since pmd_present() should return 
false in pvmw. Then it would go all the way down to __remove_mapping(), 
but freezing the page would fail since try_to_unmap() doesn't actually 
drop the refcount from the pte map.

It would not result in any critical problem AFAICT, but suboptimal and 
it may causes some unnecessary I/O due to swap.

>
>>>> I don't see the issue right away.
>>>>
>>>>> The other problem is the page's active or unevictable flag might be
>>>>> still set when freeing the page via free_page_and_swap_cache().
>>>> So what?
>>> The flags may leak to page free path then kernel may complain if
>>> DEBUG_VM is set.
> Could you elaborate on what codepath you are talking about?

__put_page ->
     __put_single_page ->
         free_unref_page ->
             put_unref_page_prepare ->
                 free_pcp_prepare ->
                     free_pages_prepare ->
                         free_pages_check

This check would just be run when DEBUG_VM is enabled.

>
>>>>> The putback_lru_page() would not clear those two flags if the pages are
>>>>> released via pagevec, it sounds nothing prevents from isolating active
>> Sorry, this is a typo. If the page is freed via pagevec, active and
>> unevictable flag would get cleared before freeing by page_off_lru().
>>
>> But, if the page is freed by free_page_and_swap_cache(), these two flags are
>> not cleared. But, it seems this path is hit rare, the pages are freed by
>> pagevec for the most cases.
>>
>>>>> or unevictable pages.
>>>> Again, why should it? vmscan is equipped to deal with this.
>>> I don't mean vmscan, I mean khugepaged may isolate active and
>>> unevictable pages since it just simply walks page table.
> Why it is wrong? lru_cache_add() only complains if both flags set, it
> shouldn't happen.

Noting wrong about isolating active or unevictable pages. I just mean it 
seems possible active or unevictable flag may be there if the page is 
freed via free_page_add_swap_cache() path.

>
>>>>> However I didn't really run into these problems, just in theory
>>>>> by visual
>>>>> inspection.
>>>>>
>>>>> And, it also seems unnecessary to have the pages add back to LRU
>>>>> again since
>>>>> they are about to be freed when reaching this point.  So,
>>>>> clearing active
>>>>> and unevictable flags, unlocking and dropping refcount from isolate
>>>>> instead of calling putback_lru_page() as what page cache collapse does.
>>>> Hm? But we do call putback_lru_page() on the way out. I do not follow.
>>> It just calls putback_lru_page() at error path, not success path.
>>> Putting pages back to lru on error path definitely makes sense. Here it
>>> is the success path.
> I agree that putting the apage on LRU just before free the page is
> suboptimal, but I don't see it as a critical issue.

Yes, given the code analysis above, I agree. If you thought the patch is 
a fine micro-optimization, I would like to re-submit it with rectified 
commit log. Thank you for your time.

>
>

Yang Shi March 19, 2020, 5:22 p.m. UTC | #8

On 3/19/20 9:57 AM, Yang Shi wrote:
>
>
> On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
>> On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
>>>
>>> On 3/18/20 5:55 PM, Yang Shi wrote:
>>>>
>>>> On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
>>>>> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>>>>>> When khugepaged collapses anonymous pages, the base pages would
>>>>>> be freed
>>>>>> via pagevec or free_page_and_swap_cache().  But, the anonymous 
>>>>>> page may
>>>>>> be added back to LRU, then it might result in the below race:
>>>>>>
>>>>>>      CPU A                CPU B
>>>>>> khugepaged:
>>>>>>     unlock page
>>>>>>     putback_lru_page
>>>>>>       add to lru
>>>>>>                  page reclaim:
>>>>>>                    isolate this page
>>>>>>                    try_to_unmap
>>>>>>     page_remove_rmap <-- corrupt _mapcount
>>>>>>
>>>>>> It looks nothing would prevent the pages from isolating by 
>>>>>> reclaimer.
>>>>> Hm. Why should it?
>>>>>
>>>>> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
>>>>> protected by ptl. And this particular _mapcount pin is reachable for
>>>>> reclaim as it's not part of usual page table tree. Basically
>>>>> try_to_unmap() will never succeeds until we give up the _mapcount on
>>>>> khugepaged side.
>>>> I don't quite get. What does "not part of usual page table tree" 
>>>> means?
>>>>
>>>> How's about try_to_unmap() acquires ptl before khugepaged?
>> The page table we are dealing with was detached from the process' page
>> table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
>> pte.
>>
>> try_to_unmap() can only reach the ptl if split ptl is disabled
>> (mm->page_table_lock is used), but it still will not be able to reach 
>> pte.
>
> Aha, got it. Thanks for explaining. I definitely missed this point. 
> Yes, pmdp_collapse_flush() would clear the pmd, then others won't see 
> the page table.
>
> However, it looks the vmscan would not stop at try_to_unmap() at all, 
> try_to_unmap() would just return true since pmd_present() should 
> return false in pvmw. Then it would go all the way down to 
> __remove_mapping(), but freezing the page would fail since 
> try_to_unmap() doesn't actually drop the refcount from the pte map.
>
> It would not result in any critical problem AFAICT, but suboptimal and 
> it may causes some unnecessary I/O due to swap.

To correct, it would not reach __remove_mapping() since refcount check 
in pageout() would fail.

>
>>
>>>>> I don't see the issue right away.
>>>>>
>>>>>> The other problem is the page's active or unevictable flag might be
>>>>>> still set when freeing the page via free_page_and_swap_cache().
>>>>> So what?
>>>> The flags may leak to page free path then kernel may complain if
>>>> DEBUG_VM is set.
>> Could you elaborate on what codepath you are talking about?
>
> __put_page ->
>     __put_single_page ->
>         free_unref_page ->
>             put_unref_page_prepare ->
>                 free_pcp_prepare ->
>                     free_pages_prepare ->
>                         free_pages_check
>
> This check would just be run when DEBUG_VM is enabled.
>
>>
>>>>>> The putback_lru_page() would not clear those two flags if the 
>>>>>> pages are
>>>>>> released via pagevec, it sounds nothing prevents from isolating 
>>>>>> active
>>> Sorry, this is a typo. If the page is freed via pagevec, active and
>>> unevictable flag would get cleared before freeing by page_off_lru().
>>>
>>> But, if the page is freed by free_page_and_swap_cache(), these two 
>>> flags are
>>> not cleared. But, it seems this path is hit rare, the pages are 
>>> freed by
>>> pagevec for the most cases.
>>>
>>>>>> or unevictable pages.
>>>>> Again, why should it? vmscan is equipped to deal with this.
>>>> I don't mean vmscan, I mean khugepaged may isolate active and
>>>> unevictable pages since it just simply walks page table.
>> Why it is wrong? lru_cache_add() only complains if both flags set, it
>> shouldn't happen.
>
> Noting wrong about isolating active or unevictable pages. I just mean 
> it seems possible active or unevictable flag may be there if the page 
> is freed via free_page_add_swap_cache() path.
>
>>
>>>>>> However I didn't really run into these problems, just in theory
>>>>>> by visual
>>>>>> inspection.
>>>>>>
>>>>>> And, it also seems unnecessary to have the pages add back to LRU
>>>>>> again since
>>>>>> they are about to be freed when reaching this point. So,
>>>>>> clearing active
>>>>>> and unevictable flags, unlocking and dropping refcount from isolate
>>>>>> instead of calling putback_lru_page() as what page cache collapse 
>>>>>> does.
>>>>> Hm? But we do call putback_lru_page() on the way out. I do not 
>>>>> follow.
>>>> It just calls putback_lru_page() at error path, not success path.
>>>> Putting pages back to lru on error path definitely makes sense. 
>>>> Here it
>>>> is the success path.
>> I agree that putting the apage on LRU just before free the page is
>> suboptimal, but I don't see it as a critical issue.
>
> Yes, given the code analysis above, I agree. If you thought the patch 
> is a fine micro-optimization, I would like to re-submit it with 
> rectified commit log. Thank you for your time.
>
>>
>>
>

Kirill A . Shutemov March 20, 2020, 11:45 a.m. UTC | #9

On Thu, Mar 19, 2020 at 09:57:47AM -0700, Yang Shi wrote:
> 
> 
> On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
> > On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
> > > 
> > > On 3/18/20 5:55 PM, Yang Shi wrote:
> > > > 
> > > > On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
> > > > > On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
> > > > > > When khugepaged collapses anonymous pages, the base pages would
> > > > > > be freed
> > > > > > via pagevec or free_page_and_swap_cache().  But, the anonymous page may
> > > > > > be added back to LRU, then it might result in the below race:
> > > > > > 
> > > > > >      CPU A                CPU B
> > > > > > khugepaged:
> > > > > >     unlock page
> > > > > >     putback_lru_page
> > > > > >       add to lru
> > > > > >                  page reclaim:
> > > > > >                    isolate this page
> > > > > >                    try_to_unmap
> > > > > >     page_remove_rmap <-- corrupt _mapcount
> > > > > > 
> > > > > > It looks nothing would prevent the pages from isolating by reclaimer.
> > > > > Hm. Why should it?
> > > > > 
> > > > > try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
> > > > > protected by ptl. And this particular _mapcount pin is reachable for
> > > > > reclaim as it's not part of usual page table tree. Basically
> > > > > try_to_unmap() will never succeeds until we give up the _mapcount on
> > > > > khugepaged side.
> > > > I don't quite get. What does "not part of usual page table tree" means?
> > > > 
> > > > How's about try_to_unmap() acquires ptl before khugepaged?
> > The page table we are dealing with was detached from the process' page
> > table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
> > pte.
> > 
> > try_to_unmap() can only reach the ptl if split ptl is disabled
> > (mm->page_table_lock is used), but it still will not be able to reach pte.
> 
> Aha, got it. Thanks for explaining. I definitely missed this point. Yes,
> pmdp_collapse_flush() would clear the pmd, then others won't see the page
> table.
> 
> However, it looks the vmscan would not stop at try_to_unmap() at all,
> try_to_unmap() would just return true since pmd_present() should return
> false in pvmw. Then it would go all the way down to __remove_mapping(), but
> freezing the page would fail since try_to_unmap() doesn't actually drop the
> refcount from the pte map.

No. try_to_unmap() checks mapcount at the end and only returns true if
it's zero.

> It would not result in any critical problem AFAICT, but suboptimal and it
> may causes some unnecessary I/O due to swap.
> 
> > 
> > > > > I don't see the issue right away.
> > > > > 
> > > > > > The other problem is the page's active or unevictable flag might be
> > > > > > still set when freeing the page via free_page_and_swap_cache().
> > > > > So what?
> > > > The flags may leak to page free path then kernel may complain if
> > > > DEBUG_VM is set.
> > Could you elaborate on what codepath you are talking about?
> 
> __put_page ->
>     __put_single_page ->
>         free_unref_page ->
>             put_unref_page_prepare ->
>                 free_pcp_prepare ->
>                     free_pages_prepare ->
>                         free_pages_check
> 
> This check would just be run when DEBUG_VM is enabled.

I'm not 100% sure, but I belive these flags will ge cleared on adding into
lru:

  release_pte_page()
    putback_lru_page()
      lru_cache_add()
       __lru_cache_add()
         __pagevec_lru_add()
	   __pagevec_lru_add_fn()
	     __pagevec_lru_add_fn()

Yang Shi March 20, 2020, 4:34 p.m. UTC | #10

On 3/20/20 4:45 AM, Kirill A. Shutemov wrote:
> On Thu, Mar 19, 2020 at 09:57:47AM -0700, Yang Shi wrote:
>>
>> On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
>>> On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
>>>> On 3/18/20 5:55 PM, Yang Shi wrote:
>>>>> On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
>>>>>> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>>>>>>> When khugepaged collapses anonymous pages, the base pages would
>>>>>>> be freed
>>>>>>> via pagevec or free_page_and_swap_cache().  But, the anonymous page may
>>>>>>> be added back to LRU, then it might result in the below race:
>>>>>>>
>>>>>>>       CPU A                CPU B
>>>>>>> khugepaged:
>>>>>>>      unlock page
>>>>>>>      putback_lru_page
>>>>>>>        add to lru
>>>>>>>                   page reclaim:
>>>>>>>                     isolate this page
>>>>>>>                     try_to_unmap
>>>>>>>      page_remove_rmap <-- corrupt _mapcount
>>>>>>>
>>>>>>> It looks nothing would prevent the pages from isolating by reclaimer.
>>>>>> Hm. Why should it?
>>>>>>
>>>>>> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
>>>>>> protected by ptl. And this particular _mapcount pin is reachable for
>>>>>> reclaim as it's not part of usual page table tree. Basically
>>>>>> try_to_unmap() will never succeeds until we give up the _mapcount on
>>>>>> khugepaged side.
>>>>> I don't quite get. What does "not part of usual page table tree" means?
>>>>>
>>>>> How's about try_to_unmap() acquires ptl before khugepaged?
>>> The page table we are dealing with was detached from the process' page
>>> table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
>>> pte.
>>>
>>> try_to_unmap() can only reach the ptl if split ptl is disabled
>>> (mm->page_table_lock is used), but it still will not be able to reach pte.
>> Aha, got it. Thanks for explaining. I definitely missed this point. Yes,
>> pmdp_collapse_flush() would clear the pmd, then others won't see the page
>> table.
>>
>> However, it looks the vmscan would not stop at try_to_unmap() at all,
>> try_to_unmap() would just return true since pmd_present() should return
>> false in pvmw. Then it would go all the way down to __remove_mapping(), but
>> freezing the page would fail since try_to_unmap() doesn't actually drop the
>> refcount from the pte map.
> No. try_to_unmap() checks mapcount at the end and only returns true if
> it's zero.

Aha, yes, you are right. It does check mapcount. It would not go that far.

>
>> It would not result in any critical problem AFAICT, but suboptimal and it
>> may causes some unnecessary I/O due to swap.
>>
>>>>>> I don't see the issue right away.
>>>>>>
>>>>>>> The other problem is the page's active or unevictable flag might be
>>>>>>> still set when freeing the page via free_page_and_swap_cache().
>>>>>> So what?
>>>>> The flags may leak to page free path then kernel may complain if
>>>>> DEBUG_VM is set.
>>> Could you elaborate on what codepath you are talking about?
>> __put_page ->
>>      __put_single_page ->
>>          free_unref_page ->
>>              put_unref_page_prepare ->
>>                  free_pcp_prepare ->
>>                      free_pages_prepare ->
>>                          free_pages_check
>>
>> This check would just be run when DEBUG_VM is enabled.
> I'm not 100% sure, but I belive these flags will ge cleared on adding into
> lru:
>
>    release_pte_page()
>      putback_lru_page()
>        lru_cache_add()
>         __lru_cache_add()
>           __pagevec_lru_add()
> 	   __pagevec_lru_add_fn()
> 	     __pagevec_lru_add_fn()

No, adding into lru would not clear the flags. But, I finally found 
where they get cleared. They get cleared by:

__put_single_page() ->
     __page_cache_release() ->
         if page is lru
         del_page_from_lru_list(page, lruvec, page_off_lru(page));

page_off_lru() would clear active and unevictable flags.

The name (__page_cache_release) sounds a little bit confusing I 
misunderstood it just would done something for page cache.

So, it looks the code does depend on putting page back to lru to release 
it. Nothing is wrong, just a little bit unproductive IMHO. Sorry for the 
rash patch. And thank you for your time again.

>

Yang Shi March 24, 2020, 5:17 p.m. UTC | #11

On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
> On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
>>
>> On 3/18/20 5:55 PM, Yang Shi wrote:
>>>
>>> On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
>>>> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>>>>> When khugepaged collapses anonymous pages, the base pages would
>>>>> be freed
>>>>> via pagevec or free_page_and_swap_cache().  But, the anonymous page may
>>>>> be added back to LRU, then it might result in the below race:
>>>>>
>>>>>      CPU A                CPU B
>>>>> khugepaged:
>>>>>     unlock page
>>>>>     putback_lru_page
>>>>>       add to lru
>>>>>                  page reclaim:
>>>>>                    isolate this page
>>>>>                    try_to_unmap
>>>>>     page_remove_rmap <-- corrupt _mapcount
>>>>>
>>>>> It looks nothing would prevent the pages from isolating by reclaimer.
>>>> Hm. Why should it?
>>>>
>>>> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
>>>> protected by ptl. And this particular _mapcount pin is reachable for
>>>> reclaim as it's not part of usual page table tree. Basically
>>>> try_to_unmap() will never succeeds until we give up the _mapcount on
>>>> khugepaged side.
>>> I don't quite get. What does "not part of usual page table tree" means?
>>>
>>> How's about try_to_unmap() acquires ptl before khugepaged?
> The page table we are dealing with was detached from the process' page
> table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
> pte.

A follow-up question here. pmdp_collapse_flush() clears pmd entry and 
does TLB shootdown on x86. I'm supposed the main purpose is to serialize 
fast gup since it doesn't acquire any lock (mmap_sem, ptl ,etc), but 
disable interrupt so the TLB shootdown IPI would get blocked. This could 
guarantee synchronization on x86, but it looks not all architectures do 
TLB shootdown or implement it via IPI, so how they could serialize with 
fast gup?

In addition it looks acquiring pmd lock is not necessary. Before both 
write mmap_sem and write anon_vma lock are acquired which could 
serialize page fault and rmap walk, so it looks fast gup is the only one 
which could run concurrently, but fast gup doesn't acquire ptl at all. 
It seems the pmd_lock/unlock could be removed.

>
> try_to_unmap() can only reach the ptl if split ptl is disabled
> (mm->page_table_lock is used), but it still will not be able to reach pte.
>
>>>> I don't see the issue right away.
>>>>
>>>>> The other problem is the page's active or unevictable flag might be
>>>>> still set when freeing the page via free_page_and_swap_cache().
>>>> So what?
>>> The flags may leak to page free path then kernel may complain if
>>> DEBUG_VM is set.
> Could you elaborate on what codepath you are talking about?
>
>>>>> The putback_lru_page() would not clear those two flags if the pages are
>>>>> released via pagevec, it sounds nothing prevents from isolating active
>> Sorry, this is a typo. If the page is freed via pagevec, active and
>> unevictable flag would get cleared before freeing by page_off_lru().
>>
>> But, if the page is freed by free_page_and_swap_cache(), these two flags are
>> not cleared. But, it seems this path is hit rare, the pages are freed by
>> pagevec for the most cases.
>>
>>>>> or unevictable pages.
>>>> Again, why should it? vmscan is equipped to deal with this.
>>> I don't mean vmscan, I mean khugepaged may isolate active and
>>> unevictable pages since it just simply walks page table.
> Why it is wrong? lru_cache_add() only complains if both flags set, it
> shouldn't happen.
>
>>>>> However I didn't really run into these problems, just in theory
>>>>> by visual
>>>>> inspection.
>>>>>
>>>>> And, it also seems unnecessary to have the pages add back to LRU
>>>>> again since
>>>>> they are about to be freed when reaching this point.  So,
>>>>> clearing active
>>>>> and unevictable flags, unlocking and dropping refcount from isolate
>>>>> instead of calling putback_lru_page() as what page cache collapse does.
>>>> Hm? But we do call putback_lru_page() on the way out. I do not follow.
>>> It just calls putback_lru_page() at error path, not success path.
>>> Putting pages back to lru on error path definitely makes sense. Here it
>>> is the success path.
> I agree that putting the apage on LRU just before free the page is
> suboptimal, but I don't see it as a critical issue.
>
>

Kirill A . Shutemov March 25, 2020, 11:26 a.m. UTC | #12

On Tue, Mar 24, 2020 at 10:17:13AM -0700, Yang Shi wrote:
> 
> 
> On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
> > On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
> > > 
> > > On 3/18/20 5:55 PM, Yang Shi wrote:
> > > > 
> > > > On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
> > > > > On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
> > > > > > When khugepaged collapses anonymous pages, the base pages would
> > > > > > be freed
> > > > > > via pagevec or free_page_and_swap_cache().  But, the anonymous page may
> > > > > > be added back to LRU, then it might result in the below race:
> > > > > > 
> > > > > >      CPU A                CPU B
> > > > > > khugepaged:
> > > > > >     unlock page
> > > > > >     putback_lru_page
> > > > > >       add to lru
> > > > > >                  page reclaim:
> > > > > >                    isolate this page
> > > > > >                    try_to_unmap
> > > > > >     page_remove_rmap <-- corrupt _mapcount
> > > > > > 
> > > > > > It looks nothing would prevent the pages from isolating by reclaimer.
> > > > > Hm. Why should it?
> > > > > 
> > > > > try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
> > > > > protected by ptl. And this particular _mapcount pin is reachable for
> > > > > reclaim as it's not part of usual page table tree. Basically
> > > > > try_to_unmap() will never succeeds until we give up the _mapcount on
> > > > > khugepaged side.
> > > > I don't quite get. What does "not part of usual page table tree" means?
> > > > 
> > > > How's about try_to_unmap() acquires ptl before khugepaged?
> > The page table we are dealing with was detached from the process' page
> > table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
> > pte.
> 
> A follow-up question here. pmdp_collapse_flush() clears pmd entry and does
> TLB shootdown on x86. I'm supposed the main purpose is to serialize fast gup
> since it doesn't acquire any lock (mmap_sem, ptl ,etc), but disable
> interrupt so the TLB shootdown IPI would get blocked. This could guarantee
> synchronization on x86, but it looks not all architectures do TLB shootdown
> or implement it via IPI, so how they could serialize with fast gup?

The main purpose of pmdp_collapse_flush() is to block access to pages
under collapse, including access via GUP (and its variants).

It's up to architecture to implement it correctly, including TLB flush vs.
GUP_fast serialization. Genetic way works fine for most architectures.
Notable exceptions are Power and S390.

> In addition it looks acquiring pmd lock is not necessary. Before both write
> mmap_sem and write anon_vma lock are acquired which could serialize page
> fault and rmap walk, so it looks fast gup is the only one which could run
> concurrently, but fast gup doesn't acquire ptl at all. It seems the
> pmd_lock/unlock could be removed.

This is likely true. And we have a comment there. But taking uncontended
lock is check, so why not.

Yang Shi March 25, 2020, 6:42 p.m. UTC | #13

On 3/25/20 4:26 AM, Kirill A. Shutemov wrote:
> On Tue, Mar 24, 2020 at 10:17:13AM -0700, Yang Shi wrote:
>>
>> On 3/19/20 3:49 AM, Kirill A. Shutemov wrote:
>>> On Wed, Mar 18, 2020 at 10:39:21PM -0700, Yang Shi wrote:
>>>> On 3/18/20 5:55 PM, Yang Shi wrote:
>>>>> On 3/18/20 5:12 PM, Kirill A. Shutemov wrote:
>>>>>> On Thu, Mar 19, 2020 at 07:19:42AM +0800, Yang Shi wrote:
>>>>>>> When khugepaged collapses anonymous pages, the base pages would
>>>>>>> be freed
>>>>>>> via pagevec or free_page_and_swap_cache().  But, the anonymous page may
>>>>>>> be added back to LRU, then it might result in the below race:
>>>>>>>
>>>>>>>       CPU A                CPU B
>>>>>>> khugepaged:
>>>>>>>      unlock page
>>>>>>>      putback_lru_page
>>>>>>>        add to lru
>>>>>>>                   page reclaim:
>>>>>>>                     isolate this page
>>>>>>>                     try_to_unmap
>>>>>>>      page_remove_rmap <-- corrupt _mapcount
>>>>>>>
>>>>>>> It looks nothing would prevent the pages from isolating by reclaimer.
>>>>>> Hm. Why should it?
>>>>>>
>>>>>> try_to_unmap() doesn't exclude parallel page unmapping. _mapcount is
>>>>>> protected by ptl. And this particular _mapcount pin is reachable for
>>>>>> reclaim as it's not part of usual page table tree. Basically
>>>>>> try_to_unmap() will never succeeds until we give up the _mapcount on
>>>>>> khugepaged side.
>>>>> I don't quite get. What does "not part of usual page table tree" means?
>>>>>
>>>>> How's about try_to_unmap() acquires ptl before khugepaged?
>>> The page table we are dealing with was detached from the process' page
>>> table tree: see pmdp_collapse_flush(). try_to_unmap() will not see the
>>> pte.
>> A follow-up question here. pmdp_collapse_flush() clears pmd entry and does
>> TLB shootdown on x86. I'm supposed the main purpose is to serialize fast gup
>> since it doesn't acquire any lock (mmap_sem, ptl ,etc), but disable
>> interrupt so the TLB shootdown IPI would get blocked. This could guarantee
>> synchronization on x86, but it looks not all architectures do TLB shootdown
>> or implement it via IPI, so how they could serialize with fast gup?
> The main purpose of pmdp_collapse_flush() is to block access to pages
> under collapse, including access via GUP (and its variants).
>
> It's up to architecture to implement it correctly, including TLB flush vs.
> GUP_fast serialization. Genetic way works fine for most architectures.
> Notable exceptions are Power and S390.

Thanks. I was wondering how Power and S390 serialized it. It looks they 
didn't deal with it at all.

>> In addition it looks acquiring pmd lock is not necessary. Before both write
>> mmap_sem and write anon_vma lock are acquired which could serialize page
>> fault and rmap walk, so it looks fast gup is the only one which could run
>> concurrently, but fast gup doesn't acquire ptl at all. It seems the
>> pmd_lock/unlock could be removed.
> This is likely true. And we have a comment there. But taking uncontended
> lock is check, so why not.

Yes, it sounds not harmful.

>

mm: khugepaged: fix potential page state corruption

Commit Message

Comments

Patch