diff mbox series

[RFC,linux-next] mm: khugepaged: remove error message when checking external pins

Message ID 1589317383-9595-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series [RFC,linux-next] mm: khugepaged: remove error message when checking external pins | expand

Commit Message

Yang Shi May 12, 2020, 9:03 p.m. UTC
When running khugepaged with higher frequency (for example, set
scan_sleep_millisecs to 0), the below error message was reported:

khugepaged: expected_refcount (1024) > refcount (512)
page:ffffd75784258000 count:511 mapcount:1 mapping:ffff968de06c7421 index:0x7fa288600
 compound_mapcount: 0
 flags: 0x17fffc00009003c(uptodate|dirty|lru|active|head|swapbacked)
 raw: 017fffc00009003c ffffd7578ba70788 ffffd7578bdb5148 ffff968de06c7421
 raw: 00000007fa288600 0000000000000000 000001ff00000000 ffff968e5e7d6000
 page dumped because: Unexpected refcount
 page->mem_cgroup:ffff968e5e7d6000

This is introduced by allowing collapsing fork shared and PTE-mapped
THPs.  The check may run into the below race:

Assuming parent process forked child process, then they do

	CPU A		CPU B			CPU C
	-----		-----			-----
Parent			Child			khugepaged

MADV_DONTNEED
  split huge pmd
  Double mapped
			MADV_DONTNEED
			  zap_huge_pmd
			    remove_page_rmap
			      Clear double map
						khugepaged_scan_pmd(parent)
						  check mapcount and refcount
						  --> total_mapcount > refcount
			      dec mapcount

The issue can be reproduced by the below test program.

---8<---
void main()
{
        void *addr;
        int ret;
        pid_t pid;

        addr = memalign(ALIGN, 2 * 1024 * 1024);
        if (!addr) {
                printf("malloc failed\n");
                return;
        }

        ret = madvise(addr, 2 * 1024 * 1024, MADV_HUGEPAGE);
        if (ret < 0) {
                printf("madvise failed\n");
                return;
        }

        memset(addr, 0xdeadbeef, 2 * 1024 * 1024);

        pid = fork();

        if (pid == 0) {
                /* Child process */
                ret = madvise(addr + (2 * 1024 * 1024) - 4096, 4096, MADV_DONTNEED);
                if (ret < 0) {
                        printf("madvise failed in child\n");
                        return;
                }
                sleep(120);
        } else if (pid > 0) {
                sleep(5);
                /* Parent process */
                ret = madvise(addr, 2 * 1024 * 1024, MADV_DONTNEED);
                if (ret < 0) {
                        printf("madvise failed in parent\n");
                        return;
                }
        } else {
                printf("fork failed\n");
                return;
        }

        sleep(120);
}
---8<---

So, total_mapcount > refcount seems not unexpected due to the inherent
race.  Removed the error message even though it is protected by
CONFIG_VM_DEBUG since we have to live with the race and AFAIK some
distros may have CONFIG_VM_DEBUG enabled dy default.

Since such case is ephemeral we could always try collapse the area again
later, so it sounds not harmful.  But, it might report false positive if
the page has excessive GUP pins (i.e. 512), however it might be not that
bad since the same check will be done later.  I didn't figure out a
simple way to prevent the false positive.

Added some notes to elaborate the race and the consequence as well.

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/khugepaged.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

Comments

Kirill A . Shutemov May 18, 2020, 10:19 a.m. UTC | #1
On Wed, May 13, 2020 at 05:03:03AM +0800, Yang Shi wrote:
> When running khugepaged with higher frequency (for example, set
> scan_sleep_millisecs to 0), the below error message was reported:
> 
> khugepaged: expected_refcount (1024) > refcount (512)
> page:ffffd75784258000 count:511 mapcount:1 mapping:ffff968de06c7421 index:0x7fa288600
>  compound_mapcount: 0
>  flags: 0x17fffc00009003c(uptodate|dirty|lru|active|head|swapbacked)
>  raw: 017fffc00009003c ffffd7578ba70788 ffffd7578bdb5148 ffff968de06c7421
>  raw: 00000007fa288600 0000000000000000 000001ff00000000 ffff968e5e7d6000
>  page dumped because: Unexpected refcount
>  page->mem_cgroup:ffff968e5e7d6000
> 
> This is introduced by allowing collapsing fork shared and PTE-mapped
> THPs.  The check may run into the below race:
> 
> Assuming parent process forked child process, then they do
> 
> 	CPU A		CPU B			CPU C
> 	-----		-----			-----
> Parent			Child			khugepaged
> 
> MADV_DONTNEED
>   split huge pmd
>   Double mapped
> 			MADV_DONTNEED
> 			  zap_huge_pmd
> 			    remove_page_rmap
> 			      Clear double map
> 						khugepaged_scan_pmd(parent)
> 						  check mapcount and refcount
> 						  --> total_mapcount > refcount
> 			      dec mapcount
> 
> The issue can be reproduced by the below test program.

Good catch! Thanks. And the fix looks reasnable.

We might want to have a similar debug check in near !is_refcount_suitable()
case in __collapse_huge_page_isolate(). The function is called with
anon_vma lock taken on write and it should prevent the false-positive.

Anyway:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

> 
> ---8<---
> void main()
> {
>         void *addr;
>         int ret;
>         pid_t pid;
> 
>         addr = memalign(ALIGN, 2 * 1024 * 1024);
>         if (!addr) {
>                 printf("malloc failed\n");
>                 return;
>         }
> 
>         ret = madvise(addr, 2 * 1024 * 1024, MADV_HUGEPAGE);
>         if (ret < 0) {
>                 printf("madvise failed\n");
>                 return;
>         }
> 
>         memset(addr, 0xdeadbeef, 2 * 1024 * 1024);
> 
>         pid = fork();
> 
>         if (pid == 0) {
>                 /* Child process */
>                 ret = madvise(addr + (2 * 1024 * 1024) - 4096, 4096, MADV_DONTNEED);
>                 if (ret < 0) {
>                         printf("madvise failed in child\n");
>                         return;
>                 }
>                 sleep(120);
>         } else if (pid > 0) {
>                 sleep(5);
>                 /* Parent process */
>                 ret = madvise(addr, 2 * 1024 * 1024, MADV_DONTNEED);
>                 if (ret < 0) {
>                         printf("madvise failed in parent\n");
>                         return;
>                 }
>         } else {
>                 printf("fork failed\n");
>                 return;
>         }
> 
>         sleep(120);
> }
> ---8<---
> 
> So, total_mapcount > refcount seems not unexpected due to the inherent
> race.  Removed the error message even though it is protected by
> CONFIG_VM_DEBUG since we have to live with the race and AFAIK some
> distros may have CONFIG_VM_DEBUG enabled dy default.
> 
> Since such case is ephemeral we could always try collapse the area again
> later, so it sounds not harmful.  But, it might report false positive if
> the page has excessive GUP pins (i.e. 512), however it might be not that
> bad since the same check will be done later.  I didn't figure out a
> simple way to prevent the false positive.
> 
> Added some notes to elaborate the race and the consequence as well.
> 
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
>  mm/khugepaged.c | 24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1fdd677..048f5d4 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -602,12 +602,6 @@ static bool is_refcount_suitable(struct page *page)
>  	if (PageSwapCache(page))
>  		expected_refcount += compound_nr(page);
>  
> -	if (IS_ENABLED(CONFIG_DEBUG_VM) && expected_refcount > refcount) {
> -		pr_err("expected_refcount (%d) > refcount (%d)\n",
> -				expected_refcount, refcount);
> -		dump_page(page, "Unexpected refcount");
> -	}
> -
>  	return page_count(page) == expected_refcount;
>  }
>  
> @@ -1341,7 +1335,23 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  			goto out_unmap;
>  		}
>  
> -		/* Check if the page has any GUP (or other external) pins */
> +		/*
> +		 * Check if the page has any GUP (or other external) pins.
> +		 *
> +		 * Here the check is racy it may see totmal_mapcount > refcount
> +		 * in some cases.
> +		 * For example, one process with one forked child process.
> +		 * The parent has the PMD split due to MADV_DONTNEED, then
> +		 * the child is trying unmap the whole PMD, but khugepaged
> +		 * may be scanning the parent between the child has
> +		 * PageDoubleMap flag cleared and dec the mapcount.  So
> +		 * khugepaged may see total_mapcount > refcount.
> +		 *
> +		 * But such case is ephemeral we could always retry collapse
> +		 * later.  However it may report false positive if the page
> +		 * has excessive GUP pins (i.e. 512).  Anyway the same check
> +		 * will be done again later the risk seems low.
> +		 */
>  		if (!is_refcount_suitable(page)) {
>  			result = SCAN_PAGE_COUNT;
>  			goto out_unmap;
> -- 
> 1.8.3.1
> 
>
Yang Shi May 19, 2020, 2:04 a.m. UTC | #2
On 5/18/20 3:19 AM, Kirill A. Shutemov wrote:
> On Wed, May 13, 2020 at 05:03:03AM +0800, Yang Shi wrote:
>> When running khugepaged with higher frequency (for example, set
>> scan_sleep_millisecs to 0), the below error message was reported:
>>
>> khugepaged: expected_refcount (1024) > refcount (512)
>> page:ffffd75784258000 count:511 mapcount:1 mapping:ffff968de06c7421 index:0x7fa288600
>>   compound_mapcount: 0
>>   flags: 0x17fffc00009003c(uptodate|dirty|lru|active|head|swapbacked)
>>   raw: 017fffc00009003c ffffd7578ba70788 ffffd7578bdb5148 ffff968de06c7421
>>   raw: 00000007fa288600 0000000000000000 000001ff00000000 ffff968e5e7d6000
>>   page dumped because: Unexpected refcount
>>   page->mem_cgroup:ffff968e5e7d6000
>>
>> This is introduced by allowing collapsing fork shared and PTE-mapped
>> THPs.  The check may run into the below race:
>>
>> Assuming parent process forked child process, then they do
>>
>> 	CPU A		CPU B			CPU C
>> 	-----		-----			-----
>> Parent			Child			khugepaged
>>
>> MADV_DONTNEED
>>    split huge pmd
>>    Double mapped
>> 			MADV_DONTNEED
>> 			  zap_huge_pmd
>> 			    remove_page_rmap
>> 			      Clear double map
>> 						khugepaged_scan_pmd(parent)
>> 						  check mapcount and refcount
>> 						  --> total_mapcount > refcount
>> 			      dec mapcount
>>
>> The issue can be reproduced by the below test program.
> Good catch! Thanks. And the fix looks reasnable.
>
> We might want to have a similar debug check in near !is_refcount_suitable()
> case in __collapse_huge_page_isolate(). The function is called with
> anon_vma lock taken on write and it should prevent the false-positive.

However it seems MADV_DONTNEED path doesn't take anon_vma lock.

>
> Anyway:
>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks.

>
>> ---8<---
>> void main()
>> {
>>          void *addr;
>>          int ret;
>>          pid_t pid;
>>
>>          addr = memalign(ALIGN, 2 * 1024 * 1024);
>>          if (!addr) {
>>                  printf("malloc failed\n");
>>                  return;
>>          }
>>
>>          ret = madvise(addr, 2 * 1024 * 1024, MADV_HUGEPAGE);
>>          if (ret < 0) {
>>                  printf("madvise failed\n");
>>                  return;
>>          }
>>
>>          memset(addr, 0xdeadbeef, 2 * 1024 * 1024);
>>
>>          pid = fork();
>>
>>          if (pid == 0) {
>>                  /* Child process */
>>                  ret = madvise(addr + (2 * 1024 * 1024) - 4096, 4096, MADV_DONTNEED);
>>                  if (ret < 0) {
>>                          printf("madvise failed in child\n");
>>                          return;
>>                  }
>>                  sleep(120);
>>          } else if (pid > 0) {
>>                  sleep(5);
>>                  /* Parent process */
>>                  ret = madvise(addr, 2 * 1024 * 1024, MADV_DONTNEED);
>>                  if (ret < 0) {
>>                          printf("madvise failed in parent\n");
>>                          return;
>>                  }
>>          } else {
>>                  printf("fork failed\n");
>>                  return;
>>          }
>>
>>          sleep(120);
>> }
>> ---8<---
>>
>> So, total_mapcount > refcount seems not unexpected due to the inherent
>> race.  Removed the error message even though it is protected by
>> CONFIG_VM_DEBUG since we have to live with the race and AFAIK some
>> distros may have CONFIG_VM_DEBUG enabled dy default.
>>
>> Since such case is ephemeral we could always try collapse the area again
>> later, so it sounds not harmful.  But, it might report false positive if
>> the page has excessive GUP pins (i.e. 512), however it might be not that
>> bad since the same check will be done later.  I didn't figure out a
>> simple way to prevent the false positive.
>>
>> Added some notes to elaborate the race and the consequence as well.
>>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>>   mm/khugepaged.c | 24 +++++++++++++++++-------
>>   1 file changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 1fdd677..048f5d4 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -602,12 +602,6 @@ static bool is_refcount_suitable(struct page *page)
>>   	if (PageSwapCache(page))
>>   		expected_refcount += compound_nr(page);
>>   
>> -	if (IS_ENABLED(CONFIG_DEBUG_VM) && expected_refcount > refcount) {
>> -		pr_err("expected_refcount (%d) > refcount (%d)\n",
>> -				expected_refcount, refcount);
>> -		dump_page(page, "Unexpected refcount");
>> -	}
>> -
>>   	return page_count(page) == expected_refcount;
>>   }
>>   
>> @@ -1341,7 +1335,23 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>   			goto out_unmap;
>>   		}
>>   
>> -		/* Check if the page has any GUP (or other external) pins */
>> +		/*
>> +		 * Check if the page has any GUP (or other external) pins.
>> +		 *
>> +		 * Here the check is racy it may see totmal_mapcount > refcount
>> +		 * in some cases.
>> +		 * For example, one process with one forked child process.
>> +		 * The parent has the PMD split due to MADV_DONTNEED, then
>> +		 * the child is trying unmap the whole PMD, but khugepaged
>> +		 * may be scanning the parent between the child has
>> +		 * PageDoubleMap flag cleared and dec the mapcount.  So
>> +		 * khugepaged may see total_mapcount > refcount.
>> +		 *
>> +		 * But such case is ephemeral we could always retry collapse
>> +		 * later.  However it may report false positive if the page
>> +		 * has excessive GUP pins (i.e. 512).  Anyway the same check
>> +		 * will be done again later the risk seems low.
>> +		 */
>>   		if (!is_refcount_suitable(page)) {
>>   			result = SCAN_PAGE_COUNT;
>>   			goto out_unmap;
>> -- 
>> 1.8.3.1
>>
>>
Kirill A . Shutemov May 19, 2020, 2:17 p.m. UTC | #3
On Mon, May 18, 2020 at 07:04:40PM -0700, Yang Shi wrote:
> 
> 
> On 5/18/20 3:19 AM, Kirill A. Shutemov wrote:
> > On Wed, May 13, 2020 at 05:03:03AM +0800, Yang Shi wrote:
> > > When running khugepaged with higher frequency (for example, set
> > > scan_sleep_millisecs to 0), the below error message was reported:
> > > 
> > > khugepaged: expected_refcount (1024) > refcount (512)
> > > page:ffffd75784258000 count:511 mapcount:1 mapping:ffff968de06c7421 index:0x7fa288600
> > >   compound_mapcount: 0
> > >   flags: 0x17fffc00009003c(uptodate|dirty|lru|active|head|swapbacked)
> > >   raw: 017fffc00009003c ffffd7578ba70788 ffffd7578bdb5148 ffff968de06c7421
> > >   raw: 00000007fa288600 0000000000000000 000001ff00000000 ffff968e5e7d6000
> > >   page dumped because: Unexpected refcount
> > >   page->mem_cgroup:ffff968e5e7d6000
> > > 
> > > This is introduced by allowing collapsing fork shared and PTE-mapped
> > > THPs.  The check may run into the below race:
> > > 
> > > Assuming parent process forked child process, then they do
> > > 
> > > 	CPU A		CPU B			CPU C
> > > 	-----		-----			-----
> > > Parent			Child			khugepaged
> > > 
> > > MADV_DONTNEED
> > >    split huge pmd
> > >    Double mapped
> > > 			MADV_DONTNEED
> > > 			  zap_huge_pmd
> > > 			    remove_page_rmap
> > > 			      Clear double map
> > > 						khugepaged_scan_pmd(parent)
> > > 						  check mapcount and refcount
> > > 						  --> total_mapcount > refcount
> > > 			      dec mapcount
> > > 
> > > The issue can be reproduced by the below test program.
> > Good catch! Thanks. And the fix looks reasnable.
> > 
> > We might want to have a similar debug check in near !is_refcount_suitable()
> > case in __collapse_huge_page_isolate(). The function is called with
> > anon_vma lock taken on write and it should prevent the false-positive.
> 
> However it seems MADV_DONTNEED path doesn't take anon_vma lock.

That's true. I missed that.

> > 
> > Anyway:
> > 
> > Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> Thanks.
> 
> > 
> > > ---8<---
> > > void main()
> > > {
> > >          void *addr;
> > >          int ret;
> > >          pid_t pid;
> > > 
> > >          addr = memalign(ALIGN, 2 * 1024 * 1024);
> > >          if (!addr) {
> > >                  printf("malloc failed\n");
> > >                  return;
> > >          }
> > > 
> > >          ret = madvise(addr, 2 * 1024 * 1024, MADV_HUGEPAGE);
> > >          if (ret < 0) {
> > >                  printf("madvise failed\n");
> > >                  return;
> > >          }
> > > 
> > >          memset(addr, 0xdeadbeef, 2 * 1024 * 1024);
> > > 
> > >          pid = fork();
> > > 
> > >          if (pid == 0) {
> > >                  /* Child process */
> > >                  ret = madvise(addr + (2 * 1024 * 1024) - 4096, 4096, MADV_DONTNEED);
> > >                  if (ret < 0) {
> > >                          printf("madvise failed in child\n");
> > >                          return;
> > >                  }
> > >                  sleep(120);
> > >          } else if (pid > 0) {
> > >                  sleep(5);
> > >                  /* Parent process */
> > >                  ret = madvise(addr, 2 * 1024 * 1024, MADV_DONTNEED);
> > >                  if (ret < 0) {
> > >                          printf("madvise failed in parent\n");
> > >                          return;
> > >                  }
> > >          } else {
> > >                  printf("fork failed\n");
> > >                  return;
> > >          }
> > > 
> > >          sleep(120);
> > > }
> > > ---8<---
> > > 
> > > So, total_mapcount > refcount seems not unexpected due to the inherent
> > > race.  Removed the error message even though it is protected by
> > > CONFIG_VM_DEBUG since we have to live with the race and AFAIK some
> > > distros may have CONFIG_VM_DEBUG enabled dy default.
> > > 
> > > Since such case is ephemeral we could always try collapse the area again
> > > later, so it sounds not harmful.  But, it might report false positive if
> > > the page has excessive GUP pins (i.e. 512), however it might be not that
> > > bad since the same check will be done later.  I didn't figure out a
> > > simple way to prevent the false positive.
> > > 
> > > Added some notes to elaborate the race and the consequence as well.
> > > 
> > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> > > ---
> > >   mm/khugepaged.c | 24 +++++++++++++++++-------
> > >   1 file changed, 17 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 1fdd677..048f5d4 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -602,12 +602,6 @@ static bool is_refcount_suitable(struct page *page)
> > >   	if (PageSwapCache(page))
> > >   		expected_refcount += compound_nr(page);
> > > -	if (IS_ENABLED(CONFIG_DEBUG_VM) && expected_refcount > refcount) {
> > > -		pr_err("expected_refcount (%d) > refcount (%d)\n",
> > > -				expected_refcount, refcount);
> > > -		dump_page(page, "Unexpected refcount");
> > > -	}
> > > -
> > >   	return page_count(page) == expected_refcount;
> > >   }
> > > @@ -1341,7 +1335,23 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > >   			goto out_unmap;
> > >   		}
> > > -		/* Check if the page has any GUP (or other external) pins */
> > > +		/*
> > > +		 * Check if the page has any GUP (or other external) pins.
> > > +		 *
> > > +		 * Here the check is racy it may see totmal_mapcount > refcount
> > > +		 * in some cases.
> > > +		 * For example, one process with one forked child process.
> > > +		 * The parent has the PMD split due to MADV_DONTNEED, then
> > > +		 * the child is trying unmap the whole PMD, but khugepaged
> > > +		 * may be scanning the parent between the child has
> > > +		 * PageDoubleMap flag cleared and dec the mapcount.  So
> > > +		 * khugepaged may see total_mapcount > refcount.
> > > +		 *
> > > +		 * But such case is ephemeral we could always retry collapse
> > > +		 * later.  However it may report false positive if the page
> > > +		 * has excessive GUP pins (i.e. 512).  Anyway the same check
> > > +		 * will be done again later the risk seems low.
> > > +		 */
> > >   		if (!is_refcount_suitable(page)) {
> > >   			result = SCAN_PAGE_COUNT;
> > >   			goto out_unmap;
> > > -- 
> > > 1.8.3.1
> > > 
> > > 
>
Yang Shi May 21, 2020, 4:48 p.m. UTC | #4
On 5/21/20 7:56 AM, Qian Cai wrote:
> On Wed, May 13, 2020 at 05:03:03AM +0800, Yang Shi wrote:
> []
>>   mm/khugepaged.c | 24 +++++++++++++++++-------
>>   1 file changed, 17 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 1fdd677..048f5d4 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -602,12 +602,6 @@ static bool is_refcount_suitable(struct page *page)
>>   	if (PageSwapCache(page))
>>   		expected_refcount += compound_nr(page);
>>   
>> -	if (IS_ENABLED(CONFIG_DEBUG_VM) && expected_refcount > refcount) {
>> -		pr_err("expected_refcount (%d) > refcount (%d)\n",
>> -				expected_refcount, refcount);
>> -		dump_page(page, "Unexpected refcount");
>> -	}
>> -
>>   	return page_count(page) == expected_refcount;
>>   }
> mm/khugepaged.c: In function 'is_refcount_suitable':
> mm/khugepaged.c:575:25: warning: variable 'refcount set but not used [-Wunused-but-set-variable]
>    int expected_refcount, refcount;
>                           ^~~~~~~~
>
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -572,9 +572,8 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
>   
>   static bool is_refcount_suitable(struct page *page)
>   {
> -	int expected_refcount, refcount;
> +	int expected_refcount;
>   
> -	refcount = page_count(page);
>   	expected_refcount = total_mapcount(page);
>   	if (PageSwapCache(page))
>   		expected_refcount += compound_nr(page);

Thanks for catching this.
diff mbox series

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1fdd677..048f5d4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -602,12 +602,6 @@  static bool is_refcount_suitable(struct page *page)
 	if (PageSwapCache(page))
 		expected_refcount += compound_nr(page);
 
-	if (IS_ENABLED(CONFIG_DEBUG_VM) && expected_refcount > refcount) {
-		pr_err("expected_refcount (%d) > refcount (%d)\n",
-				expected_refcount, refcount);
-		dump_page(page, "Unexpected refcount");
-	}
-
 	return page_count(page) == expected_refcount;
 }
 
@@ -1341,7 +1335,23 @@  static int khugepaged_scan_pmd(struct mm_struct *mm,
 			goto out_unmap;
 		}
 
-		/* Check if the page has any GUP (or other external) pins */
+		/*
+		 * Check if the page has any GUP (or other external) pins.
+		 *
+		 * Here the check is racy it may see totmal_mapcount > refcount
+		 * in some cases.
+		 * For example, one process with one forked child process.
+		 * The parent has the PMD split due to MADV_DONTNEED, then
+		 * the child is trying unmap the whole PMD, but khugepaged
+		 * may be scanning the parent between the child has
+		 * PageDoubleMap flag cleared and dec the mapcount.  So
+		 * khugepaged may see total_mapcount > refcount.
+		 *
+		 * But such case is ephemeral we could always retry collapse
+		 * later.  However it may report false positive if the page
+		 * has excessive GUP pins (i.e. 512).  Anyway the same check
+		 * will be done again later the risk seems low.
+		 */
 		if (!is_refcount_suitable(page)) {
 			result = SCAN_PAGE_COUNT;
 			goto out_unmap;