diff mbox series

[v4] mm/page_vma_mapped.c: Explicitly compare pfn for normal, hugetlbfs and THP page

Message ID 1578737885-8890-1-git-send-email-lixinhai.lxh@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v4] mm/page_vma_mapped.c: Explicitly compare pfn for normal, hugetlbfs and THP page | expand

Commit Message

Li Xinhai Jan. 11, 2020, 10:18 a.m. UTC
When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
The current implementation apply comparison as
- normal 4K page: page_pfn <= pfn < page_pfn + 1
- hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
- THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
in pfn_in_hpage. For hugetlbfs page, it should be
page_pfn == pfn

Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
is not only for THP and explicitly compare for these cases.

Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/page_vma_mapped.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Comments

Andrew Morton Jan. 13, 2020, 10:38 p.m. UTC | #1
On Sat, 11 Jan 2020 10:18:05 +0000 Li Xinhai <lixinhai.lxh@gmail.com> wrote:

> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
> The current implementation apply comparison as
> - normal 4K page: page_pfn <= pfn < page_pfn + 1
> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> in pfn_in_hpage. For hugetlbfs page, it should be
> page_pfn == pfn

Does this have any runtime effects?

> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
> is not only for THP and explicitly compare for these cases.
Michal Hocko Jan. 14, 2020, 9:25 a.m. UTC | #2
On Sat 11-01-20 10:18:05, Li Xinhai wrote:
> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
> The current implementation apply comparison as
> - normal 4K page: page_pfn <= pfn < page_pfn + 1
> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> in pfn_in_hpage. For hugetlbfs page, it should be
> page_pfn == pfn
> 
> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
> is not only for THP and explicitly compare for these cases.

Why is this important to do. I have asked and Mike had the same feeling
that the patch is missing any real justification. Why do we need this
change? It is great that you have dropped VM_BUG_ON btw.

> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/page_vma_mapped.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index eff4b45..719c352 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -52,12 +52,16 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
>  	return true;
>  }
>  
> -static inline bool pfn_in_hpage(struct page *hpage, unsigned long pfn)
> +static inline bool pfn_is_match(struct page *page, unsigned long pfn)
>  {
> -	unsigned long hpage_pfn = page_to_pfn(hpage);
> +	unsigned long page_pfn = page_to_pfn(page);
> +
> +	/* normal page and hugetlbfs page */
> +	if (!PageTransCompound(page) || PageHuge(page))
> +		return page_pfn == pfn;
>  
>  	/* THP can be referenced by any subpage */
> -	return pfn >= hpage_pfn && pfn - hpage_pfn < hpage_nr_pages(hpage);
> +	return pfn >= page_pfn && pfn - page_pfn < hpage_nr_pages(page);
>  }
>  
>  /**
> @@ -108,7 +112,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>  		pfn = pte_pfn(*pvmw->pte);
>  	}
>  
> -	return pfn_in_hpage(pvmw->page, pfn);
> +	return pfn_is_match(pvmw->page, pfn);
>  }
>  
>  /**
> -- 
> 1.8.3.1
Li Xinhai Jan. 14, 2020, 1:47 p.m. UTC | #3
On 2020-01-14 at 17:25 Michal Hocko wrote:
>On Sat 11-01-20 10:18:05, Li Xinhai wrote:
>> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
>> The current implementation apply comparison as
>> - normal 4K page: page_pfn <= pfn < page_pfn + 1
>> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
>> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
>> in pfn_in_hpage. For hugetlbfs page, it should be
>> page_pfn == pfn
>>
>> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
>> is not only for THP and explicitly compare for these cases.
>
>Why is this important to do. I have asked and Mike had the same feeling
>that the patch is missing any real justification. Why do we need this
>change? It is great that you have dropped VM_BUG_ON btw.
> 
I think it is important to make the code clear, as said, comparing hugetlbfs page
in range page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusion.

>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> ---
>>  mm/page_vma_mapped.c | 12 ++++++++----
>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index eff4b45..719c352 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -52,12 +52,16 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
>>  return true;
>>  }
>> 
>> -static inline bool pfn_in_hpage(struct page *hpage, unsigned long pfn)
>> +static inline bool pfn_is_match(struct page *page, unsigned long pfn)
>>  {
>> -	unsigned long hpage_pfn = page_to_pfn(hpage);
>> +	unsigned long page_pfn = page_to_pfn(page);
>> +
>> +	/* normal page and hugetlbfs page */
>> +	if (!PageTransCompound(page) || PageHuge(page))
>> +	return page_pfn == pfn;
>> 
>>  /* THP can be referenced by any subpage */
>> -	return pfn >= hpage_pfn && pfn - hpage_pfn < hpage_nr_pages(hpage);
>> +	return pfn >= page_pfn && pfn - page_pfn < hpage_nr_pages(page);
>>  }
>> 
>>  /**
>> @@ -108,7 +112,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>>  pfn = pte_pfn(*pvmw->pte);
>>  }
>> 
>> -	return pfn_in_hpage(pvmw->page, pfn);
>> +	return pfn_is_match(pvmw->page, pfn);
>>  }
>> 
>>  /**
>> --
>> 1.8.3.1
>
>--
>Michal Hocko
>SUSE Labs
Li Xinhai Jan. 14, 2020, 2:24 p.m. UTC | #4
On 2020-01-14 at 06:38 Andrew Morton wrote:
>On Sat, 11 Jan 2020 10:18:05 +0000 Li Xinhai <lixinhai.lxh@gmail.com> wrote:
>
>> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
>> The current implementation apply comparison as
>> - normal 4K page: page_pfn <= pfn < page_pfn + 1
>> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
>> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
>> in pfn_in_hpage. For hugetlbfs page, it should be
>> page_pfn == pfn
>
>Does this have any runtime effects? 
No impaction to current behavior, just make the code clear.

>
>> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
>> is not only for THP and explicitly compare for these cases.
Michal Hocko Jan. 15, 2020, 9:31 a.m. UTC | #5
On Tue 14-01-20 21:47:51, Li Xinhai wrote:
> On 2020-01-14 at 17:25 Michal Hocko wrote:
> >On Sat 11-01-20 10:18:05, Li Xinhai wrote:
> >> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
> >> The current implementation apply comparison as
> >> - normal 4K page: page_pfn <= pfn < page_pfn + 1
> >> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> >> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> >> in pfn_in_hpage. For hugetlbfs page, it should be
> >> page_pfn == pfn
> >>
> >> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
> >> is not only for THP and explicitly compare for these cases.
> >
> >Why is this important to do. I have asked and Mike had the same feeling
> >that the patch is missing any real justification. Why do we need this
> >change? It is great that you have dropped VM_BUG_ON btw.
> > 
> I think it is important to make the code clear, as said, comparing hugetlbfs page
> in range page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusion.

I am sorry but I do not really see a big confusion here. It is probably
a matter of taste. If others consider this an improvement then I will
not stand in the way but I consider the justification insuficient for
merging.
Mike Kravetz Jan. 16, 2020, 12:40 a.m. UTC | #6
On 1/15/20 1:31 AM, Michal Hocko wrote:
> On Tue 14-01-20 21:47:51, Li Xinhai wrote:
>> On 2020-01-14 at 17:25 Michal Hocko wrote:
>>> On Sat 11-01-20 10:18:05, Li Xinhai wrote:
>>>> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
>>>> The current implementation apply comparison as
>>>> - normal 4K page: page_pfn <= pfn < page_pfn + 1
>>>> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
>>>> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
>>>> in pfn_in_hpage. For hugetlbfs page, it should be
>>>> page_pfn == pfn
>>>>
>>>> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
>>>> is not only for THP and explicitly compare for these cases.
>>>
>>> Why is this important to do. I have asked and Mike had the same feeling
>>> that the patch is missing any real justification. Why do we need this
>>> change? It is great that you have dropped VM_BUG_ON btw.
>>>
>> I think it is important to make the code clear, as said, comparing hugetlbfs page
>> in range page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusion.
> 
> I am sorry but I do not really see a big confusion here. It is probably
> a matter of taste. If others consider this an improvement then I will
> not stand in the way but I consider the justification insuficient for
> merging.

Perhaps I am confused, but the patch does update the check done for
hugetlb pages.  IIUC, previously there was no distinction between THP
pages and hugetlb pages in the check.  It is valid to pass in a sub-
page of a THP page, but not that of a hugetlb page.

I do not believe it is possible for existing code to pass in a sub-page
of a hugetlb page.  And, no one has ever seen this as an issue.  This
is why I was questioning the need for the patch.

With all that said, the new code/check is better (more complete) than
the original.  It may not do anything for the current code base, but
it 'could' catch potential errors in future code.  Because of this, I
do consider the code an improvement.

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Michal Hocko Jan. 16, 2020, 10:29 a.m. UTC | #7
On Wed 15-01-20 16:40:26, Mike Kravetz wrote:
> On 1/15/20 1:31 AM, Michal Hocko wrote:
> > On Tue 14-01-20 21:47:51, Li Xinhai wrote:
> >> On 2020-01-14 at 17:25 Michal Hocko wrote:
> >>> On Sat 11-01-20 10:18:05, Li Xinhai wrote:
> >>>> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
> >>>> The current implementation apply comparison as
> >>>> - normal 4K page: page_pfn <= pfn < page_pfn + 1
> >>>> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> >>>> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> >>>> in pfn_in_hpage. For hugetlbfs page, it should be
> >>>> page_pfn == pfn
> >>>>
> >>>> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
> >>>> is not only for THP and explicitly compare for these cases.
> >>>
> >>> Why is this important to do. I have asked and Mike had the same feeling
> >>> that the patch is missing any real justification. Why do we need this
> >>> change? It is great that you have dropped VM_BUG_ON btw.
> >>>
> >> I think it is important to make the code clear, as said, comparing hugetlbfs page
> >> in range page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusion.
> > 
> > I am sorry but I do not really see a big confusion here. It is probably
> > a matter of taste. If others consider this an improvement then I will
> > not stand in the way but I consider the justification insuficient for
> > merging.
> 
> Perhaps I am confused, but the patch does update the check done for
> hugetlb pages.  IIUC, previously there was no distinction between THP
> pages and hugetlb pages in the check.  It is valid to pass in a sub-
> page of a THP page, but not that of a hugetlb page.
> 
> I do not believe it is possible for existing code to pass in a sub-page
> of a hugetlb page.  And, no one has ever seen this as an issue.  This
> is why I was questioning the need for the patch.

Exactly and that is the reason I fail the see a point. 

> With all that said, the new code/check is better (more complete) than
> the original.  It may not do anything for the current code base, but
> it 'could' catch potential errors in future code.  Because of this, I
> do consider the code an improvement.

It adds a branch for something that doesn't happen and also to a code
path which is quite internal to the MM AFAICS. That being said, if you
believe this is an improvement then I will not stand in the way. But
there are so many other places which could add checks that are not
exercised...

> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Yang Shi Sept. 2, 2020, 10:26 p.m. UTC | #8
On Thu, Jan 16, 2020 at 2:29 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 15-01-20 16:40:26, Mike Kravetz wrote:
> > On 1/15/20 1:31 AM, Michal Hocko wrote:
> > > On Tue 14-01-20 21:47:51, Li Xinhai wrote:
> > >> On 2020-01-14 at 17:25 Michal Hocko wrote:
> > >>> On Sat 11-01-20 10:18:05, Li Xinhai wrote:
> > >>>> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
> > >>>> The current implementation apply comparison as
> > >>>> - normal 4K page: page_pfn <= pfn < page_pfn + 1
> > >>>> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > >>>> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > >>>> in pfn_in_hpage. For hugetlbfs page, it should be
> > >>>> page_pfn == pfn
> > >>>>
> > >>>> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
> > >>>> is not only for THP and explicitly compare for these cases.
> > >>>
> > >>> Why is this important to do. I have asked and Mike had the same feeling
> > >>> that the patch is missing any real justification. Why do we need this
> > >>> change? It is great that you have dropped VM_BUG_ON btw.
> > >>>
> > >> I think it is important to make the code clear, as said, comparing hugetlbfs page
> > >> in range page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusion.
> > >
> > > I am sorry but I do not really see a big confusion here. It is probably
> > > a matter of taste. If others consider this an improvement then I will
> > > not stand in the way but I consider the justification insuficient for
> > > merging.
> >
> > Perhaps I am confused, but the patch does update the check done for
> > hugetlb pages.  IIUC, previously there was no distinction between THP
> > pages and hugetlb pages in the check.  It is valid to pass in a sub-
> > page of a THP page, but not that of a hugetlb page.
> >
> > I do not believe it is possible for existing code to pass in a sub-page
> > of a hugetlb page.  And, no one has ever seen this as an issue.  This
> > is why I was questioning the need for the patch.
>
> Exactly and that is the reason I fail the see a point.
>
> > With all that said, the new code/check is better (more complete) than
> > the original.  It may not do anything for the current code base, but
> > it 'could' catch potential errors in future code.  Because of this, I
> > do consider the code an improvement.
>
> It adds a branch for something that doesn't happen and also to a code
> path which is quite internal to the MM AFAICS. That being said, if you
> believe this is an improvement then I will not stand in the way. But
> there are so many other places which could add checks that are not
> exercised...

I just saw a bad page bug on one our host with 4.14 kernel:

backtrace:
:BUG: Bad page map in process python3.6  pte:1036e24025 pmd:1dd743d067
:page:fffff62840db8900 count:105 mapcount:-35 mapping:ffff97d432e97ad0 index:0x1
:flags: 0xbfffe000001006c(referenced|uptodate|lru|active|mappedtodisk)
:raw: 0bfffe000001006c ffff97d432e97ad0 0000000000000001 00000069ffffffdc
:raw: dead000000000100 dead000000000200 0000000000000000 ffff97c58fc0e000
:page dumped because: bad pte
:page->mem_cgroup:ffff97c58fc0e000
:addr:00007f2ddffcc000 vm_flags:00000075 anon_vma:          (null)
mapping:ffff97d432e97ad0 index:7f
:file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
[xfs] readpage:xfs_vm_readpage [xfs]
:do_exit+0x563/0xbb0
:? vfs_write+0x162/0x1a0
:do_group_exit+0x3a/0xa0
:file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
[xfs] readpage:xfs_vm_readpage [xfs]
:SyS_exit_group+0x10/0x10
:do_syscall_64+0x60/0x110
:entry_SYSCALL_64_after_hwframe+0x3d/0xa2
:RIP: 0033:0x7fb2504091d9
:RSP: 002b:00007ffcf035db88 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
:RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb2504091d9
:RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
:RBP: 00007fb250706838 R08: 000000000000003c R09: 00000000000000e7
:R10: ffffffffffffff70 R11: 0000000000000246 R12: 00007fb250706838
:R13: 00007fb25070be80 R14: 0000000000000000 R15: 0000000000000000
:CPU: 42 PID: 454567 Comm: python3.6 Tainted: G    B   W
4.14.98-t5.el7.twitter.x86_64 #1

I just saw it once and it didn't happen on the newer kernel (maybe
just not happen yet). I can't tell why the mapcount could reach -35
since all page_remove_rmap() is protected by page table lock. Then I
looked into pvmw code, and suspected the PTE might be changed after
acquiring ptl.

With the old check it was fine as long as "page_pfn <= pfn < page_pfn
+ 1", but for the base page case, it means we might be dec'ing
mapcount for another page.

Then I came across this commit. I guess this should be able to solve
the problem, but unfortunately the problem is very rare so basically I
can't prove this commit could solve it.

If you agree my analysis, we may consider to backport this to stable tree.


> > Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
>
> --
> Michal Hocko
> SUSE Labs
>
Michal Hocko Sept. 3, 2020, 8:02 a.m. UTC | #9
On Wed 02-09-20 15:26:39, Yang Shi wrote:
> On Thu, Jan 16, 2020 at 2:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 15-01-20 16:40:26, Mike Kravetz wrote:
> > > On 1/15/20 1:31 AM, Michal Hocko wrote:
> > > > On Tue 14-01-20 21:47:51, Li Xinhai wrote:
> > > >> On 2020-01-14 at 17:25 Michal Hocko wrote:
> > > >>> On Sat 11-01-20 10:18:05, Li Xinhai wrote:
> > > >>>> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
> > > >>>> The current implementation apply comparison as
> > > >>>> - normal 4K page: page_pfn <= pfn < page_pfn + 1
> > > >>>> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > > >>>> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > > >>>> in pfn_in_hpage. For hugetlbfs page, it should be
> > > >>>> page_pfn == pfn
> > > >>>>
> > > >>>> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
> > > >>>> is not only for THP and explicitly compare for these cases.
> > > >>>
> > > >>> Why is this important to do. I have asked and Mike had the same feeling
> > > >>> that the patch is missing any real justification. Why do we need this
> > > >>> change? It is great that you have dropped VM_BUG_ON btw.
> > > >>>
> > > >> I think it is important to make the code clear, as said, comparing hugetlbfs page
> > > >> in range page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusion.
> > > >
> > > > I am sorry but I do not really see a big confusion here. It is probably
> > > > a matter of taste. If others consider this an improvement then I will
> > > > not stand in the way but I consider the justification insuficient for
> > > > merging.
> > >
> > > Perhaps I am confused, but the patch does update the check done for
> > > hugetlb pages.  IIUC, previously there was no distinction between THP
> > > pages and hugetlb pages in the check.  It is valid to pass in a sub-
> > > page of a THP page, but not that of a hugetlb page.
> > >
> > > I do not believe it is possible for existing code to pass in a sub-page
> > > of a hugetlb page.  And, no one has ever seen this as an issue.  This
> > > is why I was questioning the need for the patch.
> >
> > Exactly and that is the reason I fail the see a point.
> >
> > > With all that said, the new code/check is better (more complete) than
> > > the original.  It may not do anything for the current code base, but
> > > it 'could' catch potential errors in future code.  Because of this, I
> > > do consider the code an improvement.
> >
> > It adds a branch for something that doesn't happen and also to a code
> > path which is quite internal to the MM AFAICS. That being said, if you
> > believe this is an improvement then I will not stand in the way. But
> > there are so many other places which could add checks that are not
> > exercised...
> 
> I just saw a bad page bug on one our host with 4.14 kernel:
> 
> backtrace:
> :BUG: Bad page map in process python3.6  pte:1036e24025 pmd:1dd743d067
> :page:fffff62840db8900 count:105 mapcount:-35 mapping:ffff97d432e97ad0 index:0x1
> :flags: 0xbfffe000001006c(referenced|uptodate|lru|active|mappedtodisk)
> :raw: 0bfffe000001006c ffff97d432e97ad0 0000000000000001 00000069ffffffdc
> :raw: dead000000000100 dead000000000200 0000000000000000 ffff97c58fc0e000
> :page dumped because: bad pte
> :page->mem_cgroup:ffff97c58fc0e000
> :addr:00007f2ddffcc000 vm_flags:00000075 anon_vma:          (null)
> mapping:ffff97d432e97ad0 index:7f
> :file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
> [xfs] readpage:xfs_vm_readpage [xfs]
> :do_exit+0x563/0xbb0
> :? vfs_write+0x162/0x1a0
> :do_group_exit+0x3a/0xa0
> :file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
> [xfs] readpage:xfs_vm_readpage [xfs]
> :SyS_exit_group+0x10/0x10
> :do_syscall_64+0x60/0x110
> :entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> :RIP: 0033:0x7fb2504091d9
> :RSP: 002b:00007ffcf035db88 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> :RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb2504091d9
> :RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> :RBP: 00007fb250706838 R08: 000000000000003c R09: 00000000000000e7
> :R10: ffffffffffffff70 R11: 0000000000000246 R12: 00007fb250706838
> :R13: 00007fb25070be80 R14: 0000000000000000 R15: 0000000000000000
> :CPU: 42 PID: 454567 Comm: python3.6 Tainted: G    B   W
> 4.14.98-t5.el7.twitter.x86_64 #1
> 
> I just saw it once and it didn't happen on the newer kernel (maybe
> just not happen yet). I can't tell why the mapcount could reach -35
> since all page_remove_rmap() is protected by page table lock. Then I
> looked into pvmw code, and suspected the PTE might be changed after
> acquiring ptl.
> 
> With the old check it was fine as long as "page_pfn <= pfn < page_pfn
> + 1", but for the base page case, it means we might be dec'ing
> mapcount for another page.
> 
> Then I came across this commit. I guess this should be able to solve
> the problem, but unfortunately the problem is very rare so basically I
> can't prove this commit could solve it.
> 
> If you agree my analysis, we may consider to backport this to stable tree.

I am not sure I follow. Your page looks like a normal page cache and the
patch you are referring to is only clarifying hugetlb pages. I do not
remember details but it shouldn't have any functional effect. Are those
even used in your environemnt?
Yang Shi Sept. 3, 2020, 4:08 p.m. UTC | #10
On Thu, Sep 3, 2020 at 1:02 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 02-09-20 15:26:39, Yang Shi wrote:
> > On Thu, Jan 16, 2020 at 2:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Wed 15-01-20 16:40:26, Mike Kravetz wrote:
> > > > On 1/15/20 1:31 AM, Michal Hocko wrote:
> > > > > On Tue 14-01-20 21:47:51, Li Xinhai wrote:
> > > > >> On 2020-01-14 at 17:25 Michal Hocko wrote:
> > > > >>> On Sat 11-01-20 10:18:05, Li Xinhai wrote:
> > > > >>>> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
> > > > >>>> The current implementation apply comparison as
> > > > >>>> - normal 4K page: page_pfn <= pfn < page_pfn + 1
> > > > >>>> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > > > >>>> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > > > >>>> in pfn_in_hpage. For hugetlbfs page, it should be
> > > > >>>> page_pfn == pfn
> > > > >>>>
> > > > >>>> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
> > > > >>>> is not only for THP and explicitly compare for these cases.
> > > > >>>
> > > > >>> Why is this important to do. I have asked and Mike had the same feeling
> > > > >>> that the patch is missing any real justification. Why do we need this
> > > > >>> change? It is great that you have dropped VM_BUG_ON btw.
> > > > >>>
> > > > >> I think it is important to make the code clear, as said, comparing hugetlbfs page
> > > > >> in range page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusion.
> > > > >
> > > > > I am sorry but I do not really see a big confusion here. It is probably
> > > > > a matter of taste. If others consider this an improvement then I will
> > > > > not stand in the way but I consider the justification insuficient for
> > > > > merging.
> > > >
> > > > Perhaps I am confused, but the patch does update the check done for
> > > > hugetlb pages.  IIUC, previously there was no distinction between THP
> > > > pages and hugetlb pages in the check.  It is valid to pass in a sub-
> > > > page of a THP page, but not that of a hugetlb page.
> > > >
> > > > I do not believe it is possible for existing code to pass in a sub-page
> > > > of a hugetlb page.  And, no one has ever seen this as an issue.  This
> > > > is why I was questioning the need for the patch.
> > >
> > > Exactly and that is the reason I fail the see a point.
> > >
> > > > With all that said, the new code/check is better (more complete) than
> > > > the original.  It may not do anything for the current code base, but
> > > > it 'could' catch potential errors in future code.  Because of this, I
> > > > do consider the code an improvement.
> > >
> > > It adds a branch for something that doesn't happen and also to a code
> > > path which is quite internal to the MM AFAICS. That being said, if you
> > > believe this is an improvement then I will not stand in the way. But
> > > there are so many other places which could add checks that are not
> > > exercised...
> >
> > I just saw a bad page bug on one our host with 4.14 kernel:
> >
> > backtrace:
> > :BUG: Bad page map in process python3.6  pte:1036e24025 pmd:1dd743d067
> > :page:fffff62840db8900 count:105 mapcount:-35 mapping:ffff97d432e97ad0 index:0x1
> > :flags: 0xbfffe000001006c(referenced|uptodate|lru|active|mappedtodisk)
> > :raw: 0bfffe000001006c ffff97d432e97ad0 0000000000000001 00000069ffffffdc
> > :raw: dead000000000100 dead000000000200 0000000000000000 ffff97c58fc0e000
> > :page dumped because: bad pte
> > :page->mem_cgroup:ffff97c58fc0e000
> > :addr:00007f2ddffcc000 vm_flags:00000075 anon_vma:          (null)
> > mapping:ffff97d432e97ad0 index:7f
> > :file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
> > [xfs] readpage:xfs_vm_readpage [xfs]
> > :do_exit+0x563/0xbb0
> > :? vfs_write+0x162/0x1a0
> > :do_group_exit+0x3a/0xa0
> > :file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
> > [xfs] readpage:xfs_vm_readpage [xfs]
> > :SyS_exit_group+0x10/0x10
> > :do_syscall_64+0x60/0x110
> > :entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > :RIP: 0033:0x7fb2504091d9
> > :RSP: 002b:00007ffcf035db88 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > :RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb2504091d9
> > :RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > :RBP: 00007fb250706838 R08: 000000000000003c R09: 00000000000000e7
> > :R10: ffffffffffffff70 R11: 0000000000000246 R12: 00007fb250706838
> > :R13: 00007fb25070be80 R14: 0000000000000000 R15: 0000000000000000
> > :CPU: 42 PID: 454567 Comm: python3.6 Tainted: G    B   W
> > 4.14.98-t5.el7.twitter.x86_64 #1
> >
> > I just saw it once and it didn't happen on the newer kernel (maybe
> > just not happen yet). I can't tell why the mapcount could reach -35
> > since all page_remove_rmap() is protected by page table lock. Then I
> > looked into pvmw code, and suspected the PTE might be changed after
> > acquiring ptl.
> >
> > With the old check it was fine as long as "page_pfn <= pfn < page_pfn
> > + 1", but for the base page case, it means we might be dec'ing
> > mapcount for another page.
> >
> > Then I came across this commit. I guess this should be able to solve
> > the problem, but unfortunately the problem is very rare so basically I
> > can't prove this commit could solve it.
> >
> > If you agree my analysis, we may consider to backport this to stable tree.
>
> I am not sure I follow. Your page looks like a normal page cache and the
> patch you are referring to is only clarifying hugetlb pages. I do not
> remember details but it shouldn't have any functional effect. Are those
> even used in your environemnt?

Not only does the code touch hugetlbfs page, please see the below code snippet:

+       /* normal page and hugetlbfs page */
+       if (!PageTransCompound(page) || PageHuge(page))
+               return page_pfn == pfn;

The potential problem per my understanding is pvmw does:

1. read pte
2. lock ptl
3. check pfn

During step #1 and #2 the PTE might be changed, it is not surprising
we typically have pte_same in page fault path to check this after
acquiring ptl, but for pvmw path a full pte_same check might be
unnecessary, since we just care if the pfn is intact or not.

But before the patch as long as "old_pfn < new_pfn < old_pfn + 512" is
true the check would pass for both normal base page and hugetlbfs page
even though the new pfn is changed, that commit tightened the check.
For the normal base page it must be "old_pfn == new_pfn".

> --
> Michal Hocko
> SUSE Labs
Yang Shi Sept. 4, 2020, 1:22 a.m. UTC | #11
On Thu, Sep 3, 2020 at 9:08 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Sep 3, 2020 at 1:02 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 02-09-20 15:26:39, Yang Shi wrote:
> > > On Thu, Jan 16, 2020 at 2:29 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Wed 15-01-20 16:40:26, Mike Kravetz wrote:
> > > > > On 1/15/20 1:31 AM, Michal Hocko wrote:
> > > > > > On Tue 14-01-20 21:47:51, Li Xinhai wrote:
> > > > > >> On 2020-01-14 at 17:25 Michal Hocko wrote:
> > > > > >>> On Sat 11-01-20 10:18:05, Li Xinhai wrote:
> > > > > >>>> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
> > > > > >>>> The current implementation apply comparison as
> > > > > >>>> - normal 4K page: page_pfn <= pfn < page_pfn + 1
> > > > > >>>> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > > > > >>>> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > > > > >>>> in pfn_in_hpage. For hugetlbfs page, it should be
> > > > > >>>> page_pfn == pfn
> > > > > >>>>
> > > > > >>>> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
> > > > > >>>> is not only for THP and explicitly compare for these cases.
> > > > > >>>
> > > > > >>> Why is this important to do. I have asked and Mike had the same feeling
> > > > > >>> that the patch is missing any real justification. Why do we need this
> > > > > >>> change? It is great that you have dropped VM_BUG_ON btw.
> > > > > >>>
> > > > > >> I think it is important to make the code clear, as said, comparing hugetlbfs page
> > > > > >> in range page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusion.
> > > > > >
> > > > > > I am sorry but I do not really see a big confusion here. It is probably
> > > > > > a matter of taste. If others consider this an improvement then I will
> > > > > > not stand in the way but I consider the justification insuficient for
> > > > > > merging.
> > > > >
> > > > > Perhaps I am confused, but the patch does update the check done for
> > > > > hugetlb pages.  IIUC, previously there was no distinction between THP
> > > > > pages and hugetlb pages in the check.  It is valid to pass in a sub-
> > > > > page of a THP page, but not that of a hugetlb page.
> > > > >
> > > > > I do not believe it is possible for existing code to pass in a sub-page
> > > > > of a hugetlb page.  And, no one has ever seen this as an issue.  This
> > > > > is why I was questioning the need for the patch.
> > > >
> > > > Exactly and that is the reason I fail the see a point.
> > > >
> > > > > With all that said, the new code/check is better (more complete) than
> > > > > the original.  It may not do anything for the current code base, but
> > > > > it 'could' catch potential errors in future code.  Because of this, I
> > > > > do consider the code an improvement.
> > > >
> > > > It adds a branch for something that doesn't happen and also to a code
> > > > path which is quite internal to the MM AFAICS. That being said, if you
> > > > believe this is an improvement then I will not stand in the way. But
> > > > there are so many other places which could add checks that are not
> > > > exercised...
> > >
> > > I just saw a bad page bug on one our host with 4.14 kernel:
> > >
> > > backtrace:
> > > :BUG: Bad page map in process python3.6  pte:1036e24025 pmd:1dd743d067
> > > :page:fffff62840db8900 count:105 mapcount:-35 mapping:ffff97d432e97ad0 index:0x1
> > > :flags: 0xbfffe000001006c(referenced|uptodate|lru|active|mappedtodisk)
> > > :raw: 0bfffe000001006c ffff97d432e97ad0 0000000000000001 00000069ffffffdc
> > > :raw: dead000000000100 dead000000000200 0000000000000000 ffff97c58fc0e000
> > > :page dumped because: bad pte
> > > :page->mem_cgroup:ffff97c58fc0e000
> > > :addr:00007f2ddffcc000 vm_flags:00000075 anon_vma:          (null)
> > > mapping:ffff97d432e97ad0 index:7f
> > > :file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
> > > [xfs] readpage:xfs_vm_readpage [xfs]
> > > :do_exit+0x563/0xbb0
> > > :? vfs_write+0x162/0x1a0
> > > :do_group_exit+0x3a/0xa0
> > > :file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
> > > [xfs] readpage:xfs_vm_readpage [xfs]
> > > :SyS_exit_group+0x10/0x10
> > > :do_syscall_64+0x60/0x110
> > > :entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > > :RIP: 0033:0x7fb2504091d9
> > > :RSP: 002b:00007ffcf035db88 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > > :RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb2504091d9
> > > :RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > > :RBP: 00007fb250706838 R08: 000000000000003c R09: 00000000000000e7
> > > :R10: ffffffffffffff70 R11: 0000000000000246 R12: 00007fb250706838
> > > :R13: 00007fb25070be80 R14: 0000000000000000 R15: 0000000000000000
> > > :CPU: 42 PID: 454567 Comm: python3.6 Tainted: G    B   W
> > > 4.14.98-t5.el7.twitter.x86_64 #1
> > >
> > > I just saw it once and it didn't happen on the newer kernel (maybe
> > > just not happen yet). I can't tell why the mapcount could reach -35
> > > since all page_remove_rmap() is protected by page table lock. Then I
> > > looked into pvmw code, and suspected the PTE might be changed after
> > > acquiring ptl.
> > >
> > > With the old check it was fine as long as "page_pfn <= pfn < page_pfn
> > > + 1", but for the base page case, it means we might be dec'ing
> > > mapcount for another page.
> > >
> > > Then I came across this commit. I guess this should be able to solve
> > > the problem, but unfortunately the problem is very rare so basically I
> > > can't prove this commit could solve it.
> > >
> > > If you agree my analysis, we may consider to backport this to stable tree.
> >
> > I am not sure I follow. Your page looks like a normal page cache and the
> > patch you are referring to is only clarifying hugetlb pages. I do not
> > remember details but it shouldn't have any functional effect. Are those
> > even used in your environemnt?
>
> Not only does the code touch hugetlbfs page, please see the below code snippet:
>
> +       /* normal page and hugetlbfs page */
> +       if (!PageTransCompound(page) || PageHuge(page))
> +               return page_pfn == pfn;
>
> The potential problem per my understanding is pvmw does:
>
> 1. read pte
> 2. lock ptl
> 3. check pfn
>
> During step #1 and #2 the PTE might be changed, it is not surprising
> we typically have pte_same in page fault path to check this after
> acquiring ptl, but for pvmw path a full pte_same check might be
> unnecessary, since we just care if the pfn is intact or not.
>
> But before the patch as long as "old_pfn < new_pfn < old_pfn + 512" is

I just realized I misread the old implementation, it looks the old
implementation just did the right pfn check for base page, this commit
just made it more readable. Sorry for the confusion.

> true the check would pass for both normal base page and hugetlbfs page
> even though the new pfn is changed, that commit tightened the check.
> For the normal base page it must be "old_pfn == new_pfn".
>
> > --
> > Michal Hocko
> > SUSE Labs
diff mbox series

Patch

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index eff4b45..719c352 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -52,12 +52,16 @@  static bool map_pte(struct page_vma_mapped_walk *pvmw)
 	return true;
 }
 
-static inline bool pfn_in_hpage(struct page *hpage, unsigned long pfn)
+static inline bool pfn_is_match(struct page *page, unsigned long pfn)
 {
-	unsigned long hpage_pfn = page_to_pfn(hpage);
+	unsigned long page_pfn = page_to_pfn(page);
+
+	/* normal page and hugetlbfs page */
+	if (!PageTransCompound(page) || PageHuge(page))
+		return page_pfn == pfn;
 
 	/* THP can be referenced by any subpage */
-	return pfn >= hpage_pfn && pfn - hpage_pfn < hpage_nr_pages(hpage);
+	return pfn >= page_pfn && pfn - page_pfn < hpage_nr_pages(page);
 }
 
 /**
@@ -108,7 +112,7 @@  static bool check_pte(struct page_vma_mapped_walk *pvmw)
 		pfn = pte_pfn(*pvmw->pte);
 	}
 
-	return pfn_in_hpage(pvmw->page, pfn);
+	return pfn_is_match(pvmw->page, pfn);
 }
 
 /**