diff mbox series

mm, page_alloc: check pfn is valid before moving to freelist

Message ID fb3c8c008994b2ed96f74b6b9698ff998b689bd2.1649794059.git.quic_sudaraja@quicinc.com (mailing list archive)
State New
Headers show
Series mm, page_alloc: check pfn is valid before moving to freelist | expand

Commit Message

Sudarshan Rajagopalan April 12, 2022, 8:16 p.m. UTC
Check if pfn is valid before or not before moving it to freelist.

There are possible scenario where a pageblock can have partial physical
hole and partial part of System RAM. This happens when base address in RAM
partition table is not aligned to pageblock size.

Example:

Say we have this first two entries in RAM partition table -

Base Addr: 0x0000000080000000 Length: 0x0000000058000000
Base Addr: 0x00000000E3930000 Length: 0x0000000020000000
...

Physical hole: 0xD8000000 - 0xE3930000

On system having 4K as page size and hence pageblock size being 4MB, the
base address 0xE3930000 is not aligned to 4MB pageblock size.

Now we will have pageblock which has partial physical hole and partial part
of System RAM -

Pageblock [0xE3800000 - 0xE3C00000] -
	0xE3800000 - 0xE3930000 -- physical hole
	0xE3930000 - 0xE3C00000 -- System RAM

Now doing __alloc_pages say we get a valid page with PFN 0xE3B00 from
__rmqueue_fallback, we try to put other pages from the same pageblock as well
into freelist by calling steal_suitable_fallback().

We then search for freepages from start of the pageblock due to below code -

 move_freepages_block(zone, page, migratetype, ...)
{
    pfn = page_to_pfn(page);
    start_pfn = pfn & ~(pageblock_nr_pages - 1);
    end_pfn = start_pfn + pageblock_nr_pages - 1;
...
}

With the pageblock which has partial physical hole at the beginning, we will
run into PFNs from the physical hole whose struct page is not initialized and
is invalid, and system would crash as we operate on invalid struct page to find
out of page is in Buddy or LRU or not

[  107.629453][ T9688] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  107.639214][ T9688] Mem abort info:
[  107.642829][ T9688]   ESR = 0x96000006
[  107.646696][ T9688]   EC = 0x25: DABT (current EL), IL = 32 bits
[  107.652878][ T9688]   SET = 0, FnV = 0
[  107.656751][ T9688]   EA = 0, S1PTW = 0
[  107.660705][ T9688]   FSC = 0x06: level 2 translation fault
[  107.666455][ T9688] Data abort info:
[  107.670151][ T9688]   ISV = 0, ISS = 0x00000006
[  107.674827][ T9688]   CM = 0, WnR = 0
[  107.678615][ T9688] user pgtable: 4k pages, 39-bit VAs, pgdp=000000098a237000
[  107.685970][ T9688] [0000000000000000] pgd=0800000987170003, p4d=0800000987170003, pud=0800000987170003, pmd=0000000000000000
[  107.697582][ T9688] Internal error: Oops: 96000006 [#1] PREEMPT SMP

[  108.209839][ T9688] pc : move_freepages_block+0x174/0x27c
[  108.215407][ T9688] lr : steal_suitable_fallback+0x20c/0x398

[  108.305908][ T9688] Call trace:
[  108.309151][ T9688]  move_freepages_block+0x174/0x27c        [PageLRU]
[  108.314359][ T9688]  steal_suitable_fallback+0x20c/0x398
[  108.319826][ T9688]  rmqueue_bulk+0x250/0x934
[  108.324325][ T9688]  rmqueue_pcplist+0x178/0x2ac
[  108.329086][ T9688]  rmqueue+0x5c/0xc10
[  108.333048][ T9688]  get_page_from_freelist+0x19c/0x430
[  108.338430][ T9688]  __alloc_pages+0x134/0x424
[  108.343017][ T9688]  page_cache_ra_unbounded+0x120/0x324
[  108.348494][ T9688]  do_sync_mmap_readahead+0x1b0/0x234
[  108.353878][ T9688]  filemap_fault+0xe0/0x4c8
[  108.358375][ T9688]  do_fault+0x168/0x6cc
[  108.362518][ T9688]  handle_mm_fault+0x5c4/0x848
[  108.367280][ T9688]  do_page_fault+0x3fc/0x5d0
[  108.371867][ T9688]  do_translation_fault+0x6c/0x1b0
[  108.376985][ T9688]  do_mem_abort+0x68/0x10c
[  108.381389][ T9688]  el0_ia+0x50/0xbc
[  108.385175][ T9688]  el0t_32_sync_handler+0x88/0xbc
[  108.390208][ T9688]  el0t_32_sync+0x1b8/0x1bc

Hence, avoid operating on invalid pages within the same pageblock by checking
if pfn is valid or not.

Signed-off-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
Fixes: 4c7b9896621be ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE")
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
---
 mm/page_alloc.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Andrew Morton April 12, 2022, 8:59 p.m. UTC | #1
On Tue, 12 Apr 2022 13:16:23 -0700 Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> wrote:

> Check if pfn is valid before or not before moving it to freelist.
> 
> There are possible scenario where a pageblock can have partial physical
> hole and partial part of System RAM. This happens when base address in RAM
> partition table is not aligned to pageblock size.
> 
> ...
>
> 
> Signed-off-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
> Fixes: 4c7b9896621be ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE")

I made that 859a85ddf90e714092dea71a0e54c7b9896621be and added
cc:stable.  I'll await reviewer input before proceeding further.

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2521,6 +2521,11 @@ static int move_freepages(struct zone *zone,
>  	int pages_moved = 0;
>  
>  	for (pfn = start_pfn; pfn <= end_pfn;) {
> +		if (!pfn_valid(pfn)) {

Readers will wonder how we can encounter an invalid pfn here.  A small
comment might help clue them in.

> +			pfn++;
> +			continue;
> +		}
> +
>  		page = pfn_to_page(pfn);
>  		if (!PageBuddy(page)) {
>  			/*
David Rientjes April 12, 2022, 9:05 p.m. UTC | #2
On Tue, 12 Apr 2022, Andrew Morton wrote:

> On Tue, 12 Apr 2022 13:16:23 -0700 Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> wrote:
> 
> > Check if pfn is valid before or not before moving it to freelist.
> > 
> > There are possible scenario where a pageblock can have partial physical
> > hole and partial part of System RAM. This happens when base address in RAM
> > partition table is not aligned to pageblock size.
> > 
> > ...
> >
> > 
> > Signed-off-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
> > Fixes: 4c7b9896621be ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE")
> 
> I made that 859a85ddf90e714092dea71a0e54c7b9896621be and added
> cc:stable.  I'll await reviewer input before proceeding further.
> 

Acked-by: David Rientjes <rientjes@google.com>

> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2521,6 +2521,11 @@ static int move_freepages(struct zone *zone,
> >  	int pages_moved = 0;
> >  
> >  	for (pfn = start_pfn; pfn <= end_pfn;) {
> > +		if (!pfn_valid(pfn)) {
> 
> Readers will wonder how we can encounter an invalid pfn here.  A small
> comment might help clue them in.
> 

Sudarshan can correct me if I'm wrong, but this has to do with the 
pageblock alignment of the caller that assumes all pages in the range has 
an underlying struct page that we can access but that fails to hold true 
when we have a memory hole.  A comment would definitely help:

	/* Pageblock alignment may cause us to try to access into a hole */

> > +			pfn++;
> > +			continue;
> > +		}
> > +
> >  		page = pfn_to_page(pfn);
> >  		if (!PageBuddy(page)) {
> >  			/*
> 
> 
>
Mike Rapoport April 13, 2022, 8:48 p.m. UTC | #3
On Tue, Apr 12, 2022 at 01:16:23PM -0700, Sudarshan Rajagopalan wrote:
> Check if pfn is valid before or not before moving it to freelist.
> 
> There are possible scenario where a pageblock can have partial physical
> hole and partial part of System RAM. This happens when base address in RAM
> partition table is not aligned to pageblock size.
> 
> Example:
> 
> Say we have this first two entries in RAM partition table -
> 
> Base Addr: 0x0000000080000000 Length: 0x0000000058000000
> Base Addr: 0x00000000E3930000 Length: 0x0000000020000000

I wonder what was done to memory DIMMs to get such an interesting
physical memory layout...

> ...
> 
> Physical hole: 0xD8000000 - 0xE3930000
> 
> On system having 4K as page size and hence pageblock size being 4MB, the
> base address 0xE3930000 is not aligned to 4MB pageblock size.
> 
> Now we will have pageblock which has partial physical hole and partial part
> of System RAM -
> 
> Pageblock [0xE3800000 - 0xE3C00000] -
> 	0xE3800000 - 0xE3930000 -- physical hole
> 	0xE3930000 - 0xE3C00000 -- System RAM
> 
> Now doing __alloc_pages say we get a valid page with PFN 0xE3B00 from
> __rmqueue_fallback, we try to put other pages from the same pageblock as well
> into freelist by calling steal_suitable_fallback().
> 
> We then search for freepages from start of the pageblock due to below code -
> 
>  move_freepages_block(zone, page, migratetype, ...)
> {
>     pfn = page_to_pfn(page);
>     start_pfn = pfn & ~(pageblock_nr_pages - 1);
>     end_pfn = start_pfn + pageblock_nr_pages - 1;
> ...
> }
> 
> With the pageblock which has partial physical hole at the beginning, we will
> run into PFNs from the physical hole whose struct page is not initialized and
> is invalid, and system would crash as we operate on invalid struct page to find
> out of page is in Buddy or LRU or not

struct page must be initialized and valid even for holes in the physical
memory. When a pageblock spans both existing memory and a hole, the struct
pages for the "hole" part should be marked as PG_Reserved. 
 
If you see that struct pages for memory holes exist but invalid, we should
solve the underlying issue that causes wrong struct pages contents.

> [  107.629453][ T9688] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> [  107.639214][ T9688] Mem abort info:
> [  107.642829][ T9688]   ESR = 0x96000006
> [  107.646696][ T9688]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  107.652878][ T9688]   SET = 0, FnV = 0
> [  107.656751][ T9688]   EA = 0, S1PTW = 0
> [  107.660705][ T9688]   FSC = 0x06: level 2 translation fault
> [  107.666455][ T9688] Data abort info:
> [  107.670151][ T9688]   ISV = 0, ISS = 0x00000006
> [  107.674827][ T9688]   CM = 0, WnR = 0
> [  107.678615][ T9688] user pgtable: 4k pages, 39-bit VAs, pgdp=000000098a237000
> [  107.685970][ T9688] [0000000000000000] pgd=0800000987170003, p4d=0800000987170003, pud=0800000987170003, pmd=0000000000000000
> [  107.697582][ T9688] Internal error: Oops: 96000006 [#1] PREEMPT SMP
> 
> [  108.209839][ T9688] pc : move_freepages_block+0x174/0x27c

can you post fadd2line for this address?

> [  108.215407][ T9688] lr : steal_suitable_fallback+0x20c/0x398
> 
> [  108.305908][ T9688] Call trace:
> [  108.309151][ T9688]  move_freepages_block+0x174/0x27c        [PageLRU]
> [  108.314359][ T9688]  steal_suitable_fallback+0x20c/0x398
> [  108.319826][ T9688]  rmqueue_bulk+0x250/0x934
> [  108.324325][ T9688]  rmqueue_pcplist+0x178/0x2ac
> [  108.329086][ T9688]  rmqueue+0x5c/0xc10
> [  108.333048][ T9688]  get_page_from_freelist+0x19c/0x430
> [  108.338430][ T9688]  __alloc_pages+0x134/0x424
> [  108.343017][ T9688]  page_cache_ra_unbounded+0x120/0x324
> [  108.348494][ T9688]  do_sync_mmap_readahead+0x1b0/0x234
> [  108.353878][ T9688]  filemap_fault+0xe0/0x4c8
> [  108.358375][ T9688]  do_fault+0x168/0x6cc
> [  108.362518][ T9688]  handle_mm_fault+0x5c4/0x848
> [  108.367280][ T9688]  do_page_fault+0x3fc/0x5d0
> [  108.371867][ T9688]  do_translation_fault+0x6c/0x1b0
> [  108.376985][ T9688]  do_mem_abort+0x68/0x10c
> [  108.381389][ T9688]  el0_ia+0x50/0xbc
> [  108.385175][ T9688]  el0t_32_sync_handler+0x88/0xbc
> [  108.390208][ T9688]  el0t_32_sync+0x1b8/0x1bc
> 
> Hence, avoid operating on invalid pages within the same pageblock by checking
> if pfn is valid or not.

> Signed-off-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
> Fixes: 4c7b9896621be ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE")
> Cc: Mike Rapoport <rppt@linux.ibm.com>

For now the patch looks like a band-aid for more fundamental bug, so

NAKED-by: Mike Rapoport <rppt@linux.ibm.com>


> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/page_alloc.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6e5b448..e87aa053 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2521,6 +2521,11 @@ static int move_freepages(struct zone *zone,
>  	int pages_moved = 0;
>  
>  	for (pfn = start_pfn; pfn <= end_pfn;) {
> +		if (!pfn_valid(pfn)) {
> +			pfn++;
> +			continue;
> +		}
> +
>  		page = pfn_to_page(pfn);
>  		if (!PageBuddy(page)) {
>  			/*
> -- 
> 2.7.4
>
Mike Rapoport April 13, 2022, 8:55 p.m. UTC | #4
On Tue, Apr 12, 2022 at 02:05:51PM -0700, David Rientjes wrote:
> On Tue, 12 Apr 2022, Andrew Morton wrote:
> 
> > On Tue, 12 Apr 2022 13:16:23 -0700 Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> wrote:
> > 
> > > Check if pfn is valid before or not before moving it to freelist.
> > > 
> > > There are possible scenario where a pageblock can have partial physical
> > > hole and partial part of System RAM. This happens when base address in RAM
> > > partition table is not aligned to pageblock size.
> > > 
> > > ...
> > >
> > > 
> > > Signed-off-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
> > > Fixes: 4c7b9896621be ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE")
> > 
> > I made that 859a85ddf90e714092dea71a0e54c7b9896621be and added
> > cc:stable.  I'll await reviewer input before proceeding further.
> > 
> 
> Acked-by: David Rientjes <rientjes@google.com>
> 
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2521,6 +2521,11 @@ static int move_freepages(struct zone *zone,
> > >  	int pages_moved = 0;
> > >  
> > >  	for (pfn = start_pfn; pfn <= end_pfn;) {
> > > +		if (!pfn_valid(pfn)) {
> > 
> > Readers will wonder how we can encounter an invalid pfn here.  A small
> > comment might help clue them in.
> > 
> 
> Sudarshan can correct me if I'm wrong, but this has to do with the 
> pageblock alignment of the caller that assumes all pages in the range has 
> an underlying struct page that we can access but that fails to hold true 
> when we have a memory hole.  A comment would definitely help:

We do have a struct page for every page in a pageblock even if there is a
hole in the physical memory. If this is not the case, there is more
fundamental bug that should be fixed.
 
> 	/* Pageblock alignment may cause us to try to access into a hole */
> 
> > > +			pfn++;
> > > +			continue;
> > > +		}
> > > +
> > >  		page = pfn_to_page(pfn);
> > >  		if (!PageBuddy(page)) {
> > >  			/*
> > 
> > 
> >
David Hildenbrand April 14, 2022, 2:02 p.m. UTC | #5
On 13.04.22 22:55, Mike Rapoport wrote:
> On Tue, Apr 12, 2022 at 02:05:51PM -0700, David Rientjes wrote:
>> On Tue, 12 Apr 2022, Andrew Morton wrote:
>>
>>> On Tue, 12 Apr 2022 13:16:23 -0700 Sudarshan Rajagopalan <quic_sudaraja@quicinc.com> wrote:
>>>
>>>> Check if pfn is valid before or not before moving it to freelist.
>>>>
>>>> There are possible scenario where a pageblock can have partial physical
>>>> hole and partial part of System RAM. This happens when base address in RAM
>>>> partition table is not aligned to pageblock size.
>>>>
>>>> ...
>>>>
>>>>
>>>> Signed-off-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
>>>> Fixes: 4c7b9896621be ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE")
>>>
>>> I made that 859a85ddf90e714092dea71a0e54c7b9896621be and added
>>> cc:stable.  I'll await reviewer input before proceeding further.
>>>
>>
>> Acked-by: David Rientjes <rientjes@google.com>
>>
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -2521,6 +2521,11 @@ static int move_freepages(struct zone *zone,
>>>>  	int pages_moved = 0;
>>>>  
>>>>  	for (pfn = start_pfn; pfn <= end_pfn;) {
>>>> +		if (!pfn_valid(pfn)) {
>>>
>>> Readers will wonder how we can encounter an invalid pfn here.  A small
>>> comment might help clue them in.
>>>
>>
>> Sudarshan can correct me if I'm wrong, but this has to do with the 
>> pageblock alignment of the caller that assumes all pages in the range has 
>> an underlying struct page that we can access but that fails to hold true 
>> when we have a memory hole.  A comment would definitely help:
> 
> We do have a struct page for every page in a pageblock even if there is a
> hole in the physical memory. If this is not the case, there is more
> fundamental bug that should be fixed.

Also, I dislike handling such a corner case in a way that affects all
other sane cases. move_freepages() is also used for page isolation.

I agree that this should be fixed differently, if possible.
Sudarshan Rajagopalan April 14, 2022, 9 p.m. UTC | #6
On 4/14/2022 2:18 AM, Mike Rapoport wrote:
> On Tue, Apr 12, 2022 at 01:16:23PM -0700, Sudarshan Rajagopalan wrote:
>> Check if pfn is valid before or not before moving it to freelist.
>>
>> There are possible scenario where a pageblock can have partial physical
>> hole and partial part of System RAM. This happens when base address in RAM
>> partition table is not aligned to pageblock size.
>>
>> Example:
>>
>> Say we have this first two entries in RAM partition table -
>>
>> Base Addr: 0x0000000080000000 Length: 0x0000000058000000
>> Base Addr: 0x00000000E3930000 Length: 0x0000000020000000
> I wonder what was done to memory DIMMs to get such an interesting
> physical memory layout...

We have a feature where we carve out some portion of memory in RAM 
partition table, hence we see such base addresses here.

>
>> ...
>>
>> Physical hole: 0xD8000000 - 0xE3930000
>>
>> On system having 4K as page size and hence pageblock size being 4MB, the
>> base address 0xE3930000 is not aligned to 4MB pageblock size.
>>
>> Now we will have pageblock which has partial physical hole and partial part
>> of System RAM -
>>
>> Pageblock [0xE3800000 - 0xE3C00000] -
>> 	0xE3800000 - 0xE3930000 -- physical hole
>> 	0xE3930000 - 0xE3C00000 -- System RAM
>>
>> Now doing __alloc_pages say we get a valid page with PFN 0xE3B00 from
>> __rmqueue_fallback, we try to put other pages from the same pageblock as well
>> into freelist by calling steal_suitable_fallback().
>>
>> We then search for freepages from start of the pageblock due to below code -
>>
>>   move_freepages_block(zone, page, migratetype, ...)
>> {
>>      pfn = page_to_pfn(page);
>>      start_pfn = pfn & ~(pageblock_nr_pages - 1);
>>      end_pfn = start_pfn + pageblock_nr_pages - 1;
>> ...
>> }
>>
>> With the pageblock which has partial physical hole at the beginning, we will
>> run into PFNs from the physical hole whose struct page is not initialized and
>> is invalid, and system would crash as we operate on invalid struct page to find
>> out of page is in Buddy or LRU or not
> struct page must be initialized and valid even for holes in the physical
> memory. When a pageblock spans both existing memory and a hole, the struct
> pages for the "hole" part should be marked as PG_Reserved.
>   
> If you see that struct pages for memory holes exist but invalid, we should
> solve the underlying issue that causes wrong struct pages contents.

We are using 5.15 kernel, arm64 platform. For the pages belonging to the 
physical hole, I don't see that pages are being initialized.

Looking into memmap_init code, we call init_unavailable_range() to 
initialize the pages that belong to holes in the zone. But again we only 
do this for PFNs that are valid according to below code snippet -

init_unavailable_range() {

6667     for (pfn = spfn; pfn < epfn; pfn++) {
6668         if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
6669             pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
6670                 + pageblock_nr_pages - 1;
6671             continue;
6672         }

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/mm/page_alloc.c?h=v5.15.34#n6668

With arm64 specific definition of pfn_valid(), a PFN which isn't present 
in RAM partition table (i.e. belongs to physical hole), pfn_valid will 
return FALSE. Hence we don't initialize any pages that belongs to 
physical hole here.

Or am I missing anything in kernel that initializes pages belonging to 
physical holes too? If so could you point me to that?

I see that in next kernel versions, we are removing arm64 specific 
definition of pfn_valid by Anshuman. Doing so, PFNs in hole would have 
pfn_valid return TRUE and we would then initialize pages in holes as 
well. But this patch was reverted by Will Deacon on 5.15 kernel.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/arm64/mm?h=v5.17.3&id=3de360c3fdb34fbdbaf6da3af94367d3fded95d3

>> [  107.629453][ T9688] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
>> [  107.639214][ T9688] Mem abort info:
>> [  107.642829][ T9688]   ESR = 0x96000006
>> [  107.646696][ T9688]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [  107.652878][ T9688]   SET = 0, FnV = 0
>> [  107.656751][ T9688]   EA = 0, S1PTW = 0
>> [  107.660705][ T9688]   FSC = 0x06: level 2 translation fault
>> [  107.666455][ T9688] Data abort info:
>> [  107.670151][ T9688]   ISV = 0, ISS = 0x00000006
>> [  107.674827][ T9688]   CM = 0, WnR = 0
>> [  107.678615][ T9688] user pgtable: 4k pages, 39-bit VAs, pgdp=000000098a237000
>> [  107.685970][ T9688] [0000000000000000] pgd=0800000987170003, p4d=0800000987170003, pud=0800000987170003, pmd=0000000000000000
>> [  107.697582][ T9688] Internal error: Oops: 96000006 [#1] PREEMPT SMP
>>
>> [  108.209839][ T9688] pc : move_freepages_block+0x174/0x27c
> can you post fadd2line for this address?

fadd2line didn't work quite well. I used aarch64-linux-android-addr2line 
on the address (move_freepages_block+0x174) and it points to 
arch_test_bit() at include/asm-generic/bitops/non-atomic.h:118.

On T32 using stacktrace, it points to PageLRU() in the below code under 
move_freepages()

move_freepages() {

2520             if (num_movable &&
2521                     (PageLRU(page) || __PageMovable(page)))


https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/mm/page_alloc.c?h=v5.15.34#n2521

The struct page* contents of the page are invalid, including the 
page->lru where system crashes doing PageLRU(page).

   (struct page)0xFFFFFFFE018E0000 = (
     flags = 0,
     lru = (next = 0x0010000000000021, prev = 0x0042000000000004),
     mapping = 0x0,
     index = 549755813904,
     private = 0,
     pp_magic = 4503599627370529,
     pp = 0x0042000000000004,
     _pp_mapping_pad = 0,
     dma_addr = 549755813904,
     dma_addr_upper = 0,
     pp_frag_count = (counter = 0),
     slab_list = (next = 0x0010000000000021, prev = 0x0042000000000004),
     next = 0x0010000000000021,

>
>> [  108.215407][ T9688] lr : steal_suitable_fallback+0x20c/0x398
>>
>> [  108.305908][ T9688] Call trace:
>> [  108.309151][ T9688]  move_freepages_block+0x174/0x27c        [PageLRU]
>> [  108.314359][ T9688]  steal_suitable_fallback+0x20c/0x398
>> [  108.319826][ T9688]  rmqueue_bulk+0x250/0x934
>> [  108.324325][ T9688]  rmqueue_pcplist+0x178/0x2ac
>> [  108.329086][ T9688]  rmqueue+0x5c/0xc10
>> [  108.333048][ T9688]  get_page_from_freelist+0x19c/0x430
>> [  108.338430][ T9688]  __alloc_pages+0x134/0x424
>> [  108.343017][ T9688]  page_cache_ra_unbounded+0x120/0x324
>> [  108.348494][ T9688]  do_sync_mmap_readahead+0x1b0/0x234
>> [  108.353878][ T9688]  filemap_fault+0xe0/0x4c8
>> [  108.358375][ T9688]  do_fault+0x168/0x6cc
>> [  108.362518][ T9688]  handle_mm_fault+0x5c4/0x848
>> [  108.367280][ T9688]  do_page_fault+0x3fc/0x5d0
>> [  108.371867][ T9688]  do_translation_fault+0x6c/0x1b0
>> [  108.376985][ T9688]  do_mem_abort+0x68/0x10c
>> [  108.381389][ T9688]  el0_ia+0x50/0xbc
>> [  108.385175][ T9688]  el0t_32_sync_handler+0x88/0xbc
>> [  108.390208][ T9688]  el0t_32_sync+0x1b8/0x1bc
>>
>> Hence, avoid operating on invalid pages within the same pageblock by checking
>> if pfn is valid or not.
>> Signed-off-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
>> Fixes: 4c7b9896621be ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE")
>> Cc: Mike Rapoport <rppt@linux.ibm.com>
> For now the patch looks like a band-aid for more fundamental bug, so
>
> NAKED-by: Mike Rapoport <rppt@linux.ibm.com>
>
This patch may look like work around solution but yes I think there's a 
fundamental problem where kernel takes a pageblock which has partial 
holes and partial System RAM as valid pageblock, which occurs when Base 
Address in RAM partition table are not aligned to pageblock size.

This fundamental problem needs to be fixed, and looking for your 
suggestions.

>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> ---
>>   mm/page_alloc.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6e5b448..e87aa053 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2521,6 +2521,11 @@ static int move_freepages(struct zone *zone,
>>   	int pages_moved = 0;
>>   
>>   	for (pfn = start_pfn; pfn <= end_pfn;) {
>> +		if (!pfn_valid(pfn)) {
>> +			pfn++;
>> +			continue;
>> +		}
>> +
>>   		page = pfn_to_page(pfn);
>>   		if (!PageBuddy(page)) {
>>   			/*
>> -- 
>> 2.7.4
>>
Mike Rapoport April 18, 2022, 7:24 a.m. UTC | #7
On Fri, Apr 15, 2022 at 02:30:52AM +0530, Sudarshan Rajagopalan wrote:
> 
> On 4/14/2022 2:18 AM, Mike Rapoport wrote:
> > On Tue, Apr 12, 2022 at 01:16:23PM -0700, Sudarshan Rajagopalan wrote:
> > > Check if pfn is valid before or not before moving it to freelist.
> > > 
> > > There are possible scenario where a pageblock can have partial physical
> > > hole and partial part of System RAM. This happens when base address in RAM
> > > partition table is not aligned to pageblock size.
> > > 
> > > Example:
> > > 
> > > Say we have this first two entries in RAM partition table -
> > > 
> > > Base Addr: 0x0000000080000000 Length: 0x0000000058000000
> > > Base Addr: 0x00000000E3930000 Length: 0x0000000020000000
> > I wonder what was done to memory DIMMs to get such an interesting
> > physical memory layout...
> 
> We have a feature where we carve out some portion of memory in RAM partition
> table, hence we see such base addresses here.

Cannot the firmware align that portion at some sensible boundary?
Or at least report the carved out range as "reserved" (and maybe NOMAP)
rather than as hole?

> > > Physical hole: 0xD8000000 - 0xE3930000
> > > 
> > > With the pageblock which has partial physical hole at the beginning, we will
> > > run into PFNs from the physical hole whose struct page is not initialized and
> > > is invalid, and system would crash as we operate on invalid struct page to find
> > > out of page is in Buddy or LRU or not
> >
> > struct page must be initialized and valid even for holes in the physical
> > memory. When a pageblock spans both existing memory and a hole, the struct
> > pages for the "hole" part should be marked as PG_Reserved.
> > If you see that struct pages for memory holes exist but invalid, we should
> > solve the underlying issue that causes wrong struct pages contents.
> 
> We are using 5.15 kernel, arm64 platform. For the pages belonging to the
> physical hole, I don't see that pages are being initialized.
> 
> Looking into memmap_init code, we call init_unavailable_range() to
> initialize the pages that belong to holes in the zone. But again we only do
> this for PFNs that are valid according to below code snippet -
> 
> init_unavailable_range() {
> 
> 6667     for (pfn = spfn; pfn < epfn; pfn++) {
> 6668         if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> 6669             pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> 6670                 + pageblock_nr_pages - 1;
> 6671             continue;
> 6672         }
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/mm/page_alloc.c?h=v5.15.34#n6668
> 
> With arm64 specific definition of pfn_valid(), a PFN which isn't present in
> RAM partition table (i.e. belongs to physical hole), pfn_valid will return
> FALSE. Hence we don't initialize any pages that belongs to physical hole
> here.
> 
> Or am I missing anything in kernel that initializes pages belonging to
> physical holes too? If so could you point me to that?

I agree with your analysis for 5.15, you just didn't mention that the
problem happens with 5.15.
 
> I see that in next kernel versions, we are removing arm64 specific
> definition of pfn_valid by Anshuman. Doing so, PFNs in hole would have
> pfn_valid return TRUE and we would then initialize pages in holes as well.

That said, your patch will not fix anything in the current kernel because
the issue should not happen there, right?

> But this patch was reverted by Will Deacon on 5.15 kernel.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/arm64/mm?h=v5.17.3&id=3de360c3fdb34fbdbaf6da3af94367d3fded95d3

The reason for the revert was fixed by the commit a9c38c5d267c
("dma-mapping: remove bogus test for pfn_valid from dma_map_resource").

...

> > > Hence, avoid operating on invalid pages within the same pageblock by checking
> > > if pfn is valid or not.
> > > Signed-off-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
> > > Fixes: 4c7b9896621be ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE")
> > > Cc: Mike Rapoport <rppt@linux.ibm.com>
> > For now the patch looks like a band-aid for more fundamental bug, so
> > 
> > NAKED-by: Mike Rapoport <rppt@linux.ibm.com>
> > 
> This patch may look like work around solution but yes I think there's a
> fundamental problem where kernel takes a pageblock which has partial holes
> and partial System RAM as valid pageblock, which occurs when Base Address in
> RAM partition table are not aligned to pageblock size.
> 
> This fundamental problem needs to be fixed, and looking for your
> suggestions.

I'd suggest backporting a9c38c5d267c ("dma-mapping: remove bogus test for
pfn_valid from dma_map_resource") and 3de360c3fdb3 ("arm64/mm: drop
HAVE_ARCH_PFN_VALID") to 5.15.
Sudarshan Rajagopalan April 18, 2022, 10:32 p.m. UTC | #8
On 4/18/2022 12:24 AM, Mike Rapoport wrote:
> On Fri, Apr 15, 2022 at 02:30:52AM +0530, Sudarshan Rajagopalan wrote:
>> On 4/14/2022 2:18 AM, Mike Rapoport wrote:
>>> On Tue, Apr 12, 2022 at 01:16:23PM -0700, Sudarshan Rajagopalan wrote:
>>>> Check if pfn is valid before or not before moving it to freelist.
>>>>
>>>> There are possible scenario where a pageblock can have partial physical
>>>> hole and partial part of System RAM. This happens when base address in RAM
>>>> partition table is not aligned to pageblock size.
>>>>
>>>> Example:
>>>>
>>>> Say we have this first two entries in RAM partition table -
>>>>
>>>> Base Addr: 0x0000000080000000 Length: 0x0000000058000000
>>>> Base Addr: 0x00000000E3930000 Length: 0x0000000020000000
>>> I wonder what was done to memory DIMMs to get such an interesting
>>> physical memory layout...
>> We have a feature where we carve out some portion of memory in RAM partition
>> table, hence we see such base addresses here.
> Cannot the firmware align that portion at some sensible boundary?
> Or at least report the carved out range as "reserved" (and maybe NOMAP)
> rather than as hole?

We can have the firmware or ABL align the address to next pageblock size 
boundary. This would simple mean loosing few MBs of memory with 
alignment. Same with making them as "reserved" with "nomap".

>>>> Physical hole: 0xD8000000 - 0xE3930000
>>>>
>>>> With the pageblock which has partial physical hole at the beginning, we will
>>>> run into PFNs from the physical hole whose struct page is not initialized and
>>>> is invalid, and system would crash as we operate on invalid struct page to find
>>>> out of page is in Buddy or LRU or not
>>> struct page must be initialized and valid even for holes in the physical
>>> memory. When a pageblock spans both existing memory and a hole, the struct
>>> pages for the "hole" part should be marked as PG_Reserved.
>>> If you see that struct pages for memory holes exist but invalid, we should
>>> solve the underlying issue that causes wrong struct pages contents.
>> We are using 5.15 kernel, arm64 platform. For the pages belonging to the
>> physical hole, I don't see that pages are being initialized.
>>
>> Looking into memmap_init code, we call init_unavailable_range() to
>> initialize the pages that belong to holes in the zone. But again we only do
>> this for PFNs that are valid according to below code snippet -
>>
>> init_unavailable_range() {
>>
>> 6667     for (pfn = spfn; pfn < epfn; pfn++) {
>> 6668         if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
>> 6669             pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
>> 6670                 + pageblock_nr_pages - 1;
>> 6671             continue;
>> 6672         }
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/mm/page_alloc.c?h=v5.15.34#n6668
>>
>> With arm64 specific definition of pfn_valid(), a PFN which isn't present in
>> RAM partition table (i.e. belongs to physical hole), pfn_valid will return
>> FALSE. Hence we don't initialize any pages that belongs to physical hole
>> here.
>>
>> Or am I missing anything in kernel that initializes pages belonging to
>> physical holes too? If so could you point me to that?
> I agree with your analysis for 5.15, you just didn't mention that the
> problem happens with 5.15.
>   
>> I see that in next kernel versions, we are removing arm64 specific
>> definition of pfn_valid by Anshuman. Doing so, PFNs in hole would have
>> pfn_valid return TRUE and we would then initialize pages in holes as well.
> That said, your patch will not fix anything in the current kernel because
> the issue should not happen there, right?

Yes, the issue seems to be fixed in latest kernel version with the 
patches to drop arm64 pfn_valid. But the core issue is present on 
previous kernel versions with the scenario explained. Any procedure to 
have this fixed on 5.15 kernel?

>
>> But this patch was reverted by Will Deacon on 5.15 kernel.
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/arm64/mm?h=v5.17.3&id=3de360c3fdb34fbdbaf6da3af94367d3fded95d3
> The reason for the revert was fixed by the commit a9c38c5d267c
> ("dma-mapping: remove bogus test for pfn_valid from dma_map_resource").
>
> ...
>
>>>> Hence, avoid operating on invalid pages within the same pageblock by checking
>>>> if pfn is valid or not.
>>>> Signed-off-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
>>>> Fixes: 4c7b9896621be ("mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE")
>>>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>>> For now the patch looks like a band-aid for more fundamental bug, so
>>>
>>> NAKED-by: Mike Rapoport <rppt@linux.ibm.com>
>>>
>> This patch may look like work around solution but yes I think there's a
>> fundamental problem where kernel takes a pageblock which has partial holes
>> and partial System RAM as valid pageblock, which occurs when Base Address in
>> RAM partition table are not aligned to pageblock size.
>>
>> This fundamental problem needs to be fixed, and looking for your
>> suggestions.
> I'd suggest backporting a9c38c5d267c ("dma-mapping: remove bogus test for
> pfn_valid from dma_map_resource") and 3de360c3fdb3 ("arm64/mm: drop
> HAVE_ARCH_PFN_VALID") to 5.15.
The issue is not seen with these patches backported. Not sure of the 
procedure to send patches for 5.15 kernel, but can we have them 
backported to 5.15?
Mike Rapoport April 19, 2022, 6:45 a.m. UTC | #9
On Mon, Apr 18, 2022 at 03:32:21PM -0700, Sudarshan Rajagopalan wrote:
> On 4/18/2022 12:24 AM, Mike Rapoport wrote:
> > On Fri, Apr 15, 2022 at 02:30:52AM +0530, Sudarshan Rajagopalan wrote:
> > > On 4/14/2022 2:18 AM, Mike Rapoport wrote:
> > >
> > > We have a feature where we carve out some portion of memory in RAM partition
> > > table, hence we see such base addresses here.
> > >
> > Cannot the firmware align that portion at some sensible boundary?
> > Or at least report the carved out range as "reserved" (and maybe NOMAP)
> > rather than as hole?
> 
> We can have the firmware or ABL align the address to next pageblock size
> boundary. This would simple mean loosing few MBs of memory with alignment.
> Same with making them as "reserved" with "nomap".

Reserved and nomap do not have to be aligned and there will be a valid
struct page for such regions.

Still, the kernel should be able to cope with firmware oddities so a fix
for 5.15 is still needed.
 
> > That said, your patch will not fix anything in the current kernel because
> > the issue should not happen there, right?
> 
> Yes, the issue seems to be fixed in latest kernel version with the patches
> to drop arm64 pfn_valid. But the core issue is present on previous kernel
> versions with the scenario explained. Any procedure to have this fixed on
> 5.15 kernel?
>
> > I'd suggest backporting a9c38c5d267c ("dma-mapping: remove bogus test for
> > pfn_valid from dma_map_resource") and 3de360c3fdb3 ("arm64/mm: drop
> > HAVE_ARCH_PFN_VALID") to 5.15.
>
> The issue is not seen with these patches backported. Not sure of the
> procedure to send patches for 5.15 kernel, but can we have them backported
> to 5.15?

Please look at Documentation/process/stable-kernel-rules.rst for
explanation how to send patches to stable kernels.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e5b448..e87aa053 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2521,6 +2521,11 @@  static int move_freepages(struct zone *zone,
 	int pages_moved = 0;
 
 	for (pfn = start_pfn; pfn <= end_pfn;) {
+		if (!pfn_valid(pfn)) {
+			pfn++;
+			continue;
+		}
+
 		page = pfn_to_page(pfn);
 		if (!PageBuddy(page)) {
 			/*