diff mbox series

[RFC,v2] mm: initialize struct pages reserved by ZONE_DEVICE driver.

Message ID 20190906081027.15477-1-t-fukasawa@vx.jp.nec.com (mailing list archive)
State New, archived
Headers show
Series [RFC,v2] mm: initialize struct pages reserved by ZONE_DEVICE driver. | expand

Commit Message

Toshiki Fukasawa Sept. 6, 2019, 8:09 a.m. UTC
A kernel panic is observed during reading
/proc/kpage{cgroup,count,flags} for first few pfns allocated by
pmem namespace:

BUG: unable to handle page fault for address: fffffffffffffffe
[  114.495280] #PF: supervisor read access in kernel mode
[  114.495738] #PF: error_code(0x0000) - not-present page
[  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
[  114.496713] Oops: 0000 [#1] SMP PTI
[  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
[  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
[  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
[  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
[  114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
[  114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
[  114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
[  114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
[  114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
[  114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
[  114.504530] FS:  00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
[  114.505245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
[  114.506401] Call Trace:
[  114.506660]  kpageflags_read+0xb1/0x130
[  114.507051]  proc_reg_read+0x39/0x60
[  114.507387]  vfs_read+0x8a/0x140
[  114.507686]  ksys_pread64+0x61/0xa0
[  114.508021]  do_syscall_64+0x5f/0x1a0
[  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  114.508844] RIP: 0033:0x7f0266ba426b

The first few pages of ZONE_DEVICE expressed as the range
(altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
skipped by struct page initialization. Some pfn walkers like
/proc/kpage{cgroup, count, flags} can't handle these uninitialized
struct pages, which causes the error.

In previous discussion, Dan seemed to have concern that the struct
page area of some pages indicated by vmem_altmap->reserve may not
be allocated. (See https://lore.kernel.org/lkml/CAPcyv4i5FjTOnPbXNcTzvt+e6RQYow0JRQwSFuxaa62LSuvzHQ@mail.gmail.com/)
However, arch_add_memory() called by devm_memremap_pages() allocates
struct page area for pages containing addresses in the range
(res.start) to (res.start + resource_size(res)), which include the
pages indicated by vmem_altmap->reserve. If I read correctly, it is
allocated as requested at least on x86_64. Also, memmap_init_zone()
initializes struct pages in the same range.
So I think the struct pages should be initialized.

Signed-off-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
Cc: stable@vger.kernel.org
---
Changes since rev 1:
 Instead of avoiding uninitialized pages on the pfn walker side,
 we initialize struct pages.

mm/page_alloc.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Comments

David Hildenbrand Sept. 6, 2019, 8:45 a.m. UTC | #1
On 06.09.19 10:09, Toshiki Fukasawa wrote:
> A kernel panic is observed during reading
> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
> pmem namespace:
> 
> BUG: unable to handle page fault for address: fffffffffffffffe
> [  114.495280] #PF: supervisor read access in kernel mode
> [  114.495738] #PF: error_code(0x0000) - not-present page
> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
> [  114.496713] Oops: 0000 [#1] SMP PTI
> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
> [  114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
> [  114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
> [  114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
> [  114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
> [  114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
> [  114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
> [  114.504530] FS:  00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
> [  114.505245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
> [  114.506401] Call Trace:
> [  114.506660]  kpageflags_read+0xb1/0x130
> [  114.507051]  proc_reg_read+0x39/0x60
> [  114.507387]  vfs_read+0x8a/0x140
> [  114.507686]  ksys_pread64+0x61/0xa0
> [  114.508021]  do_syscall_64+0x5f/0x1a0
> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  114.508844] RIP: 0033:0x7f0266ba426b
> 
> The first few pages of ZONE_DEVICE expressed as the range
> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
> skipped by struct page initialization. Some pfn walkers like
> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
> struct pages, which causes the error.
> 
> In previous discussion, Dan seemed to have concern that the struct
> page area of some pages indicated by vmem_altmap->reserve may not
> be allocated. (See https://lore.kernel.org/lkml/CAPcyv4i5FjTOnPbXNcTzvt+e6RQYow0JRQwSFuxaa62LSuvzHQ@mail.gmail.com/)
> However, arch_add_memory() called by devm_memremap_pages() allocates
> struct page area for pages containing addresses in the range
> (res.start) to (res.start + resource_size(res)), which include the
> pages indicated by vmem_altmap->reserve. If I read correctly, it is
> allocated as requested at least on x86_64. Also, memmap_init_zone()
> initializes struct pages in the same range.
> So I think the struct pages should be initialized.>

For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
for the whole section. For ZONE_DEVICE memory we have no such
indication. In any section that is !SECTION_IS_ONLINE and
SECTION_MARKED_PRESENT, we could have any subsections initialized.

The only indication I am aware of is pfn_zone_device_reserved() - which
seems to check exactly what you are trying to skip here.

Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
that already, why did you decide against it?

> Signed-off-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
> Cc: stable@vger.kernel.org
> ---
> Changes since rev 1:
>  Instead of avoiding uninitialized pages on the pfn walker side,
>  we initialize struct pages.
> 
> mm/page_alloc.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9c91949..6d180ae 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5846,8 +5846,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>  
>  #ifdef CONFIG_ZONE_DEVICE
>  	/*
> -	 * Honor reservation requested by the driver for this ZONE_DEVICE
> -	 * memory. We limit the total number of pages to initialize to just
> +	 * We limit the total number of pages to initialize to just
>  	 * those that might contain the memory mapping. We will defer the
>  	 * ZONE_DEVICE page initialization until after we have released
>  	 * the hotplug lock.
> @@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>  		if (!altmap)
>  			return;
>  
> -		if (start_pfn == altmap->base_pfn)
> -			start_pfn += altmap->reserve;
>  		end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>  	}
>  #endif
>
Toshiki Fukasawa Sept. 6, 2019, 10:02 a.m. UTC | #2
Thank you for your feedback.

On 2019/09/06 17:45, David Hildenbrand wrote:
> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>> A kernel panic is observed during reading
>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>> pmem namespace:
>>
>> BUG: unable to handle page fault for address: fffffffffffffffe
>> [  114.495280] #PF: supervisor read access in kernel mode
>> [  114.495738] #PF: error_code(0x0000) - not-present page
>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>> [  114.496713] Oops: 0000 [#1] SMP PTI
>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>> [  114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
>> [  114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
>> [  114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
>> [  114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
>> [  114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
>> [  114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
>> [  114.504530] FS:  00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
>> [  114.505245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
>> [  114.506401] Call Trace:
>> [  114.506660]  kpageflags_read+0xb1/0x130
>> [  114.507051]  proc_reg_read+0x39/0x60
>> [  114.507387]  vfs_read+0x8a/0x140
>> [  114.507686]  ksys_pread64+0x61/0xa0
>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>
>> The first few pages of ZONE_DEVICE expressed as the range
>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>> skipped by struct page initialization. Some pfn walkers like
>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>> struct pages, which causes the error.
>>
>> In previous discussion, Dan seemed to have concern that the struct
>> page area of some pages indicated by vmem_altmap->reserve may not
>> be allocated. (See https://lore.kernel.org/lkml/CAPcyv4i5FjTOnPbXNcTzvt+e6RQYow0JRQwSFuxaa62LSuvzHQ@mail.gmail.com/)
>> However, arch_add_memory() called by devm_memremap_pages() allocates
>> struct page area for pages containing addresses in the range
>> (res.start) to (res.start + resource_size(res)), which include the
>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>> initializes struct pages in the same range.
>> So I think the struct pages should be initialized.>
> 
> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
> for the whole section. For ZONE_DEVICE memory we have no such
> indication. In any section that is !SECTION_IS_ONLINE and
> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
> The only indication I am aware of is pfn_zone_device_reserved() - which
> seems to check exactly what you are trying to skip here.
> 
> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
> that already, why did you decide against it?

No, in current approach this function is no longer needed.
The reason why we change the approach is that all pfn walkers
have to be aware of the uninitialized struct pages.

As for SECTION_IS_ONLINE, I'm not sure now.
I will look into it next week.

Thanks,
Toshiki Fukasawa

> 
>> Signed-off-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
>> Cc: stable@vger.kernel.org
>> ---
>> Changes since rev 1:
>>   Instead of avoiding uninitialized pages on the pfn walker side,
>>   we initialize struct pages.
>>
>> mm/page_alloc.c | 5 +----
>>   1 file changed, 1 insertion(+), 4 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 9c91949..6d180ae 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5846,8 +5846,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>   
>>   #ifdef CONFIG_ZONE_DEVICE
>>   	/*
>> -	 * Honor reservation requested by the driver for this ZONE_DEVICE
>> -	 * memory. We limit the total number of pages to initialize to just
>> +	 * We limit the total number of pages to initialize to just
>>   	 * those that might contain the memory mapping. We will defer the
>>   	 * ZONE_DEVICE page initialization until after we have released
>>   	 * the hotplug lock.
>> @@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>   		if (!altmap)
>>   			return;
>>   
>> -		if (start_pfn == altmap->base_pfn)
>> -			start_pfn += altmap->reserve;
>>   		end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>>   	}
>>   #endif
>>
> 
>
David Hildenbrand Sept. 6, 2019, 10:35 a.m. UTC | #3
On 06.09.19 12:02, Toshiki Fukasawa wrote:
> Thank you for your feedback.
> 
> On 2019/09/06 17:45, David Hildenbrand wrote:
>> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>>> A kernel panic is observed during reading
>>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>>> pmem namespace:
>>>
>>> BUG: unable to handle page fault for address: fffffffffffffffe
>>> [  114.495280] #PF: supervisor read access in kernel mode
>>> [  114.495738] #PF: error_code(0x0000) - not-present page
>>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>>> [  114.496713] Oops: 0000 [#1] SMP PTI
>>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>>> [  114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
>>> [  114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
>>> [  114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
>>> [  114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
>>> [  114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
>>> [  114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
>>> [  114.504530] FS:  00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
>>> [  114.505245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
>>> [  114.506401] Call Trace:
>>> [  114.506660]  kpageflags_read+0xb1/0x130
>>> [  114.507051]  proc_reg_read+0x39/0x60
>>> [  114.507387]  vfs_read+0x8a/0x140
>>> [  114.507686]  ksys_pread64+0x61/0xa0
>>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>>
>>> The first few pages of ZONE_DEVICE expressed as the range
>>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>>> skipped by struct page initialization. Some pfn walkers like
>>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>>> struct pages, which causes the error.
>>>
>>> In previous discussion, Dan seemed to have concern that the struct
>>> page area of some pages indicated by vmem_altmap->reserve may not
>>> be allocated. (See https://lore.kernel.org/lkml/CAPcyv4i5FjTOnPbXNcTzvt+e6RQYow0JRQwSFuxaa62LSuvzHQ@mail.gmail.com/)
>>> However, arch_add_memory() called by devm_memremap_pages() allocates
>>> struct page area for pages containing addresses in the range
>>> (res.start) to (res.start + resource_size(res)), which include the
>>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>>> initializes struct pages in the same range.
>>> So I think the struct pages should be initialized.>
>>
>> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
>> for the whole section. For ZONE_DEVICE memory we have no such
>> indication. In any section that is !SECTION_IS_ONLINE and
>> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
>> The only indication I am aware of is pfn_zone_device_reserved() - which
>> seems to check exactly what you are trying to skip here.
>>
>> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
>> that already, why did you decide against it?
> 
> No, in current approach this function is no longer needed.
> The reason why we change the approach is that all pfn walkers
> have to be aware of the uninitialized struct pages.

We should use the same strategy for all pfn walkers then (effectively
get rid of pfn_zone_device_reserved() if that's what we want).

> 
> As for SECTION_IS_ONLINE, I'm not sure now.
> I will look into it next week.

SECTION_IS_ONLINE does currently not apply to ZONE_DEVICE and due to
sub-section support for ZONE_DEVICE, it cannot easily be reused.

> 
> Thanks,
> Toshiki Fukasawa
> 
>>
>>> Signed-off-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
>>> Cc: stable@vger.kernel.org
>>> ---
>>> Changes since rev 1:
>>>   Instead of avoiding uninitialized pages on the pfn walker side,
>>>   we initialize struct pages.
>>>
>>> mm/page_alloc.c | 5 +----
>>>   1 file changed, 1 insertion(+), 4 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 9c91949..6d180ae 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -5846,8 +5846,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>>   
>>>   #ifdef CONFIG_ZONE_DEVICE
>>>   	/*
>>> -	 * Honor reservation requested by the driver for this ZONE_DEVICE
>>> -	 * memory. We limit the total number of pages to initialize to just
>>> +	 * We limit the total number of pages to initialize to just
>>>   	 * those that might contain the memory mapping. We will defer the
>>>   	 * ZONE_DEVICE page initialization until after we have released
>>>   	 * the hotplug lock.
>>> @@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>>   		if (!altmap)
>>>   			return;
>>>   
>>> -		if (start_pfn == altmap->base_pfn)
>>> -			start_pfn += altmap->reserve;
>>>   		end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>>>   	}
>>>   #endif
>>>
>>
>>
Toshiki Fukasawa Sept. 9, 2019, 5:48 a.m. UTC | #4
On 2019/09/06 19:35, David Hildenbrand wrote:
> On 06.09.19 12:02, Toshiki Fukasawa wrote:
>> Thank you for your feedback.
>>
>> On 2019/09/06 17:45, David Hildenbrand wrote:
>>> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>>>> A kernel panic is observed during reading
>>>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>>>> pmem namespace:
>>>>
>>>> BUG: unable to handle page fault for address: fffffffffffffffe
>>>> [  114.495280] #PF: supervisor read access in kernel mode
>>>> [  114.495738] #PF: error_code(0x0000) - not-present page
>>>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>>>> [  114.496713] Oops: 0000 [#1] SMP PTI
>>>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>>>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>>>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>>>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>>>> [  114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
>>>> [  114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
>>>> [  114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
>>>> [  114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
>>>> [  114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
>>>> [  114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
>>>> [  114.504530] FS:  00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
>>>> [  114.505245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
>>>> [  114.506401] Call Trace:
>>>> [  114.506660]  kpageflags_read+0xb1/0x130
>>>> [  114.507051]  proc_reg_read+0x39/0x60
>>>> [  114.507387]  vfs_read+0x8a/0x140
>>>> [  114.507686]  ksys_pread64+0x61/0xa0
>>>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>>>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>>>
>>>> The first few pages of ZONE_DEVICE expressed as the range
>>>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>>>> skipped by struct page initialization. Some pfn walkers like
>>>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>>>> struct pages, which causes the error.
>>>>
>>>> In previous discussion, Dan seemed to have concern that the struct
>>>> page area of some pages indicated by vmem_altmap->reserve may not
>>>> be allocated. (See https://lore.kernel.org/lkml/CAPcyv4i5FjTOnPbXNcTzvt+e6RQYow0JRQwSFuxaa62LSuvzHQ@mail.gmail.com/)
>>>> However, arch_add_memory() called by devm_memremap_pages() allocates
>>>> struct page area for pages containing addresses in the range
>>>> (res.start) to (res.start + resource_size(res)), which include the
>>>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>>>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>>>> initializes struct pages in the same range.
>>>> So I think the struct pages should be initialized.>
>>>
>>> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
>>> for the whole section. For ZONE_DEVICE memory we have no such
>>> indication. In any section that is !SECTION_IS_ONLINE and
>>> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
>>> The only indication I am aware of is pfn_zone_device_reserved() - which
>>> seems to check exactly what you are trying to skip here.
>>>
>>> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
>>> that already, why did you decide against it?
>>
>> No, in current approach this function is no longer needed.
>> The reason why we change the approach is that all pfn walkers
>> have to be aware of the uninitialized struct pages.
> 
> We should use the same strategy for all pfn walkers then (effectively
> get rid of pfn_zone_device_reserved() if that's what we want).

True, but this patch replaces "/proc/kpageflags: do not use uninitialized
struct pages". If we initialize the uninitialized struct pages, no pfn walker
will need to be aware of them.

> 
>>
>> As for SECTION_IS_ONLINE, I'm not sure now.
>> I will look into it next week.
> 
> SECTION_IS_ONLINE does currently not apply to ZONE_DEVICE and due to
> sub-section support for ZONE_DEVICE, it cannot easily be reused.
> 
It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
but that is no longer necessary with this approach.

Thanks,
Toshiki Fukasawa
David Hildenbrand Sept. 9, 2019, 7:46 a.m. UTC | #5
On 09.09.19 07:48, Toshiki Fukasawa wrote:
> On 2019/09/06 19:35, David Hildenbrand wrote:
>> On 06.09.19 12:02, Toshiki Fukasawa wrote:
>>> Thank you for your feedback.
>>>
>>> On 2019/09/06 17:45, David Hildenbrand wrote:
>>>> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>>>>> A kernel panic is observed during reading
>>>>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>>>>> pmem namespace:
>>>>>
>>>>> BUG: unable to handle page fault for address: fffffffffffffffe
>>>>> [  114.495280] #PF: supervisor read access in kernel mode
>>>>> [  114.495738] #PF: error_code(0x0000) - not-present page
>>>>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>>>>> [  114.496713] Oops: 0000 [#1] SMP PTI
>>>>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>>>>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>>>>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>>>>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>>>>> [  114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
>>>>> [  114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
>>>>> [  114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
>>>>> [  114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
>>>>> [  114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
>>>>> [  114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
>>>>> [  114.504530] FS:  00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
>>>>> [  114.505245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
>>>>> [  114.506401] Call Trace:
>>>>> [  114.506660]  kpageflags_read+0xb1/0x130
>>>>> [  114.507051]  proc_reg_read+0x39/0x60
>>>>> [  114.507387]  vfs_read+0x8a/0x140
>>>>> [  114.507686]  ksys_pread64+0x61/0xa0
>>>>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>>>>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>>>>
>>>>> The first few pages of ZONE_DEVICE expressed as the range
>>>>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>>>>> skipped by struct page initialization. Some pfn walkers like
>>>>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>>>>> struct pages, which causes the error.
>>>>>
>>>>> In previous discussion, Dan seemed to have concern that the struct
>>>>> page area of some pages indicated by vmem_altmap->reserve may not
>>>>> be allocated. (See https://lore.kernel.org/lkml/CAPcyv4i5FjTOnPbXNcTzvt+e6RQYow0JRQwSFuxaa62LSuvzHQ@mail.gmail.com/)
>>>>> However, arch_add_memory() called by devm_memremap_pages() allocates
>>>>> struct page area for pages containing addresses in the range
>>>>> (res.start) to (res.start + resource_size(res)), which include the
>>>>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>>>>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>>>>> initializes struct pages in the same range.
>>>>> So I think the struct pages should be initialized.>
>>>>
>>>> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
>>>> for the whole section. For ZONE_DEVICE memory we have no such
>>>> indication. In any section that is !SECTION_IS_ONLINE and
>>>> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
>>>> The only indication I am aware of is pfn_zone_device_reserved() - which
>>>> seems to check exactly what you are trying to skip here.
>>>>
>>>> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
>>>> that already, why did you decide against it?
>>>
>>> No, in current approach this function is no longer needed.
>>> The reason why we change the approach is that all pfn walkers
>>> have to be aware of the uninitialized struct pages.
>>
>> We should use the same strategy for all pfn walkers then (effectively
>> get rid of pfn_zone_device_reserved() if that's what we want).
> 
> True, but this patch replaces "/proc/kpageflags: do not use uninitialized
> struct pages". If we initialize the uninitialized struct pages, no pfn walker
> will need to be aware of them.

So the function should go.

> 
>>
>>>
>>> As for SECTION_IS_ONLINE, I'm not sure now.
>>> I will look into it next week.
>>
>> SECTION_IS_ONLINE does currently not apply to ZONE_DEVICE and due to
>> sub-section support for ZONE_DEVICE, it cannot easily be reused.
>>
> It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
> distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
> but that is no longer necessary with this approach.

Let's take a step back here to understand the issues I am aware of. I
think we should solve this for good now:

A PFN walker takes a look at a random PFN at a random point in time. It
finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
options are:

1. It is buddy memory (add_memory()) that has not been online yet. The
memmap contains garbage. Don't access.

2. It is ZONE_DEVICE memory with a valid memmap. Access it.

3. It is ZONE_DEVICE memory with an invalid memmap, because the section
is only partially present: E.g., device starts at offset 64MB within a
section or the device ends at offset 64MB within a section. Don't access it.

4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
was not initialized yet. memmap_init_zone_device() did not yet succeed
after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.

5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
driver") with an invalid memmap. Don't access it.

I can see that your patch tries to make #5 vanish by initializing the
memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
still stumble over uninitialized memmaps.
David Hildenbrand Sept. 9, 2019, 8:10 a.m. UTC | #6
On 09.09.19 09:46, David Hildenbrand wrote:
> On 09.09.19 07:48, Toshiki Fukasawa wrote:
>> On 2019/09/06 19:35, David Hildenbrand wrote:
>>> On 06.09.19 12:02, Toshiki Fukasawa wrote:
>>>> Thank you for your feedback.
>>>>
>>>> On 2019/09/06 17:45, David Hildenbrand wrote:
>>>>> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>>>>>> A kernel panic is observed during reading
>>>>>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>>>>>> pmem namespace:
>>>>>>
>>>>>> BUG: unable to handle page fault for address: fffffffffffffffe
>>>>>> [  114.495280] #PF: supervisor read access in kernel mode
>>>>>> [  114.495738] #PF: error_code(0x0000) - not-present page
>>>>>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>>>>>> [  114.496713] Oops: 0000 [#1] SMP PTI
>>>>>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>>>>>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>>>>>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>>>>>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>>>>>> [  114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
>>>>>> [  114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
>>>>>> [  114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
>>>>>> [  114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
>>>>>> [  114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
>>>>>> [  114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
>>>>>> [  114.504530] FS:  00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
>>>>>> [  114.505245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [  114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
>>>>>> [  114.506401] Call Trace:
>>>>>> [  114.506660]  kpageflags_read+0xb1/0x130
>>>>>> [  114.507051]  proc_reg_read+0x39/0x60
>>>>>> [  114.507387]  vfs_read+0x8a/0x140
>>>>>> [  114.507686]  ksys_pread64+0x61/0xa0
>>>>>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>>>>>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>>>>>
>>>>>> The first few pages of ZONE_DEVICE expressed as the range
>>>>>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>>>>>> skipped by struct page initialization. Some pfn walkers like
>>>>>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>>>>>> struct pages, which causes the error.
>>>>>>
>>>>>> In previous discussion, Dan seemed to have concern that the struct
>>>>>> page area of some pages indicated by vmem_altmap->reserve may not
>>>>>> be allocated. (See https://lore.kernel.org/lkml/CAPcyv4i5FjTOnPbXNcTzvt+e6RQYow0JRQwSFuxaa62LSuvzHQ@mail.gmail.com/)
>>>>>> However, arch_add_memory() called by devm_memremap_pages() allocates
>>>>>> struct page area for pages containing addresses in the range
>>>>>> (res.start) to (res.start + resource_size(res)), which include the
>>>>>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>>>>>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>>>>>> initializes struct pages in the same range.
>>>>>> So I think the struct pages should be initialized.>
>>>>>
>>>>> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
>>>>> for the whole section. For ZONE_DEVICE memory we have no such
>>>>> indication. In any section that is !SECTION_IS_ONLINE and
>>>>> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
>>>>> The only indication I am aware of is pfn_zone_device_reserved() - which
>>>>> seems to check exactly what you are trying to skip here.
>>>>>
>>>>> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
>>>>> that already, why did you decide against it?
>>>>
>>>> No, in current approach this function is no longer needed.
>>>> The reason why we change the approach is that all pfn walkers
>>>> have to be aware of the uninitialized struct pages.
>>>
>>> We should use the same strategy for all pfn walkers then (effectively
>>> get rid of pfn_zone_device_reserved() if that's what we want).
>>
>> True, but this patch replaces "/proc/kpageflags: do not use uninitialized
>> struct pages". If we initialize the uninitialized struct pages, no pfn walker
>> will need to be aware of them.
> 
> So the function should go.
> 
>>
>>>
>>>>
>>>> As for SECTION_IS_ONLINE, I'm not sure now.
>>>> I will look into it next week.
>>>
>>> SECTION_IS_ONLINE does currently not apply to ZONE_DEVICE and due to
>>> sub-section support for ZONE_DEVICE, it cannot easily be reused.
>>>
>> It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
>> distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
>> but that is no longer necessary with this approach.
> 
> Let's take a step back here to understand the issues I am aware of. I
> think we should solve this for good now:
> 
> A PFN walker takes a look at a random PFN at a random point in time. It
> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> options are:
> 
> 1. It is buddy memory (add_memory()) that has not been online yet. The
> memmap contains garbage. Don't access.
> 
> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
> 
> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> is only partially present: E.g., device starts at offset 64MB within a
> section or the device ends at offset 64MB within a section. Don't access it.
> 
> 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
> was not initialized yet. memmap_init_zone_device() did not yet succeed
> after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
> 
> 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
> driver") with an invalid memmap. Don't access it.
> 
> I can see that your patch tries to make #5 vanish by initializing the
> memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
> still stumble over uninitialized memmaps.
> 

FWIW, I thinkg having something like pfn_zone_device(), similarly
implemented like pfn_zone_device_reserved() could be one solution to
most issues.
Dan Williams Sept. 9, 2019, 11:53 a.m. UTC | #7
On Mon, Sep 9, 2019 at 1:11 AM David Hildenbrand <david@redhat.com> wrote:
[..]
> >> It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
> >> distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
> >> but that is no longer necessary with this approach.
> >
> > Let's take a step back here to understand the issues I am aware of. I
> > think we should solve this for good now:
> >
> > A PFN walker takes a look at a random PFN at a random point in time. It
> > finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> > options are:
> >
> > 1. It is buddy memory (add_memory()) that has not been online yet. The
> > memmap contains garbage. Don't access.
> >
> > 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
> >
> > 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> > is only partially present: E.g., device starts at offset 64MB within a
> > section or the device ends at offset 64MB within a section. Don't access it.
> >
> > 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
> > was not initialized yet. memmap_init_zone_device() did not yet succeed
> > after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
> >
> > 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
> > driver") with an invalid memmap. Don't access it.
> >
> > I can see that your patch tries to make #5 vanish by initializing the
> > memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
> > still stumble over uninitialized memmaps.
> >
>
> FWIW, I thinkg having something like pfn_zone_device(), similarly
> implemented like pfn_zone_device_reserved() could be one solution to
> most issues.

I've been thinking of a replacement for PTE_DEVMAP with section-level,
or sub-section level flags. The section-level flag would still require
a call to get_dev_pagemap() to validate that the pfn is not section in
the subsection case which seems to be entirely too much overhead. If
ZONE_DEVICE is to be a first class citizen in pfn walkers I think it
would be worth the cost to double the size of subsection_map and to
identify whether a sub-section is ZONE_DEVICE, or not.

Thoughts?
David Hildenbrand Sept. 9, 2019, 12:05 p.m. UTC | #8
On 09.09.19 13:53, Dan Williams wrote:
> On Mon, Sep 9, 2019 at 1:11 AM David Hildenbrand <david@redhat.com> wrote:
> [..]
>>>> It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
>>>> distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
>>>> but that is no longer necessary with this approach.
>>>
>>> Let's take a step back here to understand the issues I am aware of. I
>>> think we should solve this for good now:
>>>
>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>> options are:
>>>
>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>> memmap contains garbage. Don't access.
>>>
>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>
>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>> is only partially present: E.g., device starts at offset 64MB within a
>>> section or the device ends at offset 64MB within a section. Don't access it.
>>>
>>> 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
>>> was not initialized yet. memmap_init_zone_device() did not yet succeed
>>> after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
>>>
>>> 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
>>> driver") with an invalid memmap. Don't access it.
>>>
>>> I can see that your patch tries to make #5 vanish by initializing the
>>> memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
>>> still stumble over uninitialized memmaps.
>>>
>>
>> FWIW, I thinkg having something like pfn_zone_device(), similarly
>> implemented like pfn_zone_device_reserved() could be one solution to
>> most issues.
> 
> I've been thinking of a replacement for PTE_DEVMAP with section-level,
> or sub-section level flags. The section-level flag would still require
> a call to get_dev_pagemap() to validate that the pfn is not section in
> the subsection case which seems to be entirely too much overhead. If
> ZONE_DEVICE is to be a first class citizen in pfn walkers I think it
> would be worth the cost to double the size of subsection_map and to
> identify whether a sub-section is ZONE_DEVICE, or not.
> 
> Thoughts?
> 

I thought about this last week and came up with something like

1. Convert SECTION_IS_ONLINE to SECTION IS_ACTIVE

2. Make pfn_to_online_page() also check that it's not ZONE_DEVICE.
Online pfns are limited to !ZONE_DEVICE.

3. Extend subsection_map to an additional active_map

4. Set SECTION IS_ACTIVE *iff* the whole active_map is set. This keeps
most accesses of pfn_to_online_page() fast. If !SECTION IS_ACTIVE, check
the active_map.

5. Set sub-sections active/unactive in
move_pfn_range_to_zone()/remove_pfn_range_from_zone() - see "[PATCH v4
0/8] mm/memory_hotplug: Shrink zones before removing memory​" for the
latter.

6. Set boot memory properly active (this is a tricky bit :/ ).

However, it turned out too complex for my taste (and limited time to
spend on this), so I abandoned that idea for now. If somebody wants to
pick that up, fine.
Dan Williams Sept. 10, 2019, 9:21 a.m. UTC | #9
On Mon, Sep 9, 2019 at 5:06 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 09.09.19 13:53, Dan Williams wrote:
> > On Mon, Sep 9, 2019 at 1:11 AM David Hildenbrand <david@redhat.com> wrote:
> > [..]
> >>>> It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
> >>>> distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
> >>>> but that is no longer necessary with this approach.
> >>>
> >>> Let's take a step back here to understand the issues I am aware of. I
> >>> think we should solve this for good now:
> >>>
> >>> A PFN walker takes a look at a random PFN at a random point in time. It
> >>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> >>> options are:
> >>>
> >>> 1. It is buddy memory (add_memory()) that has not been online yet. The
> >>> memmap contains garbage. Don't access.
> >>>
> >>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
> >>>
> >>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> >>> is only partially present: E.g., device starts at offset 64MB within a
> >>> section or the device ends at offset 64MB within a section. Don't access it.
> >>>
> >>> 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
> >>> was not initialized yet. memmap_init_zone_device() did not yet succeed
> >>> after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
> >>>
> >>> 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
> >>> driver") with an invalid memmap. Don't access it.
> >>>
> >>> I can see that your patch tries to make #5 vanish by initializing the
> >>> memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
> >>> still stumble over uninitialized memmaps.
> >>>
> >>
> >> FWIW, I thinkg having something like pfn_zone_device(), similarly
> >> implemented like pfn_zone_device_reserved() could be one solution to
> >> most issues.
> >
> > I've been thinking of a replacement for PTE_DEVMAP with section-level,
> > or sub-section level flags. The section-level flag would still require
> > a call to get_dev_pagemap() to validate that the pfn is not section in
> > the subsection case which seems to be entirely too much overhead. If
> > ZONE_DEVICE is to be a first class citizen in pfn walkers I think it
> > would be worth the cost to double the size of subsection_map and to
> > identify whether a sub-section is ZONE_DEVICE, or not.
> >
> > Thoughts?
> >
>
> I thought about this last week and came up with something like
>
> 1. Convert SECTION_IS_ONLINE to SECTION IS_ACTIVE
>
> 2. Make pfn_to_online_page() also check that it's not ZONE_DEVICE.
> Online pfns are limited to !ZONE_DEVICE.
>
> 3. Extend subsection_map to an additional active_map
>
> 4. Set SECTION IS_ACTIVE *iff* the whole active_map is set. This keeps
> most accesses of pfn_to_online_page() fast. If !SECTION IS_ACTIVE, check
> the active_map.
>
> 5. Set sub-sections active/unactive in
> move_pfn_range_to_zone()/remove_pfn_range_from_zone() - see "[PATCH v4
> 0/8] mm/memory_hotplug: Shrink zones before removing memory" for the
> latter.
>
> 6. Set boot memory properly active (this is a tricky bit :/ ).
>
> However, it turned out too complex for my taste (and limited time to
> spend on this), so I abandoned that idea for now. If somebody wants to
> pick that up, fine.
>

That seems to solve the pfn walk case but it would not address the
need for PTE_DEVMAP or speed up the other places that want an
efficient way to determine if it's worthwhile to call
get_dev_pagemap().
David Hildenbrand Sept. 10, 2019, 10:10 a.m. UTC | #10
On 10.09.19 11:21, Dan Williams wrote:
> On Mon, Sep 9, 2019 at 5:06 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 09.09.19 13:53, Dan Williams wrote:
>>> On Mon, Sep 9, 2019 at 1:11 AM David Hildenbrand <david@redhat.com> wrote:
>>> [..]
>>>>>> It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
>>>>>> distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
>>>>>> but that is no longer necessary with this approach.
>>>>>
>>>>> Let's take a step back here to understand the issues I am aware of. I
>>>>> think we should solve this for good now:
>>>>>
>>>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>>>> options are:
>>>>>
>>>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>>>> memmap contains garbage. Don't access.
>>>>>
>>>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>>>
>>>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>>>> is only partially present: E.g., device starts at offset 64MB within a
>>>>> section or the device ends at offset 64MB within a section. Don't access it.
>>>>>
>>>>> 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
>>>>> was not initialized yet. memmap_init_zone_device() did not yet succeed
>>>>> after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
>>>>>
>>>>> 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
>>>>> driver") with an invalid memmap. Don't access it.
>>>>>
>>>>> I can see that your patch tries to make #5 vanish by initializing the
>>>>> memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
>>>>> still stumble over uninitialized memmaps.
>>>>>
>>>>
>>>> FWIW, I thinkg having something like pfn_zone_device(), similarly
>>>> implemented like pfn_zone_device_reserved() could be one solution to
>>>> most issues.
>>>
>>> I've been thinking of a replacement for PTE_DEVMAP with section-level,
>>> or sub-section level flags. The section-level flag would still require
>>> a call to get_dev_pagemap() to validate that the pfn is not section in
>>> the subsection case which seems to be entirely too much overhead. If
>>> ZONE_DEVICE is to be a first class citizen in pfn walkers I think it
>>> would be worth the cost to double the size of subsection_map and to
>>> identify whether a sub-section is ZONE_DEVICE, or not.
>>>
>>> Thoughts?
>>>
>>
>> I thought about this last week and came up with something like
>>
>> 1. Convert SECTION_IS_ONLINE to SECTION IS_ACTIVE
>>
>> 2. Make pfn_to_online_page() also check that it's not ZONE_DEVICE.
>> Online pfns are limited to !ZONE_DEVICE.
>>
>> 3. Extend subsection_map to an additional active_map
>>
>> 4. Set SECTION IS_ACTIVE *iff* the whole active_map is set. This keeps
>> most accesses of pfn_to_online_page() fast. If !SECTION IS_ACTIVE, check
>> the active_map.
>>
>> 5. Set sub-sections active/unactive in
>> move_pfn_range_to_zone()/remove_pfn_range_from_zone() - see "[PATCH v4
>> 0/8] mm/memory_hotplug: Shrink zones before removing memory" for the
>> latter.
>>
>> 6. Set boot memory properly active (this is a tricky bit :/ ).
>>
>> However, it turned out too complex for my taste (and limited time to
>> spend on this), so I abandoned that idea for now. If somebody wants to
>> pick that up, fine.
>>
> 
> That seems to solve the pfn walk case but it would not address the
> need for PTE_DEVMAP or speed up the other places that want an
> efficient way to determine if it's worthwhile to call
> get_dev_pagemap().
> 

Then I probably didn't get how your suggestion would look like :)

Would you suggest - instead of reusing SECTION_IS_ONLINE - something
like SECTION_IS_DEVMAP / SECTION_IS_DEVICE, to decide whether to call
get_dev_pagemap()?

How to deal with subsections? I would like to avoid storing subsection
information only useful for ZONE_DEVICE for any kind of sections (or
could we then simply not store the information per subsection but
instead check the device as initially suggested by me?)
Michal Hocko Sept. 10, 2019, 2:01 p.m. UTC | #11
On Fri 06-09-19 08:09:52, Toshiki Fukasawa wrote:
[...]
> @@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>  		if (!altmap)
>  			return;
>  
> -		if (start_pfn == altmap->base_pfn)
> -			start_pfn += altmap->reserve;
>  		end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);

Who is actually setting reserve? This is is something really impossible
to grep for in the kernle and git grep on altmap->reserve doesn't show
anything AFAICS.

Btw. irrespective to this issue all three callers should be using
pfn_to_online_page rather than pfn_to_page AFAICS. It doesn't really
make sense to collect data for offline pfn ranges. They might be
uninitialized even without zone device.
Dan Williams Sept. 10, 2019, 2:53 p.m. UTC | #12
On Tue, Sep 10, 2019 at 7:01 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 06-09-19 08:09:52, Toshiki Fukasawa wrote:
> [...]
> > @@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> >               if (!altmap)
> >                       return;
> >
> > -             if (start_pfn == altmap->base_pfn)
> > -                     start_pfn += altmap->reserve;
> >               end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>
> Who is actually setting reserve? This is is something really impossible
> to grep for in the kernle and git grep on altmap->reserve doesn't show
> anything AFAICS.

Yes, it's difficult to grep, here is the use in the nvdimm case:

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvdimm/pfn_devs.c#n600

>
> Btw. irrespective to this issue all three callers should be using
> pfn_to_online_page rather than pfn_to_page AFAICS. It doesn't really
> make sense to collect data for offline pfn ranges. They might be
> uninitialized even without zone device.

Agree.
Michal Hocko Sept. 10, 2019, 5:35 p.m. UTC | #13
On Tue 10-09-19 07:53:17, Dan Williams wrote:
> On Tue, Sep 10, 2019 at 7:01 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 06-09-19 08:09:52, Toshiki Fukasawa wrote:
> > [...]
> > > @@ -5856,8 +5855,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> > >               if (!altmap)
> > >                       return;
> > >
> > > -             if (start_pfn == altmap->base_pfn)
> > > -                     start_pfn += altmap->reserve;
> > >               end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
> >
> > Who is actually setting reserve? This is is something really impossible
> > to grep for in the kernle and git grep on altmap->reserve doesn't show
> > anything AFAICS.
> 
> Yes, it's difficult to grep, here is the use in the nvdimm case:
> 
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvdimm/pfn_devs.c#n600

Thanks! I am still not sure what the proper fix is but this is a useful
pointer.
Toshiki Fukasawa Sept. 17, 2019, 2:34 a.m. UTC | #14
On 2019/09/09 16:46, David Hildenbrand wrote:
> Let's take a step back here to understand the issues I am aware of. I
> think we should solve this for good now:
> 
> A PFN walker takes a look at a random PFN at a random point in time. It
> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> options are:
> 
> 1. It is buddy memory (add_memory()) that has not been online yet. The
> memmap contains garbage. Don't access.
> 
> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
> 
> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> is only partially present: E.g., device starts at offset 64MB within a
> section or the device ends at offset 64MB within a section. Don't access it.

I don't agree with case #3. In the case, struct page area is not allocated on
ZONE_DEVICE, but is allocated on system memory. So I think we can access the
struct pages. What do you mean "invalid memmap"?

> 
> 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
> was not initialized yet. memmap_init_zone_device() did not yet succeed
> after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
> 
> 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
> driver") with an invalid memmap. Don't access it.
> 
> I can see that your patch tries to make #5 vanish by initializing the
> memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
> still stumble over uninitialized memmaps.
>
David Hildenbrand Sept. 17, 2019, 7:13 a.m. UTC | #15
On 17.09.19 04:34, Toshiki Fukasawa wrote:
> On 2019/09/09 16:46, David Hildenbrand wrote:
>> Let's take a step back here to understand the issues I am aware of. I
>> think we should solve this for good now:
>>
>> A PFN walker takes a look at a random PFN at a random point in time. It
>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>> options are:
>>
>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>> memmap contains garbage. Don't access.
>>
>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>
>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>> is only partially present: E.g., device starts at offset 64MB within a
>> section or the device ends at offset 64MB within a section. Don't access it.
> 
> I don't agree with case #3. In the case, struct page area is not allocated on
> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
> struct pages. What do you mean "invalid memmap"?
No, that's not the case. There is no memory, especially not system
memory. We only allow partially present sections (sub-section memory
hotplug) for ZONE_DEVICE.

invalid memmap == memmap was not initialized == struct pages contains
garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
trigger a BUG.
Toshiki Fukasawa Sept. 17, 2019, 9:32 a.m. UTC | #16
On 2019/09/17 16:13, David Hildenbrand wrote:
> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>> Let's take a step back here to understand the issues I am aware of. I
>>> think we should solve this for good now:
>>>
>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>> options are:
>>>
>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>> memmap contains garbage. Don't access.
>>>
>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>
>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>> is only partially present: E.g., device starts at offset 64MB within a
>>> section or the device ends at offset 64MB within a section. Don't access it.
>>
>> I don't agree with case #3. In the case, struct page area is not allocated on
>> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
>> struct pages. What do you mean "invalid memmap"?
> No, that's not the case. There is no memory, especially not system
> memory. We only allow partially present sections (sub-section memory
> hotplug) for ZONE_DEVICE.

Let me clear my thoughts. If I read correctly, the struct pages for sections
(including partially present sections) on ZONE_DEVICE are allocated by
vmemmap_populate(). And all the struct pages except (altmap->base_pfn) to
(altmap->base_pfn + altmap->reserve) are initialized by memmap_init_zone()
and memmap_init_zone_device().

Do struct pages for partially present sections go through a different process?

Thanks,
Toshiki Fukasawa
> 
> invalid memmap == memmap was not initialized == struct pages contains
> garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
> trigger a BUG.
>
David Hildenbrand Sept. 17, 2019, 10:20 a.m. UTC | #17
On 17.09.19 11:32, Toshiki Fukasawa wrote:
> On 2019/09/17 16:13, David Hildenbrand wrote:
>> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>>> Let's take a step back here to understand the issues I am aware of. I
>>>> think we should solve this for good now:
>>>>
>>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>>> options are:
>>>>
>>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>>> memmap contains garbage. Don't access.
>>>>
>>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>>
>>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>>> is only partially present: E.g., device starts at offset 64MB within a
>>>> section or the device ends at offset 64MB within a section. Don't access it.
>>>
>>> I don't agree with case #3. In the case, struct page area is not allocated on
>>> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
>>> struct pages. What do you mean "invalid memmap"?
>> No, that's not the case. There is no memory, especially not system
>> memory. We only allow partially present sections (sub-section memory
>> hotplug) for ZONE_DEVICE.
> 
> Let me clear my thoughts. If I read correctly, the struct pages for sections
> (including partially present sections) on ZONE_DEVICE are allocated by
> vmemmap_populate(). And all the struct pages except (altmap->base_pfn) to
> (altmap->base_pfn + altmap->reserve) are initialized by memmap_init_zone()
> and memmap_init_zone_device().
> 
> Do struct pages for partially present sections go through a different process?

No. However, the memmap is initialized via move_pfn_range_to_zone(). So
partially present sections will have partially uninitialized memmaps.

But I get your point. I just saw that pfn_valid() does take care of the
subsection map via pfn_section_valid(). - pfn_present() does not, which
is weird, but ok.

So I agree, in case #3 might have a partially uninitialized memmap, but
we can test via pfn_valid() if the memory is at least valid. So there is
some way to test. Then we're back to the race between adding the memory
and initializing the memmap.

Thanks!

> 
> Thanks,
> Toshiki Fukasawa
>>
Waiman Long Sept. 17, 2019, 3:49 p.m. UTC | #18
On 9/17/19 3:13 AM, David Hildenbrand wrote:
> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>> Let's take a step back here to understand the issues I am aware of. I
>>> think we should solve this for good now:
>>>
>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>> options are:
>>>
>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>> memmap contains garbage. Don't access.
>>>
>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>
>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>> is only partially present: E.g., device starts at offset 64MB within a
>>> section or the device ends at offset 64MB within a section. Don't access it.
>> I don't agree with case #3. In the case, struct page area is not allocated on
>> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
>> struct pages. What do you mean "invalid memmap"?
> No, that's not the case. There is no memory, especially not system
> memory. We only allow partially present sections (sub-section memory
> hotplug) for ZONE_DEVICE.
>
> invalid memmap == memmap was not initialized == struct pages contains
> garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
> trigger a BUG.
>
As long as the page structures exist, they should be initialized to some
known state. We could set PagePoison for those invalid memmap. It is the
garbage that are in those page structures that can cause problem if a
struct page walker scan those pages and try to make sense of it.

Cheers,
Longman
Qian Cai Sept. 17, 2019, 4:21 p.m. UTC | #19
On Tue, 2019-09-17 at 11:49 -0400, Waiman Long wrote:
> On 9/17/19 3:13 AM, David Hildenbrand wrote:
> > On 17.09.19 04:34, Toshiki Fukasawa wrote:
> > > On 2019/09/09 16:46, David Hildenbrand wrote:
> > > > Let's take a step back here to understand the issues I am aware of. I
> > > > think we should solve this for good now:
> > > > 
> > > > A PFN walker takes a look at a random PFN at a random point in time. It
> > > > finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> > > > options are:
> > > > 
> > > > 1. It is buddy memory (add_memory()) that has not been online yet. The
> > > > memmap contains garbage. Don't access.
> > > > 
> > > > 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
> > > > 
> > > > 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> > > > is only partially present: E.g., device starts at offset 64MB within a
> > > > section or the device ends at offset 64MB within a section. Don't access it.
> > > 
> > > I don't agree with case #3. In the case, struct page area is not allocated on
> > > ZONE_DEVICE, but is allocated on system memory. So I think we can access the
> > > struct pages. What do you mean "invalid memmap"?
> > 
> > No, that's not the case. There is no memory, especially not system
> > memory. We only allow partially present sections (sub-section memory
> > hotplug) for ZONE_DEVICE.
> > 
> > invalid memmap == memmap was not initialized == struct pages contains
> > garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
> > trigger a BUG.
> > 
> 
> As long as the page structures exist, they should be initialized to some
> known state. We could set PagePoison for those invalid memmap. It is the

Sounds like you want to run page_init_poison() by default.


> garbage that are in those page structures that can cause problem if a
> struct page walker scan those pages and try to make sense of it.
Waiman Long Sept. 17, 2019, 5:04 p.m. UTC | #20
On 9/17/19 12:21 PM, Qian Cai wrote:
> On Tue, 2019-09-17 at 11:49 -0400, Waiman Long wrote:
>> On 9/17/19 3:13 AM, David Hildenbrand wrote:
>>> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>>>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>>>> Let's take a step back here to understand the issues I am aware of. I
>>>>> think we should solve this for good now:
>>>>>
>>>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>>>> options are:
>>>>>
>>>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>>>> memmap contains garbage. Don't access.
>>>>>
>>>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>>>
>>>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>>>> is only partially present: E.g., device starts at offset 64MB within a
>>>>> section or the device ends at offset 64MB within a section. Don't access it.
>>>> I don't agree with case #3. In the case, struct page area is not allocated on
>>>> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
>>>> struct pages. What do you mean "invalid memmap"?
>>> No, that's not the case. There is no memory, especially not system
>>> memory. We only allow partially present sections (sub-section memory
>>> hotplug) for ZONE_DEVICE.
>>>
>>> invalid memmap == memmap was not initialized == struct pages contains
>>> garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
>>> trigger a BUG.
>>>
>> As long as the page structures exist, they should be initialized to some
>> known state. We could set PagePoison for those invalid memmap. It is the
> Sounds like you want to run page_init_poison() by default.

Yes for those pages that are not initialized otherwise. I don't want to
run page_init_poison() for the whole ZONE_DEVICE memory range as it can
take a while if we are talking about TBs of persistent memory. Also most
of the pages will be reinitialized anyway in the init process. So it is
mostly a wasted effort. However, for those reserved pages that are not
being exported to the memory management layer, having them initialized
to a known state will cause less problem down the road.

Cheers,
Longman
David Hildenbrand Sept. 17, 2019, 8:23 p.m. UTC | #21
On 17.09.19 19:04, Waiman Long wrote:
> On 9/17/19 12:21 PM, Qian Cai wrote:
>> On Tue, 2019-09-17 at 11:49 -0400, Waiman Long wrote:
>>> On 9/17/19 3:13 AM, David Hildenbrand wrote:
>>>> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>>>>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>>>>> Let's take a step back here to understand the issues I am aware of. I
>>>>>> think we should solve this for good now:
>>>>>>
>>>>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>>>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>>>>> options are:
>>>>>>
>>>>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>>>>> memmap contains garbage. Don't access.
>>>>>>
>>>>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>>>>
>>>>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>>>>> is only partially present: E.g., device starts at offset 64MB within a
>>>>>> section or the device ends at offset 64MB within a section. Don't access it.
>>>>> I don't agree with case #3. In the case, struct page area is not allocated on
>>>>> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
>>>>> struct pages. What do you mean "invalid memmap"?
>>>> No, that's not the case. There is no memory, especially not system
>>>> memory. We only allow partially present sections (sub-section memory
>>>> hotplug) for ZONE_DEVICE.
>>>>
>>>> invalid memmap == memmap was not initialized == struct pages contains
>>>> garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
>>>> trigger a BUG.
>>>>
>>> As long as the page structures exist, they should be initialized to some
>>> known state. We could set PagePoison for those invalid memmap. It is the
>> Sounds like you want to run page_init_poison() by default.
> 
> Yes for those pages that are not initialized otherwise. I don't want to
> run page_init_poison() for the whole ZONE_DEVICE memory range as it can
> take a while if we are talking about TBs of persistent memory. Also most
> of the pages will be reinitialized anyway in the init process. So it is
> mostly a wasted effort. However, for those reserved pages that are not
> being exported to the memory management layer, having them initialized
> to a known state will cause less problem down the road.
> 

No hacks please. There has to be a proper way to identify if a memmap
was initialized or not. Fake-initializing a memmap is *not* the answer.
Toshiki Fukasawa Sept. 18, 2019, 2:16 a.m. UTC | #22
On 2019/09/17 19:20, David Hildenbrand wrote:
> On 17.09.19 11:32, Toshiki Fukasawa wrote:
>> On 2019/09/17 16:13, David Hildenbrand wrote:
>>> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>>>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>>>> Let's take a step back here to understand the issues I am aware of. I
>>>>> think we should solve this for good now:
>>>>>
>>>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>>>> options are:
>>>>>
>>>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>>>> memmap contains garbage. Don't access.
>>>>>
>>>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>>>
>>>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>>>> is only partially present: E.g., device starts at offset 64MB within a
>>>>> section or the device ends at offset 64MB within a section. Don't access it.
>>>>
>>>> I don't agree with case #3. In the case, struct page area is not allocated on
>>>> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
>>>> struct pages. What do you mean "invalid memmap"?
>>> No, that's not the case. There is no memory, especially not system
>>> memory. We only allow partially present sections (sub-section memory
>>> hotplug) for ZONE_DEVICE.
>>
>> Let me clear my thoughts. If I read correctly, the struct pages for sections
>> (including partially present sections) on ZONE_DEVICE are allocated by
>> vmemmap_populate(). And all the struct pages except (altmap->base_pfn) to
>> (altmap->base_pfn + altmap->reserve) are initialized by memmap_init_zone()
>> and memmap_init_zone_device().
>>
>> Do struct pages for partially present sections go through a different process?
> 
> No. However, the memmap is initialized via move_pfn_range_to_zone(). So
> partially present sections will have partially uninitialized memmaps.
> 
Thank you for explanation.
To my understanding, depending on architecture, the situation is possible
that the struct pages for entire section is allocated, but only pages
in the zone are initialized.

Thanks,
Toshiki Fukasawa
Toshiki Fukasawa Sept. 18, 2019, 2:28 a.m. UTC | #23
On 2019/09/18 2:04, Waiman Long wrote:
> On 9/17/19 12:21 PM, Qian Cai wrote:
>> On Tue, 2019-09-17 at 11:49 -0400, Waiman Long wrote:
>>> On 9/17/19 3:13 AM, David Hildenbrand wrote:
>>>> On 17.09.19 04:34, Toshiki Fukasawa wrote:
>>>>> On 2019/09/09 16:46, David Hildenbrand wrote:
>>>>>> Let's take a step back here to understand the issues I am aware of. I
>>>>>> think we should solve this for good now:
>>>>>>
>>>>>> A PFN walker takes a look at a random PFN at a random point in time. It
>>>>>> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
>>>>>> options are:
>>>>>>
>>>>>> 1. It is buddy memory (add_memory()) that has not been online yet. The
>>>>>> memmap contains garbage. Don't access.
>>>>>>
>>>>>> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
>>>>>>
>>>>>> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
>>>>>> is only partially present: E.g., device starts at offset 64MB within a
>>>>>> section or the device ends at offset 64MB within a section. Don't access it.
>>>>> I don't agree with case #3. In the case, struct page area is not allocated on
>>>>> ZONE_DEVICE, but is allocated on system memory. So I think we can access the
>>>>> struct pages. What do you mean "invalid memmap"?
>>>> No, that's not the case. There is no memory, especially not system
>>>> memory. We only allow partially present sections (sub-section memory
>>>> hotplug) for ZONE_DEVICE.
>>>>
>>>> invalid memmap == memmap was not initialized == struct pages contains
>>>> garbage. There is a memmap, but accessing it (e.g., pfn_to_nid()) will
>>>> trigger a BUG.
>>>>
>>> As long as the page structures exist, they should be initialized to some
>>> known state. We could set PagePoison for those invalid memmap. It is the
>> Sounds like you want to run page_init_poison() by default.
> 
> Yes for those pages that are not initialized otherwise. I don't want to
> run page_init_poison() for the whole ZONE_DEVICE memory range as it can
> take a while if we are talking about TBs of persistent memory. Also most
> of the pages will be reinitialized anyway in the init process. So it is
> mostly a wasted effort. However, for those reserved pages that are not
> being exported to the memory management layer, having them initialized
> to a known state will cause less problem down the road.
> 
The same can be said about altmap->reserve. I think it should be initialized
since the struct page exists.

Thanks,
Toshiki Fukasawa
David Hildenbrand Sept. 18, 2019, 7:30 a.m. UTC | #24
On 09.09.19 09:46, David Hildenbrand wrote:
> On 09.09.19 07:48, Toshiki Fukasawa wrote:
>> On 2019/09/06 19:35, David Hildenbrand wrote:
>>> On 06.09.19 12:02, Toshiki Fukasawa wrote:
>>>> Thank you for your feedback.
>>>>
>>>> On 2019/09/06 17:45, David Hildenbrand wrote:
>>>>> On 06.09.19 10:09, Toshiki Fukasawa wrote:
>>>>>> A kernel panic is observed during reading
>>>>>> /proc/kpage{cgroup,count,flags} for first few pfns allocated by
>>>>>> pmem namespace:
>>>>>>
>>>>>> BUG: unable to handle page fault for address: fffffffffffffffe
>>>>>> [  114.495280] #PF: supervisor read access in kernel mode
>>>>>> [  114.495738] #PF: error_code(0x0000) - not-present page
>>>>>> [  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
>>>>>> [  114.496713] Oops: 0000 [#1] SMP PTI
>>>>>> [  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1
>>>>>> [  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
>>>>>> [  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
>>>>>> [  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
>>>>>> [  114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
>>>>>> [  114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
>>>>>> [  114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
>>>>>> [  114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
>>>>>> [  114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
>>>>>> [  114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
>>>>>> [  114.504530] FS:  00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
>>>>>> [  114.505245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> [  114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
>>>>>> [  114.506401] Call Trace:
>>>>>> [  114.506660]  kpageflags_read+0xb1/0x130
>>>>>> [  114.507051]  proc_reg_read+0x39/0x60
>>>>>> [  114.507387]  vfs_read+0x8a/0x140
>>>>>> [  114.507686]  ksys_pread64+0x61/0xa0
>>>>>> [  114.508021]  do_syscall_64+0x5f/0x1a0
>>>>>> [  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>>> [  114.508844] RIP: 0033:0x7f0266ba426b
>>>>>>
>>>>>> The first few pages of ZONE_DEVICE expressed as the range
>>>>>> (altmap->base_pfn) to (altmap->base_pfn + altmap->reserve) are
>>>>>> skipped by struct page initialization. Some pfn walkers like
>>>>>> /proc/kpage{cgroup, count, flags} can't handle these uninitialized
>>>>>> struct pages, which causes the error.
>>>>>>
>>>>>> In previous discussion, Dan seemed to have concern that the struct
>>>>>> page area of some pages indicated by vmem_altmap->reserve may not
>>>>>> be allocated. (See https://lore.kernel.org/lkml/CAPcyv4i5FjTOnPbXNcTzvt+e6RQYow0JRQwSFuxaa62LSuvzHQ@mail.gmail.com/)
>>>>>> However, arch_add_memory() called by devm_memremap_pages() allocates
>>>>>> struct page area for pages containing addresses in the range
>>>>>> (res.start) to (res.start + resource_size(res)), which include the
>>>>>> pages indicated by vmem_altmap->reserve. If I read correctly, it is
>>>>>> allocated as requested at least on x86_64. Also, memmap_init_zone()
>>>>>> initializes struct pages in the same range.
>>>>>> So I think the struct pages should be initialized.>
>>>>>
>>>>> For !ZONE_DEVICE memory, the memmap is valid with SECTION_IS_ONLINE -
>>>>> for the whole section. For ZONE_DEVICE memory we have no such
>>>>> indication. In any section that is !SECTION_IS_ONLINE and
>>>>> SECTION_MARKED_PRESENT, we could have any subsections initialized. >
>>>>> The only indication I am aware of is pfn_zone_device_reserved() - which
>>>>> seems to check exactly what you are trying to skip here.
>>>>>
>>>>> Can't you somehow use pfn_zone_device_reserved() ? Or if you considered
>>>>> that already, why did you decide against it?
>>>>
>>>> No, in current approach this function is no longer needed.
>>>> The reason why we change the approach is that all pfn walkers
>>>> have to be aware of the uninitialized struct pages.
>>>
>>> We should use the same strategy for all pfn walkers then (effectively
>>> get rid of pfn_zone_device_reserved() if that's what we want).
>>
>> True, but this patch replaces "/proc/kpageflags: do not use uninitialized
>> struct pages". If we initialize the uninitialized struct pages, no pfn walker
>> will need to be aware of them.
> 
> So the function should go.
> 
>>
>>>
>>>>
>>>> As for SECTION_IS_ONLINE, I'm not sure now.
>>>> I will look into it next week.
>>>
>>> SECTION_IS_ONLINE does currently not apply to ZONE_DEVICE and due to
>>> sub-section support for ZONE_DEVICE, it cannot easily be reused.
>>>
>> It seems that SECTION_IS_ONLINE and SECTION_MARKED_PRESENT can be used to
>> distinguish uninitialized struct pages if we can apply them to ZONE_DEVICE,
>> but that is no longer necessary with this approach.
> 
> Let's take a step back here to understand the issues I am aware of. I
> think we should solve this for good now:
> 
> A PFN walker takes a look at a random PFN at a random point in time. It
> finds a PFN with SECTION_MARKED_PRESENT && !SECTION_IS_ONLINE. The
> options are:
> 
> 1. It is buddy memory (add_memory()) that has not been online yet. The
> memmap contains garbage. Don't access.
> 
> 2. It is ZONE_DEVICE memory with a valid memmap. Access it.
> 
> 3. It is ZONE_DEVICE memory with an invalid memmap, because the section
> is only partially present: E.g., device starts at offset 64MB within a
> section or the device ends at offset 64MB within a section. Don't access it.

As Toshika correctly objected, this case might in fact not be
relevant/possible. While pfn_present() does not take care of subsections
(maybe we should fix that), pfn_valid() will take care of subsections -
at least the !CONFIG_HAVE_ARCH_PFN_VALID implementation.

So as long as we check pfn_valid(), we won't be accessing parts of the
memmap we are not supposed to access.

That leaves us with #1, #4, and #5 vs. #2.

> 
> 4. It is ZONE_DEVICE memory with an invalid memmap, because the memmap
> was not initialized yet. memmap_init_zone_device() did not yet succeed
> after dropping the mem_hotplug lock in mm/memremap.c. Don't access it.
> 
> 5. It is reserved ZONE_DEVICE memory ("pages mapped, but reserved for
> driver") with an invalid memmap. Don't access it.
> 
> I can see that your patch tries to make #5 vanish by initializing the
> memmap, fair enough. #3 and #4 can't be detected. The PFN walker could
> still stumble over uninitialized memmaps.
>
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9c91949..6d180ae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5846,8 +5846,7 @@  void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 #ifdef CONFIG_ZONE_DEVICE
 	/*
-	 * Honor reservation requested by the driver for this ZONE_DEVICE
-	 * memory. We limit the total number of pages to initialize to just
+	 * We limit the total number of pages to initialize to just
 	 * those that might contain the memory mapping. We will defer the
 	 * ZONE_DEVICE page initialization until after we have released
 	 * the hotplug lock.
@@ -5856,8 +5855,6 @@  void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		if (!altmap)
 			return;
 
-		if (start_pfn == altmap->base_pfn)
-			start_pfn += altmap->reserve;
 		end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
 	}
 #endif