[v5,01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone
diff mbox series

Message ID 20191001144011.3801-2-david@redhat.com
State New
Headers show
Series
  • mm/memory_hotplug: Shrink zones before removing memory
Related show

Commit Message

David Hildenbrand Oct. 1, 2019, 2:40 p.m. UTC
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>

With altmap, all the resource pfns are not initialized. While initializing
pfn, altmap reserve space is skipped. Hence when removing pfn from zone
skip pfns that were never initialized.

Update memunmap_pages to calculate start and end pfn based on altmap
values. This fixes a kernel crash that is observed when destroying
a namespace.

[   81.356173] kernel BUG at include/linux/mm.h:1107!
cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
    pc: c0000000004b9728: memunmap_pages+0x238/0x340
    lr: c0000000004b9724: memunmap_pages+0x234/0x340
...
    pid   = 3669, comm = ndctl
kernel BUG at include/linux/mm.h:1107!
[c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
[c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
[c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
[c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
[c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
[c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
[c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
[c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
[c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
[c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
[c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[ move all pfn-realted declarations into a single line ]
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memremap.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

Comments

David Hildenbrand Oct. 1, 2019, 2:57 p.m. UTC | #1
On 01.10.19 16:40, David Hildenbrand wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> 
> With altmap, all the resource pfns are not initialized. While initializing
> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
> skip pfns that were never initialized.
> 
> Update memunmap_pages to calculate start and end pfn based on altmap
> values. This fixes a kernel crash that is observed when destroying
> a namespace.
> 
> [   81.356173] kernel BUG at include/linux/mm.h:1107!
> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>     pc: c0000000004b9728: memunmap_pages+0x238/0x340
>     lr: c0000000004b9724: memunmap_pages+0x234/0x340
> ...
>     pid   = 3669, comm = ndctl
> kernel BUG at include/linux/mm.h:1107!
> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> [ move all pfn-realted declarations into a single line ]
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memremap.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 557e53c6fb46..026788b2ac69 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>  void memunmap_pages(struct dev_pagemap *pgmap)
>  {
>  	struct resource *res = &pgmap->res;
> -	unsigned long pfn;
> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>  	int nid;
>  
>  	dev_pagemap_kill(pgmap);
> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>  		put_page(pfn_to_page(pfn));
>  	dev_pagemap_cleanup(pgmap);
>  
> +	start_pfn = pfn_first(pgmap);
> +	end_pfn = pfn_end(pgmap);
> +	nr_pages = end_pfn - start_pfn;
> +
>  	/* pages are dead and unused, undo the arch mapping */
> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
> +	nid = page_to_nid(pfn_to_page(start_pfn));
>  
>  	mem_hotplug_begin();
>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> -		pfn = PHYS_PFN(res->start);
> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
> -				 PHYS_PFN(resource_size(res)), NULL);
> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
> +			       nr_pages, NULL);
>  	} else {
>  		arch_remove_memory(nid, res->start, resource_size(res),
>  				pgmap_altmap(pgmap));
> 

Aneesh, I was wondering why the use of "res->start" is correct (and we
shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
could review.
David Hildenbrand Oct. 1, 2019, 3:03 p.m. UTC | #2
On 01.10.19 16:57, David Hildenbrand wrote:
> On 01.10.19 16:40, David Hildenbrand wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>
>> With altmap, all the resource pfns are not initialized. While initializing
>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>> skip pfns that were never initialized.
>>
>> Update memunmap_pages to calculate start and end pfn based on altmap
>> values. This fixes a kernel crash that is observed when destroying
>> a namespace.
>>
>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>     pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>     lr: c0000000004b9724: memunmap_pages+0x234/0x340
>> ...
>>     pid   = 3669, comm = ndctl
>> kernel BUG at include/linux/mm.h:1107!
>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>> Cc: Logan Gunthorpe <logang@deltatee.com>
>> Cc: Ira Weiny <ira.weiny@intel.com>
>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> [ move all pfn-realted declarations into a single line ]
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  mm/memremap.c | 13 ++++++++-----
>>  1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 557e53c6fb46..026788b2ac69 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>  void memunmap_pages(struct dev_pagemap *pgmap)
>>  {
>>  	struct resource *res = &pgmap->res;
>> -	unsigned long pfn;
>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>  	int nid;
>>  
>>  	dev_pagemap_kill(pgmap);
>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>  		put_page(pfn_to_page(pfn));
>>  	dev_pagemap_cleanup(pgmap);
>>  
>> +	start_pfn = pfn_first(pgmap);
>> +	end_pfn = pfn_end(pgmap);
>> +	nr_pages = end_pfn - start_pfn;
>> +
>>  	/* pages are dead and unused, undo the arch mapping */
>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>  
>>  	mem_hotplug_begin();
>>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>> -		pfn = PHYS_PFN(res->start);
>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>> -				 PHYS_PFN(resource_size(res)), NULL);
>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>> +			       nr_pages, NULL);
>>  	} else {
>>  		arch_remove_memory(nid, res->start, resource_size(res),
>>  				pgmap_altmap(pgmap));
>>
> 
> Aneesh, I was wondering why the use of "res->start" is correct (and we
> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
> could review.
> 

To be more precise, I wonder if it should actually be

__remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
               resource_size(res))

IOW, keep calling __remove_pages() with the same parameters but read
nid/zone from the offset one.

Hope some memunmap_pages() expert can clarify.
Aneesh Kumar K.V Oct. 3, 2019, 4:48 p.m. UTC | #3
On 10/1/19 8:33 PM, David Hildenbrand wrote:
> On 01.10.19 16:57, David Hildenbrand wrote:
>> On 01.10.19 16:40, David Hildenbrand wrote:
>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>
>>> With altmap, all the resource pfns are not initialized. While initializing
>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>> skip pfns that were never initialized.
>>>
>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>> values. This fixes a kernel crash that is observed when destroying
>>> a namespace.
>>>
>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>> ...
>>>      pid   = 3669, comm = ndctl
>>> kernel BUG at include/linux/mm.h:1107!
>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> [ move all pfn-realted declarations into a single line ]
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/memremap.c | 13 ++++++++-----
>>>   1 file changed, 8 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>> index 557e53c6fb46..026788b2ac69 100644
>>> --- a/mm/memremap.c
>>> +++ b/mm/memremap.c
>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>>   {
>>>   	struct resource *res = &pgmap->res;
>>> -	unsigned long pfn;
>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>   	int nid;
>>>   
>>>   	dev_pagemap_kill(pgmap);
>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>   		put_page(pfn_to_page(pfn));
>>>   	dev_pagemap_cleanup(pgmap);
>>>   
>>> +	start_pfn = pfn_first(pgmap);
>>> +	end_pfn = pfn_end(pgmap);
>>> +	nr_pages = end_pfn - start_pfn;
>>> +
>>>   	/* pages are dead and unused, undo the arch mapping */
>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>   
>>>   	mem_hotplug_begin();
>>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>> -		pfn = PHYS_PFN(res->start);
>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>> +			       nr_pages, NULL);
>>>   	} else {
>>>   		arch_remove_memory(nid, res->start, resource_size(res),
>>>   				pgmap_altmap(pgmap));
>>>
>>
>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>> could review.
>>
> 
> To be more precise, I wonder if it should actually be
> 
> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>                 resource_size(res))
> 

yes, that would be make it much clear.

But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?


> IOW, keep calling __remove_pages() with the same parameters but read
> nid/zone from the offset one.
> 
> Hope some memunmap_pages() expert can clarify.
> 

-aneesh
David Hildenbrand Oct. 4, 2019, 9 a.m. UTC | #4
On 03.10.19 18:48, Aneesh Kumar K.V wrote:
> On 10/1/19 8:33 PM, David Hildenbrand wrote:
>> On 01.10.19 16:57, David Hildenbrand wrote:
>>> On 01.10.19 16:40, David Hildenbrand wrote:
>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>
>>>> With altmap, all the resource pfns are not initialized. While initializing
>>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>>> skip pfns that were never initialized.
>>>>
>>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>>> values. This fixes a kernel crash that is observed when destroying
>>>> a namespace.
>>>>
>>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>>> ...
>>>>      pid   = 3669, comm = ndctl
>>>> kernel BUG at include/linux/mm.h:1107!
>>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>>
>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> [ move all pfn-realted declarations into a single line ]
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>   mm/memremap.c | 13 ++++++++-----
>>>>   1 file changed, 8 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>> index 557e53c6fb46..026788b2ac69 100644
>>>> --- a/mm/memremap.c
>>>> +++ b/mm/memremap.c
>>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>>>   {
>>>>   	struct resource *res = &pgmap->res;
>>>> -	unsigned long pfn;
>>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>>   	int nid;
>>>>   
>>>>   	dev_pagemap_kill(pgmap);
>>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>>   		put_page(pfn_to_page(pfn));
>>>>   	dev_pagemap_cleanup(pgmap);
>>>>   
>>>> +	start_pfn = pfn_first(pgmap);
>>>> +	end_pfn = pfn_end(pgmap);
>>>> +	nr_pages = end_pfn - start_pfn;
>>>> +
>>>>   	/* pages are dead and unused, undo the arch mapping */
>>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>>   
>>>>   	mem_hotplug_begin();
>>>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>>> -		pfn = PHYS_PFN(res->start);
>>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>>> +			       nr_pages, NULL);
>>>>   	} else {
>>>>   		arch_remove_memory(nid, res->start, resource_size(res),
>>>>   				pgmap_altmap(pgmap));
>>>>
>>>
>>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>>> could review.
>>>
>>
>> To be more precise, I wonder if it should actually be
>>
>> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>>                 resource_size(res))
>>
> 
> yes, that would be make it much clear.
> 
> But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?

Okay, let's recap. We should call add_pages()/__remove_pages()
and arch_add_memory()/arch_remove_memory() with the exact same ranges.

So with PHYS_PFN(res->start) and PHYS_PFN(resource_size(res)

Now, only a subset of the pages gets actually initialized,
meaning the NID and the ZONE we read could be stale.
That, we have to fix.

What about something like this (am I missing something?):

From d77b5c50f86570819a437517a897cc40ed29eefb Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Date: Fri, 27 Sep 2019 16:02:24 +0530
Subject: [PATCH] mm/memunmap: Don't access uninitialized memmap in
 memunmap_pages()

With an altmap, the memmap falling into the reserved altmap space are
not initialized and, therefore, contain a garbage NID and a garbage
zone. Make sure to read the NID/zone from a memmap that was initialzed.

This fixes a kernel crash that is observed when destroying a namespace:

[   81.356173] kernel BUG at include/linux/mm.h:1107!
cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
    pc: c0000000004b9728: memunmap_pages+0x238/0x340
    lr: c0000000004b9724: memunmap_pages+0x234/0x340
...
    pid   = 3669, comm = ndctl
kernel BUG at include/linux/mm.h:1107!
[c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
[c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
[c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
[c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
[c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
[c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
[c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
[c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
[c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
[c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
[c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[ minimze code changes, rephrase description ]
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memremap.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 557e53c6fb46..8b11c0da345c 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
 	struct resource *res = &pgmap->res;
+	struct page *first_page;
 	unsigned long pfn;
 	int nid;
 
@@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 		put_page(pfn_to_page(pfn));
 	dev_pagemap_cleanup(pgmap);
 
+	/* make sure to access a memmap that was actually initialized */
+	first_page = pfn_to_page(pfn_first(pgmap));
+
 	/* pages are dead and unused, undo the arch mapping */
-	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+	nid = page_to_nid(first_page);
 
 	mem_hotplug_begin();
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		pfn = PHYS_PFN(res->start);
-		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-				 PHYS_PFN(resource_size(res)), NULL);
+		__remove_pages(page_zone(first_page), res->start,
+			       resource_size(res), NULL);
 	} else {
 		arch_remove_memory(nid, res->start, resource_size(res),
 				pgmap_altmap(pgmap));
David Hildenbrand Oct. 4, 2019, 9:03 a.m. UTC | #5
On 04.10.19 11:00, David Hildenbrand wrote:
> On 03.10.19 18:48, Aneesh Kumar K.V wrote:
>> On 10/1/19 8:33 PM, David Hildenbrand wrote:
>>> On 01.10.19 16:57, David Hildenbrand wrote:
>>>> On 01.10.19 16:40, David Hildenbrand wrote:
>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>>
>>>>> With altmap, all the resource pfns are not initialized. While initializing
>>>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>>>> skip pfns that were never initialized.
>>>>>
>>>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>>>> values. This fixes a kernel crash that is observed when destroying
>>>>> a namespace.
>>>>>
>>>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>>>> ...
>>>>>      pid   = 3669, comm = ndctl
>>>>> kernel BUG at include/linux/mm.h:1107!
>>>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>>>
>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>> [ move all pfn-realted declarations into a single line ]
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>> ---
>>>>>   mm/memremap.c | 13 ++++++++-----
>>>>>   1 file changed, 8 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>>> index 557e53c6fb46..026788b2ac69 100644
>>>>> --- a/mm/memremap.c
>>>>> +++ b/mm/memremap.c
>>>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>   {
>>>>>   	struct resource *res = &pgmap->res;
>>>>> -	unsigned long pfn;
>>>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>>>   	int nid;
>>>>>   
>>>>>   	dev_pagemap_kill(pgmap);
>>>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>   		put_page(pfn_to_page(pfn));
>>>>>   	dev_pagemap_cleanup(pgmap);
>>>>>   
>>>>> +	start_pfn = pfn_first(pgmap);
>>>>> +	end_pfn = pfn_end(pgmap);
>>>>> +	nr_pages = end_pfn - start_pfn;
>>>>> +
>>>>>   	/* pages are dead and unused, undo the arch mapping */
>>>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>>>   
>>>>>   	mem_hotplug_begin();
>>>>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>>>> -		pfn = PHYS_PFN(res->start);
>>>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>>>> +			       nr_pages, NULL);
>>>>>   	} else {
>>>>>   		arch_remove_memory(nid, res->start, resource_size(res),
>>>>>   				pgmap_altmap(pgmap));
>>>>>
>>>>
>>>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>>>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>>>> could review.
>>>>
>>>
>>> To be more precise, I wonder if it should actually be
>>>
>>> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>>>                 resource_size(res))
>>>
>>
>> yes, that would be make it much clear.
>>
>> But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?
> 
> Okay, let's recap. We should call add_pages()/__remove_pages()
> and arch_add_memory()/arch_remove_memory() with the exact same ranges.
> 
> So with PHYS_PFN(res->start) and PHYS_PFN(resource_size(res)
> 
> Now, only a subset of the pages gets actually initialized,
> meaning the NID and the ZONE we read could be stale.
> That, we have to fix.
> 
> What about something like this (am I missing something?):
> 
> From d77b5c50f86570819a437517a897cc40ed29eefb Mon Sep 17 00:00:00 2001
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> Date: Fri, 27 Sep 2019 16:02:24 +0530
> Subject: [PATCH] mm/memunmap: Don't access uninitialized memmap in
>  memunmap_pages()
> 
> With an altmap, the memmap falling into the reserved altmap space are
> not initialized and, therefore, contain a garbage NID and a garbage
> zone. Make sure to read the NID/zone from a memmap that was initialzed.
> 
> This fixes a kernel crash that is observed when destroying a namespace:
> 
> [   81.356173] kernel BUG at include/linux/mm.h:1107!
> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>     pc: c0000000004b9728: memunmap_pages+0x238/0x340
>     lr: c0000000004b9724: memunmap_pages+0x234/0x340
> ...
>     pid   = 3669, comm = ndctl
> kernel BUG at include/linux/mm.h:1107!
> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> [ minimze code changes, rephrase description ]
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memremap.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 557e53c6fb46..8b11c0da345c 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>  void memunmap_pages(struct dev_pagemap *pgmap)
>  {
>  	struct resource *res = &pgmap->res;
> +	struct page *first_page;
>  	unsigned long pfn;
>  	int nid;
>  
> @@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>  		put_page(pfn_to_page(pfn));
>  	dev_pagemap_cleanup(pgmap);
>  
> +	/* make sure to access a memmap that was actually initialized */
> +	first_page = pfn_to_page(pfn_first(pgmap));
> +
>  	/* pages are dead and unused, undo the arch mapping */
> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
> +	nid = page_to_nid(first_page);
>  
>  	mem_hotplug_begin();
>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> -		pfn = PHYS_PFN(res->start);
> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
> -				 PHYS_PFN(resource_size(res)), NULL);
> +		__remove_pages(page_zone(first_page), res->start,
> +			       resource_size(res), NULL);

Keeping the PHYS_PFN() calls of course ...

>  	} else {
>  		arch_remove_memory(nid, res->start, resource_size(res),
>  				pgmap_altmap(pgmap));
>
David Hildenbrand Oct. 4, 2019, 9:33 a.m. UTC | #6
On 04.10.19 11:03, David Hildenbrand wrote:
> On 04.10.19 11:00, David Hildenbrand wrote:
>> On 03.10.19 18:48, Aneesh Kumar K.V wrote:
>>> On 10/1/19 8:33 PM, David Hildenbrand wrote:
>>>> On 01.10.19 16:57, David Hildenbrand wrote:
>>>>> On 01.10.19 16:40, David Hildenbrand wrote:
>>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>>>
>>>>>> With altmap, all the resource pfns are not initialized. While initializing
>>>>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>>>>> skip pfns that were never initialized.
>>>>>>
>>>>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>>>>> values. This fixes a kernel crash that is observed when destroying
>>>>>> a namespace.
>>>>>>
>>>>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>>>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>>>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>>>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>>>>> ...
>>>>>>      pid   = 3669, comm = ndctl
>>>>>> kernel BUG at include/linux/mm.h:1107!
>>>>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>>>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>>>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>>>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>>>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>>>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>>>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>>>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>>>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>>>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>>>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>>>>
>>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>>>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>>>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>>>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>> [ move all pfn-realted declarations into a single line ]
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>> ---
>>>>>>   mm/memremap.c | 13 ++++++++-----
>>>>>>   1 file changed, 8 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>>>> index 557e53c6fb46..026788b2ac69 100644
>>>>>> --- a/mm/memremap.c
>>>>>> +++ b/mm/memremap.c
>>>>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>>>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>   {
>>>>>>   	struct resource *res = &pgmap->res;
>>>>>> -	unsigned long pfn;
>>>>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>>>>   	int nid;
>>>>>>   
>>>>>>   	dev_pagemap_kill(pgmap);
>>>>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>   		put_page(pfn_to_page(pfn));
>>>>>>   	dev_pagemap_cleanup(pgmap);
>>>>>>   
>>>>>> +	start_pfn = pfn_first(pgmap);
>>>>>> +	end_pfn = pfn_end(pgmap);
>>>>>> +	nr_pages = end_pfn - start_pfn;
>>>>>> +
>>>>>>   	/* pages are dead and unused, undo the arch mapping */
>>>>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>>>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>>>>   
>>>>>>   	mem_hotplug_begin();
>>>>>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>>>>> -		pfn = PHYS_PFN(res->start);
>>>>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>>>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>>>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>>>>> +			       nr_pages, NULL);
>>>>>>   	} else {
>>>>>>   		arch_remove_memory(nid, res->start, resource_size(res),
>>>>>>   				pgmap_altmap(pgmap));
>>>>>>
>>>>>
>>>>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>>>>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>>>>> could review.
>>>>>
>>>>
>>>> To be more precise, I wonder if it should actually be
>>>>
>>>> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>>>>                 resource_size(res))
>>>>
>>>
>>> yes, that would be make it much clear.
>>>
>>> But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?
>>
>> Okay, let's recap. We should call add_pages()/__remove_pages()
>> and arch_add_memory()/arch_remove_memory() with the exact same ranges.
>>
>> So with PHYS_PFN(res->start) and PHYS_PFN(resource_size(res)
>>
>> Now, only a subset of the pages gets actually initialized,
>> meaning the NID and the ZONE we read could be stale.
>> That, we have to fix.
>>
>> What about something like this (am I missing something?):
>>
>> From d77b5c50f86570819a437517a897cc40ed29eefb Mon Sep 17 00:00:00 2001
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>> Date: Fri, 27 Sep 2019 16:02:24 +0530
>> Subject: [PATCH] mm/memunmap: Don't access uninitialized memmap in
>>  memunmap_pages()
>>
>> With an altmap, the memmap falling into the reserved altmap space are
>> not initialized and, therefore, contain a garbage NID and a garbage
>> zone. Make sure to read the NID/zone from a memmap that was initialzed.
>>
>> This fixes a kernel crash that is observed when destroying a namespace:
>>
>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>     pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>     lr: c0000000004b9724: memunmap_pages+0x234/0x340
>> ...
>>     pid   = 3669, comm = ndctl
>> kernel BUG at include/linux/mm.h:1107!
>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>> Cc: Logan Gunthorpe <logang@deltatee.com>
>> Cc: Ira Weiny <ira.weiny@intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> [ minimze code changes, rephrase description ]
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  mm/memremap.c | 11 +++++++----
>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 557e53c6fb46..8b11c0da345c 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>  void memunmap_pages(struct dev_pagemap *pgmap)
>>  {
>>  	struct resource *res = &pgmap->res;
>> +	struct page *first_page;
>>  	unsigned long pfn;
>>  	int nid;
>>  
>> @@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>  		put_page(pfn_to_page(pfn));
>>  	dev_pagemap_cleanup(pgmap);
>>  
>> +	/* make sure to access a memmap that was actually initialized */
>> +	first_page = pfn_to_page(pfn_first(pgmap));
>> +
>>  	/* pages are dead and unused, undo the arch mapping */
>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>> +	nid = page_to_nid(first_page);
>>  
>>  	mem_hotplug_begin();
>>  	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>> -		pfn = PHYS_PFN(res->start);
>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>> -				 PHYS_PFN(resource_size(res)), NULL);
>> +		__remove_pages(page_zone(first_page), res->start,
>> +			       resource_size(res), NULL);
> 
> Keeping the PHYS_PFN() calls of course ...
> 
>>  	} else {
>>  		arch_remove_memory(nid, res->start, resource_size(res),
>>  				pgmap_altmap(pgmap));
>>
> 
> 

The current state (with the modified patch) can be found at:
	https://github.com/davidhildenbrand/linux/tree/zone_offline

@Aneesh, It would be great if you could test the namespace removal
thingy and tell me if we are still missing something :)
Aneesh Kumar K.V Oct. 5, 2019, 6:13 a.m. UTC | #7
On 10/4/19 2:33 PM, David Hildenbrand wrote:
> On 04.10.19 11:00, David Hildenbrand wrote:
>> On 03.10.19 18:48, Aneesh Kumar K.V wrote:
>>> On 10/1/19 8:33 PM, David Hildenbrand wrote:
>>>> On 01.10.19 16:57, David Hildenbrand wrote:
>>>>> On 01.10.19 16:40, David Hildenbrand wrote:
>>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>>>
>>>>>> With altmap, all the resource pfns are not initialized. While initializing
>>>>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>>>>> skip pfns that were never initialized.
>>>>>>
>>>>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>>>>> values. This fixes a kernel crash that is observed when destroying
>>>>>> a namespace.
>>>>>>
>>>>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>>>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>>>>       pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>>>>       lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>>>>> ...
>>>>>>       pid   = 3669, comm = ndctl
>>>>>> kernel BUG at include/linux/mm.h:1107!
>>>>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>>>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>>>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>>>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>>>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>>>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>>>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>>>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>>>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>>>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>>>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>>>>
>>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>>>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>>>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>>>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>> [ move all pfn-realted declarations into a single line ]
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>> ---
>>>>>>    mm/memremap.c | 13 ++++++++-----
>>>>>>    1 file changed, 8 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>>>> index 557e53c6fb46..026788b2ac69 100644
>>>>>> --- a/mm/memremap.c
>>>>>> +++ b/mm/memremap.c
>>>>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>>>>    void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>    {
>>>>>>    	struct resource *res = &pgmap->res;
>>>>>> -	unsigned long pfn;
>>>>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>>>>    	int nid;
>>>>>>    
>>>>>>    	dev_pagemap_kill(pgmap);
>>>>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>    		put_page(pfn_to_page(pfn));
>>>>>>    	dev_pagemap_cleanup(pgmap);
>>>>>>    
>>>>>> +	start_pfn = pfn_first(pgmap);
>>>>>> +	end_pfn = pfn_end(pgmap);
>>>>>> +	nr_pages = end_pfn - start_pfn;
>>>>>> +
>>>>>>    	/* pages are dead and unused, undo the arch mapping */
>>>>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>>>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>>>>    
>>>>>>    	mem_hotplug_begin();
>>>>>>    	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>>>>> -		pfn = PHYS_PFN(res->start);
>>>>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>>>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>>>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>>>>> +			       nr_pages, NULL);
>>>>>>    	} else {
>>>>>>    		arch_remove_memory(nid, res->start, resource_size(res),
>>>>>>    				pgmap_altmap(pgmap));
>>>>>>
>>>>>
>>>>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>>>>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>>>>> could review.
>>>>>
>>>>
>>>> To be more precise, I wonder if it should actually be
>>>>
>>>> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>>>>                  resource_size(res))
>>>>
>>>
>>> yes, that would be make it much clear.
>>>
>>> But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?
>>
>> Okay, let's recap. We should call add_pages()/__remove_pages()
>> and arch_add_memory()/arch_remove_memory() with the exact same ranges.
>>
>> So with PHYS_PFN(res->start) and PHYS_PFN(resource_size(res)
>>
>> Now, only a subset of the pages gets actually initialized,
>> meaning the NID and the ZONE we read could be stale.
>> That, we have to fix.
>>
>> What about something like this (am I missing something?):
>>
>>  From d77b5c50f86570819a437517a897cc40ed29eefb Mon Sep 17 00:00:00 2001
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>> Date: Fri, 27 Sep 2019 16:02:24 +0530
>> Subject: [PATCH] mm/memunmap: Don't access uninitialized memmap in
>>   memunmap_pages()
>>
>> With an altmap, the memmap falling into the reserved altmap space are
>> not initialized and, therefore, contain a garbage NID and a garbage
>> zone. Make sure to read the NID/zone from a memmap that was initialzed.
>>
>> This fixes a kernel crash that is observed when destroying a namespace:
>>
>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>> ...
>>      pid   = 3669, comm = ndctl
>> kernel BUG at include/linux/mm.h:1107!
>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>> Cc: Logan Gunthorpe <logang@deltatee.com>
>> Cc: Ira Weiny <ira.weiny@intel.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> [ minimze code changes, rephrase description ]
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/memremap.c | 11 +++++++----
>>   1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 557e53c6fb46..8b11c0da345c 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>   {
>>   	struct resource *res = &pgmap->res;
>> +	struct page *first_page;
>>   	unsigned long pfn;
>>   	int nid;
>>   
>> @@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>   		put_page(pfn_to_page(pfn));
>>   	dev_pagemap_cleanup(pgmap);
>>   
>> +	/* make sure to access a memmap that was actually initialized */
>> +	first_page = pfn_to_page(pfn_first(pgmap));
>> +
>>   	/* pages are dead and unused, undo the arch mapping */
>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>> +	nid = page_to_nid(first_page);
>>   
>>   	mem_hotplug_begin();
>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>> -		pfn = PHYS_PFN(res->start);
>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>> -				 PHYS_PFN(resource_size(res)), NULL);
>> +		__remove_pages(page_zone(first_page), res->start,
>> +			       resource_size(res), NULL);
> 
> Keeping the PHYS_PFN() calls of course ...
> 


That is not different from what I posted right?  For MEMORY_DEVICE_PRIVATE


start_pfn  = PHYS_PFN(rest->start)


>>   	} else {
>>   		arch_remove_memory(nid, res->start, resource_size(res),
>>   				pgmap_altmap(pgmap));
>>
> 
> 

-aneesh
David Hildenbrand Oct. 6, 2019, 8:13 a.m. UTC | #8
On 05.10.19 08:13, Aneesh Kumar K.V wrote:
> On 10/4/19 2:33 PM, David Hildenbrand wrote:
>> On 04.10.19 11:00, David Hildenbrand wrote:
>>> On 03.10.19 18:48, Aneesh Kumar K.V wrote:
>>>> On 10/1/19 8:33 PM, David Hildenbrand wrote:
>>>>> On 01.10.19 16:57, David Hildenbrand wrote:
>>>>>> On 01.10.19 16:40, David Hildenbrand wrote:
>>>>>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>>>>>>
>>>>>>> With altmap, all the resource pfns are not initialized. While initializing
>>>>>>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>>>>>>> skip pfns that were never initialized.
>>>>>>>
>>>>>>> Update memunmap_pages to calculate start and end pfn based on altmap
>>>>>>> values. This fixes a kernel crash that is observed when destroying
>>>>>>> a namespace.
>>>>>>>
>>>>>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>>>>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>>>>>       pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>>>>>       lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>>>>>> ...
>>>>>>>       pid   = 3669, comm = ndctl
>>>>>>> kernel BUG at include/linux/mm.h:1107!
>>>>>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>>>>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>>>>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>>>>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>>>>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>>>>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>>>>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>>>>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>>>>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>>>>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>>>>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>>>>>
>>>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>>>>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>>>>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>>>>>> Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>> [ move all pfn-realted declarations into a single line ]
>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>> ---
>>>>>>>    mm/memremap.c | 13 ++++++++-----
>>>>>>>    1 file changed, 8 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>>>>> index 557e53c6fb46..026788b2ac69 100644
>>>>>>> --- a/mm/memremap.c
>>>>>>> +++ b/mm/memremap.c
>>>>>>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>>>>>    void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>>    {
>>>>>>>    	struct resource *res = &pgmap->res;
>>>>>>> -	unsigned long pfn;
>>>>>>> +	unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>>>>>>    	int nid;
>>>>>>>    
>>>>>>>    	dev_pagemap_kill(pgmap);
>>>>>>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>>>>>    		put_page(pfn_to_page(pfn));
>>>>>>>    	dev_pagemap_cleanup(pgmap);
>>>>>>>    
>>>>>>> +	start_pfn = pfn_first(pgmap);
>>>>>>> +	end_pfn = pfn_end(pgmap);
>>>>>>> +	nr_pages = end_pfn - start_pfn;
>>>>>>> +
>>>>>>>    	/* pages are dead and unused, undo the arch mapping */
>>>>>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>>>>>> +	nid = page_to_nid(pfn_to_page(start_pfn));
>>>>>>>    
>>>>>>>    	mem_hotplug_begin();
>>>>>>>    	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>>>>>> -		pfn = PHYS_PFN(res->start);
>>>>>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>>>>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>>>>>> +		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>>>>>>> +			       nr_pages, NULL);
>>>>>>>    	} else {
>>>>>>>    		arch_remove_memory(nid, res->start, resource_size(res),
>>>>>>>    				pgmap_altmap(pgmap));
>>>>>>>
>>>>>>
>>>>>> Aneesh, I was wondering why the use of "res->start" is correct (and we
>>>>>> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
>>>>>> could review.
>>>>>>
>>>>>
>>>>> To be more precise, I wonder if it should actually be
>>>>>
>>>>> __remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
>>>>>                  resource_size(res))
>>>>>
>>>>
>>>> yes, that would be make it much clear.
>>>>
>>>> But for MEMORY_DEVICE_PRIVATE start_pfn and pfn should be same?
>>>
>>> Okay, let's recap. We should call add_pages()/__remove_pages()
>>> and arch_add_memory()/arch_remove_memory() with the exact same ranges.
>>>
>>> So with PHYS_PFN(res->start) and PHYS_PFN(resource_size(res)
>>>
>>> Now, only a subset of the pages gets actually initialized,
>>> meaning the NID and the ZONE we read could be stale.
>>> That, we have to fix.
>>>
>>> What about something like this (am I missing something?):
>>>
>>>  From d77b5c50f86570819a437517a897cc40ed29eefb Mon Sep 17 00:00:00 2001
>>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>>> Date: Fri, 27 Sep 2019 16:02:24 +0530
>>> Subject: [PATCH] mm/memunmap: Don't access uninitialized memmap in
>>>   memunmap_pages()
>>>
>>> With an altmap, the memmap falling into the reserved altmap space are
>>> not initialized and, therefore, contain a garbage NID and a garbage
>>> zone. Make sure to read the NID/zone from a memmap that was initialzed.
>>>
>>> This fixes a kernel crash that is observed when destroying a namespace:
>>>
>>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>>> cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
>>>      pc: c0000000004b9728: memunmap_pages+0x238/0x340
>>>      lr: c0000000004b9724: memunmap_pages+0x234/0x340
>>> ...
>>>      pid   = 3669, comm = ndctl
>>> kernel BUG at include/linux/mm.h:1107!
>>> [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
>>> [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
>>> [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
>>> [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
>>> [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
>>> [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
>>> [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
>>> [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
>>> [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
>>> [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
>>> [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
>>>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Jason Gunthorpe <jgg@ziepe.ca>
>>> Cc: Logan Gunthorpe <logang@deltatee.com>
>>> Cc: Ira Weiny <ira.weiny@intel.com>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> [ minimze code changes, rephrase description ]
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/memremap.c | 11 +++++++----
>>>   1 file changed, 7 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>> index 557e53c6fb46..8b11c0da345c 100644
>>> --- a/mm/memremap.c
>>> +++ b/mm/memremap.c
>>> @@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>>>   void memunmap_pages(struct dev_pagemap *pgmap)
>>>   {
>>>   	struct resource *res = &pgmap->res;
>>> +	struct page *first_page;
>>>   	unsigned long pfn;
>>>   	int nid;
>>>   
>>> @@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>>   		put_page(pfn_to_page(pfn));
>>>   	dev_pagemap_cleanup(pgmap);
>>>   
>>> +	/* make sure to access a memmap that was actually initialized */
>>> +	first_page = pfn_to_page(pfn_first(pgmap));
>>> +
>>>   	/* pages are dead and unused, undo the arch mapping */
>>> -	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>>> +	nid = page_to_nid(first_page);
>>>   
>>>   	mem_hotplug_begin();
>>>   	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>>> -		pfn = PHYS_PFN(res->start);
>>> -		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>>> -				 PHYS_PFN(resource_size(res)), NULL);
>>> +		__remove_pages(page_zone(first_page), res->start,
>>> +			       resource_size(res), NULL);
>>
>> Keeping the PHYS_PFN() calls of course ...
>>
> 
> 
> That is not different from what I posted right?  For MEMORY_DEVICE_PRIVATE

I'll go with this modified patch for now, which does not change the ranges
passed to  __remove_pages(), because that looked like a unrelated change to me.
__remove_ages() has to be called with the same range as add_pages().


commit f66c2fa09293ae0f99afc4363146941579152409 (HEAD)
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Fri Sep 27 16:02:24 2019 +0530

    mm/memunmap: Don't access uninitialized memmap in memunmap_pages()
    
    With an altmap, the memmap falling into the reserved altmap space are
    not initialized and, therefore, contain a garbage NID and a garbage
    zone. Make sure to read the NID/zone from a memmap that was initialzed.
    
    This fixes a kernel crash that is observed when destroying a namespace:
    
    [   81.356173] kernel BUG at include/linux/mm.h:1107!
    cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
        pc: c0000000004b9728: memunmap_pages+0x238/0x340
        lr: c0000000004b9724: memunmap_pages+0x234/0x340
    ...
        pid   = 3669, comm = ndctl
    kernel BUG at include/linux/mm.h:1107!
    [c000000274087ba0] c0000000009e3500 devm_action_release+0x30/0x50
    [c000000274087bc0] c0000000009e4758 release_nodes+0x268/0x2d0
    [c000000274087c30] c0000000009dd144 device_release_driver_internal+0x174/0x240
    [c000000274087c70] c0000000009d9dfc unbind_store+0x13c/0x190
    [c000000274087cb0] c0000000009d8a24 drv_attr_store+0x44/0x60
    [c000000274087cd0] c0000000005a7470 sysfs_kf_write+0x70/0xa0
    [c000000274087d10] c0000000005a5cac kernfs_fop_write+0x1ac/0x290
    [c000000274087d60] c0000000004be45c __vfs_write+0x3c/0x70
    [c000000274087d80] c0000000004c26e4 vfs_write+0xe4/0x200
    [c000000274087dd0] c0000000004c2a6c ksys_write+0x7c/0x140
    [c000000274087e20] c00000000000bbd0 system_call+0x5c/0x68
    
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Logan Gunthorpe <logang@deltatee.com>
    Cc: Ira Weiny <ira.weiny@intel.com>
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    [ minimze code changes, rephrase description ]
    Signed-off-by: David Hildenbrand <david@redhat.com>

diff --git a/mm/memremap.c b/mm/memremap.c
index 557e53c6fb46..8c2fb44c3b4d 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
        struct resource *res = &pgmap->res;
+       struct page *first_page;
        unsigned long pfn;
        int nid;
 
@@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
                put_page(pfn_to_page(pfn));
        dev_pagemap_cleanup(pgmap);
 
+       /* make sure to access a memmap that was actually initialized */
+       first_page = pfn_to_page(pfn_first(pgmap));
+
        /* pages are dead and unused, undo the arch mapping */
-       nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+       nid = page_to_nid(first_page);
 
        mem_hotplug_begin();
        if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-               pfn = PHYS_PFN(res->start);
-               __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-                                PHYS_PFN(resource_size(res)), NULL);
+               __remove_pages(page_zone(first_page), PHYS_PFN(res->start),
+                              PHYS_PFN(resource_size(res)), NULL);
        } else {
                arch_remove_memory(nid, res->start, resource_size(res),
                                pgmap_altmap(pgmap));



As Andrew wants to give this some testing, I'll send out a new version
shortly, then we can discuss there.

Patch
diff mbox series

diff --git a/mm/memremap.c b/mm/memremap.c
index 557e53c6fb46..026788b2ac69 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -123,7 +123,7 @@  static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
 	struct resource *res = &pgmap->res;
-	unsigned long pfn;
+	unsigned long pfn, nr_pages, start_pfn, end_pfn;
 	int nid;
 
 	dev_pagemap_kill(pgmap);
@@ -131,14 +131,17 @@  void memunmap_pages(struct dev_pagemap *pgmap)
 		put_page(pfn_to_page(pfn));
 	dev_pagemap_cleanup(pgmap);
 
+	start_pfn = pfn_first(pgmap);
+	end_pfn = pfn_end(pgmap);
+	nr_pages = end_pfn - start_pfn;
+
 	/* pages are dead and unused, undo the arch mapping */
-	nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+	nid = page_to_nid(pfn_to_page(start_pfn));
 
 	mem_hotplug_begin();
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		pfn = PHYS_PFN(res->start);
-		__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-				 PHYS_PFN(resource_size(res)), NULL);
+		__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
+			       nr_pages, NULL);
 	} else {
 		arch_remove_memory(nid, res->start, resource_size(res),
 				pgmap_altmap(pgmap));