diff mbox series

mm/sparse: Fix kernel crash with pfn_section_valid check

Message ID 20200325031914.107660-1-aneesh.kumar@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series mm/sparse: Fix kernel crash with pfn_section_valid check | expand

Commit Message

Aneesh Kumar K.V March 25, 2020, 3:19 a.m. UTC
Fixes the below crash

BUG: Kernel NULL pointer dereference on read at 0x00000000
Faulting instruction address: 0xc000000000c3447c
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
...
NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
LR [c000000000088354] vmemmap_free+0x144/0x320
Call Trace:
 section_deactivate+0x220/0x240
 __remove_pages+0x118/0x170
 arch_remove_memory+0x3c/0x150
 memunmap_pages+0x1cc/0x2f0
 devm_action_release+0x30/0x50
 release_nodes+0x2f8/0x3e0
 device_release_driver_internal+0x168/0x270
 unbind_store+0x130/0x170
 drv_attr_store+0x44/0x60
 sysfs_kf_write+0x68/0x80
 kernfs_fop_write+0x100/0x290
 __vfs_write+0x3c/0x70
 vfs_write+0xcc/0x240
 ksys_write+0x7c/0x140
 system_call+0x5c/0x68

With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
section_mem_map is set to NULL after depopulate_section_mem(). This
was done so that pfn_page() can work correctly with kernel config that disables
SPARSEMEM_VMEMMAP. With that config pfn_to_page does

	__section_mem_map_addr(__sec) + __pfn;
where

static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
	unsigned long map = section->section_mem_map;
	map &= SECTION_MAP_MASK;
	return (struct page *)map;
}

Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to
check the pfn validity (pfn_valid()). Since section_deactivate release
mem_section->usage if a section is fully deactivated, pfn_valid() check after
a subsection_deactivate cause a kernel crash.

static inline int pfn_valid(unsigned long pfn)
{
...
	return early_section(ms) || pfn_section_valid(ms, pfn);
}

where

static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
{
	int idx = subsection_map_index(pfn);

	return test_bit(idx, ms->usage->subsection_map);
}

Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.

Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
Cc: Baoquan He <bhe@redhat.com>
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/sparse.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Aneesh Kumar K.V March 25, 2020, 6:50 a.m. UTC | #1
Hi Andrew,

Can this be merged for v5.6? Without this most persistent memory
commands (ndctl) will result in a kernel crash on v5.6.

-aneesh

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Fixes the below crash
>
> BUG: Kernel NULL pointer dereference on read at 0x00000000
> Faulting instruction address: 0xc000000000c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> ...
> NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
> LR [c000000000088354] vmemmap_free+0x144/0x320
> Call Trace:
>  section_deactivate+0x220/0x240
>  __remove_pages+0x118/0x170
>  arch_remove_memory+0x3c/0x150
>  memunmap_pages+0x1cc/0x2f0
>  devm_action_release+0x30/0x50
>  release_nodes+0x2f8/0x3e0
>  device_release_driver_internal+0x168/0x270
>  unbind_store+0x130/0x170
>  drv_attr_store+0x44/0x60
>  sysfs_kf_write+0x68/0x80
>  kernfs_fop_write+0x100/0x290
>  __vfs_write+0x3c/0x70
>  vfs_write+0xcc/0x240
>  ksys_write+0x7c/0x140
>  system_call+0x5c/0x68
>
> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
> section_mem_map is set to NULL after depopulate_section_mem(). This
> was done so that pfn_page() can work correctly with kernel config that disables
> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
>
> 	__section_mem_map_addr(__sec) + __pfn;
> where
>
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
> 	unsigned long map = section->section_mem_map;
> 	map &= SECTION_MAP_MASK;
> 	return (struct page *)map;
> }
>
> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to
> check the pfn validity (pfn_valid()). Since section_deactivate release
> mem_section->usage if a section is fully deactivated, pfn_valid() check after
> a subsection_deactivate cause a kernel crash.
>
> static inline int pfn_valid(unsigned long pfn)
> {
> ...
> 	return early_section(ms) || pfn_section_valid(ms, pfn);
> }
>
> where
>
> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> {
> 	int idx = subsection_map_index(pfn);
>
> 	return test_bit(idx, ms->usage->subsection_map);
> }
>
> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
>
> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
> Cc: Baoquan He <bhe@redhat.com>
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  mm/sparse.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aadb7298dcef..3012d1f3771a 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  			ms->usage = NULL;
>  		}
>  		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> +		/* Mark the section invalid */
> +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>  	}
>  
>  	if (section_is_early && memmap)
> -- 
> 2.25.1
Baoquan He March 25, 2020, 7:06 a.m. UTC | #2
On 03/25/20 at 08:49am, Aneesh Kumar K.V wrote:
> Fixes the below crash
> 
> BUG: Kernel NULL pointer dereference on read at 0x00000000
> Faulting instruction address: 0xc000000000c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> ...
> NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
> LR [c000000000088354] vmemmap_free+0x144/0x320
> Call Trace:
>  section_deactivate+0x220/0x240
>  __remove_pages+0x118/0x170
>  arch_remove_memory+0x3c/0x150
>  memunmap_pages+0x1cc/0x2f0
>  devm_action_release+0x30/0x50
>  release_nodes+0x2f8/0x3e0
>  device_release_driver_internal+0x168/0x270
>  unbind_store+0x130/0x170
>  drv_attr_store+0x44/0x60
>  sysfs_kf_write+0x68/0x80
>  kernfs_fop_write+0x100/0x290
>  __vfs_write+0x3c/0x70
>  vfs_write+0xcc/0x240
>  ksys_write+0x7c/0x140
>  system_call+0x5c/0x68
> 
> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
> section_mem_map is set to NULL after depopulate_section_mem(). This
> was done so that pfn_page() can work correctly with kernel config that disables
> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
> 
> 	__section_mem_map_addr(__sec) + __pfn;
> where
> 
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
> 	unsigned long map = section->section_mem_map;
> 	map &= SECTION_MAP_MASK;
> 	return (struct page *)map;
> }
> 
> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to
> check the pfn validity (pfn_valid()). Since section_deactivate release
> mem_section->usage if a section is fully deactivated, pfn_valid() check after
> a subsection_deactivate cause a kernel crash.
> 
> static inline int pfn_valid(unsigned long pfn)
> {
> ...
> 	return early_section(ms) || pfn_section_valid(ms, pfn);
> }
> 
> where
> 
> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> {
> 	int idx = subsection_map_index(pfn);
> 
> 	return test_bit(idx, ms->usage->subsection_map);
> }
> 
> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
> 
> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
> Cc: Baoquan He <bhe@redhat.com>
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

Maybe add Sachin's Tested-by, Sachin has tested and confirmed this fix
works.

> ---
>  mm/sparse.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aadb7298dcef..3012d1f3771a 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  			ms->usage = NULL;
>  		}
>  		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> +		/* Mark the section invalid */
> +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;

Not sure if we should add checking in valid_section() or pfn_valid(),
e.g check ms->usage validation too. Otherwise, this fix looks good to
me.

Reviewed-by: Baoquan He <bhe@redhat.com>
Baoquan He March 25, 2020, 7:37 a.m. UTC | #3
On 03/25/20 at 03:06pm, Baoquan He wrote:
> On 03/25/20 at 08:49am, Aneesh Kumar K.V wrote:

> >  mm/sparse.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index aadb7298dcef..3012d1f3771a 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
> >  			ms->usage = NULL;
> >  		}
> >  		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> > +		/* Mark the section invalid */
> > +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> 
> Not sure if we should add checking in valid_section() or pfn_valid(),
> e.g check ms->usage validation too. Otherwise, this fix looks good to
> me.

With SPASEMEM_VMEMAP enabled, we should do validation check on ms->usage
before checking any subsection is valid. Since now we do have case
in which ms->usage is released, people still try to check it.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f0a2c184eb9a..d79bd938852e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1306,6 +1306,8 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
 {
 	int idx = subsection_map_index(pfn);
 
+	if (!ms->usage)
+		return 0;
 	return test_bit(idx, ms->usage->subsection_map);
 }
 #else
Aneesh Kumar K.V March 25, 2020, 8:12 a.m. UTC | #4
On 3/25/20 1:07 PM, Baoquan He wrote:
> On 03/25/20 at 03:06pm, Baoquan He wrote:
>> On 03/25/20 at 08:49am, Aneesh Kumar K.V wrote:
> 
>>>   mm/sparse.c | 2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> diff --git a/mm/sparse.c b/mm/sparse.c
>>> index aadb7298dcef..3012d1f3771a 100644
>>> --- a/mm/sparse.c
>>> +++ b/mm/sparse.c
>>> @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>>>   			ms->usage = NULL;
>>>   		}
>>>   		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
>>> +		/* Mark the section invalid */
>>> +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>>
>> Not sure if we should add checking in valid_section() or pfn_valid(),
>> e.g check ms->usage validation too. Otherwise, this fix looks good to
>> me.
> 
> With SPASEMEM_VMEMAP enabled, we should do validation check on ms->usage
> before checking any subsection is valid. Since now we do have case
> in which ms->usage is released, people still try to check it.
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f0a2c184eb9a..d79bd938852e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1306,6 +1306,8 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
>   {
>   	int idx = subsection_map_index(pfn);
>   
> +	if (!ms->usage)
> +		return 0;
>   	return test_bit(idx, ms->usage->subsection_map);
>   }
>   #else
> 

We always check for section valid, before we check if pfn_section_valid().

static inline int pfn_valid(unsigned long pfn)

	struct mem_section *ms;

	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
		return 0;
	ms = __nr_to_section(pfn_to_section_nr(pfn));
	if (!valid_section(ms))
		return 0;
	/*
	 * Traditionally early sections always returned pfn_valid() for
	 * the entire section-sized span.
	 */
	return early_section(ms) || pfn_section_valid(ms, pfn);
}


IMHO adding that if (!ms->usage) is redundant.

-aneesh
Baoquan He March 25, 2020, 8:36 a.m. UTC | #5
On 03/25/20 at 01:42pm, Aneesh Kumar K.V wrote:
> On 3/25/20 1:07 PM, Baoquan He wrote:
> > On 03/25/20 at 03:06pm, Baoquan He wrote:
> > > On 03/25/20 at 08:49am, Aneesh Kumar K.V wrote:
> > 
> > > >   mm/sparse.c | 2 ++
> > > >   1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > > index aadb7298dcef..3012d1f3771a 100644
> > > > --- a/mm/sparse.c
> > > > +++ b/mm/sparse.c
> > > > @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
> > > >   			ms->usage = NULL;
> > > >   		}
> > > >   		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> > > > +		/* Mark the section invalid */
> > > > +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> > > 
> > > Not sure if we should add checking in valid_section() or pfn_valid(),
> > > e.g check ms->usage validation too. Otherwise, this fix looks good to
> > > me.
> > 
> > With SPASEMEM_VMEMAP enabled, we should do validation check on ms->usage
> > before checking any subsection is valid. Since now we do have case
> > in which ms->usage is released, people still try to check it.
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index f0a2c184eb9a..d79bd938852e 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1306,6 +1306,8 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> >   {
> >   	int idx = subsection_map_index(pfn);
> > +	if (!ms->usage)
> > +		return 0;
> >   	return test_bit(idx, ms->usage->subsection_map);
> >   }
> >   #else
> > 
> 
> We always check for section valid, before we check if pfn_section_valid().
> 
> static inline int pfn_valid(unsigned long pfn)
> 
> 	struct mem_section *ms;
> 
> 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
> 		return 0;
> 	ms = __nr_to_section(pfn_to_section_nr(pfn));
> 	if (!valid_section(ms))
> 		return 0;
> 	/*
> 	 * Traditionally early sections always returned pfn_valid() for
> 	 * the entire section-sized span.
> 	 */
> 	return early_section(ms) || pfn_section_valid(ms, pfn);
> }
> 
> 
> IMHO adding that if (!ms->usage) is redundant.

Yeah, I tend to agree. Consider this happens in the only small window
between ms->usage releasing and ms->section_mem_map releasing when
removing a section. Just thought adding this check to enhance it even
though we have had your fix, because we only check ms->section_mem_map
in valid_section(). Anyway, your fix looks good to me, see if other
people have any comment.

Thanks
Baoquan
Andrew Morton March 26, 2020, 12:38 a.m. UTC | #6
On Wed, 25 Mar 2020 08:49:14 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:

> Fixes the below crash

(cc's added)

> BUG: Kernel NULL pointer dereference on read at 0x00000000
> Faulting instruction address: 0xc000000000c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> ...
> NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
> LR [c000000000088354] vmemmap_free+0x144/0x320
> Call Trace:
>  section_deactivate+0x220/0x240
>  __remove_pages+0x118/0x170
>  arch_remove_memory+0x3c/0x150
>  memunmap_pages+0x1cc/0x2f0
>  devm_action_release+0x30/0x50
>  release_nodes+0x2f8/0x3e0
>  device_release_driver_internal+0x168/0x270
>  unbind_store+0x130/0x170
>  drv_attr_store+0x44/0x60
>  sysfs_kf_write+0x68/0x80
>  kernfs_fop_write+0x100/0x290
>  __vfs_write+0x3c/0x70
>  vfs_write+0xcc/0x240
>  ksys_write+0x7c/0x140
>  system_call+0x5c/0x68
> 
> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
> section_mem_map is set to NULL after depopulate_section_mem(). This
> was done so that pfn_page() can work correctly with kernel config that disables
> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
> 
> 	__section_mem_map_addr(__sec) + __pfn;
> where
> 
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
> 	unsigned long map = section->section_mem_map;
> 	map &= SECTION_MAP_MASK;
> 	return (struct page *)map;
> }
> 
> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to
> check the pfn validity (pfn_valid()). Since section_deactivate release
> mem_section->usage if a section is fully deactivated, pfn_valid() check after
> a subsection_deactivate cause a kernel crash.
> 
> static inline int pfn_valid(unsigned long pfn)
> {
> ...
> 	return early_section(ms) || pfn_section_valid(ms, pfn);
> }
> 
> where
> 
> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> {
> 	int idx = subsection_map_index(pfn);
> 
> 	return test_bit(idx, ms->usage->subsection_map);
> }
> 
> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
> 
> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")

d41e2f3bd546 had cc:stable, so I shall add cc:stable to this one as well.

> Cc: Baoquan He <bhe@redhat.com>
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  mm/sparse.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aadb7298dcef..3012d1f3771a 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  			ms->usage = NULL;
>  		}
>  		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> +		/* Mark the section invalid */
> +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>  	}
>  
>  	if (section_is_early && memmap)
Michal Hocko March 26, 2020, 9:40 a.m. UTC | #7
On Wed 25-03-20 08:49:14, Aneesh Kumar K.V wrote:
> Fixes the below crash
> 
> BUG: Kernel NULL pointer dereference on read at 0x00000000
> Faulting instruction address: 0xc000000000c3447c
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> ...
> NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
> LR [c000000000088354] vmemmap_free+0x144/0x320
> Call Trace:
>  section_deactivate+0x220/0x240

It would be great to match this to the specific source code.

>  __remove_pages+0x118/0x170
>  arch_remove_memory+0x3c/0x150
>  memunmap_pages+0x1cc/0x2f0
>  devm_action_release+0x30/0x50
>  release_nodes+0x2f8/0x3e0
>  device_release_driver_internal+0x168/0x270
>  unbind_store+0x130/0x170
>  drv_attr_store+0x44/0x60
>  sysfs_kf_write+0x68/0x80
>  kernfs_fop_write+0x100/0x290
>  __vfs_write+0x3c/0x70
>  vfs_write+0xcc/0x240
>  ksys_write+0x7c/0x140
>  system_call+0x5c/0x68
> 
> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
> section_mem_map is set to NULL after depopulate_section_mem(). This
> was done so that pfn_page() can work correctly with kernel config that disables
> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
> 
> 	__section_mem_map_addr(__sec) + __pfn;
> where
> 
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
> 	unsigned long map = section->section_mem_map;
> 	map &= SECTION_MAP_MASK;
> 	return (struct page *)map;
> }
> 
> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to
> check the pfn validity (pfn_valid()). Since section_deactivate release
> mem_section->usage if a section is fully deactivated, pfn_valid() check after
> a subsection_deactivate cause a kernel crash.
> 
> static inline int pfn_valid(unsigned long pfn)
> {
> ...
> 	return early_section(ms) || pfn_section_valid(ms, pfn);
> }
> 
> where
> 
> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
> {

> 	int idx = subsection_map_index(pfn);
> 
> 	return test_bit(idx, ms->usage->subsection_map);
> }
> 
> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.

I am sorry, I haven't noticed that during the review of the commit
mentioned above. This is all subtle as hell, I have to say. 

Why do we have to free usage before deactivaing section memmap? Now that
we have a late section_mem_map reset shouldn't we tear down the usage in
the same branch?

> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
> Cc: Baoquan He <bhe@redhat.com>
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  mm/sparse.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index aadb7298dcef..3012d1f3771a 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  			ms->usage = NULL;
>  		}
>  		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> +		/* Mark the section invalid */
> +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;

Btw. this comment is not really helping at all.
		/*
		 * section->usage is gone and VMEMMAP's pfn_valid depens
		 * on it (see pfn_section_valid)
		 */
>  	}
>  
>  	if (section_is_early && memmap)
> -- 
> 2.25.1
>
Aneesh Kumar K.V March 26, 2020, 9:56 a.m. UTC | #8
On 3/26/20 3:10 PM, Michal Hocko wrote:
> On Wed 25-03-20 08:49:14, Aneesh Kumar K.V wrote:
>> Fixes the below crash
>>
>> BUG: Kernel NULL pointer dereference on read at 0x00000000
>> Faulting instruction address: 0xc000000000c3447c
>> Oops: Kernel access of bad area, sig: 11 [#1]
>> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>> CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
>> ...
>> NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
>> LR [c000000000088354] vmemmap_free+0x144/0x320
>> Call Trace:
>>   section_deactivate+0x220/0x240
> 
> It would be great to match this to the specific source code.

The crash is due to NULL dereference at

test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL;

that is explained in later part of the commit.
> 
>>   __remove_pages+0x118/0x170
>>   arch_remove_memory+0x3c/0x150
>>   memunmap_pages+0x1cc/0x2f0
>>   devm_action_release+0x30/0x50
>>   release_nodes+0x2f8/0x3e0
>>   device_release_driver_internal+0x168/0x270
>>   unbind_store+0x130/0x170
>>   drv_attr_store+0x44/0x60
>>   sysfs_kf_write+0x68/0x80
>>   kernfs_fop_write+0x100/0x290
>>   __vfs_write+0x3c/0x70
>>   vfs_write+0xcc/0x240
>>   ksys_write+0x7c/0x140
>>   system_call+0x5c/0x68
>>
>> With commit: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
>> section_mem_map is set to NULL after depopulate_section_mem(). This
>> was done so that pfn_page() can work correctly with kernel config that disables
>> SPARSEMEM_VMEMMAP. With that config pfn_to_page does
>>
>> 	__section_mem_map_addr(__sec) + __pfn;
>> where
>>
>> static inline struct page *__section_mem_map_addr(struct mem_section *section)
>> {
>> 	unsigned long map = section->section_mem_map;
>> 	map &= SECTION_MAP_MASK;
>> 	return (struct page *)map;
>> }
>>
>> Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to
>> check the pfn validity (pfn_valid()). Since section_deactivate release
>> mem_section->usage if a section is fully deactivated, pfn_valid() check after
>> a subsection_deactivate cause a kernel crash.
>>
>> static inline int pfn_valid(unsigned long pfn)
>> {
>> ...
>> 	return early_section(ms) || pfn_section_valid(ms, pfn);
>> }
>>
>> where
>>
>> static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
>> {
> 
>> 	int idx = subsection_map_index(pfn);
>>
>> 	return test_bit(idx, ms->usage->subsection_map);
>> }
>>
>> Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed.
> 
> I am sorry, I haven't noticed that during the review of the commit
> mentioned above. This is all subtle as hell, I have to say.
> 
> Why do we have to free usage before deactivaing section memmap? Now that
> we have a late section_mem_map reset shouldn't we tear down the usage in
> the same branch?
> 

We still need to make the section invalid before we call into 
depopulate_section_memmap(). Because architecture like powerpc can share 
vmemmap area across sections (16MB mapping of vmemmap area) and we use 
vmemmap_popluated() to make that decision.



>> Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
>> Cc: Baoquan He <bhe@redhat.com>
>> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   mm/sparse.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index aadb7298dcef..3012d1f3771a 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>>   			ms->usage = NULL;
>>   		}
>>   		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
>> +		/* Mark the section invalid */
>> +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> 
> Btw. this comment is not really helping at all.

That is marking the section invalid so that

static inline int valid_section(struct mem_section *section)
{
	return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP));
}


returns false.

> 		/*
> 		 * section->usage is gone and VMEMMAP's pfn_valid depens
> 		 * on it (see pfn_section_valid)
> 		 */
>>   	}
>>   
>>   	if (section_is_early && memmap)
>> -- 
>> 2.25.1
>>
>
Michal Hocko March 26, 2020, 10:16 a.m. UTC | #9
On Thu 26-03-20 15:26:22, Aneesh Kumar K.V wrote:
> On 3/26/20 3:10 PM, Michal Hocko wrote:
> > On Wed 25-03-20 08:49:14, Aneesh Kumar K.V wrote:
> > > Fixes the below crash
> > > 
> > > BUG: Kernel NULL pointer dereference on read at 0x00000000
> > > Faulting instruction address: 0xc000000000c3447c
> > > Oops: Kernel access of bad area, sig: 11 [#1]
> > > LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> > > CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> > > ...
> > > NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
> > > LR [c000000000088354] vmemmap_free+0x144/0x320
> > > Call Trace:
> > >   section_deactivate+0x220/0x240
> > 
> > It would be great to match this to the specific source code.
> 
> The crash is due to NULL dereference at
> 
> test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL;

It would be nice to call that out here as well

[...]
> > Why do we have to free usage before deactivaing section memmap? Now that
> > we have a late section_mem_map reset shouldn't we tear down the usage in
> > the same branch?
> > 
> 
> We still need to make the section invalid before we call into
> depopulate_section_memmap(). Because architecture like powerpc can share
> vmemmap area across sections (16MB mapping of vmemmap area) and we use
> vmemmap_popluated() to make that decision.

This should be noted in a comment as well.

> > > Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
> > > Cc: Baoquan He <bhe@redhat.com>
> > > Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > ---
> > >   mm/sparse.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > index aadb7298dcef..3012d1f3771a 100644
> > > --- a/mm/sparse.c
> > > +++ b/mm/sparse.c
> > > @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
> > >   			ms->usage = NULL;
> > >   		}
> > >   		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> > > +		/* Mark the section invalid */
> > > +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> > 
> > Btw. this comment is not really helping at all.
> 
> That is marking the section invalid so that
> 
> static inline int valid_section(struct mem_section *section)
> {
> 	return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP));
> }
> 
> 
> returns false.

Yes that is obvious once you are clear where to look. I was really
hoping for a comment that would simply point you to the right
direcection without chasing SECTION_HAS_MEM_MAP usage. This code is
subtle and useful comments, even when they state something that is
obvious to you _right_now_, can be really helpful.

Thanks!
Michal Hocko March 26, 2020, 10:50 a.m. UTC | #10
On Thu 26-03-20 11:16:33, Michal Hocko wrote:
> On Thu 26-03-20 15:26:22, Aneesh Kumar K.V wrote:
> > On 3/26/20 3:10 PM, Michal Hocko wrote:
> > > On Wed 25-03-20 08:49:14, Aneesh Kumar K.V wrote:
> > > > Fixes the below crash
> > > > 
> > > > BUG: Kernel NULL pointer dereference on read at 0x00000000
> > > > Faulting instruction address: 0xc000000000c3447c
> > > > Oops: Kernel access of bad area, sig: 11 [#1]
> > > > LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> > > > CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
> > > > ...
> > > > NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
> > > > LR [c000000000088354] vmemmap_free+0x144/0x320
> > > > Call Trace:
> > > >   section_deactivate+0x220/0x240
> > > 
> > > It would be great to match this to the specific source code.
> > 
> > The crash is due to NULL dereference at
> > 
> > test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL;
> 
> It would be nice to call that out here as well
> 
> [...]
> > > Why do we have to free usage before deactivaing section memmap? Now that
> > > we have a late section_mem_map reset shouldn't we tear down the usage in
> > > the same branch?
> > > 
> > 
> > We still need to make the section invalid before we call into
> > depopulate_section_memmap(). Because architecture like powerpc can share
> > vmemmap area across sections (16MB mapping of vmemmap area) and we use
> > vmemmap_popluated() to make that decision.
> 
> This should be noted in a comment as well.
> 
> > > > Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
> > > > Cc: Baoquan He <bhe@redhat.com>
> > > > Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > > ---
> > > >   mm/sparse.c | 2 ++
> > > >   1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > > index aadb7298dcef..3012d1f3771a 100644
> > > > --- a/mm/sparse.c
> > > > +++ b/mm/sparse.c
> > > > @@ -781,6 +781,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
> > > >   			ms->usage = NULL;
> > > >   		}
> > > >   		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> > > > +		/* Mark the section invalid */
> > > > +		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
> > > 
> > > Btw. this comment is not really helping at all.
> > 
> > That is marking the section invalid so that
> > 
> > static inline int valid_section(struct mem_section *section)
> > {
> > 	return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP));
> > }
> > 
> > 
> > returns false.
> 
> Yes that is obvious once you are clear where to look. I was really
> hoping for a comment that would simply point you to the right
> direcection without chasing SECTION_HAS_MEM_MAP usage. This code is
> subtle and useful comments, even when they state something that is
> obvious to you _right_now_, can be really helpful.

Btw. forgot to add. With the improved comment feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>
diff mbox series

Patch

diff --git a/mm/sparse.c b/mm/sparse.c
index aadb7298dcef..3012d1f3771a 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -781,6 +781,8 @@  static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 			ms->usage = NULL;
 		}
 		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+		/* Mark the section invalid */
+		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
 	}
 
 	if (section_is_early && memmap)