diff mbox series

[v1] mm/memory_hotplug: Don't free usage map when removing a re-added early section

Message ID 20191217104637.5509-1-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v1] mm/memory_hotplug: Don't free usage map when removing a re-added early section | expand

Commit Message

David Hildenbrand Dec. 17, 2019, 10:46 a.m. UTC
When we remove an early section, we don't free the usage map, as the
usage maps of other sections are placed into the same page. Once the
section is removed, it is no longer an early section (especially, the
memmap is freed). When we re-add that section, the usage map is reused,
however, it is no longer an early section. When removing that section
again, we try to kfree() a usage map that was allocated during early
boot - bad.

Let's check against PageReserved() to see if we are dealing with an usage
map that was allocated during boot. We could also check against
!(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved()
is cleaner.

Can be triggered using memtrace under ppc64/powernv:

$ mount -t debugfs none /sys/kernel/debug/
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
$ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
[   12.093442] ------------[ cut here ]------------
[   12.093469] kernel BUG at mm/slub.c:3969!
[   12.093656] Oops: Exception in kernel mode, sig: 5 [#1]
[   12.093961] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
[   12.094320] Modules linked in:
[   12.094615] CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
[   12.095066] NIP:  c000000000396b38 LR: c000000000385848 CTR: c000000000143d30
[   12.095427] REGS: c000000073077680 TRAP: 0700   Not tainted  (5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0)
[   12.095886] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28004828  XER: 20000000
[   12.096395] CFAR: c000000000396b9c IRQMASK: 0
[   12.096395] GPR00: c000000000385848 c000000073077910 c00000000110f300 c00000007ffffc00
[   12.096395] GPR04: 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000000
[   12.096395] GPR08: 0000000000000000 0000000000000001 0000000000000000 ffffffffffffffc8
[   12.096395] GPR12: 0000000000004000 c0000000012d0000 0000000000001000 c000000000d33c78
[   12.096395] GPR16: 0000000000000000 c0000000011bfeb0 ffffffffffffe000 c0000000000b6370
[   12.096395] GPR20: ffffffffe0000000 c0000000011411c0 0000000000006000 c0000000000b6390
[   12.096395] GPR24: 0000000010000000 0000000000000040 0000000000000000 0000000000000000
[   12.096395] GPR28: c000000000385848 c00c0000001fffc0 0000000000004000 0000000000000000
[   12.099882] NIP [c000000000396b38] kfree+0x338/0x3b0
[   12.100135] LR [c000000000385848] section_deactivate+0x138/0x200
[   12.100508] Call Trace:
[   12.100927] [c000000073077910] [c0000000010599a8] 0xc0000000010599a8 (unreliable)
[   12.101338] [c000000073077960] [c000000000385848] section_deactivate+0x138/0x200
[   12.101696] [c000000073077a10] [c00000000039b9f4] __remove_pages+0x114/0x150
[   12.102025] [c000000073077a60] [c00000000006793c] arch_remove_memory+0x3c/0x160
[   12.102381] [c000000073077ae0] [c00000000039c154] try_remove_memory+0x114/0x1a0
[   12.102715] [c000000073077b90] [c00000000039c020] __remove_memory+0x20/0x40
[   12.103041] [c000000073077bb0] [c0000000000b6714] memtrace_enable_set+0x254/0x850
[   12.103402] [c000000073077cb0] [c0000000004197e8] simple_attr_write+0x138/0x160
[   12.103751] [c000000073077d10] [c000000000558c9c] full_proxy_write+0x8c/0x110
[   12.104100] [c000000073077d60] [c0000000003d02a8] __vfs_write+0x38/0x70
[   12.104409] [c000000073077d80] [c0000000003d3c5c] vfs_write+0x11c/0x2a0
[   12.104711] [c000000073077dd0] [c0000000003d4054] ksys_write+0x84/0x140
[   12.105011] [c000000073077e20] [c00000000000b594] system_call+0x5c/0x68
[   12.105357] Instruction dump:
[   12.105606] e93d0000 75290001 41820090 8bfd0051 38a0ffff 7ca5f830 7bff0020 7ca507b4
[   12.105993] e95d0000 39200000 754a0001 4182005c <0b090000> 893d0007 3d42000b 38800006
[   12.106583] ---[ end trace 4b053cbd84e0db62 ]---

The first invocation will offline+remove memory blocks. The second
invocation will first add+online them again, in order to offline+remove
them again (usually we are lucky and the exact same memory blocks will
get "reallocated").

Tested on powernv with boot memory: The usage map will not get freed.
Tested on x86-64 with DIMMs: The usage map will get freed.

Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/sparse.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Dan Williams Dec. 18, 2019, 3:21 a.m. UTC | #1
On Tue, Dec 17, 2019 at 2:47 AM David Hildenbrand <david@redhat.com> wrote:
>
> When we remove an early section, we don't free the usage map, as the
> usage maps of other sections are placed into the same page. Once the
> section is removed, it is no longer an early section (especially, the
> memmap is freed). When we re-add that section, the usage map is reused,
> however, it is no longer an early section. When removing that section
> again, we try to kfree() a usage map that was allocated during early
> boot - bad.
>
> Let's check against PageReserved() to see if we are dealing with an usage
> map that was allocated during boot. We could also check against
> !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved()
> is cleaner.
>
> Can be triggered using memtrace under ppc64/powernv:
>
> $ mount -t debugfs none /sys/kernel/debug/
> $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
> $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [   12.093442] ------------[ cut here ]------------
> [   12.093469] kernel BUG at mm/slub.c:3969!
> [   12.093656] Oops: Exception in kernel mode, sig: 5 [#1]
> [   12.093961] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> [   12.094320] Modules linked in:
> [   12.094615] CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
> [   12.095066] NIP:  c000000000396b38 LR: c000000000385848 CTR: c000000000143d30
> [   12.095427] REGS: c000000073077680 TRAP: 0700   Not tainted  (5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0)
> [   12.095886] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28004828  XER: 20000000
> [   12.096395] CFAR: c000000000396b9c IRQMASK: 0
> [   12.096395] GPR00: c000000000385848 c000000073077910 c00000000110f300 c00000007ffffc00
> [   12.096395] GPR04: 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000000
> [   12.096395] GPR08: 0000000000000000 0000000000000001 0000000000000000 ffffffffffffffc8
> [   12.096395] GPR12: 0000000000004000 c0000000012d0000 0000000000001000 c000000000d33c78
> [   12.096395] GPR16: 0000000000000000 c0000000011bfeb0 ffffffffffffe000 c0000000000b6370
> [   12.096395] GPR20: ffffffffe0000000 c0000000011411c0 0000000000006000 c0000000000b6390
> [   12.096395] GPR24: 0000000010000000 0000000000000040 0000000000000000 0000000000000000
> [   12.096395] GPR28: c000000000385848 c00c0000001fffc0 0000000000004000 0000000000000000
> [   12.099882] NIP [c000000000396b38] kfree+0x338/0x3b0
> [   12.100135] LR [c000000000385848] section_deactivate+0x138/0x200
> [   12.100508] Call Trace:
> [   12.100927] [c000000073077910] [c0000000010599a8] 0xc0000000010599a8 (unreliable)
> [   12.101338] [c000000073077960] [c000000000385848] section_deactivate+0x138/0x200
> [   12.101696] [c000000073077a10] [c00000000039b9f4] __remove_pages+0x114/0x150
> [   12.102025] [c000000073077a60] [c00000000006793c] arch_remove_memory+0x3c/0x160
> [   12.102381] [c000000073077ae0] [c00000000039c154] try_remove_memory+0x114/0x1a0
> [   12.102715] [c000000073077b90] [c00000000039c020] __remove_memory+0x20/0x40
> [   12.103041] [c000000073077bb0] [c0000000000b6714] memtrace_enable_set+0x254/0x850
> [   12.103402] [c000000073077cb0] [c0000000004197e8] simple_attr_write+0x138/0x160
> [   12.103751] [c000000073077d10] [c000000000558c9c] full_proxy_write+0x8c/0x110
> [   12.104100] [c000000073077d60] [c0000000003d02a8] __vfs_write+0x38/0x70
> [   12.104409] [c000000073077d80] [c0000000003d3c5c] vfs_write+0x11c/0x2a0
> [   12.104711] [c000000073077dd0] [c0000000003d4054] ksys_write+0x84/0x140
> [   12.105011] [c000000073077e20] [c00000000000b594] system_call+0x5c/0x68
> [   12.105357] Instruction dump:
> [   12.105606] e93d0000 75290001 41820090 8bfd0051 38a0ffff 7ca5f830 7bff0020 7ca507b4
> [   12.105993] e95d0000 39200000 754a0001 4182005c <0b090000> 893d0007 3d42000b 38800006
> [   12.106583] ---[ end trace 4b053cbd84e0db62 ]---
>
> The first invocation will offline+remove memory blocks. The second
> invocation will first add+online them again, in order to offline+remove
> them again (usually we are lucky and the exact same memory blocks will
> get "reallocated").
>
> Tested on powernv with boot memory: The usage map will not get freed.
> Tested on x86-64 with DIMMs: The usage map will get freed.
>
> Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/sparse.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/sparse.c b/mm/sparse.c
> index b20ab7cdac86..3822ecbd8a1f 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -777,7 +777,14 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>         if (bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION)) {
>                 unsigned long section_nr = pfn_to_section_nr(pfn);
>
> -               if (!section_is_early) {

With this change this function no longer needs a local
section_is_early variable...

> +               /*
> +                * When removing an early section, the usage map is kept (as the
> +                * usage maps of other sections fall into the same page). It
> +                * will be re-used when re-adding the section - which is then no
> +                * longer an early section. If the usage map is PageReserved, it
> +                * was allocated during boot.
> +                */
> +               if (!PageReserved(virt_to_page(ms->usage))) {

This looks good. However, if this was wrong then I wonder why you are
not seeing problems with free_map_bootmem()?
David Hildenbrand Dec. 18, 2019, 8:53 a.m. UTC | #2
On 18.12.19 04:21, Dan Williams wrote:
> On Tue, Dec 17, 2019 at 2:47 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> When we remove an early section, we don't free the usage map, as the
>> usage maps of other sections are placed into the same page. Once the
>> section is removed, it is no longer an early section (especially, the
>> memmap is freed). When we re-add that section, the usage map is reused,
>> however, it is no longer an early section. When removing that section
>> again, we try to kfree() a usage map that was allocated during early
>> boot - bad.
>>
>> Let's check against PageReserved() to see if we are dealing with an usage
>> map that was allocated during boot. We could also check against
>> !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved()
>> is cleaner.
>>
>> Can be triggered using memtrace under ppc64/powernv:
>>
>> $ mount -t debugfs none /sys/kernel/debug/
>> $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [   12.093442] ------------[ cut here ]------------
>> [   12.093469] kernel BUG at mm/slub.c:3969!
>> [   12.093656] Oops: Exception in kernel mode, sig: 5 [#1]
>> [   12.093961] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
>> [   12.094320] Modules linked in:
>> [   12.094615] CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
>> [   12.095066] NIP:  c000000000396b38 LR: c000000000385848 CTR: c000000000143d30
>> [   12.095427] REGS: c000000073077680 TRAP: 0700   Not tainted  (5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0)
>> [   12.095886] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28004828  XER: 20000000
>> [   12.096395] CFAR: c000000000396b9c IRQMASK: 0
>> [   12.096395] GPR00: c000000000385848 c000000073077910 c00000000110f300 c00000007ffffc00
>> [   12.096395] GPR04: 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000000
>> [   12.096395] GPR08: 0000000000000000 0000000000000001 0000000000000000 ffffffffffffffc8
>> [   12.096395] GPR12: 0000000000004000 c0000000012d0000 0000000000001000 c000000000d33c78
>> [   12.096395] GPR16: 0000000000000000 c0000000011bfeb0 ffffffffffffe000 c0000000000b6370
>> [   12.096395] GPR20: ffffffffe0000000 c0000000011411c0 0000000000006000 c0000000000b6390
>> [   12.096395] GPR24: 0000000010000000 0000000000000040 0000000000000000 0000000000000000
>> [   12.096395] GPR28: c000000000385848 c00c0000001fffc0 0000000000004000 0000000000000000
>> [   12.099882] NIP [c000000000396b38] kfree+0x338/0x3b0
>> [   12.100135] LR [c000000000385848] section_deactivate+0x138/0x200
>> [   12.100508] Call Trace:
>> [   12.100927] [c000000073077910] [c0000000010599a8] 0xc0000000010599a8 (unreliable)
>> [   12.101338] [c000000073077960] [c000000000385848] section_deactivate+0x138/0x200
>> [   12.101696] [c000000073077a10] [c00000000039b9f4] __remove_pages+0x114/0x150
>> [   12.102025] [c000000073077a60] [c00000000006793c] arch_remove_memory+0x3c/0x160
>> [   12.102381] [c000000073077ae0] [c00000000039c154] try_remove_memory+0x114/0x1a0
>> [   12.102715] [c000000073077b90] [c00000000039c020] __remove_memory+0x20/0x40
>> [   12.103041] [c000000073077bb0] [c0000000000b6714] memtrace_enable_set+0x254/0x850
>> [   12.103402] [c000000073077cb0] [c0000000004197e8] simple_attr_write+0x138/0x160
>> [   12.103751] [c000000073077d10] [c000000000558c9c] full_proxy_write+0x8c/0x110
>> [   12.104100] [c000000073077d60] [c0000000003d02a8] __vfs_write+0x38/0x70
>> [   12.104409] [c000000073077d80] [c0000000003d3c5c] vfs_write+0x11c/0x2a0
>> [   12.104711] [c000000073077dd0] [c0000000003d4054] ksys_write+0x84/0x140
>> [   12.105011] [c000000073077e20] [c00000000000b594] system_call+0x5c/0x68
>> [   12.105357] Instruction dump:
>> [   12.105606] e93d0000 75290001 41820090 8bfd0051 38a0ffff 7ca5f830 7bff0020 7ca507b4
>> [   12.105993] e95d0000 39200000 754a0001 4182005c <0b090000> 893d0007 3d42000b 38800006
>> [   12.106583] ---[ end trace 4b053cbd84e0db62 ]---
>>
>> The first invocation will offline+remove memory blocks. The second
>> invocation will first add+online them again, in order to offline+remove
>> them again (usually we are lucky and the exact same memory blocks will
>> get "reallocated").
>>
>> Tested on powernv with boot memory: The usage map will not get freed.
>> Tested on x86-64 with DIMMs: The usage map will get freed.
>>
>> Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  mm/sparse.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index b20ab7cdac86..3822ecbd8a1f 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -777,7 +777,14 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>>         if (bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION)) {
>>                 unsigned long section_nr = pfn_to_section_nr(pfn);
>>
>> -               if (!section_is_early) {
> 
> With this change this function no longer needs a local
> section_is_early variable...

Right. I'll send a cleanup patch and we can decide if we want to squash
them in Andrews tree.

> 
>> +               /*
>> +                * When removing an early section, the usage map is kept (as the
>> +                * usage maps of other sections fall into the same page). It
>> +                * will be re-used when re-adding the section - which is then no
>> +                * longer an early section. If the usage map is PageReserved, it
>> +                * was allocated during boot.
>> +                */
>> +               if (!PageReserved(virt_to_page(ms->usage))) {
> 
> This looks good. However, if this was wrong then I wonder why you are
> not seeing problems with free_map_bootmem()?

AFAIKS, in contrast to the usage map, the memmap will get freed. So when
you readd memory, you actually allocate a new one and mark the section
!early.

I was thinking about replacing the section flag by a simple
PageReserved(virt_to_page(memmap)) check. Might be a little bit slower.
Maybe something for the new year :)

Thanks!
David Hildenbrand Dec. 18, 2019, 12:06 p.m. UTC | #3
On 18.12.19 09:53, David Hildenbrand wrote:
> On 18.12.19 04:21, Dan Williams wrote:
>> On Tue, Dec 17, 2019 at 2:47 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> When we remove an early section, we don't free the usage map, as the
>>> usage maps of other sections are placed into the same page. Once the
>>> section is removed, it is no longer an early section (especially, the
>>> memmap is freed). When we re-add that section, the usage map is reused,
>>> however, it is no longer an early section. When removing that section
>>> again, we try to kfree() a usage map that was allocated during early
>>> boot - bad.
>>>
>>> Let's check against PageReserved() to see if we are dealing with an usage
>>> map that was allocated during boot. We could also check against
>>> !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved()
>>> is cleaner.
>>>
>>> Can be triggered using memtrace under ppc64/powernv:
>>>
>>> $ mount -t debugfs none /sys/kernel/debug/
>>> $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [   12.093442] ------------[ cut here ]------------
>>> [   12.093469] kernel BUG at mm/slub.c:3969!
>>> [   12.093656] Oops: Exception in kernel mode, sig: 5 [#1]
>>> [   12.093961] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
>>> [   12.094320] Modules linked in:
>>> [   12.094615] CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
>>> [   12.095066] NIP:  c000000000396b38 LR: c000000000385848 CTR: c000000000143d30
>>> [   12.095427] REGS: c000000073077680 TRAP: 0700   Not tainted  (5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0)
>>> [   12.095886] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28004828  XER: 20000000
>>> [   12.096395] CFAR: c000000000396b9c IRQMASK: 0
>>> [   12.096395] GPR00: c000000000385848 c000000073077910 c00000000110f300 c00000007ffffc00
>>> [   12.096395] GPR04: 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000000
>>> [   12.096395] GPR08: 0000000000000000 0000000000000001 0000000000000000 ffffffffffffffc8
>>> [   12.096395] GPR12: 0000000000004000 c0000000012d0000 0000000000001000 c000000000d33c78
>>> [   12.096395] GPR16: 0000000000000000 c0000000011bfeb0 ffffffffffffe000 c0000000000b6370
>>> [   12.096395] GPR20: ffffffffe0000000 c0000000011411c0 0000000000006000 c0000000000b6390
>>> [   12.096395] GPR24: 0000000010000000 0000000000000040 0000000000000000 0000000000000000
>>> [   12.096395] GPR28: c000000000385848 c00c0000001fffc0 0000000000004000 0000000000000000
>>> [   12.099882] NIP [c000000000396b38] kfree+0x338/0x3b0
>>> [   12.100135] LR [c000000000385848] section_deactivate+0x138/0x200
>>> [   12.100508] Call Trace:
>>> [   12.100927] [c000000073077910] [c0000000010599a8] 0xc0000000010599a8 (unreliable)
>>> [   12.101338] [c000000073077960] [c000000000385848] section_deactivate+0x138/0x200
>>> [   12.101696] [c000000073077a10] [c00000000039b9f4] __remove_pages+0x114/0x150
>>> [   12.102025] [c000000073077a60] [c00000000006793c] arch_remove_memory+0x3c/0x160
>>> [   12.102381] [c000000073077ae0] [c00000000039c154] try_remove_memory+0x114/0x1a0
>>> [   12.102715] [c000000073077b90] [c00000000039c020] __remove_memory+0x20/0x40
>>> [   12.103041] [c000000073077bb0] [c0000000000b6714] memtrace_enable_set+0x254/0x850
>>> [   12.103402] [c000000073077cb0] [c0000000004197e8] simple_attr_write+0x138/0x160
>>> [   12.103751] [c000000073077d10] [c000000000558c9c] full_proxy_write+0x8c/0x110
>>> [   12.104100] [c000000073077d60] [c0000000003d02a8] __vfs_write+0x38/0x70
>>> [   12.104409] [c000000073077d80] [c0000000003d3c5c] vfs_write+0x11c/0x2a0
>>> [   12.104711] [c000000073077dd0] [c0000000003d4054] ksys_write+0x84/0x140
>>> [   12.105011] [c000000073077e20] [c00000000000b594] system_call+0x5c/0x68
>>> [   12.105357] Instruction dump:
>>> [   12.105606] e93d0000 75290001 41820090 8bfd0051 38a0ffff 7ca5f830 7bff0020 7ca507b4
>>> [   12.105993] e95d0000 39200000 754a0001 4182005c <0b090000> 893d0007 3d42000b 38800006
>>> [   12.106583] ---[ end trace 4b053cbd84e0db62 ]---
>>>
>>> The first invocation will offline+remove memory blocks. The second
>>> invocation will first add+online them again, in order to offline+remove
>>> them again (usually we are lucky and the exact same memory blocks will
>>> get "reallocated").
>>>
>>> Tested on powernv with boot memory: The usage map will not get freed.
>>> Tested on x86-64 with DIMMs: The usage map will get freed.
>>>
>>> Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>  mm/sparse.c | 9 ++++++++-
>>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/sparse.c b/mm/sparse.c
>>> index b20ab7cdac86..3822ecbd8a1f 100644
>>> --- a/mm/sparse.c
>>> +++ b/mm/sparse.c
>>> @@ -777,7 +777,14 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>>>         if (bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION)) {
>>>                 unsigned long section_nr = pfn_to_section_nr(pfn);
>>>
>>> -               if (!section_is_early) {
>>
>> With this change this function no longer needs a local
>> section_is_early variable...
> 
> Right. I'll send a cleanup patch and we can decide if we want to squash
> them in Andrews tree.

Ah, just realized that we cannot do that.

ms->section_mem_map = sparse_encode_mem_map(NULL, section_nr)

already clears the early flag, so inlining that change would break it.
David Hildenbrand Dec. 18, 2019, 12:06 p.m. UTC | #4
On 18.12.19 09:53, David Hildenbrand wrote:
> On 18.12.19 04:21, Dan Williams wrote:
>> On Tue, Dec 17, 2019 at 2:47 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> When we remove an early section, we don't free the usage map, as the
>>> usage maps of other sections are placed into the same page. Once the
>>> section is removed, it is no longer an early section (especially, the
>>> memmap is freed). When we re-add that section, the usage map is reused,
>>> however, it is no longer an early section. When removing that section
>>> again, we try to kfree() a usage map that was allocated during early
>>> boot - bad.
>>>
>>> Let's check against PageReserved() to see if we are dealing with an usage
>>> map that was allocated during boot. We could also check against
>>> !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved()
>>> is cleaner.
>>>
>>> Can be triggered using memtrace under ppc64/powernv:
>>>
>>> $ mount -t debugfs none /sys/kernel/debug/
>>> $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [   12.093442] ------------[ cut here ]------------
>>> [   12.093469] kernel BUG at mm/slub.c:3969!
>>> [   12.093656] Oops: Exception in kernel mode, sig: 5 [#1]
>>> [   12.093961] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
>>> [   12.094320] Modules linked in:
>>> [   12.094615] CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
>>> [   12.095066] NIP:  c000000000396b38 LR: c000000000385848 CTR: c000000000143d30
>>> [   12.095427] REGS: c000000073077680 TRAP: 0700   Not tainted  (5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0)
>>> [   12.095886] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28004828  XER: 20000000
>>> [   12.096395] CFAR: c000000000396b9c IRQMASK: 0
>>> [   12.096395] GPR00: c000000000385848 c000000073077910 c00000000110f300 c00000007ffffc00
>>> [   12.096395] GPR04: 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000000
>>> [   12.096395] GPR08: 0000000000000000 0000000000000001 0000000000000000 ffffffffffffffc8
>>> [   12.096395] GPR12: 0000000000004000 c0000000012d0000 0000000000001000 c000000000d33c78
>>> [   12.096395] GPR16: 0000000000000000 c0000000011bfeb0 ffffffffffffe000 c0000000000b6370
>>> [   12.096395] GPR20: ffffffffe0000000 c0000000011411c0 0000000000006000 c0000000000b6390
>>> [   12.096395] GPR24: 0000000010000000 0000000000000040 0000000000000000 0000000000000000
>>> [   12.096395] GPR28: c000000000385848 c00c0000001fffc0 0000000000004000 0000000000000000
>>> [   12.099882] NIP [c000000000396b38] kfree+0x338/0x3b0
>>> [   12.100135] LR [c000000000385848] section_deactivate+0x138/0x200
>>> [   12.100508] Call Trace:
>>> [   12.100927] [c000000073077910] [c0000000010599a8] 0xc0000000010599a8 (unreliable)
>>> [   12.101338] [c000000073077960] [c000000000385848] section_deactivate+0x138/0x200
>>> [   12.101696] [c000000073077a10] [c00000000039b9f4] __remove_pages+0x114/0x150
>>> [   12.102025] [c000000073077a60] [c00000000006793c] arch_remove_memory+0x3c/0x160
>>> [   12.102381] [c000000073077ae0] [c00000000039c154] try_remove_memory+0x114/0x1a0
>>> [   12.102715] [c000000073077b90] [c00000000039c020] __remove_memory+0x20/0x40
>>> [   12.103041] [c000000073077bb0] [c0000000000b6714] memtrace_enable_set+0x254/0x850
>>> [   12.103402] [c000000073077cb0] [c0000000004197e8] simple_attr_write+0x138/0x160
>>> [   12.103751] [c000000073077d10] [c000000000558c9c] full_proxy_write+0x8c/0x110
>>> [   12.104100] [c000000073077d60] [c0000000003d02a8] __vfs_write+0x38/0x70
>>> [   12.104409] [c000000073077d80] [c0000000003d3c5c] vfs_write+0x11c/0x2a0
>>> [   12.104711] [c000000073077dd0] [c0000000003d4054] ksys_write+0x84/0x140
>>> [   12.105011] [c000000073077e20] [c00000000000b594] system_call+0x5c/0x68
>>> [   12.105357] Instruction dump:
>>> [   12.105606] e93d0000 75290001 41820090 8bfd0051 38a0ffff 7ca5f830 7bff0020 7ca507b4
>>> [   12.105993] e95d0000 39200000 754a0001 4182005c <0b090000> 893d0007 3d42000b 38800006
>>> [   12.106583] ---[ end trace 4b053cbd84e0db62 ]---
>>>
>>> The first invocation will offline+remove memory blocks. The second
>>> invocation will first add+online them again, in order to offline+remove
>>> them again (usually we are lucky and the exact same memory blocks will
>>> get "reallocated").
>>>
>>> Tested on powernv with boot memory: The usage map will not get freed.
>>> Tested on x86-64 with DIMMs: The usage map will get freed.
>>>
>>> Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>  mm/sparse.c | 9 ++++++++-
>>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/sparse.c b/mm/sparse.c
>>> index b20ab7cdac86..3822ecbd8a1f 100644
>>> --- a/mm/sparse.c
>>> +++ b/mm/sparse.c
>>> @@ -777,7 +777,14 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>>>         if (bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION)) {
>>>                 unsigned long section_nr = pfn_to_section_nr(pfn);
>>>
>>> -               if (!section_is_early) {
>>
>> With this change this function no longer needs a local
>> section_is_early variable...
> 
> Right. I'll send a cleanup patch and we can decide if we want to squash
> them in Andrews tree.

Ah, just realized that we cannot do that.

ms->section_mem_map = sparse_encode_mem_map(NULL, section_nr)

already clears the early flag, so inlining that condition would break it.
David Hildenbrand Jan. 8, 2020, 8:19 a.m. UTC | #5
On 17.12.19 11:46, David Hildenbrand wrote:
> When we remove an early section, we don't free the usage map, as the
> usage maps of other sections are placed into the same page. Once the
> section is removed, it is no longer an early section (especially, the
> memmap is freed). When we re-add that section, the usage map is reused,
> however, it is no longer an early section. When removing that section
> again, we try to kfree() a usage map that was allocated during early
> boot - bad.
> 
> Let's check against PageReserved() to see if we are dealing with an usage
> map that was allocated during boot. We could also check against
> !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved()
> is cleaner.
> 
> Can be triggered using memtrace under ppc64/powernv:
> 
> $ mount -t debugfs none /sys/kernel/debug/
> $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
> $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [   12.093442] ------------[ cut here ]------------
> [   12.093469] kernel BUG at mm/slub.c:3969!
> [   12.093656] Oops: Exception in kernel mode, sig: 5 [#1]
> [   12.093961] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> [   12.094320] Modules linked in:
> [   12.094615] CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
> [   12.095066] NIP:  c000000000396b38 LR: c000000000385848 CTR: c000000000143d30
> [   12.095427] REGS: c000000073077680 TRAP: 0700   Not tainted  (5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0)
> [   12.095886] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28004828  XER: 20000000
> [   12.096395] CFAR: c000000000396b9c IRQMASK: 0
> [   12.096395] GPR00: c000000000385848 c000000073077910 c00000000110f300 c00000007ffffc00
> [   12.096395] GPR04: 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000000
> [   12.096395] GPR08: 0000000000000000 0000000000000001 0000000000000000 ffffffffffffffc8
> [   12.096395] GPR12: 0000000000004000 c0000000012d0000 0000000000001000 c000000000d33c78
> [   12.096395] GPR16: 0000000000000000 c0000000011bfeb0 ffffffffffffe000 c0000000000b6370
> [   12.096395] GPR20: ffffffffe0000000 c0000000011411c0 0000000000006000 c0000000000b6390
> [   12.096395] GPR24: 0000000010000000 0000000000000040 0000000000000000 0000000000000000
> [   12.096395] GPR28: c000000000385848 c00c0000001fffc0 0000000000004000 0000000000000000
> [   12.099882] NIP [c000000000396b38] kfree+0x338/0x3b0
> [   12.100135] LR [c000000000385848] section_deactivate+0x138/0x200
> [   12.100508] Call Trace:
> [   12.100927] [c000000073077910] [c0000000010599a8] 0xc0000000010599a8 (unreliable)
> [   12.101338] [c000000073077960] [c000000000385848] section_deactivate+0x138/0x200
> [   12.101696] [c000000073077a10] [c00000000039b9f4] __remove_pages+0x114/0x150
> [   12.102025] [c000000073077a60] [c00000000006793c] arch_remove_memory+0x3c/0x160
> [   12.102381] [c000000073077ae0] [c00000000039c154] try_remove_memory+0x114/0x1a0
> [   12.102715] [c000000073077b90] [c00000000039c020] __remove_memory+0x20/0x40
> [   12.103041] [c000000073077bb0] [c0000000000b6714] memtrace_enable_set+0x254/0x850
> [   12.103402] [c000000073077cb0] [c0000000004197e8] simple_attr_write+0x138/0x160
> [   12.103751] [c000000073077d10] [c000000000558c9c] full_proxy_write+0x8c/0x110
> [   12.104100] [c000000073077d60] [c0000000003d02a8] __vfs_write+0x38/0x70
> [   12.104409] [c000000073077d80] [c0000000003d3c5c] vfs_write+0x11c/0x2a0
> [   12.104711] [c000000073077dd0] [c0000000003d4054] ksys_write+0x84/0x140
> [   12.105011] [c000000073077e20] [c00000000000b594] system_call+0x5c/0x68
> [   12.105357] Instruction dump:
> [   12.105606] e93d0000 75290001 41820090 8bfd0051 38a0ffff 7ca5f830 7bff0020 7ca507b4
> [   12.105993] e95d0000 39200000 754a0001 4182005c <0b090000> 893d0007 3d42000b 38800006
> [   12.106583] ---[ end trace 4b053cbd84e0db62 ]---
> 
> The first invocation will offline+remove memory blocks. The second
> invocation will first add+online them again, in order to offline+remove
> them again (usually we are lucky and the exact same memory blocks will
> get "reallocated").
> 
> Tested on powernv with boot memory: The usage map will not get freed.
> Tested on x86-64 with DIMMs: The usage map will get freed.
> 
> Fixes: 326e1b8f83a4 ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/sparse.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index b20ab7cdac86..3822ecbd8a1f 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -777,7 +777,14 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  	if (bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION)) {
>  		unsigned long section_nr = pfn_to_section_nr(pfn);
>  
> -		if (!section_is_early) {
> +		/*
> +		 * When removing an early section, the usage map is kept (as the
> +		 * usage maps of other sections fall into the same page). It
> +		 * will be re-used when re-adding the section - which is then no
> +		 * longer an early section. If the usage map is PageReserved, it
> +		 * was allocated during boot.
> +		 */
> +		if (!PageReserved(virt_to_page(ms->usage))) {
>  			kfree(ms->usage);
>  			ms->usage = NULL;
>  		}
> 

We have another reproducer that is used more frequently than memtrace :)

Using Dynamic Memory under a Power DLAPR can trigger it easily.
Triggering removal (I assume after previously removed+re-added) of
memory from the HMC GUI can crash the kernel with the same call trace
and is fixed by this patch.

CCing Pingfan who verified that this patch fixes the issue - maybe he
wants to provide a Tested-by:

@Andrew, any chance we can get this upstream soon-ish? Thanks!
Pingfan Liu Jan. 8, 2020, 9:11 a.m. UTC | #6
On Wed, Jan 8, 2020 at 4:24 PM David Hildenbrand <david@redhat.com> wrote:
>
>
> We have another reproducer that is used more frequently than memtrace :)
>
> Using Dynamic Memory under a Power DLAPR can trigger it easily.
> Triggering removal (I assume after previously removed+re-added) of
> memory from the HMC GUI can crash the kernel with the same call trace
> and is fixed by this patch.
>
> CCing Pingfan who verified that this patch fixes the issue - maybe he
> wants to provide a Tested-by:
Yes, this patch fixes the hot remove+add+remove bug on POWERVM, which
can be reproduced by dlpar.

Tested-by: Pingfan Liu <piliu@redhat.com>
>
> @Andrew, any chance we can get this upstream soon-ish? Thanks!
>
> --
> Thanks,
>
> David / dhildenb
>
diff mbox series

Patch

diff --git a/mm/sparse.c b/mm/sparse.c
index b20ab7cdac86..3822ecbd8a1f 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -777,7 +777,14 @@  static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 	if (bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION)) {
 		unsigned long section_nr = pfn_to_section_nr(pfn);
 
-		if (!section_is_early) {
+		/*
+		 * When removing an early section, the usage map is kept (as the
+		 * usage maps of other sections fall into the same page). It
+		 * will be re-used when re-adding the section - which is then no
+		 * longer an early section. If the usage map is PageReserved, it
+		 * was allocated during boot.
+		 */
+		if (!PageReserved(virt_to_page(ms->usage))) {
 			kfree(ms->usage);
 			ms->usage = NULL;
 		}