diff mbox series

[v1] drivers/base/memory.c: indicate all memory blocks as removable

Message ID 20200128093542.6908-1-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v1] drivers/base/memory.c: indicate all memory blocks as removable | expand

Commit Message

David Hildenbrand Jan. 28, 2020, 9:35 a.m. UTC
We see multiple issues with the implementation/interface to compute
whether a memory block can be offlined (exposed via
/sys/devices/system/memory/memoryX/removable) and would like to simplify
it (remove the implementation).

1. It runs basically lockless. While this might be good for performance,
   we see possible races with memory offlining that will require at least
   some sort of locking to fix.

2. Nowadays, more false positives are possible. No arch-specific checks
   are performed that validate if memory offlining will not be denied
   right away (and such check will require locking). For example, arm64
   won't allow to offline any memory block that was added during boot -
   which will imply a very high error rate. Other archs have other
   constraints.

3. The interface is inherently racy. E.g., if a memory block is
   detected to be removable (and was not a false positive at that time),
   there is still no guarantee that offlining will actually succeed. So
   any caller already has to deal with false positives.

4. It is unclear which performance benefit this interface actually
   provides. The introducing commit 5c755e9fd813 ("memory-hotplug: add
   sysfs removable attribute for hotplug memory remove") mentioned
	"A user-level agent must be able to identify which sections of
	 memory are likely to be removable before attempting the
	 potentially expensive operation."
   However, no actual performance comparison was included.

Known users:
- lsmem: Will group memory blocks based on the "removable" property. [1]
- chmem: Indirect user. It has a RANGE mode where one can specify
	 removable ranges identified via lsmem to be offlined. However, it
	 also has a "SIZE" mode, which allows a sysadmin to skip the manual
	 "identify removable blocks" step. [2]
- powerpc-utils: Uses the "removable" attribute to skip some memory
		 blocks right away when trying to find some to
		 offline+remove. However, with ballooning enabled, it
		 already skips this information completely (because it
		 once resulted in many false negatives). Therefore, the
		 implementation can deal with false positives properly
		 already. [3]

According to Nathan Fontenot, DLPAR on powerpc is nowadays no longer
driven from userspace via the drmgr command (powerpc-utils). Nowadays
it's managed in the kernel - including onlining/offlining of memory
blocks - triggered by drmgr writing to /sys/kernel/dlpar. So the
affected legacy userspace handling is only active on old kernels. Only very
old versions of drmgr on a new kernel (unlikely) might execute slower -
totally acceptable.

With CONFIG_MEMORY_HOTREMOVE, always indicating "removable" should not
break any user space tool. We implement a very bad heuristic now.  Without
CONFIG_MEMORY_HOTREMOVE we cannot offline anything, so report
"not removable" as before.

Original discussion can be found in [4] ("[PATCH RFC v1] mm:
is_mem_section_removable() overhaul").

Other users of is_mem_section_removable() will be removed next, so that
we can remove is_mem_section_removable() completely.

[1] http://man7.org/linux/man-pages/man1/lsmem.1.html
[2] http://man7.org/linux/man-pages/man8/chmem.8.html
[3] https://github.com/ibm-power-utilities/powerpc-utils
[4] https://lkml.kernel.org/r/20200117105759.27905-1-david@redhat.com

Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: powerpc-utils-devel@googlegroups.com
Cc: util-linux@vger.kernel.org
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Karel Zak <kzak@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

RFC -> v1:
- Use IS_ENABLED() instead of ifdefs
- Add information from Nathan (thanks!)

---
 drivers/base/memory.c | 23 +++--------------------
 1 file changed, 3 insertions(+), 20 deletions(-)

Comments

Fontenot, Nathan Jan. 31, 2020, 1:41 p.m. UTC | #1
On 1/28/2020 3:35 AM, David Hildenbrand wrote:
> We see multiple issues with the implementation/interface to compute
> whether a memory block can be offlined (exposed via
> /sys/devices/system/memory/memoryX/removable) and would like to simplify
> it (remove the implementation).
> 
> 1. It runs basically lockless. While this might be good for performance,
>    we see possible races with memory offlining that will require at least
>    some sort of locking to fix.
> 
> 2. Nowadays, more false positives are possible. No arch-specific checks
>    are performed that validate if memory offlining will not be denied
>    right away (and such check will require locking). For example, arm64
>    won't allow to offline any memory block that was added during boot -
>    which will imply a very high error rate. Other archs have other
>    constraints.
> 
> 3. The interface is inherently racy. E.g., if a memory block is
>    detected to be removable (and was not a false positive at that time),
>    there is still no guarantee that offlining will actually succeed. So
>    any caller already has to deal with false positives.
> 
> 4. It is unclear which performance benefit this interface actually
>    provides. The introducing commit 5c755e9fd813 ("memory-hotplug: add
>    sysfs removable attribute for hotplug memory remove") mentioned
> 	"A user-level agent must be able to identify which sections of
> 	 memory are likely to be removable before attempting the
> 	 potentially expensive operation."
>    However, no actual performance comparison was included.
> 
> Known users:
> - lsmem: Will group memory blocks based on the "removable" property. [1]
> - chmem: Indirect user. It has a RANGE mode where one can specify
> 	 removable ranges identified via lsmem to be offlined. However, it
> 	 also has a "SIZE" mode, which allows a sysadmin to skip the manual
> 	 "identify removable blocks" step. [2]
> - powerpc-utils: Uses the "removable" attribute to skip some memory
> 		 blocks right away when trying to find some to
> 		 offline+remove. However, with ballooning enabled, it
> 		 already skips this information completely (because it
> 		 once resulted in many false negatives). Therefore, the
> 		 implementation can deal with false positives properly
> 		 already. [3]
> 
> According to Nathan Fontenot, DLPAR on powerpc is nowadays no longer
> driven from userspace via the drmgr command (powerpc-utils). Nowadays
> it's managed in the kernel - including onlining/offlining of memory
> blocks - triggered by drmgr writing to /sys/kernel/dlpar. So the
> affected legacy userspace handling is only active on old kernels. Only very
> old versions of drmgr on a new kernel (unlikely) might execute slower -
> totally acceptable.
> 
> With CONFIG_MEMORY_HOTREMOVE, always indicating "removable" should not
> break any user space tool. We implement a very bad heuristic now.  Without
> CONFIG_MEMORY_HOTREMOVE we cannot offline anything, so report
> "not removable" as before.
> 
> Original discussion can be found in [4] ("[PATCH RFC v1] mm:
> is_mem_section_removable() overhaul").
> 
> Other users of is_mem_section_removable() will be removed next, so that
> we can remove is_mem_section_removable() completely.
> 
> [1] http://man7.org/linux/man-pages/man1/lsmem.1.html
> [2] http://man7.org/linux/man-pages/man8/chmem.8.html
> [3] https://github.com/ibm-power-utilities/powerpc-utils
> [4] https://lkml.kernel.org/r/20200117105759.27905-1-david@redhat.com
> 
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: powerpc-utils-devel@googlegroups.com
> Cc: util-linux@vger.kernel.org
> Cc: Badari Pulavarty <pbadari@us.ibm.com>
> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
> Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Karel Zak <kzak@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Nathan Fontenot <ndfont@gmail.com>

> ---
> 
> RFC -> v1:
> - Use IS_ENABLED() instead of ifdefs
> - Add information from Nathan (thanks!)
> 
> ---
>  drivers/base/memory.c | 23 +++--------------------
>  1 file changed, 3 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 6503f5d0b749..9664be00a4de 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -105,30 +105,13 @@ static ssize_t phys_index_show(struct device *dev,
>  }
>  
>  /*
> - * Show whether the memory block is likely to be offlineable (or is already
> - * offline). Once offline, the memory block could be removed. The return
> - * value does, however, not indicate that there is a way to remove the
> - * memory block.
> + * Legacy interface that we cannot remove. Always indicate "removable"
> + * with CONFIG_MEMORY_HOTREMOVE - bad heuristic.
>   */
>  static ssize_t removable_show(struct device *dev, struct device_attribute *attr,
>  			      char *buf)
>  {
> -	struct memory_block *mem = to_memory_block(dev);
> -	unsigned long pfn;
> -	int ret = 1, i;
> -
> -	if (mem->state != MEM_ONLINE)
> -		goto out;
> -
> -	for (i = 0; i < sections_per_block; i++) {
> -		if (!present_section_nr(mem->start_section_nr + i))
> -			continue;
> -		pfn = section_nr_to_pfn(mem->start_section_nr + i);
> -		ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
> -	}
> -
> -out:
> -	return sprintf(buf, "%d\n", ret);
> +	return sprintf(buf, "%d\n", (int)IS_ENABLED(CONFIG_MEMORY_HOTREMOVE));
>  }
>  
>  /*
>
Dan Williams March 27, 2020, 6:24 a.m. UTC | #2
On Tue, Jan 28, 2020 at 1:44 AM David Hildenbrand <david@redhat.com> wrote:
>
> We see multiple issues with the implementation/interface to compute
> whether a memory block can be offlined (exposed via
> /sys/devices/system/memory/memoryX/removable) and would like to simplify
> it (remove the implementation).
>
> 1. It runs basically lockless. While this might be good for performance,
>    we see possible races with memory offlining that will require at least
>    some sort of locking to fix.
>
> 2. Nowadays, more false positives are possible. No arch-specific checks
>    are performed that validate if memory offlining will not be denied
>    right away (and such check will require locking). For example, arm64
>    won't allow to offline any memory block that was added during boot -
>    which will imply a very high error rate. Other archs have other
>    constraints.
>
> 3. The interface is inherently racy. E.g., if a memory block is
>    detected to be removable (and was not a false positive at that time),
>    there is still no guarantee that offlining will actually succeed. So
>    any caller already has to deal with false positives.
>
> 4. It is unclear which performance benefit this interface actually
>    provides. The introducing commit 5c755e9fd813 ("memory-hotplug: add
>    sysfs removable attribute for hotplug memory remove") mentioned
>         "A user-level agent must be able to identify which sections of
>          memory are likely to be removable before attempting the
>          potentially expensive operation."
>    However, no actual performance comparison was included.
>
> Known users:
> - lsmem: Will group memory blocks based on the "removable" property. [1]
> - chmem: Indirect user. It has a RANGE mode where one can specify
>          removable ranges identified via lsmem to be offlined. However, it
>          also has a "SIZE" mode, which allows a sysadmin to skip the manual
>          "identify removable blocks" step. [2]
> - powerpc-utils: Uses the "removable" attribute to skip some memory
>                  blocks right away when trying to find some to
>                  offline+remove. However, with ballooning enabled, it
>                  already skips this information completely (because it
>                  once resulted in many false negatives). Therefore, the
>                  implementation can deal with false positives properly
>                  already. [3]
>
> According to Nathan Fontenot, DLPAR on powerpc is nowadays no longer
> driven from userspace via the drmgr command (powerpc-utils). Nowadays
> it's managed in the kernel - including onlining/offlining of memory
> blocks - triggered by drmgr writing to /sys/kernel/dlpar. So the
> affected legacy userspace handling is only active on old kernels. Only very
> old versions of drmgr on a new kernel (unlikely) might execute slower -
> totally acceptable.
>
> With CONFIG_MEMORY_HOTREMOVE, always indicating "removable" should not
> break any user space tool. We implement a very bad heuristic now.  Without
> CONFIG_MEMORY_HOTREMOVE we cannot offline anything, so report
> "not removable" as before.
>
> Original discussion can be found in [4] ("[PATCH RFC v1] mm:
> is_mem_section_removable() overhaul").
>
> Other users of is_mem_section_removable() will be removed next, so that
> we can remove is_mem_section_removable() completely.
>
> [1] http://man7.org/linux/man-pages/man1/lsmem.1.html
> [2] http://man7.org/linux/man-pages/man8/chmem.8.html
> [3] https://github.com/ibm-power-utilities/powerpc-utils
> [4] https://lkml.kernel.org/r/20200117105759.27905-1-david@redhat.com
>
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Cc: Dan Williams <dan.j.williams@intel.com>

David, Andrew,

I'd like to recommend this patch for -stable as it likely (test
underway) solves this crash report from Steve:

[  148.796036] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[  148.796074] ------------[ cut here ]------------
[  148.796098] kernel BUG at include/linux/mm.h:1087!
[  148.796126] invalid opcode: 0000 [#1] SMP NOPTI
[  148.796146] CPU: 63 PID: 5471 Comm: lsmem Not tainted 5.5.10-200.fc31.x8=
6_64+debug #1
[  148.796173] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5=
C620.86B.02.01.0010.010620200716 01/06/2020
[  148.796212] RIP: 0010:is_mem_section_removable+0x1a4/0x1b0
[  148.796561] Call Trace:
[  148.796591]  removable_show+0x6e/0xa0
[  148.796608]  dev_attr_show+0x19/0x40
[  148.796625]  sysfs_kf_seq_show+0xa9/0x100
[  148.796640]  seq_read+0xd5/0x450
[  148.796657]  vfs_read+0xc5/0x180
[  148.796672]  ksys_read+0x68/0xe0
[  148.796688]  do_syscall_64+0x5c/0xa0
[  148.796704]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  148.796721] RIP: 0033:0x7f3ab1646412

...on a non-debug kernel it just crashes.

In this case lsmem is failing when reading memory96:

openat(3, "memory96/removable", O_RDONLY|O_CLOEXEC) = 4
fcntl(4, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fstat(4, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
read(4,  <unfinished ...>)              = ?
+++ killed by SIGSEGV +++
Segmentation fault (core dumped)

...which is phys_index 0x60 => memory address 0x3000000000

On this platform that lands us here:

100000000-303fffffff : System RAM
  291f000000-291fe00f70 : Kernel code
  2920000000-292051efff : Kernel rodata
  2920600000-292093b0bf : Kernel data
  29214f3000-2922dfffff : Kernel bss
3040000000-305fffffff : Reserved
3060000000-1aa5fffffff : Persistent Memory

...where the last memory block of System RAM is shared with persistent
memory. I.e. the block is only partially online which means that
page_to_nid() in is_mem_section_removable() will assert or crash for
some of the offline pages in that block.
Michal Hocko March 27, 2020, 7:47 a.m. UTC | #3
On Thu 26-03-20 23:24:08, Dan Williams wrote:
[...]
> David, Andrew,
> 
> I'd like to recommend this patch for -stable as it likely (test
> underway) solves this crash report from Steve:
> 
> [  148.796036] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> [  148.796074] ------------[ cut here ]------------
> [  148.796098] kernel BUG at include/linux/mm.h:1087!
> [  148.796126] invalid opcode: 0000 [#1] SMP NOPTI
> [  148.796146] CPU: 63 PID: 5471 Comm: lsmem Not tainted 5.5.10-200.fc31.x8=
> 6_64+debug #1
> [  148.796173] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5=
> C620.86B.02.01.0010.010620200716 01/06/2020
> [  148.796212] RIP: 0010:is_mem_section_removable+0x1a4/0x1b0
> [  148.796561] Call Trace:
> [  148.796591]  removable_show+0x6e/0xa0
> [  148.796608]  dev_attr_show+0x19/0x40
> [  148.796625]  sysfs_kf_seq_show+0xa9/0x100
> [  148.796640]  seq_read+0xd5/0x450
> [  148.796657]  vfs_read+0xc5/0x180
> [  148.796672]  ksys_read+0x68/0xe0
> [  148.796688]  do_syscall_64+0x5c/0xa0
> [  148.796704]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  148.796721] RIP: 0033:0x7f3ab1646412
> 
> ...on a non-debug kernel it just crashes.
> 
> In this case lsmem is failing when reading memory96:
> 
> openat(3, "memory96/removable", O_RDONLY|O_CLOEXEC) = 4
> fcntl(4, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
> fstat(4, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
> read(4,  <unfinished ...>)              = ?
> +++ killed by SIGSEGV +++
> Segmentation fault (core dumped)
> 
> ...which is phys_index 0x60 => memory address 0x3000000000
> 
> On this platform that lands us here:
> 
> 100000000-303fffffff : System RAM
>   291f000000-291fe00f70 : Kernel code
>   2920000000-292051efff : Kernel rodata
>   2920600000-292093b0bf : Kernel data
>   29214f3000-2922dfffff : Kernel bss
> 3040000000-305fffffff : Reserved
> 3060000000-1aa5fffffff : Persistent Memory

OK, 2GB memblocks and that would mean [0x3000000000, 0x3080000000]

> ...where the last memory block of System RAM is shared with persistent
> memory. I.e. the block is only partially online which means that
> page_to_nid() in is_mem_section_removable() will assert or crash for
> some of the offline pages in that block.

Yes, this patch is a simple workaround. Normal memory hotplug will not
blow up because it should be able to find out that test_pages_in_a_zone
is false. Who knows how other potential pfn walkers handle that.

Risking to sound like a broken record I will remind that I have been
pushing for having _all_ existing struct pages initialized and we
wouldn't have problems like this popping out here and there.

That being said, I do not have any objections to backporting to stable
trees.
David Hildenbrand March 27, 2020, 9 a.m. UTC | #4
On 27.03.20 08:47, Michal Hocko wrote:
> On Thu 26-03-20 23:24:08, Dan Williams wrote:
> [...]
>> David, Andrew,
>>
>> I'd like to recommend this patch for -stable as it likely (test
>> underway) solves this crash report from Steve:
>>
>> [  148.796036] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>> [  148.796074] ------------[ cut here ]------------
>> [  148.796098] kernel BUG at include/linux/mm.h:1087!
>> [  148.796126] invalid opcode: 0000 [#1] SMP NOPTI
>> [  148.796146] CPU: 63 PID: 5471 Comm: lsmem Not tainted 5.5.10-200.fc31.x8=
>> 6_64+debug #1
>> [  148.796173] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5=
>> C620.86B.02.01.0010.010620200716 01/06/2020
>> [  148.796212] RIP: 0010:is_mem_section_removable+0x1a4/0x1b0
>> [  148.796561] Call Trace:
>> [  148.796591]  removable_show+0x6e/0xa0
>> [  148.796608]  dev_attr_show+0x19/0x40
>> [  148.796625]  sysfs_kf_seq_show+0xa9/0x100
>> [  148.796640]  seq_read+0xd5/0x450
>> [  148.796657]  vfs_read+0xc5/0x180
>> [  148.796672]  ksys_read+0x68/0xe0
>> [  148.796688]  do_syscall_64+0x5c/0xa0
>> [  148.796704]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  148.796721] RIP: 0033:0x7f3ab1646412
>>
>> ...on a non-debug kernel it just crashes.
>>
>> In this case lsmem is failing when reading memory96:
>>
>> openat(3, "memory96/removable", O_RDONLY|O_CLOEXEC) = 4
>> fcntl(4, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
>> fstat(4, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
>> read(4,  <unfinished ...>)              = ?
>> +++ killed by SIGSEGV +++
>> Segmentation fault (core dumped)
>>
>> ...which is phys_index 0x60 => memory address 0x3000000000
>>
>> On this platform that lands us here:
>>
>> 100000000-303fffffff : System RAM
>>   291f000000-291fe00f70 : Kernel code
>>   2920000000-292051efff : Kernel rodata
>>   2920600000-292093b0bf : Kernel data
>>   29214f3000-2922dfffff : Kernel bss
>> 3040000000-305fffffff : Reserved
>> 3060000000-1aa5fffffff : Persistent Memory
> 
> OK, 2GB memblocks and that would mean [0x3000000000, 0x3080000000]
> 
>> ...where the last memory block of System RAM is shared with persistent
>> memory. I.e. the block is only partially online which means that
>> page_to_nid() in is_mem_section_removable() will assert or crash for
>> some of the offline pages in that block.
> 
> Yes, this patch is a simple workaround. Normal memory hotplug will not
> blow up because it should be able to find out that test_pages_in_a_zone
> is false. Who knows how other potential pfn walkers handle that.

All other pfn walkers now correctly use pfn_to_online_page() - which
will also result in false positives in this scenario and is still to be
fixed by Dan IIRC. [1]

> 
> Risking to sound like a broken record I will remind that I have been
> pushing for having _all_ existing struct pages initialized and we
> wouldn't have problems like this popping out here and there.

The real issue is that we have uninitialized memmap within a section
that is marked to contain initialized/online memmap. As I said, even
pfn_to_online_page()/SECTION_IS_ONLINE does not help here.

Also, there is no way to mark devmem to have a initialized memmap
(something like SECTION_IS_ONLINE). I expressed my feeling about that
already.

[1] contains a discussion how it could be addressed.

> 
> That being said, I do not have any objections to backporting to stable
> trees.
> 

This one is one of the remaining places where we don't use
pfn_to_online_page(). So yeah, this patch shouldn't hurt.

[1] https://lkml.kernel.org/r/20191024120938.11237-1-david@redhat.com
Dan Williams March 27, 2020, 4:28 p.m. UTC | #5
On Fri, Mar 27, 2020 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.03.20 08:47, Michal Hocko wrote:
> > On Thu 26-03-20 23:24:08, Dan Williams wrote:
> > [...]
> >> David, Andrew,
> >>
> >> I'd like to recommend this patch for -stable as it likely (test
> >> underway) solves this crash report from Steve:
> >>
> >> [  148.796036] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >> [  148.796074] ------------[ cut here ]------------
> >> [  148.796098] kernel BUG at include/linux/mm.h:1087!
> >> [  148.796126] invalid opcode: 0000 [#1] SMP NOPTI
> >> [  148.796146] CPU: 63 PID: 5471 Comm: lsmem Not tainted 5.5.10-200.fc31.x8=
> >> 6_64+debug #1
> >> [  148.796173] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5=
> >> C620.86B.02.01.0010.010620200716 01/06/2020
> >> [  148.796212] RIP: 0010:is_mem_section_removable+0x1a4/0x1b0
> >> [  148.796561] Call Trace:
> >> [  148.796591]  removable_show+0x6e/0xa0
> >> [  148.796608]  dev_attr_show+0x19/0x40
> >> [  148.796625]  sysfs_kf_seq_show+0xa9/0x100
> >> [  148.796640]  seq_read+0xd5/0x450
> >> [  148.796657]  vfs_read+0xc5/0x180
> >> [  148.796672]  ksys_read+0x68/0xe0
> >> [  148.796688]  do_syscall_64+0x5c/0xa0
> >> [  148.796704]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >> [  148.796721] RIP: 0033:0x7f3ab1646412
> >>
> >> ...on a non-debug kernel it just crashes.
> >>
> >> In this case lsmem is failing when reading memory96:
> >>
> >> openat(3, "memory96/removable", O_RDONLY|O_CLOEXEC) = 4
> >> fcntl(4, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
> >> fstat(4, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
> >> read(4,  <unfinished ...>)              = ?
> >> +++ killed by SIGSEGV +++
> >> Segmentation fault (core dumped)
> >>
> >> ...which is phys_index 0x60 => memory address 0x3000000000
> >>
> >> On this platform that lands us here:
> >>
> >> 100000000-303fffffff : System RAM
> >>   291f000000-291fe00f70 : Kernel code
> >>   2920000000-292051efff : Kernel rodata
> >>   2920600000-292093b0bf : Kernel data
> >>   29214f3000-2922dfffff : Kernel bss
> >> 3040000000-305fffffff : Reserved
> >> 3060000000-1aa5fffffff : Persistent Memory
> >
> > OK, 2GB memblocks and that would mean [0x3000000000, 0x3080000000]
> >
> >> ...where the last memory block of System RAM is shared with persistent
> >> memory. I.e. the block is only partially online which means that
> >> page_to_nid() in is_mem_section_removable() will assert or crash for
> >> some of the offline pages in that block.
> >
> > Yes, this patch is a simple workaround. Normal memory hotplug will not
> > blow up because it should be able to find out that test_pages_in_a_zone
> > is false. Who knows how other potential pfn walkers handle that.
>
> All other pfn walkers now correctly use pfn_to_online_page() - which
> will also result in false positives in this scenario and is still to be
> fixed by Dan IIRC. [1]

Sorry, it's been too long and this fell out of my cache. I also turned
away once the major fire in KVM was put out with special consideration
for for devmem pages. What's left these days? ...besides
removable_show()?
David Hildenbrand March 27, 2020, 4:50 p.m. UTC | #6
On 27.03.20 17:28, Dan Williams wrote:
> On Fri, Mar 27, 2020 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 27.03.20 08:47, Michal Hocko wrote:
>>> On Thu 26-03-20 23:24:08, Dan Williams wrote:
>>> [...]
>>>> David, Andrew,
>>>>
>>>> I'd like to recommend this patch for -stable as it likely (test
>>>> underway) solves this crash report from Steve:
>>>>
>>>> [  148.796036] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>> [  148.796074] ------------[ cut here ]------------
>>>> [  148.796098] kernel BUG at include/linux/mm.h:1087!
>>>> [  148.796126] invalid opcode: 0000 [#1] SMP NOPTI
>>>> [  148.796146] CPU: 63 PID: 5471 Comm: lsmem Not tainted 5.5.10-200.fc31.x8=
>>>> 6_64+debug #1
>>>> [  148.796173] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5=
>>>> C620.86B.02.01.0010.010620200716 01/06/2020
>>>> [  148.796212] RIP: 0010:is_mem_section_removable+0x1a4/0x1b0
>>>> [  148.796561] Call Trace:
>>>> [  148.796591]  removable_show+0x6e/0xa0
>>>> [  148.796608]  dev_attr_show+0x19/0x40
>>>> [  148.796625]  sysfs_kf_seq_show+0xa9/0x100
>>>> [  148.796640]  seq_read+0xd5/0x450
>>>> [  148.796657]  vfs_read+0xc5/0x180
>>>> [  148.796672]  ksys_read+0x68/0xe0
>>>> [  148.796688]  do_syscall_64+0x5c/0xa0
>>>> [  148.796704]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>> [  148.796721] RIP: 0033:0x7f3ab1646412
>>>>
>>>> ...on a non-debug kernel it just crashes.
>>>>
>>>> In this case lsmem is failing when reading memory96:
>>>>
>>>> openat(3, "memory96/removable", O_RDONLY|O_CLOEXEC) = 4
>>>> fcntl(4, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
>>>> fstat(4, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
>>>> read(4,  <unfinished ...>)              = ?
>>>> +++ killed by SIGSEGV +++
>>>> Segmentation fault (core dumped)
>>>>
>>>> ...which is phys_index 0x60 => memory address 0x3000000000
>>>>
>>>> On this platform that lands us here:
>>>>
>>>> 100000000-303fffffff : System RAM
>>>>   291f000000-291fe00f70 : Kernel code
>>>>   2920000000-292051efff : Kernel rodata
>>>>   2920600000-292093b0bf : Kernel data
>>>>   29214f3000-2922dfffff : Kernel bss
>>>> 3040000000-305fffffff : Reserved
>>>> 3060000000-1aa5fffffff : Persistent Memory
>>>
>>> OK, 2GB memblocks and that would mean [0x3000000000, 0x3080000000]
>>>
>>>> ...where the last memory block of System RAM is shared with persistent
>>>> memory. I.e. the block is only partially online which means that
>>>> page_to_nid() in is_mem_section_removable() will assert or crash for
>>>> some of the offline pages in that block.
>>>
>>> Yes, this patch is a simple workaround. Normal memory hotplug will not
>>> blow up because it should be able to find out that test_pages_in_a_zone
>>> is false. Who knows how other potential pfn walkers handle that.
>>
>> All other pfn walkers now correctly use pfn_to_online_page() - which
>> will also result in false positives in this scenario and is still to be
>> fixed by Dan IIRC. [1]
> 
> Sorry, it's been too long and this fell out of my cache. I also turned
> away once the major fire in KVM was put out with special consideration
> for for devmem pages. What's left these days? ...besides
> removable_show()?

Essentially any pfn_to_online_page() is a candidate.

E.g.,

mm/memory-failure.c:memory_failure()

is obviously broken (could be worked around)

Also

mm/memory-failure.c:soft_offline_page()

is obviously broken.


Also set_zone_contiguous()->__pageblock_pfn_to_page() is broken, when it
checks for "page_zone(start_page) != zone" if the memmap contains garbage.

And I only checked a handful of examples.
Dan Williams March 27, 2020, 10:13 p.m. UTC | #7
On Fri, Mar 27, 2020 at 9:50 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.03.20 17:28, Dan Williams wrote:
> > On Fri, Mar 27, 2020 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 27.03.20 08:47, Michal Hocko wrote:
> >>> On Thu 26-03-20 23:24:08, Dan Williams wrote:
> >>> [...]
> >>>> David, Andrew,
> >>>>
> >>>> I'd like to recommend this patch for -stable as it likely (test
> >>>> underway) solves this crash report from Steve:
> >>>>
> >>>> [  148.796036] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >>>> [  148.796074] ------------[ cut here ]------------
> >>>> [  148.796098] kernel BUG at include/linux/mm.h:1087!
> >>>> [  148.796126] invalid opcode: 0000 [#1] SMP NOPTI
> >>>> [  148.796146] CPU: 63 PID: 5471 Comm: lsmem Not tainted 5.5.10-200.fc31.x8=
> >>>> 6_64+debug #1
> >>>> [  148.796173] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5=
> >>>> C620.86B.02.01.0010.010620200716 01/06/2020
> >>>> [  148.796212] RIP: 0010:is_mem_section_removable+0x1a4/0x1b0
> >>>> [  148.796561] Call Trace:
> >>>> [  148.796591]  removable_show+0x6e/0xa0
> >>>> [  148.796608]  dev_attr_show+0x19/0x40
> >>>> [  148.796625]  sysfs_kf_seq_show+0xa9/0x100
> >>>> [  148.796640]  seq_read+0xd5/0x450
> >>>> [  148.796657]  vfs_read+0xc5/0x180
> >>>> [  148.796672]  ksys_read+0x68/0xe0
> >>>> [  148.796688]  do_syscall_64+0x5c/0xa0
> >>>> [  148.796704]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>>> [  148.796721] RIP: 0033:0x7f3ab1646412
> >>>>
> >>>> ...on a non-debug kernel it just crashes.
> >>>>
> >>>> In this case lsmem is failing when reading memory96:
> >>>>
> >>>> openat(3, "memory96/removable", O_RDONLY|O_CLOEXEC) = 4
> >>>> fcntl(4, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
> >>>> fstat(4, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
> >>>> read(4,  <unfinished ...>)              = ?
> >>>> +++ killed by SIGSEGV +++
> >>>> Segmentation fault (core dumped)
> >>>>
> >>>> ...which is phys_index 0x60 => memory address 0x3000000000
> >>>>
> >>>> On this platform that lands us here:
> >>>>
> >>>> 100000000-303fffffff : System RAM
> >>>>   291f000000-291fe00f70 : Kernel code
> >>>>   2920000000-292051efff : Kernel rodata
> >>>>   2920600000-292093b0bf : Kernel data
> >>>>   29214f3000-2922dfffff : Kernel bss
> >>>> 3040000000-305fffffff : Reserved
> >>>> 3060000000-1aa5fffffff : Persistent Memory
> >>>
> >>> OK, 2GB memblocks and that would mean [0x3000000000, 0x3080000000]
> >>>
> >>>> ...where the last memory block of System RAM is shared with persistent
> >>>> memory. I.e. the block is only partially online which means that
> >>>> page_to_nid() in is_mem_section_removable() will assert or crash for
> >>>> some of the offline pages in that block.
> >>>
> >>> Yes, this patch is a simple workaround. Normal memory hotplug will not
> >>> blow up because it should be able to find out that test_pages_in_a_zone
> >>> is false. Who knows how other potential pfn walkers handle that.
> >>
> >> All other pfn walkers now correctly use pfn_to_online_page() - which
> >> will also result in false positives in this scenario and is still to be
> >> fixed by Dan IIRC. [1]
> >
> > Sorry, it's been too long and this fell out of my cache. I also turned
> > away once the major fire in KVM was put out with special consideration
> > for for devmem pages. What's left these days? ...besides
> > removable_show()?
>
> Essentially any pfn_to_online_page() is a candidate.
>
> E.g.,
>
> mm/memory-failure.c:memory_failure()
>
> is obviously broken (could be worked around)

Ooh, the current state looks worse than when I looked previously. I
wasn't copied on commit 96c804a6ae8c ("mm/memory-failure.c: don't
access uninitialized memmaps in memory_failure()"). That commit seems
to ensure the pmem errors in memory sections that overlap with
System-RAM are not handled. So that change looks broken to me.
Previously get_devpagemap() was sufficient protection.

>
> Also
>
> mm/memory-failure.c:soft_offline_page()
>
> is obviously broken.

How exactly? The soft_offline_page() callers seem to already account
for System-RAM vs devmem.

>
>
> Also set_zone_contiguous()->__pageblock_pfn_to_page() is broken, when it
> checks for "page_zone(start_page) != zone" if the memmap contains garbage.
>
> And I only checked a handful of examples.

Ok, but as the first example shows in the absence of a problem report
these pre-emptive changes might make things worse so I don't think
it's as simple as go instrument all the pfn_to_online_page() users.
David Hildenbrand March 27, 2020, 10:42 p.m. UTC | #8
> Am 27.03.2020 um 23:13 schrieb Dan Williams <dan.j.williams@intel.com>:
> 
> On Fri, Mar 27, 2020 at 9:50 AM David Hildenbrand <david@redhat.com> wrote:
>> 
>>> On 27.03.20 17:28, Dan Williams wrote:
>>> On Fri, Mar 27, 2020 at 2:00 AM David Hildenbrand <david@redhat.com> wrote:
>>>> 
>>>> On 27.03.20 08:47, Michal Hocko wrote:
>>>>> On Thu 26-03-20 23:24:08, Dan Williams wrote:
>>>>> [...]
>>>>>> David, Andrew,
>>>>>> 
>>>>>> I'd like to recommend this patch for -stable as it likely (test
>>>>>> underway) solves this crash report from Steve:
>>>>>> 
>>>>>> [  148.796036] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>>>> [  148.796074] ------------[ cut here ]------------
>>>>>> [  148.796098] kernel BUG at include/linux/mm.h:1087!
>>>>>> [  148.796126] invalid opcode: 0000 [#1] SMP NOPTI
>>>>>> [  148.796146] CPU: 63 PID: 5471 Comm: lsmem Not tainted 5.5.10-200.fc31.x8=
>>>>>> 6_64+debug #1
>>>>>> [  148.796173] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5=
>>>>>> C620.86B.02.01.0010.010620200716 01/06/2020
>>>>>> [  148.796212] RIP: 0010:is_mem_section_removable+0x1a4/0x1b0
>>>>>> [  148.796561] Call Trace:
>>>>>> [  148.796591]  removable_show+0x6e/0xa0
>>>>>> [  148.796608]  dev_attr_show+0x19/0x40
>>>>>> [  148.796625]  sysfs_kf_seq_show+0xa9/0x100
>>>>>> [  148.796640]  seq_read+0xd5/0x450
>>>>>> [  148.796657]  vfs_read+0xc5/0x180
>>>>>> [  148.796672]  ksys_read+0x68/0xe0
>>>>>> [  148.796688]  do_syscall_64+0x5c/0xa0
>>>>>> [  148.796704]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>> [  148.796721] RIP: 0033:0x7f3ab1646412
>>>>>> 
>>>>>> ...on a non-debug kernel it just crashes.
>>>>>> 
>>>>>> In this case lsmem is failing when reading memory96:
>>>>>> 
>>>>>> openat(3, "memory96/removable", O_RDONLY|O_CLOEXEC) = 4
>>>>>> fcntl(4, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
>>>>>> fstat(4, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
>>>>>> read(4,  <unfinished ...>)              = ?
>>>>>> +++ killed by SIGSEGV +++
>>>>>> Segmentation fault (core dumped)
>>>>>> 
>>>>>> ...which is phys_index 0x60 => memory address 0x3000000000
>>>>>> 
>>>>>> On this platform that lands us here:
>>>>>> 
>>>>>> 100000000-303fffffff : System RAM
>>>>>>  291f000000-291fe00f70 : Kernel code
>>>>>>  2920000000-292051efff : Kernel rodata
>>>>>>  2920600000-292093b0bf : Kernel data
>>>>>>  29214f3000-2922dfffff : Kernel bss
>>>>>> 3040000000-305fffffff : Reserved
>>>>>> 3060000000-1aa5fffffff : Persistent Memory
>>>>> 
>>>>> OK, 2GB memblocks and that would mean [0x3000000000, 0x3080000000]
>>>>> 
>>>>>> ...where the last memory block of System RAM is shared with persistent
>>>>>> memory. I.e. the block is only partially online which means that
>>>>>> page_to_nid() in is_mem_section_removable() will assert or crash for
>>>>>> some of the offline pages in that block.
>>>>> 
>>>>> Yes, this patch is a simple workaround. Normal memory hotplug will not
>>>>> blow up because it should be able to find out that test_pages_in_a_zone
>>>>> is false. Who knows how other potential pfn walkers handle that.
>>>> 
>>>> All other pfn walkers now correctly use pfn_to_online_page() - which
>>>> will also result in false positives in this scenario and is still to be
>>>> fixed by Dan IIRC. [1]
>>> 
>>> Sorry, it's been too long and this fell out of my cache. I also turned
>>> away once the major fire in KVM was put out with special consideration
>>> for for devmem pages. What's left these days? ...besides
>>> removable_show()?
>> 
>> Essentially any pfn_to_online_page() is a candidate.
>> 
>> E.g.,
>> 
>> mm/memory-failure.c:memory_failure()
>> 
>> is obviously broken (could be worked around)
> 
> Ooh, the current state looks worse than when I looked previously. I
> wasn't copied on commit 96c804a6ae8c ("mm/memory-failure.c: don't
> access uninitialized memmaps in memory_failure()"). That commit seems
> to ensure the pmem errors in memory sections that overlap with
> System-RAM are not handled. So that change looks broken to me.
> Previously get_devpagemap() was sufficient protection.
> 

Well, it went in before we learned that pfn_to_online_page() is now broken in corner cases since sub-section hotadd.


>> 
>> Also
>> 
>> mm/memory-failure.c:soft_offline_page()
>> 
>> is obviously broken.
> 
> How exactly? The soft_offline_page() callers seem to already account
> for System-RAM vs devmem.

Then my quick scan was maybe wrong :)

> 
>> 
>> 
>> Also set_zone_contiguous()->__pageblock_pfn_to_page() is broken, when it
>> checks for "page_zone(start_page) != zone" if the memmap contains garbage.
>> 
>> And I only checked a handful of examples.
> 
> Ok, but as the first example shows in the absence of a problem report
> these pre-emptive changes might make things worse so I don't think
> it's as simple as go instrument all the pfn_to_online_page() users.
> 

Fixing pfn_to_online_page() is the right thing to do, not working around it eventually having false positives IMHO.
diff mbox series

Patch

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 6503f5d0b749..9664be00a4de 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -105,30 +105,13 @@  static ssize_t phys_index_show(struct device *dev,
 }
 
 /*
- * Show whether the memory block is likely to be offlineable (or is already
- * offline). Once offline, the memory block could be removed. The return
- * value does, however, not indicate that there is a way to remove the
- * memory block.
+ * Legacy interface that we cannot remove. Always indicate "removable"
+ * with CONFIG_MEMORY_HOTREMOVE - bad heuristic.
  */
 static ssize_t removable_show(struct device *dev, struct device_attribute *attr,
 			      char *buf)
 {
-	struct memory_block *mem = to_memory_block(dev);
-	unsigned long pfn;
-	int ret = 1, i;
-
-	if (mem->state != MEM_ONLINE)
-		goto out;
-
-	for (i = 0; i < sections_per_block; i++) {
-		if (!present_section_nr(mem->start_section_nr + i))
-			continue;
-		pfn = section_nr_to_pfn(mem->start_section_nr + i);
-		ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
-	}
-
-out:
-	return sprintf(buf, "%d\n", ret);
+	return sprintf(buf, "%d\n", (int)IS_ENABLED(CONFIG_MEMORY_HOTREMOVE));
 }
 
 /*