Message ID | 20250410125110.1232329-1-gshan@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | drivers/base/memory: Avoid overhead from for_each_present_section_nr() | expand |
On Thu, Apr 10, 2025 at 10:51:10PM +1000, Gavin Shan wrote: > for_each_present_section_nr() was introduced to add_boot_memory_block() > by commit 61659efdb35c ("drivers/base/memory: improve add_boot_memory_block()"). > It causes unnecessary overhead when the present sections are really > sparse. next_present_section_nr() called by the macro to find the next > present section, which is far away from the spanning sections in the > specified block. Too much time consumed by next_present_section_nr() > in this case, which can lead to softlockup as observed by Aditya Gupta > on IBM Power10 machine. > > watchdog: BUG: soft lockup - CPU#248 stuck for 22s! [swapper/248:1] > Modules linked in: > CPU: 248 UID: 0 PID: 1 Comm: swapper/248 Not tainted 6.15.0-rc1-next-20250408 #1 VOLUNTARY > Hardware name: 9105-22A POWER10 (raw) 0x800200 opal:v7.1-107-gfda75d121942 PowerNV > NIP: c00000000209218c LR: c000000002092204 CTR: 0000000000000000 > REGS: c00040000418fa30 TRAP: 0900 Not tainted (6.15.0-rc1-next-20250408) > MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28000428 XER: 00000000 > CFAR: 0000000000000000 IRQMASK: 0 > GPR00: c000000002092204 c00040000418fcd0 c000000001b08100 0000000000000040 > GPR04: 0000000000013e00 c000c03ffebabb00 0000000000c03fff c000400fff587f80 > GPR08: 0000000000000000 00000000001196f7 0000000000000000 0000000028000428 > GPR12: 0000000000000000 c000000002e80000 c00000000001007c 0000000000000000 > GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > GPR28: c000000002df7f70 0000000000013dc0 c0000000011dd898 0000000008000000 > NIP [c00000000209218c] memory_dev_init+0x114/0x1e0 > LR [c000000002092204] memory_dev_init+0x18c/0x1e0 > Call Trace: > [c00040000418fcd0] [c000000002092204] memory_dev_init+0x18c/0x1e0 (unreliable) > [c00040000418fd50] [c000000002091348] driver_init+0x78/0xa4 > [c00040000418fd70] [c0000000020063ac] kernel_init_freeable+0x22c/0x370 > [c00040000418fde0] [c0000000000100a8] kernel_init+0x34/0x25c > [c00040000418fe50] [c00000000000cd94] ret_from_kernel_user_thread+0x14/0x1c > > Avoid the overhead by folding for_each_present_section_nr() to the outer > loop. add_boot_memory_block() is dropped after that. > > Fixes: 61659efdb35c ("drivers/base/memory: improve add_boot_memory_block()") > Closes: https://lore.kernel.org/linux-mm/20250409180344.477916-1-adityag@linux.ibm.com > Reported-by: Aditya Gupta <adityag@linux.ibm.com> > Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de>
On 10.04.25 14:51, Gavin Shan wrote: > for_each_present_section_nr() was introduced to add_boot_memory_block() > by commit 61659efdb35c ("drivers/base/memory: improve add_boot_memory_block()"). > It causes unnecessary overhead when the present sections are really > sparse. next_present_section_nr() called by the macro to find the next > present section, which is far away from the spanning sections in the > specified block. Too much time consumed by next_present_section_nr() > in this case, which can lead to softlockup as observed by Aditya Gupta > on IBM Power10 machine. > > watchdog: BUG: soft lockup - CPU#248 stuck for 22s! [swapper/248:1] > Modules linked in: > CPU: 248 UID: 0 PID: 1 Comm: swapper/248 Not tainted 6.15.0-rc1-next-20250408 #1 VOLUNTARY > Hardware name: 9105-22A POWER10 (raw) 0x800200 opal:v7.1-107-gfda75d121942 PowerNV > NIP: c00000000209218c LR: c000000002092204 CTR: 0000000000000000 > REGS: c00040000418fa30 TRAP: 0900 Not tainted (6.15.0-rc1-next-20250408) > MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28000428 XER: 00000000 > CFAR: 0000000000000000 IRQMASK: 0 > GPR00: c000000002092204 c00040000418fcd0 c000000001b08100 0000000000000040 > GPR04: 0000000000013e00 c000c03ffebabb00 0000000000c03fff c000400fff587f80 > GPR08: 0000000000000000 00000000001196f7 0000000000000000 0000000028000428 > GPR12: 0000000000000000 c000000002e80000 c00000000001007c 0000000000000000 > GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > GPR28: c000000002df7f70 0000000000013dc0 c0000000011dd898 0000000008000000 > NIP [c00000000209218c] memory_dev_init+0x114/0x1e0 > LR [c000000002092204] memory_dev_init+0x18c/0x1e0 > Call Trace: > [c00040000418fcd0] [c000000002092204] memory_dev_init+0x18c/0x1e0 (unreliable) > [c00040000418fd50] [c000000002091348] driver_init+0x78/0xa4 > [c00040000418fd70] [c0000000020063ac] kernel_init_freeable+0x22c/0x370 > [c00040000418fde0] [c0000000000100a8] kernel_init+0x34/0x25c > [c00040000418fe50] [c00000000000cd94] ret_from_kernel_user_thread+0x14/0x1c > > Avoid the overhead by folding for_each_present_section_nr() to the outer > loop. add_boot_memory_block() is dropped after that. > > Fixes: 61659efdb35c ("drivers/base/memory: improve add_boot_memory_block()") > Closes: https://lore.kernel.org/linux-mm/20250409180344.477916-1-adityag@linux.ibm.com > Reported-by: Aditya Gupta <adityag@linux.ibm.com> > Signed-off-by: Gavin Shan <gshan@redhat.com> > --- > drivers/base/memory.c | 41 +++++++++++++++++------------------------ > 1 file changed, 17 insertions(+), 24 deletions(-) > > diff --git a/drivers/base/memory.c b/drivers/base/memory.c > index 8f3a41d9bfaa..19469e7f88c2 100644 > --- a/drivers/base/memory.c > +++ b/drivers/base/memory.c > @@ -816,21 +816,6 @@ static int add_memory_block(unsigned long block_id, unsigned long state, > return 0; > } > > -static int __init add_boot_memory_block(unsigned long base_section_nr) > -{ > - unsigned long nr; > - > - for_each_present_section_nr(base_section_nr, nr) { > - if (nr >= (base_section_nr + sections_per_block)) > - break; > - > - return add_memory_block(memory_block_id(base_section_nr), > - MEM_ONLINE, NULL, NULL); > - } > - > - return 0; > -} > - > static int add_hotplug_memory_block(unsigned long block_id, > struct vmem_altmap *altmap, > struct memory_group *group) > @@ -957,7 +942,7 @@ static const struct attribute_group *memory_root_attr_groups[] = { > void __init memory_dev_init(void) > { > int ret; > - unsigned long block_sz, nr; > + unsigned long block_sz, block_id, nr; > > /* Validate the configured memory block size */ > block_sz = memory_block_size_bytes(); > @@ -970,15 +955,23 @@ void __init memory_dev_init(void) > panic("%s() failed to register subsystem: %d\n", __func__, ret); > > /* > - * Create entries for memory sections that were found > - * during boot and have been initialized > + * Create entries for memory sections that were found during boot > + * and have been initialized. Use @block_id to track the last > + * handled block and initialize it to an invalid value (ULONG_MAX) > + * to bypass the block ID matching check for the first present > + * block so that it can be covered. > */ > - for (nr = 0; nr <= __highest_present_section_nr; > - nr += sections_per_block) { > - ret = add_boot_memory_block(nr); > - if (ret) > - panic("%s() failed to add memory block: %d\n", __func__, > - ret); > + block_id = ULONG_MAX; > + for_each_present_section_nr(0, nr) { > + if (block_id != ULONG_MAX && memory_block_id(nr) == block_id) > + continue; > + > + block_id = memory_block_id(nr); > + ret = add_memory_block(block_id, MEM_ONLINE, NULL, NULL); > + if (ret) { > + panic("%s() failed to add memory block: %d\n", > + __func__, ret); > + } > } > } > Staring at the end result and the particularly long comment, are we now really any better than before 61659efdb35c? Revert instead?
On Thu, Apr 10, 2025 at 03:18:00PM +0200, David Hildenbrand wrote: > Staring at the end result and the particularly long comment, are we now > really any better than before 61659efdb35c? I think we are. I mean, we made it slightly worse with 61659efdb35c because of what I explained in the error report, but I think this version is faster than the code before 61659efdb35c, as before that the outter loop was incremented by 1 any given time, meaning that the section we were passing to add_boot_memory_block() could have been already checked in there for memory-blocks spanning multiple sections. All in all, I think we are better, and the code is slightly simpler?
On Thu, Apr 10, 2025 at 03:55:19PM +0200, Oscar Salvador wrote:
> All in all, I think we are better, and the code is slightly simpler?
One thing to notice is that maybe we could further improve and leap 'nr'
by the number of sections_per_block, so in those scenarios where
a memory-block spans multiple sections this could be faster?
Just a thought, and maybe not worth it.
In the end, we have payed more than once the price of trying to be too smart
wrt. sections and boot tricks :-).
On 10.04.25 15:55, Oscar Salvador wrote: > On Thu, Apr 10, 2025 at 03:18:00PM +0200, David Hildenbrand wrote: >> Staring at the end result and the particularly long comment, are we now >> really any better than before 61659efdb35c? > > I think we are. I think you are right. The whole comment can IMHO be heavily simplified. Makes it sound more complicated that it actually is ... /* * Create memory blocks to span all present memory sections. Take care * of memory blocks that span multiple sections. */ Acked-by: David Hildenbrand <david@redhat.com>
On 10.04.25 16:12, Oscar Salvador wrote: > On Thu, Apr 10, 2025 at 03:55:19PM +0200, Oscar Salvador wrote: >> All in all, I think we are better, and the code is slightly simpler? > > One thing to notice is that maybe we could further improve and leap 'nr' > by the number of sections_per_block, so in those scenarios where > a memory-block spans multiple sections this could be faster? Essentially, when we created a block we could always start with the next section that starts after the block.
On 25/04/10 10:51PM, Gavin Shan wrote: > for_each_present_section_nr() was introduced to add_boot_memory_block() > by commit 61659efdb35c ("drivers/base/memory: improve add_boot_memory_block()"). > It causes unnecessary overhead when the present sections are really > sparse. next_present_section_nr() called by the macro to find the next > present section, which is far away from the spanning sections in the > specified block. Too much time consumed by next_present_section_nr() > in this case, which can lead to softlockup as observed by Aditya Gupta > on IBM Power10 machine. > > watchdog: BUG: soft lockup - CPU#248 stuck for 22s! [swapper/248:1] > Modules linked in: > CPU: 248 UID: 0 PID: 1 Comm: swapper/248 Not tainted 6.15.0-rc1-next-20250408 #1 VOLUNTARY > Hardware name: 9105-22A POWER10 (raw) 0x800200 opal:v7.1-107-gfda75d121942 PowerNV > NIP: c00000000209218c LR: c000000002092204 CTR: 0000000000000000 > REGS: c00040000418fa30 TRAP: 0900 Not tainted (6.15.0-rc1-next-20250408) > MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28000428 XER: 00000000 > CFAR: 0000000000000000 IRQMASK: 0 > GPR00: c000000002092204 c00040000418fcd0 c000000001b08100 0000000000000040 > GPR04: 0000000000013e00 c000c03ffebabb00 0000000000c03fff c000400fff587f80 > GPR08: 0000000000000000 00000000001196f7 0000000000000000 0000000028000428 > GPR12: 0000000000000000 c000000002e80000 c00000000001007c 0000000000000000 > GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > GPR28: c000000002df7f70 0000000000013dc0 c0000000011dd898 0000000008000000 > NIP [c00000000209218c] memory_dev_init+0x114/0x1e0 > LR [c000000002092204] memory_dev_init+0x18c/0x1e0 > Call Trace: > [c00040000418fcd0] [c000000002092204] memory_dev_init+0x18c/0x1e0 (unreliable) > [c00040000418fd50] [c000000002091348] driver_init+0x78/0xa4 > [c00040000418fd70] [c0000000020063ac] kernel_init_freeable+0x22c/0x370 > [c00040000418fde0] [c0000000000100a8] kernel_init+0x34/0x25c > [c00040000418fe50] [c00000000000cd94] ret_from_kernel_user_thread+0x14/0x1c > > Avoid the overhead by folding for_each_present_section_nr() to the outer > loop. add_boot_memory_block() is dropped after that. > > Fixes: 61659efdb35c ("drivers/base/memory: improve add_boot_memory_block()") > Closes: https://lore.kernel.org/linux-mm/20250409180344.477916-1-adityag@linux.ibm.com > Reported-by: Aditya Gupta <adityag@linux.ibm.com> > Signed-off-by: Gavin Shan <gshan@redhat.com> > --- > drivers/base/memory.c | 41 +++++++++++++++++------------------------ > 1 file changed, 17 insertions(+), 24 deletions(-) Thanks for the fix, Gavin. Tested on PowerNV Power10 hardware on which the issue was originally seen, the patch fixes the softlockup issue. Hence, Tested-by: Aditya Gupta <adityag@linux.ibm.com> Thanks, - Aditya G
On 4/11/25 12:25 AM, David Hildenbrand wrote: > On 10.04.25 16:12, Oscar Salvador wrote: >> On Thu, Apr 10, 2025 at 03:55:19PM +0200, Oscar Salvador wrote: >>> All in all, I think we are better, and the code is slightly simpler? >> >> One thing to notice is that maybe we could further improve and leap 'nr' >> by the number of sections_per_block, so in those scenarios where >> a memory-block spans multiple sections this could be faster? > > Essentially, when we created a block we could always start with the next section that starts after the block. > I think it's a good point. Tried a quick test on a ARM64 machine whose memory capacity is 1TB. Leaping 'nr' by 'sections_per_block' improves the performance a bit, even it's not too much. The time taken by memory_dev_init() drops from 110ms to 100ms. For the IBM Power9 machine (64GB memory) I have, there are not too much space to be improved because the time taken by memory_dev_init() is only 10ms. I will post a patch for review after this patch gets merged, if you agree. for_each_present_section_nr(0, nr) { - if (block_id != ULONG_MAX && memory_block_id(nr) == block_id) - continue; - - block_id = memory_block_id(nr); - ret = add_memory_block(block_id, MEM_ONLINE, NULL, NULL); + ret = add_memory_block(memory_block_id(nr), MEM_ONLINE, NULL, NULL); if (ret) { panic("%s() failed to add memory block: %d\n", __func__, ret); } + + /* Align to next block, minus one section */ + nr = ALIGN(nr + 1, sections_per_block) - 1; } Thanks, Gavin
On 11.04.25 07:04, Gavin Shan wrote: > On 4/11/25 12:25 AM, David Hildenbrand wrote: >> On 10.04.25 16:12, Oscar Salvador wrote: >>> On Thu, Apr 10, 2025 at 03:55:19PM +0200, Oscar Salvador wrote: >>>> All in all, I think we are better, and the code is slightly simpler? >>> >>> One thing to notice is that maybe we could further improve and leap 'nr' >>> by the number of sections_per_block, so in those scenarios where >>> a memory-block spans multiple sections this could be faster? >> >> Essentially, when we created a block we could always start with the next section that starts after the block. >> > > I think it's a good point. Tried a quick test on a ARM64 machine whose memory > capacity is 1TB. Leaping 'nr' by 'sections_per_block' improves the performance a bit, > even it's not too much. The time taken by memory_dev_init() drops from 110ms to 100ms. > For the IBM Power9 machine (64GB memory) I have, there are not too much space to be > improved because the time taken by memory_dev_init() is only 10ms. I will post a patch > for review after this patch gets merged, if you agree. > > for_each_present_section_nr(0, nr) { > - if (block_id != ULONG_MAX && memory_block_id(nr) == block_id) > - continue; > - > - block_id = memory_block_id(nr); > - ret = add_memory_block(block_id, MEM_ONLINE, NULL, NULL); > + ret = add_memory_block(memory_block_id(nr), MEM_ONLINE, NULL, NULL); > if (ret) { > panic("%s() failed to add memory block: %d\n", > __func__, ret); > } > + > + /* Align to next block, minus one section */ /* * Forward to the last section in this block, so we'll process the first * section of the next block in the next iteration. */ > + nr = ALIGN(nr + 1, sections_per_block) - 1; Yeah, that should work.
On Fri, Apr 11, 2025 at 03:04:16PM +1000, Gavin Shan wrote: > I think it's a good point. Tried a quick test on a ARM64 machine whose memory > capacity is 1TB. Leaping 'nr' by 'sections_per_block' improves the performance a bit, > even it's not too much. The time taken by memory_dev_init() drops from 110ms to 100ms. > For the IBM Power9 machine (64GB memory) I have, there are not too much space to be > improved because the time taken by memory_dev_init() is only 10ms. I will post a patch > for review after this patch gets merged, if you agree. I have a patch that looked pretty much the same because I wanted to try it out after commenting it to David, to see the "gains". On a x86 system with 100GB and memory-blocks spanning 8 sections, I saw a gain of 12us. Of course, that kinda of accumulates wrt. memory capacity.
diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 8f3a41d9bfaa..19469e7f88c2 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -816,21 +816,6 @@ static int add_memory_block(unsigned long block_id, unsigned long state, return 0; } -static int __init add_boot_memory_block(unsigned long base_section_nr) -{ - unsigned long nr; - - for_each_present_section_nr(base_section_nr, nr) { - if (nr >= (base_section_nr + sections_per_block)) - break; - - return add_memory_block(memory_block_id(base_section_nr), - MEM_ONLINE, NULL, NULL); - } - - return 0; -} - static int add_hotplug_memory_block(unsigned long block_id, struct vmem_altmap *altmap, struct memory_group *group) @@ -957,7 +942,7 @@ static const struct attribute_group *memory_root_attr_groups[] = { void __init memory_dev_init(void) { int ret; - unsigned long block_sz, nr; + unsigned long block_sz, block_id, nr; /* Validate the configured memory block size */ block_sz = memory_block_size_bytes(); @@ -970,15 +955,23 @@ void __init memory_dev_init(void) panic("%s() failed to register subsystem: %d\n", __func__, ret); /* - * Create entries for memory sections that were found - * during boot and have been initialized + * Create entries for memory sections that were found during boot + * and have been initialized. Use @block_id to track the last + * handled block and initialize it to an invalid value (ULONG_MAX) + * to bypass the block ID matching check for the first present + * block so that it can be covered. */ - for (nr = 0; nr <= __highest_present_section_nr; - nr += sections_per_block) { - ret = add_boot_memory_block(nr); - if (ret) - panic("%s() failed to add memory block: %d\n", __func__, - ret); + block_id = ULONG_MAX; + for_each_present_section_nr(0, nr) { + if (block_id != ULONG_MAX && memory_block_id(nr) == block_id) + continue; + + block_id = memory_block_id(nr); + ret = add_memory_block(block_id, MEM_ONLINE, NULL, NULL); + if (ret) { + panic("%s() failed to add memory block: %d\n", + __func__, ret); + } } }
for_each_present_section_nr() was introduced to add_boot_memory_block() by commit 61659efdb35c ("drivers/base/memory: improve add_boot_memory_block()"). It causes unnecessary overhead when the present sections are really sparse. next_present_section_nr() called by the macro to find the next present section, which is far away from the spanning sections in the specified block. Too much time consumed by next_present_section_nr() in this case, which can lead to softlockup as observed by Aditya Gupta on IBM Power10 machine. watchdog: BUG: soft lockup - CPU#248 stuck for 22s! [swapper/248:1] Modules linked in: CPU: 248 UID: 0 PID: 1 Comm: swapper/248 Not tainted 6.15.0-rc1-next-20250408 #1 VOLUNTARY Hardware name: 9105-22A POWER10 (raw) 0x800200 opal:v7.1-107-gfda75d121942 PowerNV NIP: c00000000209218c LR: c000000002092204 CTR: 0000000000000000 REGS: c00040000418fa30 TRAP: 0900 Not tainted (6.15.0-rc1-next-20250408) MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28000428 XER: 00000000 CFAR: 0000000000000000 IRQMASK: 0 GPR00: c000000002092204 c00040000418fcd0 c000000001b08100 0000000000000040 GPR04: 0000000000013e00 c000c03ffebabb00 0000000000c03fff c000400fff587f80 GPR08: 0000000000000000 00000000001196f7 0000000000000000 0000000028000428 GPR12: 0000000000000000 c000000002e80000 c00000000001007c 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR28: c000000002df7f70 0000000000013dc0 c0000000011dd898 0000000008000000 NIP [c00000000209218c] memory_dev_init+0x114/0x1e0 LR [c000000002092204] memory_dev_init+0x18c/0x1e0 Call Trace: [c00040000418fcd0] [c000000002092204] memory_dev_init+0x18c/0x1e0 (unreliable) [c00040000418fd50] [c000000002091348] driver_init+0x78/0xa4 [c00040000418fd70] [c0000000020063ac] kernel_init_freeable+0x22c/0x370 [c00040000418fde0] [c0000000000100a8] kernel_init+0x34/0x25c [c00040000418fe50] [c00000000000cd94] ret_from_kernel_user_thread+0x14/0x1c Avoid the overhead by folding for_each_present_section_nr() to the outer loop. add_boot_memory_block() is dropped after that. Fixes: 61659efdb35c ("drivers/base/memory: improve add_boot_memory_block()") Closes: https://lore.kernel.org/linux-mm/20250409180344.477916-1-adityag@linux.ibm.com Reported-by: Aditya Gupta <adityag@linux.ibm.com> Signed-off-by: Gavin Shan <gshan@redhat.com> --- drivers/base/memory.c | 41 +++++++++++++++++------------------------ 1 file changed, 17 insertions(+), 24 deletions(-)