Message ID | 1522915478-5044-3-git-send-email-hejianet@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Apr 05, 2018 at 01:04:35AM -0700, Jia He wrote: > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns > where possible") optimized the loop in memmap_init_zone(). But there is > still some room for improvement. E.g. if pfn and pfn+1 are in the same > memblock region, we can simply pfn++ instead of doing the binary search > in memblock_next_valid_pfn. Sure, but I bet if we are >end_pfn, we're almost certainly going to the start_pfn of the next block, so why not test that as well? > + /* fast path, return pfn+1 if next pfn is in the same region */ > + if (early_region_idx != -1) { > + start_pfn = PFN_DOWN(regions[early_region_idx].base); > + end_pfn = PFN_DOWN(regions[early_region_idx].base + > + regions[early_region_idx].size); > + > + if (pfn >= start_pfn && pfn < end_pfn) > + return pfn; early_region_idx++; start_pfn = PFN_DOWN(regions[early_region_idx].base); if (pfn >= end_pfn && pfn <= start_pfn) return start_pfn; > + }
On 4/5/2018 7:34 PM, Matthew Wilcox Wrote: > On Thu, Apr 05, 2018 at 01:04:35AM -0700, Jia He wrote: >> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns >> where possible") optimized the loop in memmap_init_zone(). But there is >> still some room for improvement. E.g. if pfn and pfn+1 are in the same >> memblock region, we can simply pfn++ instead of doing the binary search >> in memblock_next_valid_pfn. > Sure, but I bet if we are >end_pfn, we're almost certainly going to the > start_pfn of the next block, so why not test that as well? > >> + /* fast path, return pfn+1 if next pfn is in the same region */ >> + if (early_region_idx != -1) { >> + start_pfn = PFN_DOWN(regions[early_region_idx].base); >> + end_pfn = PFN_DOWN(regions[early_region_idx].base + >> + regions[early_region_idx].size); >> + >> + if (pfn >= start_pfn && pfn < end_pfn) >> + return pfn; > early_region_idx++; > start_pfn = PFN_DOWN(regions[early_region_idx].base); > if (pfn >= end_pfn && pfn <= start_pfn) > return start_pfn; Thanks, thus the binary search in next step can be discarded?
On Thu, Apr 05, 2018 at 08:44:12PM +0800, Jia He wrote: > > > On 4/5/2018 7:34 PM, Matthew Wilcox Wrote: > > On Thu, Apr 05, 2018 at 01:04:35AM -0700, Jia He wrote: > > > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns > > > where possible") optimized the loop in memmap_init_zone(). But there is > > > still some room for improvement. E.g. if pfn and pfn+1 are in the same > > > memblock region, we can simply pfn++ instead of doing the binary search > > > in memblock_next_valid_pfn. > > Sure, but I bet if we are >end_pfn, we're almost certainly going to the > > start_pfn of the next block, so why not test that as well? > > > > > + /* fast path, return pfn+1 if next pfn is in the same region */ > > > + if (early_region_idx != -1) { > > > + start_pfn = PFN_DOWN(regions[early_region_idx].base); > > > + end_pfn = PFN_DOWN(regions[early_region_idx].base + > > > + regions[early_region_idx].size); > > > + > > > + if (pfn >= start_pfn && pfn < end_pfn) > > > + return pfn; > > early_region_idx++; > > start_pfn = PFN_DOWN(regions[early_region_idx].base); > > if (pfn >= end_pfn && pfn <= start_pfn) > > return start_pfn; > Thanks, thus the binary search in next step can be discarded? I don't know all the circumstances in which this is called. Maybe a linear search with memo is more appropriate than a binary search.
On Thu, Apr 05, 2018 at 05:50:54AM -0700, Matthew Wilcox wrote: > On Thu, Apr 05, 2018 at 08:44:12PM +0800, Jia He wrote: > > > > > > On 4/5/2018 7:34 PM, Matthew Wilcox Wrote: > > > On Thu, Apr 05, 2018 at 01:04:35AM -0700, Jia He wrote: > > > > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns > > > > where possible") optimized the loop in memmap_init_zone(). But there is > > > > still some room for improvement. E.g. if pfn and pfn+1 are in the same > > > > memblock region, we can simply pfn++ instead of doing the binary search > > > > in memblock_next_valid_pfn. > > > Sure, but I bet if we are >end_pfn, we're almost certainly going to the > > > start_pfn of the next block, so why not test that as well? > > > > > > > + /* fast path, return pfn+1 if next pfn is in the same region */ > > > > + if (early_region_idx != -1) { > > > > + start_pfn = PFN_DOWN(regions[early_region_idx].base); > > > > + end_pfn = PFN_DOWN(regions[early_region_idx].base + > > > > + regions[early_region_idx].size); > > > > + > > > > + if (pfn >= start_pfn && pfn < end_pfn) > > > > + return pfn; > > > early_region_idx++; > > > start_pfn = PFN_DOWN(regions[early_region_idx].base); > > > if (pfn >= end_pfn && pfn <= start_pfn) > > > return start_pfn; > > Thanks, thus the binary search in next step can be discarded? > > I don't know all the circumstances in which this is called. Maybe a linear > search with memo is more appropriate than a binary search. That's been brought up before, and the reasoning appears to be something along the lines of... Academics and published wisdom is that on cached architectures, binary searches are bad because it doesn't operate efficiently due to the overhead from having to load cache lines. Consequently, there seems to be a knee-jerk reaction that "all binary searches are bad, we must eliminate them." What is failed to be grasped here, though, is that it is typical that the number of entries in this array tend to be small, so the entire array takes up one or two cache lines, maybe a maximum of four lines depending on your cache line length and number of entries. This means that the binary search expense is reduced, and is lower than a linear search for the majority of cases. What is key here as far as performance is concerned is whether the general usage of pfn_valid() by the kernel is optimal. We should not optimise only for the boot case, which means evaluating the effect of these changes with _real_ workloads, not just "does my machine boot a milliseconds faster".
On Fri, Apr 6, 2018 at 11:09 AM, Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > On Thu, Apr 05, 2018 at 05:50:54AM -0700, Matthew Wilcox wrote: >> On Thu, Apr 05, 2018 at 08:44:12PM +0800, Jia He wrote: >> > >> > >> > On 4/5/2018 7:34 PM, Matthew Wilcox Wrote: >> > > On Thu, Apr 05, 2018 at 01:04:35AM -0700, Jia He wrote: >> > > > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns >> > > > where possible") optimized the loop in memmap_init_zone(). But there is >> > > > still some room for improvement. E.g. if pfn and pfn+1 are in the same >> > > > memblock region, we can simply pfn++ instead of doing the binary search >> > > > in memblock_next_valid_pfn. >> > > Sure, but I bet if we are >end_pfn, we're almost certainly going to the >> > > start_pfn of the next block, so why not test that as well? >> > > >> > > > + /* fast path, return pfn+1 if next pfn is in the same region */ >> > > > + if (early_region_idx != -1) { >> > > > + start_pfn = PFN_DOWN(regions[early_region_idx].base); >> > > > + end_pfn = PFN_DOWN(regions[early_region_idx].base + >> > > > + regions[early_region_idx].size); >> > > > + >> > > > + if (pfn >= start_pfn && pfn < end_pfn) >> > > > + return pfn; >> > > early_region_idx++; >> > > start_pfn = PFN_DOWN(regions[early_region_idx].base); >> > > if (pfn >= end_pfn && pfn <= start_pfn) >> > > return start_pfn; >> > Thanks, thus the binary search in next step can be discarded? >> >> I don't know all the circumstances in which this is called. Maybe a linear >> search with memo is more appropriate than a binary search. This is actually a good point. > That's been brought up before, and the reasoning appears to be > something along the lines of... > > Academics and published wisdom is that on cached architectures, binary > searches are bad because it doesn't operate efficiently due to the > overhead from having to load cache lines. Consequently, there seems > to be a knee-jerk reaction that "all binary searches are bad, we must > eliminate them." a) This does not make sense. At least in general case. b) It is not the case here. Here it's really mostly called with sequentially incremented pfns, AFAICT. > What is failed to be grasped here, though, is that it is typical that > the number of entries in this array tend to be small, so the entire > array takes up one or two cache lines, maybe a maximum of four lines > depending on your cache line length and number of entries. > > This means that the binary search expense is reduced, and is lower > than a linear search for the majority of cases. In this case it hits mostly the last result or eventually the sequentially next one. > What is key here as far as performance is concerned is whether the > general usage of pfn_valid() by the kernel is optimal. We should > not optimise only for the boot case, which means evaluating the > effect of these changes with _real_ workloads, not just "does my > machine boot a milliseconds faster". IIUC, this is only used during early boot (and memory hotplug) and it does not influence regular runtime. Whether the general usage of pfn_valid() by the kernel is optimal is another good question, but that's totally unrelated to this series, IMHO. On the other hand I also wonder if this all really is worth the negligible boot time speedup. --nX > -- > RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ > FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up > According to speedtest.net: 8.21Mbps down 510kbps up
Thanks for your comments, Russell On 4/6/2018 5:09 PM, Russell King - ARM Linux Wrote: > On Thu, Apr 05, 2018 at 05:50:54AM -0700, Matthew Wilcox wrote: >> On Thu, Apr 05, 2018 at 08:44:12PM +0800, Jia He wrote: >>> >>> On 4/5/2018 7:34 PM, Matthew Wilcox Wrote: >>>> On Thu, Apr 05, 2018 at 01:04:35AM -0700, Jia He wrote: >>>>> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns >>>>> where possible") optimized the loop in memmap_init_zone(). But there is >>>>> still some room for improvement. E.g. if pfn and pfn+1 are in the same >>>>> memblock region, we can simply pfn++ instead of doing the binary search >>>>> in memblock_next_valid_pfn. >>>> Sure, but I bet if we are >end_pfn, we're almost certainly going to the >>>> start_pfn of the next block, so why not test that as well? >>>> >>>>> + /* fast path, return pfn+1 if next pfn is in the same region */ >>>>> + if (early_region_idx != -1) { >>>>> + start_pfn = PFN_DOWN(regions[early_region_idx].base); >>>>> + end_pfn = PFN_DOWN(regions[early_region_idx].base + >>>>> + regions[early_region_idx].size); >>>>> + >>>>> + if (pfn >= start_pfn && pfn < end_pfn) >>>>> + return pfn; >>>> early_region_idx++; >>>> start_pfn = PFN_DOWN(regions[early_region_idx].base); >>>> if (pfn >= end_pfn && pfn <= start_pfn) >>>> return start_pfn; >>> Thanks, thus the binary search in next step can be discarded? >> I don't know all the circumstances in which this is called. Maybe a linear >> search with memo is more appropriate than a binary search. > That's been brought up before, and the reasoning appears to be > something along the lines of... > > Academics and published wisdom is that on cached architectures, binary > searches are bad because it doesn't operate efficiently due to the > overhead from having to load cache lines. Consequently, there seems > to be a knee-jerk reaction that "all binary searches are bad, we must > eliminate them." IIUC, are you opposed to entirely removing the binary search instead of my previous patch set? > > What is failed to be grasped here, though, is that it is typical that > the number of entries in this array tend to be small, so the entire > array takes up one or two cache lines, maybe a maximum of four lines > depending on your cache line length and number of entries. > > This means that the binary search expense is reduced, and is lower > than a linear search for the majority of cases. > > What is key here as far as performance is concerned is whether the > general usage of pfn_valid() by the kernel is optimal. We should > not optimise only for the boot case, which means evaluating the > effect of these changes with _real_ workloads, not just "does my > machine boot a milliseconds faster". hmm.. But pfn is linearly increased during the booting time. This assumption is not correct in real workload for pfn_valid out of booting time. So in my patchset, I defined another pfn_valid_region for booting time only. I didn't have many arm/arm64 boxes to verifed. What I can do is guaranteeing the improvemnet in my armv8a (qualcom centriq 2400). Sorry about it. -- Cheers, Jia
diff --git a/include/linux/arm96_common.h b/include/linux/arm96_common.h index a6f68ea..2f4dea4 100644 --- a/include/linux/arm96_common.h +++ b/include/linux/arm96_common.h @@ -5,32 +5,47 @@ #ifndef __ARM96_COMMON_H #define __ARM96_COMMON_H #ifdef CONFIG_HAVE_ARCH_PFN_VALID +static int early_region_idx __init_memblock = -1; /* HAVE_MEMBLOCK is always enabled on arm and arm64 */ ulong __init_memblock memblock_next_valid_pfn(ulong pfn) { struct memblock_type *type = &memblock.memory; - unsigned int right = type->cnt; - unsigned int mid, left = 0; + struct memblock_region *regions = type->regions; + uint right = type->cnt; + uint mid, left = 0; + ulong start_pfn, end_pfn; phys_addr_t addr = PFN_PHYS(++pfn); + /* fast path, return pfn+1 if next pfn is in the same region */ + if (early_region_idx != -1) { + start_pfn = PFN_DOWN(regions[early_region_idx].base); + end_pfn = PFN_DOWN(regions[early_region_idx].base + + regions[early_region_idx].size); + + if (pfn >= start_pfn && pfn < end_pfn) + return pfn; + } + + /* slow path, do the binary searching */ do { mid = (right + left) / 2; - if (addr < type->regions[mid].base) + if (addr < regions[mid].base) right = mid; - else if (addr >= (type->regions[mid].base + - type->regions[mid].size)) + else if (addr >= (regions[mid].base + regions[mid].size)) left = mid + 1; else { - /* addr is within the region, so pfn is valid */ + early_region_idx = mid; return pfn; } } while (left < right); if (right == type->cnt) return -1UL; - else - return PHYS_PFN(type->regions[right].base); + + early_region_idx = right; + + return PHYS_PFN(regions[early_region_idx].base); } EXPORT_SYMBOL(memblock_next_valid_pfn); #endif /*CONFIG_HAVE_ARCH_PFN_VALID*/
Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible") optimized the loop in memmap_init_zone(). But there is still some room for improvement. E.g. if pfn and pfn+1 are in the same memblock region, we can simply pfn++ instead of doing the binary search in memblock_next_valid_pfn. Signed-off-by: Jia He <jia.he@hxt-semitech.com> --- include/linux/arm96_common.h | 31 +++++++++++++++++++++++-------- 1 file changed, 23 insertions(+), 8 deletions(-)