Message ID | 20240822230635.954557-4-andrew.cooper3@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xen/bitops: hweight() cleanup/improvements | expand |
On 23.08.2024 01:06, Andrew Cooper wrote: > Using hweight() is an especially expensive way of determining simply if > multiple bits are set in a value. Worse, 4 of the 10 hweight() calls in Xen > are of this form. > > Switch to the new multiple_bits_set() helper. This is far more efficient than > the longhand hweight() algorithm and, owing to its simplicity, likely more > efficient than even a dedicated instruction on a superscalar processor. > > No functional change. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Just to mention it: With the title being sufficiently generic, I half expected the similar {bitmap,cpumask}_weight() to also be taken care of. Yet certainly those (and maybe also their ... == 1 forms) can be covered later. Jan
On 26/08/2024 11:37 am, Jan Beulich wrote: > On 23.08.2024 01:06, Andrew Cooper wrote: >> Using hweight() is an especially expensive way of determining simply if >> multiple bits are set in a value. Worse, 4 of the 10 hweight() calls in Xen >> are of this form. >> >> Switch to the new multiple_bits_set() helper. This is far more efficient than >> the longhand hweight() algorithm and, owing to its simplicity, likely more >> efficient than even a dedicated instruction on a superscalar processor. >> >> No functional change. >> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > Reviewed-by: Jan Beulich <jbeulich@suse.com> Thanks. > > Just to mention it: With the title being sufficiently generic, I half > expected the similar {bitmap,cpumask}_weight() to also be taken care of. > Yet certainly those (and maybe also their ... == 1 forms) can be covered > later. They are on my radar, but are not as easy to transform. We'd need a new bitmap API to optimise this case; looking at the callers, it is definitely worth optimising. There are also an awful lot of cpumask_weight() == 1 which want converting, albeit slightly differently. That's a full series of work in and of itself. ~Andrew
diff --git a/xen/arch/x86/cpu/vpmu.c b/xen/arch/x86/cpu/vpmu.c index b2ba9994129b..a5bb1689c7d5 100644 --- a/xen/arch/x86/cpu/vpmu.c +++ b/xen/arch/x86/cpu/vpmu.c @@ -673,7 +673,7 @@ long do_xenpmu_op( { if ( (pmu_params.val & ~(XENPMU_MODE_SELF | XENPMU_MODE_HV | XENPMU_MODE_ALL)) || - (hweight64(pmu_params.val) > 1) ) + multiple_bits_set(pmu_params.val) ) return -EINVAL; /* 32-bit dom0 can only sample itself. */ diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c index 2ec95942713e..4a3e21a65f9d 100644 --- a/xen/arch/x86/hvm/vlapic.c +++ b/xen/arch/x86/hvm/vlapic.c @@ -467,12 +467,14 @@ static bool is_multicast_dest(struct vlapic *vlapic, unsigned int short_hand, return short_hand != APIC_DEST_SELF; if ( vlapic_x2apic_mode(vlapic) ) - return dest_mode ? hweight16(dest) > 1 : dest == 0xffffffffU; + return dest_mode ? multiple_bits_set((uint16_t)dest) + : dest == 0xffffffffU; if ( dest_mode ) - return hweight8(dest & - GET_xAPIC_DEST_FIELD(vlapic_get_reg(vlapic, - APIC_DFR))) > 1; + { + dest &= GET_xAPIC_DEST_FIELD(vlapic_get_reg(vlapic, APIC_DFR)); + return multiple_bits_set((uint8_t)dest); + } return dest == 0xff; } diff --git a/xen/common/numa.c b/xen/common/numa.c index 28a09766fabc..ce3991929ce5 100644 --- a/xen/common/numa.c +++ b/xen/common/numa.c @@ -546,7 +546,7 @@ static int __init numa_emulation(unsigned long start_pfn, uint64_t sz = pfn_to_paddr(end_pfn - start_pfn) / numa_fake; /* Kludge needed for the hash function */ - if ( hweight64(sz) > 1 ) + if ( multiple_bits_set(sz) ) { uint64_t x = 1;
Using hweight() is an especially expensive way of determining simply if multiple bits are set in a value. Worse, 4 of the 10 hweight() calls in Xen are of this form. Switch to the new multiple_bits_set() helper. This is far more efficient than the longhand hweight() algorithm and, owing to its simplicity, likely more efficient than even a dedicated instruction on a superscalar processor. No functional change. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> --- CC: Jan Beulich <JBeulich@suse.com> CC: Roger Pau Monné <roger.pau@citrix.com> CC: Wei Liu <wl@xen.org> CC: Stefano Stabellini <sstabellini@kernel.org> CC: Julien Grall <julien@xen.org> CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com> CC: Bertrand Marquis <bertrand.marquis@arm.com> CC: Michal Orzel <michal.orzel@amd.com> CC: Oleksii Kurochko <oleksii.kurochko@gmail.com> CC: Shawn Anastasio <sanastasio@raptorengineering.com> On x86, the code reduction speaks for itself: add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-240 (-240) Function old new delta vlapic_ipi 722 650 -72 numa_emulation 577 497 -80 do_xenpmu_op 1665 1577 -88 That's an aweful lot of wasted calculation for the same answer. I can't find a way of enabling CONFIG_NUMA on ARM (yet?) so right now there's no change in any other architecture. However, a synthetic helper shows the following on ARM32: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-128 (-128) Function old new delta test_mbs 176 48 -128 and on ARM64: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44) Function old new delta test_mbs 60 16 -44 PPC64 has POPCNTD in a default build of Xen and by chance both algorithms compile to the same number of instructions. RISC-V doesn't have hweight() wired up yet at all. --- xen/arch/x86/cpu/vpmu.c | 2 +- xen/arch/x86/hvm/vlapic.c | 10 ++++++---- xen/common/numa.c | 2 +- 3 files changed, 8 insertions(+), 6 deletions(-)