diff mbox series

[08/11] x86/shadow: reduce effort of hash calculation

Message ID acf0f5f6-f4da-cd88-1515-2546153322b4@suse.com (mailing list archive)
State Superseded
Headers show
Series x86/shadow: misc tidying | expand

Commit Message

Jan Beulich Jan. 5, 2023, 4:05 p.m. UTC
The "n" input is a GFN value and hence bounded by the physical address
bits in use on a system. The hash quality won't improve by also
including the upper always-zero bits in the calculation. To keep things
as compile-time-constant as they were before, use PADDR_BITS (not
paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5.

While there also drop the unnecessary cast to u32.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I was tempted to also change the types of "p" (pointer to const) and "i"
(unsigned) right here (and perhaps even the "byte" in the comment ahead
of the function), but then thought this might be going too far ...

Comments

Andrew Cooper Jan. 6, 2023, 2:03 a.m. UTC | #1
On 05/01/2023 4:05 pm, Jan Beulich wrote:
> The "n" input is a GFN value and hence bounded by the physical address
> bits in use on a system.

The one case where this isn't obviously true is in sh_audit().  It comes
from a real MFN in the system, not a GFN, which will have the same
property WRT PADDR_BITS.

>  The hash quality won't improve by also
> including the upper always-zero bits in the calculation. To keep things
> as compile-time-constant as they were before, use PADDR_BITS (not
> paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5.

While this is all true, you'll get a much better improvement by not
forcing 'n' onto the stack just to access it bytewise.  Right now, the
loop looks like:

<shadow_hash_insert>:
    48 83 ec 10                 sub    $0x10,%rsp
    49 89 c9                    mov    %rcx,%r9
    41 89 d0                    mov    %edx,%r8d
    48 8d 44 24 08              lea    0x8(%rsp),%rax
    48 8d 4c 24 10              lea    0x10(%rsp),%rcx
    48 89 74 24 08              mov    %rsi,0x8(%rsp)
    0f 1f 80 00 00 00 00        nopl   0x0(%rax)
/-> 0f b6 10                    movzbl (%rax),%edx
|   48 83 c0 01                 add    $0x1,%rax
|   45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
|   41 01 d0                    add    %edx,%r8d
|   48 39 c1                    cmp    %rax,%rcx
\-- 75 ea                       jne    ffff82d0402efda0
<shadow_hash_insert+0x20>


which doesn't even have a compile-time constant loop bound.  It's
runtime calculated by the second lea constructing the upper pointer bound.

Given this further delta:

diff --git a/xen/arch/x86/mm/shadow/common.c
b/xen/arch/x86/mm/shadow/common.c
index 4a8bcec10fe8..902c749f2724 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1397,13 +1397,12 @@ static unsigned int shadow_get_allocation(struct
domain *d)
 typedef u32 key_t;
 static inline key_t sh_hash(unsigned long n, unsigned int t)
 {
-    unsigned char *p = (unsigned char *)&n;
     key_t k = t;
     int i;
 
     BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT);
-    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ )
-        k = p[i] + (k << 6) + (k << 16) - k;
+    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++, n >>= 8 )
+        k = (uint8_t)n + (k << 6) + (k << 16) - k;
 
     return k % SHADOW_HASH_BUCKETS;
 }

the code gen becomes:

<shadow_hash_insert>:
    41 89 d0                    mov    %edx,%r8d
    49 89 c9                    mov    %rcx,%r9
    b8 05 00 00 00              mov    $0x5,%eax
/-> 45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
|   40 0f b6 d6                 movzbl %sil,%edx
|   48 c1 ee 08                 shr    $0x8,%rsi
|   41 01 d0                    add    %edx,%r8d
|   83 e8 01                    sub    $0x1,%eax
\-- 75 e9                       jne    ffff82d0402efd8b
<shadow_hash_insert+0xb>

with an actual constant loop bound, and not a memory operand in sight. 
This form (even at 8 iterations) will easily execute faster than the
stack-spilled form.

~Andrew
Jan Beulich Jan. 9, 2023, 9:48 a.m. UTC | #2
On 06.01.2023 03:03, Andrew Cooper wrote:
> On 05/01/2023 4:05 pm, Jan Beulich wrote:
>> The "n" input is a GFN value and hence bounded by the physical address
>> bits in use on a system.
> 
> The one case where this isn't obviously true is in sh_audit().  It comes
> from a real MFN in the system, not a GFN, which will have the same
> property WRT PADDR_BITS.

I'm afraid I was more wrong with that than just for the audit case. Only
FL1 shadows use GFNs. All other shadows us MFNs. I'll update the sentence.

>>  The hash quality won't improve by also
>> including the upper always-zero bits in the calculation. To keep things
>> as compile-time-constant as they were before, use PADDR_BITS (not
>> paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5.
> 
> While this is all true, you'll get a much better improvement by not
> forcing 'n' onto the stack just to access it bytewise.  Right now, the
> loop looks like:
> 
> <shadow_hash_insert>:
>     48 83 ec 10                 sub    $0x10,%rsp
>     49 89 c9                    mov    %rcx,%r9
>     41 89 d0                    mov    %edx,%r8d
>     48 8d 44 24 08              lea    0x8(%rsp),%rax
>     48 8d 4c 24 10              lea    0x10(%rsp),%rcx
>     48 89 74 24 08              mov    %rsi,0x8(%rsp)
>     0f 1f 80 00 00 00 00        nopl   0x0(%rax)
> /-> 0f b6 10                    movzbl (%rax),%edx
> |   48 83 c0 01                 add    $0x1,%rax
> |   45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
> |   41 01 d0                    add    %edx,%r8d
> |   48 39 c1                    cmp    %rax,%rcx
> \-- 75 ea                       jne    ffff82d0402efda0
> <shadow_hash_insert+0x20>
> 
> 
> which doesn't even have a compile-time constant loop bound.  It's
> runtime calculated by the second lea constructing the upper pointer bound.
> 
> Given this further delta:
> 
> diff --git a/xen/arch/x86/mm/shadow/common.c
> b/xen/arch/x86/mm/shadow/common.c
> index 4a8bcec10fe8..902c749f2724 100644
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -1397,13 +1397,12 @@ static unsigned int shadow_get_allocation(struct
> domain *d)
>  typedef u32 key_t;
>  static inline key_t sh_hash(unsigned long n, unsigned int t)
>  {
> -    unsigned char *p = (unsigned char *)&n;
>      key_t k = t;
>      int i;
>  
>      BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT);
> -    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ )
> -        k = p[i] + (k << 6) + (k << 16) - k;
> +    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++, n >>= 8 )
> +        k = (uint8_t)n + (k << 6) + (k << 16) - k;
>  
>      return k % SHADOW_HASH_BUCKETS;
>  }
> 
> the code gen becomes:
> 
> <shadow_hash_insert>:
>     41 89 d0                    mov    %edx,%r8d
>     49 89 c9                    mov    %rcx,%r9
>     b8 05 00 00 00              mov    $0x5,%eax
> /-> 45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
> |   40 0f b6 d6                 movzbl %sil,%edx
> |   48 c1 ee 08                 shr    $0x8,%rsi
> |   41 01 d0                    add    %edx,%r8d
> |   83 e8 01                    sub    $0x1,%eax
> \-- 75 e9                       jne    ffff82d0402efd8b
> <shadow_hash_insert+0xb>
> 
> with an actual constant loop bound, and not a memory operand in sight. 
> This form (even at 8 iterations) will easily execute faster than the
> stack-spilled form.

Oh, yes, good idea. Will adjust.

Jan
diff mbox series

Patch

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1400,7 +1400,11 @@  static inline key_t sh_hash(unsigned lon
     unsigned char *p = (unsigned char *)&n;
     key_t k = t;
     int i;
-    for ( i = 0; i < sizeof(n) ; i++ ) k = (u32)p[i] + (k<<6) + (k<<16) - k;
+
+    BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT);
+    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ )
+        k = p[i] + (k << 6) + (k << 16) - k;
+
     return k % SHADOW_HASH_BUCKETS;
 }