[08/11] x86/shadow: reduce effort of hash calculation

Message ID	acf0f5f6-f4da-cd88-1515-2546153322b4@suse.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> Message-ID: <acf0f5f6-f4da-cd88-1515-2546153322b4@suse.com> Date: Thu, 5 Jan 2023 17:05:59 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.6.1 Subject: [PATCH 08/11] x86/shadow: reduce effort of hash calculation Content-Language: en-US From: Jan Beulich <jbeulich@suse.com> To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Wei Liu <wl@xen.org>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, Tim Deegan <tim@xen.org>, George Dunlap <george.dunlap@citrix.com> References: <074dc3bb-6057-4f61-d516-d0fe3551165c@suse.com> In-Reply-To: <074dc3bb-6057-4f61-d516-d0fe3551165c@suse.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit MIME-Version: 1.0
Series	x86/shadow: misc tidying \| expand [00/11] x86/shadow: misc tidying [01/11] x86/shadow: replace sh_reset_l3_up_pointers() [02/11] x86/shadow: convert sh_audit_flags()'es 1st parameter to domain [03/11] x86/shadow: drop hash_vcpu_foreach() [04/11] x86/shadow: rename hash_domain_foreach() [05/11] x86/shadow: move bogus HVM checks in sh_pagetable_dying() [06/11] x86/shadow: drop a few uses of mfn_valid() [07/11] x86/shadow: L2H shadow type is PV32-only [08/11] x86/shadow: reduce effort of hash calculation [09/11] x86/shadow: simplify conditionals in sh_{get,put}_ref() [10/11] x86/shadow: correct shadow type bounds checks [11/11] x86/shadow: sh_remove_all_mappings() is HVM-only

Message ID

acf0f5f6-f4da-cd88-1515-2546153322b4@suse.com (mailing list archive)

State

Superseded

Headers

Errors-To: xen-devel-bounces@lists.xenproject.org
Precedence: list
Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Message-ID: <acf0f5f6-f4da-cd88-1515-2546153322b4@suse.com>
Date: Thu, 5 Jan 2023 17:05:59 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.6.1
Subject: [PATCH 08/11] x86/shadow: reduce effort of hash calculation
Content-Language: en-US
From: Jan Beulich <jbeulich@suse.com>
To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Wei Liu <wl@xen.org>,
	=?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>,
 Tim Deegan <tim@xen.org>, George Dunlap <george.dunlap@citrix.com>
References: <074dc3bb-6057-4f61-d516-d0fe3551165c@suse.com>
In-Reply-To: <074dc3bb-6057-4f61-d516-d0fe3551165c@suse.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?q?k56qlp1rdnkTXR/yXnm5ckSm++P0?=
	=?utf-8?q?WEyT5DKDt1TlTghKou7X2RsPlKvW76VtPGpxKP6IjpmzAEkaY1agAGukxO2aDDSJO?=
	=?utf-8?q?aQ7lLjMzlcl3O8j/cCY2s+ObNQxNlBg7Bk86NmCionhB28owyhMwygY8Sym2tdtbM?=
	=?utf-8?q?lcTy181vn8mypY2wyZ6sQB/tN+cXR2bqOpNf2Qkt01bxbazgw8I/hTbPquUzzsQAq?=
	=?utf-8?q?Jiz4HPge4F41Fq+tzkupKNJNE1AhxQjD9PvMqJdLWlN7QhTUYhSu9JYAAGWe+BdXc?=
	=?utf-8?q?+CaQjA1qvd7wnHPRyrvRUFtutvEFRO9Dt00dmzwpqFW5edebCXvas2FEaqmkms9lL?=
	=?utf-8?q?O7g7MDpIGJ43raIxb4hDQD5LWxK0QY764M4xJBVR1+PXFK5mBaz6UgAi0BYbQxv1S?=
	=?utf-8?q?EHFqXPR7Zg8Z2uhp8BAt5wmb8LQr7Hmm3NQCq96eLwldN3e40OBc4ij5Q/gfvr5ki?=
	=?utf-8?q?em/xL8m3KK2EpwrDSAjWQijJH4SQBEy3heewKCtiPsj4QOupdpB6xc/UxCn+vbm64?=
	=?utf-8?q?+FwCWqrDE6KkaKXKb591EXL6XRW8vZMUql4yrKG3n+XWpGqCM840b49K4+WCxqxkx?=
	=?utf-8?q?qJ+iPWp8UoIcQfk6j0LEvonhg9lOjCq1KSB4SMFgHzqVZj4ysmh4B6MMavc5v5smO?=
	=?utf-8?q?nu8D/lPv1K5kTrlYVyBYtZw5oIL69TH9bQYhaLtg6jiwef2mh7Ek5J+E1VB4yK45q?=
	=?utf-8?q?dIhYV9SWVa4Xz/6AGNJdvlNlNweSLyvcjHSPtYP934hVkUZYwvwENlCoGBFQGM3vV?=
	=?utf-8?q?tyF2rTRbUQu6K/ISnE09GzwBNbEWm2H9U8EvRNA5np/7pbQaiQStq6A2tiTa6UEMN?=
	=?utf-8?q?qz60qbd91q7jTtwPWgf74fOE0DR6blGnczx76XAnWIn7+EUWHScmkFfMekvaU9bs4?=
	=?utf-8?q?A6goiwWo7tHFFscA3Qv+6r1n3Ed4lOaRywTubzLcA6a9xarY2P7Nx3DPthTZP6lIl?=
	=?utf-8?q?3PoHKnyGqsSGXJHmoXFF1r3NWI6wAXrkyIvmF5cJgz/CWDhupQeyoLNDcERBgE9O7?=
	=?utf-8?q?uPyglVHOg+dXOUVjOxjJQdkvwcUMtXu6huy2VeVquZX8jMs3p/gMBUCn0ZxZouuTU?=
	=?utf-8?q?1Y3nOSz9Ih0DUZbr+1+dBqfYLVVr4eaZQG8DxKTmQHoAZTWCazXA3aeaVOYJemRMs?=
	=?utf-8?q?hFuOm9ga1SE/RoaaxGYqkpIi7VFnyNofPISm2vwViLdybppd+Q14n+QwPF0EaKa66?=
	=?utf-8?q?lRXXhRso5wdQTq1eNBNECpWqMPEXFt2EDIAgaqreNWGSdS79ZJuJx6mdngUByVlkJ?=
	=?utf-8?q?K22mkZNGm/sgF7mW/X54WdqpgalQIpyrsuTi/8ACLWKU6mT4+zxp91bu1O14AmCXy?=
	=?utf-8?q?yen4YFIwBhxhsFiEgf4kqxBYuZBtqGW9zVVsXdxK/HSAIJDDXsIu1ufUC0olR4H4T?=
	=?utf-8?q?CuvyUqnqpzA6SsPgYsAE6v8KZbEB1jb7mgp05D89bmBRHL6NPQVmmvLzmYjTgPIXg?=
	=?utf-8?q?34OfPS+5rrFQzXsoxWH+YTnQOBNBSR0S324myD1eSWBrm5y2xrTLlzHrdtOdMLwiW?=
	=?utf-8?q?mMoK4pJtWXEE?=
X-OriginatorOrg: suse.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 64ff1aa6-ab3e-4d85-38af-08daef36bd2f
X-MS-Exchange-CrossTenant-AuthSource: VE1PR04MB6560.eurprd04.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 05 Jan 2023 16:06:00.8857
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: f7a17af6-1c5c-4a36-aa8b-f5be247aa4ba
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 LaQDgsNlFjFJCocMBEUkxOUjT0qwE/voMbSmTu3fAN7wtJpxx0dgCsa2Y6wWK7xICqEttpvbeczQIWvQT6NEbg==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: PA4PR04MB8046

Series

x86/shadow: misc tidying | expand

Commit Message

Jan Beulich Jan. 5, 2023, 4:05 p.m. UTC

The "n" input is a GFN value and hence bounded by the physical address
bits in use on a system. The hash quality won't improve by also
including the upper always-zero bits in the calculation. To keep things
as compile-time-constant as they were before, use PADDR_BITS (not
paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5.

While there also drop the unnecessary cast to u32.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I was tempted to also change the types of "p" (pointer to const) and "i"
(unsigned) right here (and perhaps even the "byte" in the comment ahead
of the function), but then thought this might be going too far ...

Comments

Andrew Cooper Jan. 6, 2023, 2:03 a.m. UTC | #1

On 05/01/2023 4:05 pm, Jan Beulich wrote:
> The "n" input is a GFN value and hence bounded by the physical address
> bits in use on a system.

The one case where this isn't obviously true is in sh_audit().  It comes
from a real MFN in the system, not a GFN, which will have the same
property WRT PADDR_BITS.

>  The hash quality won't improve by also
> including the upper always-zero bits in the calculation. To keep things
> as compile-time-constant as they were before, use PADDR_BITS (not
> paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5.

While this is all true, you'll get a much better improvement by not
forcing 'n' onto the stack just to access it bytewise.  Right now, the
loop looks like:

<shadow_hash_insert>:
    48 83 ec 10                 sub    $0x10,%rsp
    49 89 c9                    mov    %rcx,%r9
    41 89 d0                    mov    %edx,%r8d
    48 8d 44 24 08              lea    0x8(%rsp),%rax
    48 8d 4c 24 10              lea    0x10(%rsp),%rcx
    48 89 74 24 08              mov    %rsi,0x8(%rsp)
    0f 1f 80 00 00 00 00        nopl   0x0(%rax)
/-> 0f b6 10                    movzbl (%rax),%edx
|   48 83 c0 01                 add    $0x1,%rax
|   45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
|   41 01 d0                    add    %edx,%r8d
|   48 39 c1                    cmp    %rax,%rcx
\-- 75 ea                       jne    ffff82d0402efda0
<shadow_hash_insert+0x20>


which doesn't even have a compile-time constant loop bound.  It's
runtime calculated by the second lea constructing the upper pointer bound.

Given this further delta:

diff --git a/xen/arch/x86/mm/shadow/common.c
b/xen/arch/x86/mm/shadow/common.c
index 4a8bcec10fe8..902c749f2724 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1397,13 +1397,12 @@ static unsigned int shadow_get_allocation(struct
domain *d)
 typedef u32 key_t;
 static inline key_t sh_hash(unsigned long n, unsigned int t)
 {
-    unsigned char *p = (unsigned char *)&n;
     key_t k = t;
     int i;
 
     BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT);
-    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ )
-        k = p[i] + (k << 6) + (k << 16) - k;
+    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++, n >>= 8 )
+        k = (uint8_t)n + (k << 6) + (k << 16) - k;
 
     return k % SHADOW_HASH_BUCKETS;
 }

the code gen becomes:

<shadow_hash_insert>:
    41 89 d0                    mov    %edx,%r8d
    49 89 c9                    mov    %rcx,%r9
    b8 05 00 00 00              mov    $0x5,%eax
/-> 45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
|   40 0f b6 d6                 movzbl %sil,%edx
|   48 c1 ee 08                 shr    $0x8,%rsi
|   41 01 d0                    add    %edx,%r8d
|   83 e8 01                    sub    $0x1,%eax
\-- 75 e9                       jne    ffff82d0402efd8b
<shadow_hash_insert+0xb>

with an actual constant loop bound, and not a memory operand in sight. 
This form (even at 8 iterations) will easily execute faster than the
stack-spilled form.

~Andrew

Jan Beulich Jan. 9, 2023, 9:48 a.m. UTC | #2

On 06.01.2023 03:03, Andrew Cooper wrote:
> On 05/01/2023 4:05 pm, Jan Beulich wrote:
>> The "n" input is a GFN value and hence bounded by the physical address
>> bits in use on a system.
> 
> The one case where this isn't obviously true is in sh_audit().  It comes
> from a real MFN in the system, not a GFN, which will have the same
> property WRT PADDR_BITS.

I'm afraid I was more wrong with that than just for the audit case. Only
FL1 shadows use GFNs. All other shadows us MFNs. I'll update the sentence.

>>  The hash quality won't improve by also
>> including the upper always-zero bits in the calculation. To keep things
>> as compile-time-constant as they were before, use PADDR_BITS (not
>> paddr_bits) for loop bounding. This reduces loop iterations from 8 to 5.
> 
> While this is all true, you'll get a much better improvement by not
> forcing 'n' onto the stack just to access it bytewise.  Right now, the
> loop looks like:
> 
> <shadow_hash_insert>:
>     48 83 ec 10                 sub    $0x10,%rsp
>     49 89 c9                    mov    %rcx,%r9
>     41 89 d0                    mov    %edx,%r8d
>     48 8d 44 24 08              lea    0x8(%rsp),%rax
>     48 8d 4c 24 10              lea    0x10(%rsp),%rcx
>     48 89 74 24 08              mov    %rsi,0x8(%rsp)
>     0f 1f 80 00 00 00 00        nopl   0x0(%rax)
> /-> 0f b6 10                    movzbl (%rax),%edx
> |   48 83 c0 01                 add    $0x1,%rax
> |   45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
> |   41 01 d0                    add    %edx,%r8d
> |   48 39 c1                    cmp    %rax,%rcx
> \-- 75 ea                       jne    ffff82d0402efda0
> <shadow_hash_insert+0x20>
> 
> 
> which doesn't even have a compile-time constant loop bound.  It's
> runtime calculated by the second lea constructing the upper pointer bound.
> 
> Given this further delta:
> 
> diff --git a/xen/arch/x86/mm/shadow/common.c
> b/xen/arch/x86/mm/shadow/common.c
> index 4a8bcec10fe8..902c749f2724 100644
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -1397,13 +1397,12 @@ static unsigned int shadow_get_allocation(struct
> domain *d)
>  typedef u32 key_t;
>  static inline key_t sh_hash(unsigned long n, unsigned int t)
>  {
> -    unsigned char *p = (unsigned char *)&n;
>      key_t k = t;
>      int i;
>  
>      BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT);
> -    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ )
> -        k = p[i] + (k << 6) + (k << 16) - k;
> +    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++, n >>= 8 )
> +        k = (uint8_t)n + (k << 6) + (k << 16) - k;
>  
>      return k % SHADOW_HASH_BUCKETS;
>  }
> 
> the code gen becomes:
> 
> <shadow_hash_insert>:
>     41 89 d0                    mov    %edx,%r8d
>     49 89 c9                    mov    %rcx,%r9
>     b8 05 00 00 00              mov    $0x5,%eax
> /-> 45 69 c0 3f 00 01 00        imul   $0x1003f,%r8d,%r8d
> |   40 0f b6 d6                 movzbl %sil,%edx
> |   48 c1 ee 08                 shr    $0x8,%rsi
> |   41 01 d0                    add    %edx,%r8d
> |   83 e8 01                    sub    $0x1,%eax
> \-- 75 e9                       jne    ffff82d0402efd8b
> <shadow_hash_insert+0xb>
> 
> with an actual constant loop bound, and not a memory operand in sight. 
> This form (even at 8 iterations) will easily execute faster than the
> stack-spilled form.

Oh, yes, good idea. Will adjust.

Jan

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1400,7 +1400,11 @@  static inline key_t sh_hash(unsigned lon
     unsigned char *p = (unsigned char *)&n;
     key_t k = t;
     int i;
-    for ( i = 0; i < sizeof(n) ; i++ ) k = (u32)p[i] + (k<<6) + (k<<16) - k;
+
+    BUILD_BUG_ON(PADDR_BITS > BITS_PER_LONG + PAGE_SHIFT);
+    for ( i = 0; i < (PADDR_BITS - PAGE_SHIFT + 7) / 8; i++ )
+        k = p[i] + (k << 6) + (k << 16) - k;
+
     return k % SHADOW_HASH_BUCKETS;
 }

[08/11] x86/shadow: reduce effort of hash calculation

Commit Message

Comments

Patch