diff mbox series

[v2,4/5] x86/boot: Simplify pagetable manipulation loops

Message ID 20200117204223.30076-5-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86: Remove more 16M total-size restrictions | expand

Commit Message

Andrew Cooper Jan. 17, 2020, 8:42 p.m. UTC
For __page_tables_{start,end} and L3 bootmap initialisation, the logic is
unnecesserily complicated owing to its attempt to use the LOOP instruction,
which results in an off-by-8 memory address owing to LOOP's termination
condition.

Rewrite both loops for improved clarity and speed.

Misc notes:
 * TEST $IMM, MEM can't macrofuse.  The loop has 0x1200 iterations, so pull
   the $_PAGE_PRESENT constant out into a spare register to turn the TEST into
   its %REG, MEM form, which can macrofuse.
 * Avoid the use of %fs-relative references.  %esi-relative is the more common
   form in the code, and doesn't suffer an address generation overhead.
 * Avoid LOOP.  CMP/JB isn't microcoded and faster to execute in all cases.
 * For a 4 interation trivial loop, even compilers unroll these.  The
   generated code size is a fraction larger, but this is init and the asm is
   far easier to follow.
 * Reposition the l2=>l1 bootmap construction so the asm reads in pagetable
   level order.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/boot/head.S | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

Comments

Jan Beulich Jan. 20, 2020, 10:46 a.m. UTC | #1
On 17.01.2020 21:42, Andrew Cooper wrote:
> For __page_tables_{start,end} and L3 bootmap initialisation, the logic is
> unnecesserily complicated owing to its attempt to use the LOOP instruction,
> which results in an off-by-8 memory address owing to LOOP's termination
> condition.
> 
> Rewrite both loops for improved clarity and speed.
> 
> Misc notes:
>  * TEST $IMM, MEM can't macrofuse.  The loop has 0x1200 iterations, so pull
>    the $_PAGE_PRESENT constant out into a spare register to turn the TEST into
>    its %REG, MEM form, which can macrofuse.
>  * Avoid the use of %fs-relative references.  %esi-relative is the more common
>    form in the code, and doesn't suffer an address generation overhead.
>  * Avoid LOOP.  CMP/JB isn't microcoded and faster to execute in all cases.
>  * For a 4 interation trivial loop, even compilers unroll these.  The
>    generated code size is a fraction larger, but this is init and the asm is
>    far easier to follow.
>  * Reposition the l2=>l1 bootmap construction so the asm reads in pagetable
>    level order.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
with two remarks/questions, but leaving it up to you whether
you want to adjust the code:

> --- a/xen/arch/x86/boot/head.S
> +++ b/xen/arch/x86/boot/head.S
> @@ -662,11 +662,17 @@ trampoline_setup:
>          mov     %edx,sym_fs(boot_tsc_stamp)+4
>  
>          /* Relocate pagetables to point at Xen's current location in memory. */
> -        mov     $((__page_tables_end-__page_tables_start)/8),%ecx
> -1:      testl   $_PAGE_PRESENT,sym_fs(__page_tables_start)-8(,%ecx,8)
> +        mov     $_PAGE_PRESENT, %edx
> +        lea     sym_esi(__page_tables_start), %eax
> +        lea     sym_esi(__page_tables_end), %edi
> +
> +1:      testb   %dl, (%eax)  /* if page present */

When it's an immediate, using TESTB is generally helpful because
there's no (sign- or whatever-)extended immediate form of it.
When using a register, I think it would generally be better to
use native size, even if for register reads the partial register
access penalty may (today) be zero.

> @@ -701,22 +707,27 @@ trampoline_setup:
>          cmp     %edx, %ecx
>          jbe     1b
>  
> -        /* Initialize L3 boot-map page directory entries. */
> -        lea     __PAGE_HYPERVISOR+(L2_PAGETABLE_ENTRIES*8)*3+sym_esi(l2_bootmap),%eax
> -        mov     $4,%ecx
> -1:      mov     %eax,sym_fs(l3_bootmap)-8(,%ecx,8)
> -        sub     $(L2_PAGETABLE_ENTRIES*8),%eax
> -        loop    1b
> -
> -        /* Map the permanent trampoline page into l{1,2}_bootmap[]. */
> +        /* Map 4x l2_bootmap[] into l3_bootmap[0...3] */
> +        lea     __PAGE_HYPERVISOR + sym_esi(l2_bootmap), %eax
> +        mov     $PAGE_SIZE, %edx
> +        mov     %eax, 0  + sym_esi(l3_bootmap)
> +        add     %edx, %eax
> +        mov     %eax, 8  + sym_esi(l3_bootmap)
> +        add     %edx, %eax
> +        mov     %eax, 16 + sym_esi(l3_bootmap)
> +        add     %edx, %eax
> +        mov     %eax, 24 + sym_esi(l3_bootmap)

It took me a moment to realize the code is correct despite there
not being any mention of PAGE_SIZE between each of the MOVs. As
you don't view code size as a (primary) concern, perhaps worth
using

        add     $PAGE_SIZE, %eax

everywhere, the more that this has a special, ModR/M-less
encoding?

Jan
Andrew Cooper Jan. 22, 2020, 3:43 p.m. UTC | #2
On 20/01/2020 10:46, Jan Beulich wrote:
> On 17.01.2020 21:42, Andrew Cooper wrote:
>> For __page_tables_{start,end} and L3 bootmap initialisation, the logic is
>> unnecesserily complicated owing to its attempt to use the LOOP instruction,
>> which results in an off-by-8 memory address owing to LOOP's termination
>> condition.
>>
>> Rewrite both loops for improved clarity and speed.
>>
>> Misc notes:
>>  * TEST $IMM, MEM can't macrofuse.  The loop has 0x1200 iterations, so pull
>>    the $_PAGE_PRESENT constant out into a spare register to turn the TEST into
>>    its %REG, MEM form, which can macrofuse.
>>  * Avoid the use of %fs-relative references.  %esi-relative is the more common
>>    form in the code, and doesn't suffer an address generation overhead.
>>  * Avoid LOOP.  CMP/JB isn't microcoded and faster to execute in all cases.
>>  * For a 4 interation trivial loop, even compilers unroll these.  The
>>    generated code size is a fraction larger, but this is init and the asm is
>>    far easier to follow.
>>  * Reposition the l2=>l1 bootmap construction so the asm reads in pagetable
>>    level order.
>>
>> No functional change.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> with two remarks/questions, but leaving it up to you whether
> you want to adjust the code:
>
>> --- a/xen/arch/x86/boot/head.S
>> +++ b/xen/arch/x86/boot/head.S
>> @@ -662,11 +662,17 @@ trampoline_setup:
>>          mov     %edx,sym_fs(boot_tsc_stamp)+4
>>  
>>          /* Relocate pagetables to point at Xen's current location in memory. */
>> -        mov     $((__page_tables_end-__page_tables_start)/8),%ecx
>> -1:      testl   $_PAGE_PRESENT,sym_fs(__page_tables_start)-8(,%ecx,8)
>> +        mov     $_PAGE_PRESENT, %edx
>> +        lea     sym_esi(__page_tables_start), %eax
>> +        lea     sym_esi(__page_tables_end), %edi
>> +
>> +1:      testb   %dl, (%eax)  /* if page present */
> When it's an immediate, using TESTB is generally helpful because
> there's no (sign- or whatever-)extended immediate form of it.
> When using a register, I think it would generally be better to
> use native size, even if for register reads the partial register
> access penalty may (today) be zero.

I don't think it is plausible that partial access penalties will be
introduced.  Partial merge penalties occur as a consequence of making
register reads consistent under renaming, and implicit zeroing behaviour
exists to remove merge penalties.

Any 32bit or larger register write results in allocating a fresh
physical register entry, filling it with the data provided, and updating
the register allocation table.

For 16bit or 8bit writes, either the physical register file needs to
support RMW updates to an architectural register, or an extra set of
uops are needed to perform the merge in the pipeline itself, before
making a 32bit writeback.

What matters in this case is the size of the memory access, and whether
8bit vs 32bit within the same cache line will ever be different.

However, we should switch to a 32bit access here, so we don't intermix
an 8bit read with a 32bit RMW.  Memory disambiguation speculation will
have an easier time of it on some parts, which will make an overall
difference.

>
>> @@ -701,22 +707,27 @@ trampoline_setup:
>>          cmp     %edx, %ecx
>>          jbe     1b
>>  
>> -        /* Initialize L3 boot-map page directory entries. */
>> -        lea     __PAGE_HYPERVISOR+(L2_PAGETABLE_ENTRIES*8)*3+sym_esi(l2_bootmap),%eax
>> -        mov     $4,%ecx
>> -1:      mov     %eax,sym_fs(l3_bootmap)-8(,%ecx,8)
>> -        sub     $(L2_PAGETABLE_ENTRIES*8),%eax
>> -        loop    1b
>> -
>> -        /* Map the permanent trampoline page into l{1,2}_bootmap[]. */
>> +        /* Map 4x l2_bootmap[] into l3_bootmap[0...3] */
>> +        lea     __PAGE_HYPERVISOR + sym_esi(l2_bootmap), %eax
>> +        mov     $PAGE_SIZE, %edx
>> +        mov     %eax, 0  + sym_esi(l3_bootmap)
>> +        add     %edx, %eax
>> +        mov     %eax, 8  + sym_esi(l3_bootmap)
>> +        add     %edx, %eax
>> +        mov     %eax, 16 + sym_esi(l3_bootmap)
>> +        add     %edx, %eax
>> +        mov     %eax, 24 + sym_esi(l3_bootmap)
> It took me a moment to realize the code is correct despite there
> not being any mention of PAGE_SIZE between each of the MOVs. As
> you don't view code size as a (primary) concern, perhaps worth
> using
>
>         add     $PAGE_SIZE, %eax
>
> everywhere, the more that this has a special, ModR/M-less
> encoding?

I had it that way first time around.  Sadly, $PAGE_SIZE can't be
expressed as imm8, which is why I switched to using %edx.

I'm not overly fussed either way, so given the confusion, I'll switch
back to this form.

~Andrew
diff mbox series

Patch

diff --git a/xen/arch/x86/boot/head.S b/xen/arch/x86/boot/head.S
index 1deeae2f2a..1acaf817ba 100644
--- a/xen/arch/x86/boot/head.S
+++ b/xen/arch/x86/boot/head.S
@@ -662,11 +662,17 @@  trampoline_setup:
         mov     %edx,sym_fs(boot_tsc_stamp)+4
 
         /* Relocate pagetables to point at Xen's current location in memory. */
-        mov     $((__page_tables_end-__page_tables_start)/8),%ecx
-1:      testl   $_PAGE_PRESENT,sym_fs(__page_tables_start)-8(,%ecx,8)
+        mov     $_PAGE_PRESENT, %edx
+        lea     sym_esi(__page_tables_start), %eax
+        lea     sym_esi(__page_tables_end), %edi
+
+1:      testb   %dl, (%eax)  /* if page present */
         jz      2f
-        add     %esi,sym_fs(__page_tables_start)-8(,%ecx,8)
-2:      loop    1b
+        add     %esi, (%eax) /* pte += base */
+2:      add     $8, %eax
+
+        cmp     %edi, %eax
+        jb      1b
 
         /* Map Xen into the higher mappings using 2M superpages. */
         lea     _PAGE_PSE + PAGE_HYPERVISOR_RWX + sym_esi(_start), %eax
@@ -701,22 +707,27 @@  trampoline_setup:
         cmp     %edx, %ecx
         jbe     1b
 
-        /* Initialize L3 boot-map page directory entries. */
-        lea     __PAGE_HYPERVISOR+(L2_PAGETABLE_ENTRIES*8)*3+sym_esi(l2_bootmap),%eax
-        mov     $4,%ecx
-1:      mov     %eax,sym_fs(l3_bootmap)-8(,%ecx,8)
-        sub     $(L2_PAGETABLE_ENTRIES*8),%eax
-        loop    1b
-
-        /* Map the permanent trampoline page into l{1,2}_bootmap[]. */
+        /* Map 4x l2_bootmap[] into l3_bootmap[0...3] */
+        lea     __PAGE_HYPERVISOR + sym_esi(l2_bootmap), %eax
+        mov     $PAGE_SIZE, %edx
+        mov     %eax, 0  + sym_esi(l3_bootmap)
+        add     %edx, %eax
+        mov     %eax, 8  + sym_esi(l3_bootmap)
+        add     %edx, %eax
+        mov     %eax, 16 + sym_esi(l3_bootmap)
+        add     %edx, %eax
+        mov     %eax, 24 + sym_esi(l3_bootmap)
+
+        /* Map l1_bootmap[] into l2_bootmap[0]. */
+        lea     __PAGE_HYPERVISOR + sym_esi(l1_bootmap), %eax
+        mov     %eax, sym_esi(l2_bootmap)
+
+        /* Map the permanent trampoline page into l1_bootmap[]. */
         mov     sym_esi(trampoline_phys), %ecx
         lea     __PAGE_HYPERVISOR_RX(%ecx), %edx /* %edx = PTE to write  */
         shr     $PAGE_SHIFT, %ecx                /* %ecx = Slot to write */
         mov     %edx, sym_offs(l1_bootmap)(%esi, %ecx, 8)
 
-        lea     __PAGE_HYPERVISOR + sym_esi(l1_bootmap), %edx
-        mov     %edx, sym_esi(l2_bootmap)
-
         /* Apply relocations to bootstrap trampoline. */
         mov     sym_esi(trampoline_phys), %edx
         lea     sym_esi(__trampoline_rel_start), %edi