diff mbox series

[1/3] x86/boot: Drop pte_update_limit from physical relocation logic

Message ID 20211207105339.28440-2-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86/boot: Cleanup | expand

Commit Message

Andrew Cooper Dec. 7, 2021, 10:53 a.m. UTC
This check has existed in one form or another since c/s 369bafdb1c1 "xen: Big
changes to x86 start-of-day" in 2007.  Even then, AFAICT, it wasn't necessary
because there was a clean split between the statically created entries (L3 and
higher) and the dynamically created ones (L2 and lower).

Without dissecting the myriad changes over the past 14 years, I'm pretty
certain Xen only booted by accident when l2_xenmap[0] was handled specially
and skipped the pte_update_limit check which would have left it corrupt.

Nevertheless, as of right now, all non-leaf PTEs (the first loop), and all 2M
xenmap leaf mappings (the second loop) need relocating.  They are no unused
mappings in the range, no mappings which will be encountered multiple times,
and it is unlikely that such mappings would be introduced.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/arch/x86/setup.c | 18 +++---------------
 1 file changed, 3 insertions(+), 15 deletions(-)

Comments

Jan Beulich Dec. 7, 2021, 11:37 a.m. UTC | #1
On 07.12.2021 11:53, Andrew Cooper wrote:
> This check has existed in one form or another since c/s 369bafdb1c1 "xen: Big
> changes to x86 start-of-day" in 2007.  Even then, AFAICT, it wasn't necessary
> because there was a clean split between the statically created entries (L3 and
> higher) and the dynamically created ones (L2 and lower).
> 
> Without dissecting the myriad changes over the past 14 years, I'm pretty
> certain Xen only booted by accident when l2_xenmap[0] was handled specially
> and skipped the pte_update_limit check which would have left it corrupt.
> 
> Nevertheless, as of right now, all non-leaf PTEs (the first loop), and all 2M
> xenmap leaf mappings (the second loop) need relocating.  They are no unused
> mappings in the range, no mappings which will be encountered multiple times,
> and it is unlikely that such mappings would be introduced.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

However, as to the description and ...

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1230,7 +1230,6 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>              l3_pgentry_t *pl3e;
>              l2_pgentry_t *pl2e;
>              int i, j, k;
> -            unsigned long pte_update_limit;
>  
>              /* Select relocation address. */
>              xen_phys_start = end - reloc_size;
> @@ -1238,14 +1237,6 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>              bootsym(trampoline_xen_phys_start) = xen_phys_start;
>  
>              /*
> -             * No PTEs pointing above this address are candidates for relocation.
> -             * Due to possibility of partial overlap of the end of source image
> -             * and the beginning of region for destination image some PTEs may
> -             * point to addresses in range [e, e + XEN_IMG_OFFSET).
> -             */
> -            pte_update_limit = PFN_DOWN(e);

... considering the comment here: Isn't it 0d31d1680868 ("x86/setup: do
not relocate Xen over current Xen image placement") where need for this
disappeared? Afaict the non-overlap of source and destination is the
crucial factor here, yet your description doesn't mention this aspect at
all. I'd therefore like to ask for an adjustment there.

Jan
Andrew Cooper Dec. 9, 2021, 5:34 p.m. UTC | #2
On 07/12/2021 11:37, Jan Beulich wrote:
> On 07.12.2021 11:53, Andrew Cooper wrote:
>> This check has existed in one form or another since c/s 369bafdb1c1 "xen: Big
>> changes to x86 start-of-day" in 2007.  Even then, AFAICT, it wasn't necessary
>> because there was a clean split between the statically created entries (L3 and
>> higher) and the dynamically created ones (L2 and lower).
>>
>> Without dissecting the myriad changes over the past 14 years, I'm pretty
>> certain Xen only booted by accident when l2_xenmap[0] was handled specially
>> and skipped the pte_update_limit check which would have left it corrupt.
>>
>> Nevertheless, as of right now, all non-leaf PTEs (the first loop), and all 2M
>> xenmap leaf mappings (the second loop) need relocating.  They are no unused
>> mappings in the range, no mappings which will be encountered multiple times,
>> and it is unlikely that such mappings would be introduced.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>
> However, as to the description and ...
>
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -1230,7 +1230,6 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>              l3_pgentry_t *pl3e;
>>              l2_pgentry_t *pl2e;
>>              int i, j, k;
>> -            unsigned long pte_update_limit;
>>  
>>              /* Select relocation address. */
>>              xen_phys_start = end - reloc_size;
>> @@ -1238,14 +1237,6 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>              bootsym(trampoline_xen_phys_start) = xen_phys_start;
>>  
>>              /*
>> -             * No PTEs pointing above this address are candidates for relocation.
>> -             * Due to possibility of partial overlap of the end of source image
>> -             * and the beginning of region for destination image some PTEs may
>> -             * point to addresses in range [e, e + XEN_IMG_OFFSET).
>> -             */
>> -            pte_update_limit = PFN_DOWN(e);
> ... considering the comment here: Isn't it 0d31d1680868 ("x86/setup: do
> not relocate Xen over current Xen image placement") where need for this
> disappeared? Afaict the non-overlap of source and destination is the
> crucial factor here, yet your description doesn't mention this aspect at
> all. I'd therefore like to ask for an adjustment there.

I don't consider that commit relevant.

There is no circumstance ever where you can relocate Xen with
actually-overlapping ranges, because one way or another, you'd end up
copying non-pagetable data over the live pagetables.

That particular commit was part of trying to make Xen's entry code
relocatable, so the MB2 path could load Xen at somewhere which wasn't 0,
but to this day we still skip any physical relocation if Xen isn't
started at 0.


To the comment specifically, it's actively wrong.

Back when XEN_IMG_OFFSET was 1M, and we had 16M worth of unconditional
mappings, then we could get PTE overlap as described, in the corner case
where we were moving Xen from 0 to anywhere between 4 and 16M physical
(2M physical was in principle a problem, but not an eligible position to
relocate to, given Xen's compile size).

And in that corner case, the logic would "corrupt" various PTEs by not
relocating them.  The PTE coving _start at 1M was special cased ahead of
the 2nd loop (finally fixed last week) and the PTEs mapping beyond _end
were unused which is why nothing actually went wrong.

~Andrew
Jan Beulich Dec. 10, 2021, 7:17 a.m. UTC | #3
On 09.12.2021 18:34, Andrew Cooper wrote:
> On 07/12/2021 11:37, Jan Beulich wrote:
>> On 07.12.2021 11:53, Andrew Cooper wrote:
>>> --- a/xen/arch/x86/setup.c
>>> +++ b/xen/arch/x86/setup.c
>>> @@ -1230,7 +1230,6 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>>              l3_pgentry_t *pl3e;
>>>              l2_pgentry_t *pl2e;
>>>              int i, j, k;
>>> -            unsigned long pte_update_limit;
>>>  
>>>              /* Select relocation address. */
>>>              xen_phys_start = end - reloc_size;
>>> @@ -1238,14 +1237,6 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>>              bootsym(trampoline_xen_phys_start) = xen_phys_start;
>>>  
>>>              /*
>>> -             * No PTEs pointing above this address are candidates for relocation.
>>> -             * Due to possibility of partial overlap of the end of source image
>>> -             * and the beginning of region for destination image some PTEs may
>>> -             * point to addresses in range [e, e + XEN_IMG_OFFSET).
>>> -             */
>>> -            pte_update_limit = PFN_DOWN(e);
>> ... considering the comment here: Isn't it 0d31d1680868 ("x86/setup: do
>> not relocate Xen over current Xen image placement") where need for this
>> disappeared? Afaict the non-overlap of source and destination is the
>> crucial factor here, yet your description doesn't mention this aspect at
>> all. I'd therefore like to ask for an adjustment there.
> 
> I don't consider that commit relevant.
> 
> There is no circumstance ever where you can relocate Xen with
> actually-overlapping ranges, because one way or another, you'd end up
> copying non-pagetable data over the live pagetables.

That was fragile, yes. I think I (vaguely!) recall a discussion I had
with Keir about this, with him pointing out that the logic builds upon
all necessary mappings being held in the TLB. If you strictly think
that's not worthwhile to consider or mention, then so be it.

> That particular commit was part of trying to make Xen's entry code
> relocatable, so the MB2 path could load Xen at somewhere which wasn't 0,
> but to this day we still skip any physical relocation if Xen isn't
> started at 0.
> 
> 
> To the comment specifically, it's actively wrong.
> 
> Back when XEN_IMG_OFFSET was 1M, and we had 16M worth of unconditional
> mappings, then we could get PTE overlap as described, in the corner case
> where we were moving Xen from 0 to anywhere between 4 and 16M physical
> (2M physical was in principle a problem, but not an eligible position to
> relocate to, given Xen's compile size).
> 
> And in that corner case, the logic would "corrupt" various PTEs by not
> relocating them.  The PTE coving _start at 1M was special cased ahead of
> the 2nd loop (finally fixed last week) and the PTEs mapping beyond _end
> were unused which is why nothing actually went wrong.

The latter fact being why I guess you've put "corrupt" in quotes.

Jan
Andrew Cooper Dec. 9, 2022, 8:36 p.m. UTC | #4
On 10/12/2021 07:17, Jan Beulich wrote:
> On 09.12.2021 18:34, Andrew Cooper wrote:
>> On 07/12/2021 11:37, Jan Beulich wrote:
>>> On 07.12.2021 11:53, Andrew Cooper wrote:
>>>> --- a/xen/arch/x86/setup.c
>>>> +++ b/xen/arch/x86/setup.c
>>>> @@ -1230,7 +1230,6 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>>>              l3_pgentry_t *pl3e;
>>>>              l2_pgentry_t *pl2e;
>>>>              int i, j, k;
>>>> -            unsigned long pte_update_limit;
>>>>  
>>>>              /* Select relocation address. */
>>>>              xen_phys_start = end - reloc_size;
>>>> @@ -1238,14 +1237,6 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>>>              bootsym(trampoline_xen_phys_start) = xen_phys_start;
>>>>  
>>>>              /*
>>>> -             * No PTEs pointing above this address are candidates for relocation.
>>>> -             * Due to possibility of partial overlap of the end of source image
>>>> -             * and the beginning of region for destination image some PTEs may
>>>> -             * point to addresses in range [e, e + XEN_IMG_OFFSET).
>>>> -             */
>>>> -            pte_update_limit = PFN_DOWN(e);
>>> ... considering the comment here: Isn't it 0d31d1680868 ("x86/setup: do
>>> not relocate Xen over current Xen image placement") where need for this
>>> disappeared? Afaict the non-overlap of source and destination is the
>>> crucial factor here, yet your description doesn't mention this aspect at
>>> all. I'd therefore like to ask for an adjustment there.
>> I don't consider that commit relevant.
>>
>> There is no circumstance ever where you can relocate Xen with
>> actually-overlapping ranges, because one way or another, you'd end up
>> copying non-pagetable data over the live pagetables.
> That was fragile, yes. I think I (vaguely!) recall a discussion I had
> with Keir about this, with him pointing out that the logic builds upon
> all necessary mappings being held in the TLB. If you strictly think
> that's not worthwhile to consider or mention, then so be it.

Ok, I'll tweak the commit message.  But having come back to this after
more than a year away, I think it's worth pointing out that the details
in 0d31d1680868 demonstrate that the "partial overlap" logic was indeed
buggy.

In a VM, a vcpu migration at just the wrong moment will empty the TLB
under your feet.  So will a whole slew of micro-architectural
conditions, or a poorly timed SMI. 

Whatever people have tried to call it in the past, it was broken plain
and simple.

~Andrew
diff mbox series

Patch

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index c8641c227d9a..0492856292cf 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1230,7 +1230,6 @@  void __init noreturn __start_xen(unsigned long mbi_p)
             l3_pgentry_t *pl3e;
             l2_pgentry_t *pl2e;
             int i, j, k;
-            unsigned long pte_update_limit;
 
             /* Select relocation address. */
             xen_phys_start = end - reloc_size;
@@ -1238,14 +1237,6 @@  void __init noreturn __start_xen(unsigned long mbi_p)
             bootsym(trampoline_xen_phys_start) = xen_phys_start;
 
             /*
-             * No PTEs pointing above this address are candidates for relocation.
-             * Due to possibility of partial overlap of the end of source image
-             * and the beginning of region for destination image some PTEs may
-             * point to addresses in range [e, e + XEN_IMG_OFFSET).
-             */
-            pte_update_limit = PFN_DOWN(e);
-
-            /*
              * Perform relocation to new physical address.
              * Before doing so we must sync static/global data with main memory
              * with a barrier(). After this we must *not* modify static/global
@@ -1267,8 +1258,7 @@  void __init noreturn __start_xen(unsigned long mbi_p)
                 {
                     /* Not present, 1GB mapping, or already relocated? */
                     if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) ||
-                         (l3e_get_flags(*pl3e) & _PAGE_PSE) ||
-                         (l3e_get_pfn(*pl3e) >= pte_update_limit) )
+                         (l3e_get_flags(*pl3e) & _PAGE_PSE) )
                         continue;
                     *pl3e = l3e_from_intpte(l3e_get_intpte(*pl3e) +
                                             xen_phys_start);
@@ -1277,8 +1267,7 @@  void __init noreturn __start_xen(unsigned long mbi_p)
                     {
                         /* Not present, PSE, or already relocated? */
                         if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) ||
-                             (l2e_get_flags(*pl2e) & _PAGE_PSE) ||
-                             (l2e_get_pfn(*pl2e) >= pte_update_limit) )
+                             (l2e_get_flags(*pl2e) & _PAGE_PSE) )
                             continue;
                         *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) +
                                                 xen_phys_start);
@@ -1291,8 +1280,7 @@  void __init noreturn __start_xen(unsigned long mbi_p)
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++, pl2e++ )
             {
                 if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) ||
-                     !(l2e_get_flags(*pl2e) & _PAGE_PSE) ||
-                     (l2e_get_pfn(*pl2e) >= pte_update_limit) )
+                     !(l2e_get_flags(*pl2e) & _PAGE_PSE) )
                     continue;
 
                 *pl2e = l2e_from_intpte(l2e_get_intpte(*pl2e) + xen_phys_start);