diff mbox series

[PATCHv3,3/4] x86/64/kexec: Map original relocate_kernel() in init_transition_pgtable()

Message ID 20240819070827.3620020-4-kirill.shutemov@linux.intel.com (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series x86: Reduce code duplication on page table initialization | expand

Commit Message

Kirill A. Shutemov Aug. 19, 2024, 7:08 a.m. UTC
The init_transition_pgtable() function sets up transitional page tables.
It ensures that the relocate_kernel() function is present in the
identity mapping at the same location as in the kernel page tables.
relocate_kernel() switches to the identity mapping, and the function
must be present at the same location in the virtual address space before
and after switching page tables.

init_transition_pgtable() maps a copy of relocate_kernel() in
image->control_code_page at the relocate_kernel() virtual address, but
the original physical address of relocate_kernel() would also work.

It is safe to use original relocate_kernel() physical address cannot be
overwritten until swap_pages() is called, and the relocate_kernel()
virtual address will not be used by then.

Map the original relocate_kernel() at the relocate_kernel() virtual
address in the identity mapping. It is preparation to replace the
init_transition_pgtable() implementation with a call to
kernel_ident_mapping_init().

Note that while relocate_kernel() switches to the identity mapping, it
does not flush global TLB entries (CR4.PGE is not cleared). This means
that in most cases, the kernel still runs relocate_kernel() from the
original physical address before the change.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/kernel/machine_kexec_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Huang, Kai Aug. 19, 2024, 11:16 a.m. UTC | #1
On Mon, 2024-08-19 at 10:08 +0300, Kirill A. Shutemov wrote:
> The init_transition_pgtable() function sets up transitional page tables.
> It ensures that the relocate_kernel() function is present in the
> identity mapping at the same location as in the kernel page tables.
> relocate_kernel() switches to the identity mapping, and the function
> must be present at the same location in the virtual address space before
> and after switching page tables.
> 
> init_transition_pgtable() maps a copy of relocate_kernel() in
> image->control_code_page at the relocate_kernel() virtual address, but
> the original physical address of relocate_kernel() would also work.
> 
> It is safe to use original relocate_kernel() physical address cannot be
> overwritten until swap_pages() is called, and the relocate_kernel()
> virtual address will not be used by then.
> 
> Map the original relocate_kernel() at the relocate_kernel() virtual
> address in the identity mapping. It is preparation to replace the
> init_transition_pgtable() implementation with a call to
> kernel_ident_mapping_init().
> 
> Note that while relocate_kernel() switches to the identity mapping, it
> does not flush global TLB entries (CR4.PGE is not cleared). This means
> that in most cases, the kernel still runs relocate_kernel() from the
> original physical address before the change.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  arch/x86/kernel/machine_kexec_64.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 9c9ac606893e..645690e81c2d 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
>  	pte_t *pte;
>  
>  	vaddr = (unsigned long)relocate_kernel;
> -	paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
> +	paddr = __pa(relocate_kernel);
>  	pgd += pgd_index(vaddr);
>  	if (!pgd_present(*pgd)) {
>  		p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);


IIUC, this breaks KEXEC_JUMP (image->preserve_context is true).

The relocate_kernel() first saves couple of regs and some other data like PA
of swap page to the control page.  Note here the VA_CONTROL_PAGE is used to
access the control page, so those data are saved to the control page.

SYM_CODE_START_NOALIGN(relocate_kernel)
        UNWIND_HINT_END_OF_STACK
        ANNOTATE_NOENDBR
        /*      
         * %rdi indirection_page
         * %rsi page_list
         * %rdx start address
         * %rcx preserve_context
         * %r8  bare_metal
         */

	...

        movq    PTR(VA_CONTROL_PAGE)(%rsi), %r11                             
        movq    %rsp, RSP(%r11)                                              
        movq    %cr0, %rax
        movq    %rax, CR0(%r11)
        movq    %cr3, %rax
        movq    %rax, CR3(%r11)
        movq    %cr4, %rax
        movq    %rax, CR4(%r11)

	...

	/*
         * get physical address of control page now
         * this is impossible after page table switch
         */
        movq    PTR(PA_CONTROL_PAGE)(%rsi), %r8

        /* get physical address of page table now too */
        movq    PTR(PA_TABLE_PAGE)(%rsi), %r9

        /* get physical address of swap page now */
        movq    PTR(PA_SWAP_PAGE)(%rsi), %r10

        /* save some information for jumping back */
        movq    %r9, CP_PA_TABLE_PAGE(%r11)
        movq    %r10, CP_PA_SWAP_PAGE(%r11)
        movq    %rdi, CP_PA_BACKUP_PAGES_MAP(%r11)

	...

And after jumping back from the second kernel, relocate_kernel() tries to
restore the saved data:

	...

        /* get the re-entry point of the peer system */
        movq    0(%rsp), %rbp
        leaq    relocate_kernel(%rip), %r8		<---------  (*) 
        movq    CP_PA_SWAP_PAGE(%r8), %r10
        movq    CP_PA_BACKUP_PAGES_MAP(%r8), %rdi
        movq    CP_PA_TABLE_PAGE(%r8), %rax
        movq    %rax, %cr3
        lea     PAGE_SIZE(%r8), %rsp
        call    swap_pages
        movq    $virtual_mapped, %rax
        pushq   %rax
        ANNOTATE_UNRET_SAFE
        ret
        int3
SYM_CODE_END(identity_mapped)

Note the above code (*) uses the VA of relocate_kernel() to access the control
page.  IIUC, that means if we map VA of relocate_kernel() to the original PA
where the code relocate_kernel() resides, then the above code will never be
able to read those data back since they were saved to the control page.

Did I miss anything?
Kirill A. Shutemov Aug. 19, 2024, 11:57 a.m. UTC | #2
On Mon, Aug 19, 2024 at 11:16:52AM +0000, Huang, Kai wrote:
> On Mon, 2024-08-19 at 10:08 +0300, Kirill A. Shutemov wrote:
> > The init_transition_pgtable() function sets up transitional page tables.
> > It ensures that the relocate_kernel() function is present in the
> > identity mapping at the same location as in the kernel page tables.
> > relocate_kernel() switches to the identity mapping, and the function
> > must be present at the same location in the virtual address space before
> > and after switching page tables.
> > 
> > init_transition_pgtable() maps a copy of relocate_kernel() in
> > image->control_code_page at the relocate_kernel() virtual address, but
> > the original physical address of relocate_kernel() would also work.
> > 
> > It is safe to use original relocate_kernel() physical address cannot be
> > overwritten until swap_pages() is called, and the relocate_kernel()
> > virtual address will not be used by then.
> > 
> > Map the original relocate_kernel() at the relocate_kernel() virtual
> > address in the identity mapping. It is preparation to replace the
> > init_transition_pgtable() implementation with a call to
> > kernel_ident_mapping_init().
> > 
> > Note that while relocate_kernel() switches to the identity mapping, it
> > does not flush global TLB entries (CR4.PGE is not cleared). This means
> > that in most cases, the kernel still runs relocate_kernel() from the
> > original physical address before the change.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  arch/x86/kernel/machine_kexec_64.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > index 9c9ac606893e..645690e81c2d 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
> >  	pte_t *pte;
> >  
> >  	vaddr = (unsigned long)relocate_kernel;
> > -	paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
> > +	paddr = __pa(relocate_kernel);
> >  	pgd += pgd_index(vaddr);
> >  	if (!pgd_present(*pgd)) {
> >  		p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
> 
> 
> IIUC, this breaks KEXEC_JUMP (image->preserve_context is true).
> 
> The relocate_kernel() first saves couple of regs and some other data like PA
> of swap page to the control page.  Note here the VA_CONTROL_PAGE is used to
> access the control page, so those data are saved to the control page.
> 
> SYM_CODE_START_NOALIGN(relocate_kernel)
>         UNWIND_HINT_END_OF_STACK
>         ANNOTATE_NOENDBR
>         /*      
>          * %rdi indirection_page
>          * %rsi page_list
>          * %rdx start address
>          * %rcx preserve_context
>          * %r8  bare_metal
>          */
> 
> 	...
> 
>         movq    PTR(VA_CONTROL_PAGE)(%rsi), %r11                             
>         movq    %rsp, RSP(%r11)                                              
>         movq    %cr0, %rax
>         movq    %rax, CR0(%r11)
>         movq    %cr3, %rax
>         movq    %rax, CR3(%r11)
>         movq    %cr4, %rax
>         movq    %rax, CR4(%r11)
> 
> 	...
> 
> 	/*
>          * get physical address of control page now
>          * this is impossible after page table switch
>          */
>         movq    PTR(PA_CONTROL_PAGE)(%rsi), %r8
> 
>         /* get physical address of page table now too */
>         movq    PTR(PA_TABLE_PAGE)(%rsi), %r9
> 
>         /* get physical address of swap page now */
>         movq    PTR(PA_SWAP_PAGE)(%rsi), %r10
> 
>         /* save some information for jumping back */
>         movq    %r9, CP_PA_TABLE_PAGE(%r11)
>         movq    %r10, CP_PA_SWAP_PAGE(%r11)
>         movq    %rdi, CP_PA_BACKUP_PAGES_MAP(%r11)
> 
> 	...
> 
> And after jumping back from the second kernel, relocate_kernel() tries to
> restore the saved data:
> 
> 	...
> 
>         /* get the re-entry point of the peer system */
>         movq    0(%rsp), %rbp
>         leaq    relocate_kernel(%rip), %r8		<---------  (*) 
>         movq    CP_PA_SWAP_PAGE(%r8), %r10
>         movq    CP_PA_BACKUP_PAGES_MAP(%r8), %rdi
>         movq    CP_PA_TABLE_PAGE(%r8), %rax
>         movq    %rax, %cr3
>         lea     PAGE_SIZE(%r8), %rsp
>         call    swap_pages
>         movq    $virtual_mapped, %rax
>         pushq   %rax
>         ANNOTATE_UNRET_SAFE
>         ret
>         int3
> SYM_CODE_END(identity_mapped)
> 
> Note the above code (*) uses the VA of relocate_kernel() to access the control
> page.  IIUC, that means if we map VA of relocate_kernel() to the original PA
> where the code relocate_kernel() resides, then the above code will never be
> able to read those data back since they were saved to the control page.
> 
> Did I miss anything?

Note that relocate_kernel() usage at (*) is inside identity_mapped(). We
run from identity mapping there. Nothing changed to identity mapping
around relocate_kernel(), only top mapping (at __START_KERNEL_map) is
affected.

But I didn't test kexec jump thing. Do you (or anybody else) have setup to
test it?
Huang, Kai Aug. 19, 2024, 12:39 p.m. UTC | #3
On Mon, 2024-08-19 at 14:57 +0300, kirill.shutemov@linux.intel.com wrote:
> On Mon, Aug 19, 2024 at 11:16:52AM +0000, Huang, Kai wrote:
> > On Mon, 2024-08-19 at 10:08 +0300, Kirill A. Shutemov wrote:
> > > The init_transition_pgtable() function sets up transitional page tables.
> > > It ensures that the relocate_kernel() function is present in the
> > > identity mapping at the same location as in the kernel page tables.
> > > relocate_kernel() switches to the identity mapping, and the function
> > > must be present at the same location in the virtual address space before
> > > and after switching page tables.
> > > 
> > > init_transition_pgtable() maps a copy of relocate_kernel() in
> > > image->control_code_page at the relocate_kernel() virtual address, but
> > > the original physical address of relocate_kernel() would also work.
> > > 
> > > It is safe to use original relocate_kernel() physical address cannot be
> > > overwritten until swap_pages() is called, and the relocate_kernel()
> > > virtual address will not be used by then.
> > > 
> > > Map the original relocate_kernel() at the relocate_kernel() virtual
> > > address in the identity mapping. It is preparation to replace the
> > > init_transition_pgtable() implementation with a call to
> > > kernel_ident_mapping_init().
> > > 
> > > Note that while relocate_kernel() switches to the identity mapping, it
> > > does not flush global TLB entries (CR4.PGE is not cleared). This means
> > > that in most cases, the kernel still runs relocate_kernel() from the
> > > original physical address before the change.
> > > 
> > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > ---
> > >  arch/x86/kernel/machine_kexec_64.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > > index 9c9ac606893e..645690e81c2d 100644
> > > --- a/arch/x86/kernel/machine_kexec_64.c
> > > +++ b/arch/x86/kernel/machine_kexec_64.c
> > > @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
> > >  	pte_t *pte;
> > >  
> > >  	vaddr = (unsigned long)relocate_kernel;
> > > -	paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
> > > +	paddr = __pa(relocate_kernel);
> > >  	pgd += pgd_index(vaddr);
> > >  	if (!pgd_present(*pgd)) {
> > >  		p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
> > 
> > 
> > IIUC, this breaks KEXEC_JUMP (image->preserve_context is true).
> > 
> > The relocate_kernel() first saves couple of regs and some other data like PA
> > of swap page to the control page.  Note here the VA_CONTROL_PAGE is used to
> > access the control page, so those data are saved to the control page.
> > 
> > SYM_CODE_START_NOALIGN(relocate_kernel)
> >         UNWIND_HINT_END_OF_STACK
> >         ANNOTATE_NOENDBR
> >         /*      
> >          * %rdi indirection_page
> >          * %rsi page_list
> >          * %rdx start address
> >          * %rcx preserve_context
> >          * %r8  bare_metal
> >          */
> > 
> > 	...
> > 
> >         movq    PTR(VA_CONTROL_PAGE)(%rsi), %r11                             
> >         movq    %rsp, RSP(%r11)                                              
> >         movq    %cr0, %rax
> >         movq    %rax, CR0(%r11)
> >         movq    %cr3, %rax
> >         movq    %rax, CR3(%r11)
> >         movq    %cr4, %rax
> >         movq    %rax, CR4(%r11)
> > 
> > 	...
> > 
> > 	/*
> >          * get physical address of control page now
> >          * this is impossible after page table switch
> >          */
> >         movq    PTR(PA_CONTROL_PAGE)(%rsi), %r8
> > 
> >         /* get physical address of page table now too */
> >         movq    PTR(PA_TABLE_PAGE)(%rsi), %r9
> > 
> >         /* get physical address of swap page now */
> >         movq    PTR(PA_SWAP_PAGE)(%rsi), %r10
> > 
> >         /* save some information for jumping back */
> >         movq    %r9, CP_PA_TABLE_PAGE(%r11)
> >         movq    %r10, CP_PA_SWAP_PAGE(%r11)
> >         movq    %rdi, CP_PA_BACKUP_PAGES_MAP(%r11)
> > 
> > 	...
> > 
> > And after jumping back from the second kernel, relocate_kernel() tries to
> > restore the saved data:
> > 
> > 	...
> > 
> >         /* get the re-entry point of the peer system */
> >         movq    0(%rsp), %rbp
> >         leaq    relocate_kernel(%rip), %r8		<---------  (*) 
> >         movq    CP_PA_SWAP_PAGE(%r8), %r10
> >         movq    CP_PA_BACKUP_PAGES_MAP(%r8), %rdi
> >         movq    CP_PA_TABLE_PAGE(%r8), %rax
> >         movq    %rax, %cr3
> >         lea     PAGE_SIZE(%r8), %rsp
> >         call    swap_pages
> >         movq    $virtual_mapped, %rax
> >         pushq   %rax
> >         ANNOTATE_UNRET_SAFE
> >         ret
> >         int3
> > SYM_CODE_END(identity_mapped)
> > 
> > Note the above code (*) uses the VA of relocate_kernel() to access the control
> > page.  IIUC, that means if we map VA of relocate_kernel() to the original PA
> > where the code relocate_kernel() resides, then the above code will never be
> > able to read those data back since they were saved to the control page.
> > 
> > Did I miss anything?
> 
> Note that relocate_kernel() usage at (*) is inside identity_mapped(). We
> run from identity mapping there. Nothing changed to identity mapping
> around relocate_kernel(), only top mapping (at __START_KERNEL_map) is
> affected.

Yes, but before this patch the VA of relocate_kernel() is mapped to the copied
one, which resides in the control page:

        control_page = page_address(image->control_code_page) + PAGE_SIZE;
        __memcpy(control_page, relocate_kernel, KEXEC_CONTROL_CODE_MAX_SIZE);
        
        page_list[PA_CONTROL_PAGE] = virt_to_phys(control_page);
        page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;    

So the (*) can actually access to the control page IIUC.

Now if we change to map VA of relocate_kernel() to the original one, then (*)
won't be able to access the control page.

> 
> But I didn't test kexec jump thing. Do you (or anybody else) have setup to
> test it?
> 

No I don't know how to test either, just my understanding on the code :-(

Git blame says Ying is the original author, so +Ying here hoping he can
provide some insight.

Anyway, my opinion is we should do patch 4 first but still map VA of
relocate_kernel() to control page so there will be no functional change.  This
patchset is about to reduce duplicated code anyway.
Kirill A. Shutemov Aug. 20, 2024, 10:13 a.m. UTC | #4
On Mon, Aug 19, 2024 at 12:39:23PM +0000, Huang, Kai wrote:
> On Mon, 2024-08-19 at 14:57 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Mon, Aug 19, 2024 at 11:16:52AM +0000, Huang, Kai wrote:
> > > On Mon, 2024-08-19 at 10:08 +0300, Kirill A. Shutemov wrote:
> > > > The init_transition_pgtable() function sets up transitional page tables.
> > > > It ensures that the relocate_kernel() function is present in the
> > > > identity mapping at the same location as in the kernel page tables.
> > > > relocate_kernel() switches to the identity mapping, and the function
> > > > must be present at the same location in the virtual address space before
> > > > and after switching page tables.
> > > > 
> > > > init_transition_pgtable() maps a copy of relocate_kernel() in
> > > > image->control_code_page at the relocate_kernel() virtual address, but
> > > > the original physical address of relocate_kernel() would also work.
> > > > 
> > > > It is safe to use original relocate_kernel() physical address cannot be
> > > > overwritten until swap_pages() is called, and the relocate_kernel()
> > > > virtual address will not be used by then.
> > > > 
> > > > Map the original relocate_kernel() at the relocate_kernel() virtual
> > > > address in the identity mapping. It is preparation to replace the
> > > > init_transition_pgtable() implementation with a call to
> > > > kernel_ident_mapping_init().
> > > > 
> > > > Note that while relocate_kernel() switches to the identity mapping, it
> > > > does not flush global TLB entries (CR4.PGE is not cleared). This means
> > > > that in most cases, the kernel still runs relocate_kernel() from the
> > > > original physical address before the change.
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > ---
> > > >  arch/x86/kernel/machine_kexec_64.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > > > index 9c9ac606893e..645690e81c2d 100644
> > > > --- a/arch/x86/kernel/machine_kexec_64.c
> > > > +++ b/arch/x86/kernel/machine_kexec_64.c
> > > > @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
> > > >  	pte_t *pte;
> > > >  
> > > >  	vaddr = (unsigned long)relocate_kernel;
> > > > -	paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
> > > > +	paddr = __pa(relocate_kernel);
> > > >  	pgd += pgd_index(vaddr);
> > > >  	if (!pgd_present(*pgd)) {
> > > >  		p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
> > > 
> > > 
> > > IIUC, this breaks KEXEC_JUMP (image->preserve_context is true).
> > > 
> > > The relocate_kernel() first saves couple of regs and some other data like PA
> > > of swap page to the control page.  Note here the VA_CONTROL_PAGE is used to
> > > access the control page, so those data are saved to the control page.
> > > 
> > > SYM_CODE_START_NOALIGN(relocate_kernel)
> > >         UNWIND_HINT_END_OF_STACK
> > >         ANNOTATE_NOENDBR
> > >         /*      
> > >          * %rdi indirection_page
> > >          * %rsi page_list
> > >          * %rdx start address
> > >          * %rcx preserve_context
> > >          * %r8  bare_metal
> > >          */
> > > 
> > > 	...
> > > 
> > >         movq    PTR(VA_CONTROL_PAGE)(%rsi), %r11                             
> > >         movq    %rsp, RSP(%r11)                                              
> > >         movq    %cr0, %rax
> > >         movq    %rax, CR0(%r11)
> > >         movq    %cr3, %rax
> > >         movq    %rax, CR3(%r11)
> > >         movq    %cr4, %rax
> > >         movq    %rax, CR4(%r11)
> > > 
> > > 	...
> > > 
> > > 	/*
> > >          * get physical address of control page now
> > >          * this is impossible after page table switch
> > >          */
> > >         movq    PTR(PA_CONTROL_PAGE)(%rsi), %r8
> > > 
> > >         /* get physical address of page table now too */
> > >         movq    PTR(PA_TABLE_PAGE)(%rsi), %r9
> > > 
> > >         /* get physical address of swap page now */
> > >         movq    PTR(PA_SWAP_PAGE)(%rsi), %r10
> > > 
> > >         /* save some information for jumping back */
> > >         movq    %r9, CP_PA_TABLE_PAGE(%r11)
> > >         movq    %r10, CP_PA_SWAP_PAGE(%r11)
> > >         movq    %rdi, CP_PA_BACKUP_PAGES_MAP(%r11)
> > > 
> > > 	...
> > > 
> > > And after jumping back from the second kernel, relocate_kernel() tries to
> > > restore the saved data:
> > > 
> > > 	...
> > > 
> > >         /* get the re-entry point of the peer system */
> > >         movq    0(%rsp), %rbp
> > >         leaq    relocate_kernel(%rip), %r8		<---------  (*) 
> > >         movq    CP_PA_SWAP_PAGE(%r8), %r10
> > >         movq    CP_PA_BACKUP_PAGES_MAP(%r8), %rdi
> > >         movq    CP_PA_TABLE_PAGE(%r8), %rax
> > >         movq    %rax, %cr3
> > >         lea     PAGE_SIZE(%r8), %rsp
> > >         call    swap_pages
> > >         movq    $virtual_mapped, %rax
> > >         pushq   %rax
> > >         ANNOTATE_UNRET_SAFE
> > >         ret
> > >         int3
> > > SYM_CODE_END(identity_mapped)
> > > 
> > > Note the above code (*) uses the VA of relocate_kernel() to access the control
> > > page.  IIUC, that means if we map VA of relocate_kernel() to the original PA
> > > where the code relocate_kernel() resides, then the above code will never be
> > > able to read those data back since they were saved to the control page.
> > > 
> > > Did I miss anything?
> > 
> > Note that relocate_kernel() usage at (*) is inside identity_mapped(). We
> > run from identity mapping there. Nothing changed to identity mapping
> > around relocate_kernel(), only top mapping (at __START_KERNEL_map) is
> > affected.
> 
> Yes, but before this patch the VA of relocate_kernel() is mapped to the copied
> one, which resides in the control page:
> 
>         control_page = page_address(image->control_code_page) + PAGE_SIZE;
>         __memcpy(control_page, relocate_kernel, KEXEC_CONTROL_CODE_MAX_SIZE);
>         
>         page_list[PA_CONTROL_PAGE] = virt_to_phys(control_page);
>         page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;    
> 
> So the (*) can actually access to the control page IIUC.
> 
> Now if we change to map VA of relocate_kernel() to the original one, then (*)
> won't be able to access the control page.

No, it still will be able to access control page.

So we call relocate_kernel() in normal kernel text (within
__START_KERNEL_map).

relocate_kernel() switches to identity mapping, VA is still the same.

relocate_kernel() jumps to identity_mapped() in the control page:


	/*
	 * get physical address of control page now
	 * this is impossible after page table switch
	 */
	movq	PTR(PA_CONTROL_PAGE)(%rsi), %r8

	...

	/* jump to identity mapped page */
	addq	$(identity_mapped - relocate_kernel), %r8
	pushq	%r8
	ANNOTATE_UNRET_SAFE
	ret

The ADDQ finds offset of identity_mapped() in the control page.

identity_mapping() finds start of the control page from *relative*
position of relocate_page() to the current RIP in the control page:

	leaq	relocate_kernel(%rip), %r8

It looks like this in my kernel binary:

	lea    -0xfa(%rip),%r8

What PA is mapped at the normal kernel text VA of relocate_kernel() makes
zero affect to the calculation.

Does it make sense?
Huang, Kai Aug. 20, 2024, 11:06 a.m. UTC | #5
> > 
> > So the (*) can actually access to the control page IIUC.
> > 
> > Now if we change to map VA of relocate_kernel() to the original one, then (*)
> > won't be able to access the control page.
> 
> No, it still will be able to access control page.
> 
> So we call relocate_kernel() in normal kernel text (within
> __START_KERNEL_map).
> 
> relocate_kernel() switches to identity mapping, VA is still the same.
> 
> relocate_kernel() jumps to identity_mapped() in the control page:
> 
> 
> 	/*
> 	 * get physical address of control page now
> 	 * this is impossible after page table switch
> 	 */
> 	movq	PTR(PA_CONTROL_PAGE)(%rsi), %r8
> 
> 	...
> 
> 	/* jump to identity mapped page */
> 	addq	$(identity_mapped - relocate_kernel), %r8
> 	pushq	%r8
> 	ANNOTATE_UNRET_SAFE
> 	ret
> 
> The ADDQ finds offset of identity_mapped() in the control page.
> 
> identity_mapping() finds start of the control page from *relative*
> position of relocate_page() to the current RIP in the control page:
> 
> 	leaq	relocate_kernel(%rip), %r8
> 
> It looks like this in my kernel binary:
> 
> 	lea    -0xfa(%rip),%r8

Ah I see.  I missed the *relative* addressing. :-)

> 
> What PA is mapped at the normal kernel text VA of relocate_kernel() makes
> zero affect to the calculation.

Yeah.

> 
> Does it make sense?
> 

Yes.  Thanks for explanation.

At later time:

        call    swap_pages                                                   
        movq    $virtual_mapped, %rax  	<---- (1)                            
        pushq   %rax
        ANNOTATE_UNRET_SAFE
        ret  				<---- (2)

(1) will load the VA which has __START_KERNEL_map to %rax, and after (2) the
kernel will run at VA of the original relocate_kernel() which maps to the PA
of the original relcoate_kernel().  But I think the memory page of the
original relocate_kernel() won't get corrupted after returning from the second
kernel, so should be safe to use?
Kirill A. Shutemov Aug. 20, 2024, 11:14 a.m. UTC | #6
On Tue, Aug 20, 2024 at 11:06:34AM +0000, Huang, Kai wrote:
> At later time:
> 
>         call    swap_pages                                                   
>         movq    $virtual_mapped, %rax  	<---- (1)                            
>         pushq   %rax
>         ANNOTATE_UNRET_SAFE
>         ret  				<---- (2)
> 
> (1) will load the VA which has __START_KERNEL_map to %rax, and after (2) the
> kernel will run at VA of the original relocate_kernel() which maps to the PA
> of the original relcoate_kernel().  But I think the memory page of the
> original relocate_kernel() won't get corrupted after returning from the second
> kernel, so should be safe to use?

Yes.
Huang, Kai Aug. 20, 2024, 11:52 a.m. UTC | #7
On Tue, 2024-08-20 at 14:14 +0300, kirill.shutemov@linux.intel.com wrote:
> On Tue, Aug 20, 2024 at 11:06:34AM +0000, Huang, Kai wrote:
> > At later time:
> > 
> >         call    swap_pages                                                   
> >         movq    $virtual_mapped, %rax  	<---- (1)                            
> >         pushq   %rax
> >         ANNOTATE_UNRET_SAFE
> >         ret  				<---- (2)
> > 
> > (1) will load the VA which has __START_KERNEL_map to %rax, and after (2) the
> > kernel will run at VA of the original relocate_kernel() which maps to the PA
> > of the original relcoate_kernel().  But I think the memory page of the
> > original relocate_kernel() won't get corrupted after returning from the second
> > kernel, so should be safe to use?
> 
> Yes.
> 

Reviewed-by: Kai Huang <kai.huang@intel.com>
diff mbox series

Patch

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 9c9ac606893e..645690e81c2d 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -157,7 +157,7 @@  static int init_transition_pgtable(struct kimage *image, pgd_t *pgd)
 	pte_t *pte;
 
 	vaddr = (unsigned long)relocate_kernel;
-	paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE);
+	paddr = __pa(relocate_kernel);
 	pgd += pgd_index(vaddr);
 	if (!pgd_present(*pgd)) {
 		p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);