diff mbox series

KVM: arm64: Flush hyp bss section after initialization of variables in bss

Message ID 20250120151514.1769597-1-lokeshvutla@google.com (mailing list archive)
State New
Headers show
Series KVM: arm64: Flush hyp bss section after initialization of variables in bss | expand

Commit Message

Lokesh Vutla Jan. 20, 2025, 3:15 p.m. UTC
To determine CPU features during initialization, the NVHE hypervisor
utilizes sanitized values of the host's CPU features registers. These
values, stored in u64 idaa64*_el1_sys_val variables are updated by the
kvm_hyp_init_symbols() function at EL1. To ensure EL2 visibility, the
data cache needs to be flushed after these updates. However,
individually flushing each variable using kvm_flush_dcache_to_poc() is
inefficient.

These cpu feature variables would be part of the bss section of
the hypervisor. Hence, flush the entire bss section of hypervisor
once the initialization is complete.

Motivation for this change:
* Since the existing variables are not flushed from EL1, the
  id_aa64pfr0_el1_sys_val is seen as 0 from EL2.
* based on this value check_override macro in hypervisor skips
  updating the sve (cpacr_el1) at finalise_el2_state.
* The default value for cpacr_el1 enables the sve traps to EL2.
* With sve enabled, during the context switch from EL0 -> EL1 (which is
  much later in the boot process), the sve registers are saved/restored.
* Since sve traps are enabled, accessing sve registers at EL1 caused a
  trap to EL2.
* However, hypervisor is not ready to handle sve traps at this stage
  causing the below kernel crash during the boot:

[    0.320850][    T1] Run /init as init process
[    0.321392][    T1] kvm [1]: nVHE hyp BUG at: [<ffffffc08112ee8c>] __kvm_nvhe_$x.24+0x254/0x254!
[    0.321522][    T1] kvm [1]: Cannot dump pKVM nVHE stacktrace: !CONFIG_PROTECTED_NVHE_STACKTRACE
[    0.321635][    T1] kvm [1]: Hyp Offset: 0xffffff6e60000000
[    0.321710][    T1] Kernel panic - not syncing: HYP panic:
[    0.321710][    T1] PS:634023c9 PC:000000522112ee8c ESR:00000000f2000800
[    0.321710][    T1] FAR:0000000000000000 CPACR:0000000000310000 PAR:0000000000000800
[    0.321710][    T1] VCPU:0000000000000000
[...]
[    0.322251][    T1] Call trace:
[    0.322292][    T1]  dump_backtrace+0x100/0x180
[    0.322355][    T1]  show_stack+0x20/0x30
[    0.322410][    T1]  dump_stack_lvl+0x40/0x88
[    0.322471][    T1]  dump_stack+0x18/0x24
[    0.322523][    T1]  panic+0x13c/0x364
[    0.322578][    T1]  nvhe_hyp_panic_handler+0x148/0x1cc
[    0.322646][    T1]  do_sve_acc+0xec/0x260
[    0.322706][    T1]  el0_sve_acc+0x34/0x68

Fixes: 6c30bfb18d0b ("KVM: arm64: Add handlers for protected VM System Registers")
Suggested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Lokesh Vutla <lokeshvutla@google.com>
---
 arch/arm64/kvm/arm.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Marc Zyngier Jan. 20, 2025, 4:13 p.m. UTC | #1
On Mon, 20 Jan 2025 15:15:14 +0000,
Lokesh Vutla <lokeshvutla@google.com> wrote:
> 
> To determine CPU features during initialization, the NVHE hypervisor

s/NVHE/nVHE/

> utilizes sanitized values of the host's CPU features registers. These
> values, stored in u64 idaa64*_el1_sys_val variables are updated by the
> kvm_hyp_init_symbols() function at EL1. To ensure EL2 visibility, the

visibility *with the MMU off*

> data cache needs to be flushed after these updates. However,
> individually flushing each variable using kvm_flush_dcache_to_poc() is
> inefficient.
> 
> These cpu feature variables would be part of the bss section of
> the hypervisor. Hence, flush the entire bss section of hypervisor
> once the initialization is complete.
> 
> Motivation for this change:
> * Since the existing variables are not flushed from EL1, the
>   id_aa64pfr0_el1_sys_val is seen as 0 from EL2.
> * based on this value check_override macro in hypervisor skips
>   updating the sve (cpacr_el1) at finalise_el2_state.
> * The default value for cpacr_el1 enables the sve traps to EL2.
> * With sve enabled, during the context switch from EL0 -> EL1 (which is
>   much later in the boot process), the sve registers are saved/restored.
> * Since sve traps are enabled, accessing sve registers at EL1 caused a
>   trap to EL2.
> * However, hypervisor is not ready to handle sve traps at this stage
>   causing the below kernel crash during the boot:

Drop this section, it doesn't bring much to the discussion.

> 
> [    0.320850][    T1] Run /init as init process
> [    0.321392][    T1] kvm [1]: nVHE hyp BUG at: [<ffffffc08112ee8c>] __kvm_nvhe_$x.24+0x254/0x254!
> [    0.321522][    T1] kvm [1]: Cannot dump pKVM nVHE stacktrace: !CONFIG_PROTECTED_NVHE_STACKTRACE
> [    0.321635][    T1] kvm [1]: Hyp Offset: 0xffffff6e60000000
> [    0.321710][    T1] Kernel panic - not syncing: HYP panic:
> [    0.321710][    T1] PS:634023c9 PC:000000522112ee8c ESR:00000000f2000800
> [    0.321710][    T1] FAR:0000000000000000 CPACR:0000000000310000 PAR:0000000000000800
> [    0.321710][    T1] VCPU:0000000000000000
> [...]
> [    0.322251][    T1] Call trace:
> [    0.322292][    T1]  dump_backtrace+0x100/0x180
> [    0.322355][    T1]  show_stack+0x20/0x30
> [    0.322410][    T1]  dump_stack_lvl+0x40/0x88
> [    0.322471][    T1]  dump_stack+0x18/0x24
> [    0.322523][    T1]  panic+0x13c/0x364
> [    0.322578][    T1]  nvhe_hyp_panic_handler+0x148/0x1cc
> [    0.322646][    T1]  do_sve_acc+0xec/0x260
> [    0.322706][    T1]  el0_sve_acc+0x34/0x68

This is essentially content-free, given that there is no
backtrace. Please drop this.

>
> Fixes: 6c30bfb18d0b ("KVM: arm64: Add handlers for protected VM System Registers")
> Suggested-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Lokesh Vutla <lokeshvutla@google.com>
> ---
>  arch/arm64/kvm/arm.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index a102c3aebdbc..5d3b2069a2d5 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -2661,6 +2661,12 @@ static int __init init_hyp_mode(void)
>  		}
>  	}
>  
> +	/*
> +	 * Flush entire BSS since part of its data is read while the MMU is off.
> +	 */
> +	kvm_flush_dcache_to_poc(kvm_ksym_ref(__hyp_bss_start),
> +				kvm_ksym_ref(__hyp_bss_end) - kvm_ksym_ref(__hyp_bss_start));
> +
>  	return 0;
>  
>  out_err:

I don't understand how this fixes anything. At this stage, the
hypervisor has already been initialised, and I expect it will have
evaluated the wrong values.

Even worse, I strongly suspect that by the time you perform this, S2
is enabled on the host, and that the BSS is off-limit. Which means it
could fault and send you to lalaland.

Have you actually tested this with upstream?

I would have expected the clean operations to be called from
kvm_hyp_init_symbols(), which runs before EL2 gets initialised in
protected mode.

	M.
Lokesh Vutla Jan. 21, 2025, 4:24 a.m. UTC | #2
Hi Marc,

On Mon, Jan 20, 2025 at 9:44 PM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 20 Jan 2025 15:15:14 +0000,
> Lokesh Vutla <lokeshvutla@google.com> wrote:
> >
> > To determine CPU features during initialization, the NVHE hypervisor
>
> s/NVHE/nVHE/

ok

>
> > utilizes sanitized values of the host's CPU features registers. These
> > values, stored in u64 idaa64*_el1_sys_val variables are updated by the
> > kvm_hyp_init_symbols() function at EL1. To ensure EL2 visibility, the
>
> visibility *with the MMU off*

ok

>
> > data cache needs to be flushed after these updates. However,
> > individually flushing each variable using kvm_flush_dcache_to_poc() is
> > inefficient.
> >
> > These cpu feature variables would be part of the bss section of
> > the hypervisor. Hence, flush the entire bss section of hypervisor
> > once the initialization is complete.
> >
> > Motivation for this change:
> > * Since the existing variables are not flushed from EL1, the
> >   id_aa64pfr0_el1_sys_val is seen as 0 from EL2.
> > * based on this value check_override macro in hypervisor skips
> >   updating the sve (cpacr_el1) at finalise_el2_state.
> > * The default value for cpacr_el1 enables the sve traps to EL2.
> > * With sve enabled, during the context switch from EL0 -> EL1 (which is
> >   much later in the boot process), the sve registers are saved/restored.
> > * Since sve traps are enabled, accessing sve registers at EL1 caused a
> >   trap to EL2.
> > * However, hypervisor is not ready to handle sve traps at this stage
> >   causing the below kernel crash during the boot:
>
> Drop this section, it doesn't bring much to the discussion.
>
> >
> > [    0.320850][    T1] Run /init as init process
> > [    0.321392][    T1] kvm [1]: nVHE hyp BUG at: [<ffffffc08112ee8c>] __kvm_nvhe_$x.24+0x254/0x254!
> > [    0.321522][    T1] kvm [1]: Cannot dump pKVM nVHE stacktrace: !CONFIG_PROTECTED_NVHE_STACKTRACE
> > [    0.321635][    T1] kvm [1]: Hyp Offset: 0xffffff6e60000000
> > [    0.321710][    T1] Kernel panic - not syncing: HYP panic:
> > [    0.321710][    T1] PS:634023c9 PC:000000522112ee8c ESR:00000000f2000800
> > [    0.321710][    T1] FAR:0000000000000000 CPACR:0000000000310000 PAR:0000000000000800
> > [    0.321710][    T1] VCPU:0000000000000000
> > [...]
> > [    0.322251][    T1] Call trace:
> > [    0.322292][    T1]  dump_backtrace+0x100/0x180
> > [    0.322355][    T1]  show_stack+0x20/0x30
> > [    0.322410][    T1]  dump_stack_lvl+0x40/0x88
> > [    0.322471][    T1]  dump_stack+0x18/0x24
> > [    0.322523][    T1]  panic+0x13c/0x364
> > [    0.322578][    T1]  nvhe_hyp_panic_handler+0x148/0x1cc
> > [    0.322646][    T1]  do_sve_acc+0xec/0x260
> > [    0.322706][    T1]  el0_sve_acc+0x34/0x68
>
> This is essentially content-free, given that there is no
> backtrace. Please drop this.

Sure.. will drop the above 2 sections in v2.

>
> >
> > Fixes: 6c30bfb18d0b ("KVM: arm64: Add handlers for protected VM System Registers")
> > Suggested-by: Fuad Tabba <tabba@google.com>
> > Signed-off-by: Lokesh Vutla <lokeshvutla@google.com>
> > ---
> >  arch/arm64/kvm/arm.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index a102c3aebdbc..5d3b2069a2d5 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -2661,6 +2661,12 @@ static int __init init_hyp_mode(void)
> >               }
> >       }
> >
> > +     /*
> > +      * Flush entire BSS since part of its data is read while the MMU is off.
> > +      */
> > +     kvm_flush_dcache_to_poc(kvm_ksym_ref(__hyp_bss_start),
> > +                             kvm_ksym_ref(__hyp_bss_end) - kvm_ksym_ref(__hyp_bss_start));
> > +
> >       return 0;
> >
> >  out_err:
>
> I don't understand how this fixes anything. At this stage, the
> hypervisor has already been initialised, and I expect it will have
> evaluated the wrong values.
>
> Even worse, I strongly suspect that by the time you perform this, S2
> is enabled on the host, and that the BSS is off-limit. Which means it
> could fault and send you to lalaland.
>
> Have you actually tested this with upstream?
>
> I would have expected the clean operations to be called from
> kvm_hyp_init_symbols(), which runs before EL2 gets initialised in
> protected mode.

Sorry my bad.. Actually we started flushing each variable and tested
the solution. With this we observed that the hypervisor is able to read
the data properly with MMU off. Then kept optimizing the solution by
flushing the entire bss and moving around the place of flushing..
I guess it got lucky in the last run with this wrong placement.
Will add the flush in kvm_hyp_init_symbols() and post v2.

Thanks and Regards,
Lokesh



>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
diff mbox series

Patch

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a102c3aebdbc..5d3b2069a2d5 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2661,6 +2661,12 @@  static int __init init_hyp_mode(void)
 		}
 	}
 
+	/*
+	 * Flush entire BSS since part of its data is read while the MMU is off.
+	 */
+	kvm_flush_dcache_to_poc(kvm_ksym_ref(__hyp_bss_start),
+				kvm_ksym_ref(__hyp_bss_end) - kvm_ksym_ref(__hyp_bss_start));
+
 	return 0;
 
 out_err: