diff mbox series

[v4,2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

Message ID 20220302164032.14569-3-dongli.zhang@oracle.com (mailing list archive)
State Accepted
Commit c147a3a0efbdfac7f4b1fb029772ec5e9e1ba39b
Headers show
Series xen: fix HVM kexec kernel panic | expand

Commit Message

Dongli Zhang March 2, 2022, 4:40 p.m. UTC
The sched_clock() can be used very early since commit 857baa87b642
("sched/clock: Enable sched clock early"). In addition, with commit
38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump
kernel in Xen HVM guest may panic at very early stage when accessing
&__this_cpu_read(xen_vcpu)->time as in below:

setup_arch()
 -> init_hypervisor_platform()
     -> x86_init.hyper.init_platform = xen_hvm_guest_init()
         -> xen_hvm_init_time_ops()
             -> xen_clocksource_read()
                 -> src = &__this_cpu_read(xen_vcpu)->time;

This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.

However, when Xen HVM guest panic on vcpu >= 32, since
xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.

This patch calls xen_hvm_init_time_ops() again later in
xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
registered when the boot vcpu is >= 32.

This issue can be reproduced on purpose via below command at the guest
side when kdump/kexec is enabled:

"taskset -c 33 echo c > /proc/sysrq-trigger"

The bugfix for PVM is not implemented due to the lack of testing
environment.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
Changed since v1:
  - Add commit message to explain why xen_hvm_init_time_ops() is delayed
    for any vcpus. (Suggested by Boris Ostrovsky)
  - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
    code in xen_hvm_guest_init(). (suggested by Juergen Gross)
Changed since v2:
  - Delay for all VCPUs. (Suggested by Boris Ostrovsky)
  - Add commit message that why PVM is not supported by this patch
  - Test if kexec/kdump works with mainline xen (HVM and PVM)
Changed since v3:
  - Re-use v2 but move the login into xen_hvm_init_time_ops() (Suggested
    by Boris Ostrovsky)

 arch/x86/xen/smp_hvm.c |  6 ++++++
 arch/x86/xen/time.c    | 25 ++++++++++++++++++++++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

Comments

Boris Ostrovsky March 3, 2022, 12:20 a.m. UTC | #1
On 3/2/22 11:40 AM, Dongli Zhang wrote:
>   void __init xen_hvm_init_time_ops(void)
>   {
> +	static bool hvm_time_initialized;
> +
> +	if (hvm_time_initialized)
> +		return;
> +
>   	/*
>   	 * vector callback is needed otherwise we cannot receive interrupts
>   	 * on cpu > 0 and at this point we don't know how many cpus are
>   	 * available.
>   	 */
>   	if (!xen_have_vector_callback)
> -		return;
> +		goto exit;


Why not just return? Do we expect the value of xen_have_vector_callback to change?


-boris


>   
>   	if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
>   		pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
> +		goto exit;
> +	}
> +
> +	/*
> +	 * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'.
> +	 * The __this_cpu_read(xen_vcpu) is still NULL when Xen HVM guest
> +	 * boots on vcpu >= MAX_VIRT_CPUS (e.g., kexec), To access
> +	 * __this_cpu_read(xen_vcpu) via xen_clocksource_read() will panic.
> +	 *
> +	 * The xen_hvm_init_time_ops() should be called again later after
> +	 * __this_cpu_read(xen_vcpu) is available.
> +	 */
> +	if (!__this_cpu_read(xen_vcpu)) {
> +		pr_info("Delay xen_init_time_common() as kernel is running on vcpu=%d\n",
> +			xen_vcpu_nr(0));
>   		return;
>   	}
>   
> @@ -577,6 +597,9 @@ void __init xen_hvm_init_time_ops(void)
>   	x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
>   
>   	x86_platform.set_wallclock = xen_set_wallclock;
> +
> +exit:
> +	hvm_time_initialized = true;
>   }
>   #endif
>
Dongli Zhang March 3, 2022, 12:31 a.m. UTC | #2
Hi Boris,

On 3/2/22 4:20 PM, Boris Ostrovsky wrote:
> 
> On 3/2/22 11:40 AM, Dongli Zhang wrote:
>>   void __init xen_hvm_init_time_ops(void)
>>   {
>> +    static bool hvm_time_initialized;
>> +
>> +    if (hvm_time_initialized)
>> +        return;
>> +
>>       /*
>>        * vector callback is needed otherwise we cannot receive interrupts
>>        * on cpu > 0 and at this point we don't know how many cpus are
>>        * available.
>>        */
>>       if (!xen_have_vector_callback)
>> -        return;
>> +        goto exit;
> 
> 
> Why not just return? Do we expect the value of xen_have_vector_callback to change?

I just want to keep above sync with ....

> 
> 
> -boris
> 
> 
>>         if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
>>           pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
>> +        goto exit;
>> +    }

... here.

That is, I want the main logic of xen_hvm_init_time_ops() to run for at most
once. Both of above two if statements will "go to exit".

Thank you very much!

Dongli Zhang

>> +
>> +    /*
>> +     * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'.
>> +     * The __this_cpu_read(xen_vcpu) is still NULL when Xen HVM guest
>> +     * boots on vcpu >= MAX_VIRT_CPUS (e.g., kexec), To access
>> +     * __this_cpu_read(xen_vcpu) via xen_clocksource_read() will panic.
>> +     *
>> +     * The xen_hvm_init_time_ops() should be called again later after
>> +     * __this_cpu_read(xen_vcpu) is available.
>> +     */
>> +    if (!__this_cpu_read(xen_vcpu)) {
>> +        pr_info("Delay xen_init_time_common() as kernel is running on
>> vcpu=%d\n",
>> +            xen_vcpu_nr(0));
>>           return;
>>       }
>>   @@ -577,6 +597,9 @@ void __init xen_hvm_init_time_ops(void)
>>       x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
>>         x86_platform.set_wallclock = xen_set_wallclock;
>> +
>> +exit:
>> +    hvm_time_initialized = true;
>>   }
>>   #endif
>>
Boris Ostrovsky March 3, 2022, 2:11 a.m. UTC | #3
On 3/2/22 7:31 PM, Dongli Zhang wrote:
> Hi Boris,
>
> On 3/2/22 4:20 PM, Boris Ostrovsky wrote:
>> On 3/2/22 11:40 AM, Dongli Zhang wrote:
>>>    void __init xen_hvm_init_time_ops(void)
>>>    {
>>> +    static bool hvm_time_initialized;
>>> +
>>> +    if (hvm_time_initialized)
>>> +        return;
>>> +
>>>        /*
>>>         * vector callback is needed otherwise we cannot receive interrupts
>>>         * on cpu > 0 and at this point we don't know how many cpus are
>>>         * available.
>>>         */
>>>        if (!xen_have_vector_callback)
>>> -        return;
>>> +        goto exit;
>>
>> Why not just return? Do we expect the value of xen_have_vector_callback to change?
> I just want to keep above sync with ....
>
>>
>> -boris
>>
>>
>>>          if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
>>>            pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
>>> +        goto exit;
>>> +    }
> ... here.
>
> That is, I want the main logic of xen_hvm_init_time_ops() to run for at most
> once. Both of above two if statements will "go to exit".


I didn't notice this actually.


I think both of them should return early, there is no reason to set hvm_time_initialized to true when, in fact, we have not initialized anything. And to avoid printing the warning twice we can just replace it with pr_info_once().


I can fix it up when committing so no need to resend. So unless you disagree


Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>


> Thank you very much!
>
> Dongli Zhang
>
>>> +
>>> +    /*
>>> +     * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'.
>>> +     * The __this_cpu_read(xen_vcpu) is still NULL when Xen HVM guest
>>> +     * boots on vcpu >= MAX_VIRT_CPUS (e.g., kexec), To access
>>> +     * __this_cpu_read(xen_vcpu) via xen_clocksource_read() will panic.
>>> +     *
>>> +     * The xen_hvm_init_time_ops() should be called again later after
>>> +     * __this_cpu_read(xen_vcpu) is available.
>>> +     */
>>> +    if (!__this_cpu_read(xen_vcpu)) {
>>> +        pr_info("Delay xen_init_time_common() as kernel is running on
>>> vcpu=%d\n",
>>> +            xen_vcpu_nr(0));
>>>            return;
>>>        }
>>>    @@ -577,6 +597,9 @@ void __init xen_hvm_init_time_ops(void)
>>>        x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
>>>          x86_platform.set_wallclock = xen_set_wallclock;
>>> +
>>> +exit:
>>> +    hvm_time_initialized = true;
>>>    }
>>>    #endif
>>>
Dongli Zhang March 3, 2022, 3:08 a.m. UTC | #4
Hi Boris,

On 3/2/22 6:11 PM, Boris Ostrovsky wrote:
> 
> On 3/2/22 7:31 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 3/2/22 4:20 PM, Boris Ostrovsky wrote:
>>> On 3/2/22 11:40 AM, Dongli Zhang wrote:
>>>>    void __init xen_hvm_init_time_ops(void)
>>>>    {
>>>> +    static bool hvm_time_initialized;
>>>> +
>>>> +    if (hvm_time_initialized)
>>>> +        return;
>>>> +
>>>>        /*
>>>>         * vector callback is needed otherwise we cannot receive interrupts
>>>>         * on cpu > 0 and at this point we don't know how many cpus are
>>>>         * available.
>>>>         */
>>>>        if (!xen_have_vector_callback)
>>>> -        return;
>>>> +        goto exit;
>>>
>>> Why not just return? Do we expect the value of xen_have_vector_callback to
>>> change?
>> I just want to keep above sync with ....
>>
>>>
>>> -boris
>>>
>>>
>>>>          if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
>>>>            pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
>>>> +        goto exit;
>>>> +    }
>> ... here.
>>
>> That is, I want the main logic of xen_hvm_init_time_ops() to run for at most
>> once. Both of above two if statements will "go to exit".
> 
> 
> I didn't notice this actually.
> 
> 
> I think both of them should return early, there is no reason to set
> hvm_time_initialized to true when, in fact, we have not initialized anything.
> And to avoid printing the warning twice we can just replace it with pr_info_once().
> 
> 
> I can fix it up when committing so no need to resend. So unless you disagree

Thank you very much for fixing it during committing.

Dongli Zhang
Boris Ostrovsky March 11, 2022, 2:19 p.m. UTC | #5
On 3/2/22 11:40 AM, Dongli Zhang wrote:
> The sched_clock() can be used very early since commit 857baa87b642
> ("sched/clock: Enable sched clock early"). In addition, with commit
> 38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump
> kernel in Xen HVM guest may panic at very early stage when accessing
> &__this_cpu_read(xen_vcpu)->time as in below:
>
> setup_arch()
>   -> init_hypervisor_platform()
>       -> x86_init.hyper.init_platform = xen_hvm_guest_init()
>           -> xen_hvm_init_time_ops()
>               -> xen_clocksource_read()
>                   -> src = &__this_cpu_read(xen_vcpu)->time;
>
> This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
> embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
> used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.
>
> However, when Xen HVM guest panic on vcpu >= 32, since
> xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
> vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.
>
> This patch calls xen_hvm_init_time_ops() again later in
> xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
> registered when the boot vcpu is >= 32.
>
> This issue can be reproduced on purpose via below command at the guest
> side when kdump/kexec is enabled:
>
> "taskset -c 33 echo c > /proc/sysrq-trigger"
>
> The bugfix for PVM is not implemented due to the lack of testing
> environment.
>
> Cc: Joe Jin <joe.jin@oracle.com>
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>


Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>


Applied to for-linus-5.18 (with return path adjusted)
diff mbox series

Patch

diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..b70afdff419c 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,12 @@  static void __init xen_hvm_smp_prepare_boot_cpu(void)
 	 */
 	xen_vcpu_setup(0);
 
+	/*
+	 * Called again in case the kernel boots on vcpu >= MAX_VIRT_CPUS.
+	 * Refer to comments in xen_hvm_init_time_ops().
+	 */
+	xen_hvm_init_time_ops();
+
 	/*
 	 * The alternative logic (which patches the unlock/lock) runs before
 	 * the smp bootup up code is activated. Hence we need to set this up
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 55b3407358a9..dcf292cc859e 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -558,16 +558,36 @@  static void xen_hvm_setup_cpu_clockevents(void)
 
 void __init xen_hvm_init_time_ops(void)
 {
+	static bool hvm_time_initialized;
+
+	if (hvm_time_initialized)
+		return;
+
 	/*
 	 * vector callback is needed otherwise we cannot receive interrupts
 	 * on cpu > 0 and at this point we don't know how many cpus are
 	 * available.
 	 */
 	if (!xen_have_vector_callback)
-		return;
+		goto exit;
 
 	if (!xen_feature(XENFEAT_hvm_safe_pvclock)) {
 		pr_info("Xen doesn't support pvclock on HVM, disable pv timer");
+		goto exit;
+	}
+
+	/*
+	 * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'.
+	 * The __this_cpu_read(xen_vcpu) is still NULL when Xen HVM guest
+	 * boots on vcpu >= MAX_VIRT_CPUS (e.g., kexec), To access
+	 * __this_cpu_read(xen_vcpu) via xen_clocksource_read() will panic.
+	 *
+	 * The xen_hvm_init_time_ops() should be called again later after
+	 * __this_cpu_read(xen_vcpu) is available.
+	 */
+	if (!__this_cpu_read(xen_vcpu)) {
+		pr_info("Delay xen_init_time_common() as kernel is running on vcpu=%d\n",
+			xen_vcpu_nr(0));
 		return;
 	}
 
@@ -577,6 +597,9 @@  void __init xen_hvm_init_time_ops(void)
 	x86_cpuinit.setup_percpu_clockev = xen_hvm_setup_cpu_clockevents;
 
 	x86_platform.set_wallclock = xen_set_wallclock;
+
+exit:
+	hvm_time_initialized = true;
 }
 #endif