diff mbox series

[v2,1/1] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

Message ID 20211028012543.8776-1-dongli.zhang@oracle.com (mailing list archive)
State New, archived
Headers show
Series [v2,1/1] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32 | expand

Commit Message

Dongli Zhang Oct. 28, 2021, 1:25 a.m. UTC
The sched_clock() can be used very early since
commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition,
with commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock time from
0"), kdump kernel in Xen HVM guest may panic at very early stage when
accessing &__this_cpu_read(xen_vcpu)->time as in below:

setup_arch()
 -> init_hypervisor_platform()
     -> x86_init.hyper.init_platform = xen_hvm_guest_init()
         -> xen_hvm_init_time_ops()
             -> xen_clocksource_read()
                 -> src = &__this_cpu_read(xen_vcpu)->time;

This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.

However, when Xen HVM guest panic on vcpu >= 32, since
xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.

This patch delays xen_hvm_init_time_ops() to later in
xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
registered when the boot vcpu is >= 32.

Another option is to always delay xen_hvm_init_time_ops() for any vcpus
(including vcpu=0). Since to delay xen_hvm_init_time_ops() may lead to
clock backward issue, it is preferred to avoid that for regular boot (The
pv_sched_clock=native_sched_clock() is used at the very beginning until
xen_sched_clock() is registered). That requires to adjust
xen_sched_clock_offset. That's why we only delay xen_hvm_init_time_ops()
for vcpu>=32.

This issue can be reproduced on purpose via below command at the guest
side when kdump/kexec is enabled:

"taskset -c 33 echo c > /proc/sysrq-trigger"

Reference:
https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg00571.html
Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
Changed since v1:
  - Add commit message to explain why xen_hvm_init_time_ops() is delayed
    for any vcpus. (Suggested by Boris Ostrovsky)
  - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
    code in xen_hvm_guest_init(). (suggested by Juergen Gross)

 arch/x86/xen/enlighten_hvm.c | 20 +++++++++++++++++++-
 arch/x86/xen/smp_hvm.c       |  8 ++++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

Comments

Boris Ostrovsky Nov. 1, 2021, 5:34 p.m. UTC | #1
On 10/27/21 9:25 PM, Dongli Zhang wrote:
> The sched_clock() can be used very early since
> commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition,
> with commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock time from
> 0"), kdump kernel in Xen HVM guest may panic at very early stage when
> accessing &__this_cpu_read(xen_vcpu)->time as in below:
>
> setup_arch()
>   -> init_hypervisor_platform()
>       -> x86_init.hyper.init_platform = xen_hvm_guest_init()
>           -> xen_hvm_init_time_ops()
>               -> xen_clocksource_read()
>                   -> src = &__this_cpu_read(xen_vcpu)->time;
>
> This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
> embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
> used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.
>
> However, when Xen HVM guest panic on vcpu >= 32, since
> xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
> vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.
>
> This patch delays xen_hvm_init_time_ops() to later in
> xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
> registered when the boot vcpu is >= 32.
>
> Another option is to always delay xen_hvm_init_time_ops() for any vcpus
> (including vcpu=0). Since to delay xen_hvm_init_time_ops() may lead to
> clock backward issue,


This is referring to https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg01516.html I assume?


>   it is preferred to avoid that for regular boot (The
> pv_sched_clock=native_sched_clock() is used at the very beginning until
> xen_sched_clock() is registered). That requires to adjust
> xen_sched_clock_offset. That's why we only delay xen_hvm_init_time_ops()
> for vcpu>=32.


We delay only on VCPU>=32 because we want to avoid the clock going backwards due to hypervisor problem pointed to be the link above, not because we need to adjust xen_sched_clock_offset (which we could if we wanted).


>
> This issue can be reproduced on purpose via below command at the guest
> side when kdump/kexec is enabled:
>
> "taskset -c 33 echo c > /proc/sysrq-trigger"
>
> Reference:
> https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg00571.html
> Cc: Joe Jin <joe.jin@oracle.com>
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
> Changed since v1:
>    - Add commit message to explain why xen_hvm_init_time_ops() is delayed
>      for any vcpus. (Suggested by Boris Ostrovsky)
>    - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
>      code in xen_hvm_guest_init(). (suggested by Juergen Gross)
>
>   arch/x86/xen/enlighten_hvm.c | 20 +++++++++++++++++++-
>   arch/x86/xen/smp_hvm.c       |  8 ++++++++
>   2 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
> index e68ea5f4ad1c..7734dec52794 100644
> --- a/arch/x86/xen/enlighten_hvm.c
> +++ b/arch/x86/xen/enlighten_hvm.c
> @@ -216,7 +216,25 @@ static void __init xen_hvm_guest_init(void)
>   	WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
>   	xen_unplug_emulated_devices();
>   	x86_init.irqs.intr_init = xen_init_IRQ;
> -	xen_hvm_init_time_ops();
> +
> +	/*
> +	 * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'
> +	 * and the VM would use them until xen_vcpu_setup() is used to
> +	 * allocate/relocate them at arbitrary address.
> +	 *
> +	 * However, when Xen HVM guest panic on vcpu >= MAX_VIRT_CPUS,
> +	 * per_cpu(xen_vcpu, cpu) is still NULL at this stage. To access
> +	 * per_cpu(xen_vcpu, cpu) via xen_clocksource_read() would panic.
> +	 *
> +	 * Therefore we delay xen_hvm_init_time_ops() to
> +	 * xen_hvm_smp_prepare_boot_cpu() when boot vcpu is >= MAX_VIRT_CPUS.
> +	 */
> +	if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS)
> +		pr_info("Delay xen_hvm_init_time_ops() as kernel is running on vcpu=%d\n",
> +			xen_vcpu_nr(0));
> +	else
> +		xen_hvm_init_time_ops();
> +
>   	xen_hvm_init_mmu_ops();
>   
>   #ifdef CONFIG_KEXEC_CORE
> diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
> index 6ff3c887e0b9..f99043df8bb5 100644
> --- a/arch/x86/xen/smp_hvm.c
> +++ b/arch/x86/xen/smp_hvm.c
> @@ -19,6 +19,14 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
>   	 */
>   	xen_vcpu_setup(0);
>   
> +	/*
> +	 * The xen_hvm_init_time_ops() is delayed from
> +	 * xen_hvm_guest_init() to here to avoid panic when the kernel
> +	 * boots from vcpu>=MAX_VIRT_CPUS (32).
> +	 */


How about

   /* Deferred call to xen_hvm_init_time_ops(). See comment in xen_hvm_guest_init() */


-boris



> +	if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS)
> +		xen_hvm_init_time_ops();
> +
>   	/*
>   	 * The alternative logic (which patches the unlock/lock) runs before
>   	 * the smp bootup up code is activated. Hence we need to set this up
Dongli Zhang Nov. 4, 2021, 5:59 p.m. UTC | #2
Hi Boris,

On 11/1/21 10:34 AM, Boris Ostrovsky wrote:
> 
> On 10/27/21 9:25 PM, Dongli Zhang wrote:
>> The sched_clock() can be used very early since
>> commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition,
>> with commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock time from
>> 0"), kdump kernel in Xen HVM guest may panic at very early stage when
>> accessing &__this_cpu_read(xen_vcpu)->time as in below:
>>
>> setup_arch()
>>   -> init_hypervisor_platform()
>>       -> x86_init.hyper.init_platform = xen_hvm_guest_init()
>>           -> xen_hvm_init_time_ops()
>>               -> xen_clocksource_read()
>>                   -> src = &__this_cpu_read(xen_vcpu)->time;
>>
>> This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
>> embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
>> used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.
>>
>> However, when Xen HVM guest panic on vcpu >= 32, since
>> xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
>> vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.
>>
>> This patch delays xen_hvm_init_time_ops() to later in
>> xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
>> registered when the boot vcpu is >= 32.
>>
>> Another option is to always delay xen_hvm_init_time_ops() for any vcpus
>> (including vcpu=0). Since to delay xen_hvm_init_time_ops() may lead to
>> clock backward issue,
> 
> 
> This is referring to
> https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg01516.html I
> assume?

No.

So far there are clock forward (e.g., from 0 to 65345) issue and clock backward
issue (e.g., from 2.432 to 0).

The clock forward issue can be resolved by above link to enforce clock update
during vcpu info registration. However, so far I am only able to reproduce clock
forward when "taskset -c 33 echo c > /proc/sysrq-trigger".

That is, I am not able to see any clock forward drift during regular boot (on
CPU=0), without the fix at hypervisor side.

The clock backward issue is because the native clock source is used if we delay
the initialization of xen clock source. We will see a backward when the source
is switched from native to xen.

> 
> 
>>   it is preferred to avoid that for regular boot (The
>> pv_sched_clock=native_sched_clock() is used at the very beginning until
>> xen_sched_clock() is registered). That requires to adjust
>> xen_sched_clock_offset. That's why we only delay xen_hvm_init_time_ops()
>> for vcpu>=32.
> 
> 
> We delay only on VCPU>=32 because we want to avoid the clock going backwards due
> to hypervisor problem pointed to be the link above, not because we need to
> adjust xen_sched_clock_offset (which we could if we wanted).

I will add that.

Just to clarify that so far I am not able to reproduce the clock backward issue
during regular boot (on CPU=0), when the fix is not available at hypervisor side.

> 
> 
>>
>> This issue can be reproduced on purpose via below command at the guest
>> side when kdump/kexec is enabled:
>>
>> "taskset -c 33 echo c > /proc/sysrq-trigger"
>>
>> Reference:
>> https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg00571.html
>> Cc: Joe Jin <joe.jin@oracle.com>
>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>> ---
>> Changed since v1:
>>    - Add commit message to explain why xen_hvm_init_time_ops() is delayed
>>      for any vcpus. (Suggested by Boris Ostrovsky)
>>    - Add a comment in xen_hvm_smp_prepare_boot_cpu() referencing the related
>>      code in xen_hvm_guest_init(). (suggested by Juergen Gross)
>>
>>   arch/x86/xen/enlighten_hvm.c | 20 +++++++++++++++++++-
>>   arch/x86/xen/smp_hvm.c       |  8 ++++++++
>>   2 files changed, 27 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
>> index e68ea5f4ad1c..7734dec52794 100644
>> --- a/arch/x86/xen/enlighten_hvm.c
>> +++ b/arch/x86/xen/enlighten_hvm.c
>> @@ -216,7 +216,25 @@ static void __init xen_hvm_guest_init(void)
>>       WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
>>       xen_unplug_emulated_devices();
>>       x86_init.irqs.intr_init = xen_init_IRQ;
>> -    xen_hvm_init_time_ops();
>> +
>> +    /*
>> +     * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'
>> +     * and the VM would use them until xen_vcpu_setup() is used to
>> +     * allocate/relocate them at arbitrary address.
>> +     *
>> +     * However, when Xen HVM guest panic on vcpu >= MAX_VIRT_CPUS,
>> +     * per_cpu(xen_vcpu, cpu) is still NULL at this stage. To access
>> +     * per_cpu(xen_vcpu, cpu) via xen_clocksource_read() would panic.
>> +     *
>> +     * Therefore we delay xen_hvm_init_time_ops() to
>> +     * xen_hvm_smp_prepare_boot_cpu() when boot vcpu is >= MAX_VIRT_CPUS.
>> +     */
>> +    if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS)
>> +        pr_info("Delay xen_hvm_init_time_ops() as kernel is running on
>> vcpu=%d\n",
>> +            xen_vcpu_nr(0));
>> +    else
>> +        xen_hvm_init_time_ops();
>> +
>>       xen_hvm_init_mmu_ops();
>>     #ifdef CONFIG_KEXEC_CORE
>> diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
>> index 6ff3c887e0b9..f99043df8bb5 100644
>> --- a/arch/x86/xen/smp_hvm.c
>> +++ b/arch/x86/xen/smp_hvm.c
>> @@ -19,6 +19,14 @@ static void __init xen_hvm_smp_prepare_boot_cpu(void)
>>        */
>>       xen_vcpu_setup(0);
>>   +    /*
>> +     * The xen_hvm_init_time_ops() is delayed from
>> +     * xen_hvm_guest_init() to here to avoid panic when the kernel
>> +     * boots from vcpu>=MAX_VIRT_CPUS (32).
>> +     */
> 
> 
> How about
> 
>   /* Deferred call to xen_hvm_init_time_ops(). See comment in
> xen_hvm_guest_init() */
> 

I will do that.

Thank you very much!

Dongli Zhang

> 
> -boris
> 
> 
> 
>> +    if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS)
>> +        xen_hvm_init_time_ops();
>> +
>>       /*
>>        * The alternative logic (which patches the unlock/lock) runs before
>>        * the smp bootup up code is activated. Hence we need to set this up
Boris Ostrovsky Nov. 5, 2021, 8:41 p.m. UTC | #3
On 11/4/21 1:59 PM, Dongli Zhang wrote:
> Hi Boris,
>
> On 11/1/21 10:34 AM, Boris Ostrovsky wrote:
>> On 10/27/21 9:25 PM, Dongli Zhang wrote:
>>> The sched_clock() can be used very early since
>>> commit 857baa87b642 ("sched/clock: Enable sched clock early"). In addition,
>>> with commit 38669ba205d1 ("x86/xen/time: Output xen sched_clock time from
>>> 0"), kdump kernel in Xen HVM guest may panic at very early stage when
>>> accessing &__this_cpu_read(xen_vcpu)->time as in below:
>>>
>>> setup_arch()
>>>    -> init_hypervisor_platform()
>>>        -> x86_init.hyper.init_platform = xen_hvm_guest_init()
>>>            -> xen_hvm_init_time_ops()
>>>                -> xen_clocksource_read()
>>>                    -> src = &__this_cpu_read(xen_vcpu)->time;
>>>
>>> This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
>>> embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
>>> used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.
>>>
>>> However, when Xen HVM guest panic on vcpu >= 32, since
>>> xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
>>> vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.
>>>
>>> This patch delays xen_hvm_init_time_ops() to later in
>>> xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
>>> registered when the boot vcpu is >= 32.
>>>
>>> Another option is to always delay xen_hvm_init_time_ops() for any vcpus
>>> (including vcpu=0). Since to delay xen_hvm_init_time_ops() may lead to
>>> clock backward issue,
>>
>> This is referring to
>> https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg01516.html I
>> assume?
> No.
>
> So far there are clock forward (e.g., from 0 to 65345) issue and clock backward
> issue (e.g., from 2.432 to 0).
>
> The clock forward issue can be resolved by above link to enforce clock update
> during vcpu info registration. However, so far I am only able to reproduce clock
> forward when "taskset -c 33 echo c > /proc/sysrq-trigger".
>
> That is, I am not able to see any clock forward drift during regular boot (on
> CPU=0), without the fix at hypervisor side.
>
> The clock backward issue is because the native clock source is used if we delay
> the initialization of xen clock source. We will see a backward when the source
> is switched from native to xen.
>
>>
>>>    it is preferred to avoid that for regular boot (The
>>> pv_sched_clock=native_sched_clock() is used at the very beginning until
>>> xen_sched_clock() is registered). That requires to adjust
>>> xen_sched_clock_offset. That's why we only delay xen_hvm_init_time_ops()
>>> for vcpu>=32.
>>
>> We delay only on VCPU>=32 because we want to avoid the clock going backwards due
>> to hypervisor problem pointed to be the link above, not because we need to
>> adjust xen_sched_clock_offset (which we could if we wanted).
> I will add that.
>
> Just to clarify that so far I am not able to reproduce the clock backward issue
> during regular boot (on CPU=0), when the fix is not available at hypervisor side.


FTR, Dongli and I had a chat and he is going to provide another version of the patch where time initialization is deferred for all vcpus.



-boris
diff mbox series

Patch

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index e68ea5f4ad1c..7734dec52794 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -216,7 +216,25 @@  static void __init xen_hvm_guest_init(void)
 	WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
 	xen_unplug_emulated_devices();
 	x86_init.irqs.intr_init = xen_init_IRQ;
-	xen_hvm_init_time_ops();
+
+	/*
+	 * Only MAX_VIRT_CPUS 'vcpu_info' are embedded inside 'shared_info'
+	 * and the VM would use them until xen_vcpu_setup() is used to
+	 * allocate/relocate them at arbitrary address.
+	 *
+	 * However, when Xen HVM guest panic on vcpu >= MAX_VIRT_CPUS,
+	 * per_cpu(xen_vcpu, cpu) is still NULL at this stage. To access
+	 * per_cpu(xen_vcpu, cpu) via xen_clocksource_read() would panic.
+	 *
+	 * Therefore we delay xen_hvm_init_time_ops() to
+	 * xen_hvm_smp_prepare_boot_cpu() when boot vcpu is >= MAX_VIRT_CPUS.
+	 */
+	if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS)
+		pr_info("Delay xen_hvm_init_time_ops() as kernel is running on vcpu=%d\n",
+			xen_vcpu_nr(0));
+	else
+		xen_hvm_init_time_ops();
+
 	xen_hvm_init_mmu_ops();
 
 #ifdef CONFIG_KEXEC_CORE
diff --git a/arch/x86/xen/smp_hvm.c b/arch/x86/xen/smp_hvm.c
index 6ff3c887e0b9..f99043df8bb5 100644
--- a/arch/x86/xen/smp_hvm.c
+++ b/arch/x86/xen/smp_hvm.c
@@ -19,6 +19,14 @@  static void __init xen_hvm_smp_prepare_boot_cpu(void)
 	 */
 	xen_vcpu_setup(0);
 
+	/*
+	 * The xen_hvm_init_time_ops() is delayed from
+	 * xen_hvm_guest_init() to here to avoid panic when the kernel
+	 * boots from vcpu>=MAX_VIRT_CPUS (32).
+	 */
+	if (xen_vcpu_nr(0) >= MAX_VIRT_CPUS)
+		xen_hvm_init_time_ops();
+
 	/*
 	 * The alternative logic (which patches the unlock/lock) runs before
 	 * the smp bootup up code is activated. Hence we need to set this up