diff mbox series

[v2] ACPI / processor_idle: use ndelay instead of io port access for wait

Message ID 20191015080404.6013-1-fengwei.yin@intel.com (mailing list archive)
State Changes Requested, archived
Headers show
Series [v2] ACPI / processor_idle: use ndelay instead of io port access for wait | expand

Commit Message

Yin, Fengwei Oct. 15, 2019, 8:04 a.m. UTC
In function acpi_idle_do_entry(), an ioport access is used for dummy
wait to guarantee hardware behavior. But it could trigger unnecessary
vmexit in virtualization environment.

If we run linux as guest and export all available native C state to
guest, we did see many PM timer access triggered VMexit when guest
enter deeper C state in our environment (We used ACRN hypervisor
instead of kvm or xen which has PM timer emulated and exports all
native C state to guest).

According to the original comments of this part of code, io port
access is only for dummy wait. We could use busy wait instead of io
port access to guarantee hardware behavior and avoid unnecessary
VMexit.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
ChangeLog:
v1 -> v2:
   - Use ndelay instead of dead loop for dummy delay.

 drivers/acpi/processor_idle.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

Comments

David Laight Oct. 15, 2019, 11:48 a.m. UTC | #1
From: Yin Fengwei
> Sent: 15 October 2019 09:04
> In function acpi_idle_do_entry(), an ioport access is used for dummy
> wait to guarantee hardware behavior. But it could trigger unnecessary
> vmexit in virtualization environment.
> 
> If we run linux as guest and export all available native C state to
> guest, we did see many PM timer access triggered VMexit when guest
> enter deeper C state in our environment (We used ACRN hypervisor
> instead of kvm or xen which has PM timer emulated and exports all
> native C state to guest).
> 
> According to the original comments of this part of code, io port
> access is only for dummy wait. We could use busy wait instead of io
> port access to guarantee hardware behavior and avoid unnecessary
> VMexit.

You need some hard synchronisation instruction(s) after the inb()
and before any kind of delay to ensure your delay code is executed
after the inb() completes.

I'm pretty sure that inb() is only synchronised with memory reads.

...
> +	/* profiling the time used for dummy wait op */
> +	ktime_get_real_ts64(&ts0);
> +	inl(acpi_gbl_FADT.xpm_timer_block.address);
> +	ktime_get_real_ts64(&ts1);

That could be dominated by the cost of ktime_get_real_ts64().
It also need synchronising instructions.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Yin, Fengwei Oct. 16, 2019, 5:56 a.m. UTC | #2
Hi David,

On 10/15/2019 7:48 PM, David Laight wrote:
> From: Yin Fengwei
>> Sent: 15 October 2019 09:04
>> In function acpi_idle_do_entry(), an ioport access is used for dummy
>> wait to guarantee hardware behavior. But it could trigger unnecessary
>> vmexit in virtualization environment.
>>
>> If we run linux as guest and export all available native C state to
>> guest, we did see many PM timer access triggered VMexit when guest
>> enter deeper C state in our environment (We used ACRN hypervisor
>> instead of kvm or xen which has PM timer emulated and exports all
>> native C state to guest).
>>
>> According to the original comments of this part of code, io port
>> access is only for dummy wait. We could use busy wait instead of io
>> port access to guarantee hardware behavior and avoid unnecessary
>> VMexit.
> 
> You need some hard synchronisation instruction(s) after the inb()
> and before any kind of delay to ensure your delay code is executed
> after the inb() completes.
> 
> I'm pretty sure that inb() is only synchronised with memory reads.
Thanks a lot for the comments.

I didn't find the common serializing instructions API in kernel (only
memory  barrier which is used to make sure of memory access). For Intel
x86, cpuid could be used as serializing instruction. But it's not
suitable for common code here. Do you have any suggestion?

> 
> ...
>> +	/* profiling the time used for dummy wait op */
>> +	ktime_get_real_ts64(&ts0);
>> +	inl(acpi_gbl_FADT.xpm_timer_block.address);
>> +	ktime_get_real_ts64(&ts1);
> 
> That could be dominated by the cost of ktime_get_real_ts64().
> It also need synchronising instructions.
I did some testing. ktime_get_real_ts64() takes much less time than io
port access.

The test code is like:
1.
	local_irq_save(flag);
	ktime_get_real_ts64(&ts0);
	inl(acpi_gbl_FADT.xpm_timer_block.address);
	ktime_get_real_ts64(&ts1);
	local_irq_restore(flag);

2.
	local_irq_save(flag);
	ktime_get_real_ts64(&ts0);
	ktime_get_real_ts64(&ts1);
	local_irq_restore(flag);

The delta in 1 is about 500000ns. And delta in 2 is about
2000ns. The date is gotten on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz.
So I suppose the impact of ktime_get_real_ts64 is small.

Regards
Yin, Fengwei

> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
Rafael J. Wysocki Oct. 18, 2019, 10:12 a.m. UTC | #3
On Wednesday, October 16, 2019 7:56:17 AM CEST Yin, Fengwei wrote:
> Hi David,
> 
> On 10/15/2019 7:48 PM, David Laight wrote:
> > From: Yin Fengwei
> >> Sent: 15 October 2019 09:04
> >> In function acpi_idle_do_entry(), an ioport access is used for dummy
> >> wait to guarantee hardware behavior. But it could trigger unnecessary
> >> vmexit in virtualization environment.
> >>
> >> If we run linux as guest and export all available native C state to
> >> guest, we did see many PM timer access triggered VMexit when guest
> >> enter deeper C state in our environment (We used ACRN hypervisor
> >> instead of kvm or xen which has PM timer emulated and exports all
> >> native C state to guest).
> >>
> >> According to the original comments of this part of code, io port
> >> access is only for dummy wait. We could use busy wait instead of io
> >> port access to guarantee hardware behavior and avoid unnecessary
> >> VMexit.
> > 
> > You need some hard synchronisation instruction(s) after the inb()
> > and before any kind of delay to ensure your delay code is executed
> > after the inb() completes.
> > 
> > I'm pretty sure that inb() is only synchronised with memory reads.
> Thanks a lot for the comments.
> 
> I didn't find the common serializing instructions API in kernel (only
> memory  barrier which is used to make sure of memory access). For Intel
> x86, cpuid could be used as serializing instruction. But it's not
> suitable for common code here. Do you have any suggestion?

In the virt guest case you don't need to worry at all AFAICS, because the inb()
itself will trap to the HV.

> > 
> > ...
> >> +	/* profiling the time used for dummy wait op */
> >> +	ktime_get_real_ts64(&ts0);
> >> +	inl(acpi_gbl_FADT.xpm_timer_block.address);
> >> +	ktime_get_real_ts64(&ts1);

You may as well use ktime_get() for this, as it's almost the same code as
ktime_get_real_ts64() AFAICS, only simpler.

Plus, static vars need not be initialized to 0.

> > 
> > That could be dominated by the cost of ktime_get_real_ts64().
> > It also need synchronising instructions.
> I did some testing. ktime_get_real_ts64() takes much less time than io
> port access.
> 
> The test code is like:
> 1.
> 	local_irq_save(flag);
> 	ktime_get_real_ts64(&ts0);
> 	inl(acpi_gbl_FADT.xpm_timer_block.address);
> 	ktime_get_real_ts64(&ts1);
> 	local_irq_restore(flag);
> 
> 2.
> 	local_irq_save(flag);
> 	ktime_get_real_ts64(&ts0);
> 	ktime_get_real_ts64(&ts1);
> 	local_irq_restore(flag);
> 
> The delta in 1 is about 500000ns. And delta in 2 is about
> 2000ns. The date is gotten on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz.
> So I suppose the impact of ktime_get_real_ts64 is small.

You may not be hitting the worst case for ktime_get_real_ts64(), though.

I wonder if special casing the virt guest would be a better approach.

Then, you could leave the code as is for non-virt and I'm not sure if the
delay is needed in the virt guest case at all.

So maybe do something like "if not in a virt guest, do the dummy inl()"
and that would be it?
Yin, Fengwei Oct. 18, 2019, 10:39 a.m. UTC | #4
On 10/18/2019 6:12 PM, Rafael J. Wysocki wrote:
> On Wednesday, October 16, 2019 7:56:17 AM CEST Yin, Fengwei wrote:
>> Hi David,
>>
>> On 10/15/2019 7:48 PM, David Laight wrote:
>>> From: Yin Fengwei
>>>> Sent: 15 October 2019 09:04
>>>> In function acpi_idle_do_entry(), an ioport access is used for dummy
>>>> wait to guarantee hardware behavior. But it could trigger unnecessary
>>>> vmexit in virtualization environment.
>>>>
>>>> If we run linux as guest and export all available native C state to
>>>> guest, we did see many PM timer access triggered VMexit when guest
>>>> enter deeper C state in our environment (We used ACRN hypervisor
>>>> instead of kvm or xen which has PM timer emulated and exports all
>>>> native C state to guest).
>>>>
>>>> According to the original comments of this part of code, io port
>>>> access is only for dummy wait. We could use busy wait instead of io
>>>> port access to guarantee hardware behavior and avoid unnecessary
>>>> VMexit.
>>>
>>> You need some hard synchronisation instruction(s) after the inb()
>>> and before any kind of delay to ensure your delay code is executed
>>> after the inb() completes.
>>>
>>> I'm pretty sure that inb() is only synchronised with memory reads.
>> Thanks a lot for the comments.
>>
>> I didn't find the common serializing instructions API in kernel (only
>> memory  barrier which is used to make sure of memory access). For Intel
>> x86, cpuid could be used as serializing instruction. But it's not
>> suitable for common code here. Do you have any suggestion?
> 
> In the virt guest case you don't need to worry at all AFAICS, because the inb()
> itself will trap to the HV.
This is not always valid. If the physical cpu is totally owned by guest 
(not shared with other guest), it's possible we passthru the C state
port to guest. In that case, inb() which trigger C state transaction
doesn't trap to the HV.

> 
>>>
>>> ...
>>>> +	/* profiling the time used for dummy wait op */
>>>> +	ktime_get_real_ts64(&ts0);
>>>> +	inl(acpi_gbl_FADT.xpm_timer_block.address);
>>>> +	ktime_get_real_ts64(&ts1);
> 
> You may as well use ktime_get() for this, as it's almost the same code as
> ktime_get_real_ts64() AFAICS, only simpler.
> 
> Plus, static vars need not be initialized to 0.
Thanks for pointing this out. Will update the patch accordingly.

> 
>>>
>>> That could be dominated by the cost of ktime_get_real_ts64().
>>> It also need synchronising instructions.
>> I did some testing. ktime_get_real_ts64() takes much less time than io
>> port access.
>>
>> The test code is like:
>> 1.
>> 	local_irq_save(flag);
>> 	ktime_get_real_ts64(&ts0);
>> 	inl(acpi_gbl_FADT.xpm_timer_block.address);
>> 	ktime_get_real_ts64(&ts1);
>> 	local_irq_restore(flag);
>>
>> 2.
>> 	local_irq_save(flag);
>> 	ktime_get_real_ts64(&ts0);
>> 	ktime_get_real_ts64(&ts1);
>> 	local_irq_restore(flag);
>>
>> The delta in 1 is about 500000ns. And delta in 2 is about
>> 2000ns. The date is gotten on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz.
>> So I suppose the impact of ktime_get_real_ts64 is small.
> 
> You may not be hitting the worst case for ktime_get_real_ts64(), though.
> 
> I wonder if special casing the virt guest would be a better approach.
> 
> Then, you could leave the code as is for non-virt and I'm not sure if the
> delay is needed in the virt guest case at all.
> 
> So maybe do something like "if not in a virt guest, do the dummy inl()"
> and that would be it?
Yes. This is better. Which we could control the impact to non-virt env.

Regards
Yin, Fengwei

> 
> 
>
Yin, Fengwei Oct. 22, 2019, 2:25 p.m. UTC | #5
On 2019/10/18 下午6:12, Rafael J. Wysocki wrote:
> On Wednesday, October 16, 2019 7:56:17 AM CEST Yin, Fengwei wrote:
>> Hi David,
>>
>> On 10/15/2019 7:48 PM, David Laight wrote:
>>> From: Yin Fengwei
>>>> Sent: 15 October 2019 09:04
>>>> In function acpi_idle_do_entry(), an ioport access is used for dummy
>>>> wait to guarantee hardware behavior. But it could trigger unnecessary
>>>> vmexit in virtualization environment.
>>>>
>>>> If we run linux as guest and export all available native C state to
>>>> guest, we did see many PM timer access triggered VMexit when guest
>>>> enter deeper C state in our environment (We used ACRN hypervisor
>>>> instead of kvm or xen which has PM timer emulated and exports all
>>>> native C state to guest).
>>>>
>>>> According to the original comments of this part of code, io port
>>>> access is only for dummy wait. We could use busy wait instead of io
>>>> port access to guarantee hardware behavior and avoid unnecessary
>>>> VMexit.
>>>
>>> You need some hard synchronisation instruction(s) after the inb()
>>> and before any kind of delay to ensure your delay code is executed
>>> after the inb() completes.
>>>
>>> I'm pretty sure that inb() is only synchronised with memory reads.
>> Thanks a lot for the comments.
>>
>> I didn't find the common serializing instructions API in kernel (only
>> memory  barrier which is used to make sure of memory access). For Intel
>> x86, cpuid could be used as serializing instruction. But it's not
>> suitable for common code here. Do you have any suggestion?
> 
> In the virt guest case you don't need to worry at all AFAICS, because the inb()
> itself will trap to the HV.
> 
>>>
>>> ...
>>>> +	/* profiling the time used for dummy wait op */
>>>> +	ktime_get_real_ts64(&ts0);
>>>> +	inl(acpi_gbl_FADT.xpm_timer_block.address);
>>>> +	ktime_get_real_ts64(&ts1);
> 
> You may as well use ktime_get() for this, as it's almost the same code as
> ktime_get_real_ts64() AFAICS, only simpler.
> 
> Plus, static vars need not be initialized to 0.
> 
>>>
>>> That could be dominated by the cost of ktime_get_real_ts64().
>>> It also need synchronising instructions.
>> I did some testing. ktime_get_real_ts64() takes much less time than io
>> port access.
>>
>> The test code is like:
>> 1.
>> 	local_irq_save(flag);
>> 	ktime_get_real_ts64(&ts0);
>> 	inl(acpi_gbl_FADT.xpm_timer_block.address);
>> 	ktime_get_real_ts64(&ts1);
>> 	local_irq_restore(flag);
>>
>> 2.
>> 	local_irq_save(flag);
>> 	ktime_get_real_ts64(&ts0);
>> 	ktime_get_real_ts64(&ts1);
>> 	local_irq_restore(flag);
>>
>> The delta in 1 is about 500000ns. And delta in 2 is about
>> 2000ns. The date is gotten on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz.
>> So I suppose the impact of ktime_get_real_ts64 is small.
> 
> You may not be hitting the worst case for ktime_get_real_ts64(), though.
> 
> I wonder if special casing the virt guest would be a better approach.
> 
> Then, you could leave the code as is for non-virt and I'm not sure if the
> delay is needed in the virt guest case at all.
> 
> So maybe do something like "if not in a virt guest, do the dummy inl()"
> and that would be it?
After re-think the scenario again, I'd like to change the patch to
something like following as Rafael suggested:

If it's not in virt guest, we still use inl for dummy wait.
If it's in virt guest, we could assume inb will be trapped to HV and
remove the dummy wait.

I will generate v3 soon.

Regards
Yin, Fengwei

> 
> 
>
diff mbox series

Patch

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index ed56c6d20b08..38968d31af28 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -14,6 +14,7 @@ 
 
 #include <linux/module.h>
 #include <linux/acpi.h>
+#include <linux/delay.h>
 #include <linux/dmi.h>
 #include <linux/sched.h>       /* need_resched() */
 #include <linux/tick.h>
@@ -55,6 +56,8 @@  struct cpuidle_driver acpi_idle_driver = {
 };
 
 #ifdef CONFIG_ACPI_PROCESSOR_CSTATE
+static s64 cx_dummy_wait_ns = 0L;
+
 static
 DEFINE_PER_CPU(struct acpi_processor_cx * [CPUIDLE_STATE_MAX], acpi_cstate);
 
@@ -64,6 +67,11 @@  static int disabled_by_idle_boot_param(void)
 		boot_option_idle_override == IDLE_HALT;
 }
 
+static void cx_dummy_wait(void)
+{
+	ndelay(cx_dummy_wait_ns);
+}
+
 /*
  * IBM ThinkPad R40e crashes mysteriously when going into C2 or C3.
  * For now disable this. Probably a bug somewhere else.
@@ -660,8 +668,13 @@  static void __cpuidle acpi_idle_do_entry(struct acpi_processor_cx *cx)
 		inb(cx->address);
 		/* Dummy wait op - must do something useless after P_LVL2 read
 		   because chipsets cannot guarantee that STPCLK# signal
-		   gets asserted in time to freeze execution properly. */
-		inl(acpi_gbl_FADT.xpm_timer_block.address);
+		   gets asserted in time to freeze execution properly.
+
+		   Previously, we do io port access here for delay here. Which
+		   could trigger unnecessary trap to HV. Now, we use dead loop
+		   here to avoid the impact to virtualization env */
+
+		cx_dummy_wait();
 	}
 }
 
@@ -683,7 +696,7 @@  static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
 		else if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) {
 			inb(cx->address);
 			/* See comment in acpi_idle_do_entry() */
-			inl(acpi_gbl_FADT.xpm_timer_block.address);
+			cx_dummy_wait();
 		} else
 			return -ENODEV;
 	}
@@ -902,6 +915,7 @@  static inline void acpi_processor_cstate_first_run_checks(void)
 {
 	acpi_status status;
 	static int first_run;
+	struct timespec64 ts0, ts1;
 
 	if (first_run)
 		return;
@@ -912,6 +926,14 @@  static inline void acpi_processor_cstate_first_run_checks(void)
 			  max_cstate);
 	first_run++;
 
+	/* profiling the time used for dummy wait op */
+	ktime_get_real_ts64(&ts0);
+	inl(acpi_gbl_FADT.xpm_timer_block.address);
+	ktime_get_real_ts64(&ts1);
+
+	ts1 = timespec64_sub(ts1, ts0);
+	cx_dummy_wait_ns = timespec64_to_ns(&ts1);
+
 	if (acpi_gbl_FADT.cst_control && !nocst) {
 		status = acpi_os_write_port(acpi_gbl_FADT.smi_command,
 					    acpi_gbl_FADT.cst_control, 8);