diff mbox series

x86/time: update TSC stamp on restore from deep C-state

Message ID 1579030581-7929-1-git-send-email-igor.druzhinin@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86/time: update TSC stamp on restore from deep C-state | expand

Commit Message

Igor Druzhinin Jan. 14, 2020, 7:36 p.m. UTC
If ITSC is not available on CPU (e.g if running nested as PV shim)
then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
all AMD and some old Intel processors. In which case TSC would need to
be restored on CPU from platform time by Xen upon exiting deep C-states.

As platform time might be behind the last TSC stamp recorded for the
current CPU, invariant of TSC stamp being always behind local TSC counter
is violated. This has an effect of get_s_time() going negative resulting
in eventual system hang or crash.

Fix this issue by updating local TSC stamp along with TSC counter write.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
---
This caused reliable hangs of shim domains with multiple vCPUs on all AMD
systems. The problem got also reproduced on bare-metal by artifically
masking ITSC feature bit. The proposed fix has been verified for both
cases.
---
 xen/arch/x86/time.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Roger Pau Monné Jan. 15, 2020, 9:47 a.m. UTC | #1
On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote:
> If ITSC is not available on CPU (e.g if running nested as PV shim)
> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
> all AMD and some old Intel processors. In which case TSC would need to
> be restored on CPU from platform time by Xen upon exiting deep C-states.
> 
> As platform time might be behind the last TSC stamp recorded for the
> current CPU, invariant of TSC stamp being always behind local TSC counter
> is violated. This has an effect of get_s_time() going negative resulting
> in eventual system hang or crash.
> 
> Fix this issue by updating local TSC stamp along with TSC counter write.

Thanks! I haven't seen such issue because I've been running the shim
with nomigrate in order to prevent the vTSC overhead.

> 
> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> ---
> This caused reliable hangs of shim domains with multiple vCPUs on all AMD
> systems. The problem got also reproduced on bare-metal by artifically
> masking ITSC feature bit. The proposed fix has been verified for both
> cases.
> ---
>  xen/arch/x86/time.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> index e79cb4d..f6b26f8 100644
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime)
>  
>  void cstate_restore_tsc(void)
>  {
> +    struct cpu_time *t = &this_cpu(cpu_time);
> +
>      if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
>          return;
>  
> -    write_tsc(stime2tsc(read_platform_stime(NULL)));
> +    t->stamp.master_stime = read_platform_stime(NULL);
> +    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
> +    t->stamp.local_stime = t->stamp.master_stime;
> +
> +    write_tsc(t->stamp.local_tsc);

In order to avoid the TSC write (and the likely associated vmexit),
could you instead do:

t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL);
t->stamp.local_tsc = rdtsc_ordered();

I think it should achieve the same as it syncs the local TSC stamp and
times, would avoid the TSC write and slightly simplifies the logic.

Thanks, Roger.
Jan Beulich Jan. 15, 2020, 11:32 a.m. UTC | #2
On 14.01.2020 20:36, Igor Druzhinin wrote:
> If ITSC is not available on CPU (e.g if running nested as PV shim)
> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
> all AMD and some old Intel processors. In which case TSC would need to
> be restored on CPU from platform time by Xen upon exiting deep C-states.

How does waking from deep C states correspond to the PV shim? I notice
that cstate_restore_tsc() gets called irrespective of the C state being
exited, so I wonder whether there's room for improvement there
independent of the issue at hand. As far as this change is concerned,
I think you want to drop the notion of "deep" from the description.

Jan
Jan Beulich Jan. 15, 2020, 11:40 a.m. UTC | #3
On 15.01.2020 10:47, Roger Pau Monné wrote:
> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote:
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime)
>>  
>>  void cstate_restore_tsc(void)
>>  {
>> +    struct cpu_time *t = &this_cpu(cpu_time);
>> +
>>      if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
>>          return;
>>  
>> -    write_tsc(stime2tsc(read_platform_stime(NULL)));
>> +    t->stamp.master_stime = read_platform_stime(NULL);
>> +    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
>> +    t->stamp.local_stime = t->stamp.master_stime;
>> +
>> +    write_tsc(t->stamp.local_tsc);
> 
> In order to avoid the TSC write (and the likely associated vmexit),
> could you instead do:
> 
> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL);
> t->stamp.local_tsc = rdtsc_ordered();
> 
> I think it should achieve the same as it syncs the local TSC stamp and
> times, would avoid the TSC write and slightly simplifies the logic.

Wouldn't this result in guests possibly observing the TSC moving
backwards?

Jan
Roger Pau Monné Jan. 15, 2020, 11:53 a.m. UTC | #4
On Wed, Jan 15, 2020 at 12:40:27PM +0100, Jan Beulich wrote:
> On 15.01.2020 10:47, Roger Pau Monné wrote:
> > On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote:
> >> --- a/xen/arch/x86/time.c
> >> +++ b/xen/arch/x86/time.c
> >> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime)
> >>  
> >>  void cstate_restore_tsc(void)
> >>  {
> >> +    struct cpu_time *t = &this_cpu(cpu_time);
> >> +
> >>      if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
> >>          return;
> >>  
> >> -    write_tsc(stime2tsc(read_platform_stime(NULL)));
> >> +    t->stamp.master_stime = read_platform_stime(NULL);
> >> +    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
> >> +    t->stamp.local_stime = t->stamp.master_stime;
> >> +
> >> +    write_tsc(t->stamp.local_tsc);
> > 
> > In order to avoid the TSC write (and the likely associated vmexit),
> > could you instead do:
> > 
> > t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL);
> > t->stamp.local_tsc = rdtsc_ordered();
> > 
> > I think it should achieve the same as it syncs the local TSC stamp and
> > times, would avoid the TSC write and slightly simplifies the logic.
> 
> Wouldn't this result in guests possibly observing the TSC moving
> backwards?

Isn't local_tsc storing a TSC value read from the same CPU always, and
hence could only go backwards if rdtsc actually goes backwards?

Ie: cpu_frequency_change seems to do something similar, together with
a re-adjusting of the time scale, but doesn't perform any TSC write.

Thanks, Roger.
Igor Druzhinin Jan. 15, 2020, 12:25 p.m. UTC | #5
On 15/01/2020 11:40, Jan Beulich wrote:
> On 15.01.2020 10:47, Roger Pau Monné wrote:
>> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote:
>>> --- a/xen/arch/x86/time.c
>>> +++ b/xen/arch/x86/time.c
>>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime)
>>>  
>>>  void cstate_restore_tsc(void)
>>>  {
>>> +    struct cpu_time *t = &this_cpu(cpu_time);
>>> +
>>>      if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
>>>          return;
>>>  
>>> -    write_tsc(stime2tsc(read_platform_stime(NULL)));
>>> +    t->stamp.master_stime = read_platform_stime(NULL);
>>> +    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
>>> +    t->stamp.local_stime = t->stamp.master_stime;
>>> +
>>> +    write_tsc(t->stamp.local_tsc);
>>
>> In order to avoid the TSC write (and the likely associated vmexit),
>> could you instead do:
>>
>> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL);
>> t->stamp.local_tsc = rdtsc_ordered();
>>
>> I think it should achieve the same as it syncs the local TSC stamp and
>> times, would avoid the TSC write and slightly simplifies the logic.
> 
> Wouldn't this result in guests possibly observing the TSC moving
> backwards?

Yes, I think so. Would restoring from TSC stamp if it's higher than
platform time better you think?

Igor
Igor Druzhinin Jan. 15, 2020, 12:28 p.m. UTC | #6
On 15/01/2020 11:32, Jan Beulich wrote:
> On 14.01.2020 20:36, Igor Druzhinin wrote:
>> If ITSC is not available on CPU (e.g if running nested as PV shim)
>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
>> all AMD and some old Intel processors. In which case TSC would need to
>> be restored on CPU from platform time by Xen upon exiting deep C-states.
> 
> How does waking from deep C states correspond to the PV shim? I notice
> that cstate_restore_tsc() gets called irrespective of the C state being
> exited, so I wonder whether there's room for improvement there
> independent of the issue at hand. As far as this change is concerned,
> I think you want to drop the notion of "deep" from the description.

I'm not familiar with what to call "deep C-state" so for me it was anything
higher than C1. If you prefer "deep" to be dropped - so be it.

Igor
Igor Druzhinin Jan. 15, 2020, 12:31 p.m. UTC | #7
On 15/01/2020 12:25, Igor Druzhinin wrote:
> On 15/01/2020 11:40, Jan Beulich wrote:
>> On 15.01.2020 10:47, Roger Pau Monné wrote:
>>> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote:
>>>> --- a/xen/arch/x86/time.c
>>>> +++ b/xen/arch/x86/time.c
>>>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime)
>>>>  
>>>>  void cstate_restore_tsc(void)
>>>>  {
>>>> +    struct cpu_time *t = &this_cpu(cpu_time);
>>>> +
>>>>      if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
>>>>          return;
>>>>  
>>>> -    write_tsc(stime2tsc(read_platform_stime(NULL)));
>>>> +    t->stamp.master_stime = read_platform_stime(NULL);
>>>> +    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
>>>> +    t->stamp.local_stime = t->stamp.master_stime;
>>>> +
>>>> +    write_tsc(t->stamp.local_tsc);
>>>
>>> In order to avoid the TSC write (and the likely associated vmexit),
>>> could you instead do:
>>>
>>> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL);
>>> t->stamp.local_tsc = rdtsc_ordered();
>>>
>>> I think it should achieve the same as it syncs the local TSC stamp and
>>> times, would avoid the TSC write and slightly simplifies the logic.
>>
>> Wouldn't this result in guests possibly observing the TSC moving
>> backwards?
> 
> Yes, I think so. Would restoring from TSC stamp if it's higher than
> platform time better you think?
> 

Ignore my reply. I was thinking you're asking whether the original code
would do such a thing. Although I'm concerned if what you say actually
applies to the original code as well. Would you think the existing logic
handles it already?

Igor
Igor Druzhinin Jan. 15, 2020, 12:36 p.m. UTC | #8
On 15/01/2020 09:47, Roger Pau Monné wrote:
> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote:
>> If ITSC is not available on CPU (e.g if running nested as PV shim)
>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
>> all AMD and some old Intel processors. In which case TSC would need to
>> be restored on CPU from platform time by Xen upon exiting deep C-states.
>>
>> As platform time might be behind the last TSC stamp recorded for the
>> current CPU, invariant of TSC stamp being always behind local TSC counter
>> is violated. This has an effect of get_s_time() going negative resulting
>> in eventual system hang or crash.
>>
>> Fix this issue by updating local TSC stamp along with TSC counter write.
> 
> Thanks! I haven't seen such issue because I've been running the shim
> with nomigrate in order to prevent the vTSC overhead.
> 
>>
>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>> ---
>> This caused reliable hangs of shim domains with multiple vCPUs on all AMD
>> systems. The problem got also reproduced on bare-metal by artifically
>> masking ITSC feature bit. The proposed fix has been verified for both
>> cases.
>> ---
>>  xen/arch/x86/time.c | 8 +++++++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
>> index e79cb4d..f6b26f8 100644
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime)
>>  
>>  void cstate_restore_tsc(void)
>>  {
>> +    struct cpu_time *t = &this_cpu(cpu_time);
>> +
>>      if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
>>          return;
>>  
>> -    write_tsc(stime2tsc(read_platform_stime(NULL)));
>> +    t->stamp.master_stime = read_platform_stime(NULL);
>> +    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
>> +    t->stamp.local_stime = t->stamp.master_stime;
>> +
>> +    write_tsc(t->stamp.local_tsc);
> 
> In order to avoid the TSC write (and the likely associated vmexit),
> could you instead do:
> 
> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL);
> t->stamp.local_tsc = rdtsc_ordered();

I think in that case RDTSC might return something behind platform time
which is not right I guess.

Igor
Jan Beulich Jan. 15, 2020, 12:39 p.m. UTC | #9
On 15.01.2020 13:28, Igor Druzhinin wrote:
> On 15/01/2020 11:32, Jan Beulich wrote:
>> On 14.01.2020 20:36, Igor Druzhinin wrote:
>>> If ITSC is not available on CPU (e.g if running nested as PV shim)
>>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
>>> all AMD and some old Intel processors. In which case TSC would need to
>>> be restored on CPU from platform time by Xen upon exiting deep C-states.
>>
>> How does waking from deep C states correspond to the PV shim? I notice
>> that cstate_restore_tsc() gets called irrespective of the C state being
>> exited, so I wonder whether there's room for improvement there
>> independent of the issue at hand. As far as this change is concerned,
>> I think you want to drop the notion of "deep" from the description.
> 
> I'm not familiar with what to call "deep C-state" so for me it was anything
> higher than C1. If you prefer "deep" to be dropped - so be it.

"Higher than C1" may be fine (albeit I vaguely recall TSC issues
starting with C3 only), but at least mwait_idle() calls the
function even for C1. As to the PV shim - does it know about any
C-states at all (beyond HLT-invoked C1)?

Jan
Jan Beulich Jan. 15, 2020, 12:41 p.m. UTC | #10
On 15.01.2020 13:31, Igor Druzhinin wrote:
> On 15/01/2020 12:25, Igor Druzhinin wrote:
>> On 15/01/2020 11:40, Jan Beulich wrote:
>>> On 15.01.2020 10:47, Roger Pau Monné wrote:
>>>> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote:
>>>>> --- a/xen/arch/x86/time.c
>>>>> +++ b/xen/arch/x86/time.c
>>>>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime)
>>>>>  
>>>>>  void cstate_restore_tsc(void)
>>>>>  {
>>>>> +    struct cpu_time *t = &this_cpu(cpu_time);
>>>>> +
>>>>>      if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
>>>>>          return;
>>>>>  
>>>>> -    write_tsc(stime2tsc(read_platform_stime(NULL)));
>>>>> +    t->stamp.master_stime = read_platform_stime(NULL);
>>>>> +    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
>>>>> +    t->stamp.local_stime = t->stamp.master_stime;
>>>>> +
>>>>> +    write_tsc(t->stamp.local_tsc);
>>>>
>>>> In order to avoid the TSC write (and the likely associated vmexit),
>>>> could you instead do:
>>>>
>>>> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL);
>>>> t->stamp.local_tsc = rdtsc_ordered();
>>>>
>>>> I think it should achieve the same as it syncs the local TSC stamp and
>>>> times, would avoid the TSC write and slightly simplifies the logic.
>>>
>>> Wouldn't this result in guests possibly observing the TSC moving
>>> backwards?
>>
>> Yes, I think so. Would restoring from TSC stamp if it's higher than
>> platform time better you think?
>>
> 
> Ignore my reply. I was thinking you're asking whether the original code
> would do such a thing. Although I'm concerned if what you say actually
> applies to the original code as well. Would you think the existing logic
> handles it already?

I think the original code won't guarantee to backwards move at
all, but it also wouldn't produce large backwards jumps. It's
large ones I was mainly thinking of when replying to Roger.
But let me also reply back to him...

Jan
Igor Druzhinin Jan. 15, 2020, 12:47 p.m. UTC | #11
On 15/01/2020 12:39, Jan Beulich wrote:
> On 15.01.2020 13:28, Igor Druzhinin wrote:
>> On 15/01/2020 11:32, Jan Beulich wrote:
>>> On 14.01.2020 20:36, Igor Druzhinin wrote:
>>>> If ITSC is not available on CPU (e.g if running nested as PV shim)
>>>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
>>>> all AMD and some old Intel processors. In which case TSC would need to
>>>> be restored on CPU from platform time by Xen upon exiting deep C-states.
>>>
>>> How does waking from deep C states correspond to the PV shim? I notice
>>> that cstate_restore_tsc() gets called irrespective of the C state being
>>> exited, so I wonder whether there's room for improvement there
>>> independent of the issue at hand. As far as this change is concerned,
>>> I think you want to drop the notion of "deep" from the description.
>>
>> I'm not familiar with what to call "deep C-state" so for me it was anything
>> higher than C1. If you prefer "deep" to be dropped - so be it.
> 
> "Higher than C1" may be fine (albeit I vaguely recall TSC issues
> starting with C3 only), but at least mwait_idle() calls the
> function even for C1. As to the PV shim - does it know about any
> C-states at all (beyond HLT-invoked C1)?

Yes, PV-shim knows about C states as it looks they are tied to
processor ID in some cases. For AMD specifically C2 is HLT.

Igor
Jan Beulich Jan. 15, 2020, 12:49 p.m. UTC | #12
On 15.01.2020 12:53, Roger Pau Monné wrote:
> On Wed, Jan 15, 2020 at 12:40:27PM +0100, Jan Beulich wrote:
>> On 15.01.2020 10:47, Roger Pau Monné wrote:
>>> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote:
>>>> --- a/xen/arch/x86/time.c
>>>> +++ b/xen/arch/x86/time.c
>>>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime)
>>>>  
>>>>  void cstate_restore_tsc(void)
>>>>  {
>>>> +    struct cpu_time *t = &this_cpu(cpu_time);
>>>> +
>>>>      if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
>>>>          return;
>>>>  
>>>> -    write_tsc(stime2tsc(read_platform_stime(NULL)));
>>>> +    t->stamp.master_stime = read_platform_stime(NULL);
>>>> +    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
>>>> +    t->stamp.local_stime = t->stamp.master_stime;
>>>> +
>>>> +    write_tsc(t->stamp.local_tsc);
>>>
>>> In order to avoid the TSC write (and the likely associated vmexit),
>>> could you instead do:
>>>
>>> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL);
>>> t->stamp.local_tsc = rdtsc_ordered();
>>>
>>> I think it should achieve the same as it syncs the local TSC stamp and
>>> times, would avoid the TSC write and slightly simplifies the logic.
>>
>> Wouldn't this result in guests possibly observing the TSC moving
>> backwards?
> 
> Isn't local_tsc storing a TSC value read from the same CPU always, and
> hence could only go backwards if rdtsc actually goes backwards?

For one I have to admit I was (mistakenly) thinking of wakeup
from S states more than that from C states. So assuming the
TSC indeed only stops (but won't get e.g. restarted), backwards
moves ought to be excluded. What I'm then worried about is too
little progress observable by guests. The PV time protocol
ought to be fine in this regard (and consumers of raw TSC values
are on their own anyway), but wouldn't you need to update TSC
offsets of HVM guests in order to compensate for the elapsed
time?

> Ie: cpu_frequency_change seems to do something similar, together with
> a re-adjusting of the time scale, but doesn't perform any TSC write.

A P-state change at most alters the the tick rate, but wouldn't
stop or even reset the TSC (afaict).

Jan
Jan Beulich Jan. 15, 2020, 12:54 p.m. UTC | #13
On 15.01.2020 13:47, Igor Druzhinin wrote:
> On 15/01/2020 12:39, Jan Beulich wrote:
>> On 15.01.2020 13:28, Igor Druzhinin wrote:
>>> On 15/01/2020 11:32, Jan Beulich wrote:
>>>> On 14.01.2020 20:36, Igor Druzhinin wrote:
>>>>> If ITSC is not available on CPU (e.g if running nested as PV shim)
>>>>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
>>>>> all AMD and some old Intel processors. In which case TSC would need to
>>>>> be restored on CPU from platform time by Xen upon exiting deep C-states.
>>>>
>>>> How does waking from deep C states correspond to the PV shim? I notice
>>>> that cstate_restore_tsc() gets called irrespective of the C state being
>>>> exited, so I wonder whether there's room for improvement there
>>>> independent of the issue at hand. As far as this change is concerned,
>>>> I think you want to drop the notion of "deep" from the description.
>>>
>>> I'm not familiar with what to call "deep C-state" so for me it was anything
>>> higher than C1. If you prefer "deep" to be dropped - so be it.
>>
>> "Higher than C1" may be fine (albeit I vaguely recall TSC issues
>> starting with C3 only), but at least mwait_idle() calls the
>> function even for C1. As to the PV shim - does it know about any
>> C-states at all (beyond HLT-invoked C1)?
> 
> Yes, PV-shim knows about C states as it looks they are tied to
> processor ID in some cases. For AMD specifically C2 is HLT.

The AMD part is pretty new, and is - afaict - an exception compared
to everything else. Under PVH there's no respective ACPI data (iirc),
and we also don't surface MONITOR/MWAIT to HVM/PVH guests, making
mwait_idle_probe() bail early. I wonder whether this special
behavior on AMD Fam17 should be suppressed in this case.

Jan
Roger Pau Monné Jan. 15, 2020, 1:23 p.m. UTC | #14
On Wed, Jan 15, 2020 at 12:36:08PM +0000, Igor Druzhinin wrote:
> On 15/01/2020 09:47, Roger Pau Monné wrote:
> > On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote:
> >> If ITSC is not available on CPU (e.g if running nested as PV shim)
> >> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
> >> all AMD and some old Intel processors. In which case TSC would need to
> >> be restored on CPU from platform time by Xen upon exiting deep C-states.
> >>
> >> As platform time might be behind the last TSC stamp recorded for the
> >> current CPU, invariant of TSC stamp being always behind local TSC counter
> >> is violated. This has an effect of get_s_time() going negative resulting
> >> in eventual system hang or crash.
> >>
> >> Fix this issue by updating local TSC stamp along with TSC counter write.
> > 
> > Thanks! I haven't seen such issue because I've been running the shim
> > with nomigrate in order to prevent the vTSC overhead.
> > 
> >>
> >> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> >> ---
> >> This caused reliable hangs of shim domains with multiple vCPUs on all AMD
> >> systems. The problem got also reproduced on bare-metal by artifically
> >> masking ITSC feature bit. The proposed fix has been verified for both
> >> cases.
> >> ---
> >>  xen/arch/x86/time.c | 8 +++++++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> >> index e79cb4d..f6b26f8 100644
> >> --- a/xen/arch/x86/time.c
> >> +++ b/xen/arch/x86/time.c
> >> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime)
> >>  
> >>  void cstate_restore_tsc(void)
> >>  {
> >> +    struct cpu_time *t = &this_cpu(cpu_time);
> >> +
> >>      if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
> >>          return;
> >>  
> >> -    write_tsc(stime2tsc(read_platform_stime(NULL)));
> >> +    t->stamp.master_stime = read_platform_stime(NULL);
> >> +    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
> >> +    t->stamp.local_stime = t->stamp.master_stime;
> >> +
> >> +    write_tsc(t->stamp.local_tsc);
> > 
> > In order to avoid the TSC write (and the likely associated vmexit),
> > could you instead do:
> > 
> > t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL);
> > t->stamp.local_tsc = rdtsc_ordered();
> 
> I think in that case RDTSC might return something behind platform time
> which is not right I guess.

The TSC and the platform time are completely independent from Xen's
PoV, you can have a platform time not based on the TSC (ie: PIT, HPET
or PM), and hence there's no direct relation between both.

The TSC is used as a way to get an approximate platform time based on
the last platform time value and the TSC delta between than value and
the current TSC value, I assume that's done because reading the TSC is
much cheaper than reading the platform time.

As long as the platform time and the TSC stamps are both updated at the
same time it should be fine.

Roger.
Roger Pau Monné Jan. 15, 2020, 1:44 p.m. UTC | #15
On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote:
> On 15.01.2020 12:53, Roger Pau Monné wrote:
> > On Wed, Jan 15, 2020 at 12:40:27PM +0100, Jan Beulich wrote:
> >> On 15.01.2020 10:47, Roger Pau Monné wrote:
> >>> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote:
> >>>> --- a/xen/arch/x86/time.c
> >>>> +++ b/xen/arch/x86/time.c
> >>>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime)
> >>>>  
> >>>>  void cstate_restore_tsc(void)
> >>>>  {
> >>>> +    struct cpu_time *t = &this_cpu(cpu_time);
> >>>> +
> >>>>      if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
> >>>>          return;
> >>>>  
> >>>> -    write_tsc(stime2tsc(read_platform_stime(NULL)));
> >>>> +    t->stamp.master_stime = read_platform_stime(NULL);
> >>>> +    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
> >>>> +    t->stamp.local_stime = t->stamp.master_stime;
> >>>> +
> >>>> +    write_tsc(t->stamp.local_tsc);
> >>>
> >>> In order to avoid the TSC write (and the likely associated vmexit),
> >>> could you instead do:
> >>>
> >>> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL);
> >>> t->stamp.local_tsc = rdtsc_ordered();
> >>>
> >>> I think it should achieve the same as it syncs the local TSC stamp and
> >>> times, would avoid the TSC write and slightly simplifies the logic.
> >>
> >> Wouldn't this result in guests possibly observing the TSC moving
> >> backwards?
> > 
> > Isn't local_tsc storing a TSC value read from the same CPU always, and
> > hence could only go backwards if rdtsc actually goes backwards?
> 
> For one I have to admit I was (mistakenly) thinking of wakeup
> from S states more than that from C states. So assuming the
> TSC indeed only stops (but won't get e.g. restarted), backwards
> moves ought to be excluded.

Even if the TSC was restarted I think my proposed approach should be
fine. The only requirement is that the stored TSC stamp must always be
behind than the value returned by rdtsc. See get_s_time_fixed: as
long as the delta is positive the returned time should be correct.

> What I'm then worried about is too
> little progress observable by guests. The PV time protocol
> ought to be fine in this regard (and consumers of raw TSC values
> are on their own anyway), but wouldn't you need to update TSC
> offsets of HVM guests in order to compensate for the elapsed
> time?

That will be done when the HVM vCPU gets scheduled in as part of the
update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be
called with the idle vCPU context, and hence there's always going to
be a vCPU switch before scheduling anything else.

> > Ie: cpu_frequency_change seems to do something similar, together with
> > a re-adjusting of the time scale, but doesn't perform any TSC write.
> 
> A P-state change at most alters the the tick rate, but wouldn't
> stop or even reset the TSC (afaict).

Right, just wanted to point out that the cpu_time stamp can be
updated without having to write to the TSC. Anyway, not sure it's very
relevant or useful, so forget this reference.

Roger.
Igor Druzhinin Jan. 15, 2020, 2:45 p.m. UTC | #16
On 15/01/2020 13:23, Roger Pau Monné wrote:
> On Wed, Jan 15, 2020 at 12:36:08PM +0000, Igor Druzhinin wrote:
>> On 15/01/2020 09:47, Roger Pau Monné wrote:
>>> On Tue, Jan 14, 2020 at 07:36:21PM +0000, Igor Druzhinin wrote:
>>>> If ITSC is not available on CPU (e.g if running nested as PV shim)
>>>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
>>>> all AMD and some old Intel processors. In which case TSC would need to
>>>> be restored on CPU from platform time by Xen upon exiting deep C-states.
>>>>
>>>> As platform time might be behind the last TSC stamp recorded for the
>>>> current CPU, invariant of TSC stamp being always behind local TSC counter
>>>> is violated. This has an effect of get_s_time() going negative resulting
>>>> in eventual system hang or crash.
>>>>
>>>> Fix this issue by updating local TSC stamp along with TSC counter write.
>>>
>>> Thanks! I haven't seen such issue because I've been running the shim
>>> with nomigrate in order to prevent the vTSC overhead.
>>>
>>>>
>>>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
>>>> ---
>>>> This caused reliable hangs of shim domains with multiple vCPUs on all AMD
>>>> systems. The problem got also reproduced on bare-metal by artifically
>>>> masking ITSC feature bit. The proposed fix has been verified for both
>>>> cases.
>>>> ---
>>>>  xen/arch/x86/time.c | 8 +++++++-
>>>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
>>>> index e79cb4d..f6b26f8 100644
>>>> --- a/xen/arch/x86/time.c
>>>> +++ b/xen/arch/x86/time.c
>>>> @@ -955,10 +955,16 @@ u64 stime2tsc(s_time_t stime)
>>>>  
>>>>  void cstate_restore_tsc(void)
>>>>  {
>>>> +    struct cpu_time *t = &this_cpu(cpu_time);
>>>> +
>>>>      if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
>>>>          return;
>>>>  
>>>> -    write_tsc(stime2tsc(read_platform_stime(NULL)));
>>>> +    t->stamp.master_stime = read_platform_stime(NULL);
>>>> +    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
>>>> +    t->stamp.local_stime = t->stamp.master_stime;
>>>> +
>>>> +    write_tsc(t->stamp.local_tsc);
>>>
>>> In order to avoid the TSC write (and the likely associated vmexit),
>>> could you instead do:
>>>
>>> t->stamp.local_stime = t->stamp.master_stime = read_platform_stime(NULL);
>>> t->stamp.local_tsc = rdtsc_ordered();
>>
>> I think in that case RDTSC might return something behind platform time
>> which is not right I guess.
> 
> The TSC and the platform time are completely independent from Xen's
> PoV, you can have a platform time not based on the TSC (ie: PIT, HPET
> or PM), and hence there's no direct relation between both.
> 
> The TSC is used as a way to get an approximate platform time based on
> the last platform time value and the TSC delta between than value and
> the current TSC value, I assume that's done because reading the TSC is
> much cheaper than reading the platform time.
> 
> As long as the platform time and the TSC stamps are both updated at the
> same time it should be fine.

I see your point. I'll test your approach and get back here with the results.

Igor
Andrew Cooper Jan. 15, 2020, 2:55 p.m. UTC | #17
On 15/01/2020 12:54, Jan Beulich wrote:
> On 15.01.2020 13:47, Igor Druzhinin wrote:
>> On 15/01/2020 12:39, Jan Beulich wrote:
>>> On 15.01.2020 13:28, Igor Druzhinin wrote:
>>>> On 15/01/2020 11:32, Jan Beulich wrote:
>>>>> On 14.01.2020 20:36, Igor Druzhinin wrote:
>>>>>> If ITSC is not available on CPU (e.g if running nested as PV shim)
>>>>>> then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
>>>>>> all AMD and some old Intel processors. In which case TSC would need to
>>>>>> be restored on CPU from platform time by Xen upon exiting deep C-states.
>>>>> How does waking from deep C states correspond to the PV shim? I notice
>>>>> that cstate_restore_tsc() gets called irrespective of the C state being
>>>>> exited, so I wonder whether there's room for improvement there
>>>>> independent of the issue at hand. As far as this change is concerned,
>>>>> I think you want to drop the notion of "deep" from the description.
>>>> I'm not familiar with what to call "deep C-state" so for me it was anything
>>>> higher than C1. If you prefer "deep" to be dropped - so be it.
>>> "Higher than C1" may be fine (albeit I vaguely recall TSC issues
>>> starting with C3 only), but at least mwait_idle() calls the
>>> function even for C1. As to the PV shim - does it know about any
>>> C-states at all (beyond HLT-invoked C1)?
>> Yes, PV-shim knows about C states as it looks they are tied to
>> processor ID in some cases. For AMD specifically C2 is HLT.
> The AMD part is pretty new, and is - afaict - an exception compared
> to everything else. Under PVH there's no respective ACPI data (iirc),
> and we also don't surface MONITOR/MWAIT to HVM/PVH guests, making
> mwait_idle_probe() bail early. I wonder whether this special
> behavior on AMD Fam17 should be suppressed in this case.

Anything which knows it is virtualised should use hypercalls to yield,
and failing that, hlt.

A VM isn't going to get a plausible idea of C/P states (give or take our
dom0 special case seeing the real APCI tables which is still an open
problem.)

~Andrew
Jan Beulich Jan. 15, 2020, 4:21 p.m. UTC | #18
On 15.01.2020 14:44, Roger Pau Monné wrote:
> On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote:
>> What I'm then worried about is too
>> little progress observable by guests. The PV time protocol
>> ought to be fine in this regard (and consumers of raw TSC values
>> are on their own anyway), but wouldn't you need to update TSC
>> offsets of HVM guests in order to compensate for the elapsed
>> time?
> 
> That will be done when the HVM vCPU gets scheduled in as part of the
> update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be
> called with the idle vCPU context, and hence there's always going to
> be a vCPU switch before scheduling anything else.

Which step would this be? All I see is a call to hvm_scale_tsc().
In time.c only tsc_set_info() calls hvm_set_tsc_offset().

Jan
Roger Pau Monné Jan. 16, 2020, 9:33 a.m. UTC | #19
On Wed, Jan 15, 2020 at 05:21:16PM +0100, Jan Beulich wrote:
> On 15.01.2020 14:44, Roger Pau Monné wrote:
> > On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote:
> >> What I'm then worried about is too
> >> little progress observable by guests. The PV time protocol
> >> ought to be fine in this regard (and consumers of raw TSC values
> >> are on their own anyway), but wouldn't you need to update TSC
> >> offsets of HVM guests in order to compensate for the elapsed
> >> time?
> > 
> > That will be done when the HVM vCPU gets scheduled in as part of the
> > update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be
> > called with the idle vCPU context, and hence there's always going to
> > be a vCPU switch before scheduling anything else.
> 
> Which step would this be? All I see is a call to hvm_scale_tsc().
> In time.c only tsc_set_info() calls hvm_set_tsc_offset().

My bad, I've mistaken the scaling with the offset.

Accounting for the offset in update_vcpu_system_time seems quite
more complicated that just updating the TSC here, so:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

I also wonder whether it would be interesting to add at least an
assert to get_s_time_fixed in order to assure that the TSC stamp is
always behind the current TSC value, this would have likely helped
catch this issue earlier.

Thanks, Roger.
Jan Beulich Jan. 16, 2020, 9:38 a.m. UTC | #20
On 16.01.2020 10:33, Roger Pau Monné wrote:
> On Wed, Jan 15, 2020 at 05:21:16PM +0100, Jan Beulich wrote:
>> On 15.01.2020 14:44, Roger Pau Monné wrote:
>>> On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote:
>>>> What I'm then worried about is too
>>>> little progress observable by guests. The PV time protocol
>>>> ought to be fine in this regard (and consumers of raw TSC values
>>>> are on their own anyway), but wouldn't you need to update TSC
>>>> offsets of HVM guests in order to compensate for the elapsed
>>>> time?
>>>
>>> That will be done when the HVM vCPU gets scheduled in as part of the
>>> update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be
>>> called with the idle vCPU context, and hence there's always going to
>>> be a vCPU switch before scheduling anything else.
>>
>> Which step would this be? All I see is a call to hvm_scale_tsc().
>> In time.c only tsc_set_info() calls hvm_set_tsc_offset().
> 
> My bad, I've mistaken the scaling with the offset.
> 
> Accounting for the offset in update_vcpu_system_time seems quite
> more complicated that just updating the TSC here, so:
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

And then (preferably with "deep" dropped from the description,
if you, Igor, agree, and which can be done while committing)
Acked-by: Jan Beulich <jbeulich@suse.com>

Jan
Igor Druzhinin Jan. 16, 2020, 12:09 p.m. UTC | #21
On 16/01/2020 09:38, Jan Beulich wrote:
> On 16.01.2020 10:33, Roger Pau Monné wrote:
>> On Wed, Jan 15, 2020 at 05:21:16PM +0100, Jan Beulich wrote:
>>> On 15.01.2020 14:44, Roger Pau Monné wrote:
>>>> On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote:
>>>>> What I'm then worried about is too
>>>>> little progress observable by guests. The PV time protocol
>>>>> ought to be fine in this regard (and consumers of raw TSC values
>>>>> are on their own anyway), but wouldn't you need to update TSC
>>>>> offsets of HVM guests in order to compensate for the elapsed
>>>>> time?
>>>>
>>>> That will be done when the HVM vCPU gets scheduled in as part of the
>>>> update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be
>>>> called with the idle vCPU context, and hence there's always going to
>>>> be a vCPU switch before scheduling anything else.
>>>
>>> Which step would this be? All I see is a call to hvm_scale_tsc().
>>> In time.c only tsc_set_info() calls hvm_set_tsc_offset().
>>
>> My bad, I've mistaken the scaling with the offset.
>>
>> Accounting for the offset in update_vcpu_system_time seems quite
>> more complicated that just updating the TSC here, so:
>>
>> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> And then (preferably with "deep" dropped from the description,
> if you, Igor, agree, and which can be done while committing)
> Acked-by: Jan Beulich <jbeulich@suse.com>

No objection. FYI I tested Roger's approach this night and it seems
to work at least in sense of fixing the original issue.

Igor
Roger Pau Monné Jan. 16, 2020, 12:25 p.m. UTC | #22
On Thu, Jan 16, 2020 at 12:09:12PM +0000, Igor Druzhinin wrote:
> On 16/01/2020 09:38, Jan Beulich wrote:
> > On 16.01.2020 10:33, Roger Pau Monné wrote:
> >> On Wed, Jan 15, 2020 at 05:21:16PM +0100, Jan Beulich wrote:
> >>> On 15.01.2020 14:44, Roger Pau Monné wrote:
> >>>> On Wed, Jan 15, 2020 at 01:49:22PM +0100, Jan Beulich wrote:
> >>>>> What I'm then worried about is too
> >>>>> little progress observable by guests. The PV time protocol
> >>>>> ought to be fine in this regard (and consumers of raw TSC values
> >>>>> are on their own anyway), but wouldn't you need to update TSC
> >>>>> offsets of HVM guests in order to compensate for the elapsed
> >>>>> time?
> >>>>
> >>>> That will be done when the HVM vCPU gets scheduled in as part of the
> >>>> update_vcpu_system_time call AFAICT. cstate_restore_tsc will always be
> >>>> called with the idle vCPU context, and hence there's always going to
> >>>> be a vCPU switch before scheduling anything else.
> >>>
> >>> Which step would this be? All I see is a call to hvm_scale_tsc().
> >>> In time.c only tsc_set_info() calls hvm_set_tsc_offset().
> >>
> >> My bad, I've mistaken the scaling with the offset.
> >>
> >> Accounting for the offset in update_vcpu_system_time seems quite
> >> more complicated that just updating the TSC here, so:
> >>
> >> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> > 
> > And then (preferably with "deep" dropped from the description,
> > if you, Igor, agree, and which can be done while committing)
> > Acked-by: Jan Beulich <jbeulich@suse.com>
> 
> No objection. FYI I tested Roger's approach this night and it seems
> to work at least in sense of fixing the original issue.

Right, but HVM guests using a timekeeping approach similar to what Xen
does (read the platform timer periodically and calculate time based on
the last read plus the TSC delta) would get wrong (behind the real
time) results AFAICT.

Roger.
diff mbox series

Patch

diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index e79cb4d..f6b26f8 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -955,10 +955,16 @@  u64 stime2tsc(s_time_t stime)
 
 void cstate_restore_tsc(void)
 {
+    struct cpu_time *t = &this_cpu(cpu_time);
+
     if ( boot_cpu_has(X86_FEATURE_NONSTOP_TSC) )
         return;
 
-    write_tsc(stime2tsc(read_platform_stime(NULL)));
+    t->stamp.master_stime = read_platform_stime(NULL);
+    t->stamp.local_tsc = stime2tsc(t->stamp.master_stime);
+    t->stamp.local_stime = t->stamp.master_stime;
+
+    write_tsc(t->stamp.local_tsc);
 }
 
 /***************************************************************************