Message ID | 20220214110914.268126-4-adrian.hunter@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | perf intel-pt: Add perf event clocks to better support VM tracing | expand |
On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote: > Currently, when Intel PT is used within a VM guest, it is not possible to > make use of TSC because perf clock is subject to paravirtualization. Yeah, so how much of that still makes sense, or ever did? AFAIK the whole pv_clock thing is utter crazy. Should we not fix that instead?
On 04/03/2022 15:41, Peter Zijlstra wrote: > On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote: >> Currently, when Intel PT is used within a VM guest, it is not possible to >> make use of TSC because perf clock is subject to paravirtualization. > > Yeah, so how much of that still makes sense, or ever did? AFAIK the > whole pv_clock thing is utter crazy. Should we not fix that instead? Presumably pv_clock must work with different host operating systems. Similarly, KVM must work with different guest operating systems. Perhaps I'm wrong, but I imagine re-engineering time virtualization might be a pretty big deal, far exceeding the scope of these patches. While it is not something that I really need, it is also not obvious that the virtualization people would see any benefit. My primary goal is to be able to make a trace covering the host and (Linux) guests. Intel PT can do that. It can trace straight through VM-Entries/Exits, politely noting them on the way past. Perf tools already supports decoding that, but only for tracing the kernel because it needs more information (so-called side-band events) to decode guest user space. The simplest way to get that is to run perf inside the guests to capture the side-band events, and then inject them into the host perf.data file during post processing. That, however, requires a clock that works for both host and guests. TSC is suitable because KVM largely leaves it alone, except for VMX TSC Offset and Scaling, but that has to be dealt with anyway because it also affects the Intel PT trace.
On Fri, Mar 04, 2022 at 08:27:45PM +0200, Adrian Hunter wrote: > On 04/03/2022 15:41, Peter Zijlstra wrote: > > On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote: > >> Currently, when Intel PT is used within a VM guest, it is not possible to > >> make use of TSC because perf clock is subject to paravirtualization. > > > > Yeah, so how much of that still makes sense, or ever did? AFAIK the > > whole pv_clock thing is utter crazy. Should we not fix that instead? > > Presumably pv_clock must work with different host operating systems. > Similarly, KVM must work with different guest operating systems. > Perhaps I'm wrong, but I imagine re-engineering time virtualization > might be a pretty big deal, far exceeding the scope of these patches. I think not; on both counts. That is, I don't think it's going to be hard, and even it if were, it would still be the right thing to do. We're not going to add interface just to work around a known broken piece of crap just because we don't want to fix it. So I'm thinking we should do the below and simply ignore any paravirt sched clock offered when there's ART on. --- diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 4420499f7bb4..a1f179ed39bf 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); void paravirt_set_sched_clock(u64 (*func)(void)) { + /* + * Anything with ART on promises to have sane TSC, otherwise the whole + * ART thing is useless. In order to make ART useful for guests, we + * should continue to use the TSC. As such, ignore any paravirt + * muckery. + */ + if (cpu_feature_enabled(X86_FEATURE_ART)) + return; + static_call_update(pv_sched_clock, func); }
On 07.03.22 10:50, Peter Zijlstra wrote: > On Fri, Mar 04, 2022 at 08:27:45PM +0200, Adrian Hunter wrote: >> On 04/03/2022 15:41, Peter Zijlstra wrote: >>> On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote: >>>> Currently, when Intel PT is used within a VM guest, it is not possible to >>>> make use of TSC because perf clock is subject to paravirtualization. >>> >>> Yeah, so how much of that still makes sense, or ever did? AFAIK the >>> whole pv_clock thing is utter crazy. Should we not fix that instead? >> >> Presumably pv_clock must work with different host operating systems. >> Similarly, KVM must work with different guest operating systems. >> Perhaps I'm wrong, but I imagine re-engineering time virtualization >> might be a pretty big deal, far exceeding the scope of these patches. > > I think not; on both counts. That is, I don't think it's going to be > hard, and even it if were, it would still be the right thing to do. > > We're not going to add interface just to work around a known broken > piece of crap just because we don't want to fix it. > > So I'm thinking we should do the below and simply ignore any paravirt > sched clock offered when there's ART on. > > --- > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > index 4420499f7bb4..a1f179ed39bf 100644 > --- a/arch/x86/kernel/paravirt.c > +++ b/arch/x86/kernel/paravirt.c > @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); > > void paravirt_set_sched_clock(u64 (*func)(void)) > { > + /* > + * Anything with ART on promises to have sane TSC, otherwise the whole > + * ART thing is useless. In order to make ART useful for guests, we > + * should continue to use the TSC. As such, ignore any paravirt > + * muckery. > + */ > + if (cpu_feature_enabled(X86_FEATURE_ART)) > + return; > + > static_call_update(pv_sched_clock, func); > } > > NAK, this will break live migration of a guest coming from a host without this feature. Juergen
On Mon, Mar 07, 2022 at 11:06:46AM +0100, Juergen Gross wrote: > > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > > index 4420499f7bb4..a1f179ed39bf 100644 > > --- a/arch/x86/kernel/paravirt.c > > +++ b/arch/x86/kernel/paravirt.c > > @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); > > void paravirt_set_sched_clock(u64 (*func)(void)) > > { > > + /* > > + * Anything with ART on promises to have sane TSC, otherwise the whole > > + * ART thing is useless. In order to make ART useful for guests, we > > + * should continue to use the TSC. As such, ignore any paravirt > > + * muckery. > > + */ > > + if (cpu_feature_enabled(X86_FEATURE_ART)) > > + return; > > + > > static_call_update(pv_sched_clock, func); > > } > > > > NAK, this will break live migration of a guest coming from a host > without this feature. I thought the whole live-migration nonsense made sure to equalize crud like that. That is, then don't expose ART to the guest.
On 07.03.22 11:38, Peter Zijlstra wrote: > On Mon, Mar 07, 2022 at 11:06:46AM +0100, Juergen Gross wrote: > >>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c >>> index 4420499f7bb4..a1f179ed39bf 100644 >>> --- a/arch/x86/kernel/paravirt.c >>> +++ b/arch/x86/kernel/paravirt.c >>> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); >>> void paravirt_set_sched_clock(u64 (*func)(void)) >>> { >>> + /* >>> + * Anything with ART on promises to have sane TSC, otherwise the whole >>> + * ART thing is useless. In order to make ART useful for guests, we >>> + * should continue to use the TSC. As such, ignore any paravirt >>> + * muckery. >>> + */ >>> + if (cpu_feature_enabled(X86_FEATURE_ART)) >>> + return; >>> + >>> static_call_update(pv_sched_clock, func); >>> } >>> >> >> NAK, this will break live migration of a guest coming from a host >> without this feature. > > I thought the whole live-migration nonsense made sure to equalize crud > like that. That is, then don't expose ART to the guest. Oh, right. I managed to confuse host-side and guest-side usage. Sorry for the noise. Juergen
On 07/03/2022 11:50, Peter Zijlstra wrote: > On Fri, Mar 04, 2022 at 08:27:45PM +0200, Adrian Hunter wrote: >> On 04/03/2022 15:41, Peter Zijlstra wrote: >>> On Mon, Feb 14, 2022 at 01:09:06PM +0200, Adrian Hunter wrote: >>>> Currently, when Intel PT is used within a VM guest, it is not possible to >>>> make use of TSC because perf clock is subject to paravirtualization. >>> >>> Yeah, so how much of that still makes sense, or ever did? AFAIK the >>> whole pv_clock thing is utter crazy. Should we not fix that instead? >> >> Presumably pv_clock must work with different host operating systems. >> Similarly, KVM must work with different guest operating systems. >> Perhaps I'm wrong, but I imagine re-engineering time virtualization >> might be a pretty big deal, far exceeding the scope of these patches. > > I think not; on both counts. That is, I don't think it's going to be > hard, and even it if were, it would still be the right thing to do. > > We're not going to add interface just to work around a known broken > piece of crap just because we don't want to fix it. > > So I'm thinking we should do the below and simply ignore any paravirt > sched clock offered when there's ART on. > > --- > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > index 4420499f7bb4..a1f179ed39bf 100644 > --- a/arch/x86/kernel/paravirt.c > +++ b/arch/x86/kernel/paravirt.c > @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); > > void paravirt_set_sched_clock(u64 (*func)(void)) > { > + /* > + * Anything with ART on promises to have sane TSC, otherwise the whole > + * ART thing is useless. In order to make ART useful for guests, we > + * should continue to use the TSC. As such, ignore any paravirt > + * muckery. > + */ > + if (cpu_feature_enabled(X86_FEATURE_ART)) Does not seem to work because the feature X86_FEATURE_ART does not seem to get set. Possibly because detect_art() excludes anything running on a hypervisor. > + return; > + > static_call_update(pv_sched_clock, func); > } >
On Mon, Mar 07, 2022 at 02:36:03PM +0200, Adrian Hunter wrote: > > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > > index 4420499f7bb4..a1f179ed39bf 100644 > > --- a/arch/x86/kernel/paravirt.c > > +++ b/arch/x86/kernel/paravirt.c > > @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); > > > > void paravirt_set_sched_clock(u64 (*func)(void)) > > { > > + /* > > + * Anything with ART on promises to have sane TSC, otherwise the whole > > + * ART thing is useless. In order to make ART useful for guests, we > > + * should continue to use the TSC. As such, ignore any paravirt > > + * muckery. > > + */ > > + if (cpu_feature_enabled(X86_FEATURE_ART)) > > Does not seem to work because the feature X86_FEATURE_ART does not seem to get set. > Possibly because detect_art() excludes anything running on a hypervisor. Simple enough to delete that clause I suppose. Christopher, what is needed to make that go away? I suppose the guest needs to be aware of the active TSC scaling parameters to make it work ?
On 7.3.2022 16.42, Peter Zijlstra wrote: > On Mon, Mar 07, 2022 at 02:36:03PM +0200, Adrian Hunter wrote: > >>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c >>> index 4420499f7bb4..a1f179ed39bf 100644 >>> --- a/arch/x86/kernel/paravirt.c >>> +++ b/arch/x86/kernel/paravirt.c >>> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); >>> >>> void paravirt_set_sched_clock(u64 (*func)(void)) >>> { >>> + /* >>> + * Anything with ART on promises to have sane TSC, otherwise the whole >>> + * ART thing is useless. In order to make ART useful for guests, we >>> + * should continue to use the TSC. As such, ignore any paravirt >>> + * muckery. >>> + */ >>> + if (cpu_feature_enabled(X86_FEATURE_ART)) >> >> Does not seem to work because the feature X86_FEATURE_ART does not seem to get set. >> Possibly because detect_art() excludes anything running on a hypervisor. > > Simple enough to delete that clause I suppose. Christopher, what is > needed to make that go away? I suppose the guest needs to be aware of > the active TSC scaling parameters to make it work ? There is also not X86_FEATURE_NONSTOP_TSC nor values for art_to_tsc_denominator or art_to_tsc_numerator. Also, from the VM's point of view, TSC will jump forwards every VM-Exit / VM-Entry unless the hypervisor changes the offset every VM-Entry, which KVM does not, so it still cannot be used as a stable clocksource.
Adrian Hunter wrote: > On 7.3.2022 16.42, Peter Zijlstra wrote: > > On Mon, Mar 07, 2022 at 02:36:03PM +0200, Adrian Hunter wrote: > > > >>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > >>> index 4420499f7bb4..a1f179ed39bf 100644 > >>> --- a/arch/x86/kernel/paravirt.c > >>> +++ b/arch/x86/kernel/paravirt.c > >>> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); > >>> > >>> void paravirt_set_sched_clock(u64 (*func)(void)) > >>> { > >>> + /* > >>> + * Anything with ART on promises to have sane TSC, otherwise the whole > >>> + * ART thing is useless. In order to make ART useful for guests, we > >>> + * should continue to use the TSC. As such, ignore any paravirt > >>> + * muckery. > >>> + */ > >>> + if (cpu_feature_enabled(X86_FEATURE_ART)) > >> > >> Does not seem to work because the feature X86_FEATURE_ART does not seem to get set. > >> Possibly because detect_art() excludes anything running on a hypervisor. > > > > Simple enough to delete that clause I suppose. Christopher, what is > > needed to make that go away? I suppose the guest needs to be aware of > > the active TSC scaling parameters to make it work ? > > There is also not X86_FEATURE_NONSTOP_TSC nor values for art_to_tsc_denominator > or art_to_tsc_numerator. Also, from the VM's point of view, TSC will jump > forwards every VM-Exit / VM-Entry unless the hypervisor changes the offset > every VM-Entry, which KVM does not, so it still cannot be used as a stable > clocksource. Translating between ART and the guest TSC can be a difficult problem and ART software support is disabled by default in a VM. There are two major issues translating ART to TSC in a VM: The range of the TSC scaling field in the VMCS is much larger than the range of values that can be represented using CPUID[15H], i.e., it is not possible to communicate this to the VM using the current CPUID interface. The range of scaling would need to be restricted or another para-virtualized method - preferably OS/hypervisor agnostic - to communicate the scaling factor to the guest needs to be invented. TSC offsetting may also be a problem. The VMCS TSC offset must be discoverable by the guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the guest TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the guest must be reflected in the VMCS and any changes to the offset in the VMCS must be reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must be invented to communicate an arbitrary VMCS TSC offset to the guest.
On 08/03/2022 23:06, Hall, Christopher S wrote: > Adrian Hunter wrote: >> On 7.3.2022 16.42, Peter Zijlstra wrote: >>> On Mon, Mar 07, 2022 at 02:36:03PM +0200, Adrian Hunter wrote: >>> >>>>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c >>>>> index 4420499f7bb4..a1f179ed39bf 100644 >>>>> --- a/arch/x86/kernel/paravirt.c >>>>> +++ b/arch/x86/kernel/paravirt.c >>>>> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); >>>>> >>>>> void paravirt_set_sched_clock(u64 (*func)(void)) >>>>> { >>>>> + /* >>>>> + * Anything with ART on promises to have sane TSC, otherwise the whole >>>>> + * ART thing is useless. In order to make ART useful for guests, we >>>>> + * should continue to use the TSC. As such, ignore any paravirt >>>>> + * muckery. >>>>> + */ >>>>> + if (cpu_feature_enabled(X86_FEATURE_ART)) >>>> >>>> Does not seem to work because the feature X86_FEATURE_ART does not seem to get set. >>>> Possibly because detect_art() excludes anything running on a hypervisor. >>> >>> Simple enough to delete that clause I suppose. Christopher, what is >>> needed to make that go away? I suppose the guest needs to be aware of >>> the active TSC scaling parameters to make it work ? >> >> There is also not X86_FEATURE_NONSTOP_TSC nor values for art_to_tsc_denominator >> or art_to_tsc_numerator. Also, from the VM's point of view, TSC will jump >> forwards every VM-Exit / VM-Entry unless the hypervisor changes the offset >> every VM-Entry, which KVM does not, so it still cannot be used as a stable >> clocksource. > > Translating between ART and the guest TSC can be a difficult problem and ART software > support is disabled by default in a VM. > > There are two major issues translating ART to TSC in a VM: > > The range of the TSC scaling field in the VMCS is much larger than the range of values > that can be represented using CPUID[15H], i.e., it is not possible to communicate this > to the VM using the current CPUID interface. The range of scaling would need to be > restricted or another para-virtualized method - preferably OS/hypervisor agnostic - to > communicate the scaling factor to the guest needs to be invented. > > TSC offsetting may also be a problem. The VMCS TSC offset must be discoverable by the > guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the guest > TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the guest > must be reflected in the VMCS and any changes to the offset in the VMCS must be > reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must > be invented to communicate an arbitrary VMCS TSC offset to the guest. > In my view it is reasonable for perf to support TSC as a perf clock in any case because: a) it allows users to work entirely with TSC if they wish b) other kernel performance / debug facilities like ftrace already support TSC c) the patches to add TSC support are relatively small and straight-forward May we have support for TSC as a perf event clock?
On 14/03/22 13:50, Adrian Hunter wrote: > On 08/03/2022 23:06, Hall, Christopher S wrote: >> Adrian Hunter wrote: >>> On 7.3.2022 16.42, Peter Zijlstra wrote: >>>> On Mon, Mar 07, 2022 at 02:36:03PM +0200, Adrian Hunter wrote: >>>> >>>>>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c >>>>>> index 4420499f7bb4..a1f179ed39bf 100644 >>>>>> --- a/arch/x86/kernel/paravirt.c >>>>>> +++ b/arch/x86/kernel/paravirt.c >>>>>> @@ -145,6 +145,15 @@ DEFINE_STATIC_CALL(pv_sched_clock, native_sched_clock); >>>>>> >>>>>> void paravirt_set_sched_clock(u64 (*func)(void)) >>>>>> { >>>>>> + /* >>>>>> + * Anything with ART on promises to have sane TSC, otherwise the whole >>>>>> + * ART thing is useless. In order to make ART useful for guests, we >>>>>> + * should continue to use the TSC. As such, ignore any paravirt >>>>>> + * muckery. >>>>>> + */ >>>>>> + if (cpu_feature_enabled(X86_FEATURE_ART)) >>>>> >>>>> Does not seem to work because the feature X86_FEATURE_ART does not seem to get set. >>>>> Possibly because detect_art() excludes anything running on a hypervisor. >>>> >>>> Simple enough to delete that clause I suppose. Christopher, what is >>>> needed to make that go away? I suppose the guest needs to be aware of >>>> the active TSC scaling parameters to make it work ? >>> >>> There is also not X86_FEATURE_NONSTOP_TSC nor values for art_to_tsc_denominator >>> or art_to_tsc_numerator. Also, from the VM's point of view, TSC will jump >>> forwards every VM-Exit / VM-Entry unless the hypervisor changes the offset >>> every VM-Entry, which KVM does not, so it still cannot be used as a stable >>> clocksource. >> >> Translating between ART and the guest TSC can be a difficult problem and ART software >> support is disabled by default in a VM. >> >> There are two major issues translating ART to TSC in a VM: >> >> The range of the TSC scaling field in the VMCS is much larger than the range of values >> that can be represented using CPUID[15H], i.e., it is not possible to communicate this >> to the VM using the current CPUID interface. The range of scaling would need to be >> restricted or another para-virtualized method - preferably OS/hypervisor agnostic - to >> communicate the scaling factor to the guest needs to be invented. >> >> TSC offsetting may also be a problem. The VMCS TSC offset must be discoverable by the >> guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the guest >> TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the guest >> must be reflected in the VMCS and any changes to the offset in the VMCS must be >> reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must >> be invented to communicate an arbitrary VMCS TSC offset to the guest. >> > > In my view it is reasonable for perf to support TSC as a perf clock in any case > because: > a) it allows users to work entirely with TSC if they wish > b) other kernel performance / debug facilities like ftrace already support TSC > c) the patches to add TSC support are relatively small and straight-forward > > May we have support for TSC as a perf event clock? Any update on this?
On Mon, Apr 25 2022 at 08:30, Adrian Hunter wrote: > On 14/03/22 13:50, Adrian Hunter wrote: >>> TSC offsetting may also be a problem. The VMCS TSC offset must be discoverable by the >>> guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the guest >>> TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the guest >>> must be reflected in the VMCS and any changes to the offset in the VMCS must be >>> reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must >>> be invented to communicate an arbitrary VMCS TSC offset to the guest. >>> >> >> In my view it is reasonable for perf to support TSC as a perf clock in any case >> because: >> a) it allows users to work entirely with TSC if they wish >> b) other kernel performance / debug facilities like ftrace already support TSC >> c) the patches to add TSC support are relatively small and straight-forward >> >> May we have support for TSC as a perf event clock? > > Any update on this? If TSC is reliable on the host, then there is absolutely no reason not to use it in the guest all over the place. And that is independent of exposing ART to the guest. So why do we need extra solutions for PT and perf, ftrace and whatever? Can we just fix the underlying problem and make the hypervisor tell the guest that TSC is stable, reliable and good to use? Then everything else just falls into place and using TSC is a substantial performance gain in general. Just look at the VDSO implementation of __arch_get_hw_counter() -> vread_pvclock(): Instead of just reading the TSC, this needs to take a nested seqcount, read TSC and do yet another mult/shift, which makes clock_gettime() ~20% slower than necessary. It's hillarious, that we still cling to this pvclock abomination, while we happily expose TSC deadline timer to the guest. TSC virt scaling was implemented in hardware for a reason. Thanks, tglx
On 25/04/22 12:32, Thomas Gleixner wrote: > On Mon, Apr 25 2022 at 08:30, Adrian Hunter wrote: >> On 14/03/22 13:50, Adrian Hunter wrote: >>>> TSC offsetting may also be a problem. The VMCS TSC offset must be discoverable by the >>>> guest. This can be done via TSC_ADJUST MSR. The offset in the VMCS and the guest >>>> TSC_ADJUST MSR must always be equivalent, i.e. a write to TSC_ADJUST in the guest >>>> must be reflected in the VMCS and any changes to the offset in the VMCS must be >>>> reflected in the TSC_ADJUST MSR. Otherwise a para-virtualized method must >>>> be invented to communicate an arbitrary VMCS TSC offset to the guest. >>>> >>> >>> In my view it is reasonable for perf to support TSC as a perf clock in any case >>> because: >>> a) it allows users to work entirely with TSC if they wish >>> b) other kernel performance / debug facilities like ftrace already support TSC >>> c) the patches to add TSC support are relatively small and straight-forward >>> >>> May we have support for TSC as a perf event clock? >> >> Any update on this? > > If TSC is reliable on the host, then there is absolutely no reason not > to use it in the guest all over the place. And that is independent of > exposing ART to the guest. > > So why do we need extra solutions for PT and perf, ftrace and whatever? > > Can we just fix the underlying problem and make the hypervisor tell the > guest that TSC is stable, reliable and good to use? > > Then everything else just falls into place and using TSC is a > substantial performance gain in general. Just look at the VDSO > implementation of __arch_get_hw_counter() -> vread_pvclock(): > > Instead of just reading the TSC, this needs to take a nested seqcount, > read TSC and do yet another mult/shift, which makes clock_gettime() ~20% > slower than necessary. > > It's hillarious, that we still cling to this pvclock abomination, while > we happily expose TSC deadline timer to the guest. TSC virt scaling was > implemented in hardware for a reason. So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide the time jumps when the VM is scheduled out? Or neglect that and just let the time jumps happen? If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e. wouldn't that mean each VCPU has to have the same VMX TSC Offset?
On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote: > On 25/04/22 12:32, Thomas Gleixner wrote: >> It's hillarious, that we still cling to this pvclock abomination, while >> we happily expose TSC deadline timer to the guest. TSC virt scaling was >> implemented in hardware for a reason. > > So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide > the time jumps when the VM is scheduled out? Or neglect that and just let the time > jumps happen? > > If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e. > wouldn't that mean each VCPU has to have the same VMX TSC Offset? Obviously so. That's the only thing which makes sense, no? Thanks, tglx
On 25/04/22 20:05, Thomas Gleixner wrote: > On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote: >> On 25/04/22 12:32, Thomas Gleixner wrote: >>> It's hillarious, that we still cling to this pvclock abomination, while >>> we happily expose TSC deadline timer to the guest. TSC virt scaling was >>> implemented in hardware for a reason. >> >> So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide >> the time jumps when the VM is scheduled out? Or neglect that and just let the time >> jumps happen? >> >> If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e. >> wouldn't that mean each VCPU has to have the same VMX TSC Offset? > > Obviously so. That's the only thing which makes sense, no? [ Sending this again, because I notice I messed up the email "From" ] But wouldn't that mean changing all the VCPUs VMX TSC Offset at the same time, which means when none are currently executing? How could that be done?
On Tue, Apr 26 2022 at 09:51, Adrian Hunter wrote: > On 25/04/22 20:05, Thomas Gleixner wrote: >> On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote: >>> On 25/04/22 12:32, Thomas Gleixner wrote: >>>> It's hillarious, that we still cling to this pvclock abomination, while >>>> we happily expose TSC deadline timer to the guest. TSC virt scaling was >>>> implemented in hardware for a reason. >>> >>> So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide >>> the time jumps when the VM is scheduled out? Or neglect that and just let the time >>> jumps happen? >>> >>> If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e. >>> wouldn't that mean each VCPU has to have the same VMX TSC Offset? >> >> Obviously so. That's the only thing which makes sense, no? > > [ Sending this again, because I notice I messed up the email "From" ] > > But wouldn't that mean changing all the VCPUs VMX TSC Offset at the same time, > which means when none are currently executing? How could that be done? Why would you change TSC offset after the point where a VM is started and why would it be different per vCPU? Time is global and time moves on when a vCPU is scheduled out. Anything else is bonkers, really. If the hypervisor tries to screw with that then how does the guest do timekeeping in a consistent way? CLOCK_REALTIME = CLOCK_MONOTONIC + offset That offset changes when something sets the clock, i.e. clock_settime(), settimeofday() or adjtimex() in case that NTP cannot compensate or for the beloved leap seconds adjustment. At any other time the offset is constant. CLOCK_MONOTONIC is derived from the underlying clocksource which is expected to increment with constant frequency and that has to be consistent accross _all_ vCPUs of a particular VM. So how would a hypervisor 'hide' scheduled out time w/o screwing up timekeeping completely? The guest TSC which is based on the host TSC is: guestTSC = offset + hostTSC * factor; If you make offset different between guest vCPUs then timekeeping in the guest is screwed. The whole point of that paravirt clock was to handle migration between hosts which did not have the VMCS TSC scaling/offset mechanism. The CPUs which did not have that went EOL at least 10 years ago. So what are you concerned about? Thanks, tglx
On 28/04/22 02:10, Thomas Gleixner wrote: > On Tue, Apr 26 2022 at 09:51, Adrian Hunter wrote: >> On 25/04/22 20:05, Thomas Gleixner wrote: >>> On Mon, Apr 25 2022 at 16:15, Adrian Hunter wrote: >>>> On 25/04/22 12:32, Thomas Gleixner wrote: >>>>> It's hillarious, that we still cling to this pvclock abomination, while >>>>> we happily expose TSC deadline timer to the guest. TSC virt scaling was >>>>> implemented in hardware for a reason. >>>> >>>> So you are talking about changing VMX TCS Offset on every VM-Entry to try to hide >>>> the time jumps when the VM is scheduled out? Or neglect that and just let the time >>>> jumps happen? >>>> >>>> If changing VMX TCS Offset, how can TSC be kept consistent between each VCPU i.e. >>>> wouldn't that mean each VCPU has to have the same VMX TSC Offset? >>> >>> Obviously so. That's the only thing which makes sense, no? >> >> [ Sending this again, because I notice I messed up the email "From" ] >> >> But wouldn't that mean changing all the VCPUs VMX TSC Offset at the same time, >> which means when none are currently executing? How could that be done? > > Why would you change TSC offset after the point where a VM is started > and why would it be different per vCPU? > > Time is global and time moves on when a vCPU is scheduled out. Anything > else is bonkers, really. If the hypervisor tries to screw with that then > how does the guest do timekeeping in a consistent way? > > CLOCK_REALTIME = CLOCK_MONOTONIC + offset > > That offset changes when something sets the clock, i.e. clock_settime(), > settimeofday() or adjtimex() in case that NTP cannot compensate or for > the beloved leap seconds adjustment. At any other time the offset is > constant. > > CLOCK_MONOTONIC is derived from the underlying clocksource which is > expected to increment with constant frequency and that has to be > consistent accross _all_ vCPUs of a particular VM. > > So how would a hypervisor 'hide' scheduled out time w/o screwing up > timekeeping completely? > > The guest TSC which is based on the host TSC is: > > guestTSC = offset + hostTSC * factor; > > If you make offset different between guest vCPUs then timekeeping in the > guest is screwed. > > The whole point of that paravirt clock was to handle migration between > hosts which did not have the VMCS TSC scaling/offset mechanism. The CPUs > which did not have that went EOL at least 10 years ago. > > So what are you concerned about? Thanks for the explanation. Changing TSC offset / scaling makes it much harder for Intel PT on the host to use, so there is no sense in my pushing for that at this time when there is anyway kernel option no-kvmclock.
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 51d5345de30a..905975a7d475 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -41,6 +41,7 @@ #include <asm/desc.h> #include <asm/ldt.h> #include <asm/unwind.h> +#include <asm/tsc.h> #include "perf_event.h" @@ -2728,18 +2729,26 @@ void arch_perf_update_userpage(struct perf_event *event, !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT); userpg->pmc_width = x86_pmu.cntval_bits; - if (event->attr.use_clockid && - event->attr.ns_clockid && - event->attr.clockid == CLOCK_PERF_HW_CLOCK) { - userpg->cap_user_time_zero = 1; - userpg->time_mult = 1; - userpg->time_shift = 0; - userpg->time_offset = 0; - userpg->time_zero = 0; - return; + if (event->attr.use_clockid && event->attr.ns_clockid) { + if (event->attr.clockid == CLOCK_PERF_HW_CLOCK) { + userpg->cap_user_time_zero = 1; + userpg->time_mult = 1; + userpg->time_shift = 0; + userpg->time_offset = 0; + userpg->time_zero = 0; + return; + } + if (event->attr.clockid == CLOCK_PERF_HW_CLOCK_NS) + userpg->cap_user_time_zero = 1; + } + + if (using_native_sched_clock() && sched_clock_stable()) { + userpg->cap_user_time = 1; + if (!event->attr.use_clockid) + userpg->cap_user_time_zero = 1; } - if (!using_native_sched_clock() || !sched_clock_stable()) + if (!userpg->cap_user_time && !userpg->cap_user_time_zero) return; cyc2ns_read_begin(&data); @@ -2750,19 +2759,16 @@ void arch_perf_update_userpage(struct perf_event *event, * Internal timekeeping for enabled/running/stopped times * is always in the local_clock domain. */ - userpg->cap_user_time = 1; userpg->time_mult = data.cyc2ns_mul; userpg->time_shift = data.cyc2ns_shift; userpg->time_offset = offset - now; /* * cap_user_time_zero doesn't make sense when we're using a different - * time base for the records. + * time base for the records, except for CLOCK_PERF_HW_CLOCK_NS. */ - if (!event->attr.use_clockid) { - userpg->cap_user_time_zero = 1; + if (userpg->cap_user_time_zero) userpg->time_zero = offset; - } cyc2ns_read_end(); } @@ -2996,6 +3002,11 @@ u64 perf_hw_clock(void) return rdtsc_ordered(); } +u64 perf_hw_clock_ns(void) +{ + return native_sched_clock_from_tsc(perf_hw_clock()); +} + void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap) { cap->version = x86_pmu.version; diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 5288ea1ae2ba..46cbca90cdd1 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -453,6 +453,8 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs); extern u64 perf_hw_clock(void); #define perf_hw_clock perf_hw_clock +extern u64 perf_hw_clock_ns(void); +#define perf_hw_clock_ns perf_hw_clock_ns #include <asm/stacktrace.h> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index e8617efd552b..0edc005f8ddf 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -298,6 +298,12 @@ enum { * paravirtualized. */ #define CLOCK_PERF_HW_CLOCK 0x10000000 +/* + * Same as CLOCK_PERF_HW_CLOCK but in nanoseconds. Note support of + * CLOCK_PERF_HW_CLOCK_NS does not necesssarily imply support of + * CLOCK_PERF_HW_CLOCK or vice versa. + */ +#define CLOCK_PERF_HW_CLOCK_NS 0x10000001 /* * The format of the data returned by read() on a perf event fd, diff --git a/kernel/events/core.c b/kernel/events/core.c index 15dee265a5b9..65e70fb669fd 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -12019,6 +12019,12 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id, bool event->clock = &perf_hw_clock; nmi_safe = true; break; +#endif +#ifdef perf_hw_clock_ns + case CLOCK_PERF_HW_CLOCK_NS: + event->clock = &perf_hw_clock_ns; + nmi_safe = true; + break; #endif default: return -EINVAL;
Currently, when Intel PT is used within a VM guest, it is not possible to make use of TSC because perf clock is subject to paravirtualization. If the hypervisor leaves rdtsc alone, the TSC value will be subject only to the VMCS TSC Offset and Scaling, the same as the TSC packet from Intel PT. The new clock is based on rdtsc and not subject to paravirtualization. Hence it would be possible to use this new clock for Intel PT decoding within a VM guest. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> --- arch/x86/events/core.c | 41 ++++++++++++++++++++----------- arch/x86/include/asm/perf_event.h | 2 ++ include/uapi/linux/perf_event.h | 6 +++++ kernel/events/core.c | 6 +++++ 4 files changed, 40 insertions(+), 15 deletions(-)