Message ID | alpine.DEB.2.20.1709252322120.2418@nanos (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 26/09/2017 00:00, Thomas Gleixner wrote: >> + * @read_with_stamp: saves cycles value (got from timekeeper) and cycles >> + * stamp (got from hardware counter value and used by >> + * timekeeper to calculate the cycles value) to >> + * corresponding input pointers return true if cycles >> + * stamp holds real cycles and false if it holds some >> + * time derivative value > > Neither the changelog nor this comment make any sense. Not to talk about > the actual TSC side implementation which does the same as tsc_read() - it > actually uses tsc_read() - but stores the same value twice and > unconditionally returns true. > > Unless I'm missing something important all of this can be achieved with a > minimal and actually understandable patch. See below. If that is acceptable, that's certainly okay for us too as a start, in order to clean up the KVM code. *However*, I must once more ask for help understanding ktime_get_snapshot, otherwise there is no way that I can start using it in kvmclock. In particular I'd like to understand the contract of ktime_get_snapshot, namely the meaning of the "cycles" field in struct system_time_snapshot. Does it have to be system clock cycles, like TSC, or can it be just the value of the clock source? And if the latter, are the users broken, because they can receive any random clocksource output in ->cycles? It doesn't help that, for all of the users of ->cycles (which are all reached from get_device_system_crosststamp), the code that actually uses the member is never called in the current git master. I asked these questions to John in the previous review, but got no answer. My understanding is that get_device_system_crosststamp is broken for !tsc clocksource. Because struct system_counterval_t must have a TSC value, history_begin needs to contain a cross-timestamp of the TSC (in ->cycles) and the clock (in ->real and ->raw). And if this cross-timestamp is not available, get_device_system_crosststamp needs to know, so that it doesn't use history_begin at all. Now, such cross-timestamp functionality is exactly what "read_with_stamp" provides. Patch 3 also introduces "->cycles_valid" in struct system_time_snapshot, to report whether the cross-timestamp is available. In the case of the TSC clocksource, of course, the two values of course are the same, and always available (so the function always returns true). However, if get_device_system_crosststamp ran with kvmclock or Hyper-V clocksource, the two values stored by read_with_stamp would be different, basically a (TSC, nanoseconds) pair. ktime_get_snapshot would then return the TSC in ->cycles, and use the nanoseconds value to compute ->real and ->raw. Because the cross timestamp is available, ->cycles_valid would be true. So, if history_begin _is_ broken, after fixing it (with "read_with_stamp" or otherwise) ktime_get_snapshot would provide KVM with everything it needs. If not, I'd be completely off base, and I'd have to apologize to Denis for screwing up. Paolo
On Tue, 26 Sep 2017, Paolo Bonzini wrote: > On 26/09/2017 00:00, Thomas Gleixner wrote: > >> + * @read_with_stamp: saves cycles value (got from timekeeper) and cycles > >> + * stamp (got from hardware counter value and used by > >> + * timekeeper to calculate the cycles value) to > >> + * corresponding input pointers return true if cycles > >> + * stamp holds real cycles and false if it holds some > >> + * time derivative value > > > > Neither the changelog nor this comment make any sense. Not to talk about > > the actual TSC side implementation which does the same as tsc_read() - it > > actually uses tsc_read() - but stores the same value twice and > > unconditionally returns true. > > > > Unless I'm missing something important all of this can be achieved with a > > minimal and actually understandable patch. See below. > > If that is acceptable, that's certainly okay for us too as a start, in > order to clean up the KVM code. > > *However*, I must once more ask for help understanding > ktime_get_snapshot, otherwise there is no way that I can start using it > in kvmclock. > > In particular I'd like to understand the contract of ktime_get_snapshot, > namely the meaning of the "cycles" field in struct system_time_snapshot. > > Does it have to be system clock cycles, like TSC, or can it be just the > value of the clock source? And if the latter, are the users broken, > because they can receive any random clocksource output in ->cycles? It > doesn't help that, for all of the users of ->cycles (which are all > reached from get_device_system_crosststamp), the code that actually uses > the member is never called in the current git master. stamp->cycles is always the raw counter value retrieved via clock->read(). clock is the current clock source. > I asked these questions to John in the previous review, but got no > answer. My understanding is that get_device_system_crosststamp is > broken for !tsc clocksource. > > Because struct system_counterval_t must have a TSC value, history_begin > needs to contain a cross-timestamp of the TSC (in ->cycles) and the clock > (in ->real and ->raw). And if this cross-timestamp is not available, > get_device_system_crosststamp needs to know, so that it doesn't use > history_begin at all. Forget about TSC. TSC is one particular use case. What's required here are two clocks which are strictly coupled, i.e. have a constant frequency ratio and a constant offset. And that is enforced: /* * Verify that the clocksource associated with the captured * system counter value is the same as the currently installed * timekeeper clocksource */ if (tk->tkr_mono.clock != system_counterval.cs) return -ENODEV; And that's not broken, that merily makes sure that the function CAN provide valid data and is not trying to correlate stuff which is not coupled. > Now, such cross-timestamp functionality is exactly what > "read_with_stamp" provides. No, it's a hack. > So, if history_begin _is_ broken, after fixing it (with > "read_with_stamp" or otherwise) ktime_get_snapshot would provide KVM > with everything it needs. If not, I'd be completely off base, and I'd > have to apologize to Denis for screwing up. I told you folks before that you are trying to abuse infrastructure which was designed for a different purpose. Now you're claiming it's broken. It's not broken, it works perfectly fine for the intended use cases. It correlates hardware captured timestamps with system time. One particular use case is high precision PTP. The network card can capture both PTP time and ART time atomically. ART and TSC have a constant ratio and constant offset. The cross time stamp machinery relates that values, which might be outside of the current timekeeping window, back to CLOCK_MONOTONIC and CLOCK_REALTIME. But there is no requirement that current clocksource is TSC. The requirement is: The hardware reference clock, the one which can be captured atomically with device clock (PTP, audio whatever), is the coupled clock of the current timekeeping clocksource. Both have a fixed ratio and offset. That's completely independend of TSC. TSC/ART are a particular hardware implementation which can use that infrastructure because they fulfil the requirement. So please stop these uninformed claims about brokeness and TSC requirements. Instead please sit down and figure out whether your particular use case of kvmclock/hyperv clock actually fit into that functionality, i.e. whether you have 1) 'device time' 2) 'system reference time' 3) 'system time' where #1 and #2 can be captured atomically #2 and #3 are coupled clocks with a fixed ratio and offset If those requirements are fulfilled then you can simply use it as is and it will give you CLOCK_MONOTONIC and CLOCK_REALTIME for the captured device time. If your use case is different and does not fit into that picture then please write it up clearly what you are trying to achieve and we can discuss how to make it work w/o adding duct tape hackery. Thanks, tglx
On Tue, 26 Sep 2017, Paolo Bonzini wrote: > However, if get_device_system_crosststamp ran with kvmclock or Hyper-V > clocksource, the two values stored by read_with_stamp would be > different, basically a (TSC, nanoseconds) pair. And that's exactly the problem. The cross time stamp infrastructure keeps the system clocksource (e.g. TSC) and the reference clock (e.g. ART) separate. There is enforcement that the two are coupled which I pointed out in the other mail, but conceptually they are different things and we are not going to change that and add voodoo callbacks which return different timestamps from different sources into the clocksource. So if you want to map kvmclock to the existing infrastructure then: device clock: maps to the hypervisor clock system reference clock: maps to REF_TSC system timekeeper clock: maps to TSC Even if REF_TSC and TSC are the same physically conceptually they are different. To make use of the existing infrastructure you have to provide means to capture HV clock and reference clock atomically. Thats probably something he hypervisor provides as a value pair HVCLOCK_NSEC, REF_TSC_VAL. And then you can correlate system time to HVCLOCK_NSEC via REF_TSC_VAL because that has a fixed ratio / offset to the system TSC. Thanks, tglx
On 27/09/2017 10:52, Thomas Gleixner wrote: > But there is no requirement that current clocksource is TSC. The > requirement is: > > The hardware reference clock, the one which can be captured atomically > with device clock (PTP, audio whatever), is the coupled clock of the > current timekeeping clocksource. Both have a fixed ratio and offset. > > That's completely independend of TSC. TSC/ART are a particular hardware > implementation which can use that infrastructure because they fulfil the > requirement. Ok, this helps. > So please stop these uninformed claims about brokeness and TSC > requirements. It was a question, not an uninformed claim. You answered the question now. > Instead please sit down and figure out whether your > particular use case of kvmclock/hyperv clock actually fit into that > functionality, i.e. whether you have > > 1) 'device time' > 2) 'system reference time' > 3) 'system time' > > where > > #1 and #2 can be captured atomically > > #2 and #3 are coupled clocks with a fixed ratio and offset > > If those requirements are fulfilled then you can simply use it as is and it > will give you CLOCK_MONOTONIC and CLOCK_REALTIME for the captured device > time. > > If your use case is different and does not fit into that picture then > please write it up clearly what you are trying to achieve and we can > discuss how to make it work w/o adding duct tape hackery. Yes, I understand better now why you consider read_with_stamp a hack. And it is---but I was confused and couldn't think of anything better. The definitions do fit KVM, and indeed there is ptp-kvm that does something very similar to what you describe in the other mail. We have: #1 is host time #2 is host TSC #3 is guest TSC We know that #2 and #3 have a fixed ratio and offset. The point is whether #1 and #2 can be captured atomically. For PTP-KVM, the host tells the guest; if capturing the two is impossible, it fails the hypercall and ioctl(PTP_SYS_OFFSET_PRECISE) fails too. Right now, the hypercall fails if the host clocksource is not the TSC clocksource, which is safe. These patches are about ascertaining whether #1 and #2 can be captured atomically in a more generic way. In the read_with_stamp case: - if it returns true, it gives an atomic reading of #1 and #2 - if it returns false, it gives a reading of #1 only. I think the hook should be specific to x86. For example it could be an array of function pointers, indexed by vclock_mode, with the same semantics as read_with_stamp. Paolo
On Wed, 27 Sep 2017, Paolo Bonzini wrote: > The definitions do fit KVM, and indeed there is ptp-kvm that does > something very similar to what you describe in the other mail. We have: > > #1 is host time > #2 is host TSC > #3 is guest TSC > > We know that #2 and #3 have a fixed ratio and offset. The point is > whether #1 and #2 can be captured atomically. > > For PTP-KVM, the host tells the guest; if capturing the two is > impossible, it fails the hypercall and ioctl(PTP_SYS_OFFSET_PRECISE) > fails too. > > Right now, the hypercall fails if the host clocksource is not the TSC > clocksource, which is safe. > > These patches are about ascertaining whether #1 and #2 can be captured > atomically in a more generic way. In the read_with_stamp case: > > - if it returns true, it gives an atomic reading of #1 and #2 > > - if it returns false, it gives a reading of #1 only. > > > I think the hook should be specific to x86. For example it could be an > array of function pointers, indexed by vclock_mode, with the same > semantics as read_with_stamp. I don't think you need that. The get_time_fn() which is handed in to get_device_system_crossstamp() can convey that information: /* * Try to synchronously capture device time and a system * counter value calling back into the device driver */ ret = get_time_fn(&xtstamp->device, &system_counterval, ctx); if (ret) return ret; So in your case get_time_fn() would be kvmclock or hyperv clock specific and the actual hypercall implementation can return a failure code if the requirements are not met: 1) host clock source is TSC 2) capturing of host time and TSC is atomic The host side of the hypercall can use ktime_get_snapshot() with the boot time extension and the extra bits I posted yesterday and fail if the snapshot is not valid for that purpose, i.e. host clocksource is not valid for guest correlation. The hypercall side on the guest can then do: if (hypercall(hypercalldata)) return -ENODEV; system_counterval->cs = tsc_clocksource; system_counterval->data = hypercalldata; return 0; Then the following check in get_device_system_crossstamp() will do the last sanity check: /* * Verify that the clocksource associated with the captured * system counter value is the same as the currently installed * timekeeper clocksource */ if (tk->tkr_mono.clock != system_counterval.cs) return -ENODEV; which catches the case where the guest side clock source is !TSC. Thanks, tglx
On 27/09/2017 13:53, Thomas Gleixner wrote: >> I think the hook should be specific to x86. For example it could be an >> array of function pointers, indexed by vclock_mode, with the same >> semantics as read_with_stamp. > I don't think you need that. > > The get_time_fn() which is handed in to get_device_system_crossstamp() can > convey that information: > > /* > * Try to synchronously capture device time and a system > * counter value calling back into the device driver > */ > ret = get_time_fn(&xtstamp->device, &system_counterval, ctx); > if (ret) > return ret; > > So in your case get_time_fn() would be kvmclock or hyperv clock specific > and the actual hypercall implementation can return a failure code if the > requirements are not met: > > 1) host clock source is TSC > 2) capturing of host time and TSC is atomic So you are suggesting reusing the cross-timestamp hypercall to implement nested pvclock. There are advantages and disadvantages to that. With read_with_stamp-like callbacks: + running on old KVM or on Hyper-V is supported - pvclock_gtod_copy does not go away With hypercall-based callbacks on the contrary: + KVM can use ktime_get_snapshot for the bare metal case - only very new KVM is supported Paolo
On Wed, 27 Sep 2017, Paolo Bonzini wrote: > On 27/09/2017 13:53, Thomas Gleixner wrote: > >> I think the hook should be specific to x86. For example it could be an > >> array of function pointers, indexed by vclock_mode, with the same > >> semantics as read_with_stamp. > > I don't think you need that. > > > > The get_time_fn() which is handed in to get_device_system_crossstamp() can > > convey that information: > > > > /* > > * Try to synchronously capture device time and a system > > * counter value calling back into the device driver > > */ > > ret = get_time_fn(&xtstamp->device, &system_counterval, ctx); > > if (ret) > > return ret; > > > > So in your case get_time_fn() would be kvmclock or hyperv clock specific > > and the actual hypercall implementation can return a failure code if the > > requirements are not met: > > > > 1) host clock source is TSC > > 2) capturing of host time and TSC is atomic > > So you are suggesting reusing the cross-timestamp hypercall to implement > nested pvclock. There are advantages and disadvantages to that. > > With read_with_stamp-like callbacks: > > + running on old KVM or on Hyper-V is supported > - pvclock_gtod_copy does not go away > > With hypercall-based callbacks on the contrary: > > + KVM can use ktime_get_snapshot for the bare metal case > - only very new KVM is supported I don't think that it's an either or decision. get_device_system_crossstamp(get_time_fn, ......) So you can have specific get_time_fn() implementations for your situation: old_kvm_fn() retrieve data from pvclock_gtod copy new_kvm_fn() use hypercall hyperv_fn() do what must be done All implementations need a way to tell you: 1) Host time 2) Host TSC timestamp which corresponds to #1 3) Validity For old_kvm_fn() pvclock_gtod_data.clock.vclock_mode == VCLOCK_TSC For new_kvm_fn() hypercall result For hyperv_fn() whatever it takes Thanks, tglx
--- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1045,7 +1045,8 @@ static struct clocksource clocksource_ts .read = read_tsc, .mask = CLOCKSOURCE_MASK(64), .flags = CLOCK_SOURCE_IS_CONTINUOUS | - CLOCK_SOURCE_MUST_VERIFY, + CLOCK_SOURCE_MUST_VERIFY | + CLOCK_SOURCE_VALID_FOR_PVCLOCK_UPDATE, .archdata = { .vclock_mode = VCLOCK_TSC }, .resume = tsc_resume, .mark_unstable = tsc_cs_mark_unstable, --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -118,7 +118,9 @@ struct clocksource { #define CLOCK_SOURCE_VALID_FOR_HRES 0x20 #define CLOCK_SOURCE_UNSTABLE 0x40 #define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80 -#define CLOCK_SOURCE_RESELECT 0x100 +#define CLOCK_SOURCE_VALID_FOR_PVCLOCK_UPDATE 0x100 + +#define CLOCK_SOURCE_RESELECT 0x200 /* simplify initialization of mask field */ #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0) --- a/include/linux/timekeeping.h +++ b/include/linux/timekeeping.h @@ -285,6 +285,8 @@ extern void ktime_get_raw_and_real_ts64( * @raw: Monotonic raw system time * @clock_was_set_seq: The sequence number of clock was set events * @cs_was_changed_seq: The sequence number of clocksource change events + * @valid_for_pvclock_update: @cycles is from a clocksource which + * can be used for pvclock updates */ struct system_time_snapshot { u64 cycles; @@ -292,6 +294,7 @@ struct system_time_snapshot { ktime_t raw; unsigned int clock_was_set_seq; u8 cs_was_changed_seq; + bool valid_for_pvclock_update; }; /* --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -948,7 +948,7 @@ time64_t __ktime_get_real_seconds(void) void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot) { struct timekeeper *tk = &tk_core.timekeeper; - unsigned long seq; + unsigned long seq, clk_flags; ktime_t base_raw; ktime_t base_real; u64 nsec_raw; @@ -960,6 +960,7 @@ void ktime_get_snapshot(struct system_ti do { seq = read_seqcount_begin(&tk_core.seq); now = tk_clock_read(&tk->tkr_mono); + clk_flags = tk->tkr_mono.clock->flags; systime_snapshot->cs_was_changed_seq = tk->cs_was_changed_seq; systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq; base_real = ktime_add(tk->tkr_mono.base, @@ -970,6 +971,8 @@ void ktime_get_snapshot(struct system_ti } while (read_seqcount_retry(&tk_core.seq, seq)); systime_snapshot->cycles = now; + systime_snapshot->valid_for_pvclock_update = + clk_flags & CLOCK_SOURCE_VALID_FOR_PVCLOCK_UPDATE, systime_snapshot->real = ktime_add_ns(base_real, nsec_real); systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw); }