Message ID | 1435157697-28579-1-git-send-email-marc.zyngier@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jun 24, 2015 at 03:54:57PM +0100, Marc Zyngier wrote: > Userspace is allowed to set the guest's view of CNTVCT, which turns > into setting CNTVOFF for the whole VM. One thing userspace is not supposed > to do is to update that register while the guest is running. Time will > either move forward (best case) or backward (really bad idea). Either way, > this shouldn't happen. > > This patch prevents userspace from touching CNTVOFF as soon as a vcpu > has been started. This ensures that time will keep monotonically increase. > > Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> > --- > > QEMU seems to trigger this at boot time, and I have no idea why it does so. > It would be good to find out, hence the RFC tag. Is this at kernel boot time you see this, or at system startup time? IIRC, QEMU creates a throw-away VM with the default CPU target time, reads out all the system registers to get the KVM reset values of those, then creates the real VM, and feeds back in all the system register reset values, as a method for QEMU and KVM to be in sync about the reset state of the machine. If we do this, and include CNTVCT, then that would probably trigger this, but the VCPU really shouldn't have been run at that time... We should prevent userspace from fiddling with this register post VCPU start regardless, but yes, it would be good to find out why this is happening in the first place. How did you notice this and does it manifest itself in some user-visible ugliness? Thanks, -Christoffer > > virt/kvm/arm/arch_timer.c | 21 +++++++++++++++++++-- > 1 file changed, 19 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c > index 98c95f2..660f348 100644 > --- a/virt/kvm/arm/arch_timer.c > +++ b/virt/kvm/arm/arch_timer.c > @@ -220,9 +220,26 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *vcpu, u64 regid, u64 value) > case KVM_REG_ARM_TIMER_CTL: > timer->cntv_ctl = value; > break; > - case KVM_REG_ARM_TIMER_CNT: > - vcpu->kvm->arch.timer.cntvoff = kvm_phys_timer_read() - value; > + case KVM_REG_ARM_TIMER_CNT: { > + struct kvm_vcpu *tmp; > + int i; > + u64 cntvoff = kvm_phys_timer_read() - value; > + > + /* > + * If any of the vcpus has started, don't update > + * CNTVOFF. Userspace is severely brain damaged. > + */ > + kvm_for_each_vcpu(i, tmp, vcpu->kvm) { > + if (tmp->arch.has_run_once) { > + kvm_debug("Won't set CNTVOFF to %llx (%llx)\n", > + cntvoff, > + vcpu->kvm->arch.timer.cntvoff); > + return -1; > + } > + } > + vcpu->kvm->arch.timer.cntvoff = cntvoff; > break; > + } > case KVM_REG_ARM_TIMER_CVAL: > timer->cntv_cval = value; > break; > -- > 2.1.4 > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Christoffer, On 25/06/15 09:04, Christoffer Dall wrote: > On Wed, Jun 24, 2015 at 03:54:57PM +0100, Marc Zyngier wrote: >> Userspace is allowed to set the guest's view of CNTVCT, which turns >> into setting CNTVOFF for the whole VM. One thing userspace is not supposed >> to do is to update that register while the guest is running. Time will >> either move forward (best case) or backward (really bad idea). Either way, >> this shouldn't happen. >> >> This patch prevents userspace from touching CNTVOFF as soon as a vcpu >> has been started. This ensures that time will keep monotonically increase. >> >> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >> --- >> >> QEMU seems to trigger this at boot time, and I have no idea why it does so. >> It would be good to find out, hence the RFC tag. > > Is this at kernel boot time you see this, or at system startup time? When the kernel boots. The OSV guys also have seen this (time going backward), and this patch seems to have solved their problem. > IIRC, QEMU creates a throw-away VM with the default CPU target time, > reads out all the system registers to get the KVM reset values of those, > then creates the real VM, and feeds back in all the system register > reset values, as a method for QEMU and KVM to be in sync about the reset > state of the machine. If we do this, and include CNTVCT, then that > would probably trigger this, but the VCPU really shouldn't have been run > at that time... I definitely see this process happening, but that doesn't seem to be the problem. > We should prevent userspace from fiddling with this register post VCPU > start regardless, but yes, it would be good to find out why this is > happening in the first place. My main problem is that I cannot easily correlate what is happening in the guest at that time with something touching CNTVCT. That's just odd. > How did you notice this and does it manifest itself in some user-visible > ugliness? It was first reported by the OSV guys, and I then spotted some interesting hrtimers taking too long at guest boot time. Putting some traces quickly identified the issue. Thanks, M.
Hi Christoffer, On 25.06.2015 10:04, Christoffer Dall wrote: > On Wed, Jun 24, 2015 at 03:54:57PM +0100, Marc Zyngier wrote: >> Userspace is allowed to set the guest's view of CNTVCT, which turns >> into setting CNTVOFF for the whole VM. One thing userspace is not supposed >> to do is to update that register while the guest is running. Time will >> either move forward (best case) or backward (really bad idea). Either way, >> this shouldn't happen. >> >> This patch prevents userspace from touching CNTVOFF as soon as a vcpu >> has been started. This ensures that time will keep monotonically increase. >> >> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> >> --- >> >> QEMU seems to trigger this at boot time, and I have no idea why it does so. >> It would be good to find out, hence the RFC tag. > > Is this at kernel boot time you see this, or at system startup time? > > IIRC, QEMU creates a throw-away VM with the default CPU target time, > reads out all the system registers to get the KVM reset values of those, > then creates the real VM, and feeds back in all the system register > reset values, as a method for QEMU and KVM to be in sync about the reset > state of the machine. If we do this, and include CNTVCT, then that > would probably trigger this, but the VCPU really shouldn't have been run > at that time... > > We should prevent userspace from fiddling with this register post VCPU > start regardless, but yes, it would be good to find out why this is > happening in the first place. > > How did you notice this and does it manifest itself in some user-visible > ugliness? > > Thanks, > -Christoffer > You can read the whole history here: https://groups.google.com/forum/#!topic/osv-dev/2w101csH65E It causes clock-related bugs with time jumping backward when relying on the virtual counter register in the guest, whenever a cpu is booted (primary, secondary via PSCI), and actually whenever the monitor is used to stop, info registers etc. Once the VM is created, I think QEMU should not request kvm to change the virtual offset of the VM anymore: maybe an unexpected consequence of QEMU's target-arm/kvm64.c::kvm_arch_put_registers ? Thanks, Claudio -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 25 June 2015 at 09:59, Claudio Fontana <claudio.fontana@huawei.com> wrote: > Once the VM is created, I think QEMU should not request kvm to > change the virtual offset of the VM anymore: maybe an unexpected > consequence of QEMU's target-arm/kvm64.c::kvm_arch_put_registers ? Hmm. In general we assume that we can: * stop the VM * read all the guest system registers * write those values back again * restart the VM if we need to. Is that what's happening here, or are we doing something odder? -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 25.06.2015 11:10, Peter Maydell wrote: > On 25 June 2015 at 09:59, Claudio Fontana <claudio.fontana@huawei.com> wrote: >> Once the VM is created, I think QEMU should not request kvm to >> change the virtual offset of the VM anymore: maybe an unexpected >> consequence of QEMU's target-arm/kvm64.c::kvm_arch_put_registers ? > > Hmm. In general we assume that we can: > * stop the VM > * read all the guest system registers > * write those values back again > * restart the VM > > if we need to. Is that what's happening here, or are we doing > something odder? > > -- PMM > What I guess could be happening by looking at the code in linux virt/kvm/arm/arch_timer.c::kvm_arm_timer_set_reg is that QEMU tries to set the KVM_REG_ARM_TIMER_CNT register from exactly the previous value, but just because of the fact that the set function is called, cntvoff is updated, since the value provided by the user is apparently assumed to be _relative_ to the physical timer. This is apparent to me in the code in that function which says: case KVM_REG_ARM_TIMER_CNT: { /* ... */ u64 cntvoff = kvm_phys_timer_read() - value; /* ... */ } And this is matched by the corresponding get function kvm_arm_timer_get_reg where it says: case KVM_REG_ARM_TIMER_CNT: return kvm_phys_timer_read() - vcpu->kvm->arch.timer.cntvoff; The time difference between when the GET is issued by QEMU and when the PUT is issued then would account for the difference. Thanks, Claudio -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-06-25 11:25, Claudio Fontana wrote: > On 25.06.2015 11:10, Peter Maydell wrote: >> On 25 June 2015 at 09:59, Claudio Fontana <claudio.fontana@huawei.com> wrote: >>> Once the VM is created, I think QEMU should not request kvm to >>> change the virtual offset of the VM anymore: maybe an unexpected >>> consequence of QEMU's target-arm/kvm64.c::kvm_arch_put_registers ? >> >> Hmm. In general we assume that we can: >> * stop the VM >> * read all the guest system registers >> * write those values back again >> * restart the VM >> >> if we need to. Is that what's happening here, or are we doing >> something odder? >> >> -- PMM >> > > What I guess could be happening by looking at the code in linux > > virt/kvm/arm/arch_timer.c::kvm_arm_timer_set_reg > > is that QEMU tries to set the KVM_REG_ARM_TIMER_CNT register from exactly the previous value, > but just because of the fact that the set function is called, cntvoff is updated, > since the value provided by the user is apparently assumed to be _relative_ to the physical timer. > > This is apparent to me in the code in that function which says: > > case KVM_REG_ARM_TIMER_CNT: { > /* ... */ > u64 cntvoff = kvm_phys_timer_read() - value; > /* ... */ > } > > And this is matched by the corresponding get function kvm_arm_timer_get_reg where it says: > > case KVM_REG_ARM_TIMER_CNT: > return kvm_phys_timer_read() - vcpu->kvm->arch.timer.cntvoff; > > The time difference between when the GET is issued by QEMU and when the PUT is issued then would account for the difference. QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE, KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is just sorted into the wrong category, thus written as part of the RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as well. Jan
On 26.06.2015 06:49, Jan Kiszka wrote: > On 2015-06-25 11:25, Claudio Fontana wrote: >> On 25.06.2015 11:10, Peter Maydell wrote: >>> On 25 June 2015 at 09:59, Claudio Fontana <claudio.fontana@huawei.com> wrote: >>>> Once the VM is created, I think QEMU should not request kvm to >>>> change the virtual offset of the VM anymore: maybe an unexpected >>>> consequence of QEMU's target-arm/kvm64.c::kvm_arch_put_registers ? >>> >>> Hmm. In general we assume that we can: >>> * stop the VM >>> * read all the guest system registers >>> * write those values back again >>> * restart the VM >>> >>> if we need to. Is that what's happening here, or are we doing >>> something odder? >>> >>> -- PMM >>> >> >> What I guess could be happening by looking at the code in linux >> >> virt/kvm/arm/arch_timer.c::kvm_arm_timer_set_reg >> >> is that QEMU tries to set the KVM_REG_ARM_TIMER_CNT register from exactly the previous value, >> but just because of the fact that the set function is called, cntvoff is updated, >> since the value provided by the user is apparently assumed to be _relative_ to the physical timer. >> >> This is apparent to me in the code in that function which says: >> >> case KVM_REG_ARM_TIMER_CNT: { >> /* ... */ >> u64 cntvoff = kvm_phys_timer_read() - value; >> /* ... */ >> } >> >> And this is matched by the corresponding get function kvm_arm_timer_get_reg where it says: >> >> case KVM_REG_ARM_TIMER_CNT: >> return kvm_phys_timer_read() - vcpu->kvm->arch.timer.cntvoff; >> >> The time difference between when the GET is issued by QEMU and when the PUT is issued then would account for the difference. > > QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE, > KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is > just sorted into the wrong category, thus written as part of the > RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as > well. > > Jan > It seems that QEMU target-arm ignores the level parameter to kvm_arch_put_registers completely. Is it intended? Actually even the first set of the cntvoff is wrong, although it does not cause any problems. When the VM is created, KVM already sets the cntvoff, so even putting the state once at RESET is I think wrong: the VM already has a CNTVOFF when it is created, so trying to change it afterwards even once should be avoided I think. But if the RFC is committed as is to kvm, it's fine by me also. Ciao Claudio -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 29 June 2015 at 18:20, Claudio Fontana <claudio.fontana@huawei.com> wrote: > On 26.06.2015 06:49, Jan Kiszka wrote: >> QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE, >> KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is >> just sorted into the wrong category, thus written as part of the >> RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as >> well. > It seems that QEMU target-arm ignores the level parameter to > kvm_arch_put_registers completely. > > Is it intended? Yes, sort of. We don't in general know anything about the semantics of most of the system registers. It should always be safe to read them all out of the kernel and write them back... -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 29/06/15 18:37, Peter Maydell wrote: > On 29 June 2015 at 18:20, Claudio Fontana <claudio.fontana@huawei.com> wrote: >> On 26.06.2015 06:49, Jan Kiszka wrote: >>> QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE, >>> KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is >>> just sorted into the wrong category, thus written as part of the >>> RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as >>> well. > >> It seems that QEMU target-arm ignores the level parameter to >> kvm_arch_put_registers completely. >> >> Is it intended? > > Yes, sort of. We don't in general know anything about the semantics > of most of the system registers. It should always be safe to > read them all out of the kernel and write them back... I'm not sure you can safely assume this for time related things, unless you can guarantee that all vcpus are stopped. Claudio is seeing time jumping in weird ways, and so have I, which would tend to show that QEMU is introducing some jitter. Maybe not easily observable on real hardware, but the FastModel is enough to show the issue. So unless someone has a better solution, I'm seriously considering getting this patch merged. Thanks, M.
On 8 July 2015 at 16:56, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 29/06/15 18:37, Peter Maydell wrote: >> On 29 June 2015 at 18:20, Claudio Fontana <claudio.fontana@huawei.com> wrote: >>> On 26.06.2015 06:49, Jan Kiszka wrote: >>>> QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE, >>>> KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is >>>> just sorted into the wrong category, thus written as part of the >>>> RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as >>>> well. >> >>> It seems that QEMU target-arm ignores the level parameter to >>> kvm_arch_put_registers completely. >>> >>> Is it intended? >> >> Yes, sort of. We don't in general know anything about the semantics >> of most of the system registers. It should always be safe to >> read them all out of the kernel and write them back... > > I'm not sure you can safely assume this for time related things, unless > you can guarantee that all vcpus are stopped. Claudio is seeing time > jumping in weird ways, and so have I, which would tend to show that QEMU > is introducing some jitter. > > Maybe not easily observable on real hardware, but the FastModel is > enough to show the issue. > > So unless someone has a better solution, I'm seriously considering > getting this patch merged. I'd prefer it if somebody could investigate to see why QEMU is actually doing this -- so far we just have speculation. -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/07/15 17:06, Peter Maydell wrote: > On 8 July 2015 at 16:56, Marc Zyngier <marc.zyngier@arm.com> wrote: >> On 29/06/15 18:37, Peter Maydell wrote: >>> On 29 June 2015 at 18:20, Claudio Fontana <claudio.fontana@huawei.com> wrote: >>>> On 26.06.2015 06:49, Jan Kiszka wrote: >>>>> QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE, >>>>> KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is >>>>> just sorted into the wrong category, thus written as part of the >>>>> RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as >>>>> well. >>> >>>> It seems that QEMU target-arm ignores the level parameter to >>>> kvm_arch_put_registers completely. >>>> >>>> Is it intended? >>> >>> Yes, sort of. We don't in general know anything about the semantics >>> of most of the system registers. It should always be safe to >>> read them all out of the kernel and write them back... >> >> I'm not sure you can safely assume this for time related things, unless >> you can guarantee that all vcpus are stopped. Claudio is seeing time >> jumping in weird ways, and so have I, which would tend to show that QEMU >> is introducing some jitter. >> >> Maybe not easily observable on real hardware, but the FastModel is >> enough to show the issue. >> >> So unless someone has a better solution, I'm seriously considering >> getting this patch merged. > > I'd prefer it if somebody could investigate to see why QEMU > is actually doing this -- so far we just have speculation. I'd prefer that too, but so far people seem to be more comfortable waiting for the issue to fix itself. In the meantime, VMs are broken in weird and wonderful ways, and I don't think the current status-quo helps anyone. M.
On 8 July 2015 at 17:37, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 08/07/15 17:06, Peter Maydell wrote: >> I'd prefer it if somebody could investigate to see why QEMU >> is actually doing this -- so far we just have speculation. > > I'd prefer that too, but so far people seem to be more comfortable > waiting for the issue to fix itself. In the meantime, VMs are broken in > weird and wonderful ways, and I don't think the current status-quo helps > anyone. Putting in a patch which might not be the right fix isn't necessarily a good plan either... Does has_run_once get cleared if we do a re-VCPU_INIT of a CPU that's run before? (We need to allow rewriting of guest state at that point so that "reset VM and load migration state" behaves correctly.) I suspect Jan is right and we really need to distinguish the KVM_PUT_*_STATE levels in ARM QEMU. This probably implies some kind of whitelist/override mechanism, since by and large we neither know nor want to know the semantics for system registers, we leave that up to the kernel. Q: if you have a running VM, and you pause it for an hour, what should the CNTVCT register do? Presumably it should not advance, but how do we arrange for that to happen? -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Peter and Marc, [cc'ing Paolo for his input on x86 timekeeping] On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote: > On 8 July 2015 at 17:37, Marc Zyngier <marc.zyngier@arm.com> wrote: > > On 08/07/15 17:06, Peter Maydell wrote: > >> I'd prefer it if somebody could investigate to see why QEMU > >> is actually doing this -- so far we just have speculation. > > > > I'd prefer that too, but so far people seem to be more comfortable > > waiting for the issue to fix itself. In the meantime, VMs are broken in > > weird and wonderful ways, and I don't think the current status-quo helps > > anyone. > > Putting in a patch which might not be the right fix isn't > necessarily a good plan either... > > Does has_run_once get cleared if we do a re-VCPU_INIT > of a CPU that's run before? (We need to allow rewriting > of guest state at that point so that "reset VM and > load migration state" behaves correctly.) no, it does not, has_run_once is set the first time a VCPU is run and is currently *never* cleared. > > I suspect Jan is right and we really need to distinguish > the KVM_PUT_*_STATE levels in ARM QEMU. This probably > implies some kind of whitelist/override mechanism, since > by and large we neither know nor want to know the > semantics for system registers, we leave that up to the > kernel. > > Q: if you have a running VM, and you pause it for > an hour, what should the CNTVCT register do? Presumably > it should not advance, but how do we arrange for that > to happen? > I think the CNTVCT should not advance when the VM is not scheduled, so if we pause the VM or starve all the VCPUs for enough time, the guest should not see time progressing, since otherwise the guest scheduler cannot maintain fairness and you're bound to see spurious RCU stalls etc. That's exactly why a guest can read both a virtual and physical counter and it is an area where you simply want some level of paravirtualization. I haven't studied how/if Linux deals with this at all. So I think adjusting CNTVOFF should be managed by the kernel for the pause/starvation scenario (which I think Avi once told me x86 does too - does anyone know the current state of the art?). So the only situation where I think userspace should adjust the CNTVOFF value is for migration where we are talking about a brand new VM with has_run_once clear. Thus, if we were designing this from scratch now, the API should be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the VM has run once, but it's too late for that as we would break userspace. The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF in the kernel side as well, and finally also fix QEMU so that it doesn't try to do the thing that future kernels will ignore. -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 9 July 2015 at 11:22, Christoffer Dall <christoffer.dall@linaro.org> wrote: > On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote: >> I suspect Jan is right and we really need to distinguish >> the KVM_PUT_*_STATE levels in ARM QEMU. This probably >> implies some kind of whitelist/override mechanism, since >> by and large we neither know nor want to know the >> semantics for system registers, we leave that up to the >> kernel. >> >> Q: if you have a running VM, and you pause it for >> an hour, what should the CNTVCT register do? Presumably >> it should not advance, but how do we arrange for that >> to happen? > I think the CNTVCT should not advance when the VM is not scheduled, so > if we pause the VM or starve all the VCPUs for enough time, the guest > should not see time progressing, since otherwise the guest scheduler > cannot maintain fairness and you're bound to see spurious RCU stalls > etc. > > That's exactly why a guest can read both a virtual and physical counter > and it is an area where you simply want some level of > paravirtualization. I haven't studied how/if Linux deals with this at > all. > > So I think adjusting CNTVOFF should be managed by the kernel for the > pause/starvation scenario (which I think Avi once told me x86 does too - > does anyone know the current state of the art?). Agreed that we should be aiming for same behaviour as x86 rather than reinventing the wheel... > So the only situation where I think userspace should adjust the CNTVOFF > value is for migration where we are talking about a brand new VM with > has_run_once clear. For incoming migration/loadvm the VM will be freshly *reset*, but it will not be completely created from scratch. It's possible for the user to run a VM for a bit, and then use "loadvm" from the monitor to load an old snapshot. This will do a qemu_system_reset() and then restore the state. So in that case has_run_once won't be clear. (Would it be OK to clear it when userspace does VCPU_INIT? What else does it control?) > Thus, if we were designing this from scratch now, the API should > be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the > VM has run once, but it's too late for that as we would break userspace. > The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF > in the kernel side as well, and finally also fix QEMU so that it doesn't > try to do the thing that future kernels will ignore. Isn't Marc's patch causing us to return an error to userspace? (It does a 'return -1', at least -- does that get ignored at a higher level?) In principle, if userspace does: * stop all vCPUs * read the counter values from each vCPU * write the counter values back to each vCPU * resume vCPUs this shouldn't cause time to do funny things, should it? (ie the problem only occurs if we try to write the vCPU counter value while only one vCPU is stopped). thanks -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-07-09 12:22, Christoffer Dall wrote: > Hi Peter and Marc, > > [cc'ing Paolo for his input on x86 timekeeping] > > On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote: >> On 8 July 2015 at 17:37, Marc Zyngier <marc.zyngier@arm.com> wrote: >>> On 08/07/15 17:06, Peter Maydell wrote: >>>> I'd prefer it if somebody could investigate to see why QEMU >>>> is actually doing this -- so far we just have speculation. >>> >>> I'd prefer that too, but so far people seem to be more comfortable >>> waiting for the issue to fix itself. In the meantime, VMs are broken in >>> weird and wonderful ways, and I don't think the current status-quo helps >>> anyone. >> >> Putting in a patch which might not be the right fix isn't >> necessarily a good plan either... >> >> Does has_run_once get cleared if we do a re-VCPU_INIT >> of a CPU that's run before? (We need to allow rewriting >> of guest state at that point so that "reset VM and >> load migration state" behaves correctly.) > > no, it does not, has_run_once is set the first time a VCPU is run and is > currently *never* cleared. > >> >> I suspect Jan is right and we really need to distinguish >> the KVM_PUT_*_STATE levels in ARM QEMU. This probably >> implies some kind of whitelist/override mechanism, since >> by and large we neither know nor want to know the >> semantics for system registers, we leave that up to the >> kernel. >> >> Q: if you have a running VM, and you pause it for >> an hour, what should the CNTVCT register do? Presumably >> it should not advance, but how do we arrange for that >> to happen? >> > > I think the CNTVCT should not advance when the VM is not scheduled, so > if we pause the VM or starve all the VCPUs for enough time, the guest > should not see time progressing, since otherwise the guest scheduler > cannot maintain fairness and you're bound to see spurious RCU stalls > etc. > > That's exactly why a guest can read both a virtual and physical counter > and it is an area where you simply want some level of > paravirtualization. I haven't studied how/if Linux deals with this at > all. > > So I think adjusting CNTVOFF should be managed by the kernel for the > pause/starvation scenario (which I think Avi once told me x86 does too - > does anyone know the current state of the art?). > > So the only situation where I think userspace should adjust the CNTVOFF > value is for migration where we are talking about a brand new VM with > has_run_once clear. > > Thus, if we were designing this from scratch now, the API should > be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the > VM has run once, but it's too late for that as we would break userspace. > The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF > in the kernel side as well, and finally also fix QEMU so that it doesn't > try to do the thing that future kernels will ignore. Fixing QEMU to only write on KVM_PUT_FULL_STATE - yes, that should be done, but I don't think the approach for the kernel is generally right. The kernel should not do any policing on user space requests to change the VCPU or VM state unless - security is affected - userspace lacks information, thus cannot decide correctly - legacy userspace has a bug, we can detect it and want to fix that up without affecting future userspace that has a reason to do it differently Regarding CNTVOFF, the first two criteria do not apply for sure. Maybe the last one, don't know. Just think of the hypothetical scenario that a userspace VM debugger wants to inject certain register manipulations. If you block this by some hidden VM state like proposed, that feature would no longer be implementable easily. Jan
On Thu, Jul 09, 2015 at 11:38:40AM +0100, Peter Maydell wrote: > On 9 July 2015 at 11:22, Christoffer Dall <christoffer.dall@linaro.org> wrote: > > On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote: > >> I suspect Jan is right and we really need to distinguish > >> the KVM_PUT_*_STATE levels in ARM QEMU. This probably > >> implies some kind of whitelist/override mechanism, since > >> by and large we neither know nor want to know the > >> semantics for system registers, we leave that up to the > >> kernel. > >> > >> Q: if you have a running VM, and you pause it for > >> an hour, what should the CNTVCT register do? Presumably > >> it should not advance, but how do we arrange for that > >> to happen? > > > I think the CNTVCT should not advance when the VM is not scheduled, so > > if we pause the VM or starve all the VCPUs for enough time, the guest > > should not see time progressing, since otherwise the guest scheduler > > cannot maintain fairness and you're bound to see spurious RCU stalls > > etc. > > > > That's exactly why a guest can read both a virtual and physical counter > > and it is an area where you simply want some level of > > paravirtualization. I haven't studied how/if Linux deals with this at > > all. > > > > So I think adjusting CNTVOFF should be managed by the kernel for the > > pause/starvation scenario (which I think Avi once told me x86 does too - > > does anyone know the current state of the art?). > > Agreed that we should be aiming for same behaviour as x86 > rather than reinventing the wheel... > > > So the only situation where I think userspace should adjust the CNTVOFF > > value is for migration where we are talking about a brand new VM with > > has_run_once clear. > > For incoming migration/loadvm the VM will be freshly *reset*, > but it will not be completely created from scratch. It's > possible for the user to run a VM for a bit, and then use > "loadvm" from the monitor to load an old snapshot. This > will do a qemu_system_reset() and then restore the state. > So in that case has_run_once won't be clear. (Would it be > OK to clear it when userspace does VCPU_INIT? What else > does it control?) We would have to introduce another flag. It is currently used to only map the VGIC resources once (we have to do this before running the VCPU the first time because only then do we have the required info of where to map the resources etc.), and we use it to enforce that the user doesn't initialize a VCPU with one setting, and then initialize it with a different setting afterwards (for example with power-on state or the emulated CPU target). > > > Thus, if we were designing this from scratch now, the API should > > be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the > > VM has run once, but it's too late for that as we would break userspace. > > The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF > > in the kernel side as well, and finally also fix QEMU so that it doesn't > > try to do the thing that future kernels will ignore. > > Isn't Marc's patch causing us to return an error to userspace? > (It does a 'return -1', at least -- does that get ignored at > a higher level?) I remembered it as it simply ignored it, but you're right, it returns -1 to userspace (which is a strange thing we do in arch_timer.c, I'm not sure why we return -EPERM here - probably a bug). So that can't work obviously, because it will regress userspace. > > In principle, if userspace does: > * stop all vCPUs > * read the counter values from each vCPU > * write the counter values back to each vCPU > * resume vCPUs > > this shouldn't cause time to do funny things, should it? No, it shouldn't. > (ie the problem only occurs if we try to write the vCPU > counter value while only one vCPU is stopped). > As I understand it, the problem is that if we ever run a VCPU after reading the value, and write back the value afterwards, you potentially make time go backwards and get inconsistent views of time from different VCPUs because they may have read the time before/after updating the CNTVOFF. -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 9 July 2015 at 13:05, Christoffer Dall <christoffer.dall@linaro.org> wrote: > As I understand it, the problem is that if we ever run a VCPU after > reading the value, and write back the value afterwards, you potentially > make time go backwards and get inconsistent views of time from different > VCPUs because they may have read the time before/after updating the > CNTVOFF. Right, but I think if QEMU does that it's a bug (and more to the point I don't entirely understand why we would do that yet, even given that we don't have a distinction between "registers to sync always" and "registers to sync only on reset"...) -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jan, On Thu, Jul 09, 2015 at 12:40:39PM +0200, Jan Kiszka wrote: > On 2015-07-09 12:22, Christoffer Dall wrote: > > Hi Peter and Marc, > > > > [cc'ing Paolo for his input on x86 timekeeping] > > > > On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote: > >> On 8 July 2015 at 17:37, Marc Zyngier <marc.zyngier@arm.com> wrote: > >>> On 08/07/15 17:06, Peter Maydell wrote: > >>>> I'd prefer it if somebody could investigate to see why QEMU > >>>> is actually doing this -- so far we just have speculation. > >>> > >>> I'd prefer that too, but so far people seem to be more comfortable > >>> waiting for the issue to fix itself. In the meantime, VMs are broken in > >>> weird and wonderful ways, and I don't think the current status-quo helps > >>> anyone. > >> > >> Putting in a patch which might not be the right fix isn't > >> necessarily a good plan either... > >> > >> Does has_run_once get cleared if we do a re-VCPU_INIT > >> of a CPU that's run before? (We need to allow rewriting > >> of guest state at that point so that "reset VM and > >> load migration state" behaves correctly.) > > > > no, it does not, has_run_once is set the first time a VCPU is run and is > > currently *never* cleared. > > > >> > >> I suspect Jan is right and we really need to distinguish > >> the KVM_PUT_*_STATE levels in ARM QEMU. This probably > >> implies some kind of whitelist/override mechanism, since > >> by and large we neither know nor want to know the > >> semantics for system registers, we leave that up to the > >> kernel. > >> > >> Q: if you have a running VM, and you pause it for > >> an hour, what should the CNTVCT register do? Presumably > >> it should not advance, but how do we arrange for that > >> to happen? > >> > > > > I think the CNTVCT should not advance when the VM is not scheduled, so > > if we pause the VM or starve all the VCPUs for enough time, the guest > > should not see time progressing, since otherwise the guest scheduler > > cannot maintain fairness and you're bound to see spurious RCU stalls > > etc. > > > > That's exactly why a guest can read both a virtual and physical counter > > and it is an area where you simply want some level of > > paravirtualization. I haven't studied how/if Linux deals with this at > > all. > > > > So I think adjusting CNTVOFF should be managed by the kernel for the > > pause/starvation scenario (which I think Avi once told me x86 does too - > > does anyone know the current state of the art?). > > > > So the only situation where I think userspace should adjust the CNTVOFF > > value is for migration where we are talking about a brand new VM with > > has_run_once clear. > > > > Thus, if we were designing this from scratch now, the API should > > be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the > > VM has run once, but it's too late for that as we would break userspace. > > The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF > > in the kernel side as well, and finally also fix QEMU so that it doesn't > > try to do the thing that future kernels will ignore. > > Fixing QEMU to only write on KVM_PUT_FULL_STATE - yes, that should be > done, but I don't think the approach for the kernel is generally right. > The kernel should not do any policing on user space requests to change > the VCPU or VM state unless > > - security is affected > - userspace lacks information, thus cannot decide correctly > - legacy userspace has a bug, we can detect it and want to fix that up > without affecting future userspace that has a reason to do it > differently > > Regarding CNTVOFF, the first two criteria do not apply for sure. Maybe > the last one, don't know. Just think of the hypothetical scenario that a > userspace VM debugger wants to inject certain register manipulations. If > you block this by some hidden VM state like proposed, that feature would > no longer be implementable easily. > Hmm, I didn't think about this, I guess reverse execution and record/replay scenarios would also want you to make time appear to be going backwards? But the third case kind of does apply, if we care about fixing legacy userspace in the kernel. We could define a new register index for the ioctl that allows time to move forward that future userspace can use, and impose the restriction of not making time go backwards with the current index. Would that break any currently working use cases on arm64 with legacy userspace, such as migration? Thanks, -Christoffer
On Thu, Jul 09, 2015 at 01:07:24PM +0100, Peter Maydell wrote: > On 9 July 2015 at 13:05, Christoffer Dall <christoffer.dall@linaro.org> wrote: > > As I understand it, the problem is that if we ever run a VCPU after > > reading the value, and write back the value afterwards, you potentially > > make time go backwards and get inconsistent views of time from different > > VCPUs because they may have read the time before/after updating the > > CNTVOFF. > > Right, but I think if QEMU does that it's a bug (and more to > the point I don't entirely understand why we would do that > yet, even given that we don't have a distinction between > "registers to sync always" and "registers to sync only on > reset"...) > I think we have evidence that it does that, but we don't know why/how. -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 09, 2015 at 02:24:06PM +0200, Christoffer Dall wrote: > On Thu, Jul 09, 2015 at 01:07:24PM +0100, Peter Maydell wrote: > > On 9 July 2015 at 13:05, Christoffer Dall <christoffer.dall@linaro.org> wrote: > > > As I understand it, the problem is that if we ever run a VCPU after > > > reading the value, and write back the value afterwards, you potentially > > > make time go backwards and get inconsistent views of time from different > > > VCPUs because they may have read the time before/after updating the > > > CNTVOFF. > > > > Right, but I think if QEMU does that it's a bug (and more to > > the point I don't entirely understand why we would do that > > yet, even given that we don't have a distinction between > > "registers to sync always" and "registers to sync only on > > reset"...) > > > I think we have evidence that it does that, but we don't know why/how. > So I ran this through GDB, and this happens when the guest probes the virtio devices, specifically the backtrace tells me that virtio_current_cpu_endian () at /root/src/qemu/hw/virtio/virtio.c:594 => cc->virtio_is_big_endian -> arm_cpu_is_big_endian -> cpu_synchronize_state -> kvm_cpu_synchronize_state which causes cpu->kvm_vcpu_dirty = true, which causes the run-loop to write the CNTVOFF on a per-vcpu basis without stopping anything, as far as I can tell. So yeah, I guess the only required fix here is to fix QEMU in some way as to not fiddle with the timer registers in this way, and then I honestly don't know if we should try to fix legacy userspace in the kernel, but based on the feedback from Jan, I suppose not. Thanks, -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 9 July 2015 at 15:17, Christoffer Dall <christoffer.dall@linaro.org> wrote: > On Thu, Jul 09, 2015 at 02:24:06PM +0200, Christoffer Dall wrote: > So I ran this through GDB, and this happens when the guest probes the > virtio devices, specifically the backtrace tells me that > > virtio_current_cpu_endian () at /root/src/qemu/hw/virtio/virtio.c:594 > => cc->virtio_is_big_endian > -> arm_cpu_is_big_endian > -> cpu_synchronize_state > -> kvm_cpu_synchronize_state > > which causes cpu->kvm_vcpu_dirty = true, which causes the run-loop to > write the CNTVOFF on a per-vcpu basis without stopping anything, as far > as I can tell. Ah, I was wondering if it was going to turn out to be this, but I hadn't figured out why it was going to cause us to do a write-back of state rather than just a read. > So yeah, I guess the only required fix here is to fix QEMU in some way > as to not fiddle with the timer registers in this way, and then I > honestly don't know if we should try to fix legacy userspace in the > kernel, but based on the feedback from Jan, I suppose not. arm_cpu_is_big_endian() doesn't actually want to write back any state -- all it's interested in is reading. So we really ought not to need to write anything back there. kvm-all.c's sync functions don't seem to provide a "sync kernel state to userspace but I promise I'm not going to dirty it" function though. I guess we could add one. (Overall it's kind of fragile even if we can avoid this specific case where we're writing back the counter state, though -- we should probably think about the sync level stuff as well.) -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 09, 2015 at 03:26:20PM +0100, Peter Maydell wrote: > On 9 July 2015 at 15:17, Christoffer Dall <christoffer.dall@linaro.org> wrote: > > On Thu, Jul 09, 2015 at 02:24:06PM +0200, Christoffer Dall wrote: > > So I ran this through GDB, and this happens when the guest probes the > > virtio devices, specifically the backtrace tells me that > > > > virtio_current_cpu_endian () at /root/src/qemu/hw/virtio/virtio.c:594 > > => cc->virtio_is_big_endian > > -> arm_cpu_is_big_endian > > -> cpu_synchronize_state > > -> kvm_cpu_synchronize_state > > > > which causes cpu->kvm_vcpu_dirty = true, which causes the run-loop to > > write the CNTVOFF on a per-vcpu basis without stopping anything, as far > > as I can tell. > > Ah, I was wondering if it was going to turn out to be this, > but I hadn't figured out why it was going to cause us to do > a write-back of state rather than just a read. > > > So yeah, I guess the only required fix here is to fix QEMU in some way > > as to not fiddle with the timer registers in this way, and then I > > honestly don't know if we should try to fix legacy userspace in the > > kernel, but based on the feedback from Jan, I suppose not. > > arm_cpu_is_big_endian() doesn't actually want to write > back any state -- all it's interested in is reading. > So we really ought not to need to write anything back there. > kvm-all.c's sync functions don't seem to provide a "sync > kernel state to userspace but I promise I'm not going to > dirty it" function though. I guess we could add one. > > (Overall it's kind of fragile even if we can avoid this > specific case where we're writing back the counter state, > though -- we should probably think about the sync level > stuff as well.) Sounds to me like we should add the sync level stuff first, as it should catch all callers, and afterwards consider adding the sync-one-way optimization. -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c index 98c95f2..660f348 100644 --- a/virt/kvm/arm/arch_timer.c +++ b/virt/kvm/arm/arch_timer.c @@ -220,9 +220,26 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *vcpu, u64 regid, u64 value) case KVM_REG_ARM_TIMER_CTL: timer->cntv_ctl = value; break; - case KVM_REG_ARM_TIMER_CNT: - vcpu->kvm->arch.timer.cntvoff = kvm_phys_timer_read() - value; + case KVM_REG_ARM_TIMER_CNT: { + struct kvm_vcpu *tmp; + int i; + u64 cntvoff = kvm_phys_timer_read() - value; + + /* + * If any of the vcpus has started, don't update + * CNTVOFF. Userspace is severely brain damaged. + */ + kvm_for_each_vcpu(i, tmp, vcpu->kvm) { + if (tmp->arch.has_run_once) { + kvm_debug("Won't set CNTVOFF to %llx (%llx)\n", + cntvoff, + vcpu->kvm->arch.timer.cntvoff); + return -1; + } + } + vcpu->kvm->arch.timer.cntvoff = cntvoff; break; + } case KVM_REG_ARM_TIMER_CVAL: timer->cntv_cval = value; break;
Userspace is allowed to set the guest's view of CNTVCT, which turns into setting CNTVOFF for the whole VM. One thing userspace is not supposed to do is to update that register while the guest is running. Time will either move forward (best case) or backward (really bad idea). Either way, this shouldn't happen. This patch prevents userspace from touching CNTVOFF as soon as a vcpu has been started. This ensures that time will keep monotonically increase. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> --- QEMU seems to trigger this at boot time, and I have no idea why it does so. It would be good to find out, hence the RFC tag. virt/kvm/arm/arch_timer.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)