diff mbox

[RFC] KVM: arm/arm64: Don't let userspace update CNTVOFF once guest is running

Message ID 1435157697-28579-1-git-send-email-marc.zyngier@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Marc Zyngier June 24, 2015, 2:54 p.m. UTC
Userspace is allowed to set the guest's view of CNTVCT, which turns
into setting CNTVOFF for the whole VM. One thing userspace is not supposed
to do is to update that register while the guest is running. Time will
either move forward (best case) or backward (really bad idea). Either way,
this shouldn't happen.

This patch prevents userspace from touching CNTVOFF as soon as a vcpu
has been started. This ensures that time will keep monotonically increase.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---

QEMU seems to trigger this at boot time, and I have no idea why it does so.
It would be good to find out, hence the RFC tag.

 virt/kvm/arm/arch_timer.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

Comments

Christoffer Dall June 25, 2015, 8:04 a.m. UTC | #1
On Wed, Jun 24, 2015 at 03:54:57PM +0100, Marc Zyngier wrote:
> Userspace is allowed to set the guest's view of CNTVCT, which turns
> into setting CNTVOFF for the whole VM. One thing userspace is not supposed
> to do is to update that register while the guest is running. Time will
> either move forward (best case) or backward (really bad idea). Either way,
> this shouldn't happen.
> 
> This patch prevents userspace from touching CNTVOFF as soon as a vcpu
> has been started. This ensures that time will keep monotonically increase.
> 
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> ---
> 
> QEMU seems to trigger this at boot time, and I have no idea why it does so.
> It would be good to find out, hence the RFC tag.

Is this at kernel boot time you see this, or at system startup time?

IIRC, QEMU creates a throw-away VM with the default CPU target time,
reads out all the system registers to get the KVM reset values of those,
then creates the real VM, and feeds back in all the system register
reset values, as a method for QEMU and KVM to be in sync about the reset
state of the machine.  If we do this, and include CNTVCT, then that
would probably trigger this, but the VCPU really shouldn't have been run
at that time...

We should prevent userspace from fiddling with this register post VCPU
start regardless, but yes, it would be good to find out why this is
happening in the first place.

How did you notice this and does it manifest itself in some user-visible
ugliness?

Thanks,
-Christoffer

> 
>  virt/kvm/arm/arch_timer.c | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> index 98c95f2..660f348 100644
> --- a/virt/kvm/arm/arch_timer.c
> +++ b/virt/kvm/arm/arch_timer.c
> @@ -220,9 +220,26 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *vcpu, u64 regid, u64 value)
>  	case KVM_REG_ARM_TIMER_CTL:
>  		timer->cntv_ctl = value;
>  		break;
> -	case KVM_REG_ARM_TIMER_CNT:
> -		vcpu->kvm->arch.timer.cntvoff = kvm_phys_timer_read() - value;
> +	case KVM_REG_ARM_TIMER_CNT: {
> +		struct kvm_vcpu *tmp;
> +		int i;
> +		u64 cntvoff = kvm_phys_timer_read() - value;
> +
> +		/*
> +		 * If any of the vcpus has started, don't update
> +		 * CNTVOFF. Userspace is severely brain damaged.
> +		 */
> +		kvm_for_each_vcpu(i, tmp, vcpu->kvm) {
> +			if (tmp->arch.has_run_once) {
> +				kvm_debug("Won't set CNTVOFF to %llx (%llx)\n",
> +					  cntvoff,
> +					  vcpu->kvm->arch.timer.cntvoff);
> +				return -1;
> +			}
> +		}
> +		vcpu->kvm->arch.timer.cntvoff = cntvoff;
>  		break;
> +	}
>  	case KVM_REG_ARM_TIMER_CVAL:
>  		timer->cntv_cval = value;
>  		break;
> -- 
> 2.1.4
>
Marc Zyngier June 25, 2015, 8:48 a.m. UTC | #2
Hi Christoffer,

On 25/06/15 09:04, Christoffer Dall wrote:
> On Wed, Jun 24, 2015 at 03:54:57PM +0100, Marc Zyngier wrote:
>> Userspace is allowed to set the guest's view of CNTVCT, which turns
>> into setting CNTVOFF for the whole VM. One thing userspace is not supposed
>> to do is to update that register while the guest is running. Time will
>> either move forward (best case) or backward (really bad idea). Either way,
>> this shouldn't happen.
>>
>> This patch prevents userspace from touching CNTVOFF as soon as a vcpu
>> has been started. This ensures that time will keep monotonically increase.
>>
>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>> ---
>>
>> QEMU seems to trigger this at boot time, and I have no idea why it does so.
>> It would be good to find out, hence the RFC tag.
> 
> Is this at kernel boot time you see this, or at system startup time?

When the kernel boots. The OSV guys also have seen this (time going
backward), and this patch seems to have solved their problem.

> IIRC, QEMU creates a throw-away VM with the default CPU target time,
> reads out all the system registers to get the KVM reset values of those,
> then creates the real VM, and feeds back in all the system register
> reset values, as a method for QEMU and KVM to be in sync about the reset
> state of the machine.  If we do this, and include CNTVCT, then that
> would probably trigger this, but the VCPU really shouldn't have been run
> at that time...

I definitely see this process happening, but that doesn't seem to be the
problem.

> We should prevent userspace from fiddling with this register post VCPU
> start regardless, but yes, it would be good to find out why this is
> happening in the first place.

My main problem is that I cannot easily correlate what is happening in
the guest at that time with something touching CNTVCT. That's just odd.

> How did you notice this and does it manifest itself in some user-visible
> ugliness?

It was first reported by the OSV guys, and I then spotted some
interesting hrtimers taking too long at guest boot time. Putting some
traces quickly identified the issue.

Thanks,

	M.
Claudio Fontana June 25, 2015, 8:59 a.m. UTC | #3
Hi Christoffer,

On 25.06.2015 10:04, Christoffer Dall wrote:
> On Wed, Jun 24, 2015 at 03:54:57PM +0100, Marc Zyngier wrote:
>> Userspace is allowed to set the guest's view of CNTVCT, which turns
>> into setting CNTVOFF for the whole VM. One thing userspace is not supposed
>> to do is to update that register while the guest is running. Time will
>> either move forward (best case) or backward (really bad idea). Either way,
>> this shouldn't happen.
>>
>> This patch prevents userspace from touching CNTVOFF as soon as a vcpu
>> has been started. This ensures that time will keep monotonically increase.
>>
>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>> ---
>>
>> QEMU seems to trigger this at boot time, and I have no idea why it does so.
>> It would be good to find out, hence the RFC tag.
> 
> Is this at kernel boot time you see this, or at system startup time?
> 
> IIRC, QEMU creates a throw-away VM with the default CPU target time,
> reads out all the system registers to get the KVM reset values of those,
> then creates the real VM, and feeds back in all the system register
> reset values, as a method for QEMU and KVM to be in sync about the reset
> state of the machine.  If we do this, and include CNTVCT, then that
> would probably trigger this, but the VCPU really shouldn't have been run
> at that time...
> 
> We should prevent userspace from fiddling with this register post VCPU
> start regardless, but yes, it would be good to find out why this is
> happening in the first place.
> 
> How did you notice this and does it manifest itself in some user-visible
> ugliness?
> 
> Thanks,
> -Christoffer
> 


You can read the whole history here:

https://groups.google.com/forum/#!topic/osv-dev/2w101csH65E

It causes clock-related bugs with time jumping backward when relying on the virtual counter register in the guest, whenever a cpu is booted (primary, secondary via PSCI), and actually whenever the monitor is used to stop, info registers etc.

Once the VM is created, I think QEMU should not request kvm to change the virtual offset of the VM anymore: maybe an unexpected consequence of QEMU's target-arm/kvm64.c::kvm_arch_put_registers ?

Thanks,

Claudio
Peter Maydell June 25, 2015, 9:10 a.m. UTC | #4
On 25 June 2015 at 09:59, Claudio Fontana <claudio.fontana@huawei.com> wrote:
> Once the VM is created, I think QEMU should not request kvm to
> change the virtual offset of the VM anymore: maybe an unexpected
> consequence of QEMU's target-arm/kvm64.c::kvm_arch_put_registers ?

Hmm. In general we assume that we can:
 * stop the VM
 * read all the guest system registers
 * write those values back again
 * restart the VM

if we need to. Is that what's happening here, or are we doing
something odder?

-- PMM
Claudio Fontana June 25, 2015, 9:25 a.m. UTC | #5
On 25.06.2015 11:10, Peter Maydell wrote:
> On 25 June 2015 at 09:59, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>> Once the VM is created, I think QEMU should not request kvm to
>> change the virtual offset of the VM anymore: maybe an unexpected
>> consequence of QEMU's target-arm/kvm64.c::kvm_arch_put_registers ?
> 
> Hmm. In general we assume that we can:
>  * stop the VM
>  * read all the guest system registers
>  * write those values back again
>  * restart the VM
> 
> if we need to. Is that what's happening here, or are we doing
> something odder?
> 
> -- PMM
> 

What I guess could be happening by looking at the code in linux

virt/kvm/arm/arch_timer.c::kvm_arm_timer_set_reg

is that QEMU tries to set the KVM_REG_ARM_TIMER_CNT register from exactly the previous value,
but just because of the fact that the set function is called, cntvoff is updated,
since the value provided by the user is apparently assumed to be _relative_ to the physical timer.

This is apparent to me in the code in that function which says:

case KVM_REG_ARM_TIMER_CNT: {
/* ... */
    u64 cntvoff = kvm_phys_timer_read() - value;
/* ... */
}

And this is matched by the corresponding get function kvm_arm_timer_get_reg where it says:

case KVM_REG_ARM_TIMER_CNT:
   return kvm_phys_timer_read() - vcpu->kvm->arch.timer.cntvoff;

The time difference between when the GET is issued by QEMU and when the PUT is issued then would account for the difference.

Thanks,

Claudio
Jan Kiszka June 26, 2015, 4:49 a.m. UTC | #6
On 2015-06-25 11:25, Claudio Fontana wrote:
> On 25.06.2015 11:10, Peter Maydell wrote:
>> On 25 June 2015 at 09:59, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>> Once the VM is created, I think QEMU should not request kvm to
>>> change the virtual offset of the VM anymore: maybe an unexpected
>>> consequence of QEMU's target-arm/kvm64.c::kvm_arch_put_registers ?
>>
>> Hmm. In general we assume that we can:
>>  * stop the VM
>>  * read all the guest system registers
>>  * write those values back again
>>  * restart the VM
>>
>> if we need to. Is that what's happening here, or are we doing
>> something odder?
>>
>> -- PMM
>>
> 
> What I guess could be happening by looking at the code in linux
> 
> virt/kvm/arm/arch_timer.c::kvm_arm_timer_set_reg
> 
> is that QEMU tries to set the KVM_REG_ARM_TIMER_CNT register from exactly the previous value,
> but just because of the fact that the set function is called, cntvoff is updated,
> since the value provided by the user is apparently assumed to be _relative_ to the physical timer.
> 
> This is apparent to me in the code in that function which says:
> 
> case KVM_REG_ARM_TIMER_CNT: {
> /* ... */
>     u64 cntvoff = kvm_phys_timer_read() - value;
> /* ... */
> }
> 
> And this is matched by the corresponding get function kvm_arm_timer_get_reg where it says:
> 
> case KVM_REG_ARM_TIMER_CNT:
>    return kvm_phys_timer_read() - vcpu->kvm->arch.timer.cntvoff;
> 
> The time difference between when the GET is issued by QEMU and when the PUT is issued then would account for the difference.

QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE,
KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is
just sorted into the wrong category, thus written as part of the
RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as
well.

Jan
Claudio Fontana June 29, 2015, 5:20 p.m. UTC | #7
On 26.06.2015 06:49, Jan Kiszka wrote:
> On 2015-06-25 11:25, Claudio Fontana wrote:
>> On 25.06.2015 11:10, Peter Maydell wrote:
>>> On 25 June 2015 at 09:59, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>>> Once the VM is created, I think QEMU should not request kvm to
>>>> change the virtual offset of the VM anymore: maybe an unexpected
>>>> consequence of QEMU's target-arm/kvm64.c::kvm_arch_put_registers ?
>>>
>>> Hmm. In general we assume that we can:
>>>  * stop the VM
>>>  * read all the guest system registers
>>>  * write those values back again
>>>  * restart the VM
>>>
>>> if we need to. Is that what's happening here, or are we doing
>>> something odder?
>>>
>>> -- PMM
>>>
>>
>> What I guess could be happening by looking at the code in linux
>>
>> virt/kvm/arm/arch_timer.c::kvm_arm_timer_set_reg
>>
>> is that QEMU tries to set the KVM_REG_ARM_TIMER_CNT register from exactly the previous value,
>> but just because of the fact that the set function is called, cntvoff is updated,
>> since the value provided by the user is apparently assumed to be _relative_ to the physical timer.
>>
>> This is apparent to me in the code in that function which says:
>>
>> case KVM_REG_ARM_TIMER_CNT: {
>> /* ... */
>>     u64 cntvoff = kvm_phys_timer_read() - value;
>> /* ... */
>> }
>>
>> And this is matched by the corresponding get function kvm_arm_timer_get_reg where it says:
>>
>> case KVM_REG_ARM_TIMER_CNT:
>>    return kvm_phys_timer_read() - vcpu->kvm->arch.timer.cntvoff;
>>
>> The time difference between when the GET is issued by QEMU and when the PUT is issued then would account for the difference.
> 
> QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE,
> KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is
> just sorted into the wrong category, thus written as part of the
> RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as
> well.
> 
> Jan
> 

It seems that QEMU target-arm ignores the level parameter to kvm_arch_put_registers completely.

Is it intended?

Actually even the first set of the cntvoff is wrong, although it does not cause any problems.

When the VM is created, KVM already sets the cntvoff, so even putting the state once at RESET is I think wrong: the VM already has a CNTVOFF when it is created, so trying to change it afterwards even once should be avoided I think.

But if the RFC is committed as is to kvm, it's fine by me also.

Ciao

Claudio
Peter Maydell June 29, 2015, 5:37 p.m. UTC | #8
On 29 June 2015 at 18:20, Claudio Fontana <claudio.fontana@huawei.com> wrote:
> On 26.06.2015 06:49, Jan Kiszka wrote:
>> QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE,
>> KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is
>> just sorted into the wrong category, thus written as part of the
>> RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as
>> well.

> It seems that QEMU target-arm ignores the level parameter to
> kvm_arch_put_registers completely.
>
> Is it intended?

Yes, sort of. We don't in general know anything about the semantics
of most of the system registers. It should always be safe to
read them all out of the kernel and write them back...

-- PMM
Marc Zyngier July 8, 2015, 3:56 p.m. UTC | #9
On 29/06/15 18:37, Peter Maydell wrote:
> On 29 June 2015 at 18:20, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>> On 26.06.2015 06:49, Jan Kiszka wrote:
>>> QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE,
>>> KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is
>>> just sorted into the wrong category, thus written as part of the
>>> RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as
>>> well.
> 
>> It seems that QEMU target-arm ignores the level parameter to
>> kvm_arch_put_registers completely.
>>
>> Is it intended?
> 
> Yes, sort of. We don't in general know anything about the semantics
> of most of the system registers. It should always be safe to
> read them all out of the kernel and write them back...

I'm not sure you can safely assume this for time related things, unless
you can guarantee that all vcpus are stopped. Claudio is seeing time
jumping in weird ways, and so have I, which would tend to show that QEMU
is introducing some jitter.

Maybe not easily observable on real hardware, but the FastModel is
enough to show the issue.

So unless someone has a better solution, I'm seriously considering
getting this patch merged.

Thanks,

	M.
Peter Maydell July 8, 2015, 4:06 p.m. UTC | #10
On 8 July 2015 at 16:56, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 29/06/15 18:37, Peter Maydell wrote:
>> On 29 June 2015 at 18:20, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>> On 26.06.2015 06:49, Jan Kiszka wrote:
>>>> QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE,
>>>> KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is
>>>> just sorted into the wrong category, thus written as part of the
>>>> RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as
>>>> well.
>>
>>> It seems that QEMU target-arm ignores the level parameter to
>>> kvm_arch_put_registers completely.
>>>
>>> Is it intended?
>>
>> Yes, sort of. We don't in general know anything about the semantics
>> of most of the system registers. It should always be safe to
>> read them all out of the kernel and write them back...
>
> I'm not sure you can safely assume this for time related things, unless
> you can guarantee that all vcpus are stopped. Claudio is seeing time
> jumping in weird ways, and so have I, which would tend to show that QEMU
> is introducing some jitter.
>
> Maybe not easily observable on real hardware, but the FastModel is
> enough to show the issue.
>
> So unless someone has a better solution, I'm seriously considering
> getting this patch merged.

I'd prefer it if somebody could investigate to see why QEMU
is actually doing this -- so far we just have speculation.

-- PMM
Marc Zyngier July 8, 2015, 4:37 p.m. UTC | #11
On 08/07/15 17:06, Peter Maydell wrote:
> On 8 July 2015 at 16:56, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 29/06/15 18:37, Peter Maydell wrote:
>>> On 29 June 2015 at 18:20, Claudio Fontana <claudio.fontana@huawei.com> wrote:
>>>> On 26.06.2015 06:49, Jan Kiszka wrote:
>>>>> QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE,
>>>>> KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is
>>>>> just sorted into the wrong category, thus written as part of the
>>>>> RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as
>>>>> well.
>>>
>>>> It seems that QEMU target-arm ignores the level parameter to
>>>> kvm_arch_put_registers completely.
>>>>
>>>> Is it intended?
>>>
>>> Yes, sort of. We don't in general know anything about the semantics
>>> of most of the system registers. It should always be safe to
>>> read them all out of the kernel and write them back...
>>
>> I'm not sure you can safely assume this for time related things, unless
>> you can guarantee that all vcpus are stopped. Claudio is seeing time
>> jumping in weird ways, and so have I, which would tend to show that QEMU
>> is introducing some jitter.
>>
>> Maybe not easily observable on real hardware, but the FastModel is
>> enough to show the issue.
>>
>> So unless someone has a better solution, I'm seriously considering
>> getting this patch merged.
> 
> I'd prefer it if somebody could investigate to see why QEMU
> is actually doing this -- so far we just have speculation.

I'd prefer that too, but so far people seem to be more comfortable
waiting for the issue to fix itself. In the meantime, VMs are broken in
weird and wonderful ways, and I don't think the current status-quo helps
anyone.

	M.
Peter Maydell July 8, 2015, 7:13 p.m. UTC | #12
On 8 July 2015 at 17:37, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 08/07/15 17:06, Peter Maydell wrote:
>> I'd prefer it if somebody could investigate to see why QEMU
>> is actually doing this -- so far we just have speculation.
>
> I'd prefer that too, but so far people seem to be more comfortable
> waiting for the issue to fix itself. In the meantime, VMs are broken in
> weird and wonderful ways, and I don't think the current status-quo helps
> anyone.

Putting in a patch which might not be the right fix isn't
necessarily a good plan either...

Does has_run_once get cleared if we do a re-VCPU_INIT
of a CPU that's run before? (We need to allow rewriting
of guest state at that point so that "reset VM and
load migration state" behaves correctly.)

I suspect Jan is right and we really need to distinguish
the KVM_PUT_*_STATE levels in ARM QEMU. This probably
implies some kind of whitelist/override mechanism, since
by and large we neither know nor want to know the
semantics for system registers, we leave that up to the
kernel.

Q: if you have a running VM, and you pause it for
an hour, what should the CNTVCT register do? Presumably
it should not advance, but how do we arrange for that
to happen?

-- PMM
Christoffer Dall July 9, 2015, 10:22 a.m. UTC | #13
Hi Peter and Marc,

[cc'ing Paolo for his input on x86 timekeeping]

On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote:
> On 8 July 2015 at 17:37, Marc Zyngier <marc.zyngier@arm.com> wrote:
> > On 08/07/15 17:06, Peter Maydell wrote:
> >> I'd prefer it if somebody could investigate to see why QEMU
> >> is actually doing this -- so far we just have speculation.
> >
> > I'd prefer that too, but so far people seem to be more comfortable
> > waiting for the issue to fix itself. In the meantime, VMs are broken in
> > weird and wonderful ways, and I don't think the current status-quo helps
> > anyone.
> 
> Putting in a patch which might not be the right fix isn't
> necessarily a good plan either...
> 
> Does has_run_once get cleared if we do a re-VCPU_INIT
> of a CPU that's run before? (We need to allow rewriting
> of guest state at that point so that "reset VM and
> load migration state" behaves correctly.)

no, it does not, has_run_once is set the first time a VCPU is run and is
currently *never* cleared.

> 
> I suspect Jan is right and we really need to distinguish
> the KVM_PUT_*_STATE levels in ARM QEMU. This probably
> implies some kind of whitelist/override mechanism, since
> by and large we neither know nor want to know the
> semantics for system registers, we leave that up to the
> kernel.
> 
> Q: if you have a running VM, and you pause it for
> an hour, what should the CNTVCT register do? Presumably
> it should not advance, but how do we arrange for that
> to happen?
> 

I think the CNTVCT should not advance when the VM is not scheduled, so
if we pause the VM or starve all the VCPUs for enough time, the guest
should not see time progressing, since otherwise the guest scheduler
cannot maintain fairness and you're bound to see spurious RCU stalls
etc.

That's exactly why a guest can read both a virtual and physical counter
and it is an area where you simply want some level of
paravirtualization.  I haven't studied how/if Linux deals with this at
all.

So I think adjusting CNTVOFF should be managed by the kernel for the
pause/starvation scenario (which I think Avi once told me x86 does too -
does anyone know the current state of the art?).

So the only situation where I think userspace should adjust the CNTVOFF
value is for migration where we are talking about a brand new VM with
has_run_once clear.

Thus, if we were designing this from scratch now, the API should
be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the
VM has run once, but it's too late for that as we would break userspace.
The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF
in the kernel side as well, and finally also fix QEMU so that it doesn't
try to do the thing that future kernels will ignore.

-Christoffer
Peter Maydell July 9, 2015, 10:38 a.m. UTC | #14
On 9 July 2015 at 11:22, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote:
>> I suspect Jan is right and we really need to distinguish
>> the KVM_PUT_*_STATE levels in ARM QEMU. This probably
>> implies some kind of whitelist/override mechanism, since
>> by and large we neither know nor want to know the
>> semantics for system registers, we leave that up to the
>> kernel.
>>
>> Q: if you have a running VM, and you pause it for
>> an hour, what should the CNTVCT register do? Presumably
>> it should not advance, but how do we arrange for that
>> to happen?

> I think the CNTVCT should not advance when the VM is not scheduled, so
> if we pause the VM or starve all the VCPUs for enough time, the guest
> should not see time progressing, since otherwise the guest scheduler
> cannot maintain fairness and you're bound to see spurious RCU stalls
> etc.
>
> That's exactly why a guest can read both a virtual and physical counter
> and it is an area where you simply want some level of
> paravirtualization.  I haven't studied how/if Linux deals with this at
> all.
>
> So I think adjusting CNTVOFF should be managed by the kernel for the
> pause/starvation scenario (which I think Avi once told me x86 does too -
> does anyone know the current state of the art?).

Agreed that we should be aiming for same behaviour as x86
rather than reinventing the wheel...

> So the only situation where I think userspace should adjust the CNTVOFF
> value is for migration where we are talking about a brand new VM with
> has_run_once clear.

For incoming migration/loadvm the VM will be freshly *reset*,
but it will not be completely created from scratch. It's
possible for the user to run a VM for a bit, and then use
"loadvm" from the monitor to load an old snapshot. This
will do a qemu_system_reset() and then restore the state.
So in that case has_run_once won't be clear. (Would it be
OK to clear it when userspace does VCPU_INIT? What else
does it control?)

> Thus, if we were designing this from scratch now, the API should
> be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the
> VM has run once, but it's too late for that as we would break userspace.
> The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF
> in the kernel side as well, and finally also fix QEMU so that it doesn't
> try to do the thing that future kernels will ignore.

Isn't Marc's patch causing us to return an error to userspace?
(It does a 'return -1', at least -- does that get ignored at
a higher level?)

In principle, if userspace does:
 * stop all vCPUs
 * read the counter values from each vCPU
 * write the counter values back to each vCPU
 * resume vCPUs

this shouldn't cause time to do funny things, should it?
(ie the problem only occurs if we try to write the vCPU
counter value while only one vCPU is stopped).

thanks
-- PMM
Jan Kiszka July 9, 2015, 10:40 a.m. UTC | #15
On 2015-07-09 12:22, Christoffer Dall wrote:
> Hi Peter and Marc,
> 
> [cc'ing Paolo for his input on x86 timekeeping]
> 
> On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote:
>> On 8 July 2015 at 17:37, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>> On 08/07/15 17:06, Peter Maydell wrote:
>>>> I'd prefer it if somebody could investigate to see why QEMU
>>>> is actually doing this -- so far we just have speculation.
>>>
>>> I'd prefer that too, but so far people seem to be more comfortable
>>> waiting for the issue to fix itself. In the meantime, VMs are broken in
>>> weird and wonderful ways, and I don't think the current status-quo helps
>>> anyone.
>>
>> Putting in a patch which might not be the right fix isn't
>> necessarily a good plan either...
>>
>> Does has_run_once get cleared if we do a re-VCPU_INIT
>> of a CPU that's run before? (We need to allow rewriting
>> of guest state at that point so that "reset VM and
>> load migration state" behaves correctly.)
> 
> no, it does not, has_run_once is set the first time a VCPU is run and is
> currently *never* cleared.
> 
>>
>> I suspect Jan is right and we really need to distinguish
>> the KVM_PUT_*_STATE levels in ARM QEMU. This probably
>> implies some kind of whitelist/override mechanism, since
>> by and large we neither know nor want to know the
>> semantics for system registers, we leave that up to the
>> kernel.
>>
>> Q: if you have a running VM, and you pause it for
>> an hour, what should the CNTVCT register do? Presumably
>> it should not advance, but how do we arrange for that
>> to happen?
>>
> 
> I think the CNTVCT should not advance when the VM is not scheduled, so
> if we pause the VM or starve all the VCPUs for enough time, the guest
> should not see time progressing, since otherwise the guest scheduler
> cannot maintain fairness and you're bound to see spurious RCU stalls
> etc.
> 
> That's exactly why a guest can read both a virtual and physical counter
> and it is an area where you simply want some level of
> paravirtualization.  I haven't studied how/if Linux deals with this at
> all.
> 
> So I think adjusting CNTVOFF should be managed by the kernel for the
> pause/starvation scenario (which I think Avi once told me x86 does too -
> does anyone know the current state of the art?).
> 
> So the only situation where I think userspace should adjust the CNTVOFF
> value is for migration where we are talking about a brand new VM with
> has_run_once clear.
> 
> Thus, if we were designing this from scratch now, the API should
> be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the
> VM has run once, but it's too late for that as we would break userspace.
> The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF
> in the kernel side as well, and finally also fix QEMU so that it doesn't
> try to do the thing that future kernels will ignore.

Fixing QEMU to only write on KVM_PUT_FULL_STATE - yes, that should be
done, but I don't think the approach for the kernel is generally right.
The kernel should not do any policing on user space requests to change
the VCPU or VM state unless

 - security is affected
 - userspace lacks information, thus cannot decide correctly
 - legacy userspace has a bug, we can detect it and want to fix that up
   without affecting future userspace that has a reason to do it
   differently

Regarding CNTVOFF, the first two criteria do not apply for sure. Maybe
the last one, don't know. Just think of the hypothetical scenario that a
userspace VM debugger wants to inject certain register manipulations. If
you block this by some hidden VM state like proposed, that feature would
no longer be implementable easily.

Jan
Christoffer Dall July 9, 2015, 12:05 p.m. UTC | #16
On Thu, Jul 09, 2015 at 11:38:40AM +0100, Peter Maydell wrote:
> On 9 July 2015 at 11:22, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote:
> >> I suspect Jan is right and we really need to distinguish
> >> the KVM_PUT_*_STATE levels in ARM QEMU. This probably
> >> implies some kind of whitelist/override mechanism, since
> >> by and large we neither know nor want to know the
> >> semantics for system registers, we leave that up to the
> >> kernel.
> >>
> >> Q: if you have a running VM, and you pause it for
> >> an hour, what should the CNTVCT register do? Presumably
> >> it should not advance, but how do we arrange for that
> >> to happen?
> 
> > I think the CNTVCT should not advance when the VM is not scheduled, so
> > if we pause the VM or starve all the VCPUs for enough time, the guest
> > should not see time progressing, since otherwise the guest scheduler
> > cannot maintain fairness and you're bound to see spurious RCU stalls
> > etc.
> >
> > That's exactly why a guest can read both a virtual and physical counter
> > and it is an area where you simply want some level of
> > paravirtualization.  I haven't studied how/if Linux deals with this at
> > all.
> >
> > So I think adjusting CNTVOFF should be managed by the kernel for the
> > pause/starvation scenario (which I think Avi once told me x86 does too -
> > does anyone know the current state of the art?).
> 
> Agreed that we should be aiming for same behaviour as x86
> rather than reinventing the wheel...
> 
> > So the only situation where I think userspace should adjust the CNTVOFF
> > value is for migration where we are talking about a brand new VM with
> > has_run_once clear.
> 
> For incoming migration/loadvm the VM will be freshly *reset*,
> but it will not be completely created from scratch. It's
> possible for the user to run a VM for a bit, and then use
> "loadvm" from the monitor to load an old snapshot. This
> will do a qemu_system_reset() and then restore the state.
> So in that case has_run_once won't be clear. (Would it be
> OK to clear it when userspace does VCPU_INIT? What else
> does it control?)

We would have to introduce another flag.  It is currently used to only
map the VGIC resources once (we have to do this before running the VCPU
the first time because only then do we have the required info of where
to map the resources etc.), and we use it to enforce that the user
doesn't initialize a VCPU with one setting, and then initialize it with
a different setting afterwards (for example with power-on state or the
emulated CPU target).

> 
> > Thus, if we were designing this from scratch now, the API should
> > be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the
> > VM has run once, but it's too late for that as we would break userspace.
> > The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF
> > in the kernel side as well, and finally also fix QEMU so that it doesn't
> > try to do the thing that future kernels will ignore.
> 
> Isn't Marc's patch causing us to return an error to userspace?
> (It does a 'return -1', at least -- does that get ignored at
> a higher level?)

I remembered it as it simply ignored it, but you're right, it returns -1
to userspace (which is a strange thing we do in arch_timer.c, I'm not
sure why we return -EPERM here - probably a bug).

So that can't work obviously, because it will regress userspace.

> 
> In principle, if userspace does:
>  * stop all vCPUs
>  * read the counter values from each vCPU
>  * write the counter values back to each vCPU
>  * resume vCPUs
> 
> this shouldn't cause time to do funny things, should it?

No, it shouldn't.

> (ie the problem only occurs if we try to write the vCPU
> counter value while only one vCPU is stopped).
> 

As I understand it, the problem is that if we ever run a VCPU after
reading the value, and write back the value afterwards, you potentially
make time go backwards and get inconsistent views of time from different
VCPUs because they may have read the time before/after updating the
CNTVOFF.

-Christoffer
Peter Maydell July 9, 2015, 12:07 p.m. UTC | #17
On 9 July 2015 at 13:05, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> As I understand it, the problem is that if we ever run a VCPU after
> reading the value, and write back the value afterwards, you potentially
> make time go backwards and get inconsistent views of time from different
> VCPUs because they may have read the time before/after updating the
> CNTVOFF.

Right, but I think if QEMU does that it's a bug (and more to
the point I don't entirely understand why we would do that
yet, even given that we don't have a distinction between
"registers to sync always" and "registers to sync only on
reset"...)

-- PMM
Christoffer Dall July 9, 2015, 12:08 p.m. UTC | #18
Hi Jan,

On Thu, Jul 09, 2015 at 12:40:39PM +0200, Jan Kiszka wrote:
> On 2015-07-09 12:22, Christoffer Dall wrote:
> > Hi Peter and Marc,
> > 
> > [cc'ing Paolo for his input on x86 timekeeping]
> > 
> > On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote:
> >> On 8 July 2015 at 17:37, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >>> On 08/07/15 17:06, Peter Maydell wrote:
> >>>> I'd prefer it if somebody could investigate to see why QEMU
> >>>> is actually doing this -- so far we just have speculation.
> >>>
> >>> I'd prefer that too, but so far people seem to be more comfortable
> >>> waiting for the issue to fix itself. In the meantime, VMs are broken in
> >>> weird and wonderful ways, and I don't think the current status-quo helps
> >>> anyone.
> >>
> >> Putting in a patch which might not be the right fix isn't
> >> necessarily a good plan either...
> >>
> >> Does has_run_once get cleared if we do a re-VCPU_INIT
> >> of a CPU that's run before? (We need to allow rewriting
> >> of guest state at that point so that "reset VM and
> >> load migration state" behaves correctly.)
> > 
> > no, it does not, has_run_once is set the first time a VCPU is run and is
> > currently *never* cleared.
> > 
> >>
> >> I suspect Jan is right and we really need to distinguish
> >> the KVM_PUT_*_STATE levels in ARM QEMU. This probably
> >> implies some kind of whitelist/override mechanism, since
> >> by and large we neither know nor want to know the
> >> semantics for system registers, we leave that up to the
> >> kernel.
> >>
> >> Q: if you have a running VM, and you pause it for
> >> an hour, what should the CNTVCT register do? Presumably
> >> it should not advance, but how do we arrange for that
> >> to happen?
> >>
> > 
> > I think the CNTVCT should not advance when the VM is not scheduled, so
> > if we pause the VM or starve all the VCPUs for enough time, the guest
> > should not see time progressing, since otherwise the guest scheduler
> > cannot maintain fairness and you're bound to see spurious RCU stalls
> > etc.
> > 
> > That's exactly why a guest can read both a virtual and physical counter
> > and it is an area where you simply want some level of
> > paravirtualization.  I haven't studied how/if Linux deals with this at
> > all.
> > 
> > So I think adjusting CNTVOFF should be managed by the kernel for the
> > pause/starvation scenario (which I think Avi once told me x86 does too -
> > does anyone know the current state of the art?).
> > 
> > So the only situation where I think userspace should adjust the CNTVOFF
> > value is for migration where we are talking about a brand new VM with
> > has_run_once clear.
> > 
> > Thus, if we were designing this from scratch now, the API should
> > be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the
> > VM has run once, but it's too late for that as we would break userspace.
> > The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF
> > in the kernel side as well, and finally also fix QEMU so that it doesn't
> > try to do the thing that future kernels will ignore.
> 
> Fixing QEMU to only write on KVM_PUT_FULL_STATE - yes, that should be
> done, but I don't think the approach for the kernel is generally right.
> The kernel should not do any policing on user space requests to change
> the VCPU or VM state unless
> 
>  - security is affected
>  - userspace lacks information, thus cannot decide correctly
>  - legacy userspace has a bug, we can detect it and want to fix that up
>    without affecting future userspace that has a reason to do it
>    differently
> 
> Regarding CNTVOFF, the first two criteria do not apply for sure. Maybe
> the last one, don't know. Just think of the hypothetical scenario that a
> userspace VM debugger wants to inject certain register manipulations. If
> you block this by some hidden VM state like proposed, that feature would
> no longer be implementable easily.
> 
Hmm, I didn't think about this, I guess reverse execution and
record/replay scenarios would also want you to make time appear to be
going backwards?

But the third case kind of does apply, if we care about fixing legacy
userspace in the kernel.  We could define a new register index for the
ioctl that allows time to move forward that future userspace can use,
and impose the restriction of not making time go backwards with the
current index.  Would that break any currently working use cases on
arm64 with legacy userspace, such as migration?

Thanks,
-Christoffer
Christoffer Dall July 9, 2015, 12:24 p.m. UTC | #19
On Thu, Jul 09, 2015 at 01:07:24PM +0100, Peter Maydell wrote:
> On 9 July 2015 at 13:05, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > As I understand it, the problem is that if we ever run a VCPU after
> > reading the value, and write back the value afterwards, you potentially
> > make time go backwards and get inconsistent views of time from different
> > VCPUs because they may have read the time before/after updating the
> > CNTVOFF.
> 
> Right, but I think if QEMU does that it's a bug (and more to
> the point I don't entirely understand why we would do that
> yet, even given that we don't have a distinction between
> "registers to sync always" and "registers to sync only on
> reset"...)
> 
I think we have evidence that it does that, but we don't know why/how.

-Christoffer
Christoffer Dall July 9, 2015, 2:17 p.m. UTC | #20
On Thu, Jul 09, 2015 at 02:24:06PM +0200, Christoffer Dall wrote:
> On Thu, Jul 09, 2015 at 01:07:24PM +0100, Peter Maydell wrote:
> > On 9 July 2015 at 13:05, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > > As I understand it, the problem is that if we ever run a VCPU after
> > > reading the value, and write back the value afterwards, you potentially
> > > make time go backwards and get inconsistent views of time from different
> > > VCPUs because they may have read the time before/after updating the
> > > CNTVOFF.
> > 
> > Right, but I think if QEMU does that it's a bug (and more to
> > the point I don't entirely understand why we would do that
> > yet, even given that we don't have a distinction between
> > "registers to sync always" and "registers to sync only on
> > reset"...)
> > 
> I think we have evidence that it does that, but we don't know why/how.
> 
So I ran this through GDB, and this happens when the guest probes the
virtio devices, specifically the backtrace tells me that

virtio_current_cpu_endian () at /root/src/qemu/hw/virtio/virtio.c:594
=> cc->virtio_is_big_endian
  -> arm_cpu_is_big_endian
    -> cpu_synchronize_state
      -> kvm_cpu_synchronize_state

which causes cpu->kvm_vcpu_dirty = true, which causes the run-loop to
write the CNTVOFF on a per-vcpu basis without stopping anything, as far
as I can tell.

So yeah, I guess the only required fix here is to fix QEMU in some way
as to not fiddle with the timer registers in this way, and then I
honestly don't know if we should try to fix legacy userspace in the
kernel, but based on the feedback from Jan, I suppose not.

Thanks,
-Christoffer
Peter Maydell July 9, 2015, 2:26 p.m. UTC | #21
On 9 July 2015 at 15:17, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Thu, Jul 09, 2015 at 02:24:06PM +0200, Christoffer Dall wrote:
> So I ran this through GDB, and this happens when the guest probes the
> virtio devices, specifically the backtrace tells me that
>
> virtio_current_cpu_endian () at /root/src/qemu/hw/virtio/virtio.c:594
> => cc->virtio_is_big_endian
>   -> arm_cpu_is_big_endian
>     -> cpu_synchronize_state
>       -> kvm_cpu_synchronize_state
>
> which causes cpu->kvm_vcpu_dirty = true, which causes the run-loop to
> write the CNTVOFF on a per-vcpu basis without stopping anything, as far
> as I can tell.

Ah, I was wondering if it was going to turn out to be this,
but I hadn't figured out why it was going to cause us to do
a write-back of state rather than just a read.

> So yeah, I guess the only required fix here is to fix QEMU in some way
> as to not fiddle with the timer registers in this way, and then I
> honestly don't know if we should try to fix legacy userspace in the
> kernel, but based on the feedback from Jan, I suppose not.

arm_cpu_is_big_endian() doesn't actually want to write
back any state -- all it's interested in is reading.
So we really ought not to need to write anything back there.
kvm-all.c's sync functions don't seem to provide a "sync
kernel state to userspace but I promise I'm not going to
dirty it" function though. I guess we could add one.

(Overall it's kind of fragile even if we can avoid this
specific case where we're writing back the counter state,
though -- we should probably think about the sync level
stuff as well.)

-- PMM
Christoffer Dall July 9, 2015, 4:06 p.m. UTC | #22
On Thu, Jul 09, 2015 at 03:26:20PM +0100, Peter Maydell wrote:
> On 9 July 2015 at 15:17, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > On Thu, Jul 09, 2015 at 02:24:06PM +0200, Christoffer Dall wrote:
> > So I ran this through GDB, and this happens when the guest probes the
> > virtio devices, specifically the backtrace tells me that
> >
> > virtio_current_cpu_endian () at /root/src/qemu/hw/virtio/virtio.c:594
> > => cc->virtio_is_big_endian
> >   -> arm_cpu_is_big_endian
> >     -> cpu_synchronize_state
> >       -> kvm_cpu_synchronize_state
> >
> > which causes cpu->kvm_vcpu_dirty = true, which causes the run-loop to
> > write the CNTVOFF on a per-vcpu basis without stopping anything, as far
> > as I can tell.
> 
> Ah, I was wondering if it was going to turn out to be this,
> but I hadn't figured out why it was going to cause us to do
> a write-back of state rather than just a read.
> 
> > So yeah, I guess the only required fix here is to fix QEMU in some way
> > as to not fiddle with the timer registers in this way, and then I
> > honestly don't know if we should try to fix legacy userspace in the
> > kernel, but based on the feedback from Jan, I suppose not.
> 
> arm_cpu_is_big_endian() doesn't actually want to write
> back any state -- all it's interested in is reading.
> So we really ought not to need to write anything back there.
> kvm-all.c's sync functions don't seem to provide a "sync
> kernel state to userspace but I promise I'm not going to
> dirty it" function though. I guess we could add one.
> 
> (Overall it's kind of fragile even if we can avoid this
> specific case where we're writing back the counter state,
> though -- we should probably think about the sync level
> stuff as well.)

Sounds to me like we should add the sync level stuff first, as it should
catch all callers, and afterwards consider adding the sync-one-way
optimization.

-Christoffer
diff mbox

Patch

diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
index 98c95f2..660f348 100644
--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -220,9 +220,26 @@  int kvm_arm_timer_set_reg(struct kvm_vcpu *vcpu, u64 regid, u64 value)
 	case KVM_REG_ARM_TIMER_CTL:
 		timer->cntv_ctl = value;
 		break;
-	case KVM_REG_ARM_TIMER_CNT:
-		vcpu->kvm->arch.timer.cntvoff = kvm_phys_timer_read() - value;
+	case KVM_REG_ARM_TIMER_CNT: {
+		struct kvm_vcpu *tmp;
+		int i;
+		u64 cntvoff = kvm_phys_timer_read() - value;
+
+		/*
+		 * If any of the vcpus has started, don't update
+		 * CNTVOFF. Userspace is severely brain damaged.
+		 */
+		kvm_for_each_vcpu(i, tmp, vcpu->kvm) {
+			if (tmp->arch.has_run_once) {
+				kvm_debug("Won't set CNTVOFF to %llx (%llx)\n",
+					  cntvoff,
+					  vcpu->kvm->arch.timer.cntvoff);
+				return -1;
+			}
+		}
+		vcpu->kvm->arch.timer.cntvoff = cntvoff;
 		break;
+	}
 	case KVM_REG_ARM_TIMER_CVAL:
 		timer->cntv_cval = value;
 		break;