diff mbox

[v2,01/14] KVM: x86: change PIT discard tick policy

Message ID 1455736496-374-2-git-send-email-rkrcmar@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Radim Krčmář Feb. 17, 2016, 7:14 p.m. UTC
Discard policy uses ack_notifiers to prevent injection of PIT interrupts
before EOI from the last one.

This patch changes the policy to always try to deliver the interrupt,
which makes a difference when its vector is in ISR.
Old implementation would drop the interrupt, but proposed one injects to
IRR, like real hardware would.

The old policy breaks legacy NMI watchdogs, where PIT is used through
virtual wire (LVT0): PIT never sends an interrupt before receiving EOI,
thus a guest deadlock with disabled interrupts will stop NMIs.

Note that NMI doesn't do EOI, so PIT also had to send a normal interrupt
through IOAPIC.  (KVM's PIT is deeply rotten and luckily not used much
in modern systems.)

Even though there is a chance of regressions, I think we can fix the
LVT0 NMI bug without introducing a new tick policy.

Cc: <stable@vger.kernel.org>
Reported-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Signed-off-by: Radim Kr?má? <rkrcmar@redhat.com>
---
 v2: resolve dependencies to make it the first patch

 arch/x86/kvm/i8254.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Comments

Paolo Bonzini Feb. 18, 2016, 4:13 p.m. UTC | #1
On 17/02/2016 20:14, Radim Kr?má? wrote:
> Discard policy uses ack_notifiers to prevent injection of PIT interrupts
> before EOI from the last one.
> 
> This patch changes the policy to always try to deliver the interrupt,
> which makes a difference when its vector is in ISR.
> Old implementation would drop the interrupt, but proposed one injects to
> IRR, like real hardware would.

This seems like what libvirt calls the "merge" policy:

    Merge the missed tick(s) into one tick and inject. The guest time
    may be delayed, depending on how the OS reacts to the merging of
    ticks

where the merged tick is the one placed into IRR.  Unlike discard,
"merge" can starve the guest through an interrupt storm.

Rik, I think you originally worked on the missed tick policies in Xen.
Is the above correct?

If so, do you recall what would be the reason to use the merge policy
instead of the discard policy?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Feb. 18, 2016, 4:56 p.m. UTC | #2
2016-02-18 17:13+0100, Paolo Bonzini:
> On 17/02/2016 20:14, Radim Kr?má? wrote:
>> Discard policy uses ack_notifiers to prevent injection of PIT interrupts
>> before EOI from the last one.
>> 
>> This patch changes the policy to always try to deliver the interrupt,
>> which makes a difference when its vector is in ISR.
>> Old implementation would drop the interrupt, but proposed one injects to
>> IRR, like real hardware would.
> 
> This seems like what libvirt calls the "merge" policy:

Oops, I never looked beyond QEMU after seeing that the naming in libvirt
doesn't even match ...

I think the policy that KVM implements (which I call discard) is "delay"
in libvirt.  (https://libvirt.org/formatdomain.html#elementsTime)

> 
>     Merge the missed tick(s) into one tick and inject. The guest time
>     may be delayed, depending on how the OS reacts to the merging of
>     ticks

The "may be delayed" there makes me feel like the timer has to support a
guest visible counter of missed ticks.
PIT will definitely be delayed if we get another tick while the previous
one is still in IRR and there is nothing that the guest can do with it.

"catchup" is the other KVM policy and "discard" also needs to allow the
guest to handle lost ticks.

> where the merged tick is the one placed into IRR.  Unlike discard,
> "merge" can starve the guest through an interrupt storm.

Yeah, starving a VCPU with an interrupt storm is more likely with the
changed policy.  It's a pretty sad situation if all the time that VCPU
gets isn't even enough to run a PIT handler, so I didn't care.

The NMI watchdog bug can also be solved without changing the policy.
(It's a hack in any case.)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Feb. 18, 2016, 5:33 p.m. UTC | #3
On 18/02/2016 17:56, Radim Kr?má? wrote:
> 2016-02-18 17:13+0100, Paolo Bonzini:
> > On 17/02/2016 20:14, Radim Kr?má? wrote:
> > > Discard policy uses ack_notifiers to prevent injection of PIT interrupts
> > > before EOI from the last one.
> > > 
> > > This patch changes the policy to always try to deliver the interrupt,
> > > which makes a difference when its vector is in ISR.
> > > Old implementation would drop the interrupt, but proposed one injects to
> > > IRR, like real hardware would.
> > 
> > This seems like what libvirt calls the "merge" policy:
> 
> Oops, I never looked beyond QEMU after seeing that the naming in libvirt
> doesn't even match ...
> 
> I think the policy that KVM implements (which I call discard) is "delay"
> in libvirt.  (https://libvirt.org/formatdomain.html#elementsTime)

Suppose the scheduled ticks are at times 0, 20, 40, 60, 80.  The EOI for
time 0 is only delivered at time 42, other EOIs are timely.

The resulting injections are:

- for discard: 0, 60, 80.

- for catchup, which QEMU calls slew: 0, 42, 51, 60, 80.

- for merge: 0, 20 (in IRR, delivered at 42), 60, 80.

For delay I *think* it would be 0, 42, 62, 82, 102.

You know the i8254 code better than I do.  Does this make sense to you?
 (Or in other words, does the code *really* do the above?...)

> The "may be delayed" there makes me feel like the timer has to support a
> guest visible counter of missed ticks.

Yes, it depends whether the guest uses PIT to count time, or just to do
periodic stuff (and then it reads the time from e.g. the PMTimer).

In either case, only catchup ensures that time is not delayed, and it's
used for Windows which uses the RTC periodic clock to count time.

>> > where the merged tick is the one placed into IRR.  Unlike discard,
>> > "merge" can starve the guest through an interrupt storm.
> Yeah, starving a VCPU with an interrupt storm is more likely with the
> changed policy.  It's a pretty sad situation if all the time that VCPU
> gets isn't even enough to run a PIT handler, so I didn't care.

True.  On one hand the hardware policy is clearly merge, not discard.
The i8259 has an IRR!  On the other hand I'm a bit wary of changing the
policy without seeing exactly what the old OSes were doing in the PIT
handler.

> The NMI watchdog bug can also be solved without changing the policy.
> (It's a hack in any case.)

Can you send a patch for that?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Feb. 18, 2016, 5:55 p.m. UTC | #4
On 18/02/2016 18:33, Paolo Bonzini wrote:
> On 18/02/2016 17:56, Radim Kr?má? wrote:
>> 2016-02-18 17:13+0100, Paolo Bonzini:
>>> On 17/02/2016 20:14, Radim Kr?má? wrote:
>>>> Discard policy uses ack_notifiers to prevent injection of PIT interrupts
>>>> before EOI from the last one.
>>>>
>>>> This patch changes the policy to always try to deliver the interrupt,
>>>> which makes a difference when its vector is in ISR.
>>>> Old implementation would drop the interrupt, but proposed one injects to
>>>> IRR, like real hardware would.
>>>
>>> This seems like what libvirt calls the "merge" policy:
>>
>> Oops, I never looked beyond QEMU after seeing that the naming in libvirt
>> doesn't even match ...
>>
>> I think the policy that KVM implements (which I call discard) is "delay"
>> in libvirt.  (https://libvirt.org/formatdomain.html#elementsTime)
> 
> Suppose the scheduled ticks are at times 0, 20, 40, 60, 80.  The EOI for
> time 0 is only delivered at time 42, other EOIs are timely.
> 
> The resulting injections are:
> 
> - for discard: 0, 60, 80.
> 
> - for catchup, which QEMU calls slew: 0, 42, 51, 60, 80.
> 
> - for merge: 0, 20 (in IRR, delivered at 42), 60, 80.
> 
> For delay I *think* it would be 0, 42, 62, 82, 102.

Wrong: for delay it is something like 0, 42, 43, 60, 80.

Your patch does the right thing, QEMU is wrong in calling the policy
"discard" where it should have been "merge".  In fact both i8254 and RTC
use the same wrong nomenclature.

And it is indeed superfluous to use ack notifiers to implement the
default policy, as the default policy is already baked into the i8259.

Sorry, it shows that I'm swamped as I'm messing up things a bit lately.
 At least I have you to correct me. :)

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Feb. 18, 2016, 6:49 p.m. UTC | #5
On 17/02/2016 20:14, Radim Kr?má? wrote:
> 
> Even though there is a chance of regressions, I think we can fix the
> LVT0 NMI bug without introducing a new tick policy.

I agree, but please change the KVM_CAP_REINJECT_CONTROL check to return
2.  This way we can make QEMU reject old kernels when the user provides
"-global kvm-pit.lost_tick_policy=merge".

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Feb. 19, 2016, 2:44 p.m. UTC | #6
[Cc'd Peter, the last guy that touched timers in libvirt, because he
 might know what tick policies are supposed to be.]

2016-02-18 18:55+0100, Paolo Bonzini:
> On 18/02/2016 18:33, Paolo Bonzini wrote:
>> On 18/02/2016 17:56, Radim Kr?má? wrote:
>>> 2016-02-18 17:13+0100, Paolo Bonzini:
>>>> On 17/02/2016 20:14, Radim Kr?má? wrote:
>>>>> Discard policy uses ack_notifiers to prevent injection of PIT interrupts
>>>>> before EOI from the last one.
>>>>>
>>>>> This patch changes the policy to always try to deliver the interrupt,
>>>>> which makes a difference when its vector is in ISR.
>>>>> Old implementation would drop the interrupt, but proposed one injects to
>>>>> IRR, like real hardware would.
>>>>
>>>> This seems like what libvirt calls the "merge" policy:
>>>
>>> Oops, I never looked beyond QEMU after seeing that the naming in libvirt
>>> doesn't even match ...
>>>
>>> I think the policy that KVM implements (which I call discard) is "delay"
>>> in libvirt.  (https://libvirt.org/formatdomain.html#elementsTime)

(I looked at libvirt code, but couldn't find any use of merge or discard
 policies, so please bear with me as I disagree wherever it's possible.)

>> Suppose the scheduled ticks are at times 0, 20, 40, 60, 80.  The EOI for
>> time 0 is only delivered at time 42, other EOIs are timely.
>> 
>> The resulting injections are:
>> - for catchup, which QEMU calls slew: 0, 42, 51, 60, 80.
>> 
>> - for merge: 0, 20 (in IRR, delivered at 42), 60, 80.
>> 
>> For delay I *think* it would be 0, 42, 62, 82, 102.

I could call this "delay".

  Continue to deliver ticks at the normal rate. The guest time will be
  delayed due to the late tick

At 82 time units, the guest thinks it's 60, so the guest will do
everything late.  (Leading us to call it delayed?!)

Few examples of "delay" that I find easier to accept:
 0, 60, 80.
 0, 42, 60, 80.  Because we haven't missed the tick at 20, it just took
                 a while to be delivered.  (Semantics ...)

> Wrong: for delay it is something like 0, 42, 43, 60, 80.

Aargh!  One KVM policy does this and QEMU calls it 'delay'.  I think
that libvirt would call it "catchup".

  Deliver ticks at a higher rate to catch up with the missed tick. The
  guest time should not be delayed once catchup is complete.

At 80, the guest time is 80;  no signs of delay.

> Your patch does the right thing, QEMU is wrong in calling the policy
> "discard" where it should have been "merge".  In fact both i8254 and RTC
> use the same wrong nomenclature.

Terminlogy does suck. (Maybe it stems from the fact that QEMU talks
about lost ticks, but libvirt about ticks?)
Nevertheless, I don't think that libvirt "merge" covers what PIT does in
KVM or real hardware.

  Merge the missed tick(s) into one tick and inject. The guest time may
  be delayed, depending on how the OS reacts to the merging of ticks

No merging is happening in KVM or real hardware:  every tick is exactly
one tick, so the guest cannot tell that we missed some ticks and the
time is delayed.  If a tick made it into clear IRR, it's not missed.

In the example:

>> - for merge: 0, 20 (in IRR, delivered at 42), 60, 80.

at 80, the guest thinks it's 60.

I think that merge might do:  0, 42, 60, 80.
But the tick at 42 is counted as two ticks (20, 40) in the guest.

The main problem of this interpretation is that discard is a subset of
merge:

>> - for discard: 0, 60, 80.

The tick at 60 has to be counted as three ticks (20, 40, 60).

*throws hands into the air and runs in circles*
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Krempa Feb. 25, 2016, 12:34 p.m. UTC | #7
On Fri, Feb 19, 2016 at 15:44:23 +0100, Radim Kr?má? wrote:
> [Cc'd Peter, the last guy that touched timers in libvirt, because he
>  might know what tick policies are supposed to be.]

I found the following RFC that describes the design of timer access in
libvirt:

http://www.redhat.com/archives/libvir-list/2010-March/msg00304.html

> 
> 2016-02-18 18:55+0100, Paolo Bonzini:
> > On 18/02/2016 18:33, Paolo Bonzini wrote:
> >> On 18/02/2016 17:56, Radim Kr?má? wrote:
> >>> 2016-02-18 17:13+0100, Paolo Bonzini:
> >>>> On 17/02/2016 20:14, Radim Kr?má? wrote:
> >>>>> Discard policy uses ack_notifiers to prevent injection of PIT interrupts
> >>>>> before EOI from the last one.
> >>>>>
> >>>>> This patch changes the policy to always try to deliver the interrupt,
> >>>>> which makes a difference when its vector is in ISR.
> >>>>> Old implementation would drop the interrupt, but proposed one injects to
> >>>>> IRR, like real hardware would.
> >>>>
> >>>> This seems like what libvirt calls the "merge" policy:
> >>>
> >>> Oops, I never looked beyond QEMU after seeing that the naming in libvirt
> >>> doesn't even match ...
> >>>
> >>> I think the policy that KVM implements (which I call discard) is "delay"
> >>> in libvirt.  (https://libvirt.org/formatdomain.html#elementsTime)
> 
> (I looked at libvirt code, but couldn't find any use of merge or discard
>  policies, so please bear with me as I disagree wherever it's possible.)

Indeed, it looks like it never was implemented. Unfortunately i'm not
able to assist more in this case.

Peter
Paolo Bonzini Feb. 25, 2016, 1:38 p.m. UTC | #8
On 19/02/2016 15:44, Radim Kr?má? wrote:
>> The resulting injections are:
>>> - for catchup, which QEMU calls slew: 0, 42, 51, 60, 80.

I think we agree that 0, 42, 43, 60, 80 is also a "catchup"/"slew"
policy.  So we can change QEMU's kvm-i8254 to accept "slew" and warn if
"delay" is given.

In fact "slew" means "a large number or quantity of something" and
indeed that's a good word to characterize kvm-i8254's reinjection behavior.

>>> - for merge: 0, 20 (in IRR, delivered at 42), 60, 80.
>>>
>>> For delay I *think* it would be 0, 42, 62, 82, 102.
> 
> I could call this "delay".
> 
>   Continue to deliver ticks at the normal rate. The guest time will be
>   delayed due to the late tick
> 
> At 82 time units, the guest thinks it's 60, so the guest will do
> everything late.  (Leading us to call it delayed?!)

Yup, this was my reasoning.

> Few examples of "delay" that I find easier to accept:
>  0, 60, 80.

This is "discard".

>  0, 42, 60, 80.  Because we haven't missed the tick at 20, it just took
>                  a while to be delivered.  (Semantics ...)

This is not discard.  The ideal implementation of the tick policies is
that the timer devices enjoy information from the interrupt controller,
that lets them know when a tick cannot be delivered.  In that case they
do stuff like:

- save it for later (catchup)

- drop it for good (which is discard, and not the same as stashing it in
IRR)

- pause the timer (delay)

- coalesce the ticks into one late tick (merge)

> Terminlogy does suck. (Maybe it stems from the fact that QEMU talks
> about lost ticks, but libvirt about ticks?)
> Nevertheless, I don't think that libvirt "merge" covers what PIT does in
> KVM or real hardware.
> 
>   Merge the missed tick(s) into one tick and inject. The guest time may
>   be delayed, depending on how the OS reacts to the merging of ticks
> 
> No merging is happening in KVM or real hardware:  every tick is exactly
> one tick, so the guest cannot tell that we missed some ticks and the
> time is delayed.  If a tick made it into clear IRR, it's not missed.

The libvirt documentation is written from the point of view of the
guest, ignoring whether the late tick is recorded in some guest-visible
register (IRR) or in the processor state (as is the case for NMI) or in
the timer device state or who knows where.

Therefore, it _also_ happens that thanks to IRR and NMI latching you can
implement "merge" without having that kind of relationship between the
timer device and the interrupt controller.  But libvirt doesn't care,
all the <timer> element wants is the guest-visible behavior in terms of
when the ticks are delivered.

This was my reasoning when proposing to change QEMU regarding "discard"
as well:

- RTC would warn for "discard" and accept "merge"

- kvm-i8254 would warn for "discard", and accept "merge" if the
capability says that your patch is in.  The idea is that "discard" is
such a bad idea, that "merge" should fail if the hypervisor would cause
the watchdog to hang.

> In the example:
> 
>>> - for merge: 0, 20 (in IRR, delivered at 42), 60, 80.
> 
> at 80, the guest thinks it's 60.
> 
> I think that merge might do:  0, 42, 60, 80.
> But the tick at 42 is counted as two ticks (20, 40) in the guest.

Yes, merge is a good policy if the guest can somehow realize that 42
stood for (20, 40).  Discard would be a good policy too if the guest can
somehow realize that 60 stood for (20, 40, 60) but it has the problem
that NMIs don't do EOIs.

> The main problem of this interpretation is that discard is a subset of
> merge:
> 
>>> - for discard: 0, 60, 80.
> 
> The tick at 60 has to be counted as three ticks (20, 40, 60).

Why is it a problem?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Feb. 25, 2016, 5:36 p.m. UTC | #9
I think I understand your definitions of tick policies and I remain
convinced that libvirt defines them differently.

Would you prefer that I got a confirmation on the libvirt list before
continuing the discussion here?

Two relevant parts of my reply that don't depend on libvirt's reply are
marked with ** .

2016-02-25 14:38+0100, Paolo Bonzini:
> On 19/02/2016 15:44, Radim Kr?má? wrote:
>>> The resulting injections are:
>>>> - for catchup, which QEMU calls slew: 0, 42, 51, 60, 80.
> 
> I think we agree that 0, 42, 43, 60, 80 is also a "catchup"/"slew"
> policy.

We do.  (Libvirt "catchup" is also QEMU "delay".)

>          So we can change QEMU's kvm-i8254 to accept "slew" and warn if
> "delay" is given.
**
QEMU 4e4fa398db69 ("qdev: Introduce lost tick policy property") defines:

  delay   - replay all lost ticks in a row once the guest accepts them
            again
  slew    - lost ticks are gradually replayed at a higher frequency than
            the original tick

"delay" is exactly how kvm-i8254 behaves (in its "reinject" mode), so I
think we shouldn't change it.

> In fact "slew" means "a large number or quantity of something" and
> indeed that's a good word to characterize kvm-i8254's reinjection behavior.

(Isn't it a verb, with a similar meaning as "drift"? ;])

>>>> - for merge: 0, 20 (in IRR, delivered at 42), 60, 80.
>>>>
>>>> For delay I *think* it would be 0, 42, 62, 82, 102.
>> 
>> I could call this "delay".
>> 
>>   Continue to deliver ticks at the normal rate. The guest time will be
>>   delayed due to the late tick
>> 
>> At 82 time units, the guest thinks it's 60, so the guest will do
>> everything late.  (Leading us to call it delayed?!)
> 
> Yup, this was my reasoning.
> 
>> Few examples of "delay" that I find easier to accept:
>>  0, 60, 80.
> 
> This is "discard".

At 80, the guest thinks that the time is 40, so every action it does
will still be delayed.  I don't see why it isn't libvirt "delay":
 - ticks are delivered at the normal rate
 - guest time is delayed

(The original RFC that Peter shared called this policy "none".)

I don't think it is libvirt "discard":
 - missed ticks were thrown away
 - future injection continues normally

which is fine, but
 - the guest time is delayed, because there isn't a way to handle lost
   ticks

and this is incompatible with libvirt's definition of "discard"

  The guest time may be delayed, unless the OS has explicit handling of
  lost ticks.

"may" doesn't fit.  You can only say
 - the guest time is delayed.

which is best described by "delay".

>>  0, 42, 60, 80.  Because we haven't missed the tick at 20, it just took
>>                  a while to be delivered.  (Semantics ...)
> 
> This is not discard.  The ideal implementation of the tick policies is
> that the timer devices enjoy information from the interrupt controller,
> that lets them know when a tick cannot be delivered.  In that case they
> do stuff like:

Yes.  The case where the timer doesn't know about missed ticks also
deserves a tick policy and I think that libvirt calls it "delay", QEMU
"discard".

> - save it for later (catchup)
> 
> - drop it for good (which is discard, and not the same as stashing it in
> IRR)
> 
> - pause the timer (delay)
> 
> - coalesce the ticks into one late tick (merge)
> 
>> Terminlogy does suck. (Maybe it stems from the fact that QEMU talks
>> about lost ticks, but libvirt about ticks?)
>> Nevertheless, I don't think that libvirt "merge" covers what PIT does in
>> KVM or real hardware.
>> 
>>   Merge the missed tick(s) into one tick and inject. The guest time may
>>   be delayed, depending on how the OS reacts to the merging of ticks
>> 
>> No merging is happening in KVM or real hardware:  every tick is exactly
>> one tick, so the guest cannot tell that we missed some ticks and the
>> time is delayed.  If a tick made it into clear IRR, it's not missed.
> 
> The libvirt documentation is written from the point of view of the
> guest, ignoring whether the late tick is recorded in some guest-visible
> register (IRR) or in the processor state (as is the case for NMI) or in
> the timer device state or who knows where.

Yes.

> Therefore, it _also_ happens that thanks to IRR and NMI latching you can
> implement "merge" without having that kind of relationship between the
> timer device and the interrupt controller.

I disagree.  IRR can catch at most one interrupt, so it is insufficient
to implement libvirt's merge.  (libvirt's merge also has the conditional
"The guest time may be delayed".)

>                                             But libvirt doesn't care,
> all the <timer> element wants is the guest-visible behavior in terms of
> when the ticks are delivered.

Yes, and the guest can't behave like the libvirt definition says.

Libvirt mainly cares if guest's time was irreversibly delayed or not and
for how long can the time be delayed.

(3 of 4 policies in libvirt don't need to delay the time.
 Only 1 policy in QEMU lets the guest recover time.  2 if you say that
 libvirt "merge" = QEMU "merge" and keep pit as QEMU "discard".)

> This was my reasoning when proposing to change QEMU regarding "discard"
> as well:
> 
> - RTC would warn for "discard" and accept "merge"
> 
> - kvm-i8254 would warn for "discard", and accept "merge" if the
> capability says that your patch is in.  The idea is that "discard" is
> such a bad idea, that "merge" should fail if the hypervisor would cause
> the watchdog to hang.
**
I think it's better to say that the tick in IRR wasn't lost and policy
is "discard".
  discard - ignore lost ticks (e.g. if the guest compensates for them
            already)

(We don't need to say that a tick is lost just because it couldn't be
 delivered within some interval.)

"merge" also fits the behavior of kvm-i8254 if you treat IRR as a part
of the timer policy.
  merge   - if multiple ticks are lost, all of them are merged into one
            which is replayed once the guest accepts it again

(It means that every tick was "lost" and then replayed, usually right
 after the tick was "lost" -- this is because you can be only be merging
 lost ticks, so if there could be a non-lost tick in IRR while no tick
 was being handled, e.g. disabled interrupts, and you wanted to deliver
 a next one, then you'd need to have another bit that remembers lost
 ticks.)

>> In the example:
>> 
>>>> - for merge: 0, 20 (in IRR, delivered at 42), 60, 80.
>> 
>> at 80, the guest thinks it's 60.
>> 
>> I think that merge might do:  0, 42, 60, 80.
>> But the tick at 42 is counted as two ticks (20, 40) in the guest.
> 
> Yes, merge is a good policy if the guest can somehow realize that 42
> stood for (20, 40).  Discard would be a good policy too if the guest can
> somehow realize that 60 stood for (20, 40, 60) but it has the problem
> that NMIs don't do EOIs.

Yes, recognizing lost ticks is a prerequisite for catchup/delay/discard.
Standard i8254 & friends aren't able to implement any policy other than
libvirt "delay", because there is no notion of lost ticks.
(KVM's changes to EOI notifiers were very shortsighted.)

>> The main problem of this interpretation is that discard is a subset of
>> merge:
>> 
>>>> - for discard: 0, 60, 80.
>> 
>> The tick at 60 has to be counted as three ticks (20, 40, 60).
> 
> Why is it a problem?

Defining a subset without using the superset is a major flaw in a spec.
It got clarified in the link that Peter gave, where merge delivers
"asap" while discard just waits for the next tick. :)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Feb. 25, 2016, 7:11 p.m. UTC | #10
> 2016-02-25 14:38+0100, Paolo Bonzini:
>> On 19/02/2016 15:44, Radim Kr?má? wrote:
>>>> The resulting injections are:
>>>>> - for catchup, which QEMU calls slew: 0, 42, 51, 60, 80.
>>
>> I think we agree that 0, 42, 43, 60, 80 is also a "catchup"/"slew"
>> policy.
> 
> We do.  (Libvirt "catchup" is also QEMU "delay".)
> 
>>          So we can change QEMU's kvm-i8254 to accept "slew" and warn if
>> "delay" is given.
> **
> QEMU 4e4fa398db69 ("qdev: Introduce lost tick policy property") defines:
> 
>   delay   - replay all lost ticks in a row once the guest accepts them
>             again
>   slew    - lost ticks are gradually replayed at a higher frequency than
>             the original tick
> 
> "delay" is exactly how kvm-i8254 behaves (in its "reinject" mode), so I
> think we shouldn't change it.

Ooh, I missed this commit message indeed.  Then libvirt delay != QEMU
delay, isn't it?

>> In fact "slew" means "a large number or quantity of something" and
>> indeed that's a good word to characterize kvm-i8254's reinjection behavior.
> 
> (Isn't it a verb, with a similar meaning as "drift"? ;])

It's a noun too, like "I've just gotten a whole slew of bugs assigned to
me".

>>> Few examples of "delay" that I find easier to accept:
>>>  0, 60, 80.
>>
>> This is "discard".
> 
> At 80, the guest thinks that the time is 40, so every action it does
> will still be delayed.  I don't see why it isn't libvirt "delay":
>  - ticks are delivered at the normal rate
>  - guest time is delayed

I can buy this. :)

> I don't think it is libvirt "discard":
>  - missed ticks were thrown away
>  - future injection continues normally
> 
> which is fine, but
>  - the guest time is delayed, because there isn't a way to handle lost
>    ticks
> 
> and this is incompatible with libvirt's definition of "discard"
> 
>   The guest time may be delayed, unless the OS has explicit handling of
>   lost ticks.
> 
> "may" doesn't fit.  You can only say
>  - the guest time is delayed.
> 
> which is best described by "delay".

I think we can safely ignore the "may be" -- you cannot say for sure
that the guest time "will" be delayed since you could always have a very
enlightened guest.

... but then, by removing the handwavy "may be" would you say that
libvirt delay and libvirt discard are the same?  Would 0, 42, 62, 82 be
a valid implementation of the libvirt "delay" policy?

>> Therefore, it _also_ happens that thanks to IRR and NMI latching you can
>> implement "merge" without having that kind of relationship between the
>> timer device and the interrupt controller.
> 
> I disagree.  IRR can catch at most one interrupt, so it is insufficient
> to implement libvirt's merge.  (libvirt's merge also has the conditional
> "The guest time may be delayed".)

Hmm... is your point that the i8254 _alone_ is implementing discard, and
the tick delivery time is _actually_ 0, 20, 60, 80 (and the t=20 tick is
delivered late but not lost due to the i8259 buffer)?  And hence the
QEMU device model should see it as discard.  I can definitely agree with
that.

There is still the matter of:

- improving the documentation

- clarifying the meaning of libvirt delay

- deciding whether it's worth changing the meaning of QEMU delay to
match libvirt's (and the default kvm-pit policy from delay to slew)

But if we can agree on this, I can apply patch 1 as is, even for 4.5.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář Feb. 26, 2016, 1:44 p.m. UTC | #11
2016-02-25 20:11+0100, Paolo Bonzini:
>> 2016-02-25 14:38+0100, Paolo Bonzini:
>>> On 19/02/2016 15:44, Radim Kr?má? wrote:
>>>          So we can change QEMU's kvm-i8254 to accept "slew" and warn if
>>> "delay" is given.
>> **
>> QEMU 4e4fa398db69 ("qdev: Introduce lost tick policy property") defines:
>> 
>>   delay   - replay all lost ticks in a row once the guest accepts them
>>             again
>>   slew    - lost ticks are gradually replayed at a higher frequency than
>>             the original tick
>> 
>> "delay" is exactly how kvm-i8254 behaves (in its "reinject" mode), so I
>> think we shouldn't change it.
> 
> Ooh, I missed this commit message indeed.  Then libvirt delay != QEMU
> delay, isn't it?

Exactly, it's sad.  libvirt delay -> QEMU discard.

(Lost) tick policy relations look like this
   libvirt  | QEMU
   catchup -> delay
   delay   -> discard
   discard -> n/a
   catchup -> slew
   delay   -> merge (?)
   merge   -> n/a (?)

Delay, discard, and merge are too ambiguous to make sense.

>> [...]
>> and this is incompatible with libvirt's definition of "discard"
>> 
>>   The guest time may be delayed, unless the OS has explicit handling of
>>   lost ticks.
>> 
>> "may" doesn't fit.  You can only say
>>  - the guest time is delayed.
>> 
>> which is best described by "delay".
> 
> I think we can safely ignore the "may be" -- you cannot say for sure
> that the guest time "will" be delayed since you could always have a very
> enlightened guest.
> ... but then, by removing the handwavy "may be" would you say that
> libvirt delay and libvirt discard are the same?

I would, which is why the "may be" is significant -- the timer has to
provide tools to the guest, even if the guest ignores them.

Do you agree that following rephrasing is identical to libvirt discard?
  Lost ticks will delay the guest time, unless the guest OS has explicit
  handling of lost ticks.

>                                                  Would 0, 42, 62, 82 be
> a valid implementation of the libvirt "delay" policy?

Distinguishing it from saner variants (0, 60, 80 or 0, 20/42, 60, 80)
would be nice, because shifting the phase is not "continuing at normal
rate" for me, but I'd frown and accept it ...
The important part is that guest time never recovers from the lost tick.

>>> Therefore, it _also_ happens that thanks to IRR and NMI latching you can
>>> implement "merge" without having that kind of relationship between the
>>> timer device and the interrupt controller.
>> 
>> I disagree.  IRR can catch at most one interrupt, so it is insufficient
>> to implement libvirt's merge.  (libvirt's merge also has the conditional
>> "The guest time may be delayed".)
> 
> Hmm... is your point that the i8254 _alone_ is implementing discard, and
> the tick delivery time is _actually_ 0, 20, 60, 80 (and the t=20 tick is
> delivered late but not lost due to the i8259 buffer)?  And hence the
> QEMU device model should see it as discard.  I can definitely agree with
> that.

Yes.  Only the tick at 40 is lost.

> There is still the matter of:
> 
> - improving the documentation

Absolutely, this email thread shouldn't have existed.

QEMU merge policy isn't defined de facto ... do we say that it falls
into libvirt delay?  (Or just remove it?)

  merge   - if multiple ticks are lost, all of them are merged into one
            which is replayed once the guest accepts it again

> - clarifying the meaning of libvirt delay

Yes, I wouldn't mind excluding 0, 42, 62, 82. :)

> - deciding whether it's worth changing the meaning of QEMU delay to
> match libvirt's (and the default kvm-pit policy from delay to slew)

Changing QEMU's lost_tick_policy seems like a recipe for confusion.
It'd be nice to unify QEMU and libvirt terms, but QEMU does care about
backward compatibility and I think it is not wise to do this change
without a new interface.  I'd rather just document and forget. :)

> But if we can agree on this, I can apply patch 1 as is, even for 4.5.

I think we agree on all parts that affect this series.
I'll start preparing v3.  (Likely posting on Tuesday/Wednesday.)

Thank you.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index b0ea42b78ccd..ab5318727579 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -245,7 +245,7 @@  static void kvm_pit_ack_irq(struct kvm_irq_ack_notifier *kian)
 		 * PIC is being reset.  Handle it gracefully here
 		 */
 		atomic_inc(&ps->pending);
-	else if (value > 0)
+	else if (value > 0 && ps->reinject)
 		/* in this case, we had multiple outstanding pit interrupts
 		 * that we needed to inject.  Reinject
 		 */
@@ -288,7 +288,9 @@  static void pit_do_work(struct kthread_work *work)
 	 * last one has been acked.
 	 */
 	spin_lock(&ps->inject_lock);
-	if (ps->irq_ack) {
+	if (!ps->reinject)
+		inject = 1;
+	else if (ps->irq_ack) {
 		ps->irq_ack = 0;
 		inject = 1;
 	}
@@ -317,10 +319,10 @@  static enum hrtimer_restart pit_timer_fn(struct hrtimer *data)
 	struct kvm_kpit_state *ps = container_of(data, struct kvm_kpit_state, timer);
 	struct kvm_pit *pt = ps->kvm->arch.vpit;
 
-	if (ps->reinject || !atomic_read(&ps->pending)) {
+	if (ps->reinject)
 		atomic_inc(&ps->pending);
-		queue_kthread_work(&pt->worker, &pt->expired);
-	}
+
+	queue_kthread_work(&pt->worker, &pt->expired);
 
 	if (ps->is_periodic) {
 		hrtimer_add_expires_ns(&ps->timer, ps->period);