diff mbox series

[v2] xen/arm: implement GICD_I[S/C]ACTIVER reads

Message ID 20200327023451.20271-1-sstabellini@kernel.org (mailing list archive)
State New, archived
Headers show
Series [v2] xen/arm: implement GICD_I[S/C]ACTIVER reads | expand

Commit Message

Stefano Stabellini March 27, 2020, 2:34 a.m. UTC
This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
reads. It doesn't take into account the latest state of interrupts on
other vCPUs. Only the current vCPU is up-to-date. A full solution is
not possible because it would require synchronization among all vCPUs,
which would be very expensive in terms or latency.

Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
CC: Wei Xu <xuwei5@hisilicon.com>
CC: Peng Fan <peng.fan@nxp.com>
---
Changes in v2:
- improve commit message
- do not invert result
- use 1U
- use common patter with vgic_rank_offset
- move implementation into a separate function called vgic_isactiver
- add vgic2 implementation
- tested on vgic2 by hacking the Linux gicv2 driver
---
 xen/arch/arm/vgic-v2.c     | 13 +++++++++++--
 xen/arch/arm/vgic-v3.c     | 15 ++++++++++++---
 xen/arch/arm/vgic.c        | 20 ++++++++++++++++++++
 xen/include/asm-arm/vgic.h |  1 +
 4 files changed, 44 insertions(+), 5 deletions(-)

Comments

Julien Grall March 28, 2020, 11:52 a.m. UTC | #1
qHi Stefano,

On 27/03/2020 02:34, Stefano Stabellini wrote:
> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
> reads. It doesn't take into account the latest state of interrupts on
> other vCPUs. Only the current vCPU is up-to-date. A full solution is
> not possible because it would require synchronization among all vCPUs,
> which would be very expensive in terms or latency.

Your sentence suggests you have number showing that correctly emulating 
the registers would be too slow. Mind sharing them?

Cheers,
Stefano Stabellini March 30, 2020, 4:35 p.m. UTC | #2
On Sat, 28 Mar 2020, Julien Grall wrote:
> qHi Stefano,
> 
> On 27/03/2020 02:34, Stefano Stabellini wrote:
> > This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
> > reads. It doesn't take into account the latest state of interrupts on
> > other vCPUs. Only the current vCPU is up-to-date. A full solution is
> > not possible because it would require synchronization among all vCPUs,
> > which would be very expensive in terms or latency.
> 
> Your sentence suggests you have number showing that correctly emulating the
> registers would be too slow. Mind sharing them?

No, I don't have any numbers. Would you prefer a different wording or a
better explanation? I also realized there is a typo in there (or/of).
Julien Grall March 30, 2020, 9:19 p.m. UTC | #3
Hi Stefano,

On 30/03/2020 17:35, Stefano Stabellini wrote:
> On Sat, 28 Mar 2020, Julien Grall wrote:
>> qHi Stefano,
>>
>> On 27/03/2020 02:34, Stefano Stabellini wrote:
>>> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
>>> reads. It doesn't take into account the latest state of interrupts on
>>> other vCPUs. Only the current vCPU is up-to-date. A full solution is
>>> not possible because it would require synchronization among all vCPUs,
>>> which would be very expensive in terms or latency.
>>
>> Your sentence suggests you have number showing that correctly emulating the
>> registers would be too slow. Mind sharing them?
> 
> No, I don't have any numbers. Would you prefer a different wording or a
> better explanation? I also realized there is a typo in there (or/of).
Let me start with I think correctness is more important than speed.
So I would have expected your commit message to contain some fact why 
synchronization is going to be slow and why this is a problem.

To give you a concrete example, the implementation of set/way 
instructions are really slow (it could take a few seconds depending on 
the setup). However, this was fine because not implementing them 
correctly would have a greater impact on the guest (corruption) and they 
are not used often.

I don't think the performance in our case will be in same order 
magnitude. It is most likely to be in the range of milliseconds (if not 
less) which I think is acceptable for emulation (particularly for the 
vGIC) and the current uses.

So lets take a step back and look how we could implement 
ISACTIVER/ICACTIVER correctly.

The only thing we need is a snapshot of the interrupts state a given 
point. I originally thought it would be necessary to use domain_pause() 
which is quite heavy, but I think we only need the vCPU to enter in Xen 
and sync the states of the LRs.

Roughly the code would look like (This is not optimized):

     for_each_vcpu(d, v)
     {
        if ( v->is_running )
          smp_call_function(do_nothing(), v->cpu);
     }

     /* Read the state */

A few remarks:
    * The function do_nothing() is basically a NOP.
    * I am suggesting to use smp_call_function() rather 
smp_send_event_check_cpu() is because we need to know when the vCPU has 
synchronized the LRs. As the function do_nothing() will be call 
afterwards, then we know the the snapshot of the LRs has been done
    * It would be possible to everything in one vCPU.
    * We can possibly optimize it for the SGIs/PPIs case

This is still not perfect, but I don't think the performance would be 
that bad.

Cheers,
Julien Grall March 31, 2020, 12:05 a.m. UTC | #4
Hi,

On 27/03/2020 02:34, Stefano Stabellini wrote:
> It doesn't take into account the latest state of interrupts on
> other vCPUs.

So I think your implementation is going to introduce a deadlock in the 
guest. Let's imagine a guest with 2 vCPUs (A and B) with the following 
setup:
     * The HW SPI 32 is routed to and serviced by vCPU B.
     * vCPU A will routinely wait for any pending SPI 32 to be finished 
before performing a specific task.

In the current form of the vGIC, vCPU B will not exit to Xen when SPI 32 
has been deactivated. Instead, the vCPU will continue to run until an 
unrelated trap happen (I/O emulation, IRQs...). Depending on your setup 
(think NULL scheduler) this may happen in a really long time (or never).

Until the vCPU B exit to Xen, SPI 32 may be considered as active. 
Therefore vCPU A will keep waiting and be block until vCPU B is finally 
trapping in Xen.

My example above is basically a cut down version of 
__synchronize_hardirq() in Linux. In practice, you may be lucky most of 
the times because there will be trap happening time to time. However, it 
also means the task you need to perform on vCPU A will be delayed.

So I would not really bet on the trap here. You have two options:
      1) Force the vCPU to trap when deactivating an interrupt
      2) For the vCPUs to exiting when reading I{S,C}ACTIVER

1) will incur cost on every interrupts which is not great. So I think 
your best option is 2) here.

Cheers,
Stefano Stabellini April 1, 2020, 12:57 a.m. UTC | #5
On Mon, 30 Mar 2020, Julien Grall wrote:
> Hi Stefano,
> 
> On 30/03/2020 17:35, Stefano Stabellini wrote:
> > On Sat, 28 Mar 2020, Julien Grall wrote:
> > > qHi Stefano,
> > > 
> > > On 27/03/2020 02:34, Stefano Stabellini wrote:
> > > > This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
> > > > reads. It doesn't take into account the latest state of interrupts on
> > > > other vCPUs. Only the current vCPU is up-to-date. A full solution is
> > > > not possible because it would require synchronization among all vCPUs,
> > > > which would be very expensive in terms or latency.
> > > 
> > > Your sentence suggests you have number showing that correctly emulating
> > > the
> > > registers would be too slow. Mind sharing them?
> > 
> > No, I don't have any numbers. Would you prefer a different wording or a
> > better explanation? I also realized there is a typo in there (or/of).
> Let me start with I think correctness is more important than speed.
> So I would have expected your commit message to contain some fact why
> synchronization is going to be slow and why this is a problem.
> 
> To give you a concrete example, the implementation of set/way instructions are
> really slow (it could take a few seconds depending on the setup). However,
> this was fine because not implementing them correctly would have a greater
> impact on the guest (corruption) and they are not used often.
> 
> I don't think the performance in our case will be in same order magnitude. It
> is most likely to be in the range of milliseconds (if not less) which I think
> is acceptable for emulation (particularly for the vGIC) and the current uses.

Writing on the mailing list some of our discussions today.

Correctness is not just in terms of compliance to a specification but it
is also about not breaking guests. Introducing latency in the range of
milliseconds, or hundreds of microseconds, would break any latency
sensitive workloads. We don't have numbers so we don't know for certain
the effect that your suggestion would have.

It would be interesting to have those numbers, and I'll add to my TODO
list to run the experiments you suggested, but I'll put it on the
back-burner (from a Xilinx perspective it is low priority as no
customers are affected.)


> So lets take a step back and look how we could implement ISACTIVER/ICACTIVER
> correctly.
> 
> The only thing we need is a snapshot of the interrupts state a given point. I
> originally thought it would be necessary to use domain_pause() which is quite
> heavy, but I think we only need the vCPU to enter in Xen and sync the states
> of the LRs.
> 
> Roughly the code would look like (This is not optimized):
> 
>     for_each_vcpu(d, v)
>     {
>        if ( v->is_running )
>          smp_call_function(do_nothing(), v->cpu);
>     }
> 
>     /* Read the state */
> 
> A few remarks:
>    * The function do_nothing() is basically a NOP.
>    * I am suggesting to use smp_call_function() rather
> smp_send_event_check_cpu() is because we need to know when the vCPU has
> synchronized the LRs. As the function do_nothing() will be call afterwards,
> then we know the the snapshot of the LRs has been done
>    * It would be possible to everything in one vCPU.
>    * We can possibly optimize it for the SGIs/PPIs case
> 
> This is still not perfect, but I don't think the performance would be that
> bad.
Julien Grall April 1, 2020, 8:30 a.m. UTC | #6
On 01/04/2020 01:57, Stefano Stabellini wrote:
> On Mon, 30 Mar 2020, Julien Grall wrote:
>> Hi Stefano,
>>
>> On 30/03/2020 17:35, Stefano Stabellini wrote:
>>> On Sat, 28 Mar 2020, Julien Grall wrote:
>>>> qHi Stefano,
>>>>
>>>> On 27/03/2020 02:34, Stefano Stabellini wrote:
>>>>> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
>>>>> reads. It doesn't take into account the latest state of interrupts on
>>>>> other vCPUs. Only the current vCPU is up-to-date. A full solution is
>>>>> not possible because it would require synchronization among all vCPUs,
>>>>> which would be very expensive in terms or latency.
>>>>
>>>> Your sentence suggests you have number showing that correctly emulating
>>>> the
>>>> registers would be too slow. Mind sharing them?
>>>
>>> No, I don't have any numbers. Would you prefer a different wording or a
>>> better explanation? I also realized there is a typo in there (or/of).
>> Let me start with I think correctness is more important than speed.
>> So I would have expected your commit message to contain some fact why
>> synchronization is going to be slow and why this is a problem.
>>
>> To give you a concrete example, the implementation of set/way instructions are
>> really slow (it could take a few seconds depending on the setup). However,
>> this was fine because not implementing them correctly would have a greater
>> impact on the guest (corruption) and they are not used often.
>>
>> I don't think the performance in our case will be in same order magnitude. It
>> is most likely to be in the range of milliseconds (if not less) which I think
>> is acceptable for emulation (particularly for the vGIC) and the current uses.
> 
> Writing on the mailing list some of our discussions today.
> 
> Correctness is not just in terms of compliance to a specification but it
> is also about not breaking guests. Introducing latency in the range of
> milliseconds, or hundreds of microseconds, would break any latency
> sensitive workloads. We don't have numbers so we don't know for certain
> the effect that your suggestion would have.

You missed part of the discussion. I don't disagree that latency is 
important. However, if an implementation is only 95% reliable, then it 
means 5% of the time your guest may break (corruption, crash, 
deadlock...). At which point the latency is the last of your concern.

> 
> It would be interesting to have those numbers, and I'll add to my TODO
> list to run the experiments you suggested, but I'll put it on the
> back-burner (from a Xilinx perspective it is low priority as no
> customers are affected.)

How about we get a correct implementation merge first and then discuss 
about optimization? This would allow the community to check whether 
there are actually noticeable latency in their workload.

Cheers,
Bertrand Marquis April 1, 2020, 9:54 a.m. UTC | #7
On 1 Apr 2020, at 09:30, Julien Grall <julien@xen.org<mailto:julien@xen.org>> wrote:



On 01/04/2020 01:57, Stefano Stabellini wrote:
On Mon, 30 Mar 2020, Julien Grall wrote:
Hi Stefano,

On 30/03/2020 17:35, Stefano Stabellini wrote:
On Sat, 28 Mar 2020, Julien Grall wrote:
qHi Stefano,

On 27/03/2020 02:34, Stefano Stabellini wrote:
This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
reads. It doesn't take into account the latest state of interrupts on
other vCPUs. Only the current vCPU is up-to-date. A full solution is
not possible because it would require synchronization among all vCPUs,
which would be very expensive in terms or latency.

Your sentence suggests you have number showing that correctly emulating
the
registers would be too slow. Mind sharing them?

No, I don't have any numbers. Would you prefer a different wording or a
better explanation? I also realized there is a typo in there (or/of).
Let me start with I think correctness is more important than speed.
So I would have expected your commit message to contain some fact why
synchronization is going to be slow and why this is a problem.

To give you a concrete example, the implementation of set/way instructions are
really slow (it could take a few seconds depending on the setup). However,
this was fine because not implementing them correctly would have a greater
impact on the guest (corruption) and they are not used often.

I don't think the performance in our case will be in same order magnitude. It
is most likely to be in the range of milliseconds (if not less) which I think
is acceptable for emulation (particularly for the vGIC) and the current uses.
Writing on the mailing list some of our discussions today.
Correctness is not just in terms of compliance to a specification but it
is also about not breaking guests. Introducing latency in the range of
milliseconds, or hundreds of microseconds, would break any latency
sensitive workloads. We don't have numbers so we don't know for certain
the effect that your suggestion would have.

You missed part of the discussion. I don't disagree that latency is important. However, if an implementation is only 95% reliable, then it means 5% of the time your guest may break (corruption, crash, deadlock...). At which point the latency is the last of your concern.

It would be interesting to have those numbers, and I'll add to my TODO
list to run the experiments you suggested, but I'll put it on the
back-burner (from a Xilinx perspective it is low priority as no
customers are affected.)

How about we get a correct implementation merge first and then discuss about optimization? This would allow the community to check whether there are actually noticeable latency in their workload.

Hi,

I am not sure that pushing something with a performance impact to later fix it is the right approach here.

The patch is an improvement compared to the current code and it can be further improved later to handle more cases (other cores).

If we really have to sync all vCPUs here, this will cost a lot and the result will still be the status in the past in fact because nothing will make sure that at the point the guest gets back the value it is still valid.

Cheers

--
Bertrand
George Dunlap April 1, 2020, 5:02 p.m. UTC | #8
On Wed, Apr 1, 2020 at 10:54 AM Bertrand Marquis <Bertrand.Marquis@arm.com>
wrote:

>
>
> On 1 Apr 2020, at 09:30, Julien Grall <julien@xen.org> wrote:
>
>
>
> On 01/04/2020 01:57, Stefano Stabellini wrote:
>
> On Mon, 30 Mar 2020, Julien Grall wrote:
>
> Hi Stefano,
>
> On 30/03/2020 17:35, Stefano Stabellini wrote:
>
> On Sat, 28 Mar 2020, Julien Grall wrote:
>
> qHi Stefano,
>
> On 27/03/2020 02:34, Stefano Stabellini wrote:
>
> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
> reads. It doesn't take into account the latest state of interrupts on
> other vCPUs. Only the current vCPU is up-to-date. A full solution is
> not possible because it would require synchronization among all vCPUs,
> which would be very expensive in terms or latency.
>
>
> Your sentence suggests you have number showing that correctly emulating
> the
> registers would be too slow. Mind sharing them?
>
>
> No, I don't have any numbers. Would you prefer a different wording or a
> better explanation? I also realized there is a typo in there (or/of).
>
> Let me start with I think correctness is more important than speed.
> So I would have expected your commit message to contain some fact why
> synchronization is going to be slow and why this is a problem.
>
> To give you a concrete example, the implementation of set/way instructions
> are
> really slow (it could take a few seconds depending on the setup). However,
> this was fine because not implementing them correctly would have a greater
> impact on the guest (corruption) and they are not used often.
>
> I don't think the performance in our case will be in same order magnitude.
> It
> is most likely to be in the range of milliseconds (if not less) which I
> think
> is acceptable for emulation (particularly for the vGIC) and the current
> uses.
>
> Writing on the mailing list some of our discussions today.
> Correctness is not just in terms of compliance to a specification but it
> is also about not breaking guests. Introducing latency in the range of
> milliseconds, or hundreds of microseconds, would break any latency
> sensitive workloads. We don't have numbers so we don't know for certain
> the effect that your suggestion would have.
>
>
> You missed part of the discussion. I don't disagree that latency is
> important. However, if an implementation is only 95% reliable, then it
> means 5% of the time your guest may break (corruption, crash, deadlock...).
> At which point the latency is the last of your concern.
>
> It would be interesting to have those numbers, and I'll add to my TODO
> list to run the experiments you suggested, but I'll put it on the
> back-burner (from a Xilinx perspective it is low priority as no
> customers are affected.)
>
>
> How about we get a correct implementation merge first and then discuss
> about optimization? This would allow the community to check whether there
> are actually noticeable latency in their workload.
>
>
> Hi,
>
> I am not sure that pushing something with a performance impact to later
> fix it is the right approach here.
>
> The patch is an improvement compared to the current code and it can be
> further improved later to handle more cases (other cores).
>
> If we really have to sync all vCPUs here, this will cost a lot and the
> result will still be the status in the past in fact because nothing will
> make sure that at the point the guest gets back the value it is still valid.
>

The same is true on real hardware, right?

Looking at this discussion as a non-ARM person, it sounds a bit like ARM
specified this in a way that was useless-but-easy-to-implement-so-why-not;
but it turns out to be useless-but-hard-to-implement virtualized.

On the one hand, I'm sympathetic to Julien's point of view; I very much
don't like the idea of silently changing behavior just because the
specified behavior is inconvenient for us.

On the other hand, I'm sympathetic to Stefano's point of view, that it's
pointless to introduce a load of overhead and jitter to implement behavior
which nobody in their right mind would even want to use (at least
virtualized).

What I think would be *ideal* would be for ARM to change the specification
to make it easier virtualize.  For instance:, by specifying that the
register *may* contain information about other cores, but may not; or, that
the register will contain information on other cores on real hardware, but
not virtualized.

Barring that, I think we should have a centralized place to document
deviations from the spec; and I think changes to this spec should be
coordinated with KVM (and maybe ACRN?), to minimize hypervisor-specific
deviations.

Thoughts?

 -George
Julien Grall April 1, 2020, 5:23 p.m. UTC | #9
On 01/04/2020 10:54, Bertrand Marquis wrote:
> 
> 
>> On 1 Apr 2020, at 09:30, Julien Grall <julien@xen.org 
>> <mailto:julien@xen.org>> wrote:
>>
>>
>>
>> On 01/04/2020 01:57, Stefano Stabellini wrote:
>>> On Mon, 30 Mar 2020, Julien Grall wrote:
>>>> Hi Stefano,
>>>>
>>>> On 30/03/2020 17:35, Stefano Stabellini wrote:
>>>>> On Sat, 28 Mar 2020, Julien Grall wrote:
>>>>>> qHi Stefano,
>>>>>>
>>>>>> On 27/03/2020 02:34, Stefano Stabellini wrote:
>>>>>>> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
>>>>>>> reads. It doesn't take into account the latest state of interrupts on
>>>>>>> other vCPUs. Only the current vCPU is up-to-date. A full solution is
>>>>>>> not possible because it would require synchronization among all 
>>>>>>> vCPUs,
>>>>>>> which would be very expensive in terms or latency.
>>>>>>
>>>>>> Your sentence suggests you have number showing that correctly 
>>>>>> emulating
>>>>>> the
>>>>>> registers would be too slow. Mind sharing them?
>>>>>
>>>>> No, I don't have any numbers. Would you prefer a different wording or a
>>>>> better explanation? I also realized there is a typo in there (or/of).
>>>> Let me start with I think correctness is more important than speed.
>>>> So I would have expected your commit message to contain some fact why
>>>> synchronization is going to be slow and why this is a problem.
>>>>
>>>> To give you a concrete example, the implementation of set/way 
>>>> instructions are
>>>> really slow (it could take a few seconds depending on the setup). 
>>>> However,
>>>> this was fine because not implementing them correctly would have a 
>>>> greater
>>>> impact on the guest (corruption) and they are not used often.
>>>>
>>>> I don't think the performance in our case will be in same order 
>>>> magnitude. It
>>>> is most likely to be in the range of milliseconds (if not less) 
>>>> which I think
>>>> is acceptable for emulation (particularly for the vGIC) and the 
>>>> current uses.
>>> Writing on the mailing list some of our discussions today.
>>> Correctness is not just in terms of compliance to a specification but it
>>> is also about not breaking guests. Introducing latency in the range of
>>> milliseconds, or hundreds of microseconds, would break any latency
>>> sensitive workloads. We don't have numbers so we don't know for certain
>>> the effect that your suggestion would have.
>>
>> You missed part of the discussion. I don't disagree that latency is 
>> important. However, if an implementation is only 95% reliable, then it 
>> means 5% of the time your guest may break (corruption, crash, 
>> deadlock...). At which point the latency is the last of your concern.
>>
>>> It would be interesting to have those numbers, and I'll add to my TODO
>>> list to run the experiments you suggested, but I'll put it on the
>>> back-burner (from a Xilinx perspective it is low priority as no
>>> customers are affected.)
>>
>> How about we get a correct implementation merge first and then discuss 
>> about optimization? This would allow the community to check whether 
>> there are actually noticeable latency in their workload.
> 
> Hi,

Hi,

> I am not sure that pushing something with a performance impact to later 
> fix it is the right approach here.
> 
> The patch is an improvement compared to the current code and it can be 
> further improved later to handle more cases (other cores).

If you read my other answer on this patch you will see that this is 
going to introduce a deadlock in the guest using multiple vCPUs. How is 
it an improvement compare to today?

> If we really have to sync all vCPUs here, this will cost a lot and the 
> result will still be the status in the past in fact because nothing will 
> make sure that at the point the guest gets back the value it is still valid.

I hope you are aware the problem is exactly the same in the hardware. As 
soon as you read the ISACTIVER, then the value may not be correct 
anymore. So I don't see the issue to have an outdated value here.

As I said to Stefano yesterday, I would be happy to compromise as long 
as the implementation does not introduce an outright deadlock in the guest.

At the moment, I don't have a better approach than kick all the vCPUs. 
Feel free to suggest a better one.

Cheers,
Julien Grall April 1, 2020, 5:56 p.m. UTC | #10
On 01/04/2020 18:02, George Dunlap wrote:
> 
> 
> On Wed, Apr 1, 2020 at 10:54 AM Bertrand Marquis 
> <Bertrand.Marquis@arm.com <mailto:Bertrand.Marquis@arm.com>> wrote:
> 
> 
> 
>>     On 1 Apr 2020, at 09:30, Julien Grall <julien@xen.org
>>     <mailto:julien@xen.org>> wrote:
>>
>>
>>
>>     On 01/04/2020 01:57, Stefano Stabellini wrote:
>>>     On Mon, 30 Mar 2020, Julien Grall wrote:
>>>>     Hi Stefano,
>>>>
>>>>     On 30/03/2020 17:35, Stefano Stabellini wrote:
>>>>>     On Sat, 28 Mar 2020, Julien Grall wrote:
>>>>>>     qHi Stefano,
>>>>>>
>>>>>>     On 27/03/2020 02:34, Stefano Stabellini wrote:
>>>>>>>     This is a simple implementation of GICD_ICACTIVER /
>>>>>>>     GICD_ISACTIVER
>>>>>>>     reads. It doesn't take into account the latest state of
>>>>>>>     interrupts on
>>>>>>>     other vCPUs. Only the current vCPU is up-to-date. A full
>>>>>>>     solution is
>>>>>>>     not possible because it would require synchronization among
>>>>>>>     all vCPUs,
>>>>>>>     which would be very expensive in terms or latency.
>>>>>>
>>>>>>     Your sentence suggests you have number showing that correctly
>>>>>>     emulating
>>>>>>     the
>>>>>>     registers would be too slow. Mind sharing them?
>>>>>
>>>>>     No, I don't have any numbers. Would you prefer a different
>>>>>     wording or a
>>>>>     better explanation? I also realized there is a typo in there
>>>>>     (or/of).
>>>>     Let me start with I think correctness is more important than speed.
>>>>     So I would have expected your commit message to contain some
>>>>     fact why
>>>>     synchronization is going to be slow and why this is a problem.
>>>>
>>>>     To give you a concrete example, the implementation of set/way
>>>>     instructions are
>>>>     really slow (it could take a few seconds depending on the
>>>>     setup). However,
>>>>     this was fine because not implementing them correctly would have
>>>>     a greater
>>>>     impact on the guest (corruption) and they are not used often.
>>>>
>>>>     I don't think the performance in our case will be in same order
>>>>     magnitude. It
>>>>     is most likely to be in the range of milliseconds (if not less)
>>>>     which I think
>>>>     is acceptable for emulation (particularly for the vGIC) and the
>>>>     current uses.
>>>     Writing on the mailing list some of our discussions today.
>>>     Correctness is not just in terms of compliance to a specification
>>>     but it
>>>     is also about not breaking guests. Introducing latency in the
>>>     range of
>>>     milliseconds, or hundreds of microseconds, would break any latency
>>>     sensitive workloads. We don't have numbers so we don't know for
>>>     certain
>>>     the effect that your suggestion would have.
>>
>>     You missed part of the discussion. I don't disagree that latency
>>     is important. However, if an implementation is only 95% reliable,
>>     then it means 5% of the time your guest may break (corruption,
>>     crash, deadlock...). At which point the latency is the last of
>>     your concern.
>>
>>>     It would be interesting to have those numbers, and I'll add to my
>>>     TODO
>>>     list to run the experiments you suggested, but I'll put it on the
>>>     back-burner (from a Xilinx perspective it is low priority as no
>>>     customers are affected.)
>>
>>     How about we get a correct implementation merge first and then
>>     discuss about optimization? This would allow the community to
>>     check whether there are actually noticeable latency in their workload.
> 
>     Hi,
> 
>     I am not sure that pushing something with a performance impact to
>     later fix it is the right approach here.
> 
>     The patch is an improvement compared to the current code and it can
>     be further improved later to handle more cases (other cores).
> 
>     If we really have to sync all vCPUs here, this will cost a lot and
>     the result will still be the status in the past in fact because
>     nothing will make sure that at the point the guest gets back the
>     value it is still valid.
> 
> 
> The same is true on real hardware, right?
> 
> Looking at this discussion as a non-ARM person, it sounds a bit like ARM 
> specified this in a way that was 
> useless-but-easy-to-implement-so-why-not; but it turns out to be 
> useless-but-hard-to-implement virtualized.
> 
> On the one hand, I'm sympathetic to Julien's point of view; I very much 
> don't like the idea of silently changing behavior just because the 
> specified behavior is inconvenient for us.
> 
> On the other hand, I'm sympathetic to Stefano's point of view, that it's 
> pointless to introduce a load of overhead and jitter to implement 
> behavior which nobody in their right mind would even want to use (at 
> least virtualized).
> 
> What I think would be *ideal* would be for ARM to change the 
> specification to make it easier virtualize.  For instance:, by 
> specifying that the register *may* contain information about other 
> cores, but may not; or, that the register will contain information on 
> other cores on real hardware, but not virtualized.

The goal of those registers is to give you a picture of the interrupts 
activated on your HW. For instance, this is used by Linux to properly 
synchronize pending interrupts as there would be a race otherwise.

I wrote a long e-mail a few months ago explaining why Linux is using it 
and why we should implement correctly [1].

Even on real HW, the value you read may already be incorrect (i.e an 
interrupt may have become pending).

What Stefano implemented is using a snapshot of what Xen thinks is the 
state of the vGIC. I think this is inline with the existing specification.

However, this has a major flaw because this relies on the vCPUs to exit 
for synchronization the HW state and what Xen thinks of the vGIC. We 
have configured the vGIC to not exit when the virtual interrupt is 
backed by an HW interrupt. So we need to rely on another trap for the 
synchronization.

With the NULL scheduler (and even credit2 with only 1 vCPU running per 
pCPU), the number of traps is very limited. So read the I*ACTIVER 
registers may report that an interrupt is actived for a really long time.

The result here is a thread in the guest may be blocked for a long-time.

To be brutally honest, this totally defeat the argument that "this patch 
has a better latency".

> 
> Barring that, I think we should have a centralized place to document 
> deviations from the spec; and I think changes to this spec should be 
> coordinated with KVM (and maybe ACRN?), to minimize hypervisor-specific 
> deviations.

AFAIK, KVM is implementing the ACTIVER registers the same way as 
Stefano's patch. It might be harder (or not possible) to hit the 
deadlock depending on how they configure the vGIC (i.e trap the active 
state) and the scheduling.

As I discussed with Stefano yesterday, I can accept that the state is 
not up-to-date. But I can't ignore the fact this patch is introducing a 
deadlock in some circumstance.

Cheers,

[1] 
https://lore.kernel.org/xen-devel/7289f75f-1ab2-2d42-cd88-1be5306b3072@xen.org/
Bertrand Marquis April 2, 2020, 8:17 a.m. UTC | #11
> On 1 Apr 2020, at 18:23, Julien Grall <julien@xen.org> wrote:
> 
> 
> 
> On 01/04/2020 10:54, Bertrand Marquis wrote:
>>> On 1 Apr 2020, at 09:30, Julien Grall <julien@xen.org <mailto:julien@xen.org>> wrote:
>>> 
>>> 
>>> 
>>> On 01/04/2020 01:57, Stefano Stabellini wrote:
>>>> On Mon, 30 Mar 2020, Julien Grall wrote:
>>>>> Hi Stefano,
>>>>> 
>>>>> On 30/03/2020 17:35, Stefano Stabellini wrote:
>>>>>> On Sat, 28 Mar 2020, Julien Grall wrote:
>>>>>>> qHi Stefano,
>>>>>>> 
>>>>>>> On 27/03/2020 02:34, Stefano Stabellini wrote:
>>>>>>>> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
>>>>>>>> reads. It doesn't take into account the latest state of interrupts on
>>>>>>>> other vCPUs. Only the current vCPU is up-to-date. A full solution is
>>>>>>>> not possible because it would require synchronization among all vCPUs,
>>>>>>>> which would be very expensive in terms or latency.
>>>>>>> 
>>>>>>> Your sentence suggests you have number showing that correctly emulating
>>>>>>> the
>>>>>>> registers would be too slow. Mind sharing them?
>>>>>> 
>>>>>> No, I don't have any numbers. Would you prefer a different wording or a
>>>>>> better explanation? I also realized there is a typo in there (or/of).
>>>>> Let me start with I think correctness is more important than speed.
>>>>> So I would have expected your commit message to contain some fact why
>>>>> synchronization is going to be slow and why this is a problem.
>>>>> 
>>>>> To give you a concrete example, the implementation of set/way instructions are
>>>>> really slow (it could take a few seconds depending on the setup). However,
>>>>> this was fine because not implementing them correctly would have a greater
>>>>> impact on the guest (corruption) and they are not used often.
>>>>> 
>>>>> I don't think the performance in our case will be in same order magnitude. It
>>>>> is most likely to be in the range of milliseconds (if not less) which I think
>>>>> is acceptable for emulation (particularly for the vGIC) and the current uses.
>>>> Writing on the mailing list some of our discussions today.
>>>> Correctness is not just in terms of compliance to a specification but it
>>>> is also about not breaking guests. Introducing latency in the range of
>>>> milliseconds, or hundreds of microseconds, would break any latency
>>>> sensitive workloads. We don't have numbers so we don't know for certain
>>>> the effect that your suggestion would have.
>>> 
>>> You missed part of the discussion. I don't disagree that latency is important. However, if an implementation is only 95% reliable, then it means 5% of the time your guest may break (corruption, crash, deadlock...). At which point the latency is the last of your concern.
>>> 
>>>> It would be interesting to have those numbers, and I'll add to my TODO
>>>> list to run the experiments you suggested, but I'll put it on the
>>>> back-burner (from a Xilinx perspective it is low priority as no
>>>> customers are affected.)
>>> 
>>> How about we get a correct implementation merge first and then discuss about optimization? This would allow the community to check whether there are actually noticeable latency in their workload.
>> Hi,
> 
> Hi,

Hi,

> 
>> I am not sure that pushing something with a performance impact to later fix it is the right approach here.
>> The patch is an improvement compared to the current code and it can be further improved later to handle more cases (other cores).
> 
> If you read my other answer on this patch you will see that this is going to introduce a deadlock in the guest using multiple vCPUs. How is it an improvement compare to today?

I fully agree with the fact that a deadlock even with low probability is a no-go.
Current implementation returning always 0 is not creating this deadlock case but can mislead in the other way letting on CPU think that the interrupt has been handled already when it has possibly not been.

> 
>> If we really have to sync all vCPUs here, this will cost a lot and the result will still be the status in the past in fact because nothing will make sure that at the point the guest gets back the value it is still valid.
> 
> I hope you are aware the problem is exactly the same in the hardware. As soon as you read the ISACTIVER, then the value may not be correct anymore. So I don't see the issue to have an outdated value here.

Main difference being that in the hardware you can still poll for it and be sure that it won’t end up in your deadlock.
So I agree, we need to find a clean solution here.

> 
> As I said to Stefano yesterday, I would be happy to compromise as long as the implementation does not introduce an outright deadlock in the guest.
> 
> At the moment, I don't have a better approach than kick all the vCPUs. Feel free to suggest a better one.

Only one that I see is to enforce a service interrupt when interrupts have been handled to make sure that we clean up the internal active status as soon as interrupt has actually been handle.
This would ensure that the polling is behaving properly from the guest point of view but would prevent the cross cpu sync (status will be right as soon as the interrupt will have been serviced).
But the trade off will be a bigger overhead by enforcing a return back to xen each time and interrupt is handled by a guest.

I will try to spend more time on this and discuss with the GIC engineers at Arm to dig deeper in this case.

Cheers
Bertrand

> 
> Cheers,
> 
> -- 
> Julien Grall
Stefano Stabellini April 2, 2020, 5:19 p.m. UTC | #12
On Wed, 1 Apr 2020, Julien Grall wrote:
> On 01/04/2020 01:57, Stefano Stabellini wrote:
> > On Mon, 30 Mar 2020, Julien Grall wrote:
> > > Hi Stefano,
> > > 
> > > On 30/03/2020 17:35, Stefano Stabellini wrote:
> > > > On Sat, 28 Mar 2020, Julien Grall wrote:
> > > > > qHi Stefano,
> > > > > 
> > > > > On 27/03/2020 02:34, Stefano Stabellini wrote:
> > > > > > This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
> > > > > > reads. It doesn't take into account the latest state of interrupts
> > > > > > on
> > > > > > other vCPUs. Only the current vCPU is up-to-date. A full solution is
> > > > > > not possible because it would require synchronization among all
> > > > > > vCPUs,
> > > > > > which would be very expensive in terms or latency.
> > > > > 
> > > > > Your sentence suggests you have number showing that correctly
> > > > > emulating
> > > > > the
> > > > > registers would be too slow. Mind sharing them?
> > > > 
> > > > No, I don't have any numbers. Would you prefer a different wording or a
> > > > better explanation? I also realized there is a typo in there (or/of).
> > > Let me start with I think correctness is more important than speed.
> > > So I would have expected your commit message to contain some fact why
> > > synchronization is going to be slow and why this is a problem.
> > > 
> > > To give you a concrete example, the implementation of set/way instructions
> > > are
> > > really slow (it could take a few seconds depending on the setup). However,
> > > this was fine because not implementing them correctly would have a greater
> > > impact on the guest (corruption) and they are not used often.
> > > 
> > > I don't think the performance in our case will be in same order magnitude.
> > > It
> > > is most likely to be in the range of milliseconds (if not less) which I
> > > think
> > > is acceptable for emulation (particularly for the vGIC) and the current
> > > uses.
> > 
> > Writing on the mailing list some of our discussions today.
> > 
> > Correctness is not just in terms of compliance to a specification but it
> > is also about not breaking guests. Introducing latency in the range of
> > milliseconds, or hundreds of microseconds, would break any latency
> > sensitive workloads. We don't have numbers so we don't know for certain
> > the effect that your suggestion would have.
> 
> You missed part of the discussion. I don't disagree that latency is important.
> However, if an implementation is only 95% reliable, then it means 5% of the
> time your guest may break (corruption, crash, deadlock...). At which point the
> latency is the last of your concern.

Yeah I missed to highlight it, also because I look at it from a slightly
different perspective: I think IRQ latency is part of correctness.

If we have a solution that is not 100% faithful to the specification we
are going to break guests that rely on a faithful implementation of
ISACTIVER.

If we have a solution that is 100% faithful to the specification but
causes latency spikes it breaks RT guests.

But they are different sets of guests, it is not like one is necessarily
a subset of the other: there are guests that cannot tolerate any latency
spikes but they are OK with an imprecise implementation of ISACTIVER.

My preference is a solution that is both spec-faithful and also doesn't
cause any latency spikes. If that is not possible then we'll have to
compromise or come up with "creative" ideas.


> > It would be interesting to have those numbers, and I'll add to my TODO
> > list to run the experiments you suggested, but I'll put it on the
> > back-burner (from a Xilinx perspective it is low priority as no
> > customers are affected.)
> 
> How about we get a correct implementation merge first and then discuss about
> optimization? This would allow the community to check whether there are
> actually noticeable latency in their workload.

A correct implementation to me means that it is correct from both the
specification point of view as well as from a latency point of view. A
patch that potentially introduces latency spikes could cause guest
breakage and I don't think it should be considered correct. The
tests would have to be done beforehand.



In terms of other "creative" ideas, here are some:

One idea, as George suggested, would be to document the interface
deviation. The intention is still to remove any deviation but at least
we would be clear about what we have. Ideally in a single place together
with other hypervisors. This is my preference.

Another idea is that we could crash the guest if GICD_ISACTIVER is read
from a multi-vcpu guest. It is similar to what we already do today but
better because we would do it purposely (not because of a typo) and
because it will work for single vcpu guests at least.

We could also leave it as is (crash all the time) but it implies that
vendors that are seeing issues with Linux today will have to keep
maintaining patches in their private trees until a better solution is
found. This would also be the case if we crash multi-vcpus guests as
previously suggested.
Julien Grall April 2, 2020, 6:52 p.m. UTC | #13
(+Marc)

Hi Stefano,

On 02/04/2020 18:19, Stefano Stabellini wrote:
> On Wed, 1 Apr 2020, Julien Grall wrote:
>> On 01/04/2020 01:57, Stefano Stabellini wrote:
>>> On Mon, 30 Mar 2020, Julien Grall wrote:
>>>> Hi Stefano,
>>>>
>>>> On 30/03/2020 17:35, Stefano Stabellini wrote:
>>>>> On Sat, 28 Mar 2020, Julien Grall wrote:
>>>>>> qHi Stefano,
>>>>>>
>>>>>> On 27/03/2020 02:34, Stefano Stabellini wrote:
>>>>>>> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER
>>>>>>> reads. It doesn't take into account the latest state of interrupts
>>>>>>> on
>>>>>>> other vCPUs. Only the current vCPU is up-to-date. A full solution is
>>>>>>> not possible because it would require synchronization among all
>>>>>>> vCPUs,
>>>>>>> which would be very expensive in terms or latency.
>>>>>>
>>>>>> Your sentence suggests you have number showing that correctly
>>>>>> emulating
>>>>>> the
>>>>>> registers would be too slow. Mind sharing them?
>>>>>
>>>>> No, I don't have any numbers. Would you prefer a different wording or a
>>>>> better explanation? I also realized there is a typo in there (or/of).
>>>> Let me start with I think correctness is more important than speed.
>>>> So I would have expected your commit message to contain some fact why
>>>> synchronization is going to be slow and why this is a problem.
>>>>
>>>> To give you a concrete example, the implementation of set/way instructions
>>>> are
>>>> really slow (it could take a few seconds depending on the setup). However,
>>>> this was fine because not implementing them correctly would have a greater
>>>> impact on the guest (corruption) and they are not used often.
>>>>
>>>> I don't think the performance in our case will be in same order magnitude.
>>>> It
>>>> is most likely to be in the range of milliseconds (if not less) which I
>>>> think
>>>> is acceptable for emulation (particularly for the vGIC) and the current
>>>> uses.
>>>
>>> Writing on the mailing list some of our discussions today.
>>>
>>> Correctness is not just in terms of compliance to a specification but it
>>> is also about not breaking guests. Introducing latency in the range of
>>> milliseconds, or hundreds of microseconds, would break any latency
>>> sensitive workloads. We don't have numbers so we don't know for certain
>>> the effect that your suggestion would have.
>>
>> You missed part of the discussion. I don't disagree that latency is important.
>> However, if an implementation is only 95% reliable, then it means 5% of the
>> time your guest may break (corruption, crash, deadlock...). At which point the
>> latency is the last of your concern.
> 
> Yeah I missed to highlight it, also because I look at it from a slightly
> different perspective: I think IRQ latency is part of correctness.
> 
> If we have a solution that is not 100% faithful to the specification we
> are going to break guests that rely on a faithful implementation of
> ISACTIVER.
> 
> If we have a solution that is 100% faithful to the specification but
> causes latency spikes it breaks RT guests.
> 
> But they are different sets of guests, it is not like one is necessarily
> a subset of the other: there are guests that cannot tolerate any latency
> spikes but they are OK with an imprecise implementation of ISACTIVER.
> 
> My preference is a solution that is both spec-faithful and also doesn't
> cause any latency spikes. If that is not possible then we'll have to
> compromise or come up with "creative" ideas.

I do agree that latency is important. However, this needs to be based on 
numbers or a good grasp as to why this would be an issue. Neither of 
these have been provided so far.

As we discussed on Tuesday, the cost for other vCPUs is only going to be 
a trap to the hypervisor and then back again. The cost is likely smaller 
than receiving and forwarding an interrupt.

You actually agreed on this analysis. So can you enlighten me as to why 
receiving an interrupt is a not problem for latency but this is?

> 
>>> It would be interesting to have those numbers, and I'll add to my TODO
>>> list to run the experiments you suggested, but I'll put it on the
>>> back-burner (from a Xilinx perspective it is low priority as no
>>> customers are affected.)
>>
>> How about we get a correct implementation merge first and then discuss about
>> optimization? This would allow the community to check whether there are
>> actually noticeable latency in their workload.
> 
> A correct implementation to me means that it is correct from both the
> specification point of view as well as from a latency point of view. A
> patch that potentially introduces latency spikes could cause guest
> breakage and I don't think it should be considered correct. The
> tests would have to be done beforehand.

In all honesty, writing and testing the implementation would have likely 
took you less than trying to push for "creative" ideas or your patch.

> In terms of other "creative" ideas, here are some:

"creative" ideas should really be the last resort. Correct me if I am 
wrong, but I don't think we are there yet.

> 
> One idea, as George suggested, would be to document the interface
> deviation. The intention is still to remove any deviation but at least
> we would be clear about what we have. Ideally in a single place together
> with other hypervisors. This is my preference.

It is not without saying that deviation from specification should not be 
taken lightly and has risks.

The main risk is you are never going to be able to reliably run an OS on 
Xen unless you manage to get the deviation accepted by the wider 
community and Arm.

> 
> Another idea is that we could crash the guest if GICD_ISACTIVER is read
> from a multi-vcpu guest. It is similar to what we already do today but
> better because we would do it purposely (not because of a typo) and
> because it will work for single vcpu guests at least.
> 
> We could also leave it as is (crash all the time) but it implies that
> vendors that are seeing issues with Linux today will have to keep
> maintaining patches in their private trees until a better solution is
> found. This would also be the case if we crash multi-vcpus guests as
> previously suggested.

The crash only happened when using vGICv3 not vGICv2. But did you look 
at Xen recently? Particularly at the following patch:

xen/arm: Handle unimplemented VGICv3 registers as RAZ/WI

Per the ARM Generic Interrupt Controller Architecture Specification (ARM
IHI 0069E), reserved registers should generally be treated as RAZ/WI.
To simplify the VGICv3 design and improve guest compatibility, treat the
default case for GICD and GICR registers as read_as_zero/write_ignore.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>

I actually pointed the patch to you during one of our weekly calls. Yet 
we agreed it would still be good to implement the register properly and 
you said you will write a patch.

Anyway, I gave you a solution and also why I think it would still be 
fine in term of IRQ latency. The ball is now in your court.

Cheers,
Marc Zyngier April 3, 2020, 8:47 a.m. UTC | #14
On 2020-04-02 19:52, Julien Grall wrote:
> (+Marc)

Thanks for looping me in. Definitely an interesting read, but also a 
very
puzzling one.

> 
> Hi Stefano,
> 
> On 02/04/2020 18:19, Stefano Stabellini wrote:
>> On Wed, 1 Apr 2020, Julien Grall wrote:
>>> On 01/04/2020 01:57, Stefano Stabellini wrote:
>>>> On Mon, 30 Mar 2020, Julien Grall wrote:
>>>>> Hi Stefano,
>>>>> 
>>>>> On 30/03/2020 17:35, Stefano Stabellini wrote:
>>>>>> On Sat, 28 Mar 2020, Julien Grall wrote:
>>>>>>> qHi Stefano,
>>>>>>> 
>>>>>>> On 27/03/2020 02:34, Stefano Stabellini wrote:
>>>>>>>> This is a simple implementation of GICD_ICACTIVER / 
>>>>>>>> GICD_ISACTIVER
>>>>>>>> reads. It doesn't take into account the latest state of 
>>>>>>>> interrupts
>>>>>>>> on
>>>>>>>> other vCPUs. Only the current vCPU is up-to-date. A full 
>>>>>>>> solution is
>>>>>>>> not possible because it would require synchronization among all
>>>>>>>> vCPUs,
>>>>>>>> which would be very expensive in terms or latency.
>>>>>>> 
>>>>>>> Your sentence suggests you have number showing that correctly
>>>>>>> emulating
>>>>>>> the
>>>>>>> registers would be too slow. Mind sharing them?
>>>>>> 
>>>>>> No, I don't have any numbers. Would you prefer a different wording 
>>>>>> or a
>>>>>> better explanation? I also realized there is a typo in there 
>>>>>> (or/of).
>>>>> Let me start with I think correctness is more important than speed.
>>>>> So I would have expected your commit message to contain some fact 
>>>>> why
>>>>> synchronization is going to be slow and why this is a problem.
>>>>> 
>>>>> To give you a concrete example, the implementation of set/way 
>>>>> instructions
>>>>> are
>>>>> really slow (it could take a few seconds depending on the setup). 
>>>>> However,
>>>>> this was fine because not implementing them correctly would have a 
>>>>> greater
>>>>> impact on the guest (corruption) and they are not used often.
>>>>> 
>>>>> I don't think the performance in our case will be in same order 
>>>>> magnitude.
>>>>> It
>>>>> is most likely to be in the range of milliseconds (if not less) 
>>>>> which I
>>>>> think
>>>>> is acceptable for emulation (particularly for the vGIC) and the 
>>>>> current
>>>>> uses.
>>>> 
>>>> Writing on the mailing list some of our discussions today.
>>>> 
>>>> Correctness is not just in terms of compliance to a specification 
>>>> but it
>>>> is also about not breaking guests. Introducing latency in the range 
>>>> of
>>>> milliseconds, or hundreds of microseconds, would break any latency
>>>> sensitive workloads. We don't have numbers so we don't know for 
>>>> certain
>>>> the effect that your suggestion would have.
>>> 
>>> You missed part of the discussion. I don't disagree that latency is 
>>> important.
>>> However, if an implementation is only 95% reliable, then it means 5% 
>>> of the
>>> time your guest may break (corruption, crash, deadlock...). At which 
>>> point the
>>> latency is the last of your concern.
>> 
>> Yeah I missed to highlight it, also because I look at it from a 
>> slightly
>> different perspective: I think IRQ latency is part of correctness.

No. Low latency is a very desirable thing, but it doesn't matter at all 
when
you don't even have functional correctness. To use my favourite car 
analogy,
having a bigger engine doesn't help when you're about to hit the wall 
and
have no breaks... You just hit the wall faster.

>> 
>> If we have a solution that is not 100% faithful to the specification 
>> we
>> are going to break guests that rely on a faithful implementation of
>> ISACTIVER.
>> 
>> If we have a solution that is 100% faithful to the specification but
>> causes latency spikes it breaks RT guests.
>> 
>> But they are different sets of guests, it is not like one is 
>> necessarily
>> a subset of the other: there are guests that cannot tolerate any 
>> latency
>> spikes but they are OK with an imprecise implementation of ISACTIVER.

s/imprecise/massively incorrect/

>> 
>> My preference is a solution that is both spec-faithful and also 
>> doesn't
>> cause any latency spikes. If that is not possible then we'll have to
>> compromise or come up with "creative" ideas.

You want your cake and eat it. Always a good thing to want.

As long as you don't pretend you implement the GIC architecture, expose
something else altogether to the guests and have specific drivers in all
the guest operating systems under the sun, by all mean, be creative.

> I do agree that latency is important. However, this needs to be based
> on numbers or a good grasp as to why this would be an issue. Neither
> of these have been provided so far.
> 
> As we discussed on Tuesday, the cost for other vCPUs is only going to
> be a trap to the hypervisor and then back again. The cost is likely
> smaller than receiving and forwarding an interrupt.
> 
> You actually agreed on this analysis. So can you enlighten me as to
> why receiving an interrupt is a not problem for latency but this is?
> 
>> 
>>>> It would be interesting to have those numbers, and I'll add to my 
>>>> TODO
>>>> list to run the experiments you suggested, but I'll put it on the
>>>> back-burner (from a Xilinx perspective it is low priority as no
>>>> customers are affected.)
>>> 
>>> How about we get a correct implementation merge first and then 
>>> discuss about
>>> optimization? This would allow the community to check whether there 
>>> are
>>> actually noticeable latency in their workload.
>> 
>> A correct implementation to me means that it is correct from both the
>> specification point of view as well as from a latency point of view. A
>> patch that potentially introduces latency spikes could cause guest
>> breakage and I don't think it should be considered correct. The
>> tests would have to be done beforehand.

Please find anything that specifies latency in the GIC spec. Oh wait,
there is none. Because that's a quality of implementation thing, and
not a correctness issue.

> 
> In all honesty, writing and testing the implementation would have
> likely took you less than trying to push for "creative" ideas or your
> patch.
> 
>> In terms of other "creative" ideas, here are some:
> 
> "creative" ideas should really be the last resort. Correct me if I am
> wrong, but I don't think we are there yet.
> 
>> 
>> One idea, as George suggested, would be to document the interface
>> deviation. The intention is still to remove any deviation but at least
>> we would be clear about what we have. Ideally in a single place 
>> together
>> with other hypervisors. This is my preference.
> 
> It is not without saying that deviation from specification should not
> be taken lightly and has risks.
> 
> The main risk is you are never going to be able to reliably run an OS
> on Xen unless you manage to get the deviation accepted by the wider
> community and Arm.

There is just no way I'll ever accept a change to the GIC interrupt 
state
machine for Linux. Feel free to try and convince other OS maintainers.

If you want to be creative, be really creative. Implement a fully PV 
interrupt
controller that gives you simple enough semantics so that you can 
actually be
deterministic. Don't pretend you implement the GIC architecture.

>> Another idea is that we could crash the guest if GICD_ISACTIVER is 
>> read
>> from a multi-vcpu guest. It is similar to what we already do today but
>> better because we would do it purposely (not because of a typo) and
>> because it will work for single vcpu guests at least.
>> 
>> We could also leave it as is (crash all the time) but it implies that
>> vendors that are seeing issues with Linux today will have to keep
>> maintaining patches in their private trees until a better solution is
>> found. This would also be the case if we crash multi-vcpus guests as
>> previously suggested.

OK, that's just insane. Suggesting that guests should be left crashing
because the are doing *the right thing* is just barking mad. I'm all for
exposing guest bugs when they take shortcuts with the architecture, but
blaming the guest for what is just a bug in Xen?

If I was someone developing a product using Xen/ARM, I'd be very worried
about what you have written above. Because it really reads "we don't 
care
about reliability as long as we can show amazing numbers". I really hope
it isn't what you mean.

         M.
George Dunlap April 3, 2020, 10:43 a.m. UTC | #15
> On Apr 3, 2020, at 9:47 AM, Marc Zyngier <maz@kernel.org> wrote:
> 
> On 2020-04-02 19:52, Julien Grall wrote:
>> (+Marc)
> 
> Thanks for looping me in. Definitely an interesting read, but also a very
> puzzling one.

[snip]

> No. Low latency is a very desirable thing, but it doesn't matter at all when
> you don't even have functional correctness. To use my favourite car analogy,
> having a bigger engine doesn't help when you're about to hit the wall and
> have no breaks... You just hit the wall faster.

[snip]

> s/imprecise/massively incorrect/

[snip]

> There is just no way I'll ever accept a change to the GIC interrupt state
> machine for Linux. Feel free to try and convince other OS maintainers.

[snip]

> If I was someone developing a product using Xen/ARM, I'd be very worried
> about what you have written above. Because it really reads "we don't care
> about reliability as long as we can show amazing numbers". I really hope
> it isn't what you mean.

What's puzzling to me, is that what everyone else in this thread is that what Stefano is trying to do is to get Xen to be have like KVM.

Are they wrong?  If so, we can just do whatever Linux does.  If not, then you need to first turn all your imprecations about correctness, smashing into walls, concern for the sanity of maintainers and so on towards your own code first.

 -George
Marc Zyngier April 3, 2020, 10:59 a.m. UTC | #16
George,

On 2020-04-03 11:43, George Dunlap wrote:
>> On Apr 3, 2020, at 9:47 AM, Marc Zyngier <maz@kernel.org> wrote:
>> 
>> On 2020-04-02 19:52, Julien Grall wrote:
>>> (+Marc)
>> 
>> Thanks for looping me in. Definitely an interesting read, but also a 
>> very
>> puzzling one.
> 
> [snip]
> 
>> No. Low latency is a very desirable thing, but it doesn't matter at 
>> all when
>> you don't even have functional correctness. To use my favourite car 
>> analogy,
>> having a bigger engine doesn't help when you're about to hit the wall 
>> and
>> have no breaks... You just hit the wall faster.
> 
> [snip]
> 
>> s/imprecise/massively incorrect/
> 
> [snip]
> 
>> There is just no way I'll ever accept a change to the GIC interrupt 
>> state
>> machine for Linux. Feel free to try and convince other OS maintainers.
> 
> [snip]
> 
>> If I was someone developing a product using Xen/ARM, I'd be very 
>> worried
>> about what you have written above. Because it really reads "we don't 
>> care
>> about reliability as long as we can show amazing numbers". I really 
>> hope
>> it isn't what you mean.
> 
> What's puzzling to me, is that what everyone else in this thread is
> that what Stefano is trying to do is to get Xen to be have like KVM.

Sorry, I don't get what you mean here. KVM at least aims to be 
architecturally
compliant. Is it perfect? Most probably not, as we fix it all the time.

Dealing with the active registers is hard. But as far as I can see,
we do get them right. Do we sacrifice latency over correctness? Yes.

And if you have spotted a problem in the way we handle those, pray tell.

> Are they wrong?  If so, we can just do whatever Linux does.  If not,
> then you need to first turn all your imprecations about correctness,
> smashing into walls, concern for the sanity of maintainers and so on
> towards your own code first.

I'm really sorry, but you see to have the wrong end of the stick here.
I'm not trying to compare Xen to KVM at all. I'm concerned about only
implementing only a small part of the architecture, ignoring the rest,
and letting guests crash, which is what was suggested here.

         M.
George Dunlap April 3, 2020, 11:15 a.m. UTC | #17
> On Apr 3, 2020, at 11:59 AM, Marc Zyngier <maz@kernel.org> wrote:
> 
> George,
> 
> On 2020-04-03 11:43, George Dunlap wrote:
>>> On Apr 3, 2020, at 9:47 AM, Marc Zyngier <maz@kernel.org> wrote:
>>> On 2020-04-02 19:52, Julien Grall wrote:
>>>> (+Marc)
>>> Thanks for looping me in. Definitely an interesting read, but also a very
>>> puzzling one.
>> [snip]
>>> No. Low latency is a very desirable thing, but it doesn't matter at all when
>>> you don't even have functional correctness. To use my favourite car analogy,
>>> having a bigger engine doesn't help when you're about to hit the wall and
>>> have no breaks... You just hit the wall faster.
>> [snip]
>>> s/imprecise/massively incorrect/
>> [snip]
>>> There is just no way I'll ever accept a change to the GIC interrupt state
>>> machine for Linux. Feel free to try and convince other OS maintainers.
>> [snip]
>>> If I was someone developing a product using Xen/ARM, I'd be very worried
>>> about what you have written above. Because it really reads "we don't care
>>> about reliability as long as we can show amazing numbers". I really hope
>>> it isn't what you mean.
>> What's puzzling to me, is that what everyone else in this thread is
>> that what Stefano is trying to do is to get Xen to be have like KVM.
> 
> Sorry, I don't get what you mean here. KVM at least aims to be architecturally
> compliant. Is it perfect? Most probably not, as we fix it all the time.
> 
> Dealing with the active registers is hard. But as far as I can see,
> we do get them right. Do we sacrifice latency over correctness? Yes.
> 
> And if you have spotted a problem in the way we handle those, pray tell.
> 
>> Are they wrong?  If so, we can just do whatever Linux does.  If not,
>> then you need to first turn all your imprecations about correctness,
>> smashing into walls, concern for the sanity of maintainers and so on
>> towards your own code first.
> 
> I'm really sorry, but you see to have the wrong end of the stick here.
> I'm not trying to compare Xen to KVM at all. I'm concerned about only
> implementing only a small part of the architecture, ignoring the rest,
> and letting guests crash, which is what was suggested here.

The current situation (as I understand it) is that Xen never implemented this functionality at all.

Stefano has been trying to convince Julien to implement this register KVM’s way, which has good latency in the median case, but in particular configurations, has arbitrarily bad worst-case latency for multi-vcpu guests.  Julien thinks what KVM is doing is wrong and against the spec, and has been refusing to let the patches in. 

He has proposed another suggestion which is closer to the spec in terms of functionality, and has bounded worst-case latency, but which has worse latency and uncontrollable jitter in the median case (or at least, so Stefano believes).

As a compromise, Stefano suggested implementing KVM’s way for single-vcpu guests.  This is a strict improvement over the current situation, since at least a lot of new guests start working, while they hash out what to do about multi-vcpu guests.

My proposal has been to work with KVM do document this deviation from the spec for guests running virtualized.  

So it’s you have have the wrong end of the stick; your contempt is misplaced.

If you don’t think it’s a deviation, then please help us convince Julien, so we can take Stefano’s patch as-is.  Or, if Julien convinces you that KVM is deviating from the spec, then let’s try to work together to see how we can implement the necessary functionality efficiently in a virtualized environment.

 -George
Julien Grall April 3, 2020, 11:16 a.m. UTC | #18
On 03/04/2020 11:43, George Dunlap wrote:
> 
>> On Apr 3, 2020, at 9:47 AM, Marc Zyngier <maz@kernel.org> wrote:
>>
>> On 2020-04-02 19:52, Julien Grall wrote:
>>> (+Marc)
>>
>> Thanks for looping me in. Definitely an interesting read, but also a very
>> puzzling one.
> 
> [snip]
> 
>> No. Low latency is a very desirable thing, but it doesn't matter at all when
>> you don't even have functional correctness. To use my favourite car analogy,
>> having a bigger engine doesn't help when you're about to hit the wall and
>> have no breaks... You just hit the wall faster.
> 
> [snip]
> 
>> s/imprecise/massively incorrect/
> 
> [snip]
> 
>> There is just no way I'll ever accept a change to the GIC interrupt state
>> machine for Linux. Feel free to try and convince other OS maintainers.
> 
> [snip]
> 
>> If I was someone developing a product using Xen/ARM, I'd be very worried
>> about what you have written above. Because it really reads "we don't care
>> about reliability as long as we can show amazing numbers". I really hope
>> it isn't what you mean.
> 
> What's puzzling to me, is that what everyone else in this thread is that what Stefano is trying to do is to get Xen to be have like KVM.

This reads to me as "bugs don't exist".

As I actually said in a previous e-mail, our vGIC is significantly 
different compare to KVM. It *might* be possible they are not affected 
because the may trap when a guest is deactivating an interrupt *or* by 
other means.

I didn't look in the details their implementation, but this suggests you 
or Stefano may have. Why do you think Stefano's implementation is 
following what KVM does? If the behavior is the same, was the problem 
reported to KVM ML? What was the answer from the maintainers?

I suspect this was never discussed on KVM ML. So that should really be 
the first step.

Cheers,
Stefano Stabellini April 3, 2020, 4:18 p.m. UTC | #19
On Fri, 3 Apr 2020, Marc Zyngier wrote:
> > > Yeah I missed to highlight it, also because I look at it from a slightly
> > > different perspective: I think IRQ latency is part of correctness.
> 
> No. Low latency is a very desirable thing, but it doesn't matter at all when
> you don't even have functional correctness.

Hi Marc,

It is good to hear from you. Functional correctness can still lead to
breakage. I wrote more details in last part of this email to explain in
details, it is long but please read it all!

[...]

> Please find anything that specifies latency in the GIC spec. Oh wait,
> there is none. Because that's a quality of implementation thing, and
> not a correctness issue.

The goal of being faithful to the spec is not compliance for the sake of
it. I take neither you nor me actually care about labels with compliance
logos to attach anywhere. The goal is to run guests correctly when
virtualized. Do you agree with me so far?

The difficult question is: what happens when the hypervisor is faithful
to the spec, but guests still break?


> > In all honesty, writing and testing the implementation would have
> > likely took you less than trying to push for "creative" ideas or your
> > patch.
> > 
> > > In terms of other "creative" ideas, here are some:
> > 
> > "creative" ideas should really be the last resort. Correct me if I am
> > wrong, but I don't think we are there yet.
> > 
> > > 
> > > One idea, as George suggested, would be to document the interface
> > > deviation. The intention is still to remove any deviation but at least
> > > we would be clear about what we have. Ideally in a single place together
> > > with other hypervisors. This is my preference.
> > 
> > It is not without saying that deviation from specification should not
> > be taken lightly and has risks.
> > 
> > The main risk is you are never going to be able to reliably run an OS
> > on Xen unless you manage to get the deviation accepted by the wider
> > community and Arm.
> 
> There is just no way I'll ever accept a change to the GIC interrupt state
> machine for Linux. Feel free to try and convince other OS maintainers.
> 
> If you want to be creative, be really creative. Implement a fully PV interrupt
> controller that gives you simple enough semantics so that you can actually be
> deterministic. Don't pretend you implement the GIC architecture.

Last time we looked at KVM, KVM was actually doing what my patch here
does, not what Julien suggested. (In short, my patch is implementing
ISACTIVER accurately only for the current vcpu and not the others;
Julien is suggesting to send an IPI to other pcpus running vcpus so that
the implementation becomes accurate for other vcpus too.)


> > > Another idea is that we could crash the guest if GICD_ISACTIVER is read
> > > from a multi-vcpu guest. It is similar to what we already do today but
> > > better because we would do it purposely (not because of a typo) and
> > > because it will work for single vcpu guests at least.
> > > 
> > > We could also leave it as is (crash all the time) but it implies that
> > > vendors that are seeing issues with Linux today will have to keep
> > > maintaining patches in their private trees until a better solution is
> > > found. This would also be the case if we crash multi-vcpus guests as
> > > previously suggested.
> 
> OK, that's just insane. Suggesting that guests should be left crashing
> because the are doing *the right thing* is just barking mad. I'm all for
> exposing guest bugs when they take shortcuts with the architecture, but
> blaming the guest for what is just a bug in Xen?
> 
> If I was someone developing a product using Xen/ARM, I'd be very worried
> about what you have written above. Because it really reads "we don't care
> about reliability as long as we can show amazing numbers". I really hope
> it isn't what you mean.

It is not what I am saying, and I am glad I have an opportunity to
explain it.

Don't think about low latency numbers as: "great it runs faster!" I
don't care about that. Well, I care, but not that much, certainly not at
the expense of being faithful to the spec.

Customers here are running guests and workloads that fail spectacularly
if they don't get IRQ latency within low deterministic ranges. A latency
spike can cause a severe failure, as bad a failure as a total system
crash.

Today, these guests don't use ISACTIVER at all AFAIK, but let's imagine
that in the future they will. Being latency sensitive guests, I doubt
they'll ever loop over ISACTIVER, but maybe they could call it once at
initialization time or during shutdown of something? Also, let's say
implementing ISACTIVER faithfully with an IPI to another vcpus might
cause a latency spikes (note that we don't have numbers on this).

Does this scenario make sense to you so far?

If we implement ISACTIVER using an IPI we introduce a significant source
of non-determinism. We yank the other vcpu into hypervisor mode when it
could have been running the critical section. It can very well cause one
of those spectacular failures. It might take the engineers putting the
system together a very significant amount of time to figure out the
problem.

Doing what my patch here does might be OK until one of these guests
start to rely on ISACTIVER to be accurate. So far I have not seen any
examples of it, but I agree it could happen, hence, it is risky.
Frankly, I thought that KVM was already behaving the way of my patch and
it was making me feel more confident about this solution.

Finally the last option is to just crash early. It is not blaming the
guest -- it would serve as an early warning that something is not right
and needs to be done about the system. Typically, it is better to fail
early and clearly rather than at some point later more subtly.
Julien Grall April 3, 2020, 4:23 p.m. UTC | #20
On 03/04/2020 17:18, Stefano Stabellini wrote:
> On Fri, 3 Apr 2020, Marc Zyngier wrote:
  > Doing what my patch here does might be OK until one of these guests
> start to rely on ISACTIVER to be accurate. So far I have not seen any
> examples of it, but I agree it could happen, hence, it is risky.

I am only going to answer to this. This is not about *accuracy* but 
deadlock in your guest. I actually wrote a long e-mail on this thread 
explaining the possible deadlock.

It is not because you can't reproduce the deadlock that the dealock is 
not there. When are you going to stop dimissing real bug in your 
implementation?

Cheers,
Julien Grall April 3, 2020, 5:46 p.m. UTC | #21
On 03/04/2020 17:23, Julien Grall wrote:
> 
> 
> On 03/04/2020 17:18, Stefano Stabellini wrote:
>> On Fri, 3 Apr 2020, Marc Zyngier wrote:
>   > Doing what my patch here does might be OK until one of these guests
>> start to rely on ISACTIVER to be accurate. So far I have not seen any
>> examples of it, but I agree it could happen, hence, it is risky.
> 
> I am only going to answer to this. This is not about *accuracy* but 
> deadlock in your guest. I actually wrote a long e-mail on this thread 
> explaining the possible deadlock.
> 
> It is not because you can't reproduce the deadlock that the dealock is 
> not there. When are you going to stop dimissing real bug in your 
> implementation?

So everyone are on the same page. Here the link to the e-mail I wrote 
about the potential problem:

https://lists.xenproject.org/archives/html/xen-devel/2020-03/msg02039.html

Cheers,
Stefano Stabellini April 3, 2020, 7:41 p.m. UTC | #22
On Thu, 2 Apr 2020, Julien Grall wrote:
> As we discussed on Tuesday, the cost for other vCPUs is only going to be a
> trap to the hypervisor and then back again. The cost is likely smaller than
> receiving and forwarding an interrupt.
> 
> You actually agreed on this analysis. So can you enlighten me as to why
> receiving an interrupt is a not problem for latency but this is?

My answer was that the difference is that an operating system can
disable interrupts, but it cannot disable receiving this special IPI.

 
> The crash only happened when using vGICv3 not vGICv2. But did you look at Xen
> recently? Particularly at the following patch:
> 
> xen/arm: Handle unimplemented VGICv3 registers as RAZ/WI
> 
> Per the ARM Generic Interrupt Controller Architecture Specification (ARM
> IHI 0069E), reserved registers should generally be treated as RAZ/WI.
> To simplify the VGICv3 design and improve guest compatibility, treat the
> default case for GICD and GICR registers as read_as_zero/write_ignore.
> 
> Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
> Acked-by: Julien Grall <julien@xen.org>
> 
> I actually pointed the patch to you during one of our weekly calls. Yet we
> agreed it would still be good to implement the register properly and you said
> you will write a patch.

As you know I cannot reproduce the crash myself, I asked Peng and Wei
for help in that. I cannot be certain Jeff's patch makes a difference,
but looking at the code, if you open
xen/arch/arm/vgic-v3.c:__vgic_v3_distr_common_mmio_read you can see that
the range mistake is still there:


    /* Read the active status of an IRQ via GICD/GICR is not supported */
    case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVER):
    case VRANGE32(GICD_ICACTIVER, GICD_ICACTIVERN):
        goto read_as_zero;


So a GICD_ISACTIVER of any register but the first should end up hitting
the default case:

    default:
        printk(XENLOG_G_ERR
               "%pv: %s: unhandled read r%d offset %#08x\n",
               v, name, dabt.reg, reg);
        return 0;
    }

Which returns 0 (IO_ABORT).

Would you be happy to have the range fixed to be:

    case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVERN):

instead?
Julien Grall April 3, 2020, 8:27 p.m. UTC | #23
On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote:
>
> On Thu, 2 Apr 2020, Julien Grall wrote:
> > As we discussed on Tuesday, the cost for other vCPUs is only going to be a
> > trap to the hypervisor and then back again. The cost is likely smaller than
> > receiving and forwarding an interrupt.
> >
> > You actually agreed on this analysis. So can you enlighten me as to why
> > receiving an interrupt is a not problem for latency but this is?
>
> My answer was that the difference is that an operating system can
> disable interrupts, but it cannot disable receiving this special IPI.

An OS can *only* disable its own interrupts, yet interrupts will still
be received by Xen even if the interrupts are masked at the processor
(e.g local_irq_disable()).

You would need to disable interrupts one by one as the GIC level (use
ICENABLER) in order to not receive any interrupts. Yet, Xen may still
receive interrupts for operational purposes (e.g serial, console,
maintainance IRQ...). So trap will happen.

But if your concern is to disturb a vCPU which has interrupts
disabled. Then there is way to mitigate this in an implementation, you
can easily know whether there are interrupts inflight at a given point
for a given vCPU. So you can avoid to IPI if you know there is no
interrupts potentially active.

This is why I clearly written the implementation I suggested was *not*
optimized in the same e-mail as where I made the proposal.

Cheers,
George Dunlap April 6, 2020, 5:58 p.m. UTC | #24
> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote:
> 
> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote:
>> 
>> On Thu, 2 Apr 2020, Julien Grall wrote:
>>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a
>>> trap to the hypervisor and then back again. The cost is likely smaller than
>>> receiving and forwarding an interrupt.
>>> 
>>> You actually agreed on this analysis. So can you enlighten me as to why
>>> receiving an interrupt is a not problem for latency but this is?
>> 
>> My answer was that the difference is that an operating system can
>> disable interrupts, but it cannot disable receiving this special IPI.
> 
> An OS can *only* disable its own interrupts, yet interrupts will still
> be received by Xen even if the interrupts are masked at the processor
> (e.g local_irq_disable()).
> 
> You would need to disable interrupts one by one as the GIC level (use
> ICENABLER) in order to not receive any interrupts. Yet, Xen may still
> receive interrupts for operational purposes (e.g serial, console,
> maintainance IRQ...). So trap will happen.

I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible.  On a 4-core system, you could  have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3.  In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1.

Julien, I think your argument above about interrupts somewhat undermines your point about deadlock.  Your basic argument is, that a guest will be interrupted by Xen quite frequently anyway for lots of reasons; adding one more to the list shouldn’t measurably increase the jitter.  But if it’s true that a guest will be interrupted by Xen frequently, then deadlock shouldn’t be much of an issue; and insofar as deadlock is an issue, it’s because there were no interrupts — and thus, adding an extra IPI will have a statistically significant effect on jitter.

On the other had, Stefano’s argument about deadlock (if I understand him correctly) has been that guests shouldn’t really be spinning on that register anyway.  But it sounds from Julien’s other email that perhaps spinning on the register is exactly what newer versions of Linux do — so Linux would certainly hang on such systems if Stefano’s assertion about the low number of Xen-related interrupts is true.

If the latter is true, then it sounds like addressing the deadlock issue will be necessary?  And so effort needs to be put towards figuring out how to minimize jitter due to Xen-related IPIs.

Wei Xu / Peng Fan, any opinions here?

 - George
Julien Grall April 6, 2020, 6:47 p.m. UTC | #25
On 06/04/2020 18:58, George Dunlap wrote:
> 
> 
>> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote:
>>
>> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote:
>>>
>>> On Thu, 2 Apr 2020, Julien Grall wrote:
>>>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a
>>>> trap to the hypervisor and then back again. The cost is likely smaller than
>>>> receiving and forwarding an interrupt.
>>>>
>>>> You actually agreed on this analysis. So can you enlighten me as to why
>>>> receiving an interrupt is a not problem for latency but this is?
>>>
>>> My answer was that the difference is that an operating system can
>>> disable interrupts, but it cannot disable receiving this special IPI.
>>
>> An OS can *only* disable its own interrupts, yet interrupts will still
>> be received by Xen even if the interrupts are masked at the processor
>> (e.g local_irq_disable()).
>>
>> You would need to disable interrupts one by one as the GIC level (use
>> ICENABLER) in order to not receive any interrupts. Yet, Xen may still
>> receive interrupts for operational purposes (e.g serial, console,
>> maintainance IRQ...). So trap will happen.
> 
> I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible.  On a 4-core system, you could  have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3.  In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1.
Well maintenance IRQs are per pCPU so you can't route to another one...

But, I think you missed my point that local_irq_disable() from the guest 
will not prevent the hypervisor to receive interrupts *even* the one 
routed to the vCPU itself. They will just not be delivered to the guest 
context until local_irq_enable() is called.

> 
> Julien, I think your argument above about interrupts somewhat undermines your point about deadlock.  Your basic argument is, that a guest will be interrupted by Xen quite frequently anyway for lots of reasons; adding one more to the list shouldn’t measurably increase the jitter.  But if it’s true that a guest will be interrupted by Xen frequently, then deadlock shouldn’t be much of an issue; and insofar as deadlock is an issue, it’s because there were no interrupts — and thus, adding an extra IPI will have a statistically significant effect on jitter.

I presented two way to disable interrupts because Stefano's e-mail was 
not clear enough which one he was referring to. So I was hoping to safe 
some round-trip, but it looks like I muddle my point.

If Stefano was referring to critical sections where interrupts are 
masked using local_irq_disable(). Then, you are not going to limit the 
numbers of traps at all (see why above). So the point here was moot.

I don't believe Stefano was referring to the second case as disabling 
interrupts at the GIC level will require to trap in the hypervisor. But 
I thought I would mention it.

Regarding the livelock (Marc pointed out it wasn't a deadlock), there 
are multiple conflicting use cases. In an ideal world, there will be 
limited reasons to interrupts the guest. And yes it will result to a 
livelock.

In a less ideal world, there will some time be interruptions. But you 
don't know when. If you look at determinism then, I am afraid that 
Stefano's implementation is not because you don't know when the vCPU 
will exit.

> 
> On the other had, Stefano’s argument about deadlock (if I understand him correctly) has been that guests shouldn’t really be spinning on that register anyway.  But it sounds from Julien’s other email that perhaps spinning on the register is exactly what newer versions of Linux do — so Linux would certainly hang on such systems if Stefano’s assertion about the low number of Xen-related interrupts is true.

Rather than playing the game "one person's word against the other 
person's word", from Linux 5.4:

do {
     unsigned long flags;

     /*
      * Wait until we're out of the critical section.  This might
      * give the wrong answer due to the lack of memory barriers.
      */
     while (irqd_irq_inprogress(&desc->irq_data))
         cpu_relax();

     /* Ok, that indicated we're done: double-check carefully. */
     raw_spin_lock_irqsave(&desc->lock, flags);
     inprogress = irqd_irq_inprogress(&desc->irq_data);

     /*
      * If requested and supported, check at the chip whether it
      * is in flight at the hardware level, i.e. already pending
      * in a CPU and waiting for service and acknowledge.
      */
     if (!inprogress && sync_chip) {
         /*
          * Ignore the return code. inprogress is only updated
          * when the chip supports it.
          */
         __irq_get_irqchip_state(irqd, IRQCHIP_STATE_ACTIVE,
                                 &inprogress);
     }
     raw_spin_unlock_irqrestore(&desc->lock, flags);
     /* Oops, that failed? */
} while (inprogress);

> 
> If the latter is true, then it sounds like addressing the deadlock issue will be necessary?  And so effort needs to be put towards figuring out how to minimize jitter due to Xen-related IPIs.
Here what I wrote before:

"
A few remarks:
     * The function do_nothing() is basically a NOP.
     * I am suggesting to use smp_call_function() rather
smp_send_event_check_cpu() is because we need to know when the vCPU has
synchronized the LRs. As the function do_nothing() will be call
afterwards, then we know the the snapshot of the LRs has been done
     * It would be possible to everything in one vCPU.
     * We can possibly optimize it for the SGIs/PPIs case

This is still not perfect, but I don't think the performance would be
that bad.
"

"
But if your concern is to disturb a vCPU which has interrupts
disabled. Then there is way to mitigate this in an implementation, you
can easily know whether there are interrupts inflight at a given point
for a given vCPU. So you can avoid to IPI if you know there is no
interrupts potentially active.
"

I am looking forward to hear about the potential improvements.

Cheers,
George Dunlap April 7, 2020, 4:16 p.m. UTC | #26
> On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org> wrote:
> 
> On 06/04/2020 18:58, George Dunlap wrote:
>>> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote:
>>> 
>>> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote:
>>>> 
>>>> On Thu, 2 Apr 2020, Julien Grall wrote:
>>>>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a
>>>>> trap to the hypervisor and then back again. The cost is likely smaller than
>>>>> receiving and forwarding an interrupt.
>>>>> 
>>>>> You actually agreed on this analysis. So can you enlighten me as to why
>>>>> receiving an interrupt is a not problem for latency but this is?
>>>> 
>>>> My answer was that the difference is that an operating system can
>>>> disable interrupts, but it cannot disable receiving this special IPI.
>>> 
>>> An OS can *only* disable its own interrupts, yet interrupts will still
>>> be received by Xen even if the interrupts are masked at the processor
>>> (e.g local_irq_disable()).
>>> 
>>> You would need to disable interrupts one by one as the GIC level (use
>>> ICENABLER) in order to not receive any interrupts. Yet, Xen may still
>>> receive interrupts for operational purposes (e.g serial, console,
>>> maintainance IRQ...). So trap will happen.
>> I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible.  On a 4-core system, you could  have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3.  In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1.
> Well maintenance IRQs are per pCPU so you can't route to another one...
> 
> But, I think you missed my point that local_irq_disable() from the guest will not prevent the hypervisor to receive interrupts *even* the one routed to the vCPU itself. They will just not be delivered to the guest context until local_irq_enable() is called.

My understanding, from Stefano was that what his customers are concerned about is the time between the time a physical IRQ is delivered to the guest and the time the guest OS can respond appropriately.  The key thing here isn’t necessarily speed, but predictability — system designers need to know that, with a high probability, their interrupt routines will complete within X amount of cycles.

Further interrupts delivered to a guest are not a problem in this scenario, if the guest can disable them until the critical IRQ has been handled.

Xen-related IPIs, however, could potentially cause a problem if not mitigated.  Consider a guest where vcpu 1 loops over the register, while vcpu 2 is handling a latency-critical IRQ.  A naive implementation might send an IPI every time vcpu 1 does a read, spamming vcpu 2 with dozens of IPIs.  Then an IRQ routine which reliably finishes well within the required time normally suddenly overruns and causes an issue.

I don’t know what maintenance IRQs are, but if they only happen intermittently, it’s possible that you’d never get more than a single one in a latency-critical IRQ routine; and as such, the variatibility in execution time (jitter) wouldn’t be an issue in practice.  But every time you add a new unblockable IPI, you increase this jitter; particularly if this unblockable IPI might be repeated an arbitrary number of times.

(Stefano, let me know if I’ve misunderstood something.)

So stepping back a moment, here’s all the possible ideas that I think have been discussed (or are there implicitly) so far.

1. [Default] Do nothing; guests using this register continue crashing

2. Make the I?ACTIVER registers RZWI.

3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current behavior (as far as we understand it)

4. Use a simple IPI with do_noop to update I?ACTIVER

4a.  Use an IPI, but come up with clever tricks to avoid interrupting guests handling IRQs.

5. Trap to Xen on guest EOI, so that we know when the 

6. Some clever paravirtualized option

Obviously nobody wants #1, and #3 is clearly not really an option now either.

#2 is not great, but it’s simple and quick to implement for now.  Julien, I’m not sure your position on this one: You rejected the idea back in v1 of this patch series, but seemed to refer to it again earlier in this thread.

#4 is relatively quick to implement a “dumb” version, but such a “dumb” version has a high risk of causing unacceptable jitter (or at least, so Stefano believes).

#4a or #6 are further potential lines to explore, but would require a lot of additional design to get working right.

I think if I understand Stefano’s PoV, then #5 would actually be acceptable — the overall amount of time spent in the hypervisor would probably be greater, but it would be bounded and predictable: once someone got their IRQ handler working reliably, it would likely continue to work.

It sounds like #5 might be pretty quick to implement; and then at some point in the future if someone wants to improve performance, they can work on 4a or 6.

Any thoughts?  Anything I’m missing?

 -George
Bertrand Marquis April 7, 2020, 4:25 p.m. UTC | #27
On 7 Apr 2020, at 17:16, George Dunlap <George.Dunlap@citrix.com<mailto:George.Dunlap@citrix.com>> wrote:



On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org<mailto:julien@xen.org>> wrote:

On 06/04/2020 18:58, George Dunlap wrote:
On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com<mailto:julien.grall.oss@gmail.com>> wrote:

On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org<mailto:sstabellini@kernel.org>> wrote:

On Thu, 2 Apr 2020, Julien Grall wrote:
As we discussed on Tuesday, the cost for other vCPUs is only going to be a
trap to the hypervisor and then back again. The cost is likely smaller than
receiving and forwarding an interrupt.

You actually agreed on this analysis. So can you enlighten me as to why
receiving an interrupt is a not problem for latency but this is?

My answer was that the difference is that an operating system can
disable interrupts, but it cannot disable receiving this special IPI.

An OS can *only* disable its own interrupts, yet interrupts will still
be received by Xen even if the interrupts are masked at the processor
(e.g local_irq_disable()).

You would need to disable interrupts one by one as the GIC level (use
ICENABLER) in order to not receive any interrupts. Yet, Xen may still
receive interrupts for operational purposes (e.g serial, console,
maintainance IRQ...). So trap will happen.
I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible.  On a 4-core system, you could  have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3.  In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1.
Well maintenance IRQs are per pCPU so you can't route to another one...

But, I think you missed my point that local_irq_disable() from the guest will not prevent the hypervisor to receive interrupts *even* the one routed to the vCPU itself. They will just not be delivered to the guest context until local_irq_enable() is called.

My understanding, from Stefano was that what his customers are concerned about is the time between the time a physical IRQ is delivered to the guest and the time the guest OS can respond appropriately.  The key thing here isn’t necessarily speed, but predictability — system designers need to know that, with a high probability, their interrupt routines will complete within X amount of cycles.

Further interrupts delivered to a guest are not a problem in this scenario, if the guest can disable them until the critical IRQ has been handled.

Xen-related IPIs, however, could potentially cause a problem if not mitigated.  Consider a guest where vcpu 1 loops over the register, while vcpu 2 is handling a latency-critical IRQ.  A naive implementation might send an IPI every time vcpu 1 does a read, spamming vcpu 2 with dozens of IPIs.  Then an IRQ routine which reliably finishes well within the required time normally suddenly overruns and causes an issue.

I don’t know what maintenance IRQs are, but if they only happen intermittently, it’s possible that you’d never get more than a single one in a latency-critical IRQ routine; and as such, the variatibility in execution time (jitter) wouldn’t be an issue in practice.  But every time you add a new unblockable IPI, you increase this jitter; particularly if this unblockable IPI might be repeated an arbitrary number of times.

(Stefano, let me know if I’ve misunderstood something.)

So stepping back a moment, here’s all the possible ideas that I think have been discussed (or are there implicitly) so far.

1. [Default] Do nothing; guests using this register continue crashing

2. Make the I?ACTIVER registers RZWI.

3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current behavior (as far as we understand it)

4. Use a simple IPI with do_noop to update I?ACTIVER

4a.  Use an IPI, but come up with clever tricks to avoid interrupting guests handling IRQs.

5. Trap to Xen on guest EOI, so that we know when the

This is possible to do on a per interrupt basis or when all interrupts in LR registers have all been handled (maintenance interrupt when there is nothing left to handle in LR registers, used to fire other interrupts if we have more pending interrupts then number of LR registers).

Maybe a solution making sure we get a maintenance interrupts once all interrupts in LR registers have been handled could be a good mitigation ?

This could allow to not sync other cores but would make sure the time until we will cleanup active interrupts is limited so the poller could be sure to have at some acceptable point in time the information.


6. Some clever paravirtualized option

Obviously nobody wants #1, and #3 is clearly not really an option now either.

#2 is not great, but it’s simple and quick to implement for now.  Julien, I’m not sure your position on this one: You rejected the idea back in v1 of this patch series, but seemed to refer to it again earlier in this thread.

#4 is relatively quick to implement a “dumb” version, but such a “dumb” version has a high risk of causing unacceptable jitter (or at least, so Stefano believes).

#4a or #6 are further potential lines to explore, but would require a lot of additional design to get working right.

I think if I understand Stefano’s PoV, then #5 would actually be acceptable — the overall amount of time spent in the hypervisor would probably be greater, but it would be bounded and predictable: once someone got their IRQ handler working reliably, it would likely continue to work.

It sounds like #5 might be pretty quick to implement; and then at some point in the future if someone wants to improve performance, they can work on 4a or 6.

I agree this could be a good mitigation.

Bertrand
Julien Grall April 7, 2020, 4:50 p.m. UTC | #28
On 07/04/2020 17:16, George Dunlap wrote:
> 
> 
>> On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org> wrote:
>>
>> On 06/04/2020 18:58, George Dunlap wrote:
>>>> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote:
>>>>
>>>> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote:
>>>>>
>>>>> On Thu, 2 Apr 2020, Julien Grall wrote:
>>>>>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a
>>>>>> trap to the hypervisor and then back again. The cost is likely smaller than
>>>>>> receiving and forwarding an interrupt.
>>>>>>
>>>>>> You actually agreed on this analysis. So can you enlighten me as to why
>>>>>> receiving an interrupt is a not problem for latency but this is?
>>>>>
>>>>> My answer was that the difference is that an operating system can
>>>>> disable interrupts, but it cannot disable receiving this special IPI.
>>>>
>>>> An OS can *only* disable its own interrupts, yet interrupts will still
>>>> be received by Xen even if the interrupts are masked at the processor
>>>> (e.g local_irq_disable()).
>>>>
>>>> You would need to disable interrupts one by one as the GIC level (use
>>>> ICENABLER) in order to not receive any interrupts. Yet, Xen may still
>>>> receive interrupts for operational purposes (e.g serial, console,
>>>> maintainance IRQ...). So trap will happen.
>>> I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible.  On a 4-core system, you could  have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3.  In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1.
>> Well maintenance IRQs are per pCPU so you can't route to another one...
>>
>> But, I think you missed my point that local_irq_disable() from the guest will not prevent the hypervisor to receive interrupts *even* the one routed to the vCPU itself. They will just not be delivered to the guest context until local_irq_enable() is called.
> 
> My understanding, from Stefano was that what his customers are concerned about is the time between the time a physical IRQ is delivered to the guest and the time the guest OS can respond appropriately.  The key thing here isn’t necessarily speed, but predictability — system designers need to know that, with a high probability, their interrupt routines will complete within X amount of cycles.
> 
> Further interrupts delivered to a guest are not a problem in this scenario, if the guest can disable them until the critical IRQ has been handled.

You keep saying a guest can disable interrupts, but it can't do it via 
local_irq_disable(). So what method are you thinking? Disabling at the 
GIC level? That is involving traps and most likely not going to help 
with predictability...

> 
> Xen-related IPIs, however, could potentially cause a problem if not mitigated.  Consider a guest where vcpu 1 loops over the register, while vcpu 2 is handling a latency-critical IRQ.  A naive implementation might send an IPI every time vcpu 1 does a read, spamming vcpu 2 with dozens of IPIs.  Then an IRQ routine which reliably finishes well within the required time normally suddenly overruns and causes an issue.

I never suggested the naive implementation would be perfect. That's why 
I said it can be optimized...

> 
> I don’t know what maintenance IRQs are, but if they only happen intermittently, it’s possible that you’d never get more than a single one in a latency-critical IRQ routine; and as such, the variatibility in execution time (jitter) wouldn’t be an issue in practice.  But every time you add a new unblockable IPI, you increase this jitter; particularly if this unblockable IPI might be repeated an arbitrary number of times.
> 
> (Stefano, let me know if I’ve misunderstood something.)
> 
> So stepping back a moment, here’s all the possible ideas that I think have been discussed (or are there implicitly) so far.
> 
> 1. [Default] Do nothing; guests using this register continue crashing
> 
> 2. Make the I?ACTIVER registers RZWI.
> 
> 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current behavior (as far as we understand it)
> 
> 4. Use a simple IPI with do_noop to update I?ACTIVER
> 
> 4a.  Use an IPI, but come up with clever tricks to avoid interrupting guests handling IRQs.
> 
> 5. Trap to Xen on guest EOI, so that we know when the
> 
> 6. Some clever paravirtualized option

7. Use an IPI if we are confident the interrupts may be active.

> 
> Obviously nobody wants #1, and #3 is clearly not really an option now either.
> 
> #2 is not great, but it’s simple and quick to implement for now.  Julien, I’m not sure your position on this one: You rejected the idea back in v1 of this patch series, but seemed to refer to it again earlier in this thread.
> 
> #4 is relatively quick to implement a “dumb” version, but such a “dumb” version has a high risk of causing unacceptable jitter (or at least, so Stefano believes).
> 
> #4a or #6 are further potential lines to explore, but would require a lot of additional design to get working right.
> 
> I think if I understand Stefano’s PoV, then #5 would actually be acceptable — the overall amount of time spent in the hypervisor would probably be greater, but it would be bounded and predictable: once someone got their IRQ handler working reliably, it would likely continue to work.
> It sounds like #5 might be pretty quick to implement; and then at some point in the future if someone wants to improve performance, they can work on 4a or 6.
With the existing solution, you may only trap when receiving the next 
interrupts. But with your approach you are now trapping once when 
deactivate and potentially trapping soon after when receiving an 
interrupts. So you increase cost per-interrupts.

While I agree this is going to make Stefano's customers happy, you are 
also going to make unhappy anyone with workload using an high numbers of 
interrupts (e.g media).

Given that none of Stefano's customer are actually using ISACTIVER 
today, it feels to me it is wrong to add pain for everyone else.

It feels to me we want to either implement 4. or 7. and then optimize it 
based on feedback.

Cheers,
Julien Grall April 7, 2020, 4:52 p.m. UTC | #29
On 07/04/2020 17:25, Bertrand Marquis wrote:
> This is possible to do on a per interrupt basis or when all interrupts 
> in LR registers have all been handled (maintenance interrupt when there 
> is nothing left to handle in LR registers, used to fire other interrupts 
> if we have more pending interrupts then number of LR registers).
> 
> Maybe a solution making sure we get a maintenance interrupts once all 
> interrupts in LR registers have been handled could be a good mitigation ?

The chance of having multiple interrupts inflight is fairly limited. So 
effectively, you will trap for every interrupts.

Cheers,
Julien Grall April 7, 2020, 4:56 p.m. UTC | #30
On 07/04/2020 17:50, Julien Grall wrote:
> 
> 
> On 07/04/2020 17:16, George Dunlap wrote:
>>
>>
>>> On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org> wrote:
>>>
>>> On 06/04/2020 18:58, George Dunlap wrote:
>>>>> On Apr 3, 2020, at 9:27 PM, Julien Grall 
>>>>> <julien.grall.oss@gmail.com> wrote:
>>>>>
>>>>> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini 
>>>>> <sstabellini@kernel.org> wrote:
>>>>>>
>>>>>> On Thu, 2 Apr 2020, Julien Grall wrote:
>>>>>>> As we discussed on Tuesday, the cost for other vCPUs is only 
>>>>>>> going to be a
>>>>>>> trap to the hypervisor and then back again. The cost is likely 
>>>>>>> smaller than
>>>>>>> receiving and forwarding an interrupt.
>>>>>>>
>>>>>>> You actually agreed on this analysis. So can you enlighten me as 
>>>>>>> to why
>>>>>>> receiving an interrupt is a not problem for latency but this is?
>>>>>>
>>>>>> My answer was that the difference is that an operating system can
>>>>>> disable interrupts, but it cannot disable receiving this special IPI.
>>>>>
>>>>> An OS can *only* disable its own interrupts, yet interrupts will still
>>>>> be received by Xen even if the interrupts are masked at the processor
>>>>> (e.g local_irq_disable()).
>>>>>
>>>>> You would need to disable interrupts one by one as the GIC level (use
>>>>> ICENABLER) in order to not receive any interrupts. Yet, Xen may still
>>>>> receive interrupts for operational purposes (e.g serial, console,
>>>>> maintainance IRQ...). So trap will happen.
>>>> I think Stefano’s assertion is that the users he has in mind will be 
>>>> configuring the system such that RT workloads get a minimum number 
>>>> of hypervisor-related interrupts possible.  On a 4-core system, you 
>>>> could  have non-RT workloads running on cores 0-1, and RT workloads 
>>>> running with the NULL scheduler on cores 2-3.  In such a system, 
>>>> you’d obviously arrange that serial and maintenance IRQs are 
>>>> delivered to cores 0-1.
>>> Well maintenance IRQs are per pCPU so you can't route to another one...
>>>
>>> But, I think you missed my point that local_irq_disable() from the 
>>> guest will not prevent the hypervisor to receive interrupts *even* 
>>> the one routed to the vCPU itself. They will just not be delivered to 
>>> the guest context until local_irq_enable() is called.
>>
>> My understanding, from Stefano was that what his customers are 
>> concerned about is the time between the time a physical IRQ is 
>> delivered to the guest and the time the guest OS can respond 
>> appropriately.  The key thing here isn’t necessarily speed, but 
>> predictability — system designers need to know that, with a high 
>> probability, their interrupt routines will complete within X amount of 
>> cycles.
>>
>> Further interrupts delivered to a guest are not a problem in this 
>> scenario, if the guest can disable them until the critical IRQ has 
>> been handled.
> 
> You keep saying a guest can disable interrupts, but it can't do it via 
> local_irq_disable(). So what method are you thinking? Disabling at the 
> GIC level? That is involving traps and most likely not going to help 
> with predictability...

Just to clear I meant interrupts to be received by Xen including the one 
routed to that vCPU.

Cheers,
George Dunlap April 7, 2020, 6:16 p.m. UTC | #31
> On Apr 7, 2020, at 5:50 PM, Julien Grall <julien@xen.org> wrote:
> 
> 
> 
> On 07/04/2020 17:16, George Dunlap wrote:
>>> On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org> wrote:
>>> 
>>> On 06/04/2020 18:58, George Dunlap wrote:
>>>>> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote:
>>>>> 
>>>>> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote:
>>>>>> 
>>>>>> On Thu, 2 Apr 2020, Julien Grall wrote:
>>>>>>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a
>>>>>>> trap to the hypervisor and then back again. The cost is likely smaller than
>>>>>>> receiving and forwarding an interrupt.
>>>>>>> 
>>>>>>> You actually agreed on this analysis. So can you enlighten me as to why
>>>>>>> receiving an interrupt is a not problem for latency but this is?
>>>>>> 
>>>>>> My answer was that the difference is that an operating system can
>>>>>> disable interrupts, but it cannot disable receiving this special IPI.
>>>>> 
>>>>> An OS can *only* disable its own interrupts, yet interrupts will still
>>>>> be received by Xen even if the interrupts are masked at the processor
>>>>> (e.g local_irq_disable()).
>>>>> 
>>>>> You would need to disable interrupts one by one as the GIC level (use
>>>>> ICENABLER) in order to not receive any interrupts. Yet, Xen may still
>>>>> receive interrupts for operational purposes (e.g serial, console,
>>>>> maintainance IRQ...). So trap will happen.
>>>> I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible.  On a 4-core system, you could  have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3.  In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1.
>>> Well maintenance IRQs are per pCPU so you can't route to another one...
>>> 
>>> But, I think you missed my point that local_irq_disable() from the guest will not prevent the hypervisor to receive interrupts *even* the one routed to the vCPU itself. They will just not be delivered to the guest context until local_irq_enable() is called.
>> My understanding, from Stefano was that what his customers are concerned about is the time between the time a physical IRQ is delivered to the guest and the time the guest OS can respond appropriately.  The key thing here isn’t necessarily speed, but predictability — system designers need to know that, with a high probability, their interrupt routines will complete within X amount of cycles.
>> Further interrupts delivered to a guest are not a problem in this scenario, if the guest can disable them until the critical IRQ has been handled.
> 
> You keep saying a guest can disable interrupts, but it can't do it via local_irq_disable(). So what method are you thinking? Disabling at the GIC level? That is involving traps and most likely not going to help with predictability...

So you’ll have to forgive me for making educated guesses here, as I’m trying to collect all the information.  On x86, if you use device pass-through on a system with a virtualized APIC and posted interrupts, then when when the device generates interrupts, those are delivered directly to the guest without involvement of Xen; and when the guest disables interrupts in the vAPIC, those interrupts will be disabled, and be delivered when the guest re-enables interrupts.  Given what Stefano said about disabling interrupts, I assumed that ARM had the same sort of functionality.  Is that not the case?

>> Xen-related IPIs, however, could potentially cause a problem if not mitigated. Consider a guest where vcpu 1 loops over the register, while vcpu 2 is handling a latency-critical IRQ.  A naive implementation might send an IPI every time vcpu 1 does a read, spamming vcpu 2 with dozens of IPIs.  Then an IRQ routine which reliably finishes well within the required time normally suddenly overruns and causes an issue.
> 
> I never suggested the naive implementation would be perfect. That's why I said it can be optimized...

It didn’t seem to me that you understood what Stefano’s concerns were; so I was trying to explain the situation he is trying to avoid (as well as making sure that I had a clear understanding myself).  The reason I said “a naive implementation” was to make clear that I knew that’s not what you were suggesting. :-)

>> I don’t know what maintenance IRQs are, but if they only happen intermittently, it’s possible that you’d never get more than a single one in a latency-critical IRQ routine; and as such, the variatibility in execution time (jitter) wouldn’t be an issue in practice.  But every time you add a new unblockable IPI, you increase this jitter; particularly if this unblockable IPI might be repeated an arbitrary number of times.
>> (Stefano, let me know if I’ve misunderstood something.)
>> So stepping back a moment, here’s all the possible ideas that I think have been discussed (or are there implicitly) so far.
>> 1. [Default] Do nothing; guests using this register continue crashing
>> 2. Make the I?ACTIVER registers RZWI.
>> 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current behavior (as far as we understand it)
>> 4. Use a simple IPI with do_noop to update I?ACTIVER
>> 4a.  Use an IPI, but come up with clever tricks to avoid interrupting guests handling IRQs.
>> 5. Trap to Xen on guest EOI, so that we know when the
>> 6. Some clever paravirtualized option
> 
> 7. Use an IPI if we are confident the interrupts may be active.

I don’t understand this one.  How is it different than 4 or 4a?  And in particular, how does it evaluate on the “how much additional design work would it take” criteria?

 -George
Julien Grall April 7, 2020, 7:58 p.m. UTC | #32
Hi George,

On 07/04/2020 19:16, George Dunlap wrote:
> 
> 
>> On Apr 7, 2020, at 5:50 PM, Julien Grall <julien@xen.org> wrote:
>>
>>
>>
>> On 07/04/2020 17:16, George Dunlap wrote:
>>>> On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org> wrote:
>>>>
>>>> On 06/04/2020 18:58, George Dunlap wrote:
>>>>>> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote:
>>>>>>
>>>>>> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote:
>>>>>>>
>>>>>>> On Thu, 2 Apr 2020, Julien Grall wrote:
>>>>>>>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a
>>>>>>>> trap to the hypervisor and then back again. The cost is likely smaller than
>>>>>>>> receiving and forwarding an interrupt.
>>>>>>>>
>>>>>>>> You actually agreed on this analysis. So can you enlighten me as to why
>>>>>>>> receiving an interrupt is a not problem for latency but this is?
>>>>>>>
>>>>>>> My answer was that the difference is that an operating system can
>>>>>>> disable interrupts, but it cannot disable receiving this special IPI.
>>>>>>
>>>>>> An OS can *only* disable its own interrupts, yet interrupts will still
>>>>>> be received by Xen even if the interrupts are masked at the processor
>>>>>> (e.g local_irq_disable()).
>>>>>>
>>>>>> You would need to disable interrupts one by one as the GIC level (use
>>>>>> ICENABLER) in order to not receive any interrupts. Yet, Xen may still
>>>>>> receive interrupts for operational purposes (e.g serial, console,
>>>>>> maintainance IRQ...). So trap will happen.
>>>>> I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible.  On a 4-core system, you could  have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3.  In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1.
>>>> Well maintenance IRQs are per pCPU so you can't route to another one...
>>>>
>>>> But, I think you missed my point that local_irq_disable() from the guest will not prevent the hypervisor to receive interrupts *even* the one routed to the vCPU itself. They will just not be delivered to the guest context until local_irq_enable() is called.
>>> My understanding, from Stefano was that what his customers are concerned about is the time between the time a physical IRQ is delivered to the guest and the time the guest OS can respond appropriately.  The key thing here isn’t necessarily speed, but predictability — system designers need to know that, with a high probability, their interrupt routines will complete within X amount of cycles.
>>> Further interrupts delivered to a guest are not a problem in this scenario, if the guest can disable them until the critical IRQ has been handled.
>>
>> You keep saying a guest can disable interrupts, but it can't do it via local_irq_disable(). So what method are you thinking? Disabling at the GIC level? That is involving traps and most likely not going to help with predictability...
> 
> So you’ll have to forgive me for making educated guesses here, as I’m trying to collect all the information.  On x86, if you use device pass-through on a system with a virtualized APIC and posted interrupts, then when when the device generates interrupts, those are delivered directly to the guest without involvement of Xen; and when the guest disables interrupts in the vAPIC, those interrupts will be disabled, and be delivered when the guest re-enables interrupts.  Given what Stefano said about disabling interrupts, I assumed that ARM had the same sort of functionality.  Is that not the case?

Posted interrupts are only present in GICv4 and onwards. GICv4 only 
supports direct injections for LPIs (e.g MSIs) and GICv4.1 added support 
for direct injection of SGIs (aka IPIs).

Xen on Arm does not support any posted interrupts at all and, I don't 
believe Stefano has HW (at least in production) using GICv4.

> 
>>> Xen-related IPIs, however, could potentially cause a problem if not mitigated. Consider a guest where vcpu 1 loops over the register, while vcpu 2 is handling a latency-critical IRQ.  A naive implementation might send an IPI every time vcpu 1 does a read, spamming vcpu 2 with dozens of IPIs.  Then an IRQ routine which reliably finishes well within the required time normally suddenly overruns and causes an issue.
>>
>> I never suggested the naive implementation would be perfect. That's why I said it can be optimized...
> 
> It didn’t seem to me that you understood what Stefano’s concerns were; so I was trying to explain the situation he is trying to avoid (as well as making sure that I had a clear understanding myself).  The reason I said “a naive implementation” was to make clear that I knew that’s not what you were suggesting. :-)

I know Stefano's concerns, sorry it was not clear enough :).

> 
>>> I don’t know what maintenance IRQs are, but if they only happen intermittently, it’s possible that you’d never get more than a single one in a latency-critical IRQ routine; and as such, the variatibility in execution time (jitter) wouldn’t be an issue in practice.  But every time you add a new unblockable IPI, you increase this jitter; particularly if this unblockable IPI might be repeated an arbitrary number of times.
>>> (Stefano, let me know if I’ve misunderstood something.)
>>> So stepping back a moment, here’s all the possible ideas that I think have been discussed (or are there implicitly) so far.
>>> 1. [Default] Do nothing; guests using this register continue crashing
>>> 2. Make the I?ACTIVER registers RZWI.
>>> 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current behavior (as far as we understand it)
>>> 4. Use a simple IPI with do_noop to update I?ACTIVER
>>> 4a.  Use an IPI, but come up with clever tricks to avoid interrupting guests handling IRQs.
>>> 5. Trap to Xen on guest EOI, so that we know when the
>>> 6. Some clever paravirtualized option
>>
>> 7. Use an IPI if we are confident the interrupts may be active.
> 
> I don’t understand this one.  How is it different than 4 or 4a?  And in particular, how does it evaluate on the “how much additional design work would it take” criteria?

Let me start with, if we want to have a vGIC to rule them all, then I am 
afraid you are going to have to compromise. We can discuss about 
tailoring the vGIC but I think before that we need a vGIC that is 
faithfull with the spec (e.g differentiating level vs edge interrupts, 
handling activer...). Note that Arm spent some effort to get a new vGIC 
merged but this was never made a first class citizen.

However, even if you tailor the vGIC, you are not going to be able to 
avoid IPI in some occasions. This would happen when using event 
channels, in-guest IPI... Another example is when enabling an interrupt, 
although I realize that our vGIC does not do it today meaning that a 
pending interrupt while disabled will not be forwarded until the vCPU exit.

Furthermore, implementing a write to I{C,S}ACTIVER (to 
activate/de-activate) is going to be very difficult (to not say 
impossible) to do without IPI.

If you are worry about a vCPU been interrupted in critical section, then 
I think you should tailor your guest to avoid using those registers.

An alternative would be to use spinlock/mutex within the code to prevent 
a VCPU to access the vGIC registers while another vCPU don't want to be 
interrupted.

Regarding, 4a. The only way I could think of would be to trap the 
instructions that mask/unmask interrupts. If I read correctly the Armv8 
spec, it is not possible to do it.

7. is basically 4.a the goal would be to avoid interrupts the vCPU has 
much as possible but you may be wrong sometimes. Obviously we want to be 
correct most of the times.

I understand this may not be the ideal solution, but this is probably 
the best we could come with and does not involve a lot of work because 
we have already all the information in place (we know when an interrupt 
was injected to a vCPU).

The next best solution is to prevent in your guest to modify some 
registers of the vGIC at the same time as another vCPU is in a critical 
section.

Cheers,
Julien Grall April 8, 2020, 3:54 p.m. UTC | #33
Hi Stefano,

On 03/04/2020 20:41, Stefano Stabellini wrote:
> On Thu, 2 Apr 2020, Julien Grall wrote:
> As you know I cannot reproduce the crash myself, I asked Peng and Wei
> for help in that. I cannot be certain Jeff's patch makes a difference,
> but looking at the code, if you open
> xen/arch/arm/vgic-v3.c:__vgic_v3_distr_common_mmio_read you can see that
> the range mistake is still there:
> 
> 
>      /* Read the active status of an IRQ via GICD/GICR is not supported */
>      case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVER):
>      case VRANGE32(GICD_ICACTIVER, GICD_ICACTIVERN):
>          goto read_as_zero;
> 
> 
> So a GICD_ISACTIVER of any register but the first should end up hitting
> the default case:
> 
>      default:
>          printk(XENLOG_G_ERR
>                 "%pv: %s: unhandled read r%d offset %#08x\n",
>                 v, name, dabt.reg, reg);
>          return 0;
>      }
> 
> Which returns 0 (IO_ABORT).
> 
> Would you be happy to have the range fixed to be:
> 
>      case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVERN):
> 
> instead?

I don't particularly like it, but it is not going to make any difference 
for Linux < 5.4. So I am not opposed to it.

However, I am a bit worry the vGIC is still going to be a pile of hack.
So I think we should continue the discussion about making it better. 
This includes how to implement I{C, S}ACTIVER properly and what sort of 
use-cases we want to support.

Cheers,
Stefano Stabellini April 9, 2020, 1:26 a.m. UTC | #34
On Tue, 7 Apr 2020, Julien Grall wrote:
> > > > I don’t know what maintenance IRQs are, but if they only happen
> > > > intermittently, it’s possible that you’d never get more than a single
> > > > one in a latency-critical IRQ routine; and as such, the variatibility in
> > > > execution time (jitter) wouldn’t be an issue in practice.  But every
> > > > time you add a new unblockable IPI, you increase this jitter;
> > > > particularly if this unblockable IPI might be repeated an arbitrary
> > > > number of times.
> > > > (Stefano, let me know if I’ve misunderstood something.)
> > > > So stepping back a moment, here’s all the possible ideas that I think
> > > > have been discussed (or are there implicitly) so far.
> > > > 1. [Default] Do nothing; guests using this register continue crashing
> > > > 2. Make the I?ACTIVER registers RZWI.
> > > > 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current
> > > > behavior (as far as we understand it)
> > > > 4. Use a simple IPI with do_noop to update I?ACTIVER
> > > > 4a.  Use an IPI, but come up with clever tricks to avoid interrupting
> > > > guests handling IRQs.
> > > > 5. Trap to Xen on guest EOI, so that we know when the
> > > > 6. Some clever paravirtualized option
> > > 
> > > 7. Use an IPI if we are confident the interrupts may be active.
> > 
> > I don’t understand this one.  How is it different than 4 or 4a?  And in
> > particular, how does it evaluate on the “how much additional design work
> > would it take” criteria?
> 
> Let me start with, if we want to have a vGIC to rule them all, then I am
> afraid you are going to have to compromise. We can discuss about tailoring the
> vGIC but I think before that we need a vGIC that is faithfull with the spec
> (e.g differentiating level vs edge interrupts, handling activer...). Note that
> Arm spent some effort to get a new vGIC merged but this was never made a first
> class citizen.
> 
> However, even if you tailor the vGIC, you are not going to be able to avoid
> IPI in some occasions. This would happen when using event channels, in-guest
> IPI... Another example is when enabling an interrupt, although I realize that
> our vGIC does not do it today meaning that a pending interrupt while disabled
> will not be forwarded until the vCPU exit.
> 
> Furthermore, implementing a write to I{C,S}ACTIVER (to activate/de-activate)
> is going to be very difficult (to not say impossible) to do without IPI.
> 
> If you are worry about a vCPU been interrupted in critical section, then I
> think you should tailor your guest to avoid using those registers.

Let's call it option 8 "tell the user that she needs to modify her
kernel."


> An alternative would be to use spinlock/mutex within the code to prevent a
> VCPU to access the vGIC registers while another vCPU don't want to be
> interrupted.
> 
> Regarding, 4a. The only way I could think of would be to trap the instructions
> that mask/unmask interrupts. If I read correctly the Armv8 spec, it is not
> possible to do it.
> 
> 7. is basically 4.a the goal would be to avoid interrupts the vCPU has much as
> possible but you may be wrong sometimes. Obviously we want to be correct most
> of the times.
>
> I understand this may not be the ideal solution, but this is probably the best
> we could come with and does not involve a lot of work because we have already
> all the information in place (we know when an interrupt was injected to a
> vCPU).
> 
> The next best solution is to prevent in your guest to modify some registers of
> the vGIC at the same time as another vCPU is in a critical section.

Let's call this option 9.


I am just thinking out loud here :)


- 2 "Make the I?ACTIVER registers RZWI"

  As far as I can tell it should prevent the livelock because it would
  never return an ACTIVE state. It could be improved by returning the
  latest ACTIVE information for local interrupts and returning zero for
  interrupts routed to other pcpus. Not perfect but an option.


- 5 "maintenance interrupt"

  This is good for jitter sensitive guests but not the best for the
  others. We could enable it conditionally: enable maintenance
  interrupts only for certain vcpus/pcpus but it is not great to have to
  make this kind of difference in the implementation. However, it is
  possible. Let's see if we can come up with something better.


- 7 "optimized IPI"
 
  A tiny chance of causing issues. Let's see if we can come up with
  anything better.


- 8 "tell the user to fix modify the kernel"

  We could do it in addition to 7. The issue is really how we do it.
  A warning message if DEBUG && if sched==null? That doesn't sound
  right. We could introduce a new nojitter=true command line option for
  Xen? It wouldn't really change the behavior of Xen, but it would
  enable this warning. Or, it could enable the warning and also switch
  the implementation of I?ACTIVER to option 2.


- 9 "prevent I?ACTIVER during critical sections"

  This could be good but I have a difficult time thinking of how we
  could implement it. How do we tell that the other vcpu is in or out of
  the critical section?

  (I was brainstorming a way to re-enable trapping the idling
  instruction WFI temporarily so that the other vcpu would definitely
  trap into Xen again when it is out of the critical section. But it
  doesn't seem possible from the spec.)
Julien Grall April 9, 2020, 12:56 p.m. UTC | #35
On 09/04/2020 02:26, Stefano Stabellini wrote:
> On Tue, 7 Apr 2020, Julien Grall wrote:
>>>>> I don’t know what maintenance IRQs are, but if they only happen
>>>>> intermittently, it’s possible that you’d never get more than a single
>>>>> one in a latency-critical IRQ routine; and as such, the variatibility in
>>>>> execution time (jitter) wouldn’t be an issue in practice.  But every
>>>>> time you add a new unblockable IPI, you increase this jitter;
>>>>> particularly if this unblockable IPI might be repeated an arbitrary
>>>>> number of times.
>>>>> (Stefano, let me know if I’ve misunderstood something.)
>>>>> So stepping back a moment, here’s all the possible ideas that I think
>>>>> have been discussed (or are there implicitly) so far.
>>>>> 1. [Default] Do nothing; guests using this register continue crashing
>>>>> 2. Make the I?ACTIVER registers RZWI.
>>>>> 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current
>>>>> behavior (as far as we understand it)
>>>>> 4. Use a simple IPI with do_noop to update I?ACTIVER
>>>>> 4a.  Use an IPI, but come up with clever tricks to avoid interrupting
>>>>> guests handling IRQs.
>>>>> 5. Trap to Xen on guest EOI, so that we know when the
>>>>> 6. Some clever paravirtualized option
>>>>
>>>> 7. Use an IPI if we are confident the interrupts may be active.
>>>
>>> I don’t understand this one.  How is it different than 4 or 4a?  And in
>>> particular, how does it evaluate on the “how much additional design work
>>> would it take” criteria?
>>
>> Let me start with, if we want to have a vGIC to rule them all, then I am
>> afraid you are going to have to compromise. We can discuss about tailoring the
>> vGIC but I think before that we need a vGIC that is faithfull with the spec
>> (e.g differentiating level vs edge interrupts, handling activer...). Note that
>> Arm spent some effort to get a new vGIC merged but this was never made a first
>> class citizen.
>>
>> However, even if you tailor the vGIC, you are not going to be able to avoid
>> IPI in some occasions. This would happen when using event channels, in-guest
>> IPI... Another example is when enabling an interrupt, although I realize that
>> our vGIC does not do it today meaning that a pending interrupt while disabled
>> will not be forwarded until the vCPU exit.
>>
>> Furthermore, implementing a write to I{C,S}ACTIVER (to activate/de-activate)
>> is going to be very difficult (to not say impossible) to do without IPI.
>>
>> If you are worry about a vCPU been interrupted in critical section, then I
>> think you should tailor your guest to avoid using those registers.
> 
> Let's call it option 8 "tell the user that she needs to modify her
> kernel."
> 
> 
>> An alternative would be to use spinlock/mutex within the code to prevent a
>> VCPU to access the vGIC registers while another vCPU don't want to be
>> interrupted.
>>
>> Regarding, 4a. The only way I could think of would be to trap the instructions
>> that mask/unmask interrupts. If I read correctly the Armv8 spec, it is not
>> possible to do it.
>>
>> 7. is basically 4.a the goal would be to avoid interrupts the vCPU has much as
>> possible but you may be wrong sometimes. Obviously we want to be correct most
>> of the times.
>>
>> I understand this may not be the ideal solution, but this is probably the best
>> we could come with and does not involve a lot of work because we have already
>> all the information in place (we know when an interrupt was injected to a
>> vCPU).
>>
>> The next best solution is to prevent in your guest to modify some registers of
>> the vGIC at the same time as another vCPU is in a critical section.
> 
> Let's call this option 9.
> 
> 
> I am just thinking out loud here :)

Thank you for thinking out loud. Sadly, as I pointed out before, there 
are other part of the vGIC facing the same problems (e.g I{S,C}ENABLER, 
sending SGIs, sending event channels).

So can you enlighten me why I{S,C}ENABLER is that much a concern over 
the other?

> 
> 
> - 2 "Make the I?ACTIVER registers RZWI"
> 
>    As far as I can tell it should prevent the livelock because it would
>    never return an ACTIVE state. It could be improved by returning the
>    latest ACTIVE information for local interrupts and returning zero for
>    interrupts routed to other pcpus. Not perfect but an option.

How a partial implementation will help? Wouldn't this make more 
difficult for a developper?

Bear in mind that on GICv3 you can read the information all the 
re-distributors information (not only yours).

> 
> 
> - 5 "maintenance interrupt"
> 
>    This is good for jitter sensitive guests but not the best for the
>    others. We could enable it conditionally: enable maintenance
>    interrupts only for certain vcpus/pcpus but it is not great to have to
>    make this kind of difference in the implementation. However, it is
>    possible. Let's see if we can come up with something better.
> 
> 
> - 7 "optimized IPI"
>   
>    A tiny chance of causing issues. Let's see if we can come up with
>    anything better.
> 
> 
> - 8 "tell the user to fix modify the kernel"
> 
>    We could do it in addition to 7. The issue is really how we do it.
>    A warning message if DEBUG && if sched==null? That doesn't sound
>    right. We could introduce a new nojitter=true command line option for
>    Xen? It wouldn't really change the behavior of Xen, but it would
>    enable this warning. Or, it could enable the warning and also switch
>    the implementation of I?ACTIVER to option 2.

This is not a sched=null specific problem. The problem would exactly be 
the same when you are dedicating a pCPU to a vCPU on credit and credit2.

> 
> 
> - 9 "prevent I?ACTIVER during critical sections"
> 
>    This could be good but I have a difficult time thinking of how we
>    could implement it. How do we tell that the other vcpu is in or out of
>    the critical section?

I believe you misread what I wrote. I didn't suggest Xen would do it but 
the guest will do it. As the vCPUs belongs to the same guest.

Cheers,
Julien Grall April 9, 2020, 1 p.m. UTC | #36
On 09/04/2020 13:56, Julien Grall wrote:
> 
> 
> On 09/04/2020 02:26, Stefano Stabellini wrote:
>> On Tue, 7 Apr 2020, Julien Grall wrote:
>>>>>> I don’t know what maintenance IRQs are, but if they only happen
>>>>>> intermittently, it’s possible that you’d never get more than a single
>>>>>> one in a latency-critical IRQ routine; and as such, the 
>>>>>> variatibility in
>>>>>> execution time (jitter) wouldn’t be an issue in practice.  But every
>>>>>> time you add a new unblockable IPI, you increase this jitter;
>>>>>> particularly if this unblockable IPI might be repeated an arbitrary
>>>>>> number of times.
>>>>>> (Stefano, let me know if I’ve misunderstood something.)
>>>>>> So stepping back a moment, here’s all the possible ideas that I think
>>>>>> have been discussed (or are there implicitly) so far.
>>>>>> 1. [Default] Do nothing; guests using this register continue crashing
>>>>>> 2. Make the I?ACTIVER registers RZWI.
>>>>>> 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s 
>>>>>> current
>>>>>> behavior (as far as we understand it)
>>>>>> 4. Use a simple IPI with do_noop to update I?ACTIVER
>>>>>> 4a.  Use an IPI, but come up with clever tricks to avoid interrupting
>>>>>> guests handling IRQs.
>>>>>> 5. Trap to Xen on guest EOI, so that we know when the
>>>>>> 6. Some clever paravirtualized option
>>>>>
>>>>> 7. Use an IPI if we are confident the interrupts may be active.
>>>>
>>>> I don’t understand this one.  How is it different than 4 or 4a?  And in
>>>> particular, how does it evaluate on the “how much additional design 
>>>> work
>>>> would it take” criteria?
>>>
>>> Let me start with, if we want to have a vGIC to rule them all, then I am
>>> afraid you are going to have to compromise. We can discuss about 
>>> tailoring the
>>> vGIC but I think before that we need a vGIC that is faithfull with 
>>> the spec
>>> (e.g differentiating level vs edge interrupts, handling activer...). 
>>> Note that
>>> Arm spent some effort to get a new vGIC merged but this was never 
>>> made a first
>>> class citizen.
>>>
>>> However, even if you tailor the vGIC, you are not going to be able to 
>>> avoid
>>> IPI in some occasions. This would happen when using event channels, 
>>> in-guest
>>> IPI... Another example is when enabling an interrupt, although I 
>>> realize that
>>> our vGIC does not do it today meaning that a pending interrupt while 
>>> disabled
>>> will not be forwarded until the vCPU exit.
>>>
>>> Furthermore, implementing a write to I{C,S}ACTIVER (to 
>>> activate/de-activate)
>>> is going to be very difficult (to not say impossible) to do without IPI.
>>>
>>> If you are worry about a vCPU been interrupted in critical section, 
>>> then I
>>> think you should tailor your guest to avoid using those registers.
>>
>> Let's call it option 8 "tell the user that she needs to modify her
>> kernel."
>>
>>
>>> An alternative would be to use spinlock/mutex within the code to 
>>> prevent a
>>> VCPU to access the vGIC registers while another vCPU don't want to be
>>> interrupted.
>>>
>>> Regarding, 4a. The only way I could think of would be to trap the 
>>> instructions
>>> that mask/unmask interrupts. If I read correctly the Armv8 spec, it 
>>> is not
>>> possible to do it.
>>>
>>> 7. is basically 4.a the goal would be to avoid interrupts the vCPU 
>>> has much as
>>> possible but you may be wrong sometimes. Obviously we want to be 
>>> correct most
>>> of the times.
>>>
>>> I understand this may not be the ideal solution, but this is probably 
>>> the best
>>> we could come with and does not involve a lot of work because we have 
>>> already
>>> all the information in place (we know when an interrupt was injected 
>>> to a
>>> vCPU).
>>>
>>> The next best solution is to prevent in your guest to modify some 
>>> registers of
>>> the vGIC at the same time as another vCPU is in a critical section.
>>
>> Let's call this option 9.
>>
>>
>> I am just thinking out loud here :)
> 
> Thank you for thinking out loud. Sadly, as I pointed out before, there 
> are other part of the vGIC facing the same problems (e.g I{S,C}ENABLER, 
> sending SGIs, sending event channels).
> 
> So can you enlighten me why I{S,C}ENABLER is that much a concern over 
> the other?

To be clear, I am not saying I{S,C}ENABLER should not be a concern. But 
I would prefer if we focus on a generic solution if the problem is wider.

Cheers,
Bertrand Marquis April 17, 2020, 3:16 p.m. UTC | #37
On 9 Apr 2020, at 14:00, Julien Grall <julien@xen.org<mailto:julien@xen.org>> wrote:



On 09/04/2020 13:56, Julien Grall wrote:
On 09/04/2020 02:26, Stefano Stabellini wrote:
On Tue, 7 Apr 2020, Julien Grall wrote:
I don’t know what maintenance IRQs are, but if they only happen
intermittently, it’s possible that you’d never get more than a single
one in a latency-critical IRQ routine; and as such, the variatibility in
execution time (jitter) wouldn’t be an issue in practice.  But every
time you add a new unblockable IPI, you increase this jitter;
particularly if this unblockable IPI might be repeated an arbitrary
number of times.
(Stefano, let me know if I’ve misunderstood something.)
So stepping back a moment, here’s all the possible ideas that I think
have been discussed (or are there implicitly) so far.
1. [Default] Do nothing; guests using this register continue crashing
2. Make the I?ACTIVER registers RZWI.
3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current
behavior (as far as we understand it)
4. Use a simple IPI with do_noop to update I?ACTIVER
4a.  Use an IPI, but come up with clever tricks to avoid interrupting
guests handling IRQs.
5. Trap to Xen on guest EOI, so that we know when the
6. Some clever paravirtualized option

7. Use an IPI if we are confident the interrupts may be active.

I don’t understand this one.  How is it different than 4 or 4a?  And in
particular, how does it evaluate on the “how much additional design work
would it take” criteria?

Let me start with, if we want to have a vGIC to rule them all, then I am
afraid you are going to have to compromise. We can discuss about tailoring the
vGIC but I think before that we need a vGIC that is faithfull with the spec
(e.g differentiating level vs edge interrupts, handling activer...). Note that
Arm spent some effort to get a new vGIC merged but this was never made a first
class citizen.

However, even if you tailor the vGIC, you are not going to be able to avoid
IPI in some occasions. This would happen when using event channels, in-guest
IPI... Another example is when enabling an interrupt, although I realize that
our vGIC does not do it today meaning that a pending interrupt while disabled
will not be forwarded until the vCPU exit.

Furthermore, implementing a write to I{C,S}ACTIVER (to activate/de-activate)
is going to be very difficult (to not say impossible) to do without IPI.

If you are worry about a vCPU been interrupted in critical section, then I
think you should tailor your guest to avoid using those registers.

Let's call it option 8 "tell the user that she needs to modify her
kernel."


An alternative would be to use spinlock/mutex within the code to prevent a
VCPU to access the vGIC registers while another vCPU don't want to be
interrupted.

Regarding, 4a. The only way I could think of would be to trap the instructions
that mask/unmask interrupts. If I read correctly the Armv8 spec, it is not
possible to do it.

7. is basically 4.a the goal would be to avoid interrupts the vCPU has much as
possible but you may be wrong sometimes. Obviously we want to be correct most
of the times.

I understand this may not be the ideal solution, but this is probably the best
we could come with and does not involve a lot of work because we have already
all the information in place (we know when an interrupt was injected to a
vCPU).

The next best solution is to prevent in your guest to modify some registers of
the vGIC at the same time as another vCPU is in a critical section.

Let's call this option 9.


I am just thinking out loud here :)
Thank you for thinking out loud. Sadly, as I pointed out before, there are other part of the vGIC facing the same problems (e.g I{S,C}ENABLER, sending SGIs, sending event channels).
So can you enlighten me why I{S,C}ENABLER is that much a concern over the other?

To be clear, I am not saying I{S,C}ENABLER should not be a concern. But I would prefer if we focus on a generic solution if the problem is wider.

It seems that the problem is a bit bigger then expected and will need more discussion and thinking.
I did some research on my side and there are several design possible depending on what should be the goal performance or real-time behaviour (going from just give the status we have to fire a service interrupts when any interrupts is acked by a guest to refresh our internal status).

In the short term, could we not agree to fix the typo by returning always 0 and start a discussion on the vgic implementation ?

Cheers

Bertrand
Julien Grall April 17, 2020, 4:07 p.m. UTC | #38
Hi Bertrand,

On 17/04/2020 16:16, Bertrand Marquis wrote:
> It seems that the problem is a bit bigger then expected and will need 
> more discussion and thinking.
> I did some research on my side and there are several design possible 
> depending on what should be the goal performance or real-time behaviour 
> (going from just give the status we have to fire a service interrupts 
> when any interrupts is acked by a guest to refresh our internal status).
> 
> In the short term, could we not agree to fix the typo by returning 
> always 0 and start a discussion on the vgic implementation ?

I have already pointed out multiple time now ([1], [2]) that I would not 
oppose the temporary solution. I think this is a matter of someone 
(Stefano?) to submit it :).

Cheers,

[1] 
https://lists.xenproject.org/archives/html/xen-devel/2019-11/msg01642.html
[2] 
https://lists.xenproject.org/archives/html/xen-devel/2020-04/msg00459.html
Stefano Stabellini April 17, 2020, 4:31 p.m. UTC | #39
On Fri, 17 Apr 2020, Julien Grall wrote:
> Hi Bertrand,
> 
> On 17/04/2020 16:16, Bertrand Marquis wrote:
> > It seems that the problem is a bit bigger then expected and will need more
> > discussion and thinking.
> > I did some research on my side and there are several design possible
> > depending on what should be the goal performance or real-time behaviour
> > (going from just give the status we have to fire a service interrupts when
> > any interrupts is acked by a guest to refresh our internal status).
> > 
> > In the short term, could we not agree to fix the typo by returning always 0
> > and start a discussion on the vgic implementation ?
> 
> I have already pointed out multiple time now ([1], [2]) that I would not
> oppose the temporary solution. I think this is a matter of someone (Stefano?)
> to submit it :).
> 
> Cheers,
> 
> [1] https://lists.xenproject.org/archives/html/xen-devel/2019-11/msg01642.html
> [2] https://lists.xenproject.org/archives/html/xen-devel/2020-04/msg00459.html

I can submit it. Julien, would you prefer the plain always return 0
patch, or would you prefer if I tried to get the latest ISACTIVER
information (like this patch does) but only for the local processor?
Julien Grall April 17, 2020, 4:47 p.m. UTC | #40
On 17/04/2020 17:31, Stefano Stabellini wrote:
> On Fri, 17 Apr 2020, Julien Grall wrote:
>> Hi Bertrand,
>>
>> On 17/04/2020 16:16, Bertrand Marquis wrote:
>>> It seems that the problem is a bit bigger then expected and will need more
>>> discussion and thinking.
>>> I did some research on my side and there are several design possible
>>> depending on what should be the goal performance or real-time behaviour
>>> (going from just give the status we have to fire a service interrupts when
>>> any interrupts is acked by a guest to refresh our internal status).
>>>
>>> In the short term, could we not agree to fix the typo by returning always 0
>>> and start a discussion on the vgic implementation ?
>>
>> I have already pointed out multiple time now ([1], [2]) that I would not
>> oppose the temporary solution. I think this is a matter of someone (Stefano?)
>> to submit it :).
>>
>> Cheers,
>>
>> [1] https://lists.xenproject.org/archives/html/xen-devel/2019-11/msg01642.html
>> [2] https://lists.xenproject.org/archives/html/xen-devel/2020-04/msg00459.html
> 
> I can submit it. Julien, would you prefer the plain always return 0
> patch, or would you prefer if I tried to get the latest ISACTIVER
> information (like this patch does) but only for the local processor?

Always returning 0.

Cheers,
diff mbox series

Patch

diff --git a/xen/arch/arm/vgic-v2.c b/xen/arch/arm/vgic-v2.c
index 64b141fea5..454408d11d 100644
--- a/xen/arch/arm/vgic-v2.c
+++ b/xen/arch/arm/vgic-v2.c
@@ -245,10 +245,19 @@  static int vgic_v2_distr_mmio_read(struct vcpu *v, mmio_info_t *info,
     case VRANGE32(GICD_ICPENDR, GICD_ICPENDRN):
         goto read_as_zero;
 
-    /* Read the active status of an IRQ via GICD is not supported */
     case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVERN):
+        if ( dabt.size != DABT_WORD ) goto bad_width;
+        rank = vgic_rank_offset(v, 1, gicd_reg - GICD_ISACTIVER, DABT_WORD);
+        if ( rank == NULL ) goto read_as_zero;
+        *r = vgic_isactiver(v, 32 * rank->index);
+        return 1;
+
     case VRANGE32(GICD_ICACTIVER, GICD_ICACTIVERN):
-        goto read_as_zero;
+        if ( dabt.size != DABT_WORD ) goto bad_width;
+        rank = vgic_rank_offset(v, 1, gicd_reg - GICD_ICACTIVER, DABT_WORD);
+        if ( rank == NULL ) goto read_as_zero;
+        *r = vgic_isactiver(v, 32 * rank->index);
+        return 1;
 
     case VRANGE32(GICD_IPRIORITYR, GICD_IPRIORITYRN):
     {
diff --git a/xen/arch/arm/vgic-v3.c b/xen/arch/arm/vgic-v3.c
index 4e60ba15cc..a2cd39c45d 100644
--- a/xen/arch/arm/vgic-v3.c
+++ b/xen/arch/arm/vgic-v3.c
@@ -712,10 +712,19 @@  static int __vgic_v3_distr_common_mmio_read(const char *name, struct vcpu *v,
     case VRANGE32(GICD_ICPENDR, GICD_ICPENDR):
         goto read_as_zero;
 
-    /* Read the active status of an IRQ via GICD/GICR is not supported */
-    case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVER):
+    case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVERN):
+        if ( dabt.size != DABT_WORD ) goto bad_width;
+        rank = vgic_rank_offset(v, 1, reg - GICD_ISACTIVER, DABT_WORD);
+        if ( rank == NULL ) goto read_as_zero;
+        *r = vgic_isactiver(v, 32 * rank->index);
+        return 1;
+
     case VRANGE32(GICD_ICACTIVER, GICD_ICACTIVERN):
-        goto read_as_zero;
+        if ( dabt.size != DABT_WORD ) goto bad_width;
+        rank = vgic_rank_offset(v, 1, reg - GICD_ICACTIVER, DABT_WORD);
+        if ( rank == NULL ) goto read_as_zero;
+        *r = vgic_isactiver(v, 32 * rank->index);
+        return 1;
 
     case VRANGE32(GICD_IPRIORITYR, GICD_IPRIORITYRN):
     {
diff --git a/xen/arch/arm/vgic.c b/xen/arch/arm/vgic.c
index 82f524a35c..d491fa38a5 100644
--- a/xen/arch/arm/vgic.c
+++ b/xen/arch/arm/vgic.c
@@ -423,6 +423,26 @@  void vgic_enable_irqs(struct vcpu *v, uint32_t r, int n)
     }
 }
 
+uint32_t vgic_isactiver(struct vcpu *v, unsigned int start_irq)
+{
+        struct pending_irq *p;
+        unsigned int irq;
+        uint32_t r = 0;
+
+        /*
+         * The following won't reflect the latest status of interrupts on
+         * other vcpus.
+         */
+        for ( irq = start_irq; irq < start_irq + 32; irq++ )
+        {
+            p = irq_to_pending(v, irq);
+            if ( p != NULL && test_bit(GIC_IRQ_GUEST_ACTIVE, &p->status) )
+                r |= 1U << (irq - start_irq);
+        }
+
+        return r;
+}
+
 bool vgic_to_sgi(struct vcpu *v, register_t sgir, enum gic_sgi_mode irqmode,
                  int virq, const struct sgi_target *target)
 {
diff --git a/xen/include/asm-arm/vgic.h b/xen/include/asm-arm/vgic.h
index ce1e3c4bbd..a9e3f2fa60 100644
--- a/xen/include/asm-arm/vgic.h
+++ b/xen/include/asm-arm/vgic.h
@@ -288,6 +288,7 @@  extern struct vgic_irq_rank *vgic_rank_offset(struct vcpu *v, int b, int n, int
 extern struct vgic_irq_rank *vgic_rank_irq(struct vcpu *v, unsigned int irq);
 extern void vgic_disable_irqs(struct vcpu *v, uint32_t r, int n);
 extern void vgic_enable_irqs(struct vcpu *v, uint32_t r, int n);
+extern uint32_t vgic_isactiver(struct vcpu *v, unsigned int start_irq);
 extern void register_vgic_ops(struct domain *d, const struct vgic_ops *ops);
 int vgic_v2_init(struct domain *d, int *mmio_count);
 int vgic_v3_init(struct domain *d, int *mmio_count);