Message ID | 20200327023451.20271-1-sstabellini@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] xen/arm: implement GICD_I[S/C]ACTIVER reads | expand |
qHi Stefano, On 27/03/2020 02:34, Stefano Stabellini wrote: > This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER > reads. It doesn't take into account the latest state of interrupts on > other vCPUs. Only the current vCPU is up-to-date. A full solution is > not possible because it would require synchronization among all vCPUs, > which would be very expensive in terms or latency. Your sentence suggests you have number showing that correctly emulating the registers would be too slow. Mind sharing them? Cheers,
On Sat, 28 Mar 2020, Julien Grall wrote: > qHi Stefano, > > On 27/03/2020 02:34, Stefano Stabellini wrote: > > This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER > > reads. It doesn't take into account the latest state of interrupts on > > other vCPUs. Only the current vCPU is up-to-date. A full solution is > > not possible because it would require synchronization among all vCPUs, > > which would be very expensive in terms or latency. > > Your sentence suggests you have number showing that correctly emulating the > registers would be too slow. Mind sharing them? No, I don't have any numbers. Would you prefer a different wording or a better explanation? I also realized there is a typo in there (or/of).
Hi Stefano, On 30/03/2020 17:35, Stefano Stabellini wrote: > On Sat, 28 Mar 2020, Julien Grall wrote: >> qHi Stefano, >> >> On 27/03/2020 02:34, Stefano Stabellini wrote: >>> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER >>> reads. It doesn't take into account the latest state of interrupts on >>> other vCPUs. Only the current vCPU is up-to-date. A full solution is >>> not possible because it would require synchronization among all vCPUs, >>> which would be very expensive in terms or latency. >> >> Your sentence suggests you have number showing that correctly emulating the >> registers would be too slow. Mind sharing them? > > No, I don't have any numbers. Would you prefer a different wording or a > better explanation? I also realized there is a typo in there (or/of). Let me start with I think correctness is more important than speed. So I would have expected your commit message to contain some fact why synchronization is going to be slow and why this is a problem. To give you a concrete example, the implementation of set/way instructions are really slow (it could take a few seconds depending on the setup). However, this was fine because not implementing them correctly would have a greater impact on the guest (corruption) and they are not used often. I don't think the performance in our case will be in same order magnitude. It is most likely to be in the range of milliseconds (if not less) which I think is acceptable for emulation (particularly for the vGIC) and the current uses. So lets take a step back and look how we could implement ISACTIVER/ICACTIVER correctly. The only thing we need is a snapshot of the interrupts state a given point. I originally thought it would be necessary to use domain_pause() which is quite heavy, but I think we only need the vCPU to enter in Xen and sync the states of the LRs. Roughly the code would look like (This is not optimized): for_each_vcpu(d, v) { if ( v->is_running ) smp_call_function(do_nothing(), v->cpu); } /* Read the state */ A few remarks: * The function do_nothing() is basically a NOP. * I am suggesting to use smp_call_function() rather smp_send_event_check_cpu() is because we need to know when the vCPU has synchronized the LRs. As the function do_nothing() will be call afterwards, then we know the the snapshot of the LRs has been done * It would be possible to everything in one vCPU. * We can possibly optimize it for the SGIs/PPIs case This is still not perfect, but I don't think the performance would be that bad. Cheers,
Hi, On 27/03/2020 02:34, Stefano Stabellini wrote: > It doesn't take into account the latest state of interrupts on > other vCPUs. So I think your implementation is going to introduce a deadlock in the guest. Let's imagine a guest with 2 vCPUs (A and B) with the following setup: * The HW SPI 32 is routed to and serviced by vCPU B. * vCPU A will routinely wait for any pending SPI 32 to be finished before performing a specific task. In the current form of the vGIC, vCPU B will not exit to Xen when SPI 32 has been deactivated. Instead, the vCPU will continue to run until an unrelated trap happen (I/O emulation, IRQs...). Depending on your setup (think NULL scheduler) this may happen in a really long time (or never). Until the vCPU B exit to Xen, SPI 32 may be considered as active. Therefore vCPU A will keep waiting and be block until vCPU B is finally trapping in Xen. My example above is basically a cut down version of __synchronize_hardirq() in Linux. In practice, you may be lucky most of the times because there will be trap happening time to time. However, it also means the task you need to perform on vCPU A will be delayed. So I would not really bet on the trap here. You have two options: 1) Force the vCPU to trap when deactivating an interrupt 2) For the vCPUs to exiting when reading I{S,C}ACTIVER 1) will incur cost on every interrupts which is not great. So I think your best option is 2) here. Cheers,
On Mon, 30 Mar 2020, Julien Grall wrote: > Hi Stefano, > > On 30/03/2020 17:35, Stefano Stabellini wrote: > > On Sat, 28 Mar 2020, Julien Grall wrote: > > > qHi Stefano, > > > > > > On 27/03/2020 02:34, Stefano Stabellini wrote: > > > > This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER > > > > reads. It doesn't take into account the latest state of interrupts on > > > > other vCPUs. Only the current vCPU is up-to-date. A full solution is > > > > not possible because it would require synchronization among all vCPUs, > > > > which would be very expensive in terms or latency. > > > > > > Your sentence suggests you have number showing that correctly emulating > > > the > > > registers would be too slow. Mind sharing them? > > > > No, I don't have any numbers. Would you prefer a different wording or a > > better explanation? I also realized there is a typo in there (or/of). > Let me start with I think correctness is more important than speed. > So I would have expected your commit message to contain some fact why > synchronization is going to be slow and why this is a problem. > > To give you a concrete example, the implementation of set/way instructions are > really slow (it could take a few seconds depending on the setup). However, > this was fine because not implementing them correctly would have a greater > impact on the guest (corruption) and they are not used often. > > I don't think the performance in our case will be in same order magnitude. It > is most likely to be in the range of milliseconds (if not less) which I think > is acceptable for emulation (particularly for the vGIC) and the current uses. Writing on the mailing list some of our discussions today. Correctness is not just in terms of compliance to a specification but it is also about not breaking guests. Introducing latency in the range of milliseconds, or hundreds of microseconds, would break any latency sensitive workloads. We don't have numbers so we don't know for certain the effect that your suggestion would have. It would be interesting to have those numbers, and I'll add to my TODO list to run the experiments you suggested, but I'll put it on the back-burner (from a Xilinx perspective it is low priority as no customers are affected.) > So lets take a step back and look how we could implement ISACTIVER/ICACTIVER > correctly. > > The only thing we need is a snapshot of the interrupts state a given point. I > originally thought it would be necessary to use domain_pause() which is quite > heavy, but I think we only need the vCPU to enter in Xen and sync the states > of the LRs. > > Roughly the code would look like (This is not optimized): > > for_each_vcpu(d, v) > { > if ( v->is_running ) > smp_call_function(do_nothing(), v->cpu); > } > > /* Read the state */ > > A few remarks: > * The function do_nothing() is basically a NOP. > * I am suggesting to use smp_call_function() rather > smp_send_event_check_cpu() is because we need to know when the vCPU has > synchronized the LRs. As the function do_nothing() will be call afterwards, > then we know the the snapshot of the LRs has been done > * It would be possible to everything in one vCPU. > * We can possibly optimize it for the SGIs/PPIs case > > This is still not perfect, but I don't think the performance would be that > bad.
On 01/04/2020 01:57, Stefano Stabellini wrote: > On Mon, 30 Mar 2020, Julien Grall wrote: >> Hi Stefano, >> >> On 30/03/2020 17:35, Stefano Stabellini wrote: >>> On Sat, 28 Mar 2020, Julien Grall wrote: >>>> qHi Stefano, >>>> >>>> On 27/03/2020 02:34, Stefano Stabellini wrote: >>>>> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER >>>>> reads. It doesn't take into account the latest state of interrupts on >>>>> other vCPUs. Only the current vCPU is up-to-date. A full solution is >>>>> not possible because it would require synchronization among all vCPUs, >>>>> which would be very expensive in terms or latency. >>>> >>>> Your sentence suggests you have number showing that correctly emulating >>>> the >>>> registers would be too slow. Mind sharing them? >>> >>> No, I don't have any numbers. Would you prefer a different wording or a >>> better explanation? I also realized there is a typo in there (or/of). >> Let me start with I think correctness is more important than speed. >> So I would have expected your commit message to contain some fact why >> synchronization is going to be slow and why this is a problem. >> >> To give you a concrete example, the implementation of set/way instructions are >> really slow (it could take a few seconds depending on the setup). However, >> this was fine because not implementing them correctly would have a greater >> impact on the guest (corruption) and they are not used often. >> >> I don't think the performance in our case will be in same order magnitude. It >> is most likely to be in the range of milliseconds (if not less) which I think >> is acceptable for emulation (particularly for the vGIC) and the current uses. > > Writing on the mailing list some of our discussions today. > > Correctness is not just in terms of compliance to a specification but it > is also about not breaking guests. Introducing latency in the range of > milliseconds, or hundreds of microseconds, would break any latency > sensitive workloads. We don't have numbers so we don't know for certain > the effect that your suggestion would have. You missed part of the discussion. I don't disagree that latency is important. However, if an implementation is only 95% reliable, then it means 5% of the time your guest may break (corruption, crash, deadlock...). At which point the latency is the last of your concern. > > It would be interesting to have those numbers, and I'll add to my TODO > list to run the experiments you suggested, but I'll put it on the > back-burner (from a Xilinx perspective it is low priority as no > customers are affected.) How about we get a correct implementation merge first and then discuss about optimization? This would allow the community to check whether there are actually noticeable latency in their workload. Cheers,
On 1 Apr 2020, at 09:30, Julien Grall <julien@xen.org<mailto:julien@xen.org>> wrote: On 01/04/2020 01:57, Stefano Stabellini wrote: On Mon, 30 Mar 2020, Julien Grall wrote: Hi Stefano, On 30/03/2020 17:35, Stefano Stabellini wrote: On Sat, 28 Mar 2020, Julien Grall wrote: qHi Stefano, On 27/03/2020 02:34, Stefano Stabellini wrote: This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER reads. It doesn't take into account the latest state of interrupts on other vCPUs. Only the current vCPU is up-to-date. A full solution is not possible because it would require synchronization among all vCPUs, which would be very expensive in terms or latency. Your sentence suggests you have number showing that correctly emulating the registers would be too slow. Mind sharing them? No, I don't have any numbers. Would you prefer a different wording or a better explanation? I also realized there is a typo in there (or/of). Let me start with I think correctness is more important than speed. So I would have expected your commit message to contain some fact why synchronization is going to be slow and why this is a problem. To give you a concrete example, the implementation of set/way instructions are really slow (it could take a few seconds depending on the setup). However, this was fine because not implementing them correctly would have a greater impact on the guest (corruption) and they are not used often. I don't think the performance in our case will be in same order magnitude. It is most likely to be in the range of milliseconds (if not less) which I think is acceptable for emulation (particularly for the vGIC) and the current uses. Writing on the mailing list some of our discussions today. Correctness is not just in terms of compliance to a specification but it is also about not breaking guests. Introducing latency in the range of milliseconds, or hundreds of microseconds, would break any latency sensitive workloads. We don't have numbers so we don't know for certain the effect that your suggestion would have. You missed part of the discussion. I don't disagree that latency is important. However, if an implementation is only 95% reliable, then it means 5% of the time your guest may break (corruption, crash, deadlock...). At which point the latency is the last of your concern. It would be interesting to have those numbers, and I'll add to my TODO list to run the experiments you suggested, but I'll put it on the back-burner (from a Xilinx perspective it is low priority as no customers are affected.) How about we get a correct implementation merge first and then discuss about optimization? This would allow the community to check whether there are actually noticeable latency in their workload. Hi, I am not sure that pushing something with a performance impact to later fix it is the right approach here. The patch is an improvement compared to the current code and it can be further improved later to handle more cases (other cores). If we really have to sync all vCPUs here, this will cost a lot and the result will still be the status in the past in fact because nothing will make sure that at the point the guest gets back the value it is still valid. Cheers -- Bertrand
On Wed, Apr 1, 2020 at 10:54 AM Bertrand Marquis <Bertrand.Marquis@arm.com> wrote: > > > On 1 Apr 2020, at 09:30, Julien Grall <julien@xen.org> wrote: > > > > On 01/04/2020 01:57, Stefano Stabellini wrote: > > On Mon, 30 Mar 2020, Julien Grall wrote: > > Hi Stefano, > > On 30/03/2020 17:35, Stefano Stabellini wrote: > > On Sat, 28 Mar 2020, Julien Grall wrote: > > qHi Stefano, > > On 27/03/2020 02:34, Stefano Stabellini wrote: > > This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER > reads. It doesn't take into account the latest state of interrupts on > other vCPUs. Only the current vCPU is up-to-date. A full solution is > not possible because it would require synchronization among all vCPUs, > which would be very expensive in terms or latency. > > > Your sentence suggests you have number showing that correctly emulating > the > registers would be too slow. Mind sharing them? > > > No, I don't have any numbers. Would you prefer a different wording or a > better explanation? I also realized there is a typo in there (or/of). > > Let me start with I think correctness is more important than speed. > So I would have expected your commit message to contain some fact why > synchronization is going to be slow and why this is a problem. > > To give you a concrete example, the implementation of set/way instructions > are > really slow (it could take a few seconds depending on the setup). However, > this was fine because not implementing them correctly would have a greater > impact on the guest (corruption) and they are not used often. > > I don't think the performance in our case will be in same order magnitude. > It > is most likely to be in the range of milliseconds (if not less) which I > think > is acceptable for emulation (particularly for the vGIC) and the current > uses. > > Writing on the mailing list some of our discussions today. > Correctness is not just in terms of compliance to a specification but it > is also about not breaking guests. Introducing latency in the range of > milliseconds, or hundreds of microseconds, would break any latency > sensitive workloads. We don't have numbers so we don't know for certain > the effect that your suggestion would have. > > > You missed part of the discussion. I don't disagree that latency is > important. However, if an implementation is only 95% reliable, then it > means 5% of the time your guest may break (corruption, crash, deadlock...). > At which point the latency is the last of your concern. > > It would be interesting to have those numbers, and I'll add to my TODO > list to run the experiments you suggested, but I'll put it on the > back-burner (from a Xilinx perspective it is low priority as no > customers are affected.) > > > How about we get a correct implementation merge first and then discuss > about optimization? This would allow the community to check whether there > are actually noticeable latency in their workload. > > > Hi, > > I am not sure that pushing something with a performance impact to later > fix it is the right approach here. > > The patch is an improvement compared to the current code and it can be > further improved later to handle more cases (other cores). > > If we really have to sync all vCPUs here, this will cost a lot and the > result will still be the status in the past in fact because nothing will > make sure that at the point the guest gets back the value it is still valid. > The same is true on real hardware, right? Looking at this discussion as a non-ARM person, it sounds a bit like ARM specified this in a way that was useless-but-easy-to-implement-so-why-not; but it turns out to be useless-but-hard-to-implement virtualized. On the one hand, I'm sympathetic to Julien's point of view; I very much don't like the idea of silently changing behavior just because the specified behavior is inconvenient for us. On the other hand, I'm sympathetic to Stefano's point of view, that it's pointless to introduce a load of overhead and jitter to implement behavior which nobody in their right mind would even want to use (at least virtualized). What I think would be *ideal* would be for ARM to change the specification to make it easier virtualize. For instance:, by specifying that the register *may* contain information about other cores, but may not; or, that the register will contain information on other cores on real hardware, but not virtualized. Barring that, I think we should have a centralized place to document deviations from the spec; and I think changes to this spec should be coordinated with KVM (and maybe ACRN?), to minimize hypervisor-specific deviations. Thoughts? -George
On 01/04/2020 10:54, Bertrand Marquis wrote: > > >> On 1 Apr 2020, at 09:30, Julien Grall <julien@xen.org >> <mailto:julien@xen.org>> wrote: >> >> >> >> On 01/04/2020 01:57, Stefano Stabellini wrote: >>> On Mon, 30 Mar 2020, Julien Grall wrote: >>>> Hi Stefano, >>>> >>>> On 30/03/2020 17:35, Stefano Stabellini wrote: >>>>> On Sat, 28 Mar 2020, Julien Grall wrote: >>>>>> qHi Stefano, >>>>>> >>>>>> On 27/03/2020 02:34, Stefano Stabellini wrote: >>>>>>> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER >>>>>>> reads. It doesn't take into account the latest state of interrupts on >>>>>>> other vCPUs. Only the current vCPU is up-to-date. A full solution is >>>>>>> not possible because it would require synchronization among all >>>>>>> vCPUs, >>>>>>> which would be very expensive in terms or latency. >>>>>> >>>>>> Your sentence suggests you have number showing that correctly >>>>>> emulating >>>>>> the >>>>>> registers would be too slow. Mind sharing them? >>>>> >>>>> No, I don't have any numbers. Would you prefer a different wording or a >>>>> better explanation? I also realized there is a typo in there (or/of). >>>> Let me start with I think correctness is more important than speed. >>>> So I would have expected your commit message to contain some fact why >>>> synchronization is going to be slow and why this is a problem. >>>> >>>> To give you a concrete example, the implementation of set/way >>>> instructions are >>>> really slow (it could take a few seconds depending on the setup). >>>> However, >>>> this was fine because not implementing them correctly would have a >>>> greater >>>> impact on the guest (corruption) and they are not used often. >>>> >>>> I don't think the performance in our case will be in same order >>>> magnitude. It >>>> is most likely to be in the range of milliseconds (if not less) >>>> which I think >>>> is acceptable for emulation (particularly for the vGIC) and the >>>> current uses. >>> Writing on the mailing list some of our discussions today. >>> Correctness is not just in terms of compliance to a specification but it >>> is also about not breaking guests. Introducing latency in the range of >>> milliseconds, or hundreds of microseconds, would break any latency >>> sensitive workloads. We don't have numbers so we don't know for certain >>> the effect that your suggestion would have. >> >> You missed part of the discussion. I don't disagree that latency is >> important. However, if an implementation is only 95% reliable, then it >> means 5% of the time your guest may break (corruption, crash, >> deadlock...). At which point the latency is the last of your concern. >> >>> It would be interesting to have those numbers, and I'll add to my TODO >>> list to run the experiments you suggested, but I'll put it on the >>> back-burner (from a Xilinx perspective it is low priority as no >>> customers are affected.) >> >> How about we get a correct implementation merge first and then discuss >> about optimization? This would allow the community to check whether >> there are actually noticeable latency in their workload. > > Hi, Hi, > I am not sure that pushing something with a performance impact to later > fix it is the right approach here. > > The patch is an improvement compared to the current code and it can be > further improved later to handle more cases (other cores). If you read my other answer on this patch you will see that this is going to introduce a deadlock in the guest using multiple vCPUs. How is it an improvement compare to today? > If we really have to sync all vCPUs here, this will cost a lot and the > result will still be the status in the past in fact because nothing will > make sure that at the point the guest gets back the value it is still valid. I hope you are aware the problem is exactly the same in the hardware. As soon as you read the ISACTIVER, then the value may not be correct anymore. So I don't see the issue to have an outdated value here. As I said to Stefano yesterday, I would be happy to compromise as long as the implementation does not introduce an outright deadlock in the guest. At the moment, I don't have a better approach than kick all the vCPUs. Feel free to suggest a better one. Cheers,
On 01/04/2020 18:02, George Dunlap wrote: > > > On Wed, Apr 1, 2020 at 10:54 AM Bertrand Marquis > <Bertrand.Marquis@arm.com <mailto:Bertrand.Marquis@arm.com>> wrote: > > > >> On 1 Apr 2020, at 09:30, Julien Grall <julien@xen.org >> <mailto:julien@xen.org>> wrote: >> >> >> >> On 01/04/2020 01:57, Stefano Stabellini wrote: >>> On Mon, 30 Mar 2020, Julien Grall wrote: >>>> Hi Stefano, >>>> >>>> On 30/03/2020 17:35, Stefano Stabellini wrote: >>>>> On Sat, 28 Mar 2020, Julien Grall wrote: >>>>>> qHi Stefano, >>>>>> >>>>>> On 27/03/2020 02:34, Stefano Stabellini wrote: >>>>>>> This is a simple implementation of GICD_ICACTIVER / >>>>>>> GICD_ISACTIVER >>>>>>> reads. It doesn't take into account the latest state of >>>>>>> interrupts on >>>>>>> other vCPUs. Only the current vCPU is up-to-date. A full >>>>>>> solution is >>>>>>> not possible because it would require synchronization among >>>>>>> all vCPUs, >>>>>>> which would be very expensive in terms or latency. >>>>>> >>>>>> Your sentence suggests you have number showing that correctly >>>>>> emulating >>>>>> the >>>>>> registers would be too slow. Mind sharing them? >>>>> >>>>> No, I don't have any numbers. Would you prefer a different >>>>> wording or a >>>>> better explanation? I also realized there is a typo in there >>>>> (or/of). >>>> Let me start with I think correctness is more important than speed. >>>> So I would have expected your commit message to contain some >>>> fact why >>>> synchronization is going to be slow and why this is a problem. >>>> >>>> To give you a concrete example, the implementation of set/way >>>> instructions are >>>> really slow (it could take a few seconds depending on the >>>> setup). However, >>>> this was fine because not implementing them correctly would have >>>> a greater >>>> impact on the guest (corruption) and they are not used often. >>>> >>>> I don't think the performance in our case will be in same order >>>> magnitude. It >>>> is most likely to be in the range of milliseconds (if not less) >>>> which I think >>>> is acceptable for emulation (particularly for the vGIC) and the >>>> current uses. >>> Writing on the mailing list some of our discussions today. >>> Correctness is not just in terms of compliance to a specification >>> but it >>> is also about not breaking guests. Introducing latency in the >>> range of >>> milliseconds, or hundreds of microseconds, would break any latency >>> sensitive workloads. We don't have numbers so we don't know for >>> certain >>> the effect that your suggestion would have. >> >> You missed part of the discussion. I don't disagree that latency >> is important. However, if an implementation is only 95% reliable, >> then it means 5% of the time your guest may break (corruption, >> crash, deadlock...). At which point the latency is the last of >> your concern. >> >>> It would be interesting to have those numbers, and I'll add to my >>> TODO >>> list to run the experiments you suggested, but I'll put it on the >>> back-burner (from a Xilinx perspective it is low priority as no >>> customers are affected.) >> >> How about we get a correct implementation merge first and then >> discuss about optimization? This would allow the community to >> check whether there are actually noticeable latency in their workload. > > Hi, > > I am not sure that pushing something with a performance impact to > later fix it is the right approach here. > > The patch is an improvement compared to the current code and it can > be further improved later to handle more cases (other cores). > > If we really have to sync all vCPUs here, this will cost a lot and > the result will still be the status in the past in fact because > nothing will make sure that at the point the guest gets back the > value it is still valid. > > > The same is true on real hardware, right? > > Looking at this discussion as a non-ARM person, it sounds a bit like ARM > specified this in a way that was > useless-but-easy-to-implement-so-why-not; but it turns out to be > useless-but-hard-to-implement virtualized. > > On the one hand, I'm sympathetic to Julien's point of view; I very much > don't like the idea of silently changing behavior just because the > specified behavior is inconvenient for us. > > On the other hand, I'm sympathetic to Stefano's point of view, that it's > pointless to introduce a load of overhead and jitter to implement > behavior which nobody in their right mind would even want to use (at > least virtualized). > > What I think would be *ideal* would be for ARM to change the > specification to make it easier virtualize. For instance:, by > specifying that the register *may* contain information about other > cores, but may not; or, that the register will contain information on > other cores on real hardware, but not virtualized. The goal of those registers is to give you a picture of the interrupts activated on your HW. For instance, this is used by Linux to properly synchronize pending interrupts as there would be a race otherwise. I wrote a long e-mail a few months ago explaining why Linux is using it and why we should implement correctly [1]. Even on real HW, the value you read may already be incorrect (i.e an interrupt may have become pending). What Stefano implemented is using a snapshot of what Xen thinks is the state of the vGIC. I think this is inline with the existing specification. However, this has a major flaw because this relies on the vCPUs to exit for synchronization the HW state and what Xen thinks of the vGIC. We have configured the vGIC to not exit when the virtual interrupt is backed by an HW interrupt. So we need to rely on another trap for the synchronization. With the NULL scheduler (and even credit2 with only 1 vCPU running per pCPU), the number of traps is very limited. So read the I*ACTIVER registers may report that an interrupt is actived for a really long time. The result here is a thread in the guest may be blocked for a long-time. To be brutally honest, this totally defeat the argument that "this patch has a better latency". > > Barring that, I think we should have a centralized place to document > deviations from the spec; and I think changes to this spec should be > coordinated with KVM (and maybe ACRN?), to minimize hypervisor-specific > deviations. AFAIK, KVM is implementing the ACTIVER registers the same way as Stefano's patch. It might be harder (or not possible) to hit the deadlock depending on how they configure the vGIC (i.e trap the active state) and the scheduling. As I discussed with Stefano yesterday, I can accept that the state is not up-to-date. But I can't ignore the fact this patch is introducing a deadlock in some circumstance. Cheers, [1] https://lore.kernel.org/xen-devel/7289f75f-1ab2-2d42-cd88-1be5306b3072@xen.org/
> On 1 Apr 2020, at 18:23, Julien Grall <julien@xen.org> wrote: > > > > On 01/04/2020 10:54, Bertrand Marquis wrote: >>> On 1 Apr 2020, at 09:30, Julien Grall <julien@xen.org <mailto:julien@xen.org>> wrote: >>> >>> >>> >>> On 01/04/2020 01:57, Stefano Stabellini wrote: >>>> On Mon, 30 Mar 2020, Julien Grall wrote: >>>>> Hi Stefano, >>>>> >>>>> On 30/03/2020 17:35, Stefano Stabellini wrote: >>>>>> On Sat, 28 Mar 2020, Julien Grall wrote: >>>>>>> qHi Stefano, >>>>>>> >>>>>>> On 27/03/2020 02:34, Stefano Stabellini wrote: >>>>>>>> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER >>>>>>>> reads. It doesn't take into account the latest state of interrupts on >>>>>>>> other vCPUs. Only the current vCPU is up-to-date. A full solution is >>>>>>>> not possible because it would require synchronization among all vCPUs, >>>>>>>> which would be very expensive in terms or latency. >>>>>>> >>>>>>> Your sentence suggests you have number showing that correctly emulating >>>>>>> the >>>>>>> registers would be too slow. Mind sharing them? >>>>>> >>>>>> No, I don't have any numbers. Would you prefer a different wording or a >>>>>> better explanation? I also realized there is a typo in there (or/of). >>>>> Let me start with I think correctness is more important than speed. >>>>> So I would have expected your commit message to contain some fact why >>>>> synchronization is going to be slow and why this is a problem. >>>>> >>>>> To give you a concrete example, the implementation of set/way instructions are >>>>> really slow (it could take a few seconds depending on the setup). However, >>>>> this was fine because not implementing them correctly would have a greater >>>>> impact on the guest (corruption) and they are not used often. >>>>> >>>>> I don't think the performance in our case will be in same order magnitude. It >>>>> is most likely to be in the range of milliseconds (if not less) which I think >>>>> is acceptable for emulation (particularly for the vGIC) and the current uses. >>>> Writing on the mailing list some of our discussions today. >>>> Correctness is not just in terms of compliance to a specification but it >>>> is also about not breaking guests. Introducing latency in the range of >>>> milliseconds, or hundreds of microseconds, would break any latency >>>> sensitive workloads. We don't have numbers so we don't know for certain >>>> the effect that your suggestion would have. >>> >>> You missed part of the discussion. I don't disagree that latency is important. However, if an implementation is only 95% reliable, then it means 5% of the time your guest may break (corruption, crash, deadlock...). At which point the latency is the last of your concern. >>> >>>> It would be interesting to have those numbers, and I'll add to my TODO >>>> list to run the experiments you suggested, but I'll put it on the >>>> back-burner (from a Xilinx perspective it is low priority as no >>>> customers are affected.) >>> >>> How about we get a correct implementation merge first and then discuss about optimization? This would allow the community to check whether there are actually noticeable latency in their workload. >> Hi, > > Hi, Hi, > >> I am not sure that pushing something with a performance impact to later fix it is the right approach here. >> The patch is an improvement compared to the current code and it can be further improved later to handle more cases (other cores). > > If you read my other answer on this patch you will see that this is going to introduce a deadlock in the guest using multiple vCPUs. How is it an improvement compare to today? I fully agree with the fact that a deadlock even with low probability is a no-go. Current implementation returning always 0 is not creating this deadlock case but can mislead in the other way letting on CPU think that the interrupt has been handled already when it has possibly not been. > >> If we really have to sync all vCPUs here, this will cost a lot and the result will still be the status in the past in fact because nothing will make sure that at the point the guest gets back the value it is still valid. > > I hope you are aware the problem is exactly the same in the hardware. As soon as you read the ISACTIVER, then the value may not be correct anymore. So I don't see the issue to have an outdated value here. Main difference being that in the hardware you can still poll for it and be sure that it won’t end up in your deadlock. So I agree, we need to find a clean solution here. > > As I said to Stefano yesterday, I would be happy to compromise as long as the implementation does not introduce an outright deadlock in the guest. > > At the moment, I don't have a better approach than kick all the vCPUs. Feel free to suggest a better one. Only one that I see is to enforce a service interrupt when interrupts have been handled to make sure that we clean up the internal active status as soon as interrupt has actually been handle. This would ensure that the polling is behaving properly from the guest point of view but would prevent the cross cpu sync (status will be right as soon as the interrupt will have been serviced). But the trade off will be a bigger overhead by enforcing a return back to xen each time and interrupt is handled by a guest. I will try to spend more time on this and discuss with the GIC engineers at Arm to dig deeper in this case. Cheers Bertrand > > Cheers, > > -- > Julien Grall
On Wed, 1 Apr 2020, Julien Grall wrote: > On 01/04/2020 01:57, Stefano Stabellini wrote: > > On Mon, 30 Mar 2020, Julien Grall wrote: > > > Hi Stefano, > > > > > > On 30/03/2020 17:35, Stefano Stabellini wrote: > > > > On Sat, 28 Mar 2020, Julien Grall wrote: > > > > > qHi Stefano, > > > > > > > > > > On 27/03/2020 02:34, Stefano Stabellini wrote: > > > > > > This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER > > > > > > reads. It doesn't take into account the latest state of interrupts > > > > > > on > > > > > > other vCPUs. Only the current vCPU is up-to-date. A full solution is > > > > > > not possible because it would require synchronization among all > > > > > > vCPUs, > > > > > > which would be very expensive in terms or latency. > > > > > > > > > > Your sentence suggests you have number showing that correctly > > > > > emulating > > > > > the > > > > > registers would be too slow. Mind sharing them? > > > > > > > > No, I don't have any numbers. Would you prefer a different wording or a > > > > better explanation? I also realized there is a typo in there (or/of). > > > Let me start with I think correctness is more important than speed. > > > So I would have expected your commit message to contain some fact why > > > synchronization is going to be slow and why this is a problem. > > > > > > To give you a concrete example, the implementation of set/way instructions > > > are > > > really slow (it could take a few seconds depending on the setup). However, > > > this was fine because not implementing them correctly would have a greater > > > impact on the guest (corruption) and they are not used often. > > > > > > I don't think the performance in our case will be in same order magnitude. > > > It > > > is most likely to be in the range of milliseconds (if not less) which I > > > think > > > is acceptable for emulation (particularly for the vGIC) and the current > > > uses. > > > > Writing on the mailing list some of our discussions today. > > > > Correctness is not just in terms of compliance to a specification but it > > is also about not breaking guests. Introducing latency in the range of > > milliseconds, or hundreds of microseconds, would break any latency > > sensitive workloads. We don't have numbers so we don't know for certain > > the effect that your suggestion would have. > > You missed part of the discussion. I don't disagree that latency is important. > However, if an implementation is only 95% reliable, then it means 5% of the > time your guest may break (corruption, crash, deadlock...). At which point the > latency is the last of your concern. Yeah I missed to highlight it, also because I look at it from a slightly different perspective: I think IRQ latency is part of correctness. If we have a solution that is not 100% faithful to the specification we are going to break guests that rely on a faithful implementation of ISACTIVER. If we have a solution that is 100% faithful to the specification but causes latency spikes it breaks RT guests. But they are different sets of guests, it is not like one is necessarily a subset of the other: there are guests that cannot tolerate any latency spikes but they are OK with an imprecise implementation of ISACTIVER. My preference is a solution that is both spec-faithful and also doesn't cause any latency spikes. If that is not possible then we'll have to compromise or come up with "creative" ideas. > > It would be interesting to have those numbers, and I'll add to my TODO > > list to run the experiments you suggested, but I'll put it on the > > back-burner (from a Xilinx perspective it is low priority as no > > customers are affected.) > > How about we get a correct implementation merge first and then discuss about > optimization? This would allow the community to check whether there are > actually noticeable latency in their workload. A correct implementation to me means that it is correct from both the specification point of view as well as from a latency point of view. A patch that potentially introduces latency spikes could cause guest breakage and I don't think it should be considered correct. The tests would have to be done beforehand. In terms of other "creative" ideas, here are some: One idea, as George suggested, would be to document the interface deviation. The intention is still to remove any deviation but at least we would be clear about what we have. Ideally in a single place together with other hypervisors. This is my preference. Another idea is that we could crash the guest if GICD_ISACTIVER is read from a multi-vcpu guest. It is similar to what we already do today but better because we would do it purposely (not because of a typo) and because it will work for single vcpu guests at least. We could also leave it as is (crash all the time) but it implies that vendors that are seeing issues with Linux today will have to keep maintaining patches in their private trees until a better solution is found. This would also be the case if we crash multi-vcpus guests as previously suggested.
(+Marc) Hi Stefano, On 02/04/2020 18:19, Stefano Stabellini wrote: > On Wed, 1 Apr 2020, Julien Grall wrote: >> On 01/04/2020 01:57, Stefano Stabellini wrote: >>> On Mon, 30 Mar 2020, Julien Grall wrote: >>>> Hi Stefano, >>>> >>>> On 30/03/2020 17:35, Stefano Stabellini wrote: >>>>> On Sat, 28 Mar 2020, Julien Grall wrote: >>>>>> qHi Stefano, >>>>>> >>>>>> On 27/03/2020 02:34, Stefano Stabellini wrote: >>>>>>> This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER >>>>>>> reads. It doesn't take into account the latest state of interrupts >>>>>>> on >>>>>>> other vCPUs. Only the current vCPU is up-to-date. A full solution is >>>>>>> not possible because it would require synchronization among all >>>>>>> vCPUs, >>>>>>> which would be very expensive in terms or latency. >>>>>> >>>>>> Your sentence suggests you have number showing that correctly >>>>>> emulating >>>>>> the >>>>>> registers would be too slow. Mind sharing them? >>>>> >>>>> No, I don't have any numbers. Would you prefer a different wording or a >>>>> better explanation? I also realized there is a typo in there (or/of). >>>> Let me start with I think correctness is more important than speed. >>>> So I would have expected your commit message to contain some fact why >>>> synchronization is going to be slow and why this is a problem. >>>> >>>> To give you a concrete example, the implementation of set/way instructions >>>> are >>>> really slow (it could take a few seconds depending on the setup). However, >>>> this was fine because not implementing them correctly would have a greater >>>> impact on the guest (corruption) and they are not used often. >>>> >>>> I don't think the performance in our case will be in same order magnitude. >>>> It >>>> is most likely to be in the range of milliseconds (if not less) which I >>>> think >>>> is acceptable for emulation (particularly for the vGIC) and the current >>>> uses. >>> >>> Writing on the mailing list some of our discussions today. >>> >>> Correctness is not just in terms of compliance to a specification but it >>> is also about not breaking guests. Introducing latency in the range of >>> milliseconds, or hundreds of microseconds, would break any latency >>> sensitive workloads. We don't have numbers so we don't know for certain >>> the effect that your suggestion would have. >> >> You missed part of the discussion. I don't disagree that latency is important. >> However, if an implementation is only 95% reliable, then it means 5% of the >> time your guest may break (corruption, crash, deadlock...). At which point the >> latency is the last of your concern. > > Yeah I missed to highlight it, also because I look at it from a slightly > different perspective: I think IRQ latency is part of correctness. > > If we have a solution that is not 100% faithful to the specification we > are going to break guests that rely on a faithful implementation of > ISACTIVER. > > If we have a solution that is 100% faithful to the specification but > causes latency spikes it breaks RT guests. > > But they are different sets of guests, it is not like one is necessarily > a subset of the other: there are guests that cannot tolerate any latency > spikes but they are OK with an imprecise implementation of ISACTIVER. > > My preference is a solution that is both spec-faithful and also doesn't > cause any latency spikes. If that is not possible then we'll have to > compromise or come up with "creative" ideas. I do agree that latency is important. However, this needs to be based on numbers or a good grasp as to why this would be an issue. Neither of these have been provided so far. As we discussed on Tuesday, the cost for other vCPUs is only going to be a trap to the hypervisor and then back again. The cost is likely smaller than receiving and forwarding an interrupt. You actually agreed on this analysis. So can you enlighten me as to why receiving an interrupt is a not problem for latency but this is? > >>> It would be interesting to have those numbers, and I'll add to my TODO >>> list to run the experiments you suggested, but I'll put it on the >>> back-burner (from a Xilinx perspective it is low priority as no >>> customers are affected.) >> >> How about we get a correct implementation merge first and then discuss about >> optimization? This would allow the community to check whether there are >> actually noticeable latency in their workload. > > A correct implementation to me means that it is correct from both the > specification point of view as well as from a latency point of view. A > patch that potentially introduces latency spikes could cause guest > breakage and I don't think it should be considered correct. The > tests would have to be done beforehand. In all honesty, writing and testing the implementation would have likely took you less than trying to push for "creative" ideas or your patch. > In terms of other "creative" ideas, here are some: "creative" ideas should really be the last resort. Correct me if I am wrong, but I don't think we are there yet. > > One idea, as George suggested, would be to document the interface > deviation. The intention is still to remove any deviation but at least > we would be clear about what we have. Ideally in a single place together > with other hypervisors. This is my preference. It is not without saying that deviation from specification should not be taken lightly and has risks. The main risk is you are never going to be able to reliably run an OS on Xen unless you manage to get the deviation accepted by the wider community and Arm. > > Another idea is that we could crash the guest if GICD_ISACTIVER is read > from a multi-vcpu guest. It is similar to what we already do today but > better because we would do it purposely (not because of a typo) and > because it will work for single vcpu guests at least. > > We could also leave it as is (crash all the time) but it implies that > vendors that are seeing issues with Linux today will have to keep > maintaining patches in their private trees until a better solution is > found. This would also be the case if we crash multi-vcpus guests as > previously suggested. The crash only happened when using vGICv3 not vGICv2. But did you look at Xen recently? Particularly at the following patch: xen/arm: Handle unimplemented VGICv3 registers as RAZ/WI Per the ARM Generic Interrupt Controller Architecture Specification (ARM IHI 0069E), reserved registers should generally be treated as RAZ/WI. To simplify the VGICv3 design and improve guest compatibility, treat the default case for GICD and GICR registers as read_as_zero/write_ignore. Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com> Acked-by: Julien Grall <julien@xen.org> I actually pointed the patch to you during one of our weekly calls. Yet we agreed it would still be good to implement the register properly and you said you will write a patch. Anyway, I gave you a solution and also why I think it would still be fine in term of IRQ latency. The ball is now in your court. Cheers,
On 2020-04-02 19:52, Julien Grall wrote: > (+Marc) Thanks for looping me in. Definitely an interesting read, but also a very puzzling one. > > Hi Stefano, > > On 02/04/2020 18:19, Stefano Stabellini wrote: >> On Wed, 1 Apr 2020, Julien Grall wrote: >>> On 01/04/2020 01:57, Stefano Stabellini wrote: >>>> On Mon, 30 Mar 2020, Julien Grall wrote: >>>>> Hi Stefano, >>>>> >>>>> On 30/03/2020 17:35, Stefano Stabellini wrote: >>>>>> On Sat, 28 Mar 2020, Julien Grall wrote: >>>>>>> qHi Stefano, >>>>>>> >>>>>>> On 27/03/2020 02:34, Stefano Stabellini wrote: >>>>>>>> This is a simple implementation of GICD_ICACTIVER / >>>>>>>> GICD_ISACTIVER >>>>>>>> reads. It doesn't take into account the latest state of >>>>>>>> interrupts >>>>>>>> on >>>>>>>> other vCPUs. Only the current vCPU is up-to-date. A full >>>>>>>> solution is >>>>>>>> not possible because it would require synchronization among all >>>>>>>> vCPUs, >>>>>>>> which would be very expensive in terms or latency. >>>>>>> >>>>>>> Your sentence suggests you have number showing that correctly >>>>>>> emulating >>>>>>> the >>>>>>> registers would be too slow. Mind sharing them? >>>>>> >>>>>> No, I don't have any numbers. Would you prefer a different wording >>>>>> or a >>>>>> better explanation? I also realized there is a typo in there >>>>>> (or/of). >>>>> Let me start with I think correctness is more important than speed. >>>>> So I would have expected your commit message to contain some fact >>>>> why >>>>> synchronization is going to be slow and why this is a problem. >>>>> >>>>> To give you a concrete example, the implementation of set/way >>>>> instructions >>>>> are >>>>> really slow (it could take a few seconds depending on the setup). >>>>> However, >>>>> this was fine because not implementing them correctly would have a >>>>> greater >>>>> impact on the guest (corruption) and they are not used often. >>>>> >>>>> I don't think the performance in our case will be in same order >>>>> magnitude. >>>>> It >>>>> is most likely to be in the range of milliseconds (if not less) >>>>> which I >>>>> think >>>>> is acceptable for emulation (particularly for the vGIC) and the >>>>> current >>>>> uses. >>>> >>>> Writing on the mailing list some of our discussions today. >>>> >>>> Correctness is not just in terms of compliance to a specification >>>> but it >>>> is also about not breaking guests. Introducing latency in the range >>>> of >>>> milliseconds, or hundreds of microseconds, would break any latency >>>> sensitive workloads. We don't have numbers so we don't know for >>>> certain >>>> the effect that your suggestion would have. >>> >>> You missed part of the discussion. I don't disagree that latency is >>> important. >>> However, if an implementation is only 95% reliable, then it means 5% >>> of the >>> time your guest may break (corruption, crash, deadlock...). At which >>> point the >>> latency is the last of your concern. >> >> Yeah I missed to highlight it, also because I look at it from a >> slightly >> different perspective: I think IRQ latency is part of correctness. No. Low latency is a very desirable thing, but it doesn't matter at all when you don't even have functional correctness. To use my favourite car analogy, having a bigger engine doesn't help when you're about to hit the wall and have no breaks... You just hit the wall faster. >> >> If we have a solution that is not 100% faithful to the specification >> we >> are going to break guests that rely on a faithful implementation of >> ISACTIVER. >> >> If we have a solution that is 100% faithful to the specification but >> causes latency spikes it breaks RT guests. >> >> But they are different sets of guests, it is not like one is >> necessarily >> a subset of the other: there are guests that cannot tolerate any >> latency >> spikes but they are OK with an imprecise implementation of ISACTIVER. s/imprecise/massively incorrect/ >> >> My preference is a solution that is both spec-faithful and also >> doesn't >> cause any latency spikes. If that is not possible then we'll have to >> compromise or come up with "creative" ideas. You want your cake and eat it. Always a good thing to want. As long as you don't pretend you implement the GIC architecture, expose something else altogether to the guests and have specific drivers in all the guest operating systems under the sun, by all mean, be creative. > I do agree that latency is important. However, this needs to be based > on numbers or a good grasp as to why this would be an issue. Neither > of these have been provided so far. > > As we discussed on Tuesday, the cost for other vCPUs is only going to > be a trap to the hypervisor and then back again. The cost is likely > smaller than receiving and forwarding an interrupt. > > You actually agreed on this analysis. So can you enlighten me as to > why receiving an interrupt is a not problem for latency but this is? > >> >>>> It would be interesting to have those numbers, and I'll add to my >>>> TODO >>>> list to run the experiments you suggested, but I'll put it on the >>>> back-burner (from a Xilinx perspective it is low priority as no >>>> customers are affected.) >>> >>> How about we get a correct implementation merge first and then >>> discuss about >>> optimization? This would allow the community to check whether there >>> are >>> actually noticeable latency in their workload. >> >> A correct implementation to me means that it is correct from both the >> specification point of view as well as from a latency point of view. A >> patch that potentially introduces latency spikes could cause guest >> breakage and I don't think it should be considered correct. The >> tests would have to be done beforehand. Please find anything that specifies latency in the GIC spec. Oh wait, there is none. Because that's a quality of implementation thing, and not a correctness issue. > > In all honesty, writing and testing the implementation would have > likely took you less than trying to push for "creative" ideas or your > patch. > >> In terms of other "creative" ideas, here are some: > > "creative" ideas should really be the last resort. Correct me if I am > wrong, but I don't think we are there yet. > >> >> One idea, as George suggested, would be to document the interface >> deviation. The intention is still to remove any deviation but at least >> we would be clear about what we have. Ideally in a single place >> together >> with other hypervisors. This is my preference. > > It is not without saying that deviation from specification should not > be taken lightly and has risks. > > The main risk is you are never going to be able to reliably run an OS > on Xen unless you manage to get the deviation accepted by the wider > community and Arm. There is just no way I'll ever accept a change to the GIC interrupt state machine for Linux. Feel free to try and convince other OS maintainers. If you want to be creative, be really creative. Implement a fully PV interrupt controller that gives you simple enough semantics so that you can actually be deterministic. Don't pretend you implement the GIC architecture. >> Another idea is that we could crash the guest if GICD_ISACTIVER is >> read >> from a multi-vcpu guest. It is similar to what we already do today but >> better because we would do it purposely (not because of a typo) and >> because it will work for single vcpu guests at least. >> >> We could also leave it as is (crash all the time) but it implies that >> vendors that are seeing issues with Linux today will have to keep >> maintaining patches in their private trees until a better solution is >> found. This would also be the case if we crash multi-vcpus guests as >> previously suggested. OK, that's just insane. Suggesting that guests should be left crashing because the are doing *the right thing* is just barking mad. I'm all for exposing guest bugs when they take shortcuts with the architecture, but blaming the guest for what is just a bug in Xen? If I was someone developing a product using Xen/ARM, I'd be very worried about what you have written above. Because it really reads "we don't care about reliability as long as we can show amazing numbers". I really hope it isn't what you mean. M.
> On Apr 3, 2020, at 9:47 AM, Marc Zyngier <maz@kernel.org> wrote: > > On 2020-04-02 19:52, Julien Grall wrote: >> (+Marc) > > Thanks for looping me in. Definitely an interesting read, but also a very > puzzling one. [snip] > No. Low latency is a very desirable thing, but it doesn't matter at all when > you don't even have functional correctness. To use my favourite car analogy, > having a bigger engine doesn't help when you're about to hit the wall and > have no breaks... You just hit the wall faster. [snip] > s/imprecise/massively incorrect/ [snip] > There is just no way I'll ever accept a change to the GIC interrupt state > machine for Linux. Feel free to try and convince other OS maintainers. [snip] > If I was someone developing a product using Xen/ARM, I'd be very worried > about what you have written above. Because it really reads "we don't care > about reliability as long as we can show amazing numbers". I really hope > it isn't what you mean. What's puzzling to me, is that what everyone else in this thread is that what Stefano is trying to do is to get Xen to be have like KVM. Are they wrong? If so, we can just do whatever Linux does. If not, then you need to first turn all your imprecations about correctness, smashing into walls, concern for the sanity of maintainers and so on towards your own code first. -George
George, On 2020-04-03 11:43, George Dunlap wrote: >> On Apr 3, 2020, at 9:47 AM, Marc Zyngier <maz@kernel.org> wrote: >> >> On 2020-04-02 19:52, Julien Grall wrote: >>> (+Marc) >> >> Thanks for looping me in. Definitely an interesting read, but also a >> very >> puzzling one. > > [snip] > >> No. Low latency is a very desirable thing, but it doesn't matter at >> all when >> you don't even have functional correctness. To use my favourite car >> analogy, >> having a bigger engine doesn't help when you're about to hit the wall >> and >> have no breaks... You just hit the wall faster. > > [snip] > >> s/imprecise/massively incorrect/ > > [snip] > >> There is just no way I'll ever accept a change to the GIC interrupt >> state >> machine for Linux. Feel free to try and convince other OS maintainers. > > [snip] > >> If I was someone developing a product using Xen/ARM, I'd be very >> worried >> about what you have written above. Because it really reads "we don't >> care >> about reliability as long as we can show amazing numbers". I really >> hope >> it isn't what you mean. > > What's puzzling to me, is that what everyone else in this thread is > that what Stefano is trying to do is to get Xen to be have like KVM. Sorry, I don't get what you mean here. KVM at least aims to be architecturally compliant. Is it perfect? Most probably not, as we fix it all the time. Dealing with the active registers is hard. But as far as I can see, we do get them right. Do we sacrifice latency over correctness? Yes. And if you have spotted a problem in the way we handle those, pray tell. > Are they wrong? If so, we can just do whatever Linux does. If not, > then you need to first turn all your imprecations about correctness, > smashing into walls, concern for the sanity of maintainers and so on > towards your own code first. I'm really sorry, but you see to have the wrong end of the stick here. I'm not trying to compare Xen to KVM at all. I'm concerned about only implementing only a small part of the architecture, ignoring the rest, and letting guests crash, which is what was suggested here. M.
> On Apr 3, 2020, at 11:59 AM, Marc Zyngier <maz@kernel.org> wrote: > > George, > > On 2020-04-03 11:43, George Dunlap wrote: >>> On Apr 3, 2020, at 9:47 AM, Marc Zyngier <maz@kernel.org> wrote: >>> On 2020-04-02 19:52, Julien Grall wrote: >>>> (+Marc) >>> Thanks for looping me in. Definitely an interesting read, but also a very >>> puzzling one. >> [snip] >>> No. Low latency is a very desirable thing, but it doesn't matter at all when >>> you don't even have functional correctness. To use my favourite car analogy, >>> having a bigger engine doesn't help when you're about to hit the wall and >>> have no breaks... You just hit the wall faster. >> [snip] >>> s/imprecise/massively incorrect/ >> [snip] >>> There is just no way I'll ever accept a change to the GIC interrupt state >>> machine for Linux. Feel free to try and convince other OS maintainers. >> [snip] >>> If I was someone developing a product using Xen/ARM, I'd be very worried >>> about what you have written above. Because it really reads "we don't care >>> about reliability as long as we can show amazing numbers". I really hope >>> it isn't what you mean. >> What's puzzling to me, is that what everyone else in this thread is >> that what Stefano is trying to do is to get Xen to be have like KVM. > > Sorry, I don't get what you mean here. KVM at least aims to be architecturally > compliant. Is it perfect? Most probably not, as we fix it all the time. > > Dealing with the active registers is hard. But as far as I can see, > we do get them right. Do we sacrifice latency over correctness? Yes. > > And if you have spotted a problem in the way we handle those, pray tell. > >> Are they wrong? If so, we can just do whatever Linux does. If not, >> then you need to first turn all your imprecations about correctness, >> smashing into walls, concern for the sanity of maintainers and so on >> towards your own code first. > > I'm really sorry, but you see to have the wrong end of the stick here. > I'm not trying to compare Xen to KVM at all. I'm concerned about only > implementing only a small part of the architecture, ignoring the rest, > and letting guests crash, which is what was suggested here. The current situation (as I understand it) is that Xen never implemented this functionality at all. Stefano has been trying to convince Julien to implement this register KVM’s way, which has good latency in the median case, but in particular configurations, has arbitrarily bad worst-case latency for multi-vcpu guests. Julien thinks what KVM is doing is wrong and against the spec, and has been refusing to let the patches in. He has proposed another suggestion which is closer to the spec in terms of functionality, and has bounded worst-case latency, but which has worse latency and uncontrollable jitter in the median case (or at least, so Stefano believes). As a compromise, Stefano suggested implementing KVM’s way for single-vcpu guests. This is a strict improvement over the current situation, since at least a lot of new guests start working, while they hash out what to do about multi-vcpu guests. My proposal has been to work with KVM do document this deviation from the spec for guests running virtualized. So it’s you have have the wrong end of the stick; your contempt is misplaced. If you don’t think it’s a deviation, then please help us convince Julien, so we can take Stefano’s patch as-is. Or, if Julien convinces you that KVM is deviating from the spec, then let’s try to work together to see how we can implement the necessary functionality efficiently in a virtualized environment. -George
On 03/04/2020 11:43, George Dunlap wrote: > >> On Apr 3, 2020, at 9:47 AM, Marc Zyngier <maz@kernel.org> wrote: >> >> On 2020-04-02 19:52, Julien Grall wrote: >>> (+Marc) >> >> Thanks for looping me in. Definitely an interesting read, but also a very >> puzzling one. > > [snip] > >> No. Low latency is a very desirable thing, but it doesn't matter at all when >> you don't even have functional correctness. To use my favourite car analogy, >> having a bigger engine doesn't help when you're about to hit the wall and >> have no breaks... You just hit the wall faster. > > [snip] > >> s/imprecise/massively incorrect/ > > [snip] > >> There is just no way I'll ever accept a change to the GIC interrupt state >> machine for Linux. Feel free to try and convince other OS maintainers. > > [snip] > >> If I was someone developing a product using Xen/ARM, I'd be very worried >> about what you have written above. Because it really reads "we don't care >> about reliability as long as we can show amazing numbers". I really hope >> it isn't what you mean. > > What's puzzling to me, is that what everyone else in this thread is that what Stefano is trying to do is to get Xen to be have like KVM. This reads to me as "bugs don't exist". As I actually said in a previous e-mail, our vGIC is significantly different compare to KVM. It *might* be possible they are not affected because the may trap when a guest is deactivating an interrupt *or* by other means. I didn't look in the details their implementation, but this suggests you or Stefano may have. Why do you think Stefano's implementation is following what KVM does? If the behavior is the same, was the problem reported to KVM ML? What was the answer from the maintainers? I suspect this was never discussed on KVM ML. So that should really be the first step. Cheers,
On Fri, 3 Apr 2020, Marc Zyngier wrote: > > > Yeah I missed to highlight it, also because I look at it from a slightly > > > different perspective: I think IRQ latency is part of correctness. > > No. Low latency is a very desirable thing, but it doesn't matter at all when > you don't even have functional correctness. Hi Marc, It is good to hear from you. Functional correctness can still lead to breakage. I wrote more details in last part of this email to explain in details, it is long but please read it all! [...] > Please find anything that specifies latency in the GIC spec. Oh wait, > there is none. Because that's a quality of implementation thing, and > not a correctness issue. The goal of being faithful to the spec is not compliance for the sake of it. I take neither you nor me actually care about labels with compliance logos to attach anywhere. The goal is to run guests correctly when virtualized. Do you agree with me so far? The difficult question is: what happens when the hypervisor is faithful to the spec, but guests still break? > > In all honesty, writing and testing the implementation would have > > likely took you less than trying to push for "creative" ideas or your > > patch. > > > > > In terms of other "creative" ideas, here are some: > > > > "creative" ideas should really be the last resort. Correct me if I am > > wrong, but I don't think we are there yet. > > > > > > > > One idea, as George suggested, would be to document the interface > > > deviation. The intention is still to remove any deviation but at least > > > we would be clear about what we have. Ideally in a single place together > > > with other hypervisors. This is my preference. > > > > It is not without saying that deviation from specification should not > > be taken lightly and has risks. > > > > The main risk is you are never going to be able to reliably run an OS > > on Xen unless you manage to get the deviation accepted by the wider > > community and Arm. > > There is just no way I'll ever accept a change to the GIC interrupt state > machine for Linux. Feel free to try and convince other OS maintainers. > > If you want to be creative, be really creative. Implement a fully PV interrupt > controller that gives you simple enough semantics so that you can actually be > deterministic. Don't pretend you implement the GIC architecture. Last time we looked at KVM, KVM was actually doing what my patch here does, not what Julien suggested. (In short, my patch is implementing ISACTIVER accurately only for the current vcpu and not the others; Julien is suggesting to send an IPI to other pcpus running vcpus so that the implementation becomes accurate for other vcpus too.) > > > Another idea is that we could crash the guest if GICD_ISACTIVER is read > > > from a multi-vcpu guest. It is similar to what we already do today but > > > better because we would do it purposely (not because of a typo) and > > > because it will work for single vcpu guests at least. > > > > > > We could also leave it as is (crash all the time) but it implies that > > > vendors that are seeing issues with Linux today will have to keep > > > maintaining patches in their private trees until a better solution is > > > found. This would also be the case if we crash multi-vcpus guests as > > > previously suggested. > > OK, that's just insane. Suggesting that guests should be left crashing > because the are doing *the right thing* is just barking mad. I'm all for > exposing guest bugs when they take shortcuts with the architecture, but > blaming the guest for what is just a bug in Xen? > > If I was someone developing a product using Xen/ARM, I'd be very worried > about what you have written above. Because it really reads "we don't care > about reliability as long as we can show amazing numbers". I really hope > it isn't what you mean. It is not what I am saying, and I am glad I have an opportunity to explain it. Don't think about low latency numbers as: "great it runs faster!" I don't care about that. Well, I care, but not that much, certainly not at the expense of being faithful to the spec. Customers here are running guests and workloads that fail spectacularly if they don't get IRQ latency within low deterministic ranges. A latency spike can cause a severe failure, as bad a failure as a total system crash. Today, these guests don't use ISACTIVER at all AFAIK, but let's imagine that in the future they will. Being latency sensitive guests, I doubt they'll ever loop over ISACTIVER, but maybe they could call it once at initialization time or during shutdown of something? Also, let's say implementing ISACTIVER faithfully with an IPI to another vcpus might cause a latency spikes (note that we don't have numbers on this). Does this scenario make sense to you so far? If we implement ISACTIVER using an IPI we introduce a significant source of non-determinism. We yank the other vcpu into hypervisor mode when it could have been running the critical section. It can very well cause one of those spectacular failures. It might take the engineers putting the system together a very significant amount of time to figure out the problem. Doing what my patch here does might be OK until one of these guests start to rely on ISACTIVER to be accurate. So far I have not seen any examples of it, but I agree it could happen, hence, it is risky. Frankly, I thought that KVM was already behaving the way of my patch and it was making me feel more confident about this solution. Finally the last option is to just crash early. It is not blaming the guest -- it would serve as an early warning that something is not right and needs to be done about the system. Typically, it is better to fail early and clearly rather than at some point later more subtly.
On 03/04/2020 17:18, Stefano Stabellini wrote: > On Fri, 3 Apr 2020, Marc Zyngier wrote: > Doing what my patch here does might be OK until one of these guests > start to rely on ISACTIVER to be accurate. So far I have not seen any > examples of it, but I agree it could happen, hence, it is risky. I am only going to answer to this. This is not about *accuracy* but deadlock in your guest. I actually wrote a long e-mail on this thread explaining the possible deadlock. It is not because you can't reproduce the deadlock that the dealock is not there. When are you going to stop dimissing real bug in your implementation? Cheers,
On 03/04/2020 17:23, Julien Grall wrote: > > > On 03/04/2020 17:18, Stefano Stabellini wrote: >> On Fri, 3 Apr 2020, Marc Zyngier wrote: > > Doing what my patch here does might be OK until one of these guests >> start to rely on ISACTIVER to be accurate. So far I have not seen any >> examples of it, but I agree it could happen, hence, it is risky. > > I am only going to answer to this. This is not about *accuracy* but > deadlock in your guest. I actually wrote a long e-mail on this thread > explaining the possible deadlock. > > It is not because you can't reproduce the deadlock that the dealock is > not there. When are you going to stop dimissing real bug in your > implementation? So everyone are on the same page. Here the link to the e-mail I wrote about the potential problem: https://lists.xenproject.org/archives/html/xen-devel/2020-03/msg02039.html Cheers,
On Thu, 2 Apr 2020, Julien Grall wrote: > As we discussed on Tuesday, the cost for other vCPUs is only going to be a > trap to the hypervisor and then back again. The cost is likely smaller than > receiving and forwarding an interrupt. > > You actually agreed on this analysis. So can you enlighten me as to why > receiving an interrupt is a not problem for latency but this is? My answer was that the difference is that an operating system can disable interrupts, but it cannot disable receiving this special IPI. > The crash only happened when using vGICv3 not vGICv2. But did you look at Xen > recently? Particularly at the following patch: > > xen/arm: Handle unimplemented VGICv3 registers as RAZ/WI > > Per the ARM Generic Interrupt Controller Architecture Specification (ARM > IHI 0069E), reserved registers should generally be treated as RAZ/WI. > To simplify the VGICv3 design and improve guest compatibility, treat the > default case for GICD and GICR registers as read_as_zero/write_ignore. > > Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com> > Acked-by: Julien Grall <julien@xen.org> > > I actually pointed the patch to you during one of our weekly calls. Yet we > agreed it would still be good to implement the register properly and you said > you will write a patch. As you know I cannot reproduce the crash myself, I asked Peng and Wei for help in that. I cannot be certain Jeff's patch makes a difference, but looking at the code, if you open xen/arch/arm/vgic-v3.c:__vgic_v3_distr_common_mmio_read you can see that the range mistake is still there: /* Read the active status of an IRQ via GICD/GICR is not supported */ case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVER): case VRANGE32(GICD_ICACTIVER, GICD_ICACTIVERN): goto read_as_zero; So a GICD_ISACTIVER of any register but the first should end up hitting the default case: default: printk(XENLOG_G_ERR "%pv: %s: unhandled read r%d offset %#08x\n", v, name, dabt.reg, reg); return 0; } Which returns 0 (IO_ABORT). Would you be happy to have the range fixed to be: case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVERN): instead?
On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote: > > On Thu, 2 Apr 2020, Julien Grall wrote: > > As we discussed on Tuesday, the cost for other vCPUs is only going to be a > > trap to the hypervisor and then back again. The cost is likely smaller than > > receiving and forwarding an interrupt. > > > > You actually agreed on this analysis. So can you enlighten me as to why > > receiving an interrupt is a not problem for latency but this is? > > My answer was that the difference is that an operating system can > disable interrupts, but it cannot disable receiving this special IPI. An OS can *only* disable its own interrupts, yet interrupts will still be received by Xen even if the interrupts are masked at the processor (e.g local_irq_disable()). You would need to disable interrupts one by one as the GIC level (use ICENABLER) in order to not receive any interrupts. Yet, Xen may still receive interrupts for operational purposes (e.g serial, console, maintainance IRQ...). So trap will happen. But if your concern is to disturb a vCPU which has interrupts disabled. Then there is way to mitigate this in an implementation, you can easily know whether there are interrupts inflight at a given point for a given vCPU. So you can avoid to IPI if you know there is no interrupts potentially active. This is why I clearly written the implementation I suggested was *not* optimized in the same e-mail as where I made the proposal. Cheers,
> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote: > > On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote: >> >> On Thu, 2 Apr 2020, Julien Grall wrote: >>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a >>> trap to the hypervisor and then back again. The cost is likely smaller than >>> receiving and forwarding an interrupt. >>> >>> You actually agreed on this analysis. So can you enlighten me as to why >>> receiving an interrupt is a not problem for latency but this is? >> >> My answer was that the difference is that an operating system can >> disable interrupts, but it cannot disable receiving this special IPI. > > An OS can *only* disable its own interrupts, yet interrupts will still > be received by Xen even if the interrupts are masked at the processor > (e.g local_irq_disable()). > > You would need to disable interrupts one by one as the GIC level (use > ICENABLER) in order to not receive any interrupts. Yet, Xen may still > receive interrupts for operational purposes (e.g serial, console, > maintainance IRQ...). So trap will happen. I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible. On a 4-core system, you could have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3. In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1. Julien, I think your argument above about interrupts somewhat undermines your point about deadlock. Your basic argument is, that a guest will be interrupted by Xen quite frequently anyway for lots of reasons; adding one more to the list shouldn’t measurably increase the jitter. But if it’s true that a guest will be interrupted by Xen frequently, then deadlock shouldn’t be much of an issue; and insofar as deadlock is an issue, it’s because there were no interrupts — and thus, adding an extra IPI will have a statistically significant effect on jitter. On the other had, Stefano’s argument about deadlock (if I understand him correctly) has been that guests shouldn’t really be spinning on that register anyway. But it sounds from Julien’s other email that perhaps spinning on the register is exactly what newer versions of Linux do — so Linux would certainly hang on such systems if Stefano’s assertion about the low number of Xen-related interrupts is true. If the latter is true, then it sounds like addressing the deadlock issue will be necessary? And so effort needs to be put towards figuring out how to minimize jitter due to Xen-related IPIs. Wei Xu / Peng Fan, any opinions here? - George
On 06/04/2020 18:58, George Dunlap wrote: > > >> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote: >> >> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote: >>> >>> On Thu, 2 Apr 2020, Julien Grall wrote: >>>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a >>>> trap to the hypervisor and then back again. The cost is likely smaller than >>>> receiving and forwarding an interrupt. >>>> >>>> You actually agreed on this analysis. So can you enlighten me as to why >>>> receiving an interrupt is a not problem for latency but this is? >>> >>> My answer was that the difference is that an operating system can >>> disable interrupts, but it cannot disable receiving this special IPI. >> >> An OS can *only* disable its own interrupts, yet interrupts will still >> be received by Xen even if the interrupts are masked at the processor >> (e.g local_irq_disable()). >> >> You would need to disable interrupts one by one as the GIC level (use >> ICENABLER) in order to not receive any interrupts. Yet, Xen may still >> receive interrupts for operational purposes (e.g serial, console, >> maintainance IRQ...). So trap will happen. > > I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible. On a 4-core system, you could have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3. In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1. Well maintenance IRQs are per pCPU so you can't route to another one... But, I think you missed my point that local_irq_disable() from the guest will not prevent the hypervisor to receive interrupts *even* the one routed to the vCPU itself. They will just not be delivered to the guest context until local_irq_enable() is called. > > Julien, I think your argument above about interrupts somewhat undermines your point about deadlock. Your basic argument is, that a guest will be interrupted by Xen quite frequently anyway for lots of reasons; adding one more to the list shouldn’t measurably increase the jitter. But if it’s true that a guest will be interrupted by Xen frequently, then deadlock shouldn’t be much of an issue; and insofar as deadlock is an issue, it’s because there were no interrupts — and thus, adding an extra IPI will have a statistically significant effect on jitter. I presented two way to disable interrupts because Stefano's e-mail was not clear enough which one he was referring to. So I was hoping to safe some round-trip, but it looks like I muddle my point. If Stefano was referring to critical sections where interrupts are masked using local_irq_disable(). Then, you are not going to limit the numbers of traps at all (see why above). So the point here was moot. I don't believe Stefano was referring to the second case as disabling interrupts at the GIC level will require to trap in the hypervisor. But I thought I would mention it. Regarding the livelock (Marc pointed out it wasn't a deadlock), there are multiple conflicting use cases. In an ideal world, there will be limited reasons to interrupts the guest. And yes it will result to a livelock. In a less ideal world, there will some time be interruptions. But you don't know when. If you look at determinism then, I am afraid that Stefano's implementation is not because you don't know when the vCPU will exit. > > On the other had, Stefano’s argument about deadlock (if I understand him correctly) has been that guests shouldn’t really be spinning on that register anyway. But it sounds from Julien’s other email that perhaps spinning on the register is exactly what newer versions of Linux do — so Linux would certainly hang on such systems if Stefano’s assertion about the low number of Xen-related interrupts is true. Rather than playing the game "one person's word against the other person's word", from Linux 5.4: do { unsigned long flags; /* * Wait until we're out of the critical section. This might * give the wrong answer due to the lack of memory barriers. */ while (irqd_irq_inprogress(&desc->irq_data)) cpu_relax(); /* Ok, that indicated we're done: double-check carefully. */ raw_spin_lock_irqsave(&desc->lock, flags); inprogress = irqd_irq_inprogress(&desc->irq_data); /* * If requested and supported, check at the chip whether it * is in flight at the hardware level, i.e. already pending * in a CPU and waiting for service and acknowledge. */ if (!inprogress && sync_chip) { /* * Ignore the return code. inprogress is only updated * when the chip supports it. */ __irq_get_irqchip_state(irqd, IRQCHIP_STATE_ACTIVE, &inprogress); } raw_spin_unlock_irqrestore(&desc->lock, flags); /* Oops, that failed? */ } while (inprogress); > > If the latter is true, then it sounds like addressing the deadlock issue will be necessary? And so effort needs to be put towards figuring out how to minimize jitter due to Xen-related IPIs. Here what I wrote before: " A few remarks: * The function do_nothing() is basically a NOP. * I am suggesting to use smp_call_function() rather smp_send_event_check_cpu() is because we need to know when the vCPU has synchronized the LRs. As the function do_nothing() will be call afterwards, then we know the the snapshot of the LRs has been done * It would be possible to everything in one vCPU. * We can possibly optimize it for the SGIs/PPIs case This is still not perfect, but I don't think the performance would be that bad. " " But if your concern is to disturb a vCPU which has interrupts disabled. Then there is way to mitigate this in an implementation, you can easily know whether there are interrupts inflight at a given point for a given vCPU. So you can avoid to IPI if you know there is no interrupts potentially active. " I am looking forward to hear about the potential improvements. Cheers,
> On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org> wrote: > > On 06/04/2020 18:58, George Dunlap wrote: >>> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote: >>> >>> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote: >>>> >>>> On Thu, 2 Apr 2020, Julien Grall wrote: >>>>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a >>>>> trap to the hypervisor and then back again. The cost is likely smaller than >>>>> receiving and forwarding an interrupt. >>>>> >>>>> You actually agreed on this analysis. So can you enlighten me as to why >>>>> receiving an interrupt is a not problem for latency but this is? >>>> >>>> My answer was that the difference is that an operating system can >>>> disable interrupts, but it cannot disable receiving this special IPI. >>> >>> An OS can *only* disable its own interrupts, yet interrupts will still >>> be received by Xen even if the interrupts are masked at the processor >>> (e.g local_irq_disable()). >>> >>> You would need to disable interrupts one by one as the GIC level (use >>> ICENABLER) in order to not receive any interrupts. Yet, Xen may still >>> receive interrupts for operational purposes (e.g serial, console, >>> maintainance IRQ...). So trap will happen. >> I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible. On a 4-core system, you could have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3. In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1. > Well maintenance IRQs are per pCPU so you can't route to another one... > > But, I think you missed my point that local_irq_disable() from the guest will not prevent the hypervisor to receive interrupts *even* the one routed to the vCPU itself. They will just not be delivered to the guest context until local_irq_enable() is called. My understanding, from Stefano was that what his customers are concerned about is the time between the time a physical IRQ is delivered to the guest and the time the guest OS can respond appropriately. The key thing here isn’t necessarily speed, but predictability — system designers need to know that, with a high probability, their interrupt routines will complete within X amount of cycles. Further interrupts delivered to a guest are not a problem in this scenario, if the guest can disable them until the critical IRQ has been handled. Xen-related IPIs, however, could potentially cause a problem if not mitigated. Consider a guest where vcpu 1 loops over the register, while vcpu 2 is handling a latency-critical IRQ. A naive implementation might send an IPI every time vcpu 1 does a read, spamming vcpu 2 with dozens of IPIs. Then an IRQ routine which reliably finishes well within the required time normally suddenly overruns and causes an issue. I don’t know what maintenance IRQs are, but if they only happen intermittently, it’s possible that you’d never get more than a single one in a latency-critical IRQ routine; and as such, the variatibility in execution time (jitter) wouldn’t be an issue in practice. But every time you add a new unblockable IPI, you increase this jitter; particularly if this unblockable IPI might be repeated an arbitrary number of times. (Stefano, let me know if I’ve misunderstood something.) So stepping back a moment, here’s all the possible ideas that I think have been discussed (or are there implicitly) so far. 1. [Default] Do nothing; guests using this register continue crashing 2. Make the I?ACTIVER registers RZWI. 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current behavior (as far as we understand it) 4. Use a simple IPI with do_noop to update I?ACTIVER 4a. Use an IPI, but come up with clever tricks to avoid interrupting guests handling IRQs. 5. Trap to Xen on guest EOI, so that we know when the 6. Some clever paravirtualized option Obviously nobody wants #1, and #3 is clearly not really an option now either. #2 is not great, but it’s simple and quick to implement for now. Julien, I’m not sure your position on this one: You rejected the idea back in v1 of this patch series, but seemed to refer to it again earlier in this thread. #4 is relatively quick to implement a “dumb” version, but such a “dumb” version has a high risk of causing unacceptable jitter (or at least, so Stefano believes). #4a or #6 are further potential lines to explore, but would require a lot of additional design to get working right. I think if I understand Stefano’s PoV, then #5 would actually be acceptable — the overall amount of time spent in the hypervisor would probably be greater, but it would be bounded and predictable: once someone got their IRQ handler working reliably, it would likely continue to work. It sounds like #5 might be pretty quick to implement; and then at some point in the future if someone wants to improve performance, they can work on 4a or 6. Any thoughts? Anything I’m missing? -George
On 7 Apr 2020, at 17:16, George Dunlap <George.Dunlap@citrix.com<mailto:George.Dunlap@citrix.com>> wrote: On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org<mailto:julien@xen.org>> wrote: On 06/04/2020 18:58, George Dunlap wrote: On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com<mailto:julien.grall.oss@gmail.com>> wrote: On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org<mailto:sstabellini@kernel.org>> wrote: On Thu, 2 Apr 2020, Julien Grall wrote: As we discussed on Tuesday, the cost for other vCPUs is only going to be a trap to the hypervisor and then back again. The cost is likely smaller than receiving and forwarding an interrupt. You actually agreed on this analysis. So can you enlighten me as to why receiving an interrupt is a not problem for latency but this is? My answer was that the difference is that an operating system can disable interrupts, but it cannot disable receiving this special IPI. An OS can *only* disable its own interrupts, yet interrupts will still be received by Xen even if the interrupts are masked at the processor (e.g local_irq_disable()). You would need to disable interrupts one by one as the GIC level (use ICENABLER) in order to not receive any interrupts. Yet, Xen may still receive interrupts for operational purposes (e.g serial, console, maintainance IRQ...). So trap will happen. I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible. On a 4-core system, you could have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3. In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1. Well maintenance IRQs are per pCPU so you can't route to another one... But, I think you missed my point that local_irq_disable() from the guest will not prevent the hypervisor to receive interrupts *even* the one routed to the vCPU itself. They will just not be delivered to the guest context until local_irq_enable() is called. My understanding, from Stefano was that what his customers are concerned about is the time between the time a physical IRQ is delivered to the guest and the time the guest OS can respond appropriately. The key thing here isn’t necessarily speed, but predictability — system designers need to know that, with a high probability, their interrupt routines will complete within X amount of cycles. Further interrupts delivered to a guest are not a problem in this scenario, if the guest can disable them until the critical IRQ has been handled. Xen-related IPIs, however, could potentially cause a problem if not mitigated. Consider a guest where vcpu 1 loops over the register, while vcpu 2 is handling a latency-critical IRQ. A naive implementation might send an IPI every time vcpu 1 does a read, spamming vcpu 2 with dozens of IPIs. Then an IRQ routine which reliably finishes well within the required time normally suddenly overruns and causes an issue. I don’t know what maintenance IRQs are, but if they only happen intermittently, it’s possible that you’d never get more than a single one in a latency-critical IRQ routine; and as such, the variatibility in execution time (jitter) wouldn’t be an issue in practice. But every time you add a new unblockable IPI, you increase this jitter; particularly if this unblockable IPI might be repeated an arbitrary number of times. (Stefano, let me know if I’ve misunderstood something.) So stepping back a moment, here’s all the possible ideas that I think have been discussed (or are there implicitly) so far. 1. [Default] Do nothing; guests using this register continue crashing 2. Make the I?ACTIVER registers RZWI. 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current behavior (as far as we understand it) 4. Use a simple IPI with do_noop to update I?ACTIVER 4a. Use an IPI, but come up with clever tricks to avoid interrupting guests handling IRQs. 5. Trap to Xen on guest EOI, so that we know when the This is possible to do on a per interrupt basis or when all interrupts in LR registers have all been handled (maintenance interrupt when there is nothing left to handle in LR registers, used to fire other interrupts if we have more pending interrupts then number of LR registers). Maybe a solution making sure we get a maintenance interrupts once all interrupts in LR registers have been handled could be a good mitigation ? This could allow to not sync other cores but would make sure the time until we will cleanup active interrupts is limited so the poller could be sure to have at some acceptable point in time the information. 6. Some clever paravirtualized option Obviously nobody wants #1, and #3 is clearly not really an option now either. #2 is not great, but it’s simple and quick to implement for now. Julien, I’m not sure your position on this one: You rejected the idea back in v1 of this patch series, but seemed to refer to it again earlier in this thread. #4 is relatively quick to implement a “dumb” version, but such a “dumb” version has a high risk of causing unacceptable jitter (or at least, so Stefano believes). #4a or #6 are further potential lines to explore, but would require a lot of additional design to get working right. I think if I understand Stefano’s PoV, then #5 would actually be acceptable — the overall amount of time spent in the hypervisor would probably be greater, but it would be bounded and predictable: once someone got their IRQ handler working reliably, it would likely continue to work. It sounds like #5 might be pretty quick to implement; and then at some point in the future if someone wants to improve performance, they can work on 4a or 6. I agree this could be a good mitigation. Bertrand
On 07/04/2020 17:16, George Dunlap wrote: > > >> On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org> wrote: >> >> On 06/04/2020 18:58, George Dunlap wrote: >>>> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote: >>>> >>>> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote: >>>>> >>>>> On Thu, 2 Apr 2020, Julien Grall wrote: >>>>>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a >>>>>> trap to the hypervisor and then back again. The cost is likely smaller than >>>>>> receiving and forwarding an interrupt. >>>>>> >>>>>> You actually agreed on this analysis. So can you enlighten me as to why >>>>>> receiving an interrupt is a not problem for latency but this is? >>>>> >>>>> My answer was that the difference is that an operating system can >>>>> disable interrupts, but it cannot disable receiving this special IPI. >>>> >>>> An OS can *only* disable its own interrupts, yet interrupts will still >>>> be received by Xen even if the interrupts are masked at the processor >>>> (e.g local_irq_disable()). >>>> >>>> You would need to disable interrupts one by one as the GIC level (use >>>> ICENABLER) in order to not receive any interrupts. Yet, Xen may still >>>> receive interrupts for operational purposes (e.g serial, console, >>>> maintainance IRQ...). So trap will happen. >>> I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible. On a 4-core system, you could have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3. In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1. >> Well maintenance IRQs are per pCPU so you can't route to another one... >> >> But, I think you missed my point that local_irq_disable() from the guest will not prevent the hypervisor to receive interrupts *even* the one routed to the vCPU itself. They will just not be delivered to the guest context until local_irq_enable() is called. > > My understanding, from Stefano was that what his customers are concerned about is the time between the time a physical IRQ is delivered to the guest and the time the guest OS can respond appropriately. The key thing here isn’t necessarily speed, but predictability — system designers need to know that, with a high probability, their interrupt routines will complete within X amount of cycles. > > Further interrupts delivered to a guest are not a problem in this scenario, if the guest can disable them until the critical IRQ has been handled. You keep saying a guest can disable interrupts, but it can't do it via local_irq_disable(). So what method are you thinking? Disabling at the GIC level? That is involving traps and most likely not going to help with predictability... > > Xen-related IPIs, however, could potentially cause a problem if not mitigated. Consider a guest where vcpu 1 loops over the register, while vcpu 2 is handling a latency-critical IRQ. A naive implementation might send an IPI every time vcpu 1 does a read, spamming vcpu 2 with dozens of IPIs. Then an IRQ routine which reliably finishes well within the required time normally suddenly overruns and causes an issue. I never suggested the naive implementation would be perfect. That's why I said it can be optimized... > > I don’t know what maintenance IRQs are, but if they only happen intermittently, it’s possible that you’d never get more than a single one in a latency-critical IRQ routine; and as such, the variatibility in execution time (jitter) wouldn’t be an issue in practice. But every time you add a new unblockable IPI, you increase this jitter; particularly if this unblockable IPI might be repeated an arbitrary number of times. > > (Stefano, let me know if I’ve misunderstood something.) > > So stepping back a moment, here’s all the possible ideas that I think have been discussed (or are there implicitly) so far. > > 1. [Default] Do nothing; guests using this register continue crashing > > 2. Make the I?ACTIVER registers RZWI. > > 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current behavior (as far as we understand it) > > 4. Use a simple IPI with do_noop to update I?ACTIVER > > 4a. Use an IPI, but come up with clever tricks to avoid interrupting guests handling IRQs. > > 5. Trap to Xen on guest EOI, so that we know when the > > 6. Some clever paravirtualized option 7. Use an IPI if we are confident the interrupts may be active. > > Obviously nobody wants #1, and #3 is clearly not really an option now either. > > #2 is not great, but it’s simple and quick to implement for now. Julien, I’m not sure your position on this one: You rejected the idea back in v1 of this patch series, but seemed to refer to it again earlier in this thread. > > #4 is relatively quick to implement a “dumb” version, but such a “dumb” version has a high risk of causing unacceptable jitter (or at least, so Stefano believes). > > #4a or #6 are further potential lines to explore, but would require a lot of additional design to get working right. > > I think if I understand Stefano’s PoV, then #5 would actually be acceptable — the overall amount of time spent in the hypervisor would probably be greater, but it would be bounded and predictable: once someone got their IRQ handler working reliably, it would likely continue to work. > It sounds like #5 might be pretty quick to implement; and then at some point in the future if someone wants to improve performance, they can work on 4a or 6. With the existing solution, you may only trap when receiving the next interrupts. But with your approach you are now trapping once when deactivate and potentially trapping soon after when receiving an interrupts. So you increase cost per-interrupts. While I agree this is going to make Stefano's customers happy, you are also going to make unhappy anyone with workload using an high numbers of interrupts (e.g media). Given that none of Stefano's customer are actually using ISACTIVER today, it feels to me it is wrong to add pain for everyone else. It feels to me we want to either implement 4. or 7. and then optimize it based on feedback. Cheers,
On 07/04/2020 17:25, Bertrand Marquis wrote: > This is possible to do on a per interrupt basis or when all interrupts > in LR registers have all been handled (maintenance interrupt when there > is nothing left to handle in LR registers, used to fire other interrupts > if we have more pending interrupts then number of LR registers). > > Maybe a solution making sure we get a maintenance interrupts once all > interrupts in LR registers have been handled could be a good mitigation ? The chance of having multiple interrupts inflight is fairly limited. So effectively, you will trap for every interrupts. Cheers,
On 07/04/2020 17:50, Julien Grall wrote: > > > On 07/04/2020 17:16, George Dunlap wrote: >> >> >>> On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org> wrote: >>> >>> On 06/04/2020 18:58, George Dunlap wrote: >>>>> On Apr 3, 2020, at 9:27 PM, Julien Grall >>>>> <julien.grall.oss@gmail.com> wrote: >>>>> >>>>> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini >>>>> <sstabellini@kernel.org> wrote: >>>>>> >>>>>> On Thu, 2 Apr 2020, Julien Grall wrote: >>>>>>> As we discussed on Tuesday, the cost for other vCPUs is only >>>>>>> going to be a >>>>>>> trap to the hypervisor and then back again. The cost is likely >>>>>>> smaller than >>>>>>> receiving and forwarding an interrupt. >>>>>>> >>>>>>> You actually agreed on this analysis. So can you enlighten me as >>>>>>> to why >>>>>>> receiving an interrupt is a not problem for latency but this is? >>>>>> >>>>>> My answer was that the difference is that an operating system can >>>>>> disable interrupts, but it cannot disable receiving this special IPI. >>>>> >>>>> An OS can *only* disable its own interrupts, yet interrupts will still >>>>> be received by Xen even if the interrupts are masked at the processor >>>>> (e.g local_irq_disable()). >>>>> >>>>> You would need to disable interrupts one by one as the GIC level (use >>>>> ICENABLER) in order to not receive any interrupts. Yet, Xen may still >>>>> receive interrupts for operational purposes (e.g serial, console, >>>>> maintainance IRQ...). So trap will happen. >>>> I think Stefano’s assertion is that the users he has in mind will be >>>> configuring the system such that RT workloads get a minimum number >>>> of hypervisor-related interrupts possible. On a 4-core system, you >>>> could have non-RT workloads running on cores 0-1, and RT workloads >>>> running with the NULL scheduler on cores 2-3. In such a system, >>>> you’d obviously arrange that serial and maintenance IRQs are >>>> delivered to cores 0-1. >>> Well maintenance IRQs are per pCPU so you can't route to another one... >>> >>> But, I think you missed my point that local_irq_disable() from the >>> guest will not prevent the hypervisor to receive interrupts *even* >>> the one routed to the vCPU itself. They will just not be delivered to >>> the guest context until local_irq_enable() is called. >> >> My understanding, from Stefano was that what his customers are >> concerned about is the time between the time a physical IRQ is >> delivered to the guest and the time the guest OS can respond >> appropriately. The key thing here isn’t necessarily speed, but >> predictability — system designers need to know that, with a high >> probability, their interrupt routines will complete within X amount of >> cycles. >> >> Further interrupts delivered to a guest are not a problem in this >> scenario, if the guest can disable them until the critical IRQ has >> been handled. > > You keep saying a guest can disable interrupts, but it can't do it via > local_irq_disable(). So what method are you thinking? Disabling at the > GIC level? That is involving traps and most likely not going to help > with predictability... Just to clear I meant interrupts to be received by Xen including the one routed to that vCPU. Cheers,
> On Apr 7, 2020, at 5:50 PM, Julien Grall <julien@xen.org> wrote: > > > > On 07/04/2020 17:16, George Dunlap wrote: >>> On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org> wrote: >>> >>> On 06/04/2020 18:58, George Dunlap wrote: >>>>> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote: >>>>> >>>>> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote: >>>>>> >>>>>> On Thu, 2 Apr 2020, Julien Grall wrote: >>>>>>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a >>>>>>> trap to the hypervisor and then back again. The cost is likely smaller than >>>>>>> receiving and forwarding an interrupt. >>>>>>> >>>>>>> You actually agreed on this analysis. So can you enlighten me as to why >>>>>>> receiving an interrupt is a not problem for latency but this is? >>>>>> >>>>>> My answer was that the difference is that an operating system can >>>>>> disable interrupts, but it cannot disable receiving this special IPI. >>>>> >>>>> An OS can *only* disable its own interrupts, yet interrupts will still >>>>> be received by Xen even if the interrupts are masked at the processor >>>>> (e.g local_irq_disable()). >>>>> >>>>> You would need to disable interrupts one by one as the GIC level (use >>>>> ICENABLER) in order to not receive any interrupts. Yet, Xen may still >>>>> receive interrupts for operational purposes (e.g serial, console, >>>>> maintainance IRQ...). So trap will happen. >>>> I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible. On a 4-core system, you could have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3. In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1. >>> Well maintenance IRQs are per pCPU so you can't route to another one... >>> >>> But, I think you missed my point that local_irq_disable() from the guest will not prevent the hypervisor to receive interrupts *even* the one routed to the vCPU itself. They will just not be delivered to the guest context until local_irq_enable() is called. >> My understanding, from Stefano was that what his customers are concerned about is the time between the time a physical IRQ is delivered to the guest and the time the guest OS can respond appropriately. The key thing here isn’t necessarily speed, but predictability — system designers need to know that, with a high probability, their interrupt routines will complete within X amount of cycles. >> Further interrupts delivered to a guest are not a problem in this scenario, if the guest can disable them until the critical IRQ has been handled. > > You keep saying a guest can disable interrupts, but it can't do it via local_irq_disable(). So what method are you thinking? Disabling at the GIC level? That is involving traps and most likely not going to help with predictability... So you’ll have to forgive me for making educated guesses here, as I’m trying to collect all the information. On x86, if you use device pass-through on a system with a virtualized APIC and posted interrupts, then when when the device generates interrupts, those are delivered directly to the guest without involvement of Xen; and when the guest disables interrupts in the vAPIC, those interrupts will be disabled, and be delivered when the guest re-enables interrupts. Given what Stefano said about disabling interrupts, I assumed that ARM had the same sort of functionality. Is that not the case? >> Xen-related IPIs, however, could potentially cause a problem if not mitigated. Consider a guest where vcpu 1 loops over the register, while vcpu 2 is handling a latency-critical IRQ. A naive implementation might send an IPI every time vcpu 1 does a read, spamming vcpu 2 with dozens of IPIs. Then an IRQ routine which reliably finishes well within the required time normally suddenly overruns and causes an issue. > > I never suggested the naive implementation would be perfect. That's why I said it can be optimized... It didn’t seem to me that you understood what Stefano’s concerns were; so I was trying to explain the situation he is trying to avoid (as well as making sure that I had a clear understanding myself). The reason I said “a naive implementation” was to make clear that I knew that’s not what you were suggesting. :-) >> I don’t know what maintenance IRQs are, but if they only happen intermittently, it’s possible that you’d never get more than a single one in a latency-critical IRQ routine; and as such, the variatibility in execution time (jitter) wouldn’t be an issue in practice. But every time you add a new unblockable IPI, you increase this jitter; particularly if this unblockable IPI might be repeated an arbitrary number of times. >> (Stefano, let me know if I’ve misunderstood something.) >> So stepping back a moment, here’s all the possible ideas that I think have been discussed (or are there implicitly) so far. >> 1. [Default] Do nothing; guests using this register continue crashing >> 2. Make the I?ACTIVER registers RZWI. >> 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current behavior (as far as we understand it) >> 4. Use a simple IPI with do_noop to update I?ACTIVER >> 4a. Use an IPI, but come up with clever tricks to avoid interrupting guests handling IRQs. >> 5. Trap to Xen on guest EOI, so that we know when the >> 6. Some clever paravirtualized option > > 7. Use an IPI if we are confident the interrupts may be active. I don’t understand this one. How is it different than 4 or 4a? And in particular, how does it evaluate on the “how much additional design work would it take” criteria? -George
Hi George, On 07/04/2020 19:16, George Dunlap wrote: > > >> On Apr 7, 2020, at 5:50 PM, Julien Grall <julien@xen.org> wrote: >> >> >> >> On 07/04/2020 17:16, George Dunlap wrote: >>>> On Apr 6, 2020, at 7:47 PM, Julien Grall <julien@xen.org> wrote: >>>> >>>> On 06/04/2020 18:58, George Dunlap wrote: >>>>>> On Apr 3, 2020, at 9:27 PM, Julien Grall <julien.grall.oss@gmail.com> wrote: >>>>>> >>>>>> On Fri, 3 Apr 2020 at 20:41, Stefano Stabellini <sstabellini@kernel.org> wrote: >>>>>>> >>>>>>> On Thu, 2 Apr 2020, Julien Grall wrote: >>>>>>>> As we discussed on Tuesday, the cost for other vCPUs is only going to be a >>>>>>>> trap to the hypervisor and then back again. The cost is likely smaller than >>>>>>>> receiving and forwarding an interrupt. >>>>>>>> >>>>>>>> You actually agreed on this analysis. So can you enlighten me as to why >>>>>>>> receiving an interrupt is a not problem for latency but this is? >>>>>>> >>>>>>> My answer was that the difference is that an operating system can >>>>>>> disable interrupts, but it cannot disable receiving this special IPI. >>>>>> >>>>>> An OS can *only* disable its own interrupts, yet interrupts will still >>>>>> be received by Xen even if the interrupts are masked at the processor >>>>>> (e.g local_irq_disable()). >>>>>> >>>>>> You would need to disable interrupts one by one as the GIC level (use >>>>>> ICENABLER) in order to not receive any interrupts. Yet, Xen may still >>>>>> receive interrupts for operational purposes (e.g serial, console, >>>>>> maintainance IRQ...). So trap will happen. >>>>> I think Stefano’s assertion is that the users he has in mind will be configuring the system such that RT workloads get a minimum number of hypervisor-related interrupts possible. On a 4-core system, you could have non-RT workloads running on cores 0-1, and RT workloads running with the NULL scheduler on cores 2-3. In such a system, you’d obviously arrange that serial and maintenance IRQs are delivered to cores 0-1. >>>> Well maintenance IRQs are per pCPU so you can't route to another one... >>>> >>>> But, I think you missed my point that local_irq_disable() from the guest will not prevent the hypervisor to receive interrupts *even* the one routed to the vCPU itself. They will just not be delivered to the guest context until local_irq_enable() is called. >>> My understanding, from Stefano was that what his customers are concerned about is the time between the time a physical IRQ is delivered to the guest and the time the guest OS can respond appropriately. The key thing here isn’t necessarily speed, but predictability — system designers need to know that, with a high probability, their interrupt routines will complete within X amount of cycles. >>> Further interrupts delivered to a guest are not a problem in this scenario, if the guest can disable them until the critical IRQ has been handled. >> >> You keep saying a guest can disable interrupts, but it can't do it via local_irq_disable(). So what method are you thinking? Disabling at the GIC level? That is involving traps and most likely not going to help with predictability... > > So you’ll have to forgive me for making educated guesses here, as I’m trying to collect all the information. On x86, if you use device pass-through on a system with a virtualized APIC and posted interrupts, then when when the device generates interrupts, those are delivered directly to the guest without involvement of Xen; and when the guest disables interrupts in the vAPIC, those interrupts will be disabled, and be delivered when the guest re-enables interrupts. Given what Stefano said about disabling interrupts, I assumed that ARM had the same sort of functionality. Is that not the case? Posted interrupts are only present in GICv4 and onwards. GICv4 only supports direct injections for LPIs (e.g MSIs) and GICv4.1 added support for direct injection of SGIs (aka IPIs). Xen on Arm does not support any posted interrupts at all and, I don't believe Stefano has HW (at least in production) using GICv4. > >>> Xen-related IPIs, however, could potentially cause a problem if not mitigated. Consider a guest where vcpu 1 loops over the register, while vcpu 2 is handling a latency-critical IRQ. A naive implementation might send an IPI every time vcpu 1 does a read, spamming vcpu 2 with dozens of IPIs. Then an IRQ routine which reliably finishes well within the required time normally suddenly overruns and causes an issue. >> >> I never suggested the naive implementation would be perfect. That's why I said it can be optimized... > > It didn’t seem to me that you understood what Stefano’s concerns were; so I was trying to explain the situation he is trying to avoid (as well as making sure that I had a clear understanding myself). The reason I said “a naive implementation” was to make clear that I knew that’s not what you were suggesting. :-) I know Stefano's concerns, sorry it was not clear enough :). > >>> I don’t know what maintenance IRQs are, but if they only happen intermittently, it’s possible that you’d never get more than a single one in a latency-critical IRQ routine; and as such, the variatibility in execution time (jitter) wouldn’t be an issue in practice. But every time you add a new unblockable IPI, you increase this jitter; particularly if this unblockable IPI might be repeated an arbitrary number of times. >>> (Stefano, let me know if I’ve misunderstood something.) >>> So stepping back a moment, here’s all the possible ideas that I think have been discussed (or are there implicitly) so far. >>> 1. [Default] Do nothing; guests using this register continue crashing >>> 2. Make the I?ACTIVER registers RZWI. >>> 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current behavior (as far as we understand it) >>> 4. Use a simple IPI with do_noop to update I?ACTIVER >>> 4a. Use an IPI, but come up with clever tricks to avoid interrupting guests handling IRQs. >>> 5. Trap to Xen on guest EOI, so that we know when the >>> 6. Some clever paravirtualized option >> >> 7. Use an IPI if we are confident the interrupts may be active. > > I don’t understand this one. How is it different than 4 or 4a? And in particular, how does it evaluate on the “how much additional design work would it take” criteria? Let me start with, if we want to have a vGIC to rule them all, then I am afraid you are going to have to compromise. We can discuss about tailoring the vGIC but I think before that we need a vGIC that is faithfull with the spec (e.g differentiating level vs edge interrupts, handling activer...). Note that Arm spent some effort to get a new vGIC merged but this was never made a first class citizen. However, even if you tailor the vGIC, you are not going to be able to avoid IPI in some occasions. This would happen when using event channels, in-guest IPI... Another example is when enabling an interrupt, although I realize that our vGIC does not do it today meaning that a pending interrupt while disabled will not be forwarded until the vCPU exit. Furthermore, implementing a write to I{C,S}ACTIVER (to activate/de-activate) is going to be very difficult (to not say impossible) to do without IPI. If you are worry about a vCPU been interrupted in critical section, then I think you should tailor your guest to avoid using those registers. An alternative would be to use spinlock/mutex within the code to prevent a VCPU to access the vGIC registers while another vCPU don't want to be interrupted. Regarding, 4a. The only way I could think of would be to trap the instructions that mask/unmask interrupts. If I read correctly the Armv8 spec, it is not possible to do it. 7. is basically 4.a the goal would be to avoid interrupts the vCPU has much as possible but you may be wrong sometimes. Obviously we want to be correct most of the times. I understand this may not be the ideal solution, but this is probably the best we could come with and does not involve a lot of work because we have already all the information in place (we know when an interrupt was injected to a vCPU). The next best solution is to prevent in your guest to modify some registers of the vGIC at the same time as another vCPU is in a critical section. Cheers,
Hi Stefano, On 03/04/2020 20:41, Stefano Stabellini wrote: > On Thu, 2 Apr 2020, Julien Grall wrote: > As you know I cannot reproduce the crash myself, I asked Peng and Wei > for help in that. I cannot be certain Jeff's patch makes a difference, > but looking at the code, if you open > xen/arch/arm/vgic-v3.c:__vgic_v3_distr_common_mmio_read you can see that > the range mistake is still there: > > > /* Read the active status of an IRQ via GICD/GICR is not supported */ > case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVER): > case VRANGE32(GICD_ICACTIVER, GICD_ICACTIVERN): > goto read_as_zero; > > > So a GICD_ISACTIVER of any register but the first should end up hitting > the default case: > > default: > printk(XENLOG_G_ERR > "%pv: %s: unhandled read r%d offset %#08x\n", > v, name, dabt.reg, reg); > return 0; > } > > Which returns 0 (IO_ABORT). > > Would you be happy to have the range fixed to be: > > case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVERN): > > instead? I don't particularly like it, but it is not going to make any difference for Linux < 5.4. So I am not opposed to it. However, I am a bit worry the vGIC is still going to be a pile of hack. So I think we should continue the discussion about making it better. This includes how to implement I{C, S}ACTIVER properly and what sort of use-cases we want to support. Cheers,
On Tue, 7 Apr 2020, Julien Grall wrote: > > > > I don’t know what maintenance IRQs are, but if they only happen > > > > intermittently, it’s possible that you’d never get more than a single > > > > one in a latency-critical IRQ routine; and as such, the variatibility in > > > > execution time (jitter) wouldn’t be an issue in practice. But every > > > > time you add a new unblockable IPI, you increase this jitter; > > > > particularly if this unblockable IPI might be repeated an arbitrary > > > > number of times. > > > > (Stefano, let me know if I’ve misunderstood something.) > > > > So stepping back a moment, here’s all the possible ideas that I think > > > > have been discussed (or are there implicitly) so far. > > > > 1. [Default] Do nothing; guests using this register continue crashing > > > > 2. Make the I?ACTIVER registers RZWI. > > > > 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current > > > > behavior (as far as we understand it) > > > > 4. Use a simple IPI with do_noop to update I?ACTIVER > > > > 4a. Use an IPI, but come up with clever tricks to avoid interrupting > > > > guests handling IRQs. > > > > 5. Trap to Xen on guest EOI, so that we know when the > > > > 6. Some clever paravirtualized option > > > > > > 7. Use an IPI if we are confident the interrupts may be active. > > > > I don’t understand this one. How is it different than 4 or 4a? And in > > particular, how does it evaluate on the “how much additional design work > > would it take” criteria? > > Let me start with, if we want to have a vGIC to rule them all, then I am > afraid you are going to have to compromise. We can discuss about tailoring the > vGIC but I think before that we need a vGIC that is faithfull with the spec > (e.g differentiating level vs edge interrupts, handling activer...). Note that > Arm spent some effort to get a new vGIC merged but this was never made a first > class citizen. > > However, even if you tailor the vGIC, you are not going to be able to avoid > IPI in some occasions. This would happen when using event channels, in-guest > IPI... Another example is when enabling an interrupt, although I realize that > our vGIC does not do it today meaning that a pending interrupt while disabled > will not be forwarded until the vCPU exit. > > Furthermore, implementing a write to I{C,S}ACTIVER (to activate/de-activate) > is going to be very difficult (to not say impossible) to do without IPI. > > If you are worry about a vCPU been interrupted in critical section, then I > think you should tailor your guest to avoid using those registers. Let's call it option 8 "tell the user that she needs to modify her kernel." > An alternative would be to use spinlock/mutex within the code to prevent a > VCPU to access the vGIC registers while another vCPU don't want to be > interrupted. > > Regarding, 4a. The only way I could think of would be to trap the instructions > that mask/unmask interrupts. If I read correctly the Armv8 spec, it is not > possible to do it. > > 7. is basically 4.a the goal would be to avoid interrupts the vCPU has much as > possible but you may be wrong sometimes. Obviously we want to be correct most > of the times. > > I understand this may not be the ideal solution, but this is probably the best > we could come with and does not involve a lot of work because we have already > all the information in place (we know when an interrupt was injected to a > vCPU). > > The next best solution is to prevent in your guest to modify some registers of > the vGIC at the same time as another vCPU is in a critical section. Let's call this option 9. I am just thinking out loud here :) - 2 "Make the I?ACTIVER registers RZWI" As far as I can tell it should prevent the livelock because it would never return an ACTIVE state. It could be improved by returning the latest ACTIVE information for local interrupts and returning zero for interrupts routed to other pcpus. Not perfect but an option. - 5 "maintenance interrupt" This is good for jitter sensitive guests but not the best for the others. We could enable it conditionally: enable maintenance interrupts only for certain vcpus/pcpus but it is not great to have to make this kind of difference in the implementation. However, it is possible. Let's see if we can come up with something better. - 7 "optimized IPI" A tiny chance of causing issues. Let's see if we can come up with anything better. - 8 "tell the user to fix modify the kernel" We could do it in addition to 7. The issue is really how we do it. A warning message if DEBUG && if sched==null? That doesn't sound right. We could introduce a new nojitter=true command line option for Xen? It wouldn't really change the behavior of Xen, but it would enable this warning. Or, it could enable the warning and also switch the implementation of I?ACTIVER to option 2. - 9 "prevent I?ACTIVER during critical sections" This could be good but I have a difficult time thinking of how we could implement it. How do we tell that the other vcpu is in or out of the critical section? (I was brainstorming a way to re-enable trapping the idling instruction WFI temporarily so that the other vcpu would definitely trap into Xen again when it is out of the critical section. But it doesn't seem possible from the spec.)
On 09/04/2020 02:26, Stefano Stabellini wrote: > On Tue, 7 Apr 2020, Julien Grall wrote: >>>>> I don’t know what maintenance IRQs are, but if they only happen >>>>> intermittently, it’s possible that you’d never get more than a single >>>>> one in a latency-critical IRQ routine; and as such, the variatibility in >>>>> execution time (jitter) wouldn’t be an issue in practice. But every >>>>> time you add a new unblockable IPI, you increase this jitter; >>>>> particularly if this unblockable IPI might be repeated an arbitrary >>>>> number of times. >>>>> (Stefano, let me know if I’ve misunderstood something.) >>>>> So stepping back a moment, here’s all the possible ideas that I think >>>>> have been discussed (or are there implicitly) so far. >>>>> 1. [Default] Do nothing; guests using this register continue crashing >>>>> 2. Make the I?ACTIVER registers RZWI. >>>>> 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current >>>>> behavior (as far as we understand it) >>>>> 4. Use a simple IPI with do_noop to update I?ACTIVER >>>>> 4a. Use an IPI, but come up with clever tricks to avoid interrupting >>>>> guests handling IRQs. >>>>> 5. Trap to Xen on guest EOI, so that we know when the >>>>> 6. Some clever paravirtualized option >>>> >>>> 7. Use an IPI if we are confident the interrupts may be active. >>> >>> I don’t understand this one. How is it different than 4 or 4a? And in >>> particular, how does it evaluate on the “how much additional design work >>> would it take” criteria? >> >> Let me start with, if we want to have a vGIC to rule them all, then I am >> afraid you are going to have to compromise. We can discuss about tailoring the >> vGIC but I think before that we need a vGIC that is faithfull with the spec >> (e.g differentiating level vs edge interrupts, handling activer...). Note that >> Arm spent some effort to get a new vGIC merged but this was never made a first >> class citizen. >> >> However, even if you tailor the vGIC, you are not going to be able to avoid >> IPI in some occasions. This would happen when using event channels, in-guest >> IPI... Another example is when enabling an interrupt, although I realize that >> our vGIC does not do it today meaning that a pending interrupt while disabled >> will not be forwarded until the vCPU exit. >> >> Furthermore, implementing a write to I{C,S}ACTIVER (to activate/de-activate) >> is going to be very difficult (to not say impossible) to do without IPI. >> >> If you are worry about a vCPU been interrupted in critical section, then I >> think you should tailor your guest to avoid using those registers. > > Let's call it option 8 "tell the user that she needs to modify her > kernel." > > >> An alternative would be to use spinlock/mutex within the code to prevent a >> VCPU to access the vGIC registers while another vCPU don't want to be >> interrupted. >> >> Regarding, 4a. The only way I could think of would be to trap the instructions >> that mask/unmask interrupts. If I read correctly the Armv8 spec, it is not >> possible to do it. >> >> 7. is basically 4.a the goal would be to avoid interrupts the vCPU has much as >> possible but you may be wrong sometimes. Obviously we want to be correct most >> of the times. >> >> I understand this may not be the ideal solution, but this is probably the best >> we could come with and does not involve a lot of work because we have already >> all the information in place (we know when an interrupt was injected to a >> vCPU). >> >> The next best solution is to prevent in your guest to modify some registers of >> the vGIC at the same time as another vCPU is in a critical section. > > Let's call this option 9. > > > I am just thinking out loud here :) Thank you for thinking out loud. Sadly, as I pointed out before, there are other part of the vGIC facing the same problems (e.g I{S,C}ENABLER, sending SGIs, sending event channels). So can you enlighten me why I{S,C}ENABLER is that much a concern over the other? > > > - 2 "Make the I?ACTIVER registers RZWI" > > As far as I can tell it should prevent the livelock because it would > never return an ACTIVE state. It could be improved by returning the > latest ACTIVE information for local interrupts and returning zero for > interrupts routed to other pcpus. Not perfect but an option. How a partial implementation will help? Wouldn't this make more difficult for a developper? Bear in mind that on GICv3 you can read the information all the re-distributors information (not only yours). > > > - 5 "maintenance interrupt" > > This is good for jitter sensitive guests but not the best for the > others. We could enable it conditionally: enable maintenance > interrupts only for certain vcpus/pcpus but it is not great to have to > make this kind of difference in the implementation. However, it is > possible. Let's see if we can come up with something better. > > > - 7 "optimized IPI" > > A tiny chance of causing issues. Let's see if we can come up with > anything better. > > > - 8 "tell the user to fix modify the kernel" > > We could do it in addition to 7. The issue is really how we do it. > A warning message if DEBUG && if sched==null? That doesn't sound > right. We could introduce a new nojitter=true command line option for > Xen? It wouldn't really change the behavior of Xen, but it would > enable this warning. Or, it could enable the warning and also switch > the implementation of I?ACTIVER to option 2. This is not a sched=null specific problem. The problem would exactly be the same when you are dedicating a pCPU to a vCPU on credit and credit2. > > > - 9 "prevent I?ACTIVER during critical sections" > > This could be good but I have a difficult time thinking of how we > could implement it. How do we tell that the other vcpu is in or out of > the critical section? I believe you misread what I wrote. I didn't suggest Xen would do it but the guest will do it. As the vCPUs belongs to the same guest. Cheers,
On 09/04/2020 13:56, Julien Grall wrote: > > > On 09/04/2020 02:26, Stefano Stabellini wrote: >> On Tue, 7 Apr 2020, Julien Grall wrote: >>>>>> I don’t know what maintenance IRQs are, but if they only happen >>>>>> intermittently, it’s possible that you’d never get more than a single >>>>>> one in a latency-critical IRQ routine; and as such, the >>>>>> variatibility in >>>>>> execution time (jitter) wouldn’t be an issue in practice. But every >>>>>> time you add a new unblockable IPI, you increase this jitter; >>>>>> particularly if this unblockable IPI might be repeated an arbitrary >>>>>> number of times. >>>>>> (Stefano, let me know if I’ve misunderstood something.) >>>>>> So stepping back a moment, here’s all the possible ideas that I think >>>>>> have been discussed (or are there implicitly) so far. >>>>>> 1. [Default] Do nothing; guests using this register continue crashing >>>>>> 2. Make the I?ACTIVER registers RZWI. >>>>>> 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s >>>>>> current >>>>>> behavior (as far as we understand it) >>>>>> 4. Use a simple IPI with do_noop to update I?ACTIVER >>>>>> 4a. Use an IPI, but come up with clever tricks to avoid interrupting >>>>>> guests handling IRQs. >>>>>> 5. Trap to Xen on guest EOI, so that we know when the >>>>>> 6. Some clever paravirtualized option >>>>> >>>>> 7. Use an IPI if we are confident the interrupts may be active. >>>> >>>> I don’t understand this one. How is it different than 4 or 4a? And in >>>> particular, how does it evaluate on the “how much additional design >>>> work >>>> would it take” criteria? >>> >>> Let me start with, if we want to have a vGIC to rule them all, then I am >>> afraid you are going to have to compromise. We can discuss about >>> tailoring the >>> vGIC but I think before that we need a vGIC that is faithfull with >>> the spec >>> (e.g differentiating level vs edge interrupts, handling activer...). >>> Note that >>> Arm spent some effort to get a new vGIC merged but this was never >>> made a first >>> class citizen. >>> >>> However, even if you tailor the vGIC, you are not going to be able to >>> avoid >>> IPI in some occasions. This would happen when using event channels, >>> in-guest >>> IPI... Another example is when enabling an interrupt, although I >>> realize that >>> our vGIC does not do it today meaning that a pending interrupt while >>> disabled >>> will not be forwarded until the vCPU exit. >>> >>> Furthermore, implementing a write to I{C,S}ACTIVER (to >>> activate/de-activate) >>> is going to be very difficult (to not say impossible) to do without IPI. >>> >>> If you are worry about a vCPU been interrupted in critical section, >>> then I >>> think you should tailor your guest to avoid using those registers. >> >> Let's call it option 8 "tell the user that she needs to modify her >> kernel." >> >> >>> An alternative would be to use spinlock/mutex within the code to >>> prevent a >>> VCPU to access the vGIC registers while another vCPU don't want to be >>> interrupted. >>> >>> Regarding, 4a. The only way I could think of would be to trap the >>> instructions >>> that mask/unmask interrupts. If I read correctly the Armv8 spec, it >>> is not >>> possible to do it. >>> >>> 7. is basically 4.a the goal would be to avoid interrupts the vCPU >>> has much as >>> possible but you may be wrong sometimes. Obviously we want to be >>> correct most >>> of the times. >>> >>> I understand this may not be the ideal solution, but this is probably >>> the best >>> we could come with and does not involve a lot of work because we have >>> already >>> all the information in place (we know when an interrupt was injected >>> to a >>> vCPU). >>> >>> The next best solution is to prevent in your guest to modify some >>> registers of >>> the vGIC at the same time as another vCPU is in a critical section. >> >> Let's call this option 9. >> >> >> I am just thinking out loud here :) > > Thank you for thinking out loud. Sadly, as I pointed out before, there > are other part of the vGIC facing the same problems (e.g I{S,C}ENABLER, > sending SGIs, sending event channels). > > So can you enlighten me why I{S,C}ENABLER is that much a concern over > the other? To be clear, I am not saying I{S,C}ENABLER should not be a concern. But I would prefer if we focus on a generic solution if the problem is wider. Cheers,
On 9 Apr 2020, at 14:00, Julien Grall <julien@xen.org<mailto:julien@xen.org>> wrote: On 09/04/2020 13:56, Julien Grall wrote: On 09/04/2020 02:26, Stefano Stabellini wrote: On Tue, 7 Apr 2020, Julien Grall wrote: I don’t know what maintenance IRQs are, but if they only happen intermittently, it’s possible that you’d never get more than a single one in a latency-critical IRQ routine; and as such, the variatibility in execution time (jitter) wouldn’t be an issue in practice. But every time you add a new unblockable IPI, you increase this jitter; particularly if this unblockable IPI might be repeated an arbitrary number of times. (Stefano, let me know if I’ve misunderstood something.) So stepping back a moment, here’s all the possible ideas that I think have been discussed (or are there implicitly) so far. 1. [Default] Do nothing; guests using this register continue crashing 2. Make the I?ACTIVER registers RZWI. 3. Make I?ACTIVER return the most recent known value; i.e. KVM’s current behavior (as far as we understand it) 4. Use a simple IPI with do_noop to update I?ACTIVER 4a. Use an IPI, but come up with clever tricks to avoid interrupting guests handling IRQs. 5. Trap to Xen on guest EOI, so that we know when the 6. Some clever paravirtualized option 7. Use an IPI if we are confident the interrupts may be active. I don’t understand this one. How is it different than 4 or 4a? And in particular, how does it evaluate on the “how much additional design work would it take” criteria? Let me start with, if we want to have a vGIC to rule them all, then I am afraid you are going to have to compromise. We can discuss about tailoring the vGIC but I think before that we need a vGIC that is faithfull with the spec (e.g differentiating level vs edge interrupts, handling activer...). Note that Arm spent some effort to get a new vGIC merged but this was never made a first class citizen. However, even if you tailor the vGIC, you are not going to be able to avoid IPI in some occasions. This would happen when using event channels, in-guest IPI... Another example is when enabling an interrupt, although I realize that our vGIC does not do it today meaning that a pending interrupt while disabled will not be forwarded until the vCPU exit. Furthermore, implementing a write to I{C,S}ACTIVER (to activate/de-activate) is going to be very difficult (to not say impossible) to do without IPI. If you are worry about a vCPU been interrupted in critical section, then I think you should tailor your guest to avoid using those registers. Let's call it option 8 "tell the user that she needs to modify her kernel." An alternative would be to use spinlock/mutex within the code to prevent a VCPU to access the vGIC registers while another vCPU don't want to be interrupted. Regarding, 4a. The only way I could think of would be to trap the instructions that mask/unmask interrupts. If I read correctly the Armv8 spec, it is not possible to do it. 7. is basically 4.a the goal would be to avoid interrupts the vCPU has much as possible but you may be wrong sometimes. Obviously we want to be correct most of the times. I understand this may not be the ideal solution, but this is probably the best we could come with and does not involve a lot of work because we have already all the information in place (we know when an interrupt was injected to a vCPU). The next best solution is to prevent in your guest to modify some registers of the vGIC at the same time as another vCPU is in a critical section. Let's call this option 9. I am just thinking out loud here :) Thank you for thinking out loud. Sadly, as I pointed out before, there are other part of the vGIC facing the same problems (e.g I{S,C}ENABLER, sending SGIs, sending event channels). So can you enlighten me why I{S,C}ENABLER is that much a concern over the other? To be clear, I am not saying I{S,C}ENABLER should not be a concern. But I would prefer if we focus on a generic solution if the problem is wider. It seems that the problem is a bit bigger then expected and will need more discussion and thinking. I did some research on my side and there are several design possible depending on what should be the goal performance or real-time behaviour (going from just give the status we have to fire a service interrupts when any interrupts is acked by a guest to refresh our internal status). In the short term, could we not agree to fix the typo by returning always 0 and start a discussion on the vgic implementation ? Cheers Bertrand
Hi Bertrand, On 17/04/2020 16:16, Bertrand Marquis wrote: > It seems that the problem is a bit bigger then expected and will need > more discussion and thinking. > I did some research on my side and there are several design possible > depending on what should be the goal performance or real-time behaviour > (going from just give the status we have to fire a service interrupts > when any interrupts is acked by a guest to refresh our internal status). > > In the short term, could we not agree to fix the typo by returning > always 0 and start a discussion on the vgic implementation ? I have already pointed out multiple time now ([1], [2]) that I would not oppose the temporary solution. I think this is a matter of someone (Stefano?) to submit it :). Cheers, [1] https://lists.xenproject.org/archives/html/xen-devel/2019-11/msg01642.html [2] https://lists.xenproject.org/archives/html/xen-devel/2020-04/msg00459.html
On Fri, 17 Apr 2020, Julien Grall wrote: > Hi Bertrand, > > On 17/04/2020 16:16, Bertrand Marquis wrote: > > It seems that the problem is a bit bigger then expected and will need more > > discussion and thinking. > > I did some research on my side and there are several design possible > > depending on what should be the goal performance or real-time behaviour > > (going from just give the status we have to fire a service interrupts when > > any interrupts is acked by a guest to refresh our internal status). > > > > In the short term, could we not agree to fix the typo by returning always 0 > > and start a discussion on the vgic implementation ? > > I have already pointed out multiple time now ([1], [2]) that I would not > oppose the temporary solution. I think this is a matter of someone (Stefano?) > to submit it :). > > Cheers, > > [1] https://lists.xenproject.org/archives/html/xen-devel/2019-11/msg01642.html > [2] https://lists.xenproject.org/archives/html/xen-devel/2020-04/msg00459.html I can submit it. Julien, would you prefer the plain always return 0 patch, or would you prefer if I tried to get the latest ISACTIVER information (like this patch does) but only for the local processor?
On 17/04/2020 17:31, Stefano Stabellini wrote: > On Fri, 17 Apr 2020, Julien Grall wrote: >> Hi Bertrand, >> >> On 17/04/2020 16:16, Bertrand Marquis wrote: >>> It seems that the problem is a bit bigger then expected and will need more >>> discussion and thinking. >>> I did some research on my side and there are several design possible >>> depending on what should be the goal performance or real-time behaviour >>> (going from just give the status we have to fire a service interrupts when >>> any interrupts is acked by a guest to refresh our internal status). >>> >>> In the short term, could we not agree to fix the typo by returning always 0 >>> and start a discussion on the vgic implementation ? >> >> I have already pointed out multiple time now ([1], [2]) that I would not >> oppose the temporary solution. I think this is a matter of someone (Stefano?) >> to submit it :). >> >> Cheers, >> >> [1] https://lists.xenproject.org/archives/html/xen-devel/2019-11/msg01642.html >> [2] https://lists.xenproject.org/archives/html/xen-devel/2020-04/msg00459.html > > I can submit it. Julien, would you prefer the plain always return 0 > patch, or would you prefer if I tried to get the latest ISACTIVER > information (like this patch does) but only for the local processor? Always returning 0. Cheers,
diff --git a/xen/arch/arm/vgic-v2.c b/xen/arch/arm/vgic-v2.c index 64b141fea5..454408d11d 100644 --- a/xen/arch/arm/vgic-v2.c +++ b/xen/arch/arm/vgic-v2.c @@ -245,10 +245,19 @@ static int vgic_v2_distr_mmio_read(struct vcpu *v, mmio_info_t *info, case VRANGE32(GICD_ICPENDR, GICD_ICPENDRN): goto read_as_zero; - /* Read the active status of an IRQ via GICD is not supported */ case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVERN): + if ( dabt.size != DABT_WORD ) goto bad_width; + rank = vgic_rank_offset(v, 1, gicd_reg - GICD_ISACTIVER, DABT_WORD); + if ( rank == NULL ) goto read_as_zero; + *r = vgic_isactiver(v, 32 * rank->index); + return 1; + case VRANGE32(GICD_ICACTIVER, GICD_ICACTIVERN): - goto read_as_zero; + if ( dabt.size != DABT_WORD ) goto bad_width; + rank = vgic_rank_offset(v, 1, gicd_reg - GICD_ICACTIVER, DABT_WORD); + if ( rank == NULL ) goto read_as_zero; + *r = vgic_isactiver(v, 32 * rank->index); + return 1; case VRANGE32(GICD_IPRIORITYR, GICD_IPRIORITYRN): { diff --git a/xen/arch/arm/vgic-v3.c b/xen/arch/arm/vgic-v3.c index 4e60ba15cc..a2cd39c45d 100644 --- a/xen/arch/arm/vgic-v3.c +++ b/xen/arch/arm/vgic-v3.c @@ -712,10 +712,19 @@ static int __vgic_v3_distr_common_mmio_read(const char *name, struct vcpu *v, case VRANGE32(GICD_ICPENDR, GICD_ICPENDR): goto read_as_zero; - /* Read the active status of an IRQ via GICD/GICR is not supported */ - case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVER): + case VRANGE32(GICD_ISACTIVER, GICD_ISACTIVERN): + if ( dabt.size != DABT_WORD ) goto bad_width; + rank = vgic_rank_offset(v, 1, reg - GICD_ISACTIVER, DABT_WORD); + if ( rank == NULL ) goto read_as_zero; + *r = vgic_isactiver(v, 32 * rank->index); + return 1; + case VRANGE32(GICD_ICACTIVER, GICD_ICACTIVERN): - goto read_as_zero; + if ( dabt.size != DABT_WORD ) goto bad_width; + rank = vgic_rank_offset(v, 1, reg - GICD_ICACTIVER, DABT_WORD); + if ( rank == NULL ) goto read_as_zero; + *r = vgic_isactiver(v, 32 * rank->index); + return 1; case VRANGE32(GICD_IPRIORITYR, GICD_IPRIORITYRN): { diff --git a/xen/arch/arm/vgic.c b/xen/arch/arm/vgic.c index 82f524a35c..d491fa38a5 100644 --- a/xen/arch/arm/vgic.c +++ b/xen/arch/arm/vgic.c @@ -423,6 +423,26 @@ void vgic_enable_irqs(struct vcpu *v, uint32_t r, int n) } } +uint32_t vgic_isactiver(struct vcpu *v, unsigned int start_irq) +{ + struct pending_irq *p; + unsigned int irq; + uint32_t r = 0; + + /* + * The following won't reflect the latest status of interrupts on + * other vcpus. + */ + for ( irq = start_irq; irq < start_irq + 32; irq++ ) + { + p = irq_to_pending(v, irq); + if ( p != NULL && test_bit(GIC_IRQ_GUEST_ACTIVE, &p->status) ) + r |= 1U << (irq - start_irq); + } + + return r; +} + bool vgic_to_sgi(struct vcpu *v, register_t sgir, enum gic_sgi_mode irqmode, int virq, const struct sgi_target *target) { diff --git a/xen/include/asm-arm/vgic.h b/xen/include/asm-arm/vgic.h index ce1e3c4bbd..a9e3f2fa60 100644 --- a/xen/include/asm-arm/vgic.h +++ b/xen/include/asm-arm/vgic.h @@ -288,6 +288,7 @@ extern struct vgic_irq_rank *vgic_rank_offset(struct vcpu *v, int b, int n, int extern struct vgic_irq_rank *vgic_rank_irq(struct vcpu *v, unsigned int irq); extern void vgic_disable_irqs(struct vcpu *v, uint32_t r, int n); extern void vgic_enable_irqs(struct vcpu *v, uint32_t r, int n); +extern uint32_t vgic_isactiver(struct vcpu *v, unsigned int start_irq); extern void register_vgic_ops(struct domain *d, const struct vgic_ops *ops); int vgic_v2_init(struct domain *d, int *mmio_count); int vgic_v3_init(struct domain *d, int *mmio_count);
This is a simple implementation of GICD_ICACTIVER / GICD_ISACTIVER reads. It doesn't take into account the latest state of interrupts on other vCPUs. Only the current vCPU is up-to-date. A full solution is not possible because it would require synchronization among all vCPUs, which would be very expensive in terms or latency. Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> CC: Wei Xu <xuwei5@hisilicon.com> CC: Peng Fan <peng.fan@nxp.com> --- Changes in v2: - improve commit message - do not invert result - use 1U - use common patter with vgic_rank_offset - move implementation into a separate function called vgic_isactiver - add vgic2 implementation - tested on vgic2 by hacking the Linux gicv2 driver --- xen/arch/arm/vgic-v2.c | 13 +++++++++++-- xen/arch/arm/vgic-v3.c | 15 ++++++++++++--- xen/arch/arm/vgic.c | 20 ++++++++++++++++++++ xen/include/asm-arm/vgic.h | 1 + 4 files changed, 44 insertions(+), 5 deletions(-)