diff mbox series

[RFC,10/39] KVM: x86/xen: support upcall vector

Message ID 20190220201609.28290-11-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series x86/KVM: Xen HVM guest support | expand

Commit Message

Joao Martins Feb. 20, 2019, 8:15 p.m. UTC
From: Ankur Arora <ankur.a.arora@oracle.com>

Add support for HVM_PARAM_CALLBACK_VIA_TYPE_VECTOR and
HVM_PARAM_CALLBACK_VIA_TYPE_EVTCHN upcall. Some Xen upcall variants do
not have an EOI for received upcalls. We handle that by directly
injecting the interrupt in the VMCS instead of going through the
LAPIC.

Note that the route @vcpu field represents the vcpu index and not a vcpu
id. The vcpu_id is architecture specific e.g. on x86 it's set to the
apic id by userspace.

Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 arch/x86/include/asm/kvm_host.h |  14 ++++++
 arch/x86/kvm/irq.c              |  14 ++++--
 arch/x86/kvm/irq_comm.c         |  11 +++++
 arch/x86/kvm/xen.c              | 106 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/xen.h              |   9 ++++
 include/linux/kvm_host.h        |  24 +++++++++
 include/uapi/linux/kvm.h        |   8 +++
 7 files changed, 183 insertions(+), 3 deletions(-)

Comments

David Woodhouse Dec. 2, 2020, 11:17 a.m. UTC | #1
On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> @@ -176,6 +177,9 @@ int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
>         int r;
>  
>         switch (e->type) {
> +       case KVM_IRQ_ROUTING_XEN_EVTCHN:
> +               return kvm_xen_set_evtchn(e, kvm, irq_source_id, level,
> +                                      line_status);
>         case KVM_IRQ_ROUTING_HV_SINT:
>                 return kvm_hv_set_sint(e, kvm, irq_source_id, level,
>                                        line_status);
> @@ -325,6 +329,13 @@ int kvm_set_routing_entry(struct kvm *kvm,
>                 e->hv_sint.vcpu = ue->u.hv_sint.vcpu;
>                 e->hv_sint.sint = ue->u.hv_sint.sint;
>                 break;
> +       case KVM_IRQ_ROUTING_XEN_EVTCHN:
> +               e->set = kvm_xen_set_evtchn;
> +               e->evtchn.vcpu = ue->u.evtchn.vcpu;
> +               e->evtchn.vector = ue->u.evtchn.vector;
> +               e->evtchn.via = ue->u.evtchn.via;
> +
> +               return kvm_xen_setup_evtchn(kvm, e);
>         default:
>                 return -EINVAL;
>         }


Hmm. I'm not sure I've have done it that way.

These IRQ routing entries aren't for individual event channel ports;
they don't map to kvm_xen_evtchn_send().

They actually represent the upcall to the given vCPU when any event
channel is signalled, and it's really per-vCPU configuration.

When the kernel raises (IPI, VIRQ) events on a given CPU, it doesn't
actually use these routing entries; it just uses the values in
vcpu_xen->cb.{via,vector} which were cached from the last of these IRQ
routing entries that happens to have been processed?

The VMM is *expected* to set up precisely one of these for each vCPU,
right? Would it not be better to do that via KVM_XEN_HVM_SET_ATTR?

The usage model for userspace is presumably that the VMM should set the
appropriate bit in evtchn_pending, check evtchn_mask and then call into
the kernel to do the set_irq() to inject the callback vector to the
guest?

I might be more inclined to go for a model where the kernel handles the
evtchn_pending/evtchn_mask for us. What would go into the irq routing
table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().

Does that seem reasonable?

Either way, I do think we need a way for events raised in the kernel to
be signalled to userspace, if they are targeted at a vCPU which has
CALLBACK_VIA_INTX that the kernel can't do directly. So we probably
*do* need that eventfd I was objecting to earlier, except that it's not
a per-evtchn thing; it's per-vCPU.
Joao Martins Dec. 2, 2020, 1:12 p.m. UTC | #2
On 12/2/20 11:17 AM, David Woodhouse wrote:
> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>> @@ -176,6 +177,9 @@ int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
>>         int r;
>>  
>>         switch (e->type) {
>> +       case KVM_IRQ_ROUTING_XEN_EVTCHN:
>> +               return kvm_xen_set_evtchn(e, kvm, irq_source_id, level,
>> +                                      line_status);
>>         case KVM_IRQ_ROUTING_HV_SINT:
>>                 return kvm_hv_set_sint(e, kvm, irq_source_id, level,
>>                                        line_status);
>> @@ -325,6 +329,13 @@ int kvm_set_routing_entry(struct kvm *kvm,
>>                 e->hv_sint.vcpu = ue->u.hv_sint.vcpu;
>>                 e->hv_sint.sint = ue->u.hv_sint.sint;
>>                 break;
>> +       case KVM_IRQ_ROUTING_XEN_EVTCHN:
>> +               e->set = kvm_xen_set_evtchn;
>> +               e->evtchn.vcpu = ue->u.evtchn.vcpu;
>> +               e->evtchn.vector = ue->u.evtchn.vector;
>> +               e->evtchn.via = ue->u.evtchn.via;
>> +
>> +               return kvm_xen_setup_evtchn(kvm, e);
>>         default:
>>                 return -EINVAL;
>>         }
> 
> 
> Hmm. I'm not sure I've have done it that way.
> 
> These IRQ routing entries aren't for individual event channel ports;
> they don't map to kvm_xen_evtchn_send().
> 
> They actually represent the upcall to the given vCPU when any event
> channel is signalled, and it's really per-vCPU configuration.
> 
Right.

> When the kernel raises (IPI, VIRQ) events on a given CPU, it doesn't
> actually use these routing entries; it just uses the values in
> vcpu_xen->cb.{via,vector} which were cached from the last of these IRQ
> routing entries that happens to have been processed?
> 
Correct.

> The VMM is *expected* to set up precisely one of these for each vCPU,
> right?
> 
From guest PoV, the hypercall:

	HYPERVISOR_hvm_op(HVMOP_set_param, HVM_PARAM_CALLBACK_IRQ, ...)

(...) is global.

But this one (on more recent versions of Xen, particularly recent Windows guests):

	HVMOP_set_evtchn_upcall_vector

Is per-vCPU, and it's a LAPIC vector.

But indeed the VMM ends up changing the @via @vector on a individual CPU.

> Would it not be better to do that via KVM_XEN_HVM_SET_ATTR?

It's definitely an interesting (better?) alternative, considering we use as a vCPU attribute.

I suppose you're suggesting like,

	KVM_XEN_ATTR_TYPE_CALLBACK

And passing the vector, and callback type.

> The usage model for userspace is presumably that the VMM should set the
> appropriate bit in evtchn_pending, check evtchn_mask and then call into
> the kernel to do the set_irq() to inject the callback vector to the
> guest?
> 
Correct, that's how it works for userspace handled event channels.

> I might be more inclined to go for a model where the kernel handles the
> evtchn_pending/evtchn_mask for us. What would go into the irq routing
> table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
> 

But passing port to the routing and handling the sending of events wouldn't it lead to
unnecessary handling of event channels which aren't handled by the kernel, compared to
just injecting caring about the upcall?

I thought from previous feedback that it was something you wanted to avoid.

> Does that seem reasonable?
> 
Otherwise, it seems reasonable to have it.

I'll let Ankur chip in too, as this was something he was more closely modelling after.

> Either way, I do think we need a way for events raised in the kernel to
> be signalled to userspace, if they are targeted at a vCPU which has
> CALLBACK_VIA_INTX that the kernel can't do directly. So we probably
> *do* need that eventfd I was objecting to earlier, except that it's not
> a per-evtchn thing; it's per-vCPU.
> 

Ah!

I wanted to mention the GSI callback method too, but wasn't enterily sure if eventfd was
enough.

	Joao
David Woodhouse Dec. 2, 2020, 4:47 p.m. UTC | #3
On Wed, 2020-12-02 at 13:12 +0000, Joao Martins wrote:
> On 12/2/20 11:17 AM, David Woodhouse wrote:
> > I might be more inclined to go for a model where the kernel handles the
> > evtchn_pending/evtchn_mask for us. What would go into the irq routing
> > table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
> > 
> 
> But passing port to the routing and handling the sending of events wouldn't it lead to
> unnecessary handling of event channels which aren't handled by the kernel, compared to
> just injecting caring about the upcall?

Well, I'm generally in favour of *not* doing things in the kernel that
don't need to be there.

But if the kernel is going to short-circuit the IPIs and VIRQs, then
it's already going to have to handle the evtchn_pending/evtchn_mask
bitmaps, and actually injecting interrupts.

Given that it has to have that functionality anyway, it seems saner to
let the kernel have full control over it and to just expose
'evtchn_send' to userspace.

The alternative is to have userspace trying to play along with the
atomic handling of those bitmasks too, and injecting events through
KVM_INTERRUPT/KVM_SIGNAL_MSI in parallel to the kernel doing so. That
seems like *more* complexity, not less.

> I wanted to mention the GSI callback method too, but wasn't enterily sure if eventfd was
> enough.

That actually works quite nicely even for userspace irqchip.

Forgetting Xen for the moment... my model for userspace I/OAPIC with
interrupt remapping is that during normal runtime, the irqfd is
assigned and things all work and we can even have IRQ posting for
eventfds which came from VFIO. 

When the IOMMU invalidates an IRQ translation, all it does is
*deassign* the irqfd from the KVM IRQ. So the next time that eventfd
fires, it's caught in the userspace event loop instead. Userspace can
then retranslate through the IOMMU and reassign the irqfd for next
time.

So, back to Xen. As things stand with just the hypercall+shinfo patches
I've already rebased, we have enough to do fully functional Xen
hosting. The event channels are slow but it *can* be done entirely in
userspace — handling *all* the hypercalls, and delivering interrupts to
the guest in whatever mode is required.

Event channels are a very important optimisation though. For the VMM
API I think we should follow the Xen model, mixing the domain-wide and
per-vCPU configuration. It's the best way to faithfully model the
behaviour a true Xen guest would experience.

So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
 • HVMIRQ_callback_vector, taking a vector#
 • HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#

And *maybe* in a later patch it could also handle
 • HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
 • HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)

I don't know if the latter two really make sense. After all the
argument for handling IPI/VIRQ in kernel kind of falls over if we have
to bounce out to userspace anyway. So it *only* makes sense if those
eventfds actually end up wired *through* userspace to a KVM IRQFD as I
described for the IOMMU stuff.


In addition to that per-domain setup, we'd also have a per-vCPU
KVM_XEN_ATTR_TYPE_VCPU_CALLBACK_VECTOR which takes {vCPU, vector}.
Joao Martins Dec. 2, 2020, 6:34 p.m. UTC | #4
On 12/2/20 4:47 PM, David Woodhouse wrote:
> On Wed, 2020-12-02 at 13:12 +0000, Joao Martins wrote:
>> On 12/2/20 11:17 AM, David Woodhouse wrote:
>>> I might be more inclined to go for a model where the kernel handles the
>>> evtchn_pending/evtchn_mask for us. What would go into the irq routing
>>> table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
>>
>> But passing port to the routing and handling the sending of events wouldn't it lead to
>> unnecessary handling of event channels which aren't handled by the kernel, compared to
>> just injecting caring about the upcall?
> 
> Well, I'm generally in favour of *not* doing things in the kernel that
> don't need to be there.
> 
> But if the kernel is going to short-circuit the IPIs and VIRQs, then
> it's already going to have to handle the evtchn_pending/evtchn_mask
> bitmaps, and actually injecting interrupts.
> 
Right. I was trying to point that out in the discussion we had
in next patch. But true be told, more about touting the idea of kernel
knowing if a given event channel is registered for userspace handling,
rather than fully handling the event channel.

I suppose we are able to provide both options to the VMM anyway
i.e. 1) letting them handle it enterily in userspace by intercepting
EVTCHNOP_send, or through the irq route if we want kernel to offload it.

> Given that it has to have that functionality anyway, it seems saner to
> let the kernel have full control over it and to just expose
> 'evtchn_send' to userspace.
> 
> The alternative is to have userspace trying to play along with the
> atomic handling of those bitmasks too 

That part is not particularly hard -- having it done already.

>, and injecting events through
> KVM_INTERRUPT/KVM_SIGNAL_MSI in parallel to the kernel doing so. That
> seems like *more* complexity, not less.
>
/me nods

>> I wanted to mention the GSI callback method too, but wasn't enterily sure if eventfd was
>> enough.
> 
> That actually works quite nicely even for userspace irqchip.
> 
> Forgetting Xen for the moment... my model for userspace I/OAPIC with
> interrupt remapping is that during normal runtime, the irqfd is
> assigned and things all work and we can even have IRQ posting for
> eventfds which came from VFIO. 
> 
> When the IOMMU invalidates an IRQ translation, all it does is
> *deassign* the irqfd from the KVM IRQ. So the next time that eventfd
> fires, it's caught in the userspace event loop instead. Userspace can
> then retranslate through the IOMMU and reassign the irqfd for next
> time.
> 
> So, back to Xen. As things stand with just the hypercall+shinfo patches
> I've already rebased, we have enough to do fully functional Xen
> hosting. 

Yes -- the rest become optimizations in performance sensitive paths.

TBH (and this is slightly offtopic) the somewhat hairy part is xenbus/xenstore.
And the alternative to playing nice with xenstored, is the VMM learning
to parse the xenbus messages and fake the xenstored content/transactions stuff
individually per per-VM .

> The event channels are slow but it *can* be done entirely in

While consulting my old notes, about twice as slow if done in userspace.

> userspace — handling *all* the hypercalls, and delivering interrupts to
> the guest in whatever mode is required.
> 
> Event channels are a very important optimisation though. 

/me nods

> For the VMM
> API I think we should follow the Xen model, mixing the domain-wide and
> per-vCPU configuration. It's the best way to faithfully model the
> behaviour a true Xen guest would experience.
> 
> So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
>  • HVMIRQ_callback_vector, taking a vector#
>  • HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#
> 
> And *maybe* in a later patch it could also handle
>  • HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
>  • HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)
> 
Most of the Xen versions we were caring had callback_vector and
vcpu callback vector (despite Linux not using the latter). But if you're
dating back to 3.2 and 4.1 well (or certain Windows drivers), I suppose
gsi and pci-intx are must-haves.

> I don't know if the latter two really make sense. After all the
> argument for handling IPI/VIRQ in kernel kind of falls over if we have
> to bounce out to userspace anyway. 

Might as well let userspace EVTCHNOP_send handle it, I wonder.

> So it *only* makes sense if those
> eventfds actually end up wired *through* userspace to a KVM IRQFD as I
> described for the IOMMU stuff.
> 
We didn't implement the phy event channels, but for most old kernels we
tested (back to 2.6.XX) at the time, seemed to play along.

> In addition to that per-domain setup, we'd also have a per-vCPU
> KVM_XEN_ATTR_TYPE_VCPU_CALLBACK_VECTOR which takes {vCPU, vector}.
> 
I feel we could just accommodate it as subtype in KVM_XEN_ATTR_TYPE_CALLBACK_VIA.
Don't see the adavantage in having another xen attr type.

But kinda have mixed feelings in having kernel handling all event channels ABI,
as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
the added gain to VMMs that don't need to care about how the internals of event channels.
But performance-wise it wouldn't bring anything better. But maybe, the former is reason
enough to consider it.

	Joao
David Woodhouse Dec. 2, 2020, 7:02 p.m. UTC | #5
On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
> On 12/2/20 4:47 PM, David Woodhouse wrote:
> > On Wed, 2020-12-02 at 13:12 +0000, Joao Martins wrote:
> > > On 12/2/20 11:17 AM, David Woodhouse wrote:
> > > > I might be more inclined to go for a model where the kernel handles the
> > > > evtchn_pending/evtchn_mask for us. What would go into the irq routing
> > > > table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
> > > 
> > > But passing port to the routing and handling the sending of events wouldn't it lead to
> > > unnecessary handling of event channels which aren't handled by the kernel, compared to
> > > just injecting caring about the upcall?
> > 
> > Well, I'm generally in favour of *not* doing things in the kernel that
> > don't need to be there.
> > 
> > But if the kernel is going to short-circuit the IPIs and VIRQs, then
> > it's already going to have to handle the evtchn_pending/evtchn_mask
> > bitmaps, and actually injecting interrupts.
> > 
> 
> Right. I was trying to point that out in the discussion we had
> in next patch. But true be told, more about touting the idea of kernel
> knowing if a given event channel is registered for userspace handling,
> rather than fully handling the event channel.
> 
> I suppose we are able to provide both options to the VMM anyway
> i.e. 1) letting them handle it enterily in userspace by intercepting
> EVTCHNOP_send, or through the irq route if we want kernel to offload it.

Right. The kernel takes what it knows about and anything else goes up
to userspace.

I do like the way you've handled the vcpu binding in userspace, and the
kernel just knows that a given port goes to a given target CPU.

> 
> > For the VMM
> > API I think we should follow the Xen model, mixing the domain-wide and
> > per-vCPU configuration. It's the best way to faithfully model the
> > behaviour a true Xen guest would experience.
> > 
> > So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
> >  • HVMIRQ_callback_vector, taking a vector#
> >  • HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#
> > 
> > And *maybe* in a later patch it could also handle
> >  • HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
> >  • HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)
> > 
> 
> Most of the Xen versions we were caring had callback_vector and
> vcpu callback vector (despite Linux not using the latter). But if you're
> dating back to 3.2 and 4.1 well (or certain Windows drivers), I suppose
> gsi and pci-intx are must-haves.

Note sure about GSI but PCI-INTX is definitely something I've seen in
active use by customers recently. I think SLES10 will use that.

> I feel we could just accommodate it as subtype in KVM_XEN_ATTR_TYPE_CALLBACK_VIA.
> Don't see the adavantage in having another xen attr type.

Yeah, fair enough.

> But kinda have mixed feelings in having kernel handling all event channels ABI,
> as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
> the added gain to VMMs that don't need to care about how the internals of event channels.
> But performance-wise it wouldn't bring anything better. But maybe, the former is reason
> enough to consider it.

Yeah, we'll see. Especially when it comes to implementing FIFO event
channels, I'd rather just do it in one place — and if the kernel does
it anyway then it's hardly difficult to hook into that.

But I've been about as coherent as I can be in email, and I think we're
generally aligned on the direction. I'll do some more experiments and
see what I can get working, and what it looks like.

I'm focusing on making the shinfo stuff all use kvm_map_gfn() first.
Joao Martins Dec. 2, 2020, 8:12 p.m. UTC | #6
On 12/2/20 7:02 PM, David Woodhouse wrote:
> On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
>> On 12/2/20 4:47 PM, David Woodhouse wrote:
>>> On Wed, 2020-12-02 at 13:12 +0000, Joao Martins wrote:
>>>> On 12/2/20 11:17 AM, David Woodhouse wrote:
>>> For the VMM
>>> API I think we should follow the Xen model, mixing the domain-wide and
>>> per-vCPU configuration. It's the best way to faithfully model the
>>> behaviour a true Xen guest would experience.
>>>
>>> So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
>>>  • HVMIRQ_callback_vector, taking a vector#
>>>  • HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#
>>>
>>> And *maybe* in a later patch it could also handle
>>>  • HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
>>>  • HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)
>>>
>>
>> Most of the Xen versions we were caring had callback_vector and
>> vcpu callback vector (despite Linux not using the latter). But if you're
>> dating back to 3.2 and 4.1 well (or certain Windows drivers), I suppose
>> gsi and pci-intx are must-haves.
> 
> Note sure about GSI but PCI-INTX is definitely something I've seen in
> active use by customers recently. I think SLES10 will use that.
> 

Some of the Windows drivers we used were relying on GSI.

I don't know about what kernel is SLES10 but Linux is aware
of XENFEAT_hvm_callback_vector since 2.6.35 i.e. about 10years.
Unless some other out-of-tree patch is opting it out I suppose.

> 
>> But kinda have mixed feelings in having kernel handling all event channels ABI,
>> as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
>> the added gain to VMMs that don't need to care about how the internals of event channels.
>> But performance-wise it wouldn't bring anything better. But maybe, the former is reason
>> enough to consider it.
> 
> Yeah, we'll see. Especially when it comes to implementing FIFO event
> channels, I'd rather just do it in one place — and if the kernel does
> it anyway then it's hardly difficult to hook into that.
> 
Fortunately that's xen 4.3 and up *I think* :) (the FIFO events)

And Linux is the one user I am aware IIRC.

> But I've been about as coherent as I can be in email, and I think we're
> generally aligned on the direction. 

Yes, definitely.

> I'll do some more experiments and
> see what I can get working, and what it looks like.
> 
> I'm focusing on making the shinfo stuff all use kvm_map_gfn() first.
> 
I was chatting with Ankur, and we can't fully 100% remember why we dropped using
kvm_vcpu_map/kvm_map_gfn. We were using kvm_vcpu_map() but at the time the new guest
mapping series was in discussion, so we dropped those until it settled in.

One "side effect" on mapping shared_info with kvm_vcpu_map, is that we have to loop all
vcpus unless we move shared_info elsewhere IIRC. But switching vcpu_info, vcpu_time_info
(and steal clock) to kvm_vcpu_map is trivial.. at least based on our old wip branches here.

	Joao
David Woodhouse Dec. 2, 2020, 8:37 p.m. UTC | #7
On Wed, 2020-12-02 at 20:12 +0000, Joao Martins wrote:
> > I'll do some more experiments and
> > see what I can get working, and what it looks like.
> > 
> > I'm focusing on making the shinfo stuff all use kvm_map_gfn() first.
> > 
> 
> I was chatting with Ankur, and we can't fully 100% remember why we dropped using
> kvm_vcpu_map/kvm_map_gfn. We were using kvm_vcpu_map() but at the time the new guest
> mapping series was in discussion, so we dropped those until it settled in.
> 
> One "side effect" on mapping shared_info with kvm_vcpu_map, is that we have to loop all
> vcpus unless we move shared_info elsewhere IIRC. But switching vcpu_info, vcpu_time_info
> (and steal clock) to kvm_vcpu_map is trivial.. at least based on our old wip branches here.

I'm not mapping/unmapping every time. I've just changed the
page_to_virt() bit to use kvm_map_gfn() as a one-time setup for now,
because I need it to work for GFNs without a backing page.

I have that working for the shinfo in my tree so far; will do the other
pages next.

In the fullness of time I'd like to eliminate the duplication between
kvm_setup_pvclock_page() and kvm_xen_update_vcpu_time(), which are
doing precisely the same thing. I think that can wait until we fix up
the MMU notifier thing as discussed though, so they can all just do
direct dereferencing.

But I'm inclined not to get hung up on that for now. There are much
more fun things to be playing with.
Ankur Arora Dec. 3, 2020, 1:08 a.m. UTC | #8
On 2020-12-02 11:02 a.m., David Woodhouse wrote:
> On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
>> On 12/2/20 4:47 PM, David Woodhouse wrote:
>>> On Wed, 2020-12-02 at 13:12 +0000, Joao Martins wrote:
>>>> On 12/2/20 11:17 AM, David Woodhouse wrote:
>>>>> I might be more inclined to go for a model where the kernel handles the
>>>>> evtchn_pending/evtchn_mask for us. What would go into the irq routing
>>>>> table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
>>>>
>>>> But passing port to the routing and handling the sending of events wouldn't it lead to
>>>> unnecessary handling of event channels which aren't handled by the kernel, compared to
>>>> just injecting caring about the upcall?
>>>
>>> Well, I'm generally in favour of *not* doing things in the kernel that
>>> don't need to be there.
>>>
>>> But if the kernel is going to short-circuit the IPIs and VIRQs, then
>>> it's already going to have to handle the evtchn_pending/evtchn_mask
>>> bitmaps, and actually injecting interrupts.
>>>
>>
>> Right. I was trying to point that out in the discussion we had
>> in next patch. But true be told, more about touting the idea of kernel
>> knowing if a given event channel is registered for userspace handling,
>> rather than fully handling the event channel.
>>
>> I suppose we are able to provide both options to the VMM anyway
>> i.e. 1) letting them handle it enterily in userspace by intercepting
>> EVTCHNOP_send, or through the irq route if we want kernel to offload it.
> 
> Right. The kernel takes what it knows about and anything else goes up
> to userspace.
> 
> I do like the way you've handled the vcpu binding in userspace, and the
> kernel just knows that a given port goes to a given target CPU.
> 
>>
>>> For the VMM
>>> API I think we should follow the Xen model, mixing the domain-wide and
>>> per-vCPU configuration. It's the best way to faithfully model the
>>> behaviour a true Xen guest would experience.
>>>
>>> So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
>>>   • HVMIRQ_callback_vector, taking a vector#
>>>   • HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#
>>>
>>> And *maybe* in a later patch it could also handle
>>>   • HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
>>>   • HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)
>>>
>>
>> Most of the Xen versions we were caring had callback_vector and
>> vcpu callback vector (despite Linux not using the latter). But if you're
>> dating back to 3.2 and 4.1 well (or certain Windows drivers), I suppose
>> gsi and pci-intx are must-haves.
> 
> Note sure about GSI but PCI-INTX is definitely something I've seen in
> active use by customers recently. I think SLES10 will use that.
> 
>> I feel we could just accommodate it as subtype in KVM_XEN_ATTR_TYPE_CALLBACK_VIA.
>> Don't see the adavantage in having another xen attr type.
> 
> Yeah, fair enough.
> 
>> But kinda have mixed feelings in having kernel handling all event channels ABI,
>> as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
>> the added gain to VMMs that don't need to care about how the internals of event channels.
>> But performance-wise it wouldn't bring anything better. But maybe, the former is reason
>> enough to consider it.
> 
> Yeah, we'll see. Especially when it comes to implementing FIFO event
> channels, I'd rather just do it in one place — and if the kernel does
> it anyway then it's hardly difficult to hook into that.

Sorry I'm late to this conversation. Not a whole lot to add to what Joao
said. I would only differ with him on how much to offload.

Given that we need the fast path in the kernel anyway, I think it's simpler
to do all the event-channel bitmap only in the kernel.
This would also simplify using the kernel Xen drivers if someone eventually
decides to use them.


Ankur

> 
> But I've been about as coherent as I can be in email, and I think we're
> generally aligned on the direction. I'll do some more experiments and
> see what I can get working, and what it looks like.
> 
> I'm focusing on making the shinfo stuff all use kvm_map_gfn() first.
>
David Woodhouse Dec. 8, 2020, 4:08 p.m. UTC | #9
On Wed, 2020-12-02 at 19:02 +0000, David Woodhouse wrote:
> 
> > I feel we could just accommodate it as subtype in KVM_XEN_ATTR_TYPE_CALLBACK_VIA.
> > Don't see the adavantage in having another xen attr type.
> 
> Yeah, fair enough.
> 
> > But kinda have mixed feelings in having kernel handling all event channels ABI,
> > as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
> > the added gain to VMMs that don't need to care about how the internals of event channels.
> > But performance-wise it wouldn't bring anything better. But maybe, the former is reason
> > enough to consider it.
> 
> Yeah, we'll see. Especially when it comes to implementing FIFO event
> channels, I'd rather just do it in one place — and if the kernel does
> it anyway then it's hardly difficult to hook into that.
> 
> But I've been about as coherent as I can be in email, and I think we're
> generally aligned on the direction. I'll do some more experiments and
> see what I can get working, and what it looks like.


So... I did some more typing, and revived our completely userspace
based implementation of event channels. I wanted to declare that such
was *possible*, and that using the kernel for IPI and VIRQ was just a
very desirable optimisation.

It looks like Linux doesn't use the per-vCPU upcall vector that you
called 'KVM_XEN_CALLBACK_VIA_EVTCHN'. So I'm delivering interrupts via
KVM_INTERRUPT as if they were ExtINT....

... except I'm not. Because the kernel really does expect that to be an
ExtINT from a legacy PIC, and kvm_apic_accept_pic_intr() only returns
true if LVT0 is set up for EXTINT and unmasked.

I messed around with this hack and increasingly desperate variations on
the theme (since this one doesn't cause the IRQ window to be opened to
userspace in the first place), but couldn't get anything working:

--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2380,6 +2380,9 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
        if ((lvt0 & APIC_LVT_MASKED) == 0 &&
            GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
                r = 1;
+       /* Shoot me. */
+       if (vcpu->arch.pending_external_vector == 243)
+               r = 1;
        return r;
 }
 

Eventually I resorted to delivering the interrupt through the lapic
*anyway* (through KVM_SIGNAL_MSI with an MSI message constructed for
the appropriate vCPU/vector) and the following hack to auto-EOI:

--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2416,7 +2419,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
         */
 
        apic_clear_irr(vector, apic);
-       if (test_bit(vector, vcpu_to_synic(vcpu)->auto_eoi_bitmap)) {
+       if (vector == 243 || test_bit(vector, vcpu_to_synic(vcpu)->auto_eoi_bitmap)) {
                /*
                 * For auto-EOI interrupts, there might be another pending
                 * interrupt above PPR, so check whether to raise another


That works, and now my guest finishes the SMP bringup (and gets as far
as waiting on the XenStore implementation that I haven't put back yet).

So I think we need at least a tiny amount of support in-kernel for
delivering event channel interrupt vectors, even if we wanted to allow
for a completely userspace implementation.

Unless I'm missing something?

I will get on with implementing the in-kernel handling with IRQ routing
entries targeting a given { port, vcpu }. And I'm kind of vacillating
about whether the mode/vector should be separately configured, or
whether they might as well be in the IRQ routing table too, even if
it's kind of redundant because it's specified the same for *every* port
targeting the same vCPU. I *think* I prefer that redundancy over having
a separate configuration mechanism to set the vector for each vCPU. But
we'll see what happens when my fingers do the typing...
Ankur Arora Dec. 9, 2020, 6:35 a.m. UTC | #10
On 2020-12-08 8:08 a.m., David Woodhouse wrote:
> On Wed, 2020-12-02 at 19:02 +0000, David Woodhouse wrote:
>>
>>> I feel we could just accommodate it as subtype in KVM_XEN_ATTR_TYPE_CALLBACK_VIA.
>>> Don't see the adavantage in having another xen attr type.
>>
>> Yeah, fair enough.
>>
>>> But kinda have mixed feelings in having kernel handling all event channels ABI,
>>> as opposed to only the ones userspace asked to offload. It looks a tad unncessary besides
>>> the added gain to VMMs that don't need to care about how the internals of event channels.
>>> But performance-wise it wouldn't bring anything better. But maybe, the former is reason
>>> enough to consider it.
>>
>> Yeah, we'll see. Especially when it comes to implementing FIFO event
>> channels, I'd rather just do it in one place — and if the kernel does
>> it anyway then it's hardly difficult to hook into that.
>>
>> But I've been about as coherent as I can be in email, and I think we're
>> generally aligned on the direction. I'll do some more experiments and
>> see what I can get working, and what it looks like.
> 
> 
> So... I did some more typing, and revived our completely userspace
> based implementation of event channels. I wanted to declare that such
> was *possible*, and that using the kernel for IPI and VIRQ was just a
> very desirable optimisation.
> 
> It looks like Linux doesn't use the per-vCPU upcall vector that you
> called 'KVM_XEN_CALLBACK_VIA_EVTCHN'. So I'm delivering interrupts via
> KVM_INTERRUPT as if they were ExtINT....
> 
> ... except I'm not. Because the kernel really does expect that to be an
> ExtINT from a legacy PIC, and kvm_apic_accept_pic_intr() only returns
> true if LVT0 is set up for EXTINT and unmasked.
> 
> I messed around with this hack and increasingly desperate variations on
> the theme (since this one doesn't cause the IRQ window to be opened to
> userspace in the first place), but couldn't get anything working:

Increasingly desperate variations,  about sums up my process as well while
trying to get the upcall vector working.

> 
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2380,6 +2380,9 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
>          if ((lvt0 & APIC_LVT_MASKED) == 0 &&
>              GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
>                  r = 1;
> +       /* Shoot me. */
> +       if (vcpu->arch.pending_external_vector == 243)
> +               r = 1;
>          return r;
>   }
>   
> 
> Eventually I resorted to delivering the interrupt through the lapic
> *anyway* (through KVM_SIGNAL_MSI with an MSI message constructed for
> the appropriate vCPU/vector) and the following hack to auto-EOI:
> 
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2416,7 +2419,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
>           */
>   
>          apic_clear_irr(vector, apic);
> -       if (test_bit(vector, vcpu_to_synic(vcpu)->auto_eoi_bitmap)) {
> +       if (vector == 243 || test_bit(vector, vcpu_to_synic(vcpu)->auto_eoi_bitmap)) {
>                  /*
>                   * For auto-EOI interrupts, there might be another pending
>                   * interrupt above PPR, so check whether to raise another
> 
> 
> That works, and now my guest finishes the SMP bringup (and gets as far
> as waiting on the XenStore implementation that I haven't put back yet).
> 
> So I think we need at least a tiny amount of support in-kernel for
> delivering event channel interrupt vectors, even if we wanted to allow
> for a completely userspace implementation.
> 
> Unless I'm missing something?

I did use the auto_eoi hack as well. So, yeah, I don't see any way of
getting around this.

Also, IIRC we had eventually gotten rid of the auto_eoi approach
because that wouldn't work with APICv. At that point we resorted to
direct queuing for vectored callbacks which was a hack that I never
grew fond of...
  
> I will get on with implementing the in-kernel handling with IRQ routing
> entries targeting a given { port, vcpu }. And I'm kind of vacillating
> about whether the mode/vector should be separately configured, or
> whether they might as well be in the IRQ routing table too, even if
> it's kind of redundant because it's specified the same for *every* port
> targeting the same vCPU. I *think* I prefer that redundancy over having
> a separate configuration mechanism to set the vector for each vCPU. But
> we'll see what happens when my fingers do the typing...
> 

Good luck to your fingers!

Ankur
David Woodhouse Dec. 9, 2020, 10:27 a.m. UTC | #11
On Tue, 2020-12-08 at 22:35 -0800, Ankur Arora wrote:
> > It looks like Linux doesn't use the per-vCPU upcall vector that you
> > called 'KVM_XEN_CALLBACK_VIA_EVTCHN'. So I'm delivering interrupts via
> > KVM_INTERRUPT as if they were ExtINT....
> > 
> > ... except I'm not. Because the kernel really does expect that to be an
> > ExtINT from a legacy PIC, and kvm_apic_accept_pic_intr() only returns
> > true if LVT0 is set up for EXTINT and unmasked.
> > 
> > I messed around with this hack and increasingly desperate variations on
> > the theme (since this one doesn't cause the IRQ window to be opened to
> > userspace in the first place), but couldn't get anything working:
> 
> Increasingly desperate variations,  about sums up my process as well while
> trying to get the upcall vector working.

:)

So this seems to work, and I think it's about as minimal as it can get.

All it does is implement a kvm_xen_has_interrupt() which checks the
vcpu_info->evtchn_upcall_pending flag, just like Xen does.

With this, my completely userspace implementation of event channels is
working. And of course this forms a basis for adding the minimal
acceleration of IPI/VIRQ that we need to the kernel, too.

My only slight concern is the bit in vcpu_enter_guest() where we have
to add the check for kvm_xen_has_interrupt(), because nothing is
setting KVM_REQ_EVENT. I suppose I could look at having something —
even an explicit ioctl from userspace — set that for us.... BUT...

I'm not sure that condition wasn't *already* broken some cases for
KVM_INTERRUPT anyway. In kvm_vcpu_ioctl_interrupt() we set
vcpu->arch.pending_userspace_vector and we *do* request KVM_REQ_EVENT,
sure.

But what if vcpu_enter_guest() processes that the first time (and
clears KVM_REQ_EVENT), and then exits for some *other* reason with
interrupts still disabled? Next time through vcpu_enter_guest(), even
though kvm_cpu_has_injectable_intr() is still true, we don't enable the
IRQ windowvmexit because KVM_REQ_EVENT got lost so we don't even call
inject_pending_event().

So instead of just kvm_xen_has_interrupt() in my patch below, I wonder
if we need to use kvm_cpu_has_injectable_intr() to fix the existing
bug? Or am I missing something there and there isn't a bug after all?

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d8716ef27728..4a63f212fdfc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -902,6 +902,7 @@ struct msr_bitmap_range {
 /* Xen emulation context */
 struct kvm_xen {
 	bool long_mode;
+	u8 upcall_vector;
 	struct kvm_host_map shinfo_map;
 	void *shinfo;
 };
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 814698e5b152..24668b51b5c8 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -14,6 +14,7 @@
 #include "irq.h"
 #include "i8254.h"
 #include "x86.h"
+#include "xen.h"
 
 /*
  * check if there are pending timer events
@@ -56,6 +57,9 @@ int kvm_cpu_has_extint(struct kvm_vcpu *v)
 	if (!lapic_in_kernel(v))
 		return v->arch.interrupt.injected;
 
+	if (kvm_xen_has_interrupt(v))
+		return 1;
+
 	if (!kvm_apic_accept_pic_intr(v))
 		return 0;
 
@@ -110,6 +114,9 @@ static int kvm_cpu_get_extint(struct kvm_vcpu *v)
 	if (!lapic_in_kernel(v))
 		return v->arch.interrupt.nr;
 
+	if (kvm_xen_has_interrupt(v))
+		return v->kvm->arch.xen.upcall_vector;
+
 	if (irqchip_split(v->kvm)) {
 		int vector = v->arch.pending_external_vector;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ad9eea8f4f26..1711072b3616 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8891,7 +8891,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_x86_ops.msr_filter_changed(vcpu);
 	}
 
-	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
+	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
+	    kvm_xen_has_interrupt(vcpu)) {
 		++vcpu->stat.req_event;
 		kvm_apic_accept_events(vcpu);
 		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 4aa776c1ad57..116422d93d96 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -257,6 +257,22 @@ void kvm_xen_setup_pvclock_page(struct kvm_vcpu *v)
 	srcu_read_unlock(&v->kvm->srcu, idx);
 }
 
+int kvm_xen_has_interrupt(struct kvm_vcpu *v)
+{
+       int rc = 0;
+
+       if (v->kvm->arch.xen.upcall_vector) {
+               int idx = srcu_read_lock(&v->kvm->srcu);
+               struct vcpu_info *vcpu_info = xen_vcpu_info(v);
+
+               if (vcpu_info && READ_ONCE(vcpu_info->evtchn_upcall_pending))
+                       rc = 1;
+
+               srcu_read_unlock(&v->kvm->srcu, idx);
+       }
+       return rc;
+}
+
 static int vcpu_attr_loc(struct kvm_vcpu *vcpu, u16 type,
 			 struct kvm_host_map **map, void ***hva, size_t *sz)
 {
@@ -338,6 +354,14 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
 		break;
 	}
 
+	case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR:
+		if (data->u.vector < 0x10)
+			return -EINVAL;
+
+		kvm->arch.xen.upcall_vector = data->u.vector;
+		r = 0;
+		break;
+
 	default:
 		break;
 	}
@@ -386,6 +410,11 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
 		break;
 	}
 
+	case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR:
+		data->u.vector = kvm->arch.xen.upcall_vector;
+		r = 0;
+		break;
+
 	default:
 		break;
 	}
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index ccd6002f55bc..1f599342f02c 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -25,6 +25,7 @@ static inline struct kvm_vcpu *xen_vcpu_to_vcpu(struct kvm_vcpu_xen *xen_vcpu)
 void kvm_xen_setup_pvclock_page(struct kvm_vcpu *vcpu);
 void kvm_xen_setup_runstate_page(struct kvm_vcpu *vcpu);
 void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu);
+int kvm_xen_has_interrupt (struct kvm_vcpu *vcpu);
 int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
 int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
 int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1047364d1adf..113279fa9e1e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1587,6 +1587,7 @@ struct kvm_xen_hvm_attr {
 
 	union {
 		__u8 long_mode;
+		__u8 vector;
 		struct {
 			__u64 gfn;
 		} shared_info;
@@ -1604,6 +1605,7 @@ struct kvm_xen_hvm_attr {
 #define KVM_XEN_ATTR_TYPE_VCPU_INFO		0x2
 #define KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO	0x3
 #define KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE		0x4
+#define KVM_XEN_ATTR_TYPE_UPCALL_VECTOR		0x5
 
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
Joao Martins Dec. 9, 2020, 10:51 a.m. UTC | #12
On 12/9/20 10:27 AM, David Woodhouse wrote:
> On Tue, 2020-12-08 at 22:35 -0800, Ankur Arora wrote:
>>> It looks like Linux doesn't use the per-vCPU upcall vector that you
>>> called 'KVM_XEN_CALLBACK_VIA_EVTCHN'. So I'm delivering interrupts via
>>> KVM_INTERRUPT as if they were ExtINT....
>>>
>>> ... except I'm not. Because the kernel really does expect that to be an
>>> ExtINT from a legacy PIC, and kvm_apic_accept_pic_intr() only returns
>>> true if LVT0 is set up for EXTINT and unmasked.
>>>
>>> I messed around with this hack and increasingly desperate variations on
>>> the theme (since this one doesn't cause the IRQ window to be opened to
>>> userspace in the first place), but couldn't get anything working:
>>
>> Increasingly desperate variations,  about sums up my process as well while
>> trying to get the upcall vector working.
> 
> :)
> 
> So this seems to work, and I think it's about as minimal as it can get.
> 
> All it does is implement a kvm_xen_has_interrupt() which checks the
> vcpu_info->evtchn_upcall_pending flag, just like Xen does.
> 
> With this, my completely userspace implementation of event channels is
> working. And of course this forms a basis for adding the minimal
> acceleration of IPI/VIRQ that we need to the kernel, too.
> 
> My only slight concern is the bit in vcpu_enter_guest() where we have
> to add the check for kvm_xen_has_interrupt(), because nothing is
> setting KVM_REQ_EVENT. I suppose I could look at having something —
> even an explicit ioctl from userspace — set that for us.... BUT...
> 
Isn't this what the first half of this patch was doing initially (minus the
irq routing) ? Looks really similar:

https://lore.kernel.org/kvm/20190220201609.28290-11-joao.m.martins@oracle.com/

Albeit, I gotta after seeing the irq routing removed it ends much simpler, if we just
replace the irq routing with a domain-wide upcall vector ;)

Albeit it wouldn't cover the Windows Xen open source drivers which use the EVTCHN method
(which is a regular LAPIC vector) [to answer your question on what uses it] For the EVTCHN
you can just inject a regular vector through lapic deliver, and guest acks it. Sadly Linux
doesn't use it,
and if it was the case we would probably get faster upcalls with APICv/AVIC.

> I'm not sure that condition wasn't *already* broken some cases for
> KVM_INTERRUPT anyway. In kvm_vcpu_ioctl_interrupt() we set
> vcpu->arch.pending_userspace_vector and we *do* request KVM_REQ_EVENT,
> sure.
> 
> But what if vcpu_enter_guest() processes that the first time (and
> clears KVM_REQ_EVENT), and then exits for some *other* reason with
> interrupts still disabled? Next time through vcpu_enter_guest(), even
> though kvm_cpu_has_injectable_intr() is still true, we don't enable the
> IRQ windowvmexit because KVM_REQ_EVENT got lost so we don't even call
> inject_pending_event().
> 
> So instead of just kvm_xen_has_interrupt() in my patch below, I wonder
> if we need to use kvm_cpu_has_injectable_intr() to fix the existing
> bug? Or am I missing something there and there isn't a bug after all?
> 
Given that the notion of an event channel pending is Xen specific handling, I am not sure
we can remove the kvm_xen_has_interrupt()/kvm_xen_get_interrupt() logic. Much of the
reason that I ended up checking on vmenter that checks event channels pendings..

That or the autoEOI hack Ankur and you were talking out.

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d8716ef27728..4a63f212fdfc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -902,6 +902,7 @@ struct msr_bitmap_range {
>  /* Xen emulation context */
>  struct kvm_xen {
>  	bool long_mode;
> +	u8 upcall_vector;
>  	struct kvm_host_map shinfo_map;
>  	void *shinfo;
>  };
> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> index 814698e5b152..24668b51b5c8 100644
> --- a/arch/x86/kvm/irq.c
> +++ b/arch/x86/kvm/irq.c
> @@ -14,6 +14,7 @@
>  #include "irq.h"
>  #include "i8254.h"
>  #include "x86.h"
> +#include "xen.h"
>  
>  /*
>   * check if there are pending timer events
> @@ -56,6 +57,9 @@ int kvm_cpu_has_extint(struct kvm_vcpu *v)
>  	if (!lapic_in_kernel(v))
>  		return v->arch.interrupt.injected;
>  
> +	if (kvm_xen_has_interrupt(v))
> +		return 1;
> +
>  	if (!kvm_apic_accept_pic_intr(v))
>  		return 0;
>  
> @@ -110,6 +114,9 @@ static int kvm_cpu_get_extint(struct kvm_vcpu *v)
>  	if (!lapic_in_kernel(v))
>  		return v->arch.interrupt.nr;
>  
> +	if (kvm_xen_has_interrupt(v))
> +		return v->kvm->arch.xen.upcall_vector;
> +
>  	if (irqchip_split(v->kvm)) {
>  		int vector = v->arch.pending_external_vector;
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ad9eea8f4f26..1711072b3616 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8891,7 +8891,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  			kvm_x86_ops.msr_filter_changed(vcpu);
>  	}
>  
> -	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> +	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
> +	    kvm_xen_has_interrupt(vcpu)) {
>  		++vcpu->stat.req_event;
>  		kvm_apic_accept_events(vcpu);
>  		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 4aa776c1ad57..116422d93d96 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -257,6 +257,22 @@ void kvm_xen_setup_pvclock_page(struct kvm_vcpu *v)
>  	srcu_read_unlock(&v->kvm->srcu, idx);
>  }
>  
> +int kvm_xen_has_interrupt(struct kvm_vcpu *v)
> +{
> +       int rc = 0;
> +
> +       if (v->kvm->arch.xen.upcall_vector) {
> +               int idx = srcu_read_lock(&v->kvm->srcu);
> +               struct vcpu_info *vcpu_info = xen_vcpu_info(v);
> +
> +               if (vcpu_info && READ_ONCE(vcpu_info->evtchn_upcall_pending))
> +                       rc = 1;
> +
> +               srcu_read_unlock(&v->kvm->srcu, idx);
> +       }
> +       return rc;
> +}
> +
>  static int vcpu_attr_loc(struct kvm_vcpu *vcpu, u16 type,
>  			 struct kvm_host_map **map, void ***hva, size_t *sz)
>  {
> @@ -338,6 +354,14 @@ int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
>  		break;
>  	}
>  
> +	case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR:
> +		if (data->u.vector < 0x10)
> +			return -EINVAL;
> +
> +		kvm->arch.xen.upcall_vector = data->u.vector;
> +		r = 0;
> +		break;
> +
>  	default:
>  		break;
>  	}
> @@ -386,6 +410,11 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
>  		break;
>  	}
>  
> +	case KVM_XEN_ATTR_TYPE_UPCALL_VECTOR:
> +		data->u.vector = kvm->arch.xen.upcall_vector;
> +		r = 0;
> +		break;
> +
>  	default:
>  		break;
>  	}
> diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
> index ccd6002f55bc..1f599342f02c 100644
> --- a/arch/x86/kvm/xen.h
> +++ b/arch/x86/kvm/xen.h
> @@ -25,6 +25,7 @@ static inline struct kvm_vcpu *xen_vcpu_to_vcpu(struct kvm_vcpu_xen *xen_vcpu)
>  void kvm_xen_setup_pvclock_page(struct kvm_vcpu *vcpu);
>  void kvm_xen_setup_runstate_page(struct kvm_vcpu *vcpu);
>  void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu);
> +int kvm_xen_has_interrupt (struct kvm_vcpu *vcpu);
>  int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
>  int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
>  int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 1047364d1adf..113279fa9e1e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1587,6 +1587,7 @@ struct kvm_xen_hvm_attr {
>  
>  	union {
>  		__u8 long_mode;
> +		__u8 vector;
>  		struct {
>  			__u64 gfn;
>  		} shared_info;
> @@ -1604,6 +1605,7 @@ struct kvm_xen_hvm_attr {
>  #define KVM_XEN_ATTR_TYPE_VCPU_INFO		0x2
>  #define KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO	0x3
>  #define KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE		0x4
> +#define KVM_XEN_ATTR_TYPE_UPCALL_VECTOR		0x5
>  
>  /* Secure Encrypted Virtualization command */
>  enum sev_cmd_id {
>
David Woodhouse Dec. 9, 2020, 11:39 a.m. UTC | #13
On Wed, 2020-12-09 at 10:51 +0000, Joao Martins wrote:
> Isn't this what the first half of this patch was doing initially (minus the
> irq routing) ? Looks really similar:
> 
> https://lore.kernel.org/kvm/20190220201609.28290-11-joao.m.martins@oracle.com/

Absolutely! This thread is in reply to your original posting of
precisely that patch, and I've had your tree open in gitk to crib from
for most of the last week.

There's a *reason* why my tree looks like yours, and why more than half
of the patches in it still show you as being the author :)

> Albeit, I gotta after seeing the irq routing removed it ends much simpler, if we just
> replace the irq routing with a domain-wide upcall vector ;)

I like "simpler".

I also killed the ->cb.queued flag you had because I think it's
redundant with evtchn_upcall_pending anyway.

> Albeit it wouldn't cover the Windows Xen open source drivers which use the EVTCHN method
> (which is a regular LAPIC vector) [to answer your question on what uses it] For the EVTCHN
> you can just inject a regular vector through lapic deliver, and guest acks it. Sadly Linux
> doesn't use it,
> and if it was the case we would probably get faster upcalls with APICv/AVIC.

It doesn't need to, because those can just be injected with
KVM_SIGNAL_MSI.

At most, we just need to make sure that kvm_xen_has_interrupt() returns
false if the per-vCPU LAPIC vector is configured. But I didn't do that
because I checked Xen and it doesn't do it either.

As far as I can tell, Xen's hvm_vcpu_has_pending_irq() will still
return the domain-wide vector in preference to the one in the LAPIC, if
it actually gets invoked. And if the guest sets ->evtchn_upcall_pending
to reinject the IRQ (as Linux does on unmask) Xen won't use the per-
vCPU vector to inject that; it'll use the domain-wide vector.

> > I'm not sure that condition wasn't *already* broken some cases for
> > KVM_INTERRUPT anyway. In kvm_vcpu_ioctl_interrupt() we set
> > vcpu->arch.pending_userspace_vector and we *do* request KVM_REQ_EVENT,
> > sure.
> > 
> > But what if vcpu_enter_guest() processes that the first time (and
> > clears KVM_REQ_EVENT), and then exits for some *other* reason with
> > interrupts still disabled? Next time through vcpu_enter_guest(), even
> > though kvm_cpu_has_injectable_intr() is still true, we don't enable the
> > IRQ windowvmexit because KVM_REQ_EVENT got lost so we don't even call
> > inject_pending_event().
> > 
> > So instead of just kvm_xen_has_interrupt() in my patch below, I wonder
> > if we need to use kvm_cpu_has_injectable_intr() to fix the existing
> > bug? Or am I missing something there and there isn't a bug after all?
> > 
> 
> Given that the notion of an event channel pending is Xen specific handling, I am not sure
> we can remove the kvm_xen_has_interrupt()/kvm_xen_get_interrupt() logic. Much of the
> reason that I ended up checking on vmenter that checks event channels pendings..

Sure, we definitely need the check I added in vcpu_enter_guest() for
Xen unless I'm going to come up with a way to set KVM_REQ_EVENT at the
appropriate time.

But I'm looking at the ExtINT handling and as far as I can tell it's
buggy. So I didn't want to use it as a model for setting KVM_REQ_EVENT
for Xen events.
Joao Martins Dec. 9, 2020, 1:26 p.m. UTC | #14
On 12/9/20 11:39 AM, David Woodhouse wrote:
> On Wed, 2020-12-09 at 10:51 +0000, Joao Martins wrote:
>> Isn't this what the first half of this patch was doing initially (minus the
>> irq routing) ? Looks really similar:
>>
>> https://lore.kernel.org/kvm/20190220201609.28290-11-joao.m.martins@oracle.com/
> 
> Absolutely! This thread is in reply to your original posting of
> precisely that patch, and I've had your tree open in gitk to crib from
> for most of the last week.
> 
I forgot about this patch given all the discussion so far and I had to re-look given that
it resembled me from your snippet. But I ended up being a little pedantic -- sorry about that.

> There's a *reason* why my tree looks like yours, and why more than half
> of the patches in it still show you as being the author :)
> 
Btw, in this patch it would be Ankur's.

More importantly, thanks a lot for picking it up and for all the awesome stuff you're
doing with it.

>> Albeit, I gotta after seeing the irq routing removed it ends much simpler, if we just
>> replace the irq routing with a domain-wide upcall vector ;)
> 
> I like "simpler".
> 
> I also killed the ->cb.queued flag you had because I think it's
> redundant with evtchn_upcall_pending anyway.
> 
Yeap, indeed.

>> Albeit it wouldn't cover the Windows Xen open source drivers which use the EVTCHN method
>> (which is a regular LAPIC vector) [to answer your question on what uses it] For the EVTCHN
>> you can just inject a regular vector through lapic deliver, and guest acks it. Sadly Linux
>> doesn't use it,
>> and if it was the case we would probably get faster upcalls with APICv/AVIC.
> 
> It doesn't need to, because those can just be injected with
> KVM_SIGNAL_MSI.
> 
/me nods

> At most, we just need to make sure that kvm_xen_has_interrupt() returns
> false if the per-vCPU LAPIC vector is configured. But I didn't do that
> because I checked Xen and it doesn't do it either.
> 
Oh! I have this strange recollection that it was, when we were looking at the Xen
implementation.

> As far as I can tell, Xen's hvm_vcpu_has_pending_irq() will still
> return the domain-wide vector in preference to the one in the LAPIC, if
> it actually gets invoked. 

Only if the callback installed is HVMIRQ_callback_vector IIUC.

Otherwise the vector would be pending like any other LAPIC vector.

> And if the guest sets ->evtchn_upcall_pending
> to reinject the IRQ (as Linux does on unmask) Xen won't use the per-
> vCPU vector to inject that; it'll use the domain-wide vector.
> Right -- I don't think Linux even installs a per-CPU upcall LAPIC vector other than the
domain's callback irq.

>>> I'm not sure that condition wasn't *already* broken some cases for
>>> KVM_INTERRUPT anyway. In kvm_vcpu_ioctl_interrupt() we set
>>> vcpu->arch.pending_userspace_vector and we *do* request KVM_REQ_EVENT,
>>> sure.
>>>
>>> But what if vcpu_enter_guest() processes that the first time (and
>>> clears KVM_REQ_EVENT), and then exits for some *other* reason with
>>> interrupts still disabled? Next time through vcpu_enter_guest(), even
>>> though kvm_cpu_has_injectable_intr() is still true, we don't enable the
>>> IRQ windowvmexit because KVM_REQ_EVENT got lost so we don't even call
>>> inject_pending_event().
>>>
>>> So instead of just kvm_xen_has_interrupt() in my patch below, I wonder
>>> if we need to use kvm_cpu_has_injectable_intr() to fix the existing
>>> bug? Or am I missing something there and there isn't a bug after all?
>>
>> Given that the notion of an event channel pending is Xen specific handling, I am not sure
>> we can remove the kvm_xen_has_interrupt()/kvm_xen_get_interrupt() logic. Much of the
>> reason that we ended up checking on vmenter that checks event channels pendings..
> 
> Sure, we definitely need the check I added in vcpu_enter_guest() for
> Xen unless I'm going to come up with a way to set KVM_REQ_EVENT at the
> appropriate time.
> 
> But I'm looking at the ExtINT handling and as far as I can tell it's
> buggy. So I didn't want to use it as a model for setting KVM_REQ_EVENT
> for Xen events.
> 
/me nods
David Woodhouse Dec. 9, 2020, 3:41 p.m. UTC | #15
On 9 December 2020 13:26:55 GMT, Joao Martins <joao.m.martins@oracle.com> wrote:
>On 12/9/20 11:39 AM, David Woodhouse wrote:
>> On Wed, 2020-12-09 at 10:51 +0000, Joao Martins wrote:
>>> Isn't this what the first half of this patch was doing initially
>(minus the
>>> irq routing) ? Looks really similar:
>>>
>>>
>https://lore.kernel.org/kvm/20190220201609.28290-11-joao.m.martins@oracle.com/
>> 
>> Absolutely! This thread is in reply to your original posting of
>> precisely that patch, and I've had your tree open in gitk to crib
>from
>> for most of the last week.
>> 
>I forgot about this patch given all the discussion so far and I had to
>re-look given that
>it resembled me from your snippet. But I ended up being a little
>pedantic -- sorry about that.

Nah, pedantry is good :)

>> At most, we just need to make sure that kvm_xen_has_interrupt()
>returns
>> false if the per-vCPU LAPIC vector is configured. But I didn't do
>that
>> because I checked Xen and it doesn't do it either.
>> 
>Oh! I have this strange recollection that it was, when we were looking
>at the Xen
>implementation.

Hm, maybe I missed it. Will stare at it harder, although looking at Xen code tends to make my brain hurt :)

>> As far as I can tell, Xen's hvm_vcpu_has_pending_irq() will still
>> return the domain-wide vector in preference to the one in the LAPIC,
>if
>> it actually gets invoked. 
>
>Only if the callback installed is HVMIRQ_callback_vector IIUC.
>
>Otherwise the vector would be pending like any other LAPIC vector.

Ah, right.

For some reason I had it in my head that you could only set the per-vCPU lapic vector if the domain was set to HVMIRQ_callback_vector. If the domain is set to HVMIRQ_callback_none, that clearly makes more sense.

Still, my patch should do the same as Xen does in the case where a guest does set both, I think.

Faithful compatibility with odd Xen behaviour FTW :)
Joao Martins Dec. 9, 2020, 4:12 p.m. UTC | #16
On 12/9/20 3:41 PM, David Woodhouse wrote:
> On 9 December 2020 13:26:55 GMT, Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 12/9/20 11:39 AM, David Woodhouse wrote:
>>> As far as I can tell, Xen's hvm_vcpu_has_pending_irq() will still
>>> return the domain-wide vector in preference to the one in the LAPIC,
>> if
>>> it actually gets invoked. 
>>
>> Only if the callback installed is HVMIRQ_callback_vector IIUC.
>>
>> Otherwise the vector would be pending like any other LAPIC vector.
> 
> Ah, right.
> 
> For some reason I had it in my head that you could only set the per-vCPU lapic vector if the domain was set to HVMIRQ_callback_vector. If the domain is set to HVMIRQ_callback_none, that clearly makes more sense.
> 
> Still, my patch should do the same as Xen does in the case where a guest does set both, I think.
> 
> Faithful compatibility with odd Xen behaviour FTW :)
> 
Ah, yes. In that case, HVMIRQ_callback_vector does take precedence.

But it would be very weird for a guest to setup two callback vectors :)
David Woodhouse Jan. 1, 2021, 2:33 p.m. UTC | #17
On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
> > But if the kernel is going to short-circuit the IPIs and VIRQs, then
> > it's already going to have to handle the evtchn_pending/evtchn_mask
> > bitmaps, and actually injecting interrupts.
> > 
> 
> Right. I was trying to point that out in the discussion we had
> in next patch. But true be told, more about touting the idea of kernel
> knowing if a given event channel is registered for userspace handling,
> rather than fully handling the event channel.
> 
> I suppose we are able to provide both options to the VMM anyway
> i.e. 1) letting them handle it enterily in userspace by intercepting
> EVTCHNOP_send, or through the irq route if we want kernel to offload it.
> 
> > Given that it has to have that functionality anyway, it seems saner to
> > let the kernel have full control over it and to just expose
> > 'evtchn_send' to userspace.
> > 
> > The alternative is to have userspace trying to play along with the
> > atomic handling of those bitmasks too 
> 
> That part is not particularly hard -- having it done already.

Right, for 2-level it works out fairly well. I like your model of
installing { vcpu_id, port } into the KVM IRQ routing table and that's
enough for the kernel to raise the event channels that it *needs* to
know about, while userspace can do the others for itself. It's just
atomic test-and-set bitop stuff with no real need for coordination.

For FIFO event channels it gets more fun, because the updates aren't
truly atomic — they require a spinlock around the three operations that
the host has to perform when linking an event into a queue: 

 • Set the new port's LINKED bit
 • Set the previous head's LINK to point to the new port
 • Store the new port# as the head.

One option might be to declare that for FIFO, all updates for a given
queue *must* be handled either by the kernel, or by userspace, and
there's sharing of control.

Or maybe there's a way to keep the kernel side simple by avoiding the
tail handling completely. Surely we only really care about kernel
handling of the *fast* path, where a single event channel is triggered
and handled immediately? In the high-latency case where we're gathering
multiple events in a queue before the guest ever gets to process them, 
we might as well let that be handled by userspace, right?

So in that case, perhaps the kernel part could forget all the horrid
nonsense with tracking the tail of the queue. It would handle the event
in-kernel *only* in the case where the event is the *first* in the
queue, and the head was previously zero?

But even that isn't a simple atomic operation though; we still have to
mark the event LINKED, then update the head pointer to reference it.
And what if we set the 'LINKED' bit but then find that userspace has
already linked another port so ->head is no longer zero?

Can we just clear the LINKED bit and then punt the whole event for
userspace to (re)handle? Or raise a special event to userspace so it
knows it needs to go ahead and link the port even though its LINKED bit
has already been set?

None of the available options really fill me with joy; I'm somewhat
inclined just to declare that the kernel won't support acceleration of
FIFO event channels at all.

None of which matters a *huge* amount right now if I was only going to
leave that as a future optimisation anyway.

What it does actually mean in the short term is that as I update your
KVM_IRQ_ROUTING_XEN_EVTCHN support, I probably *won't* bother to add a
'priority' field to struct kvm_irq_routing_xen_evtchn to make it
extensible to FIFO event channels. We can always add that later.

Does that seem reasonable?
Joao Martins Jan. 5, 2021, 12:11 p.m. UTC | #18
On 1/1/21 2:33 PM, David Woodhouse wrote:
> On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
>>> But if the kernel is going to short-circuit the IPIs and VIRQs, then
>>> it's already going to have to handle the evtchn_pending/evtchn_mask
>>> bitmaps, and actually injecting interrupts.
>>>
>>
>> Right. I was trying to point that out in the discussion we had
>> in next patch. But true be told, more about touting the idea of kernel
>> knowing if a given event channel is registered for userspace handling,
>> rather than fully handling the event channel.
>>
>> I suppose we are able to provide both options to the VMM anyway
>> i.e. 1) letting them handle it enterily in userspace by intercepting
>> EVTCHNOP_send, or through the irq route if we want kernel to offload it.
>>
>>> Given that it has to have that functionality anyway, it seems saner to
>>> let the kernel have full control over it and to just expose
>>> 'evtchn_send' to userspace.
>>>
>>> The alternative is to have userspace trying to play along with the
>>> atomic handling of those bitmasks too 
>>
>> That part is not particularly hard -- having it done already.
> 
> Right, for 2-level it works out fairly well. I like your model of
> installing { vcpu_id, port } into the KVM IRQ routing table and that's
> enough for the kernel to raise the event channels that it *needs* to
> know about, while userspace can do the others for itself. It's just
> atomic test-and-set bitop stuff with no real need for coordination.
> 
> For FIFO event channels it gets more fun, because the updates aren't
> truly atomic — they require a spinlock around the three operations that
> the host has to perform when linking an event into a queue: 
> 
>  • Set the new port's LINKED bit
>  • Set the previous head's LINK to point to the new port
>  • Store the new port# as the head.
> 
> One option might be to declare that for FIFO, all updates for a given
> queue *must* be handled either by the kernel, or by userspace, and
> there's sharing of control.
> 
> Or maybe there's a way to keep the kernel side simple by avoiding the
> tail handling completely. Surely we only really care about kernel
> handling of the *fast* path, where a single event channel is triggered
> and handled immediately? In the high-latency case where we're gathering
> multiple events in a queue before the guest ever gets to process them, 
> we might as well let that be handled by userspace, right?
> 
> So in that case, perhaps the kernel part could forget all the horrid
> nonsense with tracking the tail of the queue. It would handle the event
> in-kernel *only* in the case where the event is the *first* in the
> queue, and the head was previously zero?
> 
> But even that isn't a simple atomic operation though; we still have to
> mark the event LINKED, then update the head pointer to reference it.
> And what if we set the 'LINKED' bit but then find that userspace has
> already linked another port so ->head is no longer zero?
> 
> Can we just clear the LINKED bit and then punt the whole event for
> userspace to (re)handle? Or raise a special event to userspace so it
> knows it needs to go ahead and link the port even though its LINKED bit
> has already been set?
> 
> None of the available options really fill me with joy; I'm somewhat
> inclined just to declare that the kernel won't support acceleration of
> FIFO event channels at all.
> 
> None of which matters a *huge* amount right now if I was only going to
> leave that as a future optimisation anyway.
> 
ACK.

> What it does actually mean in the short term is that as I update your
> KVM_IRQ_ROUTING_XEN_EVTCHN support, I probably *won't* bother to add a
> 'priority' field to struct kvm_irq_routing_xen_evtchn to make it
> extensible to FIFO event channels. We can always add that later.
> 
> Does that seem reasonable?
> 
Yes, makes sense IMHO. Guests need to anyway fallback to 2L if the
evtchnop_init_control hypercall fails, and the way we are handling events,
doesn't warrant FIFO event channel support as mandatory.

Despite the many fifo event features, IIRC the main driving motivation
was to go beyond the 1K/4K port limit for 32b/64b guests to be 128K max
ports per guest instead. But that was mostly a limit for Domain-0 as it hosts
all the backend handling in the traditional (i.e. without driver
domains) deployment. Therefore limiting how many vdevs one could host in
the system. And on cases for dense VM consolidation where you host 1K
guests with e.g. 3 block devices and 1 network interface one would quickly
ran out of interdomain event channel ports in Dom0. But that is not the
case here.

We are anyways ABI-limited to HVM_MAX_VCPUS (128) and if we assume
4 event channels for the legacy guest per VCPU (4 ipi evts, 1 timer,
1 debug) then it means we have still 3328 ports for interdomain event
channels for a 128 vCPU HVM guest ... when using the 2L event channels ABI.

	Joao
David Woodhouse Jan. 5, 2021, 1:23 p.m. UTC | #19
On Tue, 2021-01-05 at 12:11 +0000, Joao Martins wrote:
> On 1/1/21 2:33 PM, David Woodhouse wrote:
> > What it does actually mean in the short term is that as I update your
> > KVM_IRQ_ROUTING_XEN_EVTCHN support, I probably *won't* bother to add a
> > 'priority' field to struct kvm_irq_routing_xen_evtchn to make it
> > extensible to FIFO event channels. We can always add that later.
> > 
> > Does that seem reasonable?
> > 
> 
> Yes, makes sense IMHO. Guests need to anyway fallback to 2L if the
> evtchnop_init_control hypercall fails, and the way we are handling events,
> doesn't warrant FIFO event channel support as mandatory.

Right.

I think it's perfectly OK to declare that we aren't going to support
FIFO event channel acceleration in Linux/KVM, and anyone who really
wants to support them would have to do it entirely in userspace.

The only reason a VMM would really need to support FIFO event channels
is if we're insane enough to want to do live migration from actual
Xen, of guests which were previously using them.

While I make no comment about my mental state, I can happily observe
that I don't have guests which currently use FIFO support.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9d388ba0a05c..3305173bf10b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -534,6 +534,12 @@  struct kvm_vcpu_hv {
 	cpumask_t tlb_flush;
 };
 
+struct kvm_xen_callback {
+	u32 via;
+	u32 vector;
+	atomic_t queued;
+};
+
 /* Xen per vcpu emulation context */
 struct kvm_vcpu_xen {
 	struct kvm_xen_exit exit;
@@ -543,6 +549,7 @@  struct kvm_vcpu_xen {
 	struct pvclock_vcpu_time_info *pv_time;
 	gpa_t steal_time_addr;
 	struct vcpu_runstate_info *steal_time;
+	struct kvm_xen_callback cb;
 };
 
 struct kvm_vcpu_arch {
@@ -854,6 +861,13 @@  struct kvm_xen {
 	struct shared_info *shinfo;
 };
 
+enum kvm_xen_callback_via {
+	KVM_XEN_CALLBACK_VIA_GSI,
+	KVM_XEN_CALLBACK_VIA_PCI_INTX,
+	KVM_XEN_CALLBACK_VIA_VECTOR,
+	KVM_XEN_CALLBACK_VIA_EVTCHN,
+};
+
 enum kvm_irqchip_mode {
 	KVM_IRQCHIP_NONE,
 	KVM_IRQCHIP_KERNEL,       /* created with KVM_CREATE_IRQCHIP */
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index faa264822cee..cdb1dbfcc9b1 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -26,6 +26,7 @@ 
 #include "irq.h"
 #include "i8254.h"
 #include "x86.h"
+#include "xen.h"
 
 /*
  * check if there are pending timer events
@@ -61,7 +62,9 @@  static int kvm_cpu_has_extint(struct kvm_vcpu *v)
 			return pending_userspace_extint(v);
 		else
 			return v->kvm->arch.vpic->output;
-	} else
+	} else if (kvm_xen_has_interrupt(v) != -1)
+		return 1;
+	else
 		return 0;
 }
 
@@ -119,7 +122,7 @@  int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
 	if (kvm_cpu_has_extint(v))
 		return 1;
 
-	return kvm_apic_has_interrupt(v) != -1;	/* LAPIC */
+	return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
 }
 EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
 
@@ -135,8 +138,13 @@  static int kvm_cpu_get_extint(struct kvm_vcpu *v)
 
 			v->arch.pending_external_vector = -1;
 			return vector;
-		} else
+		} else {
+			int vector = kvm_xen_get_interrupt(v);
+
+			if (vector)
+				return vector;		 /* Xen */
 			return kvm_pic_read_irq(v->kvm); /* PIC */
+		}
 	} else
 		return -1;
 }
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 3cc3b2d130a0..3b5da18c9ce2 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -36,6 +36,7 @@ 
 #include "lapic.h"
 
 #include "hyperv.h"
+#include "xen.h"
 #include "x86.h"
 
 static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
@@ -176,6 +177,9 @@  int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
 	int r;
 
 	switch (e->type) {
+	case KVM_IRQ_ROUTING_XEN_EVTCHN:
+		return kvm_xen_set_evtchn(e, kvm, irq_source_id, level,
+				       line_status);
 	case KVM_IRQ_ROUTING_HV_SINT:
 		return kvm_hv_set_sint(e, kvm, irq_source_id, level,
 				       line_status);
@@ -325,6 +329,13 @@  int kvm_set_routing_entry(struct kvm *kvm,
 		e->hv_sint.vcpu = ue->u.hv_sint.vcpu;
 		e->hv_sint.sint = ue->u.hv_sint.sint;
 		break;
+	case KVM_IRQ_ROUTING_XEN_EVTCHN:
+		e->set = kvm_xen_set_evtchn;
+		e->evtchn.vcpu = ue->u.evtchn.vcpu;
+		e->evtchn.vector = ue->u.evtchn.vector;
+		e->evtchn.via = ue->u.evtchn.via;
+
+		return kvm_xen_setup_evtchn(kvm, e);
 	default:
 		return -EINVAL;
 	}
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 4fdc4c71245a..99a3722146d8 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -7,6 +7,7 @@ 
 
 #include "x86.h"
 #include "xen.h"
+#include "ioapic.h"
 
 #include <linux/kvm_host.h>
 #include <linux/sched/stat.h>
@@ -17,6 +18,111 @@ 
 
 #include "trace.h"
 
+static void *xen_vcpu_info(struct kvm_vcpu *v);
+
+int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+	struct vcpu_info *vcpu_info = xen_vcpu_info(vcpu);
+
+	if (!!atomic_read(&vcpu_xen->cb.queued) || (vcpu_info &&
+	    test_bit(0, (unsigned long *) &vcpu_info->evtchn_upcall_pending)))
+		return 1;
+
+	return -1;
+}
+
+int kvm_xen_get_interrupt(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+	u32 vector = vcpu_xen->cb.vector;
+
+	if (kvm_xen_has_interrupt(vcpu) == -1)
+		return 0;
+
+	atomic_set(&vcpu_xen->cb.queued, 0);
+	return vector;
+}
+
+static int kvm_xen_do_upcall(struct kvm *kvm, u32 dest_vcpu,
+			     u32 via, u32 vector, int level)
+{
+	struct kvm_vcpu_xen *vcpu_xen;
+	struct kvm_lapic_irq irq;
+	struct kvm_vcpu *vcpu;
+
+	if (vector > 0xff || vector < 0x10 || dest_vcpu >= KVM_MAX_VCPUS)
+		return -EINVAL;
+
+	vcpu = kvm_get_vcpu(kvm, dest_vcpu);
+	if (!vcpu)
+		return -EINVAL;
+
+	memset(&irq, 0, sizeof(irq));
+	if (via == KVM_XEN_CALLBACK_VIA_VECTOR) {
+		vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+		atomic_set(&vcpu_xen->cb.queued, 1);
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+		kvm_vcpu_kick(vcpu);
+	} else if (via == KVM_XEN_CALLBACK_VIA_EVTCHN) {
+		irq.shorthand = APIC_DEST_SELF;
+		irq.dest_mode = APIC_DEST_PHYSICAL;
+		irq.delivery_mode = APIC_DM_FIXED;
+		irq.vector = vector;
+		irq.level = level;
+
+		/* Deliver upcall to a vector on the destination vcpu */
+		kvm_irq_delivery_to_apic(kvm, vcpu->arch.apic, &irq, NULL);
+	} else {
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int kvm_xen_set_evtchn(struct kvm_kernel_irq_routing_entry *e,
+		   struct kvm *kvm, int irq_source_id, int level,
+		   bool line_status)
+{
+	/*
+	 * The routing information for the kirq specifies the vector
+	 * on the destination vcpu.
+	 */
+	return kvm_xen_do_upcall(kvm, e->evtchn.vcpu, e->evtchn.via,
+				 e->evtchn.vector, level);
+}
+
+int kvm_xen_setup_evtchn(struct kvm *kvm,
+			 struct kvm_kernel_irq_routing_entry *e)
+{
+	struct kvm_vcpu_xen *vcpu_xen;
+	struct kvm_vcpu *vcpu = NULL;
+
+	if (e->evtchn.vector > 0xff || e->evtchn.vector < 0x10)
+		return -EINVAL;
+
+	/* Expect vcpu to be sane */
+	if (e->evtchn.vcpu >= KVM_MAX_VCPUS)
+		return -EINVAL;
+
+	vcpu = kvm_get_vcpu(kvm, e->evtchn.vcpu);
+	if (!vcpu)
+		return -EINVAL;
+
+	vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+	if (e->evtchn.via == KVM_XEN_CALLBACK_VIA_VECTOR) {
+		vcpu_xen->cb.via = KVM_XEN_CALLBACK_VIA_VECTOR;
+		vcpu_xen->cb.vector = e->evtchn.vector;
+	} else if (e->evtchn.via == KVM_XEN_CALLBACK_VIA_EVTCHN) {
+		vcpu_xen->cb.via = KVM_XEN_CALLBACK_VIA_EVTCHN;
+		vcpu_xen->cb.vector = e->evtchn.vector;
+	} else {
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static void set_vcpu_attr(struct kvm_vcpu *v, u16 type, gpa_t gpa, void *addr)
 {
 	struct kvm_vcpu_xen *vcpu_xen = vcpu_to_xen_vcpu(v);
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 2feef68ee80f..6a42e134924a 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -25,6 +25,15 @@  bool kvm_xen_hypercall_enabled(struct kvm *kvm);
 bool kvm_xen_hypercall_set(struct kvm *kvm);
 int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
 
+int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu);
+int kvm_xen_get_interrupt(struct kvm_vcpu *vcpu);
+
+int kvm_xen_set_evtchn(struct kvm_kernel_irq_routing_entry *e,
+		       struct kvm *kvm, int irq_source_id, int level,
+		       bool line_status);
+int kvm_xen_setup_evtchn(struct kvm *kvm,
+			 struct kvm_kernel_irq_routing_entry *e);
+
 void kvm_xen_destroy_vm(struct kvm *kvm);
 void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d55c63db09b..af5e7455ff6a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -350,6 +350,29 @@  struct kvm_hv_sint {
 	u32 sint;
 };
 
+/*
+ * struct kvm_xen_evtchn: currently specifies the upcall vector setup to
+ * deliver the interrupt to the guest.
+ *
+ * via = XEN_PARAM_CALLBACK_VIA_TYPE_GSI|_PCI
+ *    vcpu: always deliver to vcpu-0
+ *    vector: is used as upcall-vector
+ *    EOI: none
+ * via = XEN_PARAM_CALLBACK_VIA_TYPE_VECTOR
+ *    vcpu: deliver to specified vcpu
+ *    vector: used as upcall-vector
+ *    EOI: none
+ * via = XEN_PARAM_CALLBACK_VIA_TYPE_EVTCHN
+ *    vcpu: deliver to specified vcpu (vector should be bound to the vcpu)
+ *    vector: used as upcall-vector
+ *    EOI: expected
+ */
+struct kvm_xen_evtchn {
+	u32 via;
+	u32 vcpu;
+	u32 vector;
+};
+
 struct kvm_kernel_irq_routing_entry {
 	u32 gsi;
 	u32 type;
@@ -370,6 +393,7 @@  struct kvm_kernel_irq_routing_entry {
 		} msi;
 		struct kvm_s390_adapter_int adapter;
 		struct kvm_hv_sint hv_sint;
+		struct kvm_xen_evtchn evtchn;
 	};
 	struct hlist_node link;
 };
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 682ea00abd58..49001f681cd1 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1035,11 +1035,18 @@  struct kvm_irq_routing_hv_sint {
 	__u32 sint;
 };
 
+struct kvm_irq_routing_xen_evtchn {
+	__u32 via;
+	__u32 vcpu;
+	__u32 vector;
+};
+
 /* gsi routing entry types */
 #define KVM_IRQ_ROUTING_IRQCHIP 1
 #define KVM_IRQ_ROUTING_MSI 2
 #define KVM_IRQ_ROUTING_S390_ADAPTER 3
 #define KVM_IRQ_ROUTING_HV_SINT 4
+#define KVM_IRQ_ROUTING_XEN_EVTCHN 5
 
 struct kvm_irq_routing_entry {
 	__u32 gsi;
@@ -1051,6 +1058,7 @@  struct kvm_irq_routing_entry {
 		struct kvm_irq_routing_msi msi;
 		struct kvm_irq_routing_s390_adapter adapter;
 		struct kvm_irq_routing_hv_sint hv_sint;
+		struct kvm_irq_routing_xen_evtchn evtchn;
 		__u32 pad[8];
 	} u;
 };