diff mbox series

[RFC,11/39] KVM: x86/xen: evtchn signaling via eventfd

Message ID 20190220201609.28290-12-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series x86/KVM: Xen HVM guest support | expand

Commit Message

Joao Martins Feb. 20, 2019, 8:15 p.m. UTC
userspace registers a @port to an @eventfd, that is bound to a
 @vcpu. This information is then used when the guest does an
EVTCHNOP_send with a port registered with the kernel.

EVTCHNOP_send short-circuiting happens by marking the event as pending
in the shared info and vcpu info pages and doing the upcall. For IPIs
and interdomain event channels, we do the upcall on the assigned vcpu.

After binding events the guest or host may wish to bind those
events to a particular vcpu. This is usually done for unbound
and and interdomain events. Update requests are handled via the
KVM_XEN_EVENTFD_UPDATE flag.

Unregistered ports are handled by the emulator.

Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/kvm/x86.c              |   1 +
 arch/x86/kvm/xen.c              | 238 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/xen.h              |   2 +
 include/uapi/linux/kvm.h        |  20 ++++
 5 files changed, 264 insertions(+)

Comments

David Woodhouse Nov. 30, 2020, 9:41 a.m. UTC | #1
On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> userspace registers a @port to an @eventfd, that is bound to a
>  @vcpu. This information is then used when the guest does an
> EVTCHNOP_send with a port registered with the kernel.

Why do I want this part?

> EVTCHNOP_send short-circuiting happens by marking the event as pending
> in the shared info and vcpu info pages and doing the upcall. For IPIs
> and interdomain event channels, we do the upcall on the assigned vcpu.

This part I understand, 'peeking' at the EVTCHNOP_send hypercall so
that we can short-circuit IPI delivery without it having to bounce
through userspace.

But why would I then want then short-circuit the short-circuit,
providing an eventfd for it to signal... so that I can then just
receive the event in userspace in a *different* form to the original
hypercall exit I would have got?
Joao Martins Nov. 30, 2020, 12:17 p.m. UTC | #2
On 11/30/20 9:41 AM, David Woodhouse wrote:
> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>> userspace registers a @port to an @eventfd, that is bound to a
>>  @vcpu. This information is then used when the guest does an
>> EVTCHNOP_send with a port registered with the kernel.
> 
> Why do I want this part?
> 
It is unnecessary churn to support eventfd at this point.

The patch subject/name is also a tad misleading, as
it implements the event channel port offloading with the optional fd
being just a small detail in addition.

>> EVTCHNOP_send short-circuiting happens by marking the event as pending
>> in the shared info and vcpu info pages and doing the upcall. For IPIs
>> and interdomain event channels, we do the upcall on the assigned vcpu.
> 
> This part I understand, 'peeking' at the EVTCHNOP_send hypercall so
> that we can short-circuit IPI delivery without it having to bounce
> through userspace.
> 
> But why would I then want then short-circuit the short-circuit,
> providing an eventfd for it to signal... so that I can then just
> receive the event in userspace in a *different* form to the original
> hypercall exit I would have got?
> 

One thing I didn't quite do at the time, is the whitelisting of unregistered
ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
ones which go to userspace should be explicitly requested as such
and otherwise return -ENOENT in the hypercall.

Perhaps eventfd could be a way to express this? Like if you register
without an eventfd it's offloaded, otherwise it's assigned to userspace,
or if neither it's then returned an error without bothering the VMM.

But still, eventfd is probably unnecessary complexity when another @type
(XEN_EVTCHN_TYPE_USER) would serve, and then just exiting to userspace
and let it route its evtchn port handling to the its own I/O handling thread.

	Joao
David Woodhouse Nov. 30, 2020, 12:55 p.m. UTC | #3
On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
> On 11/30/20 9:41 AM, David Woodhouse wrote:
> > On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> > > userspace registers a @port to an @eventfd, that is bound to a
> > >  @vcpu. This information is then used when the guest does an
> > > EVTCHNOP_send with a port registered with the kernel.
> > 
> > Why do I want this part?
> > 
> 
> It is unnecessary churn to support eventfd at this point.
> 
> The patch subject/name is also a tad misleading, as
> it implements the event channel port offloading with the optional fd
> being just a small detail in addition.

Right, I'd got that the commit title overstated its importance, but was
wondering why the optional fd existed at all.

I looked through the later xen_shim parts, half expecting it to be used
there... but no, they add their own special case next to the place
where eventfd_signal() gets called, instead of hooking the shim up via
an eventfd.

> > > EVTCHNOP_send short-circuiting happens by marking the event as pending
> > > in the shared info and vcpu info pages and doing the upcall. For IPIs
> > > and interdomain event channels, we do the upcall on the assigned vcpu.
> > 
> > This part I understand, 'peeking' at the EVTCHNOP_send hypercall so
> > that we can short-circuit IPI delivery without it having to bounce
> > through userspace.
> > 
> > But why would I then want then short-circuit the short-circuit,
> > providing an eventfd for it to signal... so that I can then just
> > receive the event in userspace in a *different* form to the original
> > hypercall exit I would have got?
> > 
> 
> One thing I didn't quite do at the time, is the whitelisting of unregistered
> ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
> the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
> ones which go to userspace should be explicitly requested as such
> and otherwise return -ENOENT in the hypercall.

Hm, why would -ENOENT be a fast path which needs to be handled in the
kernel?

> Perhaps eventfd could be a way to express this? Like if you register
> without an eventfd it's offloaded, otherwise it's assigned to userspace,
> or if neither it's then returned an error without bothering the VMM.

I much prefer the simple model where the *only* event channels that the
kernel knows about are the ones it's expected to handle.

For any others, the bypass doesn't kick in, and userspace gets the
KVM_EXIT_HYPERCALL exit.

> But still, eventfd is probably unnecessary complexity when another @type
> (XEN_EVTCHN_TYPE_USER) would serve, and then just exiting to userspace
> and let it route its evtchn port handling to the its own I/O handling thread.

Hmm... so the benefit of the eventfd is that we can wake the I/O thread
directly instead of bouncing out to userspace on the vCPU thread only
for it to send a signal and return to the guest? Did you ever use that,
and it is worth the additional in-kernel code?

Is there any reason we'd want that for IPI or VIRQ event channels, or
can it be only for INTERDOM/UNBOUND event channels which come later?

I'm tempted to leave it out of the first patch, and *maybe* add it back
in a later patch, putting it in the union alongside .virq.type.


                struct kvm_xen_eventfd {
 
 #define XEN_EVTCHN_TYPE_VIRQ 0
 #define XEN_EVTCHN_TYPE_IPI  1
                        __u32 type;
                        __u32 port;
                        __u32 vcpu;
-                       __s32 fd;
 
 #define KVM_XEN_EVENTFD_DEASSIGN       (1 << 0)
 #define KVM_XEN_EVENTFD_UPDATE         (1 << 1)
                        __u32 flags;
                        union {
                                struct {
                                        __u8 type;
                                } virq;
+                              struct {
+                                      __s32 eventfd;
+                              } interdom; /* including unbound */
                                __u32 padding[2];
                        };
               } evtchn;

Does that make sense to you?
Joao Martins Nov. 30, 2020, 3:08 p.m. UTC | #4
On 11/30/20 12:55 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
>> On 11/30/20 9:41 AM, David Woodhouse wrote:
>>> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>>>> EVTCHNOP_send short-circuiting happens by marking the event as pending
>>>> in the shared info and vcpu info pages and doing the upcall. For IPIs
>>>> and interdomain event channels, we do the upcall on the assigned vcpu.
>>>
>>> This part I understand, 'peeking' at the EVTCHNOP_send hypercall so
>>> that we can short-circuit IPI delivery without it having to bounce
>>> through userspace.
>>>
>>> But why would I then want then short-circuit the short-circuit,
>>> providing an eventfd for it to signal... so that I can then just
>>> receive the event in userspace in a *different* form to the original
>>> hypercall exit I would have got?
>>>
>>
>> One thing I didn't quite do at the time, is the whitelisting of unregistered
>> ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
>> the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
>> ones which go to userspace should be explicitly requested as such
>> and otherwise return -ENOENT in the hypercall.
> 
> Hm, why would -ENOENT be a fast path which needs to be handled in the
> kernel?
> 
It's not that it's a fast path.

Like sending an event channel to an unbound vector, now becomes an possible vector to
worry about in userspace VMM e.g. should that port lookup logic be fragile.

So it's more along the lines of Nack-ing the invalid port earlier to rather go
to go userspace to invalidate it, provided we do the lookup anyway in the kernel.

>> Perhaps eventfd could be a way to express this? Like if you register
>> without an eventfd it's offloaded, otherwise it's assigned to userspace,
>> or if neither it's then returned an error without bothering the VMM.
> 
> I much prefer the simple model where the *only* event channels that the
> kernel knows about are the ones it's expected to handle.
> 
> For any others, the bypass doesn't kick in, and userspace gets the
> KVM_EXIT_HYPERCALL exit.
> 
/me nods

I should comment on your other patch but: if we're going to make it generic for
the userspace hypercall handling, might as well move hyper-v there too. In this series,
I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a generic version
I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest types, while
disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate patch :)

>> But still, eventfd is probably unnecessary complexity when another @type
>> (XEN_EVTCHN_TYPE_USER) would serve, and then just exiting to userspace
>> and let it route its evtchn port handling to the its own I/O handling thread.
> 
> Hmm... so the benefit of the eventfd is that we can wake the I/O thread
> directly instead of bouncing out to userspace on the vCPU thread only
> for it to send a signal and return to the guest? Did you ever use that,
> and it is worth the additional in-kernel code?
> 
This was my own preemptive optimization to the interface -- it's not worth
the added code for vIRQ and IPI at this point which is what *for sure* the
kernel will handle.

> Is there any reason we'd want that for IPI or VIRQ event channels, or
> can it be only for INTERDOM/UNBOUND event channels which come later?
> 
/me nods.

No reason to keep that for IPI/vIRQ.

> I'm tempted to leave it out of the first patch, and *maybe* add it back
> in a later patch, putting it in the union alongside .virq.type.
> 
> 
>                 struct kvm_xen_eventfd {
>  
>  #define XEN_EVTCHN_TYPE_VIRQ 0
>  #define XEN_EVTCHN_TYPE_IPI  1
>                         __u32 type;
>                         __u32 port;
>                         __u32 vcpu;
> -                       __s32 fd;
>  
>  #define KVM_XEN_EVENTFD_DEASSIGN       (1 << 0)
>  #define KVM_XEN_EVENTFD_UPDATE         (1 << 1)
>                         __u32 flags;
>                         union {
>                                 struct {
>                                         __u8 type;
>                                 } virq;
> +                              struct {
> +                                      __s32 eventfd;
> +                              } interdom; /* including unbound */
>                                 __u32 padding[2];
>                         };
>                } evtchn;
> 
> Does that make sense to you?
> 
Yeap! :)

	Joao
David Woodhouse Nov. 30, 2020, 4:48 p.m. UTC | #5
On Mon, 2020-11-30 at 15:08 +0000, Joao Martins wrote:
> On 11/30/20 12:55 PM, David Woodhouse wrote:
> > On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
> > > On 11/30/20 9:41 AM, David Woodhouse wrote:
> > > > On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> > > One thing I didn't quite do at the time, is the whitelisting of unregistered
> > > ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
> > > the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
> > > ones which go to userspace should be explicitly requested as such
> > > and otherwise return -ENOENT in the hypercall.
> > 
> > Hm, why would -ENOENT be a fast path which needs to be handled in the
> > kernel?
> > 
> 
> It's not that it's a fast path.
> 
> Like sending an event channel to an unbound vector, now becomes an possible vector to
> worry about in userspace VMM e.g. should that port lookup logic be fragile.
> 
> So it's more along the lines of Nack-ing the invalid port earlier to rather go
> to go userspace to invalidate it, provided we do the lookup anyway in the kernel.

If the port lookup logic is fragile, I *want* it in the sandboxed
userspace VMM and not in the kernel :)

And unless we're going to do *all* of the EVTCHNOP_bind*, EVTCHN_close,
etc. handling in the kernel, doesn't userspace have to have all that
logic for managing the port space anyway?

I think it's better to let userspace own it outright, and use the
kernel bypass purely for the fast paths. The VMM can even implement
IPI/VIRQ support in userspace, then use the kernel bypass if/when it's
available.

> > > Perhaps eventfd could be a way to express this? Like if you register
> > > without an eventfd it's offloaded, otherwise it's assigned to userspace,
> > > or if neither it's then returned an error without bothering the VMM.
> > 
> > I much prefer the simple model where the *only* event channels that the
> > kernel knows about are the ones it's expected to handle.
> > 
> > For any others, the bypass doesn't kick in, and userspace gets the
> > KVM_EXIT_HYPERCALL exit.
> > 
> 
> /me nods
> 
> I should comment on your other patch but: if we're going to make it generic for
> the userspace hypercall handling, might as well move hyper-v there too. In this series,
> I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a generic version
> I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest types, while
> disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate patch :)

There's a limit to how much consolidation we can do because the ABI is
different; the args are in different registers.

I do suspect Hyper-V should have marshalled its arguments into the
existing kvm_run->arch.hypercall and used KVM_EXIT_HYPERCALL but I
don't think it makes sense to change it now since it's a user-facing
ABI. I don't want to follow its lead by inventing *another* gratuitous
exit type for Xen though.
Joao Martins Nov. 30, 2020, 5:15 p.m. UTC | #6
On 11/30/20 4:48 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 15:08 +0000, Joao Martins wrote:
>> On 11/30/20 12:55 PM, David Woodhouse wrote:
>>> On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
>>>> On 11/30/20 9:41 AM, David Woodhouse wrote:
>>>>> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>>>> One thing I didn't quite do at the time, is the whitelisting of unregistered
>>>> ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
>>>> the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
>>>> ones which go to userspace should be explicitly requested as such
>>>> and otherwise return -ENOENT in the hypercall.
>>>
>>> Hm, why would -ENOENT be a fast path which needs to be handled in the
>>> kernel?
>>>
>>
>> It's not that it's a fast path.
>>
>> Like sending an event channel to an unbound vector, now becomes an possible vector to
>> worry about in userspace VMM e.g. should that port lookup logic be fragile.
>>
>> So it's more along the lines of Nack-ing the invalid port earlier to rather go
>> to go userspace to invalidate it, provided we do the lookup anyway in the kernel.
> 
> If the port lookup logic is fragile, I *want* it in the sandboxed
> userspace VMM and not in the kernel :)
> 
Yes definitely -- I think we are on the same page on that.

But it's just that we do the lookup *anyways* to check if the kernel has a given
evtchn port registered. That's the lookup I am talking about here, with just an
extra bit to tell that's a userspace handled port.

> And unless we're going to do *all* of the EVTCHNOP_bind*, EVTCHN_close,
> etc. handling in the kernel, doesn't userspace have to have all that
> logic for managing the port space anyway?
> 
Indeed.

> I think it's better to let userspace own it outright, and use the
> kernel bypass purely for the fast paths. The VMM can even implement
> IPI/VIRQ support in userspace, then use the kernel bypass if/when it's
> available.
> 
True, and it's pretty much how it's implemented today.

But felt it was still worth having this discussion ... should this be
considered or discarded. I suppose we stick with the later for now.

>>>> Perhaps eventfd could be a way to express this? Like if you register
>>>> without an eventfd it's offloaded, otherwise it's assigned to userspace,
>>>> or if neither it's then returned an error without bothering the VMM.
>>>
>>> I much prefer the simple model where the *only* event channels that the
>>> kernel knows about are the ones it's expected to handle.
>>>
>>> For any others, the bypass doesn't kick in, and userspace gets the
>>> KVM_EXIT_HYPERCALL exit.
>>>
>>
>> /me nods
>>
>> I should comment on your other patch but: if we're going to make it generic for
>> the userspace hypercall handling, might as well move hyper-v there too. In this series,
>> I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a generic version
>> I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest types, while
>> disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate patch :)
> 
> There's a limit to how much consolidation we can do because the ABI is
> different; the args are in different registers.
> 
Yes. It would be optionally enabled of course and VMM would have to adjust to the new ABI
-- surely wouldn't want to break current users of KVM_EXIT_HYPERV.

> I do suspect Hyper-V should have marshalled its arguments into the
> existing kvm_run->arch.hypercall and used KVM_EXIT_HYPERCALL but I
> don't think it makes sense to change it now since it's a user-facing
> ABI. I don't want to follow its lead by inventing *another* gratuitous
> exit type for Xen though.
> 
I definitely like the KVM_EXIT_HYPERCALL better than a KVM_EXIT_XEN userspace
exit type ;)

But I guess you still need to co-relate a type of hypercall (Xen guest cap enabled?) to
tell it's Xen or KVM to specially enlighten certain opcodes (EVTCHNOP_send).

	Joao
David Woodhouse Nov. 30, 2020, 6:01 p.m. UTC | #7
On Mon, 2020-11-30 at 17:15 +0000, Joao Martins wrote:
> On 11/30/20 4:48 PM, David Woodhouse wrote:
> > On Mon, 2020-11-30 at 15:08 +0000, Joao Martins wrote:
> > > On 11/30/20 12:55 PM, David Woodhouse wrote:
> > > > On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
> > > > > On 11/30/20 9:41 AM, David Woodhouse wrote:
> > > > > > On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
> > > > > 
> > > > > One thing I didn't quite do at the time, is the whitelisting of unregistered
> > > > > ports to userspace.

...

> But felt it was still worth having this discussion ... should this be
> considered or discarded. I suppose we stick with the later for now.

Ack. Duly discarded :)

> > > > > Perhaps eventfd could be a way to express this? Like if you register
> > > > > without an eventfd it's offloaded, otherwise it's assigned to userspace,
> > > > > or if neither it's then returned an error without bothering the VMM.
> > > > 
> > > > I much prefer the simple model where the *only* event channels that the
> > > > kernel knows about are the ones it's expected to handle.
> > > > 
> > > > For any others, the bypass doesn't kick in, and userspace gets the
> > > > KVM_EXIT_HYPERCALL exit.
> > > > 
> > > 
> > > /me nods
> > > 
> > > I should comment on your other patch but: if we're going to make it generic for
> > > the userspace hypercall handling, might as well move hyper-v there too. In this series,
> > > I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a generic version
> > > I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest types, while
> > > disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate patch :)
> > 
> > There's a limit to how much consolidation we can do because the ABI is
> > different; the args are in different registers.
> > 
> 
> Yes. It would be optionally enabled of course and VMM would have to adjust to the new ABI
> -- surely wouldn't want to break current users of KVM_EXIT_HYPERV.

True, but that means we'd have to keep KVM_EXIT_HYPERV around anyway,
and can't actually *remove* it. The "consolidation" gives us more
complexity, not less.

> > I do suspect Hyper-V should have marshalled its arguments into the
> > existing kvm_run->arch.hypercall and used KVM_EXIT_HYPERCALL but I
> > don't think it makes sense to change it now since it's a user-facing
> > ABI. I don't want to follow its lead by inventing *another* gratuitous
> > exit type for Xen though.
> > 
> 
> I definitely like the KVM_EXIT_HYPERCALL better than a KVM_EXIT_XEN userspace
> exit type ;)
> 
> But I guess you still need to co-relate a type of hypercall (Xen guest cap enabled?) to
> tell it's Xen or KVM to specially enlighten certain opcodes (EVTCHNOP_send).

Sure, but if the VMM doesn't know what kind of guest it's hosting, we
have bigger problems... :)
Joao Martins Nov. 30, 2020, 6:41 p.m. UTC | #8
On 11/30/20 6:01 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 17:15 +0000, Joao Martins wrote:
>> On 11/30/20 4:48 PM, David Woodhouse wrote:
>>> On Mon, 2020-11-30 at 15:08 +0000, Joao Martins wrote:
>>>> On 11/30/20 12:55 PM, David Woodhouse wrote:
>>>>> On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
>>>>>> On 11/30/20 9:41 AM, David Woodhouse wrote:
>>>>>>> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:

[...]

>>>> I should comment on your other patch but: if we're going to make it generic for
>>>> the userspace hypercall handling, might as well move hyper-v there too. In this series,
>>>> I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a generic version
>>>> I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest types, while
>>>> disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate patch :)
>>>
>>> There's a limit to how much consolidation we can do because the ABI is
>>> different; the args are in different registers.
>>>
>>
>> Yes. It would be optionally enabled of course and VMM would have to adjust to the new ABI
>> -- surely wouldn't want to break current users of KVM_EXIT_HYPERV.
> 
> True, but that means we'd have to keep KVM_EXIT_HYPERV around anyway,
> and can't actually *remove* it. The "consolidation" gives us more
> complexity, not less.
>
Fair point.

>>> I do suspect Hyper-V should have marshalled its arguments into the
>>> existing kvm_run->arch.hypercall and used KVM_EXIT_HYPERCALL but I
>>> don't think it makes sense to change it now since it's a user-facing
>>> ABI. I don't want to follow its lead by inventing *another* gratuitous
>>> exit type for Xen though.
>>>
>>
>> I definitely like the KVM_EXIT_HYPERCALL better than a KVM_EXIT_XEN userspace
>> exit type ;)
>>
>> But I guess you still need to co-relate a type of hypercall (Xen guest cap enabled?) to
>> tell it's Xen or KVM to specially enlighten certain opcodes (EVTCHNOP_send).
> 
> Sure, but if the VMM doesn't know what kind of guest it's hosting, we
> have bigger problems... :)
> 
Right :)

I was referring to the kernel here.

Eventually we need to special case things for a given guest type case e.g.

int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
{
...
        if (kvm_hv_hypercall_enabled(vcpu->kvm))
                return kvm_hv_hypercall(...);

        if (kvm_xen_hypercall_enabled(vcpu->kvm))
                return kvm_xen_hypercall(...);
...
}

And on kvm_xen_hypercall() for the cases VMM offloads to demarshal what the registers mean
e.g. for event channel send 64-bit guest: RAX for opcode and RDI/RSI for cmd and port.

The kernel logic wouldn't be much different at the core, so thought of tihs consolidation.
But the added complexity would have come from having to deal with two userspace exit types
-- indeed probably not worth the trouble as you pointed out.
David Woodhouse Nov. 30, 2020, 7:04 p.m. UTC | #9
On Mon, 2020-11-30 at 18:41 +0000, Joao Martins wrote:
> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> {
> ...
>         if (kvm_hv_hypercall_enabled(vcpu->kvm))
>                 return kvm_hv_hypercall(...);
> 
>         if (kvm_xen_hypercall_enabled(vcpu->kvm))
>                 return kvm_xen_hypercall(...);
> ...
> }
> 
> And on kvm_xen_hypercall() for the cases VMM offloads to demarshal what the registers mean
> e.g. for event channel send 64-bit guest: RAX for opcode and RDI/RSI for cmd and port.

Right, although it's a little more abstract than that: "RDI/RSI for
arg#0, arg#1 respectively".

And those are RDI/RSI for 64-bit Xen, EBX/ECX for 32-bit Xen, and
RBX/RDI for Hyper-V. (And Hyper-V seems to use only the two, while Xen
theoretically has up to 6).

> The kernel logic wouldn't be much different at the core, so thought of tihs consolidation.
> But the added complexity would have come from having to deal with two userspace exit types
> -- indeed probably not worth the trouble as you pointed out.

Yeah, I think I'm just going to move the 'kvm_userspace_hypercall()'
from my patch to be 'kvm_xen_hypercall()' in a new xen.c but still
using KVM_EXIT_HYPERCALL. Then I can rebase your other patches on top
of that, with the evtchn bypass.
Joao Martins Nov. 30, 2020, 7:25 p.m. UTC | #10
On 11/30/20 7:04 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 18:41 +0000, Joao Martins wrote:
>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>> {
>> ...
>>         if (kvm_hv_hypercall_enabled(vcpu->kvm))
>>                 return kvm_hv_hypercall(...);
>>
>>         if (kvm_xen_hypercall_enabled(vcpu->kvm))
>>                 return kvm_xen_hypercall(...);
>> ...
>> }
>>
>> And on kvm_xen_hypercall() for the cases VMM offloads to demarshal what the registers mean
>> e.g. for event channel send 64-bit guest: RAX for opcode and RDI/RSI for cmd and port.
> 
> Right, although it's a little more abstract than that: "RDI/RSI for
> arg#0, arg#1 respectively".
> 
> And those are RDI/RSI for 64-bit Xen, EBX/ECX for 32-bit Xen, and
> RBX/RDI for Hyper-V. (And Hyper-V seems to use only the two, while Xen
> theoretically has up to 6).
> 
Indeed, almost reminds my other patch for xen hypercalls -- it was handling 32-bit and
64-bit that way:

https://lore.kernel.org/kvm/20190220201609.28290-3-joao.m.martins@oracle.com/

>> The kernel logic wouldn't be much different at the core, so thought of tihs consolidation.
>> But the added complexity would have come from having to deal with two userspace exit types
>> -- indeed probably not worth the trouble as you pointed out.
> 
> Yeah, I think I'm just going to move the 'kvm_userspace_hypercall()'
> from my patch to be 'kvm_xen_hypercall()' in a new xen.c but still
> using KVM_EXIT_HYPERCALL. Then I can rebase your other patches on top
> of that, with the evtchn bypass.
> 
Yeap, makes sense.

	Joao
David Woodhouse Nov. 23, 2021, 1:15 p.m. UTC | #11
So... what we have in the kernel already is entirely sufficient to run
real Xen guests so they don't know they're running under KVM — with PV
netback and blkback, Xen timers, etc.

We can even run the PV Xen shim itself, to host PV guests.

There are some pieces we really want to move into the kernel though,
IPIs being top of the list — we really don't want them bouncing out to
the userspace VMM. In-kernel timers would be good, too.

I've finally got the basic 2l event channel delivery working in the
kernel, with a minor detour into fixing nesting UAF bugs. The first
simple use case for that is routing MSIs of assigned PCI devices to Xen
PIRQs. So I'm ready to stare at the patch in $SUBJECT again...

On Mon, 2020-11-30 at 15:08 +0000, Joao Martins wrote:
> On 11/30/20 12:55 PM, David Woodhouse wrote:
> > On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
> > > But still, eventfd is probably unnecessary complexity when another @type
> > > (XEN_EVTCHN_TYPE_USER) would serve, and then just exiting to userspace
> > > and let it route its evtchn port handling to the its own I/O handling thread.
> > 
> > Hmm... so the benefit of the eventfd is that we can wake the I/O thread
> > directly instead of bouncing out to userspace on the vCPU thread only
> > for it to send a signal and return to the guest? Did you ever use that,
> > and it is worth the additional in-kernel code?
> > 
> This was my own preemptive optimization to the interface -- it's not worth
> the added code for vIRQ and IPI at this point which is what *for sure* the
> kernel will handle.
> 
> > Is there any reason we'd want that for IPI or VIRQ event channels, or
> > can it be only for INTERDOM/UNBOUND event channels which come later?
> > 
> /me nods.
> 
> No reason to keep that for IPI/vIRQ.
> 
> > I'm tempted to leave it out of the first patch, and *maybe* add it back
> > in a later patch, putting it in the union alongside .virq.type.
> > 
> > 
> >                 struct kvm_xen_eventfd {
> >  
> >  #define XEN_EVTCHN_TYPE_VIRQ 0
> >  #define XEN_EVTCHN_TYPE_IPI  1
> >                         __u32 type;
> >                         __u32 port;
> >                         __u32 vcpu;
> > -                       __s32 fd;
> >  
> >  #define KVM_XEN_EVENTFD_DEASSIGN       (1 << 0)
> >  #define KVM_XEN_EVENTFD_UPDATE         (1 << 1)
> >       eentfd_                  __u32 flags;
> >                         union {
> >                                 struct {
> >                                         __u8 type;
> >                                 } virq;
> > +                              struct {
> > +                                      __s32 eventfd;
> > +                              } interdom; /* including unbound */
> >                                 __u32 padding[2];
> >                         };
> >                } evtchn;
> > 
> > Does that make sense to you?
> > 
> Yeap! :)

It turns out there are actually two cases for interdom event channels.
As well as signalling "other domains" (i.e. userspace) they can also be
set up as loopback, so sending an event on one port actually raises an
event on another of that guest's same ports, a bit like an IPI.

So it isn't quite as simple as "if we handle interdom in the kernel it
has an eventfd". 

It also means we need to provide both *source* and *target* ports.
Which brings me to the KVM_XEN_EVENTFD_DEASSIGN/KVM_XEN_EVENTFD_UPDATE
flags that you used. I'm not sure I like that API; given that we now
need separate source and target ports can we handle 'deassign' just by
setting the target to zero? (and eventfd to -1) ? 

I think your original patch leaked refcounts on the eventfd in
kvm_xen_eventfd_update() because it would call eventfd_ctx_fdget() then
neither used nor released the result?

So I'm thinking of making it just a 'set'. And in fact I'm thinking of
modelling it on the irq routing table by having a single call to
set/update all of them at once?

Any objections?
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3305173bf10b..f31fcaf8fa7c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -859,6 +859,9 @@  struct kvm_xen {
 
 	gfn_t shinfo_addr;
 	struct shared_info *shinfo;
+
+	struct idr port_to_evt;
+	struct mutex xen_lock;
 };
 
 enum kvm_xen_callback_via {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 11b9ff2bd901..76bd23113ccd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9181,6 +9181,7 @@  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
 
+	kvm_xen_init_vm(kvm);
 	kvm_hv_init_vm(kvm);
 	kvm_page_track_init(kvm);
 	kvm_mmu_init_vm(kvm);
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 99a3722146d8..1fbdfa7c4356 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -10,14 +10,28 @@ 
 #include "ioapic.h"
 
 #include <linux/kvm_host.h>
+#include <linux/eventfd.h>
 #include <linux/sched/stat.h>
 
 #include <trace/events/kvm.h>
 #include <xen/interface/xen.h>
 #include <xen/interface/vcpu.h>
+#include <xen/interface/event_channel.h>
 
 #include "trace.h"
 
+struct evtchnfd {
+	struct eventfd_ctx *ctx;
+	u32 vcpu;
+	u32 port;
+	u32 type;
+	union {
+		struct {
+			u8 type;
+		} virq;
+	};
+};
+
 static void *xen_vcpu_info(struct kvm_vcpu *v);
 
 int kvm_xen_has_interrupt(struct kvm_vcpu *vcpu)
@@ -80,6 +94,13 @@  static int kvm_xen_do_upcall(struct kvm *kvm, u32 dest_vcpu,
 	return 0;
 }
 
+static void kvm_xen_evtchnfd_upcall(struct kvm_vcpu *vcpu, struct evtchnfd *e)
+{
+	struct kvm_vcpu_xen *vx = vcpu_to_xen_vcpu(vcpu);
+
+	kvm_xen_do_upcall(vcpu->kvm, e->vcpu, vx->cb.via, vx->cb.vector, 0);
+}
+
 int kvm_xen_set_evtchn(struct kvm_kernel_irq_routing_entry *e,
 		   struct kvm *kvm, int irq_source_id, int level,
 		   bool line_status)
@@ -329,6 +350,12 @@  int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
 		r = 0;
 		break;
 	}
+	case KVM_XEN_ATTR_TYPE_EVTCHN: {
+		struct kvm_xen_eventfd xevfd = data->u.evtchn;
+
+		r = kvm_vm_ioctl_xen_eventfd(kvm, &xevfd);
+		break;
+	}
 	default:
 		break;
 	}
@@ -388,10 +415,96 @@  static int kvm_xen_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
+static int kvm_xen_evtchn_2l_vcpu_set_pending(struct vcpu_info *v)
+{
+	return test_and_set_bit(0, (unsigned long *) &v->evtchn_upcall_pending);
+}
+
+#define BITS_PER_EVTCHN_WORD (sizeof(xen_ulong_t)*8)
+
+static int kvm_xen_evtchn_2l_set_pending(struct shared_info *shared_info,
+					 struct vcpu_info *vcpu_info,
+					 int p)
+{
+	if (test_and_set_bit(p, (unsigned long *) shared_info->evtchn_pending))
+		return 1;
+
+	if (!test_bit(p, (unsigned long *) shared_info->evtchn_mask) &&
+	    !test_and_set_bit(p / BITS_PER_EVTCHN_WORD,
+			      (unsigned long *) &vcpu_info->evtchn_pending_sel))
+		return kvm_xen_evtchn_2l_vcpu_set_pending(vcpu_info);
+
+	return 1;
+}
+
+#undef BITS_PER_EVTCHN_WORD
+
+static int kvm_xen_evtchn_set_pending(struct kvm_vcpu *svcpu,
+				      struct evtchnfd *evfd)
+{
+	struct kvm_vcpu_xen *vcpu_xen;
+	struct vcpu_info *vcpu_info;
+	struct shared_info *shared_info;
+	struct kvm_vcpu *vcpu;
+
+	vcpu = kvm_get_vcpu(svcpu->kvm, evfd->vcpu);
+	if (!vcpu)
+		return -ENOENT;
+
+	vcpu_xen = vcpu_to_xen_vcpu(vcpu);
+	shared_info = (struct shared_info *) vcpu->kvm->arch.xen.shinfo;
+	vcpu_info = (struct vcpu_info *) vcpu_xen->vcpu_info;
+
+	return kvm_xen_evtchn_2l_set_pending(shared_info, vcpu_info,
+					     evfd->port);
+}
+
+static int kvm_xen_evtchn_send(struct kvm_vcpu *vcpu, int port)
+{
+	struct eventfd_ctx *eventfd;
+	struct evtchnfd *evtchnfd;
+
+	/* conn_to_evt is protected by vcpu->kvm->srcu */
+	evtchnfd = idr_find(&vcpu->kvm->arch.xen.port_to_evt, port);
+	if (!evtchnfd)
+		return -ENOENT;
+
+	eventfd = evtchnfd->ctx;
+	if (!kvm_xen_evtchn_set_pending(vcpu, evtchnfd)) {
+		if (!eventfd)
+			kvm_xen_evtchnfd_upcall(vcpu, evtchnfd);
+		else
+			eventfd_signal(eventfd, 1);
+	}
+
+	return 0;
+}
+
+static int kvm_xen_hcall_evtchn_send(struct kvm_vcpu *vcpu, int cmd, u64 param)
+{
+	struct evtchn_send send;
+	gpa_t gpa;
+	int idx;
+
+	/* Port management is done in userspace */
+	if (cmd != EVTCHNOP_send)
+		return -EINVAL;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	gpa = kvm_mmu_gva_to_gpa_system(vcpu, param, NULL);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	if (!gpa || kvm_vcpu_read_guest(vcpu, gpa, &send, sizeof(send)))
+		return -EFAULT;
+
+	return kvm_xen_evtchn_send(vcpu, send.port);
+}
+
 int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
 {
 	bool longmode;
 	u64 input, params[5];
+	int r;
 
 	input = (u64)kvm_register_read(vcpu, VCPU_REGS_RAX);
 
@@ -415,6 +528,19 @@  int kvm_xen_hypercall(struct kvm_vcpu *vcpu)
 	trace_kvm_xen_hypercall(input, params[0], params[1], params[2],
 				params[3], params[4]);
 
+	switch (input) {
+	case __HYPERVISOR_event_channel_op:
+		r = kvm_xen_hcall_evtchn_send(vcpu, params[0],
+					      params[1]);
+		if (!r) {
+			kvm_xen_hypercall_set_result(vcpu, r);
+			return kvm_skip_emulated_instruction(vcpu);
+		}
+	/* fallthrough */
+	default:
+		break;
+	}
+
 	vcpu->run->exit_reason = KVM_EXIT_XEN;
 	vcpu->run->xen.type = KVM_EXIT_XEN_HCALL;
 	vcpu->run->xen.u.hcall.input = input;
@@ -441,6 +567,12 @@  void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu)
 		put_page(virt_to_page(vcpu_xen->steal_time));
 }
 
+void kvm_xen_init_vm(struct kvm *kvm)
+{
+	mutex_init(&kvm->arch.xen.xen_lock);
+	idr_init(&kvm->arch.xen.port_to_evt);
+}
+
 void kvm_xen_destroy_vm(struct kvm *kvm)
 {
 	struct kvm_xen *xen = &kvm->arch.xen;
@@ -448,3 +580,109 @@  void kvm_xen_destroy_vm(struct kvm *kvm)
 	if (xen->shinfo)
 		put_page(virt_to_page(xen->shinfo));
 }
+
+static int kvm_xen_eventfd_update(struct kvm *kvm, struct idr *port_to_evt,
+				  struct mutex *port_lock,
+				  struct kvm_xen_eventfd *args)
+{
+	struct eventfd_ctx *eventfd = NULL;
+	struct evtchnfd *evtchnfd;
+
+	mutex_lock(port_lock);
+	evtchnfd = idr_find(port_to_evt, args->port);
+	mutex_unlock(port_lock);
+
+	if (!evtchnfd)
+		return -ENOENT;
+
+	if (args->fd != -1) {
+		eventfd = eventfd_ctx_fdget(args->fd);
+		if (IS_ERR(eventfd))
+			return PTR_ERR(eventfd);
+	}
+
+	evtchnfd->vcpu = args->vcpu;
+	return 0;
+}
+
+static int kvm_xen_eventfd_assign(struct kvm *kvm, struct idr *port_to_evt,
+				  struct mutex *port_lock,
+				  struct kvm_xen_eventfd *args)
+{
+	struct eventfd_ctx *eventfd = NULL;
+	struct evtchnfd *evtchnfd;
+	u32 port = args->port;
+	int ret;
+
+	if (args->fd != -1) {
+		eventfd = eventfd_ctx_fdget(args->fd);
+		if (IS_ERR(eventfd))
+			return PTR_ERR(eventfd);
+	}
+
+	evtchnfd =  kzalloc(sizeof(struct evtchnfd), GFP_KERNEL);
+	if (!evtchnfd)
+		return -ENOMEM;
+
+	evtchnfd->ctx = eventfd;
+	evtchnfd->port = port;
+	evtchnfd->vcpu = args->vcpu;
+	evtchnfd->type = args->type;
+	if (evtchnfd->type == XEN_EVTCHN_TYPE_VIRQ)
+		evtchnfd->virq.type = args->virq.type;
+
+	mutex_lock(port_lock);
+	ret = idr_alloc(port_to_evt, evtchnfd, port, port + 1,
+			GFP_KERNEL);
+	mutex_unlock(port_lock);
+
+	if (ret >= 0)
+		return 0;
+
+	if (ret == -ENOSPC)
+		ret = -EEXIST;
+
+	if (eventfd)
+		eventfd_ctx_put(eventfd);
+	kfree(evtchnfd);
+	return ret;
+}
+
+static int kvm_xen_eventfd_deassign(struct kvm *kvm, struct idr *port_to_evt,
+				  struct mutex *port_lock, u32 port)
+{
+	struct evtchnfd *evtchnfd;
+
+	mutex_lock(port_lock);
+	evtchnfd = idr_remove(port_to_evt, port);
+	mutex_unlock(port_lock);
+
+	if (!evtchnfd)
+		return -ENOENT;
+
+	if (kvm)
+		synchronize_srcu(&kvm->srcu);
+	if (evtchnfd->ctx)
+		eventfd_ctx_put(evtchnfd->ctx);
+	kfree(evtchnfd);
+	return 0;
+}
+
+int kvm_vm_ioctl_xen_eventfd(struct kvm *kvm, struct kvm_xen_eventfd *args)
+{
+	struct kvm_xen *xen = &kvm->arch.xen;
+	int allowed_flags = (KVM_XEN_EVENTFD_DEASSIGN | KVM_XEN_EVENTFD_UPDATE);
+
+	if ((args->flags & (~allowed_flags)) ||
+	    (args->port <= 0))
+		return -EINVAL;
+
+	if (args->flags == KVM_XEN_EVENTFD_DEASSIGN)
+		return kvm_xen_eventfd_deassign(kvm, &xen->port_to_evt,
+						&xen->xen_lock, args->port);
+	if (args->flags == KVM_XEN_EVENTFD_UPDATE)
+		return kvm_xen_eventfd_update(kvm, &xen->port_to_evt,
+					      &xen->xen_lock, args);
+	return kvm_xen_eventfd_assign(kvm, &xen->port_to_evt,
+				      &xen->xen_lock, args);
+}
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 6a42e134924a..8f26625564c8 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -34,7 +34,9 @@  int kvm_xen_set_evtchn(struct kvm_kernel_irq_routing_entry *e,
 int kvm_xen_setup_evtchn(struct kvm *kvm,
 			 struct kvm_kernel_irq_routing_entry *e);
 
+void kvm_xen_init_vm(struct kvm *kvm);
 void kvm_xen_destroy_vm(struct kvm *kvm);
+int kvm_vm_ioctl_xen_eventfd(struct kvm *kvm, struct kvm_xen_eventfd *args);
 void kvm_xen_vcpu_uninit(struct kvm_vcpu *vcpu);
 
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 49001f681cd1..4eae47a0ef63 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1479,6 +1479,25 @@  struct kvm_xen_hvm_attr {
 			__u32 vcpu;
 			__u64 gpa;
 		} vcpu_attr;
+		struct kvm_xen_eventfd {
+
+#define XEN_EVTCHN_TYPE_VIRQ 0
+#define XEN_EVTCHN_TYPE_IPI  1
+			__u32 type;
+			__u32 port;
+			__u32 vcpu;
+			__s32 fd;
+
+#define KVM_XEN_EVENTFD_DEASSIGN	(1 << 0)
+#define KVM_XEN_EVENTFD_UPDATE		(1 << 1)
+			__u32 flags;
+			union {
+				struct {
+					__u8 type;
+				} virq;
+				__u32 padding[2];
+			};
+		} evtchn;
 	} u;
 };
 
@@ -1487,6 +1506,7 @@  struct kvm_xen_hvm_attr {
 #define KVM_XEN_ATTR_TYPE_VCPU_INFO         0x1
 #define KVM_XEN_ATTR_TYPE_VCPU_TIME_INFO    0x2
 #define KVM_XEN_ATTR_TYPE_VCPU_RUNSTATE     0x3
+#define KVM_XEN_ATTR_TYPE_EVTCHN            0x4
 
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {