diff mbox

[v5,2/2] kvm: x86: hyperv: guest->host event signaling via eventfd

Message ID 20171212160742.793-3-rkagan@virtuozzo.com (mailing list archive)
State New, archived
Headers show

Commit Message

Roman Kagan Dec. 12, 2017, 4:07 p.m. UTC
In Hyper-V, the fast guest->host notification mechanism is the
SIGNAL_EVENT hypercall, with a single parameter of the connection ID to
signal.

Currently this hypercall incurs a user exit and requires the userspace
to decode the parameters and trigger the notification of the potentially
different I/O context.

To avoid the costly user exit, process this hypercall and signal the
corresponding eventfd in KVM, similar to ioeventfd.  The association
between the connection id and the eventfd is established via the newly
introduced KVM_HYPERV_EVENTFD ioctl, and maintained in an
(srcu-protected) IDR.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
---
v4 -> v5:
 - fix block comment formatting

v3 -> v4:
 - switch to kvm_vcpu_read_guest and take srcu_read_lock around it
 - rework and document the interpretation of the hypercall parameter
 - merge !fast version into kvm_hvcall_signal_event for brevity

v2 -> v3:
 - expand docs on allowed values and return codes
 - fix uninitialized return
 - style fixes

v1 -> v2:
 - make data types consistent
 - get by without the recently dropped struct hv_input_signal_event
 - fix subject prefix

 Documentation/virtual/kvm/api.txt |  31 +++++++++++
 arch/x86/include/asm/kvm_host.h   |   2 +
 arch/x86/kvm/hyperv.h             |   1 +
 include/uapi/linux/kvm.h          |  13 +++++
 arch/x86/kvm/hyperv.c             | 110 +++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c                |  10 ++++
 6 files changed, 166 insertions(+), 1 deletion(-)

Comments

Paolo Bonzini Dec. 12, 2017, 4:22 p.m. UTC | #1
On 12/12/2017 17:07, Roman Kagan wrote:
> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +

The lock/unlock is not needed (vcpu_enter_guest -> vmx_handle_exit ->
handle_vmcall -> kvm_emulate_hypercall -> kvm_hv_hypercall ->
kvm_hvcall_signal_event).  I'll drop it.

Thanks,
David Hildenbrand Dec. 12, 2017, 4:29 p.m. UTC | #2
On 12.12.2017 17:22, Paolo Bonzini wrote:
> On 12/12/2017 17:07, Roman Kagan wrote:
>> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
>> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> +
> 
> The lock/unlock is not needed (vcpu_enter_guest -> vmx_handle_exit ->
> handle_vmcall -> kvm_emulate_hypercall -> kvm_hv_hypercall ->
> kvm_hvcall_signal_event).  I'll drop it.
> 
> Thanks,
> 

Feel free to add my

Reviewed-by: David Hildenbrand <david@redhat.com>

(although it would be good to have another look at the hyperv specific
conn_id handling)
Roman Kagan Dec. 12, 2017, 5:03 p.m. UTC | #3
On Tue, Dec 12, 2017 at 05:22:20PM +0100, Paolo Bonzini wrote:
> On 12/12/2017 17:07, Roman Kagan wrote:
> > +		idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
> > +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +
> 
> The lock/unlock is not needed (vcpu_enter_guest -> vmx_handle_exit ->
> handle_vmcall -> kvm_emulate_hypercall -> kvm_hv_hypercall ->
> kvm_hvcall_signal_event).  I'll drop it.

Oh crap, indeed.  I'll respin soonish.

Thanks,
Roman.
Roman Kagan Dec. 12, 2017, 6:18 p.m. UTC | #4
On Tue, Dec 12, 2017 at 05:29:02PM +0100, David Hildenbrand wrote:
> On 12.12.2017 17:22, Paolo Bonzini wrote:
> > On 12/12/2017 17:07, Roman Kagan wrote:
> >> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
> >> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
> >> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >> +
> > 
> > The lock/unlock is not needed (vcpu_enter_guest -> vmx_handle_exit ->
> > handle_vmcall -> kvm_emulate_hypercall -> kvm_hv_hypercall ->
> > kvm_hvcall_signal_event).  I'll drop it.
> > 
> > Thanks,
> > 
> 
> Feel free to add my
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> (although it would be good to have another look at the hyperv specific
> conn_id handling)

Are you talking about this weird flag_no part of the hypercall parameter
which we add to the conn_id?

Yes this is something I'm not quite comfortable with: we've never seen
it being anything but 0, and Linux guest drivers ignore its existence
altogether (and there seem to be no fields in the vmbus protocol
structures to let the guest know of the allowed numbers there).

So I may be misinterpreting the spec; and it may be prudent just to
reject all flag_no != 0 hypercalls until we actually start seeing ones,
and then research how to handle them correctly.  What do you think?

Thanks,
Roman.
David Hildenbrand Dec. 12, 2017, 6:20 p.m. UTC | #5
On 12.12.2017 19:18, Roman Kagan wrote:
> On Tue, Dec 12, 2017 at 05:29:02PM +0100, David Hildenbrand wrote:
>> On 12.12.2017 17:22, Paolo Bonzini wrote:
>>> On 12/12/2017 17:07, Roman Kagan wrote:
>>>> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
>>>> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
>>>> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>>> +
>>>
>>> The lock/unlock is not needed (vcpu_enter_guest -> vmx_handle_exit ->
>>> handle_vmcall -> kvm_emulate_hypercall -> kvm_hv_hypercall ->
>>> kvm_hvcall_signal_event).  I'll drop it.
>>>
>>> Thanks,
>>>
>>
>> Feel free to add my
>>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>
>> (although it would be good to have another look at the hyperv specific
>> conn_id handling)
> 
> Are you talking about this weird flag_no part of the hypercall parameter
> which we add to the conn_id?
> 
> Yes this is something I'm not quite comfortable with: we've never seen
> it being anything but 0, and Linux guest drivers ignore its existence
> altogether (and there seem to be no fields in the vmbus protocol
> structures to let the guest know of the allowed numbers there).
> 
> So I may be misinterpreting the spec; and it may be prudent just to
> reject all flag_no != 0 hypercalls until we actually start seeing ones,
> and then research how to handle them correctly.  What do you think?

Could be done, but if it isn't too much trouble, let's try to get it
right from the beginning.

Can you point me at the spec so I can verify?

> 
> Thanks,
> Roman.
>
Roman Kagan Dec. 13, 2017, 8:41 a.m. UTC | #6
On Tue, Dec 12, 2017 at 07:20:52PM +0100, David Hildenbrand wrote:
> On 12.12.2017 19:18, Roman Kagan wrote:
> > On Tue, Dec 12, 2017 at 05:29:02PM +0100, David Hildenbrand wrote:
> >> On 12.12.2017 17:22, Paolo Bonzini wrote:
> >>> On 12/12/2017 17:07, Roman Kagan wrote:
> >> (although it would be good to have another look at the hyperv specific
> >> conn_id handling)
> > 
> > Are you talking about this weird flag_no part of the hypercall parameter
> > which we add to the conn_id?
> > 
> > Yes this is something I'm not quite comfortable with: we've never seen
> > it being anything but 0, and Linux guest drivers ignore its existence
> > altogether (and there seem to be no fields in the vmbus protocol
> > structures to let the guest know of the allowed numbers there).
> > 
> > So I may be misinterpreting the spec; and it may be prudent just to
> > reject all flag_no != 0 hypercalls until we actually start seeing ones,
> > and then research how to handle them correctly.  What do you think?
> 
> Could be done, but if it isn't too much trouble, let's try to get it
> right from the beginning.
> 
> Can you point me at the spec so I can verify?

Sure.  The Hyper-V spec is

https://github.com/Microsoft/Virtualization-Documentation/raw/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v5.0b.pdf

Hope you can read Microsoftish better than me.  The relevant part is
chapter 11, "Inter-Partition Communication"; the hypercall itself is
described in "11.11.2 HvSignalEvent".

However, the VMBus guest drivers in Linux use only a limited subset of
those.  In particular, the concept of ports seems completely unused, the
connections appear pre-established, and the parent partition informs the
guest of the connection to signal for a specific channel via a u32 field
in the channel offer message.
(See drivers/hv/channel_mgmt.c:vmbus_onoffer for how the connection_id
from the offer is recorded, and drivers/hv/connection.c:vmbus_set_event
for how it's used for signaling.)

So I'm tempted to think this extra flag number is just the usual
overdesign, and I'd better require it to be zero.  This is certainly the
case for all the VMBus devices currently supported by the Linux guest
drivers, so we should be good with it too.

Roman.
David Hildenbrand Dec. 13, 2017, 9:35 a.m. UTC | #7
On 13.12.2017 09:41, Roman Kagan wrote:
> On Tue, Dec 12, 2017 at 07:20:52PM +0100, David Hildenbrand wrote:
>> On 12.12.2017 19:18, Roman Kagan wrote:
>>> On Tue, Dec 12, 2017 at 05:29:02PM +0100, David Hildenbrand wrote:
>>>> On 12.12.2017 17:22, Paolo Bonzini wrote:
>>>>> On 12/12/2017 17:07, Roman Kagan wrote:
>>>> (although it would be good to have another look at the hyperv specific
>>>> conn_id handling)
>>>
>>> Are you talking about this weird flag_no part of the hypercall parameter
>>> which we add to the conn_id?
>>>
>>> Yes this is something I'm not quite comfortable with: we've never seen
>>> it being anything but 0, and Linux guest drivers ignore its existence
>>> altogether (and there seem to be no fields in the vmbus protocol
>>> structures to let the guest know of the allowed numbers there).
>>>
>>> So I may be misinterpreting the spec; and it may be prudent just to
>>> reject all flag_no != 0 hypercalls until we actually start seeing ones,
>>> and then research how to handle them correctly.  What do you think?
>>
>> Could be done, but if it isn't too much trouble, let's try to get it
>> right from the beginning.
>>
>> Can you point me at the spec so I can verify?
> 
> Sure.  The Hyper-V spec is
> 
> https://github.com/Microsoft/Virtualization-Documentation/raw/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v5.0b.pdf
> 
> Hope you can read Microsoftish better than me.  The relevant part is
> chapter 11, "Inter-Partition Communication"; the hypercall itself is
> described in "11.11.2 HvSignalEvent".
> 
> However, the VMBus guest drivers in Linux use only a limited subset of
> those.  In particular, the concept of ports seems completely unused, the
> connections appear pre-established, and the parent partition informs the
> guest of the connection to signal for a specific channel via a u32 field
> in the channel offer message.
> (See drivers/hv/channel_mgmt.c:vmbus_onoffer for how the connection_id
> from the offer is recorded, and drivers/hv/connection.c:vmbus_set_event
> for how it's used for signaling.)
> 
> So I'm tempted to think this extra flag number is just the usual
> overdesign, and I'd better require it to be zero.  This is certainly the
> case for all the VMBus devices currently supported by the Linux guest
> drivers, so we should be good with it too.
> 
> Roman.
> 

General question, what about Microsoft Windows guests?
David Hildenbrand Dec. 13, 2017, 9:51 a.m. UTC | #8
>  
> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param)
> +{
> +	u16 ret;
> +	u32 conn_id, flag_no;
> +	int idx;
> +	struct eventfd_ctx *eventfd;
> +
> +	if (unlikely(!fast)) {
> +		gpa_t gpa = param;
> +
> +		if ((gpa & (__alignof__(param) - 1)) ||
> +		    offset_in_page(gpa) + sizeof(param) > PAGE_SIZE)
> +			return HV_STATUS_INVALID_ALIGNMENT;
> +
> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
> +		if (ret < 0)
> +			return HV_STATUS_INSUFFICIENT_MEMORY;
> +	}
> +
> +	/*
> +	 * the signaled event number is made up of a 24bit "connection id" and
> +	 * a 16bit "flag number"; on the hypervisor side it's only their sum
> +	 * that matters
> +	 */
> +	conn_id = param;
> +	flag_no = param >> 32;
> +	if ((conn_id & ~KVM_HYPERV_CONN_ID_MASK) || (flag_no & 0xffff0000))

1. You seem to check for RsvdZ here (flag_no & 0xffff0000),
  "HV_STATUS_INVALID_CONNECTION_ID: The specified connection ID is
   invalid."
   -> Shouldn't the reserved field be simply ignored?

2. KVM_HYPERV_CONN_ID_MASK. "ConnectionId (4 bytes)".

Why should it only be 3bytes?

(I am pretty sure I am missing something in the document here :) )


> +		return HV_STATUS_INVALID_CONNECTION_ID;
> +	conn_id += flag_no;

"FlagNumber specifies the relative index of the event flag that the
caller wants to set within the target"

I am not sure if we should simply change the connection id here. This
seems to be an additional parameter to be passed to the one connection id.


> +	if (conn_id & ~KVM_HYPERV_CONN_ID_MASK)
> +		return HV_STATUS_INVALID_CONNECTION_ID;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +	eventfd = idr_find(&vcpu->kvm->arch.hyperv.conn_to_evt, conn_id);
> +	if (eventfd)
> +		eventfd_signal(eventfd, 1);
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
> +	return eventfd ? HV_STATUS_SUCCESS : HV_STATUS_INVALID_CONNECTION_ID;
> +}
> +
David Hildenbrand Dec. 13, 2017, 9:55 a.m. UTC | #9
> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param)
> +{
> +	u16 ret;
> +	u32 conn_id, flag_no;
> +	int idx;
> +	struct eventfd_ctx *eventfd;
> +
> +	if (unlikely(!fast)) {
> +		gpa_t gpa = param;
> +
> +		if ((gpa & (__alignof__(param) - 1)) ||
> +		    offset_in_page(gpa) + sizeof(param) > PAGE_SIZE)
> +			return HV_STATUS_INVALID_ALIGNMENT;
> +
> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
> +		if (ret < 0)
> +			return HV_STATUS_INSUFFICIENT_MEMORY;

"... event flags do not require any buffer allocation or queuing within
the hypervisor, so HvSignalEvent will never fail due to insufficient
resources."


> +	}
> +
> +	/*
> +	 * the signaled event number is made up of a 24bit "connection id" and
> +	 * a 16bit "flag number"; on the hypervisor side it's only their sum
> +	 * that matters
Roman Kagan Dec. 13, 2017, 10 a.m. UTC | #10
On Wed, Dec 13, 2017 at 10:35:40AM +0100, David Hildenbrand wrote:
> On 13.12.2017 09:41, Roman Kagan wrote:
> > On Tue, Dec 12, 2017 at 07:20:52PM +0100, David Hildenbrand wrote:
> >> On 12.12.2017 19:18, Roman Kagan wrote:
> >>> On Tue, Dec 12, 2017 at 05:29:02PM +0100, David Hildenbrand wrote:
> >>>> On 12.12.2017 17:22, Paolo Bonzini wrote:
> >>>>> On 12/12/2017 17:07, Roman Kagan wrote:
> >>>> (although it would be good to have another look at the hyperv specific
> >>>> conn_id handling)
> >>>
> >>> Are you talking about this weird flag_no part of the hypercall parameter
> >>> which we add to the conn_id?
> >>>
> >>> Yes this is something I'm not quite comfortable with: we've never seen
> >>> it being anything but 0, and Linux guest drivers ignore its existence
> >>> altogether (and there seem to be no fields in the vmbus protocol
> >>> structures to let the guest know of the allowed numbers there).
> >>>
> >>> So I may be misinterpreting the spec; and it may be prudent just to
> >>> reject all flag_no != 0 hypercalls until we actually start seeing ones,
> >>> and then research how to handle them correctly.  What do you think?
> >>
> >> Could be done, but if it isn't too much trouble, let's try to get it
> >> right from the beginning.
> >>
> >> Can you point me at the spec so I can verify?
> > 
> > Sure.  The Hyper-V spec is
> > 
> > https://github.com/Microsoft/Virtualization-Documentation/raw/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v5.0b.pdf
> > 
> > Hope you can read Microsoftish better than me.  The relevant part is
> > chapter 11, "Inter-Partition Communication"; the hypercall itself is
> > described in "11.11.2 HvSignalEvent".
> > 
> > However, the VMBus guest drivers in Linux use only a limited subset of
> > those.  In particular, the concept of ports seems completely unused, the
> > connections appear pre-established, and the parent partition informs the
> > guest of the connection to signal for a specific channel via a u32 field
> > in the channel offer message.
> > (See drivers/hv/channel_mgmt.c:vmbus_onoffer for how the connection_id
> > from the offer is recorded, and drivers/hv/connection.c:vmbus_set_event
> > for how it's used for signaling.)
> > 
> > So I'm tempted to think this extra flag number is just the usual
> > overdesign, and I'd better require it to be zero.  This is certainly the
> > case for all the VMBus devices currently supported by the Linux guest
> > drivers, so we should be good with it too.
> > 
> > Roman.
> > 
> 
> General question, what about Microsoft Windows guests?

Well, we haven't tried intercepting this hypercall in Windows guests on
the real Hyper-V to see if they ever use non-zero flag number.

On our implementation of VMBus in QEMU, based on the protocol deduced
from the Linux guest drivers, the Windows guests abide by the same
rules.  At least it was enough to get VMBus scsi, network, and util
devices working.

Roman.
Roman Kagan Dec. 13, 2017, 11:04 a.m. UTC | #11
On Wed, Dec 13, 2017 at 10:51:06AM +0100, David Hildenbrand wrote:
> > +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param)
> > +{
> > +	u16 ret;
> > +	u32 conn_id, flag_no;
> > +	int idx;
> > +	struct eventfd_ctx *eventfd;
> > +
> > +	if (unlikely(!fast)) {
> > +		gpa_t gpa = param;
> > +
> > +		if ((gpa & (__alignof__(param) - 1)) ||
> > +		    offset_in_page(gpa) + sizeof(param) > PAGE_SIZE)
> > +			return HV_STATUS_INVALID_ALIGNMENT;
> > +
> > +		idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
> > +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +
> > +		if (ret < 0)
> > +			return HV_STATUS_INSUFFICIENT_MEMORY;
> > +	}
> > +
> > +	/*
> > +	 * the signaled event number is made up of a 24bit "connection id" and
> > +	 * a 16bit "flag number"; on the hypervisor side it's only their sum
> > +	 * that matters
> > +	 */
> > +	conn_id = param;
> > +	flag_no = param >> 32;
> > +	if ((conn_id & ~KVM_HYPERV_CONN_ID_MASK) || (flag_no & 0xffff0000))
> 
> 1. You seem to check for RsvdZ here (flag_no & 0xffff0000),
>   "HV_STATUS_INVALID_CONNECTION_ID: The specified connection ID is
>    invalid."
>    -> Shouldn't the reserved field be simply ignored?

It's "reserved zero", so I think it's correct to reject non-zero values.

> 2. KVM_HYPERV_CONN_ID_MASK. "ConnectionId (4 bytes)".
> 
> Why should it only be 3bytes?
> 
> (I am pretty sure I am missing something in the document here :) )


"11.10.7 Connections

Connections are identified by 32-bit IDs. The high 8 bits are reserved
and must be zero.[...]"

> > +		return HV_STATUS_INVALID_CONNECTION_ID;
> > +	conn_id += flag_no;
> 
> "FlagNumber specifies the relative index of the event flag that the
> caller wants to set within the target"
> 
> I am not sure if we should simply change the connection id here. This
> seems to be an additional parameter to be passed to the one connection id.

Right.  I think I misinterpreted this part back when I implemented it in
QEMU (first submission was this summer).  We lived happily with that
code, because the FlagNuber was always zero, and then I copied that over
to this KVM patch.

I think requiring it to be zero is a better choice; I've prepared v7 to
that end.

Thanks,
Roman.
David Hildenbrand Dec. 13, 2017, 11:14 a.m. UTC | #12
On 13.12.2017 12:04, Roman Kagan wrote:
> On Wed, Dec 13, 2017 at 10:51:06AM +0100, David Hildenbrand wrote:
>>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param)
>>> +{
>>> +	u16 ret;
>>> +	u32 conn_id, flag_no;
>>> +	int idx;
>>> +	struct eventfd_ctx *eventfd;
>>> +
>>> +	if (unlikely(!fast)) {
>>> +		gpa_t gpa = param;
>>> +
>>> +		if ((gpa & (__alignof__(param) - 1)) ||
>>> +		    offset_in_page(gpa) + sizeof(param) > PAGE_SIZE)
>>> +			return HV_STATUS_INVALID_ALIGNMENT;
>>> +
>>> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
>>> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
>>> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>> +
>>> +		if (ret < 0)
>>> +			return HV_STATUS_INSUFFICIENT_MEMORY;
>>> +	}
>>> +
>>> +	/*
>>> +	 * the signaled event number is made up of a 24bit "connection id" and
>>> +	 * a 16bit "flag number"; on the hypervisor side it's only their sum
>>> +	 * that matters
>>> +	 */
>>> +	conn_id = param;
>>> +	flag_no = param >> 32;
>>> +	if ((conn_id & ~KVM_HYPERV_CONN_ID_MASK) || (flag_no & 0xffff0000))
>>
>> 1. You seem to check for RsvdZ here (flag_no & 0xffff0000),
>>   "HV_STATUS_INVALID_CONNECTION_ID: The specified connection ID is
>>    invalid."
>>    -> Shouldn't the reserved field be simply ignored?
> 
> It's "reserved zero", so I think it's correct to reject non-zero values.

If it's not documented, guess it is ignored? At least
"HV_STATUS_INVALID_CONNECTION_ID" is wrong, as this is not the
connection id.

> 
>> 2. KVM_HYPERV_CONN_ID_MASK. "ConnectionId (4 bytes)".
>>
>> Why should it only be 3bytes?
>>
>> (I am pretty sure I am missing something in the document here :) )
> 
> 
> "11.10.7 Connections
> 
> Connections are identified by 32-bit IDs. The high 8 bits are reserved
> and must be zero.[...]"

This makes then sense indeed!

> 
>>> +		return HV_STATUS_INVALID_CONNECTION_ID;
>>> +	conn_id += flag_no;
>>
>> "FlagNumber specifies the relative index of the event flag that the
>> caller wants to set within the target"
>>
>> I am not sure if we should simply change the connection id here. This
>> seems to be an additional parameter to be passed to the one connection id.
> 
> Right.  I think I misinterpreted this part back when I implemented it in
> QEMU (first submission was this summer).  We lived happily with that
> code, because the FlagNuber was always zero, and then I copied that over
> to this KVM patch.
> 
> I think requiring it to be zero is a better choice; I've prepared v7 to
> that end.

But which return value to use? HV_STATUS_INVALID_PARAMETER?

> 
> Thanks,
> Roman.
>
Roman Kagan Dec. 13, 2017, 11:44 a.m. UTC | #13
On Wed, Dec 13, 2017 at 10:55:02AM +0100, David Hildenbrand wrote:
> 
> > +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param)
> > +{
> > +	u16 ret;
> > +	u32 conn_id, flag_no;
> > +	int idx;
> > +	struct eventfd_ctx *eventfd;
> > +
> > +	if (unlikely(!fast)) {
> > +		gpa_t gpa = param;
> > +
> > +		if ((gpa & (__alignof__(param) - 1)) ||
> > +		    offset_in_page(gpa) + sizeof(param) > PAGE_SIZE)
> > +			return HV_STATUS_INVALID_ALIGNMENT;
> > +
> > +		idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
> > +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +
> > +		if (ret < 0)
> > +			return HV_STATUS_INSUFFICIENT_MEMORY;
> 
> "... event flags do not require any buffer allocation or queuing within
> the hypervisor, so HvSignalEvent will never fail due to insufficient
> resources."

Ouch.  I wonder what else to map -EFAULT to then...  Looks like
HV_STATUS_INVALID_ALIGNMENT is the most appropriate (as in "The input or
output GPA pointer is not within the bounds of the GPA space.")

Thanks,
Roman.
David Hildenbrand Dec. 13, 2017, 11:46 a.m. UTC | #14
On 13.12.2017 12:44, Roman Kagan wrote:
> On Wed, Dec 13, 2017 at 10:55:02AM +0100, David Hildenbrand wrote:
>>
>>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param)
>>> +{
>>> +	u16 ret;
>>> +	u32 conn_id, flag_no;
>>> +	int idx;
>>> +	struct eventfd_ctx *eventfd;
>>> +
>>> +	if (unlikely(!fast)) {
>>> +		gpa_t gpa = param;
>>> +
>>> +		if ((gpa & (__alignof__(param) - 1)) ||
>>> +		    offset_in_page(gpa) + sizeof(param) > PAGE_SIZE)
>>> +			return HV_STATUS_INVALID_ALIGNMENT;
>>> +
>>> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
>>> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
>>> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>> +
>>> +		if (ret < 0)
>>> +			return HV_STATUS_INSUFFICIENT_MEMORY;
>>
>> "... event flags do not require any buffer allocation or queuing within
>> the hypervisor, so HvSignalEvent will never fail due to insufficient
>> resources."
> 
> Ouch.  I wonder what else to map -EFAULT to then...  Looks like
> HV_STATUS_INVALID_ALIGNMENT is the most appropriate (as in "The input or
> output GPA pointer is not within the bounds of the GPA space.")
> 
> Thanks,
> Roman.
> 

It is likely happen rearelyn, but usually if we are out of memory we
should report to user space. (chances that we won't be able to continue
for longer either way are very high)
David Hildenbrand Dec. 13, 2017, 11:51 a.m. UTC | #15
On 13.12.2017 12:46, David Hildenbrand wrote:
> On 13.12.2017 12:44, Roman Kagan wrote:
>> On Wed, Dec 13, 2017 at 10:55:02AM +0100, David Hildenbrand wrote:
>>>
>>>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param)
>>>> +{
>>>> +	u16 ret;
>>>> +	u32 conn_id, flag_no;
>>>> +	int idx;
>>>> +	struct eventfd_ctx *eventfd;
>>>> +
>>>> +	if (unlikely(!fast)) {
>>>> +		gpa_t gpa = param;
>>>> +
>>>> +		if ((gpa & (__alignof__(param) - 1)) ||
>>>> +		    offset_in_page(gpa) + sizeof(param) > PAGE_SIZE)
>>>> +			return HV_STATUS_INVALID_ALIGNMENT;
>>>> +
>>>> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
>>>> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
>>>> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>>> +
>>>> +		if (ret < 0)
>>>> +			return HV_STATUS_INSUFFICIENT_MEMORY;
>>>
>>> "... event flags do not require any buffer allocation or queuing within
>>> the hypervisor, so HvSignalEvent will never fail due to insufficient
>>> resources."
>>
>> Ouch.  I wonder what else to map -EFAULT to then...  Looks like
>> HV_STATUS_INVALID_ALIGNMENT is the most appropriate (as in "The input or
>> output GPA pointer is not within the bounds of the GPA space.")
>>
>> Thanks,
>> Roman.
>>
> 
> It is likely happen rearelyn, but usually if we are out of memory we
> should report to user space. (chances that we won't be able to continue
> for longer either way are very high)
> 
(I implied -ENOMEM is translated to -EFAULT) it can of course happen if
the address is wrong.

Also wonder how to deal with that. Could it be that we even have to
trigger a page fault in the guest?
David Hildenbrand Dec. 13, 2017, 11:57 a.m. UTC | #16
On 13.12.2017 12:51, David Hildenbrand wrote:
> On 13.12.2017 12:46, David Hildenbrand wrote:
>> On 13.12.2017 12:44, Roman Kagan wrote:
>>> On Wed, Dec 13, 2017 at 10:55:02AM +0100, David Hildenbrand wrote:
>>>>
>>>>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param)
>>>>> +{
>>>>> +	u16 ret;
>>>>> +	u32 conn_id, flag_no;
>>>>> +	int idx;
>>>>> +	struct eventfd_ctx *eventfd;
>>>>> +
>>>>> +	if (unlikely(!fast)) {
>>>>> +		gpa_t gpa = param;
>>>>> +
>>>>> +		if ((gpa & (__alignof__(param) - 1)) ||
>>>>> +		    offset_in_page(gpa) + sizeof(param) > PAGE_SIZE)
>>>>> +			return HV_STATUS_INVALID_ALIGNMENT;
>>>>> +
>>>>> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
>>>>> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
>>>>> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
>>>>> +
>>>>> +		if (ret < 0)
>>>>> +			return HV_STATUS_INSUFFICIENT_MEMORY;
>>>>
>>>> "... event flags do not require any buffer allocation or queuing within
>>>> the hypervisor, so HvSignalEvent will never fail due to insufficient
>>>> resources."
>>>
>>> Ouch.  I wonder what else to map -EFAULT to then...  Looks like
>>> HV_STATUS_INVALID_ALIGNMENT is the most appropriate (as in "The input or
>>> output GPA pointer is not within the bounds of the GPA space.")
>>>
>>> Thanks,
>>> Roman.
>>>
>>
>> It is likely happen rearelyn, but usually if we are out of memory we
>> should report to user space. (chances that we won't be able to continue
>> for longer either way are very high)
>>
> (I implied -ENOMEM is translated to -EFAULT) it can of course happen if
> the address is wrong.
> 
> Also wonder how to deal with that. Could it be that we even have to
> trigger a page fault in the guest?
> 


The hypervisor will validate that the calling partition can read from
the input page before executing the
requested hypercall. This validation consists of two checks: the
specified GPA is mapped and the GPA is
marked readable. If either of these tests fails, the hypervisor
generates a memory intercept message.

So memory intercept message is the right thing to do I guess?
Roman Kagan Dec. 13, 2017, 12:16 p.m. UTC | #17
On Wed, Dec 13, 2017 at 12:14:48PM +0100, David Hildenbrand wrote:
> On 13.12.2017 12:04, Roman Kagan wrote:
> > On Wed, Dec 13, 2017 at 10:51:06AM +0100, David Hildenbrand wrote:
> >>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param)
> >>> +{
> >>> +	u16 ret;
> >>> +	u32 conn_id, flag_no;
> >>> +	int idx;
> >>> +	struct eventfd_ctx *eventfd;
> >>> +
> >>> +	if (unlikely(!fast)) {
> >>> +		gpa_t gpa = param;
> >>> +
> >>> +		if ((gpa & (__alignof__(param) - 1)) ||
> >>> +		    offset_in_page(gpa) + sizeof(param) > PAGE_SIZE)
> >>> +			return HV_STATUS_INVALID_ALIGNMENT;
> >>> +
> >>> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
> >>> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
> >>> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >>> +
> >>> +		if (ret < 0)
> >>> +			return HV_STATUS_INSUFFICIENT_MEMORY;
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * the signaled event number is made up of a 24bit "connection id" and
> >>> +	 * a 16bit "flag number"; on the hypervisor side it's only their sum
> >>> +	 * that matters
> >>> +	 */
> >>> +	conn_id = param;
> >>> +	flag_no = param >> 32;
> >>> +	if ((conn_id & ~KVM_HYPERV_CONN_ID_MASK) || (flag_no & 0xffff0000))
> >>
> >> 1. You seem to check for RsvdZ here (flag_no & 0xffff0000),
> >>   "HV_STATUS_INVALID_CONNECTION_ID: The specified connection ID is
> >>    invalid."
> >>    -> Shouldn't the reserved field be simply ignored?
> > 
> > It's "reserved zero", so I think it's correct to reject non-zero values.
> 
> If it's not documented, guess it is ignored?

It is (sort of) documented, per

"1.2 Reserved Values
[...]
Zero value (documented as RsvdZ in diagrams and ReservedZ in code
segments) – For maximum forward compatibility, clients should zero the
value within this field."

> At least "HV_STATUS_INVALID_CONNECTION_ID" is wrong, as this is not
> the connection id.

I (ab)used this return code here to fall through to userspace handling
of this hypercall.  But it may indeed make more sense to return
HV_STATUS_INVALID_HYPERCALL_INPUT, as in "A reserved bit in the
specified hypercall input value is non-zero."

> >>> +		return HV_STATUS_INVALID_CONNECTION_ID;
> >>> +	conn_id += flag_no;
> >>
> >> "FlagNumber specifies the relative index of the event flag that the
> >> caller wants to set within the target"
> >>
> >> I am not sure if we should simply change the connection id here. This
> >> seems to be an additional parameter to be passed to the one connection id.
> > 
> > Right.  I think I misinterpreted this part back when I implemented it in
> > QEMU (first submission was this summer).  We lived happily with that
> > code, because the FlagNuber was always zero, and then I copied that over
> > to this KVM patch.
> > 
> > I think requiring it to be zero is a better choice; I've prepared v7 to
> > that end.
> 
> But which return value to use? HV_STATUS_INVALID_PARAMETER?

I think we're better off resorting to the userspace for handling it (at
least to report the situation).  With the current patch this will be
HV_STATUS_INVALID_CONNECTION_ID, but I can use something else or define
a special code for that if desired.

Roman.
Roman Kagan Dec. 13, 2017, 12:58 p.m. UTC | #18
On Wed, Dec 13, 2017 at 12:57:11PM +0100, David Hildenbrand wrote:
> On 13.12.2017 12:51, David Hildenbrand wrote:
> > On 13.12.2017 12:46, David Hildenbrand wrote:
> >> On 13.12.2017 12:44, Roman Kagan wrote:
> >>> On Wed, Dec 13, 2017 at 10:55:02AM +0100, David Hildenbrand wrote:
> >>>>
> >>>>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param)
> >>>>> +{
> >>>>> +	u16 ret;
> >>>>> +	u32 conn_id, flag_no;
> >>>>> +	int idx;
> >>>>> +	struct eventfd_ctx *eventfd;
> >>>>> +
> >>>>> +	if (unlikely(!fast)) {
> >>>>> +		gpa_t gpa = param;
> >>>>> +
> >>>>> +		if ((gpa & (__alignof__(param) - 1)) ||
> >>>>> +		    offset_in_page(gpa) + sizeof(param) > PAGE_SIZE)
> >>>>> +			return HV_STATUS_INVALID_ALIGNMENT;
> >>>>> +
> >>>>> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
> >>>>> +		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
> >>>>> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >>>>> +
> >>>>> +		if (ret < 0)
> >>>>> +			return HV_STATUS_INSUFFICIENT_MEMORY;
> >>>>
> >>>> "... event flags do not require any buffer allocation or queuing within
> >>>> the hypervisor, so HvSignalEvent will never fail due to insufficient
> >>>> resources."
> >>>
> >>> Ouch.  I wonder what else to map -EFAULT to then...  Looks like
> >>> HV_STATUS_INVALID_ALIGNMENT is the most appropriate (as in "The input or
> >>> output GPA pointer is not within the bounds of the GPA space.")
> >>>
> >>> Thanks,
> >>> Roman.
> >>>
> >>
> >> It is likely happen rearelyn, but usually if we are out of memory we
> >> should report to user space. (chances that we won't be able to continue
> >> for longer either way are very high)
> >>
> > (I implied -ENOMEM is translated to -EFAULT) it can of course happen if
> > the address is wrong.
> > 
> > Also wonder how to deal with that. Could it be that we even have to
> > trigger a page fault in the guest?
> > 
> 
> 
> The hypervisor will validate that the calling partition can read from
> the input page before executing the
> requested hypercall. This validation consists of two checks: the
> specified GPA is mapped and the GPA is
> marked readable. If either of these tests fails, the hypervisor
> generates a memory intercept message.
> 
> So memory intercept message is the right thing to do I guess?

The intercept messages are sent to the parent partition (whose role is
played by the hypervisor in our case) and that one is supposed to decide
what to do with it.

Triggering a pagefault in the guest doesn't look correct because the
checks you mention are related to the GPA and not to the guest virtual
address.

So I think the only option we have is to return an error code, and we
need to decide which one.  We also can fall through to the userspace,
but I can't see why the userspace would have better chances to read that
GPA.

Roman.
diff mbox

Patch

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 57d3ee9e4bde..3a959373fd97 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3403,6 +3403,37 @@  invalid, if invalid pages are written to (e.g. after the end of memory)
 or if no page table is present for the addresses (e.g. when using
 hugepages).
 
+4.109 KVM_HYPERV_EVENTFD
+
+Capability: KVM_CAP_HYPERV_EVENTFD
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_hyperv_eventfd (in)
+
+This ioctl (un)registers an eventfd to receive notifications from the guest on
+the specified Hyper-V connection id through the SIGNAL_EVENT hypercall, without
+causing a user exit.
+
+struct kvm_hyperv_eventfd {
+	__u32 conn_id;
+	__s32 fd;
+	__u32 flags;
+	__u32 padding[3];
+};
+
+The conn_id field should fit within 24 bits:
+
+#define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
+
+The acceptable values for the flags field are:
+
+#define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
+
+Returns: 0 on success,
+	-EINVAL if conn_id or flags is outside the allowed range
+	-ENOENT on deassign if the conn_id isn't registered
+	-EEXIST on assign if the conn_id is already registered
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 516798431328..6a9914752a84 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -752,6 +752,8 @@  struct kvm_hv {
 	u64 hv_crash_ctl;
 
 	HV_REFERENCE_TSC_PAGE tsc_ref;
+
+	struct idr conn_to_evt;
 };
 
 enum kvm_irqchip_mode {
diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index cc2468244ca2..837465d69c6d 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -90,5 +90,6 @@  void kvm_hv_setup_tsc_page(struct kvm *kvm,
 
 void kvm_hv_init_vm(struct kvm *kvm);
 void kvm_hv_destroy_vm(struct kvm *kvm);
+int kvm_vm_ioctl_hv_eventfd(struct kvm *kvm, struct kvm_hyperv_eventfd *args);
 
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 496e59a2738b..7a871e7fb5df 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -932,6 +932,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_HYPERV_SYNIC2 148
 #define KVM_CAP_HYPERV_VP_INDEX 149
 #define KVM_CAP_S390_AIS_MIGRATION 150
+#define KVM_CAP_HYPERV_EVENTFD 151
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1359,6 +1360,8 @@  struct kvm_s390_ucas_mapping {
 #define KVM_S390_GET_CMMA_BITS      _IOWR(KVMIO, 0xb8, struct kvm_s390_cmma_log)
 #define KVM_S390_SET_CMMA_BITS      _IOW(KVMIO, 0xb9, struct kvm_s390_cmma_log)
 
+#define KVM_HYPERV_EVENTFD	_IOW(KVMIO,  0xba, struct kvm_hyperv_eventfd)
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
 #define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
@@ -1419,4 +1422,14 @@  struct kvm_assigned_msix_entry {
 #define KVM_ARM_DEV_EL1_PTIMER		(1 << 1)
 #define KVM_ARM_DEV_PMU			(1 << 2)
 
+struct kvm_hyperv_eventfd {
+	__u32 conn_id;
+	__s32 fd;
+	__u32 flags;
+	__u32 padding[3];
+};
+
+#define KVM_HYPERV_CONN_ID_MASK		0x00ffffff
+#define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)
+
 #endif /* __LINUX_KVM_H */
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 015fb06c7522..c1541dccf14d 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -29,6 +29,7 @@ 
 #include <linux/kvm_host.h>
 #include <linux/highmem.h>
 #include <linux/sched/cputime.h>
+#include <linux/eventfd.h>
 
 #include <asm/apicdef.h>
 #include <trace/events/kvm.h>
@@ -1226,6 +1227,50 @@  static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param)
+{
+	u16 ret;
+	u32 conn_id, flag_no;
+	int idx;
+	struct eventfd_ctx *eventfd;
+
+	if (unlikely(!fast)) {
+		gpa_t gpa = param;
+
+		if ((gpa & (__alignof__(param) - 1)) ||
+		    offset_in_page(gpa) + sizeof(param) > PAGE_SIZE)
+			return HV_STATUS_INVALID_ALIGNMENT;
+
+		idx = srcu_read_lock(&vcpu->kvm->srcu);
+		ret = kvm_vcpu_read_guest(vcpu, gpa, &param, sizeof(param));
+		srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+		if (ret < 0)
+			return HV_STATUS_INSUFFICIENT_MEMORY;
+	}
+
+	/*
+	 * the signaled event number is made up of a 24bit "connection id" and
+	 * a 16bit "flag number"; on the hypervisor side it's only their sum
+	 * that matters
+	 */
+	conn_id = param;
+	flag_no = param >> 32;
+	if ((conn_id & ~KVM_HYPERV_CONN_ID_MASK) || (flag_no & 0xffff0000))
+		return HV_STATUS_INVALID_CONNECTION_ID;
+	conn_id += flag_no;
+	if (conn_id & ~KVM_HYPERV_CONN_ID_MASK)
+		return HV_STATUS_INVALID_CONNECTION_ID;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	eventfd = idr_find(&vcpu->kvm->arch.hyperv.conn_to_evt, conn_id);
+	if (eventfd)
+		eventfd_signal(eventfd, 1);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	return eventfd ? HV_STATUS_SUCCESS : HV_STATUS_INVALID_CONNECTION_ID;
+}
+
 int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 {
 	u64 param, ingpa, outgpa, ret;
@@ -1276,8 +1321,12 @@  int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 	case HVCALL_NOTIFY_LONG_SPIN_WAIT:
 		kvm_vcpu_on_spin(vcpu, true);
 		break;
-	case HVCALL_POST_MESSAGE:
 	case HVCALL_SIGNAL_EVENT:
+		res = kvm_hvcall_signal_event(vcpu, fast, ingpa);
+		if (res != HV_STATUS_INVALID_CONNECTION_ID)
+			break;
+		/* maybe userspace knows this conn_id: fall through */
+	case HVCALL_POST_MESSAGE:
 		/* don't bother userspace if it has no way to handle it */
 		if (!vcpu_to_synic(vcpu)->active) {
 			res = HV_STATUS_INVALID_HYPERCALL_CODE;
@@ -1305,8 +1354,67 @@  int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
 void kvm_hv_init_vm(struct kvm *kvm)
 {
 	mutex_init(&kvm->arch.hyperv.hv_lock);
+	idr_init(&kvm->arch.hyperv.conn_to_evt);
 }
 
 void kvm_hv_destroy_vm(struct kvm *kvm)
 {
+	struct eventfd_ctx *eventfd;
+	int i;
+
+	idr_for_each_entry(&kvm->arch.hyperv.conn_to_evt, eventfd, i)
+		eventfd_ctx_put(eventfd);
+	idr_destroy(&kvm->arch.hyperv.conn_to_evt);
+}
+
+static int kvm_hv_eventfd_assign(struct kvm *kvm, u32 conn_id, int fd)
+{
+	struct kvm_hv *hv = &kvm->arch.hyperv;
+	struct eventfd_ctx *eventfd;
+	int ret;
+
+	eventfd = eventfd_ctx_fdget(fd);
+	if (IS_ERR(eventfd))
+		return PTR_ERR(eventfd);
+
+	mutex_lock(&hv->hv_lock);
+	ret = idr_alloc(&hv->conn_to_evt, eventfd, conn_id, conn_id + 1,
+			GFP_KERNEL);
+	mutex_unlock(&hv->hv_lock);
+
+	if (ret >= 0)
+		return 0;
+
+	if (ret == -ENOSPC)
+		ret = -EEXIST;
+	eventfd_ctx_put(eventfd);
+	return ret;
+}
+
+static int kvm_hv_eventfd_deassign(struct kvm *kvm, u32 conn_id)
+{
+	struct kvm_hv *hv = &kvm->arch.hyperv;
+	struct eventfd_ctx *eventfd;
+
+	mutex_lock(&hv->hv_lock);
+	eventfd = idr_remove(&hv->conn_to_evt, conn_id);
+	mutex_unlock(&hv->hv_lock);
+
+	if (!eventfd)
+		return -ENOENT;
+
+	synchronize_srcu(&kvm->srcu);
+	eventfd_ctx_put(eventfd);
+	return 0;
+}
+
+int kvm_vm_ioctl_hv_eventfd(struct kvm *kvm, struct kvm_hyperv_eventfd *args)
+{
+	if ((args->flags & ~KVM_HYPERV_EVENTFD_DEASSIGN) ||
+	    (args->conn_id & ~KVM_HYPERV_CONN_ID_MASK))
+		return -EINVAL;
+
+	if (args->flags == KVM_HYPERV_EVENTFD_DEASSIGN)
+		return kvm_hv_eventfd_deassign(kvm, args->conn_id);
+	return kvm_hv_eventfd_assign(kvm, args->conn_id, args->fd);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d17cf7900138..1c43d262da14 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2701,6 +2701,7 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_HYPERV_SYNIC:
 	case KVM_CAP_HYPERV_SYNIC2:
 	case KVM_CAP_HYPERV_VP_INDEX:
+	case KVM_CAP_HYPERV_EVENTFD:
 	case KVM_CAP_PCI_SEGMENT:
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
@@ -4295,6 +4296,15 @@  long kvm_arch_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_enable_cap(kvm, &cap);
 		break;
 	}
+	case KVM_HYPERV_EVENTFD: {
+		struct kvm_hyperv_eventfd hvevfd;
+
+		r = -EFAULT;
+		if (copy_from_user(&hvevfd, argp, sizeof(hvevfd)))
+			goto out;
+		r = kvm_vm_ioctl_hv_eventfd(kvm, &hvevfd);
+		break;
+	}
 	default:
 		r = -ENOTTY;
 	}