Message ID | 20171212160742.793-3-rkagan@virtuozzo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 12/12/2017 17:07, Roman Kagan wrote: > + idx = srcu_read_lock(&vcpu->kvm->srcu); > + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > + The lock/unlock is not needed (vcpu_enter_guest -> vmx_handle_exit -> handle_vmcall -> kvm_emulate_hypercall -> kvm_hv_hypercall -> kvm_hvcall_signal_event). I'll drop it. Thanks,
On 12.12.2017 17:22, Paolo Bonzini wrote: > On 12/12/2017 17:07, Roman Kagan wrote: >> + idx = srcu_read_lock(&vcpu->kvm->srcu); >> + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); >> + srcu_read_unlock(&vcpu->kvm->srcu, idx); >> + > > The lock/unlock is not needed (vcpu_enter_guest -> vmx_handle_exit -> > handle_vmcall -> kvm_emulate_hypercall -> kvm_hv_hypercall -> > kvm_hvcall_signal_event). I'll drop it. > > Thanks, > Feel free to add my Reviewed-by: David Hildenbrand <david@redhat.com> (although it would be good to have another look at the hyperv specific conn_id handling)
On Tue, Dec 12, 2017 at 05:22:20PM +0100, Paolo Bonzini wrote: > On 12/12/2017 17:07, Roman Kagan wrote: > > + idx = srcu_read_lock(&vcpu->kvm->srcu); > > + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); > > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > > + > > The lock/unlock is not needed (vcpu_enter_guest -> vmx_handle_exit -> > handle_vmcall -> kvm_emulate_hypercall -> kvm_hv_hypercall -> > kvm_hvcall_signal_event). I'll drop it. Oh crap, indeed. I'll respin soonish. Thanks, Roman.
On Tue, Dec 12, 2017 at 05:29:02PM +0100, David Hildenbrand wrote: > On 12.12.2017 17:22, Paolo Bonzini wrote: > > On 12/12/2017 17:07, Roman Kagan wrote: > >> + idx = srcu_read_lock(&vcpu->kvm->srcu); > >> + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); > >> + srcu_read_unlock(&vcpu->kvm->srcu, idx); > >> + > > > > The lock/unlock is not needed (vcpu_enter_guest -> vmx_handle_exit -> > > handle_vmcall -> kvm_emulate_hypercall -> kvm_hv_hypercall -> > > kvm_hvcall_signal_event). I'll drop it. > > > > Thanks, > > > > Feel free to add my > > Reviewed-by: David Hildenbrand <david@redhat.com> > > (although it would be good to have another look at the hyperv specific > conn_id handling) Are you talking about this weird flag_no part of the hypercall parameter which we add to the conn_id? Yes this is something I'm not quite comfortable with: we've never seen it being anything but 0, and Linux guest drivers ignore its existence altogether (and there seem to be no fields in the vmbus protocol structures to let the guest know of the allowed numbers there). So I may be misinterpreting the spec; and it may be prudent just to reject all flag_no != 0 hypercalls until we actually start seeing ones, and then research how to handle them correctly. What do you think? Thanks, Roman.
On 12.12.2017 19:18, Roman Kagan wrote: > On Tue, Dec 12, 2017 at 05:29:02PM +0100, David Hildenbrand wrote: >> On 12.12.2017 17:22, Paolo Bonzini wrote: >>> On 12/12/2017 17:07, Roman Kagan wrote: >>>> + idx = srcu_read_lock(&vcpu->kvm->srcu); >>>> + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); >>>> + srcu_read_unlock(&vcpu->kvm->srcu, idx); >>>> + >>> >>> The lock/unlock is not needed (vcpu_enter_guest -> vmx_handle_exit -> >>> handle_vmcall -> kvm_emulate_hypercall -> kvm_hv_hypercall -> >>> kvm_hvcall_signal_event). I'll drop it. >>> >>> Thanks, >>> >> >> Feel free to add my >> >> Reviewed-by: David Hildenbrand <david@redhat.com> >> >> (although it would be good to have another look at the hyperv specific >> conn_id handling) > > Are you talking about this weird flag_no part of the hypercall parameter > which we add to the conn_id? > > Yes this is something I'm not quite comfortable with: we've never seen > it being anything but 0, and Linux guest drivers ignore its existence > altogether (and there seem to be no fields in the vmbus protocol > structures to let the guest know of the allowed numbers there). > > So I may be misinterpreting the spec; and it may be prudent just to > reject all flag_no != 0 hypercalls until we actually start seeing ones, > and then research how to handle them correctly. What do you think? Could be done, but if it isn't too much trouble, let's try to get it right from the beginning. Can you point me at the spec so I can verify? > > Thanks, > Roman. >
On Tue, Dec 12, 2017 at 07:20:52PM +0100, David Hildenbrand wrote: > On 12.12.2017 19:18, Roman Kagan wrote: > > On Tue, Dec 12, 2017 at 05:29:02PM +0100, David Hildenbrand wrote: > >> On 12.12.2017 17:22, Paolo Bonzini wrote: > >>> On 12/12/2017 17:07, Roman Kagan wrote: > >> (although it would be good to have another look at the hyperv specific > >> conn_id handling) > > > > Are you talking about this weird flag_no part of the hypercall parameter > > which we add to the conn_id? > > > > Yes this is something I'm not quite comfortable with: we've never seen > > it being anything but 0, and Linux guest drivers ignore its existence > > altogether (and there seem to be no fields in the vmbus protocol > > structures to let the guest know of the allowed numbers there). > > > > So I may be misinterpreting the spec; and it may be prudent just to > > reject all flag_no != 0 hypercalls until we actually start seeing ones, > > and then research how to handle them correctly. What do you think? > > Could be done, but if it isn't too much trouble, let's try to get it > right from the beginning. > > Can you point me at the spec so I can verify? Sure. The Hyper-V spec is https://github.com/Microsoft/Virtualization-Documentation/raw/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v5.0b.pdf Hope you can read Microsoftish better than me. The relevant part is chapter 11, "Inter-Partition Communication"; the hypercall itself is described in "11.11.2 HvSignalEvent". However, the VMBus guest drivers in Linux use only a limited subset of those. In particular, the concept of ports seems completely unused, the connections appear pre-established, and the parent partition informs the guest of the connection to signal for a specific channel via a u32 field in the channel offer message. (See drivers/hv/channel_mgmt.c:vmbus_onoffer for how the connection_id from the offer is recorded, and drivers/hv/connection.c:vmbus_set_event for how it's used for signaling.) So I'm tempted to think this extra flag number is just the usual overdesign, and I'd better require it to be zero. This is certainly the case for all the VMBus devices currently supported by the Linux guest drivers, so we should be good with it too. Roman.
On 13.12.2017 09:41, Roman Kagan wrote: > On Tue, Dec 12, 2017 at 07:20:52PM +0100, David Hildenbrand wrote: >> On 12.12.2017 19:18, Roman Kagan wrote: >>> On Tue, Dec 12, 2017 at 05:29:02PM +0100, David Hildenbrand wrote: >>>> On 12.12.2017 17:22, Paolo Bonzini wrote: >>>>> On 12/12/2017 17:07, Roman Kagan wrote: >>>> (although it would be good to have another look at the hyperv specific >>>> conn_id handling) >>> >>> Are you talking about this weird flag_no part of the hypercall parameter >>> which we add to the conn_id? >>> >>> Yes this is something I'm not quite comfortable with: we've never seen >>> it being anything but 0, and Linux guest drivers ignore its existence >>> altogether (and there seem to be no fields in the vmbus protocol >>> structures to let the guest know of the allowed numbers there). >>> >>> So I may be misinterpreting the spec; and it may be prudent just to >>> reject all flag_no != 0 hypercalls until we actually start seeing ones, >>> and then research how to handle them correctly. What do you think? >> >> Could be done, but if it isn't too much trouble, let's try to get it >> right from the beginning. >> >> Can you point me at the spec so I can verify? > > Sure. The Hyper-V spec is > > https://github.com/Microsoft/Virtualization-Documentation/raw/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v5.0b.pdf > > Hope you can read Microsoftish better than me. The relevant part is > chapter 11, "Inter-Partition Communication"; the hypercall itself is > described in "11.11.2 HvSignalEvent". > > However, the VMBus guest drivers in Linux use only a limited subset of > those. In particular, the concept of ports seems completely unused, the > connections appear pre-established, and the parent partition informs the > guest of the connection to signal for a specific channel via a u32 field > in the channel offer message. > (See drivers/hv/channel_mgmt.c:vmbus_onoffer for how the connection_id > from the offer is recorded, and drivers/hv/connection.c:vmbus_set_event > for how it's used for signaling.) > > So I'm tempted to think this extra flag number is just the usual > overdesign, and I'd better require it to be zero. This is certainly the > case for all the VMBus devices currently supported by the Linux guest > drivers, so we should be good with it too. > > Roman. > General question, what about Microsoft Windows guests?
> > +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param) > +{ > + u16 ret; > + u32 conn_id, flag_no; > + int idx; > + struct eventfd_ctx *eventfd; > + > + if (unlikely(!fast)) { > + gpa_t gpa = param; > + > + if ((gpa & (__alignof__(param) - 1)) || > + offset_in_page(gpa) + sizeof(param) > PAGE_SIZE) > + return HV_STATUS_INVALID_ALIGNMENT; > + > + idx = srcu_read_lock(&vcpu->kvm->srcu); > + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > + > + if (ret < 0) > + return HV_STATUS_INSUFFICIENT_MEMORY; > + } > + > + /* > + * the signaled event number is made up of a 24bit "connection id" and > + * a 16bit "flag number"; on the hypervisor side it's only their sum > + * that matters > + */ > + conn_id = param; > + flag_no = param >> 32; > + if ((conn_id & ~KVM_HYPERV_CONN_ID_MASK) || (flag_no & 0xffff0000)) 1. You seem to check for RsvdZ here (flag_no & 0xffff0000), "HV_STATUS_INVALID_CONNECTION_ID: The specified connection ID is invalid." -> Shouldn't the reserved field be simply ignored? 2. KVM_HYPERV_CONN_ID_MASK. "ConnectionId (4 bytes)". Why should it only be 3bytes? (I am pretty sure I am missing something in the document here :) ) > + return HV_STATUS_INVALID_CONNECTION_ID; > + conn_id += flag_no; "FlagNumber specifies the relative index of the event flag that the caller wants to set within the target" I am not sure if we should simply change the connection id here. This seems to be an additional parameter to be passed to the one connection id. > + if (conn_id & ~KVM_HYPERV_CONN_ID_MASK) > + return HV_STATUS_INVALID_CONNECTION_ID; > + > + idx = srcu_read_lock(&vcpu->kvm->srcu); > + eventfd = idr_find(&vcpu->kvm->arch.hyperv.conn_to_evt, conn_id); > + if (eventfd) > + eventfd_signal(eventfd, 1); > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > + > + return eventfd ? HV_STATUS_SUCCESS : HV_STATUS_INVALID_CONNECTION_ID; > +} > +
> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param) > +{ > + u16 ret; > + u32 conn_id, flag_no; > + int idx; > + struct eventfd_ctx *eventfd; > + > + if (unlikely(!fast)) { > + gpa_t gpa = param; > + > + if ((gpa & (__alignof__(param) - 1)) || > + offset_in_page(gpa) + sizeof(param) > PAGE_SIZE) > + return HV_STATUS_INVALID_ALIGNMENT; > + > + idx = srcu_read_lock(&vcpu->kvm->srcu); > + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > + > + if (ret < 0) > + return HV_STATUS_INSUFFICIENT_MEMORY; "... event flags do not require any buffer allocation or queuing within the hypervisor, so HvSignalEvent will never fail due to insufficient resources." > + } > + > + /* > + * the signaled event number is made up of a 24bit "connection id" and > + * a 16bit "flag number"; on the hypervisor side it's only their sum > + * that matters
On Wed, Dec 13, 2017 at 10:35:40AM +0100, David Hildenbrand wrote: > On 13.12.2017 09:41, Roman Kagan wrote: > > On Tue, Dec 12, 2017 at 07:20:52PM +0100, David Hildenbrand wrote: > >> On 12.12.2017 19:18, Roman Kagan wrote: > >>> On Tue, Dec 12, 2017 at 05:29:02PM +0100, David Hildenbrand wrote: > >>>> On 12.12.2017 17:22, Paolo Bonzini wrote: > >>>>> On 12/12/2017 17:07, Roman Kagan wrote: > >>>> (although it would be good to have another look at the hyperv specific > >>>> conn_id handling) > >>> > >>> Are you talking about this weird flag_no part of the hypercall parameter > >>> which we add to the conn_id? > >>> > >>> Yes this is something I'm not quite comfortable with: we've never seen > >>> it being anything but 0, and Linux guest drivers ignore its existence > >>> altogether (and there seem to be no fields in the vmbus protocol > >>> structures to let the guest know of the allowed numbers there). > >>> > >>> So I may be misinterpreting the spec; and it may be prudent just to > >>> reject all flag_no != 0 hypercalls until we actually start seeing ones, > >>> and then research how to handle them correctly. What do you think? > >> > >> Could be done, but if it isn't too much trouble, let's try to get it > >> right from the beginning. > >> > >> Can you point me at the spec so I can verify? > > > > Sure. The Hyper-V spec is > > > > https://github.com/Microsoft/Virtualization-Documentation/raw/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v5.0b.pdf > > > > Hope you can read Microsoftish better than me. The relevant part is > > chapter 11, "Inter-Partition Communication"; the hypercall itself is > > described in "11.11.2 HvSignalEvent". > > > > However, the VMBus guest drivers in Linux use only a limited subset of > > those. In particular, the concept of ports seems completely unused, the > > connections appear pre-established, and the parent partition informs the > > guest of the connection to signal for a specific channel via a u32 field > > in the channel offer message. > > (See drivers/hv/channel_mgmt.c:vmbus_onoffer for how the connection_id > > from the offer is recorded, and drivers/hv/connection.c:vmbus_set_event > > for how it's used for signaling.) > > > > So I'm tempted to think this extra flag number is just the usual > > overdesign, and I'd better require it to be zero. This is certainly the > > case for all the VMBus devices currently supported by the Linux guest > > drivers, so we should be good with it too. > > > > Roman. > > > > General question, what about Microsoft Windows guests? Well, we haven't tried intercepting this hypercall in Windows guests on the real Hyper-V to see if they ever use non-zero flag number. On our implementation of VMBus in QEMU, based on the protocol deduced from the Linux guest drivers, the Windows guests abide by the same rules. At least it was enough to get VMBus scsi, network, and util devices working. Roman.
On Wed, Dec 13, 2017 at 10:51:06AM +0100, David Hildenbrand wrote: > > +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param) > > +{ > > + u16 ret; > > + u32 conn_id, flag_no; > > + int idx; > > + struct eventfd_ctx *eventfd; > > + > > + if (unlikely(!fast)) { > > + gpa_t gpa = param; > > + > > + if ((gpa & (__alignof__(param) - 1)) || > > + offset_in_page(gpa) + sizeof(param) > PAGE_SIZE) > > + return HV_STATUS_INVALID_ALIGNMENT; > > + > > + idx = srcu_read_lock(&vcpu->kvm->srcu); > > + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); > > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > > + > > + if (ret < 0) > > + return HV_STATUS_INSUFFICIENT_MEMORY; > > + } > > + > > + /* > > + * the signaled event number is made up of a 24bit "connection id" and > > + * a 16bit "flag number"; on the hypervisor side it's only their sum > > + * that matters > > + */ > > + conn_id = param; > > + flag_no = param >> 32; > > + if ((conn_id & ~KVM_HYPERV_CONN_ID_MASK) || (flag_no & 0xffff0000)) > > 1. You seem to check for RsvdZ here (flag_no & 0xffff0000), > "HV_STATUS_INVALID_CONNECTION_ID: The specified connection ID is > invalid." > -> Shouldn't the reserved field be simply ignored? It's "reserved zero", so I think it's correct to reject non-zero values. > 2. KVM_HYPERV_CONN_ID_MASK. "ConnectionId (4 bytes)". > > Why should it only be 3bytes? > > (I am pretty sure I am missing something in the document here :) ) "11.10.7 Connections Connections are identified by 32-bit IDs. The high 8 bits are reserved and must be zero.[...]" > > + return HV_STATUS_INVALID_CONNECTION_ID; > > + conn_id += flag_no; > > "FlagNumber specifies the relative index of the event flag that the > caller wants to set within the target" > > I am not sure if we should simply change the connection id here. This > seems to be an additional parameter to be passed to the one connection id. Right. I think I misinterpreted this part back when I implemented it in QEMU (first submission was this summer). We lived happily with that code, because the FlagNuber was always zero, and then I copied that over to this KVM patch. I think requiring it to be zero is a better choice; I've prepared v7 to that end. Thanks, Roman.
On 13.12.2017 12:04, Roman Kagan wrote: > On Wed, Dec 13, 2017 at 10:51:06AM +0100, David Hildenbrand wrote: >>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param) >>> +{ >>> + u16 ret; >>> + u32 conn_id, flag_no; >>> + int idx; >>> + struct eventfd_ctx *eventfd; >>> + >>> + if (unlikely(!fast)) { >>> + gpa_t gpa = param; >>> + >>> + if ((gpa & (__alignof__(param) - 1)) || >>> + offset_in_page(gpa) + sizeof(param) > PAGE_SIZE) >>> + return HV_STATUS_INVALID_ALIGNMENT; >>> + >>> + idx = srcu_read_lock(&vcpu->kvm->srcu); >>> + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); >>> + srcu_read_unlock(&vcpu->kvm->srcu, idx); >>> + >>> + if (ret < 0) >>> + return HV_STATUS_INSUFFICIENT_MEMORY; >>> + } >>> + >>> + /* >>> + * the signaled event number is made up of a 24bit "connection id" and >>> + * a 16bit "flag number"; on the hypervisor side it's only their sum >>> + * that matters >>> + */ >>> + conn_id = param; >>> + flag_no = param >> 32; >>> + if ((conn_id & ~KVM_HYPERV_CONN_ID_MASK) || (flag_no & 0xffff0000)) >> >> 1. You seem to check for RsvdZ here (flag_no & 0xffff0000), >> "HV_STATUS_INVALID_CONNECTION_ID: The specified connection ID is >> invalid." >> -> Shouldn't the reserved field be simply ignored? > > It's "reserved zero", so I think it's correct to reject non-zero values. If it's not documented, guess it is ignored? At least "HV_STATUS_INVALID_CONNECTION_ID" is wrong, as this is not the connection id. > >> 2. KVM_HYPERV_CONN_ID_MASK. "ConnectionId (4 bytes)". >> >> Why should it only be 3bytes? >> >> (I am pretty sure I am missing something in the document here :) ) > > > "11.10.7 Connections > > Connections are identified by 32-bit IDs. The high 8 bits are reserved > and must be zero.[...]" This makes then sense indeed! > >>> + return HV_STATUS_INVALID_CONNECTION_ID; >>> + conn_id += flag_no; >> >> "FlagNumber specifies the relative index of the event flag that the >> caller wants to set within the target" >> >> I am not sure if we should simply change the connection id here. This >> seems to be an additional parameter to be passed to the one connection id. > > Right. I think I misinterpreted this part back when I implemented it in > QEMU (first submission was this summer). We lived happily with that > code, because the FlagNuber was always zero, and then I copied that over > to this KVM patch. > > I think requiring it to be zero is a better choice; I've prepared v7 to > that end. But which return value to use? HV_STATUS_INVALID_PARAMETER? > > Thanks, > Roman. >
On Wed, Dec 13, 2017 at 10:55:02AM +0100, David Hildenbrand wrote: > > > +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param) > > +{ > > + u16 ret; > > + u32 conn_id, flag_no; > > + int idx; > > + struct eventfd_ctx *eventfd; > > + > > + if (unlikely(!fast)) { > > + gpa_t gpa = param; > > + > > + if ((gpa & (__alignof__(param) - 1)) || > > + offset_in_page(gpa) + sizeof(param) > PAGE_SIZE) > > + return HV_STATUS_INVALID_ALIGNMENT; > > + > > + idx = srcu_read_lock(&vcpu->kvm->srcu); > > + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); > > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > > + > > + if (ret < 0) > > + return HV_STATUS_INSUFFICIENT_MEMORY; > > "... event flags do not require any buffer allocation or queuing within > the hypervisor, so HvSignalEvent will never fail due to insufficient > resources." Ouch. I wonder what else to map -EFAULT to then... Looks like HV_STATUS_INVALID_ALIGNMENT is the most appropriate (as in "The input or output GPA pointer is not within the bounds of the GPA space.") Thanks, Roman.
On 13.12.2017 12:44, Roman Kagan wrote: > On Wed, Dec 13, 2017 at 10:55:02AM +0100, David Hildenbrand wrote: >> >>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param) >>> +{ >>> + u16 ret; >>> + u32 conn_id, flag_no; >>> + int idx; >>> + struct eventfd_ctx *eventfd; >>> + >>> + if (unlikely(!fast)) { >>> + gpa_t gpa = param; >>> + >>> + if ((gpa & (__alignof__(param) - 1)) || >>> + offset_in_page(gpa) + sizeof(param) > PAGE_SIZE) >>> + return HV_STATUS_INVALID_ALIGNMENT; >>> + >>> + idx = srcu_read_lock(&vcpu->kvm->srcu); >>> + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); >>> + srcu_read_unlock(&vcpu->kvm->srcu, idx); >>> + >>> + if (ret < 0) >>> + return HV_STATUS_INSUFFICIENT_MEMORY; >> >> "... event flags do not require any buffer allocation or queuing within >> the hypervisor, so HvSignalEvent will never fail due to insufficient >> resources." > > Ouch. I wonder what else to map -EFAULT to then... Looks like > HV_STATUS_INVALID_ALIGNMENT is the most appropriate (as in "The input or > output GPA pointer is not within the bounds of the GPA space.") > > Thanks, > Roman. > It is likely happen rearelyn, but usually if we are out of memory we should report to user space. (chances that we won't be able to continue for longer either way are very high)
On 13.12.2017 12:46, David Hildenbrand wrote: > On 13.12.2017 12:44, Roman Kagan wrote: >> On Wed, Dec 13, 2017 at 10:55:02AM +0100, David Hildenbrand wrote: >>> >>>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param) >>>> +{ >>>> + u16 ret; >>>> + u32 conn_id, flag_no; >>>> + int idx; >>>> + struct eventfd_ctx *eventfd; >>>> + >>>> + if (unlikely(!fast)) { >>>> + gpa_t gpa = param; >>>> + >>>> + if ((gpa & (__alignof__(param) - 1)) || >>>> + offset_in_page(gpa) + sizeof(param) > PAGE_SIZE) >>>> + return HV_STATUS_INVALID_ALIGNMENT; >>>> + >>>> + idx = srcu_read_lock(&vcpu->kvm->srcu); >>>> + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); >>>> + srcu_read_unlock(&vcpu->kvm->srcu, idx); >>>> + >>>> + if (ret < 0) >>>> + return HV_STATUS_INSUFFICIENT_MEMORY; >>> >>> "... event flags do not require any buffer allocation or queuing within >>> the hypervisor, so HvSignalEvent will never fail due to insufficient >>> resources." >> >> Ouch. I wonder what else to map -EFAULT to then... Looks like >> HV_STATUS_INVALID_ALIGNMENT is the most appropriate (as in "The input or >> output GPA pointer is not within the bounds of the GPA space.") >> >> Thanks, >> Roman. >> > > It is likely happen rearelyn, but usually if we are out of memory we > should report to user space. (chances that we won't be able to continue > for longer either way are very high) > (I implied -ENOMEM is translated to -EFAULT) it can of course happen if the address is wrong. Also wonder how to deal with that. Could it be that we even have to trigger a page fault in the guest?
On 13.12.2017 12:51, David Hildenbrand wrote: > On 13.12.2017 12:46, David Hildenbrand wrote: >> On 13.12.2017 12:44, Roman Kagan wrote: >>> On Wed, Dec 13, 2017 at 10:55:02AM +0100, David Hildenbrand wrote: >>>> >>>>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param) >>>>> +{ >>>>> + u16 ret; >>>>> + u32 conn_id, flag_no; >>>>> + int idx; >>>>> + struct eventfd_ctx *eventfd; >>>>> + >>>>> + if (unlikely(!fast)) { >>>>> + gpa_t gpa = param; >>>>> + >>>>> + if ((gpa & (__alignof__(param) - 1)) || >>>>> + offset_in_page(gpa) + sizeof(param) > PAGE_SIZE) >>>>> + return HV_STATUS_INVALID_ALIGNMENT; >>>>> + >>>>> + idx = srcu_read_lock(&vcpu->kvm->srcu); >>>>> + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); >>>>> + srcu_read_unlock(&vcpu->kvm->srcu, idx); >>>>> + >>>>> + if (ret < 0) >>>>> + return HV_STATUS_INSUFFICIENT_MEMORY; >>>> >>>> "... event flags do not require any buffer allocation or queuing within >>>> the hypervisor, so HvSignalEvent will never fail due to insufficient >>>> resources." >>> >>> Ouch. I wonder what else to map -EFAULT to then... Looks like >>> HV_STATUS_INVALID_ALIGNMENT is the most appropriate (as in "The input or >>> output GPA pointer is not within the bounds of the GPA space.") >>> >>> Thanks, >>> Roman. >>> >> >> It is likely happen rearelyn, but usually if we are out of memory we >> should report to user space. (chances that we won't be able to continue >> for longer either way are very high) >> > (I implied -ENOMEM is translated to -EFAULT) it can of course happen if > the address is wrong. > > Also wonder how to deal with that. Could it be that we even have to > trigger a page fault in the guest? > The hypervisor will validate that the calling partition can read from the input page before executing the requested hypercall. This validation consists of two checks: the specified GPA is mapped and the GPA is marked readable. If either of these tests fails, the hypervisor generates a memory intercept message. So memory intercept message is the right thing to do I guess?
On Wed, Dec 13, 2017 at 12:14:48PM +0100, David Hildenbrand wrote: > On 13.12.2017 12:04, Roman Kagan wrote: > > On Wed, Dec 13, 2017 at 10:51:06AM +0100, David Hildenbrand wrote: > >>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param) > >>> +{ > >>> + u16 ret; > >>> + u32 conn_id, flag_no; > >>> + int idx; > >>> + struct eventfd_ctx *eventfd; > >>> + > >>> + if (unlikely(!fast)) { > >>> + gpa_t gpa = param; > >>> + > >>> + if ((gpa & (__alignof__(param) - 1)) || > >>> + offset_in_page(gpa) + sizeof(param) > PAGE_SIZE) > >>> + return HV_STATUS_INVALID_ALIGNMENT; > >>> + > >>> + idx = srcu_read_lock(&vcpu->kvm->srcu); > >>> + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); > >>> + srcu_read_unlock(&vcpu->kvm->srcu, idx); > >>> + > >>> + if (ret < 0) > >>> + return HV_STATUS_INSUFFICIENT_MEMORY; > >>> + } > >>> + > >>> + /* > >>> + * the signaled event number is made up of a 24bit "connection id" and > >>> + * a 16bit "flag number"; on the hypervisor side it's only their sum > >>> + * that matters > >>> + */ > >>> + conn_id = param; > >>> + flag_no = param >> 32; > >>> + if ((conn_id & ~KVM_HYPERV_CONN_ID_MASK) || (flag_no & 0xffff0000)) > >> > >> 1. You seem to check for RsvdZ here (flag_no & 0xffff0000), > >> "HV_STATUS_INVALID_CONNECTION_ID: The specified connection ID is > >> invalid." > >> -> Shouldn't the reserved field be simply ignored? > > > > It's "reserved zero", so I think it's correct to reject non-zero values. > > If it's not documented, guess it is ignored? It is (sort of) documented, per "1.2 Reserved Values [...] Zero value (documented as RsvdZ in diagrams and ReservedZ in code segments) – For maximum forward compatibility, clients should zero the value within this field." > At least "HV_STATUS_INVALID_CONNECTION_ID" is wrong, as this is not > the connection id. I (ab)used this return code here to fall through to userspace handling of this hypercall. But it may indeed make more sense to return HV_STATUS_INVALID_HYPERCALL_INPUT, as in "A reserved bit in the specified hypercall input value is non-zero." > >>> + return HV_STATUS_INVALID_CONNECTION_ID; > >>> + conn_id += flag_no; > >> > >> "FlagNumber specifies the relative index of the event flag that the > >> caller wants to set within the target" > >> > >> I am not sure if we should simply change the connection id here. This > >> seems to be an additional parameter to be passed to the one connection id. > > > > Right. I think I misinterpreted this part back when I implemented it in > > QEMU (first submission was this summer). We lived happily with that > > code, because the FlagNuber was always zero, and then I copied that over > > to this KVM patch. > > > > I think requiring it to be zero is a better choice; I've prepared v7 to > > that end. > > But which return value to use? HV_STATUS_INVALID_PARAMETER? I think we're better off resorting to the userspace for handling it (at least to report the situation). With the current patch this will be HV_STATUS_INVALID_CONNECTION_ID, but I can use something else or define a special code for that if desired. Roman.
On Wed, Dec 13, 2017 at 12:57:11PM +0100, David Hildenbrand wrote: > On 13.12.2017 12:51, David Hildenbrand wrote: > > On 13.12.2017 12:46, David Hildenbrand wrote: > >> On 13.12.2017 12:44, Roman Kagan wrote: > >>> On Wed, Dec 13, 2017 at 10:55:02AM +0100, David Hildenbrand wrote: > >>>> > >>>>> +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param) > >>>>> +{ > >>>>> + u16 ret; > >>>>> + u32 conn_id, flag_no; > >>>>> + int idx; > >>>>> + struct eventfd_ctx *eventfd; > >>>>> + > >>>>> + if (unlikely(!fast)) { > >>>>> + gpa_t gpa = param; > >>>>> + > >>>>> + if ((gpa & (__alignof__(param) - 1)) || > >>>>> + offset_in_page(gpa) + sizeof(param) > PAGE_SIZE) > >>>>> + return HV_STATUS_INVALID_ALIGNMENT; > >>>>> + > >>>>> + idx = srcu_read_lock(&vcpu->kvm->srcu); > >>>>> + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); > >>>>> + srcu_read_unlock(&vcpu->kvm->srcu, idx); > >>>>> + > >>>>> + if (ret < 0) > >>>>> + return HV_STATUS_INSUFFICIENT_MEMORY; > >>>> > >>>> "... event flags do not require any buffer allocation or queuing within > >>>> the hypervisor, so HvSignalEvent will never fail due to insufficient > >>>> resources." > >>> > >>> Ouch. I wonder what else to map -EFAULT to then... Looks like > >>> HV_STATUS_INVALID_ALIGNMENT is the most appropriate (as in "The input or > >>> output GPA pointer is not within the bounds of the GPA space.") > >>> > >>> Thanks, > >>> Roman. > >>> > >> > >> It is likely happen rearelyn, but usually if we are out of memory we > >> should report to user space. (chances that we won't be able to continue > >> for longer either way are very high) > >> > > (I implied -ENOMEM is translated to -EFAULT) it can of course happen if > > the address is wrong. > > > > Also wonder how to deal with that. Could it be that we even have to > > trigger a page fault in the guest? > > > > > The hypervisor will validate that the calling partition can read from > the input page before executing the > requested hypercall. This validation consists of two checks: the > specified GPA is mapped and the GPA is > marked readable. If either of these tests fails, the hypervisor > generates a memory intercept message. > > So memory intercept message is the right thing to do I guess? The intercept messages are sent to the parent partition (whose role is played by the hypervisor in our case) and that one is supposed to decide what to do with it. Triggering a pagefault in the guest doesn't look correct because the checks you mention are related to the GPA and not to the guest virtual address. So I think the only option we have is to return an error code, and we need to decide which one. We also can fall through to the userspace, but I can't see why the userspace would have better chances to read that GPA. Roman.
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 57d3ee9e4bde..3a959373fd97 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -3403,6 +3403,37 @@ invalid, if invalid pages are written to (e.g. after the end of memory) or if no page table is present for the addresses (e.g. when using hugepages). +4.109 KVM_HYPERV_EVENTFD + +Capability: KVM_CAP_HYPERV_EVENTFD +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_hyperv_eventfd (in) + +This ioctl (un)registers an eventfd to receive notifications from the guest on +the specified Hyper-V connection id through the SIGNAL_EVENT hypercall, without +causing a user exit. + +struct kvm_hyperv_eventfd { + __u32 conn_id; + __s32 fd; + __u32 flags; + __u32 padding[3]; +}; + +The conn_id field should fit within 24 bits: + +#define KVM_HYPERV_CONN_ID_MASK 0x00ffffff + +The acceptable values for the flags field are: + +#define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0) + +Returns: 0 on success, + -EINVAL if conn_id or flags is outside the allowed range + -ENOENT on deassign if the conn_id isn't registered + -EEXIST on assign if the conn_id is already registered + 5. The kvm_run structure ------------------------ diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 516798431328..6a9914752a84 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -752,6 +752,8 @@ struct kvm_hv { u64 hv_crash_ctl; HV_REFERENCE_TSC_PAGE tsc_ref; + + struct idr conn_to_evt; }; enum kvm_irqchip_mode { diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index cc2468244ca2..837465d69c6d 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -90,5 +90,6 @@ void kvm_hv_setup_tsc_page(struct kvm *kvm, void kvm_hv_init_vm(struct kvm *kvm); void kvm_hv_destroy_vm(struct kvm *kvm); +int kvm_vm_ioctl_hv_eventfd(struct kvm *kvm, struct kvm_hyperv_eventfd *args); #endif diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 496e59a2738b..7a871e7fb5df 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -932,6 +932,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_HYPERV_SYNIC2 148 #define KVM_CAP_HYPERV_VP_INDEX 149 #define KVM_CAP_S390_AIS_MIGRATION 150 +#define KVM_CAP_HYPERV_EVENTFD 151 #ifdef KVM_CAP_IRQ_ROUTING @@ -1359,6 +1360,8 @@ struct kvm_s390_ucas_mapping { #define KVM_S390_GET_CMMA_BITS _IOWR(KVMIO, 0xb8, struct kvm_s390_cmma_log) #define KVM_S390_SET_CMMA_BITS _IOW(KVMIO, 0xb9, struct kvm_s390_cmma_log) +#define KVM_HYPERV_EVENTFD _IOW(KVMIO, 0xba, struct kvm_hyperv_eventfd) + #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0) #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1) #define KVM_DEV_ASSIGN_MASK_INTX (1 << 2) @@ -1419,4 +1422,14 @@ struct kvm_assigned_msix_entry { #define KVM_ARM_DEV_EL1_PTIMER (1 << 1) #define KVM_ARM_DEV_PMU (1 << 2) +struct kvm_hyperv_eventfd { + __u32 conn_id; + __s32 fd; + __u32 flags; + __u32 padding[3]; +}; + +#define KVM_HYPERV_CONN_ID_MASK 0x00ffffff +#define KVM_HYPERV_EVENTFD_DEASSIGN (1 << 0) + #endif /* __LINUX_KVM_H */ diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 015fb06c7522..c1541dccf14d 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -29,6 +29,7 @@ #include <linux/kvm_host.h> #include <linux/highmem.h> #include <linux/sched/cputime.h> +#include <linux/eventfd.h> #include <asm/apicdef.h> #include <trace/events/kvm.h> @@ -1226,6 +1227,50 @@ static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu) return 1; } +static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, bool fast, u64 param) +{ + u16 ret; + u32 conn_id, flag_no; + int idx; + struct eventfd_ctx *eventfd; + + if (unlikely(!fast)) { + gpa_t gpa = param; + + if ((gpa & (__alignof__(param) - 1)) || + offset_in_page(gpa) + sizeof(param) > PAGE_SIZE) + return HV_STATUS_INVALID_ALIGNMENT; + + idx = srcu_read_lock(&vcpu->kvm->srcu); + ret = kvm_vcpu_read_guest(vcpu, gpa, ¶m, sizeof(param)); + srcu_read_unlock(&vcpu->kvm->srcu, idx); + + if (ret < 0) + return HV_STATUS_INSUFFICIENT_MEMORY; + } + + /* + * the signaled event number is made up of a 24bit "connection id" and + * a 16bit "flag number"; on the hypervisor side it's only their sum + * that matters + */ + conn_id = param; + flag_no = param >> 32; + if ((conn_id & ~KVM_HYPERV_CONN_ID_MASK) || (flag_no & 0xffff0000)) + return HV_STATUS_INVALID_CONNECTION_ID; + conn_id += flag_no; + if (conn_id & ~KVM_HYPERV_CONN_ID_MASK) + return HV_STATUS_INVALID_CONNECTION_ID; + + idx = srcu_read_lock(&vcpu->kvm->srcu); + eventfd = idr_find(&vcpu->kvm->arch.hyperv.conn_to_evt, conn_id); + if (eventfd) + eventfd_signal(eventfd, 1); + srcu_read_unlock(&vcpu->kvm->srcu, idx); + + return eventfd ? HV_STATUS_SUCCESS : HV_STATUS_INVALID_CONNECTION_ID; +} + int kvm_hv_hypercall(struct kvm_vcpu *vcpu) { u64 param, ingpa, outgpa, ret; @@ -1276,8 +1321,12 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) case HVCALL_NOTIFY_LONG_SPIN_WAIT: kvm_vcpu_on_spin(vcpu, true); break; - case HVCALL_POST_MESSAGE: case HVCALL_SIGNAL_EVENT: + res = kvm_hvcall_signal_event(vcpu, fast, ingpa); + if (res != HV_STATUS_INVALID_CONNECTION_ID) + break; + /* maybe userspace knows this conn_id: fall through */ + case HVCALL_POST_MESSAGE: /* don't bother userspace if it has no way to handle it */ if (!vcpu_to_synic(vcpu)->active) { res = HV_STATUS_INVALID_HYPERCALL_CODE; @@ -1305,8 +1354,67 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) void kvm_hv_init_vm(struct kvm *kvm) { mutex_init(&kvm->arch.hyperv.hv_lock); + idr_init(&kvm->arch.hyperv.conn_to_evt); } void kvm_hv_destroy_vm(struct kvm *kvm) { + struct eventfd_ctx *eventfd; + int i; + + idr_for_each_entry(&kvm->arch.hyperv.conn_to_evt, eventfd, i) + eventfd_ctx_put(eventfd); + idr_destroy(&kvm->arch.hyperv.conn_to_evt); +} + +static int kvm_hv_eventfd_assign(struct kvm *kvm, u32 conn_id, int fd) +{ + struct kvm_hv *hv = &kvm->arch.hyperv; + struct eventfd_ctx *eventfd; + int ret; + + eventfd = eventfd_ctx_fdget(fd); + if (IS_ERR(eventfd)) + return PTR_ERR(eventfd); + + mutex_lock(&hv->hv_lock); + ret = idr_alloc(&hv->conn_to_evt, eventfd, conn_id, conn_id + 1, + GFP_KERNEL); + mutex_unlock(&hv->hv_lock); + + if (ret >= 0) + return 0; + + if (ret == -ENOSPC) + ret = -EEXIST; + eventfd_ctx_put(eventfd); + return ret; +} + +static int kvm_hv_eventfd_deassign(struct kvm *kvm, u32 conn_id) +{ + struct kvm_hv *hv = &kvm->arch.hyperv; + struct eventfd_ctx *eventfd; + + mutex_lock(&hv->hv_lock); + eventfd = idr_remove(&hv->conn_to_evt, conn_id); + mutex_unlock(&hv->hv_lock); + + if (!eventfd) + return -ENOENT; + + synchronize_srcu(&kvm->srcu); + eventfd_ctx_put(eventfd); + return 0; +} + +int kvm_vm_ioctl_hv_eventfd(struct kvm *kvm, struct kvm_hyperv_eventfd *args) +{ + if ((args->flags & ~KVM_HYPERV_EVENTFD_DEASSIGN) || + (args->conn_id & ~KVM_HYPERV_CONN_ID_MASK)) + return -EINVAL; + + if (args->flags == KVM_HYPERV_EVENTFD_DEASSIGN) + return kvm_hv_eventfd_deassign(kvm, args->conn_id); + return kvm_hv_eventfd_assign(kvm, args->conn_id, args->fd); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d17cf7900138..1c43d262da14 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2701,6 +2701,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_HYPERV_SYNIC: case KVM_CAP_HYPERV_SYNIC2: case KVM_CAP_HYPERV_VP_INDEX: + case KVM_CAP_HYPERV_EVENTFD: case KVM_CAP_PCI_SEGMENT: case KVM_CAP_DEBUGREGS: case KVM_CAP_X86_ROBUST_SINGLESTEP: @@ -4295,6 +4296,15 @@ long kvm_arch_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_enable_cap(kvm, &cap); break; } + case KVM_HYPERV_EVENTFD: { + struct kvm_hyperv_eventfd hvevfd; + + r = -EFAULT; + if (copy_from_user(&hvevfd, argp, sizeof(hvevfd))) + goto out; + r = kvm_vm_ioctl_hv_eventfd(kvm, &hvevfd); + break; + } default: r = -ENOTTY; }
In Hyper-V, the fast guest->host notification mechanism is the SIGNAL_EVENT hypercall, with a single parameter of the connection ID to signal. Currently this hypercall incurs a user exit and requires the userspace to decode the parameters and trigger the notification of the potentially different I/O context. To avoid the costly user exit, process this hypercall and signal the corresponding eventfd in KVM, similar to ioeventfd. The association between the connection id and the eventfd is established via the newly introduced KVM_HYPERV_EVENTFD ioctl, and maintained in an (srcu-protected) IDR. Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> --- v4 -> v5: - fix block comment formatting v3 -> v4: - switch to kvm_vcpu_read_guest and take srcu_read_lock around it - rework and document the interpretation of the hypercall parameter - merge !fast version into kvm_hvcall_signal_event for brevity v2 -> v3: - expand docs on allowed values and return codes - fix uninitialized return - style fixes v1 -> v2: - make data types consistent - get by without the recently dropped struct hv_input_signal_event - fix subject prefix Documentation/virtual/kvm/api.txt | 31 +++++++++++ arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/hyperv.h | 1 + include/uapi/linux/kvm.h | 13 +++++ arch/x86/kvm/hyperv.c | 110 +++++++++++++++++++++++++++++++++++++- arch/x86/kvm/x86.c | 10 ++++ 6 files changed, 166 insertions(+), 1 deletion(-)