mbox series

[v2,00/15] Introduce /dev/mshv drivers

Message ID 1692309711-5573-1-git-send-email-nunodasneves@linux.microsoft.com (mailing list archive)
Headers show
Series Introduce /dev/mshv drivers | expand

Message

Nuno Das Neves Aug. 17, 2023, 10:01 p.m. UTC
This series introduces support for creating and running guest machines
while running on the Microsoft Hypervisor. [0]
This is done via an IOCTL interface accessed through /dev/mshv, similar to
/dev/kvm. Another series introducing this support was previously posted.
[1]

These interfaces support VMMs running in:
1. The root patition - provided in the mshv_root module, and
2. VTL 2 - provided in the mshv_vtl module [2]

Patches breakdown
-----------------
The first 7 patches are refactoring and adding some helper functions.
They provide some benefit on their own and could be applied independently
as cleanup patches.

Patches 8-12 just set things up for the driver code to come. These are very
small. They come first so that the remaining patches are more self-contained.

The final 3 patches are the meat of the series:
- Patch 13 contains new header files used by the driver.
  These are designed to mirror the ABI headers exported by Hyper-V. This is
  done to avoid polluting hyperv-tlfs.h and help track changes to the ABIs
  that are still unstable. (See FAQ below).
- Patch 14 conditionally includes these new header files into mshyperv.h
  and linux/hyperv.h, in order to be able to use these files in the new
  drivers while remaining independent from hyperv-tlfs.h.
- Patch 15 contains the new driver code located in drivers/hv. This is a
  large amount of code and new files, but it is mostly self-contained and
  all within drivers/hv - apart from the IOCTL interface itself in uapi.

Patch 15 is rather big and has bounced back from some mailing lists. If you
did not get a copy in your inbox, you can view it here instead:
https://github.com/NunoDasNeves/linux/commit/4dd436f13d95a6a0daecd75da012a2caa6c57c8c

FAQ on include/uapi/hyperv/*.h
------------------------------
Q:
Why not just add these definitions to hyperv-tlfs.h?
A:
The intention of hyperv-tlfs.h is to contain stable definitions documented
in the public TLFS document. These new definitions don't fit that criteria,
so they should be separate.

Q:
Why are these files named hvgdk.h, hvgdk_mini.h, hvhdk.h and hvhdk_mini.h?
A:
The precise meaning of the names reflects conventions used internally at
Microsoft.
Naming them this way makes it easy to find where particular Hyper-V
definitions come from, and check their correctness.
It also facilitates the future work of automatically generating these files.

Q:
Why are they in uapi?
A:
In short, to keep things simple. There are many definitions needed in both
the kernel and the VMM in userspace. Separating them doesn't serve much
purpose, and makes it more laborious to import definitions from Hyper-V
code.

Q:
The new headers redefine many things that are already in hyperv-tlfs.h - why?
A:
Some definitions are extended compared to what is documented in the TLFS.
In order to avoid adding undocumented or unstable definitions to hyperv-tlfs.h,
the new headers must compile independently.
Therefore, the new headers must redefine many things in hyperv-tlfs.h in order
to compile.

--------------------------
[0] "Hyper-V" is more well-known, but it really refers to the whole stack
    including the hypervisor and other components that run in Windows
    kernel and userspace.
[1] Previous /dev/mshv patch series and discussion:
    https://lore.kernel.org/linux-hyperv/1632853875-20261-1-git-send-email-nunodasneves@linux.microsoft.com/
[2] Virtual Secure Mode (VSM) and Virtual Trust Levels (VTL):
    https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/vsm

--------------------------

Changes since v1:
    * Clean up formatting, capitalization in commit messages
    * Add detail to commit message for patch 15
    * Remove errant lines in Makefile and Kconfig in patch 15
    * Move a reference to CONFIG_MSHV_VTL from patch 9 to 15

Nuno Das Neves (15):
  hyperv-tlfs: Change shared HV_REGISTER_* defines to HV_MSR_*
  mshyperv: Introduce hv_get_hypervisor_version function
  mshyperv: Introduce numa_node_to_proximity_domain_info
  asm-generic/mshyperv: Introduce hv_recommend_using_aeoi()
  hyperv: Move hv_connection_id to hyperv-tlfs
  hyperv-tlfs: Introduce hv_status_to_string and hv_status_to_errno
  Drivers: hv: Move hv_call_deposit_pages and hv_call_create_vp to
    common code
  Drivers: hv: Introduce per-cpu event ring tail
  Drivers: hv: Introduce hv_output_arg_exists in hv_common.c
  x86: hyperv: Add mshv_handler irq handler and setup function
  Drivers: hv: export vmbus_isr, hv_context and hv_post_message
  Documentation: Reserve ioctl number for mshv driver
  uapi: hyperv: Add mshv driver headers hvhdk.h, hvhdk_mini.h, hvgdk.h,
    hvgdk_mini.h
  asm-generic: hyperv: Use mshv headers conditionally. Add
    asm-generic/hyperv-defs.h
  Drivers: hv: Add modules to expose /dev/mshv to VMMs running on
    Hyper-V

 .../userspace-api/ioctl/ioctl-number.rst      |    2 +
 arch/arm64/hyperv/mshyperv.c                  |   23 +-
 arch/arm64/include/asm/hyperv-tlfs.h          |   25 +
 arch/arm64/include/asm/mshyperv.h             |    2 +-
 arch/x86/hyperv/hv_init.c                     |    2 +-
 arch/x86/hyperv/hv_proc.c                     |  166 +-
 arch/x86/include/asm/hyperv-tlfs.h            |  137 +-
 arch/x86/include/asm/mshyperv.h               |   13 +-
 arch/x86/kernel/cpu/mshyperv.c                |   71 +-
 drivers/acpi/numa/srat.c                      |    1 +
 drivers/clocksource/hyperv_timer.c            |   24 +-
 drivers/hv/Kconfig                            |   50 +
 drivers/hv/Makefile                           |   20 +
 drivers/hv/hv.c                               |   46 +-
 drivers/hv/hv_call.c                          |  119 +
 drivers/hv/hv_common.c                        |  222 +-
 drivers/hv/hyperv_vmbus.h                     |    2 +-
 drivers/hv/mshv.h                             |  156 ++
 drivers/hv/mshv_eventfd.c                     |  758 +++++++
 drivers/hv/mshv_eventfd.h                     |   80 +
 drivers/hv/mshv_main.c                        |  208 ++
 drivers/hv/mshv_msi.c                         |  129 ++
 drivers/hv/mshv_portid_table.c                |   84 +
 drivers/hv/mshv_root.h                        |  194 ++
 drivers/hv/mshv_root_hv_call.c                | 1064 +++++++++
 drivers/hv/mshv_root_main.c                   | 1964 +++++++++++++++++
 drivers/hv/mshv_synic.c                       |  689 ++++++
 drivers/hv/mshv_vtl.h                         |   52 +
 drivers/hv/mshv_vtl_main.c                    | 1542 +++++++++++++
 drivers/hv/vmbus_drv.c                        |    3 +-
 drivers/hv/xfer_to_guest.c                    |   28 +
 include/asm-generic/hyperv-defs.h             |   26 +
 include/asm-generic/hyperv-tlfs.h             |   77 +-
 include/asm-generic/mshyperv.h                |   76 +-
 include/linux/hyperv.h                        |   11 +-
 include/uapi/hyperv/hvgdk.h                   |   41 +
 include/uapi/hyperv/hvgdk_mini.h              | 1077 +++++++++
 include/uapi/hyperv/hvhdk.h                   | 1352 ++++++++++++
 include/uapi/hyperv/hvhdk_mini.h              |  164 ++
 include/uapi/linux/mshv.h                     |  298 +++
 40 files changed, 10645 insertions(+), 353 deletions(-)
 create mode 100644 drivers/hv/hv_call.c
 create mode 100644 drivers/hv/mshv.h
 create mode 100644 drivers/hv/mshv_eventfd.c
 create mode 100644 drivers/hv/mshv_eventfd.h
 create mode 100644 drivers/hv/mshv_main.c
 create mode 100644 drivers/hv/mshv_msi.c
 create mode 100644 drivers/hv/mshv_portid_table.c
 create mode 100644 drivers/hv/mshv_root.h
 create mode 100644 drivers/hv/mshv_root_hv_call.c
 create mode 100644 drivers/hv/mshv_root_main.c
 create mode 100644 drivers/hv/mshv_synic.c
 create mode 100644 drivers/hv/mshv_vtl.h
 create mode 100644 drivers/hv/mshv_vtl_main.c
 create mode 100644 drivers/hv/xfer_to_guest.c
 create mode 100644 include/asm-generic/hyperv-defs.h
 create mode 100644 include/uapi/hyperv/hvgdk.h
 create mode 100644 include/uapi/hyperv/hvgdk_mini.h
 create mode 100644 include/uapi/hyperv/hvhdk.h
 create mode 100644 include/uapi/hyperv/hvhdk_mini.h
 create mode 100644 include/uapi/linux/mshv.h

Comments

Saurabh Singh Sengar Aug. 18, 2023, 1:08 p.m. UTC | #1
> -----Original Message-----
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> Sent: Friday, August 18, 2023 3:32 AM
> To: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org;
> x86@kernel.org; linux-arm-kernel@lists.infradead.org; linux-
> arch@vger.kernel.org
> Cc: patches@lists.linux.dev; Michael Kelley (LINUX)
> <mikelley@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
> wei.liu@kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; Dexuan Cui
> <decui@microsoft.com>; apais@linux.microsoft.com; Tianyu Lan
> <Tianyu.Lan@microsoft.com>; ssengar@linux.microsoft.com; MUKESH
> RATHOR <mukeshrathor@microsoft.com>; stanislav.kinsburskiy@gmail.com;
> jinankjain@linux.microsoft.com; vkuznets <vkuznets@redhat.com>;
> tglx@linutronix.de; mingo@redhat.com; bp@alien8.de;
> dave.hansen@linux.intel.com; hpa@zytor.com; will@kernel.org;
> catalin.marinas@arm.com
> Subject: [PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv to
> VMMs running on Hyper-V
> 
> Add mshv, mshv_root, and mshv_vtl modules:
> 
> Module mshv is the parent module to the other two. It provides /dev/mshv,
> plus
> some common hypercall helper code. When one of the child modules is
> loaded, it
> is registered with the mshv module, which then provides entry point(s) to the
> child module via the IOCTLs defined in uapi/linux/mshv.h.
> 
> E.g. When the mshv_root module is loaded, it registers itself, and the
> MSHV_CREATE_PARTITION IOCTL becomes available in /dev/mshv. That is
> used to
> get a partition fd managed by mshv_root.
> 
> Similarly for mshv_vtl module, there is MSHV_CREATE_VTL, which creates
> an fd representing the lower vtl, managed by mshv_vtl.
> 
> Module mshv_root provides APIs for creating and managing child partitions.
> It
> defines abstractions for partitions (vms), vps (vcpus), and other things
> related to running a guest. It exposes the userspace interfaces for a VMM to
> manage the guest.
> 
> Module mshv_vtl provides VTL (Virtual Trust Level) support for VMMs. In
> this scenario, the host kernel and VMM run in a higher trust level than the
> guest, but within the same partition. This provides better isolation and
> performance.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> ---
>  drivers/hv/Kconfig             |   50 +
>  drivers/hv/Makefile            |   20 +
>  drivers/hv/hv_call.c           |  119 ++
>  drivers/hv/hv_common.c         |    4 +
>  drivers/hv/mshv.h              |  156 +++
>  drivers/hv/mshv_eventfd.c      |  758 ++++++++++++
>  drivers/hv/mshv_eventfd.h      |   80 ++
>  drivers/hv/mshv_main.c         |  208 ++++
>  drivers/hv/mshv_msi.c          |  129 +++
>  drivers/hv/mshv_portid_table.c |   84 ++
>  drivers/hv/mshv_root.h         |  194 ++++
>  drivers/hv/mshv_root_hv_call.c | 1064 +++++++++++++++++
>  drivers/hv/mshv_root_main.c    | 1964
> ++++++++++++++++++++++++++++++++
>  drivers/hv/mshv_synic.c        |  689 +++++++++++
>  drivers/hv/mshv_vtl.h          |   52 +
>  drivers/hv/mshv_vtl_main.c     | 1542 +++++++++++++++++++++++++
>  drivers/hv/xfer_to_guest.c     |   28 +
>  include/uapi/linux/mshv.h      |  298 +++++
>  18 files changed, 7439 insertions(+)
>  create mode 100644 drivers/hv/hv_call.c
>  create mode 100644 drivers/hv/mshv.h
>  create mode 100644 drivers/hv/mshv_eventfd.c
>  create mode 100644 drivers/hv/mshv_eventfd.h
>  create mode 100644 drivers/hv/mshv_main.c
>  create mode 100644 drivers/hv/mshv_msi.c
>  create mode 100644 drivers/hv/mshv_portid_table.c
>  create mode 100644 drivers/hv/mshv_root.h
>  create mode 100644 drivers/hv/mshv_root_hv_call.c
>  create mode 100644 drivers/hv/mshv_root_main.c
>  create mode 100644 drivers/hv/mshv_synic.c
>  create mode 100644 drivers/hv/mshv_vtl.h
>  create mode 100644 drivers/hv/mshv_vtl_main.c
>  create mode 100644 drivers/hv/xfer_to_guest.c
>  create mode 100644 include/uapi/linux/mshv.h
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 00242107d62e..0d9aefc07b15 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -54,4 +54,54 @@ config HYPERV_BALLOON
>  	help
>  	  Select this option to enable Hyper-V Balloon driver.
> 
> +config MSHV
> +	tristate "Microsoft Hypervisor root partition interfaces: /dev/mshv"
> +	depends on X86_64 && HYPERV
> +	select EVENTFD
> +	select MSHV_XFER_TO_GUEST_WORK
> +	help
> +	  Select this option to enable core functionality for managing guest
> +	  virtual machines running under the Microsoft Hypervisor.
> +
> +	  The interfaces are provided via a device named /dev/mshv.
> +
> +	  To compile this as a module, choose M here.
> +
> +	  If unsure, say N.
> +
> +config MSHV_ROOT
> +	tristate "Microsoft Hyper-V root partition APIs driver"
> +	depends on MSHV
> +	help
> +	  Select this option to provide /dev/mshv interfaces specific to
> +	  running as the root partition on Microsoft Hypervisor.
> +
> +	  To compile this as a module, choose M here.
> +
> +	  If unsure, say N.
> +
> +config MSHV_VTL
> +	tristate "Microsoft Hyper-V VTL driver"
> +	depends on MSHV
> +	select HYPERV_VTL_MODE
> +	select TRANSPARENT_HUGEPAGE

TRANSPARENT_HUGEPAGE can be avoided for now.

> +	help
> +	  Select this option to enable Hyper-V VTL driver.
> +	  Virtual Secure Mode (VSM) is a set of hypervisor capabilities and
> +	  enlightenments offered to host and guest partitions which enables
> +	  the creation and management of new security boundaries within
> +	  operating system software.
> +
> +	  VSM achieves and maintains isolation through Virtual Trust Levels
> +	  (VTLs). Virtual Trust Levels are hierarchical, with higher levels
> +	  being more privileged than lower levels. VTL0 is the least privileged
> +	  level, and currently only other level supported is VTL2.
> +
> +	  To compile this as a module, choose M here.
> +
> +	  If unsure, say N.
> +
> +config MSHV_XFER_TO_GUEST_WORK
> +	bool
> +
>  endmenu
> diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
> index d76df5c8c2a9..da7aa7542b05 100644
> --- a/drivers/hv/Makefile
> +++ b/drivers/hv/Makefile
> @@ -2,10 +2,30 @@
>  obj-$(CONFIG_HYPERV)		+= hv_vmbus.o
>  obj-$(CONFIG_HYPERV_UTILS)	+= hv_utils.o
>  obj-$(CONFIG_HYPERV_BALLOON)	+= hv_balloon.o
> +obj-$(CONFIG_MSHV)			+= mshv.o
> +obj-$(CONFIG_MSHV_VTL)		+= mshv_vtl.o
> +obj-$(CONFIG_MSHV_ROOT)		+= mshv_root.o
> 
>  CFLAGS_hv_trace.o = -I$(src)
>  CFLAGS_hv_balloon.o = -I$(src)
> 
> +CFLAGS_mshv_main.o			= -DHV_HYPERV_DEFS
> +CFLAGS_hv_call.o			= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_root_main.o		= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_root_hv_call.o	= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_synic.o			= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_portid_table.o	= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_eventfd.o		= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_msi.o			= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_vtl_main.o		= -DHV_HYPERV_DEFS
> +
> +mshv-y				+= mshv_main.o
> +mshv_root-y			:= mshv_root_main.o mshv_synic.o
> mshv_portid_table.o \
> +						mshv_eventfd.o mshv_msi.o
> mshv_root_hv_call.o hv_call.o
> +mshv_vtl-y			:= mshv_vtl_main.o hv_call.o
> +
> +obj-$(CONFIG_MSHV_XFER_TO_GUEST_WORK) += xfer_to_guest.o
> +
>  hv_vmbus-y := vmbus_drv.o \
>  		 hv.o connection.o channel.o \
>  		 channel_mgmt.o ring_buffer.o hv_trace.o
> diff --git a/drivers/hv/hv_call.c b/drivers/hv/hv_call.c
> new file mode 100644
> index 000000000000..4455001d8545
> --- /dev/null
> +++ b/drivers/hv/hv_call.c
> @@ -0,0 +1,119 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, Microsoft Corporation.
> + *
> + * Hypercall helper functions shared between mshv modules.
> + *
> + * Authors:
> + *   Nuno Das Neves <nunodasneves@linux.microsoft.com>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <asm/mshyperv.h>
> +
> +#define HV_GET_REGISTER_BATCH_SIZE	\
> +	(HV_HYP_PAGE_SIZE / sizeof(union hv_register_value))
> +#define HV_SET_REGISTER_BATCH_SIZE	\
> +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_registers)) \
> +		/ sizeof(struct hv_register_assoc))
> +
> +int hv_call_get_vp_registers(
> +		u32 vp_index,
> +		u64 partition_id,
> +		u16 count,
> +		union hv_input_vtl input_vtl,
> +		struct hv_register_assoc *registers)
> +{
> +	struct hv_input_get_vp_registers *input_page;
> +	union hv_register_value *output_page;
> +	u16 completed = 0;
> +	unsigned long remaining = count;
> +	int rep_count, i;
> +	u64 status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	input_page->partition_id = partition_id;
> +	input_page->vp_index = vp_index;
> +	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
> +	input_page->rsvd_z8 = 0;
> +	input_page->rsvd_z16 = 0;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
> +		for (i = 0; i < rep_count; ++i)
> +			input_page->names[i] = registers[i].name;
> +
> +		status = hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS,
> rep_count,
> +					     0, input_page, output_page);

Is there any possibility that count value is passed 0 by mistake ? In that case
status will remain uninitialized. 

> +		if (!hv_result_success(status)) {
> +			pr_err("%s: completed %li out of %u, %s\n",
> +			       __func__,
> +			       count - remaining, count,
> +			       hv_status_to_string(status));
> +			break;
> +		}
> +		completed = hv_repcomp(status);
> +		for (i = 0; i < completed; ++i)
> +			registers[i].value = output_page[i];
> +
> +		registers += completed;
> +		remaining -= completed;
> +	}
> +	local_irq_restore(flags);
> +
> +	return hv_status_to_errno(status);
> +}
> +
> +int hv_call_set_vp_registers(
> +		u32 vp_index,
> +		u64 partition_id,
> +		u16 count,
> +		union hv_input_vtl input_vtl,
> +		struct hv_register_assoc *registers)
> +{
> +	struct hv_input_set_vp_registers *input_page;
> +	u16 completed = 0;
> +	unsigned long remaining = count;
> +	int rep_count;
> +	u64 status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> +	input_page->partition_id = partition_id;
> +	input_page->vp_index = vp_index;
> +	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
> +	input_page->rsvd_z8 = 0;
> +	input_page->rsvd_z16 = 0;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_SET_REGISTER_BATCH_SIZE);
> +		memcpy(input_page->elements, registers,
> +			sizeof(struct hv_register_assoc) * rep_count);
> +
> +		status = hv_do_rep_hypercall(HVCALL_SET_VP_REGISTERS,
> rep_count,
> +					     0, input_page, NULL);
> +		if (!hv_result_success(status)) {
> +			pr_err("%s: completed %li out of %u, %s\n",
> +			       __func__,
> +			       count - remaining, count,
> +			       hv_status_to_string(status));
> +			break;
> +		}
> +		completed = hv_repcomp(status);
> +		registers += completed;
> +		remaining -= completed;
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	return hv_status_to_errno(status);
> +}
> +
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 13f972e72375..ccd76f30a638 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -62,7 +62,11 @@ EXPORT_SYMBOL_GPL(hyperv_pcpu_output_arg);
>   */
>  static inline bool hv_output_arg_exists(void)
>  {
> +#ifdef CONFIG_MSHV_VTL

Although today both the option works together. But thinking
which is more accurate CONFIG_HYPERV_VTL_MODE or
CONFIG_MSHV_VTL here for scalability of VTL modules.

> +	return true;
> +#else
>  	return hv_root_partition ? true : false;
> +#endif
>  }
> 
>  static void hv_kmsg_dump_unregister(void);
> diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
> new file mode 100644
> index 000000000000..166480a73f3f
> --- /dev/null
> +++ b/drivers/hv/mshv.h
> @@ -0,0 +1,156 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2023, Microsoft Corporation.
> + */
> +
> +#ifndef _MSHV_H_
> +#define _MSHV_H_
> +
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <linux/semaphore.h>
> +#include <linux/sched.h>
> +#include <linux/srcu.h>
> +#include <linux/wait.h>
> +#include <uapi/linux/mshv.h>
> +
> +/*
> + * Hyper-V hypercalls
> + */
> +
> +int hv_call_withdraw_memory(u64 count, int node, u64 partition_id);
> +int hv_call_create_partition(
> +		u64 flags,
> +		struct hv_partition_creation_properties creation_properties,
> +		union hv_partition_isolation_properties isolation_properties,
> +		u64 *partition_id);
> +int hv_call_initialize_partition(u64 partition_id);
> +int hv_call_finalize_partition(u64 partition_id);
> +int hv_call_delete_partition(u64 partition_id);
> +int hv_call_map_gpa_pages(
> +		u64 partition_id,
> +		u64 gpa_target,
> +		u64 page_count, u32 flags,
> +		struct page **pages);
> +int hv_call_unmap_gpa_pages(
> +		u64 partition_id,
> +		u64 gpa_target,
> +		u64 page_count, u32 flags);
> +int hv_call_get_vp_registers(
> +		u32 vp_index,
> +		u64 partition_id,
> +		u16 count,
> +		union hv_input_vtl input_vtl,
> +		struct hv_register_assoc *registers);
> +int hv_call_get_gpa_access_states(
> +		u64 partition_id,
> +		u32 count,
> +		u64 gpa_base_pfn,
> +		u64 state_flags,
> +		int *written_total,
> +		union hv_gpa_page_access_state *states);
> +
> +int hv_call_set_vp_registers(
> +		u32 vp_index,
> +		u64 partition_id,
> +		u16 count,
> +		union hv_input_vtl input_vtl,
> +		struct hv_register_assoc *registers);

Nit: Opportunity to fix many of the checkpatch.pl related to line break here
and many other places.

> +int hv_call_install_intercept(u64 partition_id, u32 access_type,
> +		enum hv_intercept_type intercept_type,
> +		union hv_intercept_parameters intercept_parameter);
> +int hv_call_assert_virtual_interrupt(
> +		u64 partition_id,
> +		u32 vector,
> +		u64 dest_addr,
> +		union hv_interrupt_control control);
> +int hv_call_clear_virtual_interrupt(u64 partition_id);
> +
> +#ifdef HV_SUPPORTS_VP_STATE
> +int hv_call_get_vp_state(
> +		u32 vp_index,
> +		u64 partition_id,
> +		enum hv_get_set_vp_state_type type,
> +		struct hv_vp_state_data_xsave xsave,
> +		/* Choose between pages and ret_output */
> +		u64 page_count,
> +		struct page **pages,
> +		union hv_output_get_vp_state *ret_output);
> +int hv_call_set_vp_state(
> +		u32 vp_index,
> +		u64 partition_id,
> +		enum hv_get_set_vp_state_type type,
> +		struct hv_vp_state_data_xsave xsave,
> +		/* Choose between pages and bytes */
> +		u64 page_count,
> +		struct page **pages,
> +		u32 num_bytes,
> +		u8 *bytes);
> +#endif
> +
> +int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
> +				struct page **state_page);
> +int hv_call_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32
> type);
> +int hv_call_get_partition_property(
> +		u64 partition_id,
> +		u64 property_code,
> +		u64 *property_value);
> +int hv_call_set_partition_property(
> +	u64 partition_id, u64 property_code, u64 property_value,
> +	void (*completion_handler)(void * /* data */, u64 * /* status */),
> +	void *completion_data);
> +int hv_call_translate_virtual_address(
> +		u32 vp_index,
> +		u64 partition_id,
> +		u64 flags,
> +		u64 gva,
> +		u64 *gpa,
> +		union hv_translate_gva_result *result);
> +int hv_call_get_vp_cpuid_values(
> +		u32 vp_index,
> +		u64 partition_id,
> +		union hv_get_vp_cpuid_values_flags values_flags,
> +		struct hv_cpuid_leaf_info *info,
> +		union hv_output_get_vp_cpuid_values *result);
> +
> +int hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
> +			u64 connection_partition_id, struct hv_port_info
> *port_info,
> +			u8 port_vtl, u8 min_connection_vtl, int node);
> +int hv_call_delete_port(u64 port_partition_id, union hv_port_id port_id);
> +int hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
> +			 u64 connection_partition_id,
> +			 union hv_connection_id connection_id,
> +			 struct hv_connection_info *connection_info,
> +			 u8 connection_vtl, int node);
> +int hv_call_disconnect_port(u64 connection_partition_id,
> +			    union hv_connection_id connection_id);
> +int hv_call_notify_port_ring_empty(u32 sint_index);
> +#ifdef HV_SUPPORTS_REGISTER_INTERCEPT
> +int hv_call_register_intercept_result(u32 vp_index,
> +				  u64 partition_id,
> +				  enum hv_intercept_type intercept_type,
> +				  union
> hv_register_intercept_result_parameters *params);
> +#endif
> +int hv_call_signal_event_direct(u32 vp_index,
> +				u64 partition_id,
> +				u8 vtl,
> +				u8 sint,
> +				u16 flag_number,
> +				u8 *newly_signaled);
> +int hv_call_post_message_direct(u32 vp_index,
> +				u64 partition_id,
> +				u8 vtl,
> +				u32 sint_index,
> +				u8 *message);
> +
> +struct mshv_partition *mshv_partition_find(u64 partition_id)
> __must_hold(RCU);
> +
> +int mshv_xfer_to_guest_mode_handle_work(unsigned long ti_work);
> +
> +typedef long (*mshv_create_func_t)(void __user *user_arg);
> +typedef long (*mshv_check_ext_func_t)(u32 arg);
> +int mshv_setup_vtl_func(const mshv_create_func_t create_vtl,
> +			const mshv_check_ext_func_t check_ext);
> +int mshv_set_create_partition_func(const mshv_create_func_t func);
> +
> +#endif /* _MSHV_H */
> diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
> new file mode 100644
> index 000000000000..ddc64fe3920e
> --- /dev/null
> +++ b/drivers/hv/mshv_eventfd.c
> @@ -0,0 +1,758 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * eventfd support for mshv
> + *
> + * Heavily inspired from KVM implementation of irqfd/ioeventfd. The basic
> + * framework code is taken from the kvm implementation.
> + *
> + * All credits to kvm developers.
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/wait.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/list.h>
> +#include <linux/workqueue.h>
> +#include <linux/eventfd.h>
> +
> +#include "mshv_eventfd.h"
> +#include "mshv.h"
> +#include "mshv_root.h"
> +
> +static struct workqueue_struct *irqfd_cleanup_wq;
> +
> +void
> +mshv_register_irq_ack_notifier(struct mshv_partition *partition,
> +			       struct mshv_irq_ack_notifier *mian)
> +{
> +	mutex_lock(&partition->irq_lock);
> +	hlist_add_head_rcu(&mian->link, &partition->irq_ack_notifier_list);
> +	mutex_unlock(&partition->irq_lock);
> +}
> +
> +void
> +mshv_unregister_irq_ack_notifier(struct mshv_partition *partition,
> +				 struct mshv_irq_ack_notifier *mian)
> +{
> +	mutex_lock(&partition->irq_lock);
> +	hlist_del_init_rcu(&mian->link);
> +	mutex_unlock(&partition->irq_lock);
> +	synchronize_rcu();
> +}
> +
> +bool
> +mshv_notify_acked_gsi(struct mshv_partition *partition, int gsi)
> +{
> +	struct mshv_irq_ack_notifier *mian;
> +	bool acked = false;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(mian, &partition->irq_ack_notifier_list,
> +			link) {
> +		if (mian->gsi == gsi) {
> +			mian->irq_acked(mian);
> +			acked = true;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	return acked;
> +}
> +
> +static inline bool hv_should_clear_interrupt(enum hv_interrupt_type type)
> +{
> +	return type == HV_X64_INTERRUPT_TYPE_EXTINT;
> +}
> +
> +static void
> +irqfd_resampler_ack(struct mshv_irq_ack_notifier *mian)
> +{
> +	struct mshv_kernel_irqfd_resampler *resampler;
> +	struct mshv_partition *partition;
> +	struct mshv_kernel_irqfd *irqfd;
> +	int idx;
> +
> +	resampler = container_of(mian,
> +			struct mshv_kernel_irqfd_resampler, notifier);
> +	partition = resampler->partition;
> +
> +	idx = srcu_read_lock(&partition->irq_srcu);
> +
> +	hlist_for_each_entry_rcu(irqfd, &resampler->irqfds_list,
> resampler_hnode) {
> +		if (hv_should_clear_interrupt(irqfd-
> >lapic_irq.control.interrupt_type))
> +			hv_call_clear_virtual_interrupt(partition->id);
> +
> +		eventfd_signal(irqfd->resamplefd, 1);
> +	}
> +
> +	srcu_read_unlock(&partition->irq_srcu, idx);
> +}
> +
> +static void
> +irqfd_assert(struct work_struct *work)
> +{
> +	struct mshv_kernel_irqfd *irqfd =
> +		container_of(work, struct mshv_kernel_irqfd, assert);
> +	struct mshv_lapic_irq *irq = &irqfd->lapic_irq;
> +
> +	hv_call_assert_virtual_interrupt(irqfd->partition->id,
> +					 irq->vector, irq->apic_id,
> +					 irq->control);
> +}
> +
> +static void
> +irqfd_inject(struct mshv_kernel_irqfd *irqfd)
> +{
> +	struct mshv_partition *partition = irqfd->partition;
> +	struct mshv_lapic_irq *irq = &irqfd->lapic_irq;
> +	unsigned int seq;
> +	int idx;
> +
> +	WARN_ON(irqfd->resampler &&
> +		!irq->control.level_triggered);
> +
> +	idx = srcu_read_lock(&partition->irq_srcu);
> +	if (irqfd->msi_entry.gsi) {
> +		if (!irqfd->msi_entry.entry_valid) {
> +			pr_warn("Invalid routing info for gsi %u",
> +				irqfd->msi_entry.gsi);
> +			srcu_read_unlock(&partition->irq_srcu, idx);
> +			return;
> +		}
> +
> +		do {
> +			seq = read_seqcount_begin(&irqfd->msi_entry_sc);
> +		} while (read_seqcount_retry(&irqfd->msi_entry_sc, seq));
> +	}
> +
> +	srcu_read_unlock(&partition->irq_srcu, idx);
> +
> +	schedule_work(&irqfd->assert);
> +}
> +
> +static void
> +irqfd_resampler_shutdown(struct mshv_kernel_irqfd *irqfd)
> +{
> +	struct mshv_kernel_irqfd_resampler *resampler = irqfd->resampler;
> +	struct mshv_partition *partition = resampler->partition;
> +
> +	mutex_lock(&partition->irqfds.resampler_lock);
> +
> +	hlist_del_rcu(&irqfd->resampler_hnode);
> +	synchronize_srcu(&partition->irq_srcu);
> +
> +	if (hlist_empty(&resampler->irqfds_list)) {
> +		hlist_del(&resampler->hnode);
> +		mshv_unregister_irq_ack_notifier(partition, &resampler-
> >notifier);
> +		kfree(resampler);
> +	}
> +
> +	mutex_unlock(&partition->irqfds.resampler_lock);
> +}
> +
> +/*
> + * Race-free decouple logic (ordering is critical)
> + */
> +static void
> +irqfd_shutdown(struct work_struct *work)
> +{
> +	struct mshv_kernel_irqfd *irqfd =
> +		container_of(work, struct mshv_kernel_irqfd, shutdown);
> +
> +	/*
> +	 * Synchronize with the wait-queue and unhook ourselves to prevent
> +	 * further events.
> +	 */
> +	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> +
> +	if (irqfd->resampler) {
> +		irqfd_resampler_shutdown(irqfd);
> +		eventfd_ctx_put(irqfd->resamplefd);
> +	}
> +
> +	/*
> +	 * We know no new events will be scheduled at this point, so block
> +	 * until all previously outstanding events have completed
> +	 */
> +	flush_work(&irqfd->assert);
> +
> +	/*
> +	 * It is now safe to release the object's resources
> +	 */
> +	eventfd_ctx_put(irqfd->eventfd);
> +	kfree(irqfd);
> +}
> +
> +/* assumes partition->irqfds.lock is held */
> +static bool
> +irqfd_is_active(struct mshv_kernel_irqfd *irqfd)
> +{
> +	return !hlist_unhashed(&irqfd->hnode);
> +}
> +
> +/*
> + * Mark the irqfd as inactive and schedule it for removal
> + *
> + * assumes partition->irqfds.lock is held
> + */
> +static void
> +irqfd_deactivate(struct mshv_kernel_irqfd *irqfd)
> +{
> +	WARN_ON(!irqfd_is_active(irqfd));
> +
> +	hlist_del(&irqfd->hnode);
> +
> +	queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
> +}
> +
> +/*
> + * Called with wqh->lock held and interrupts disabled
> + */
> +static int
> +irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
> +		int sync, void *key)
> +{
> +	struct mshv_kernel_irqfd *irqfd =
> +		container_of(wait, struct mshv_kernel_irqfd, wait);
> +	unsigned long flags = (unsigned long)key;
> +	int idx;
> +	unsigned int seq;
> +	struct mshv_partition *partition = irqfd->partition;
> +	int ret = 0;
> +
> +	if (flags & POLLIN) {
> +		u64 cnt;
> +
> +		eventfd_ctx_do_read(irqfd->eventfd, &cnt);
> +		idx = srcu_read_lock(&partition->irq_srcu);
> +		do {
> +			seq = read_seqcount_begin(&irqfd->msi_entry_sc);
> +		} while (read_seqcount_retry(&irqfd->msi_entry_sc, seq));
> +
> +		/* An event has been signaled, inject an interrupt */
> +		irqfd_inject(irqfd);
> +		srcu_read_unlock(&partition->irq_srcu, idx);
> +
> +		ret = 1;
> +	}
> +
> +	if (flags & POLLHUP) {
> +		/* The eventfd is closing, detach from Partition */
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&partition->irqfds.lock, flags);
> +
> +		/*
> +		 * We must check if someone deactivated the irqfd before
> +		 * we could acquire the irqfds.lock since the item is
> +		 * deactivated from the mshv side before it is unhooked from
> +		 * the wait-queue.  If it is already deactivated, we can
> +		 * simply return knowing the other side will cleanup for us.
> +		 * We cannot race against the irqfd going away since the
> +		 * other side is required to acquire wqh->lock, which we hold
> +		 */
> +		if (irqfd_is_active(irqfd))
> +			irqfd_deactivate(irqfd);
> +
> +		spin_unlock_irqrestore(&partition->irqfds.lock, flags);
> +	}
> +
> +	return ret;
> +}
> +
> +/* Must be called under irqfds.lock */
> +static void irqfd_update(struct mshv_partition *partition,
> +			 struct mshv_kernel_irqfd *irqfd)
> +{
> +	write_seqcount_begin(&irqfd->msi_entry_sc);
> +	irqfd->msi_entry = mshv_msi_map_gsi(partition, irqfd->gsi);
> +	mshv_set_msi_irq(&irqfd->msi_entry, &irqfd->lapic_irq);
> +	write_seqcount_end(&irqfd->msi_entry_sc);
> +}
> +
> +void mshv_irqfd_routing_update(struct mshv_partition *partition)
> +{
> +	struct mshv_kernel_irqfd *irqfd;
> +
> +	spin_lock_irq(&partition->irqfds.lock);
> +	hlist_for_each_entry(irqfd, &partition->irqfds.items, hnode)
> +		irqfd_update(partition, irqfd);
> +	spin_unlock_irq(&partition->irqfds.lock);
> +}
> +
> +static void
> +irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
> +			poll_table *pt)
> +{
> +	struct mshv_kernel_irqfd *irqfd =
> +		container_of(pt, struct mshv_kernel_irqfd, pt);
> +
> +	irqfd->wqh = wqh;
> +	add_wait_queue_priority(wqh, &irqfd->wait);
> +}
> +
> +static int
> +mshv_irqfd_assign(struct mshv_partition *partition,
> +		  struct mshv_irqfd *args)
> +{
> +	struct eventfd_ctx *eventfd = NULL, *resamplefd = NULL;
> +	struct mshv_kernel_irqfd *irqfd, *tmp;
> +	unsigned int events;
> +	struct fd f;
> +	int ret;
> +	int idx;
> +
> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> +	if (!irqfd)
> +		return -ENOMEM;
> +
> +	irqfd->partition = partition;
> +	irqfd->gsi = args->gsi;
> +	INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
> +	INIT_WORK(&irqfd->assert, irqfd_assert);
> +	seqcount_spinlock_init(&irqfd->msi_entry_sc,
> +			       &partition->irqfds.lock);
> +
> +	f = fdget(args->fd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto out;
> +	}
> +
> +	eventfd = eventfd_ctx_fileget(f.file);
> +	if (IS_ERR(eventfd)) {
> +		ret = PTR_ERR(eventfd);
> +		goto fail;
> +	}
> +
> +	irqfd->eventfd = eventfd;
> +
> +	if (args->flags & MSHV_IRQFD_FLAG_RESAMPLE) {
> +		struct mshv_kernel_irqfd_resampler *resampler;
> +
> +		resamplefd = eventfd_ctx_fdget(args->resamplefd);
> +		if (IS_ERR(resamplefd)) {
> +			ret = PTR_ERR(resamplefd);
> +			goto fail;
> +		}
> +
> +		irqfd->resamplefd = resamplefd;
> +
> +		mutex_lock(&partition->irqfds.resampler_lock);
> +
> +		hlist_for_each_entry(resampler,
> +				    &partition->irqfds.resampler_list, hnode) {
> +			if (resampler->notifier.gsi == irqfd->gsi) {
> +				irqfd->resampler = resampler;
> +				break;
> +			}
> +		}
> +
> +		if (!irqfd->resampler) {
> +			resampler = kzalloc(sizeof(*resampler),
> +					    GFP_KERNEL_ACCOUNT);
> +			if (!resampler) {
> +				ret = -ENOMEM;
> +				mutex_unlock(&partition-
> >irqfds.resampler_lock);
> +				goto fail;
> +			}
> +
> +			resampler->partition = partition;
> +			INIT_HLIST_HEAD(&resampler->irqfds_list);
> +			resampler->notifier.gsi = irqfd->gsi;
> +			resampler->notifier.irq_acked = irqfd_resampler_ack;
> +
> +			hlist_add_head(&resampler->hnode, &partition-
> >irqfds.resampler_list);
> +			mshv_register_irq_ack_notifier(partition,
> +						      &resampler->notifier);
> +			irqfd->resampler = resampler;
> +		}
> +
> +		hlist_add_head_rcu(&irqfd->resampler_hnode, &irqfd-
> >resampler->irqfds_list);
> +
> +		mutex_unlock(&partition->irqfds.resampler_lock);
> +	}
> +
> +	/*
> +	 * Install our own custom wake-up handling so we are notified via
> +	 * a callback whenever someone signals the underlying eventfd
> +	 */
> +	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> +	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
> +
> +	spin_lock_irq(&partition->irqfds.lock);
> +	if (args->flags & MSHV_IRQFD_FLAG_RESAMPLE &&
> +	    !irqfd->lapic_irq.control.level_triggered) {
> +		/*
> +		 * Resample Fd must be for level triggered interrupt
> +		 * Otherwise return with failure
> +		 */
> +		spin_unlock_irq(&partition->irqfds.lock);
> +		ret = -EINVAL;
> +		goto fail;
> +	}
> +	ret = 0;
> +	hlist_for_each_entry(tmp, &partition->irqfds.items, hnode) {
> +		if (irqfd->eventfd != tmp->eventfd)
> +			continue;
> +		/* This fd is used for another irq already. */
> +		ret = -EBUSY;
> +		spin_unlock_irq(&partition->irqfds.lock);
> +		goto fail;
> +	}
> +
> +	idx = srcu_read_lock(&partition->irq_srcu);
> +	irqfd_update(partition, irqfd);
> +	hlist_add_head(&irqfd->hnode, &partition->irqfds.items);
> +	spin_unlock_irq(&partition->irqfds.lock);
> +
> +	/*
> +	 * Check if there was an event already pending on the eventfd
> +	 * before we registered, and trigger it as if we didn't miss it.
> +	 */
> +	events = vfs_poll(f.file, &irqfd->pt);
> +
> +	if (events & POLLIN)
> +		irqfd_inject(irqfd);
> +
> +	srcu_read_unlock(&partition->irq_srcu, idx);
> +	/*
> +	 * do not drop the file until the irqfd is fully initialized, otherwise
> +	 * we might race against the POLLHUP
> +	 */
> +	fdput(f);
> +
> +	return 0;
> +
> +fail:
> +	if (irqfd->resampler)
> +		irqfd_resampler_shutdown(irqfd);
> +
> +	if (resamplefd && !IS_ERR(resamplefd))
> +		eventfd_ctx_put(resamplefd);
> +
> +	if (eventfd && !IS_ERR(eventfd))
> +		eventfd_ctx_put(eventfd);
> +
> +	fdput(f);
> +
> +out:
> +	kfree(irqfd);
> +	return ret;
> +}
> +
> +/*
> + * shutdown any irqfd's that match fd+gsi
> + */
> +static int
> +mshv_irqfd_deassign(struct mshv_partition *partition,
> +		    struct mshv_irqfd *args)
> +{
> +	struct mshv_kernel_irqfd *irqfd;
> +	struct hlist_node *n;
> +	struct eventfd_ctx *eventfd;
> +
> +	eventfd = eventfd_ctx_fdget(args->fd);
> +	if (IS_ERR(eventfd))
> +		return PTR_ERR(eventfd);
> +
> +	hlist_for_each_entry_safe(irqfd, n, &partition->irqfds.items, hnode) {
> +		if (irqfd->eventfd == eventfd && irqfd->gsi == args->gsi)
> +			irqfd_deactivate(irqfd);
> +	}
> +
> +	eventfd_ctx_put(eventfd);
> +
> +	/*
> +	 * Block until we know all outstanding shutdown jobs have completed
> +	 * so that we guarantee there will not be any more interrupts on this
> +	 * gsi once this deassign function returns.
> +	 */
> +	flush_workqueue(irqfd_cleanup_wq);
> +
> +	return 0;
> +}
> +
> +int
> +mshv_irqfd(struct mshv_partition *partition, struct mshv_irqfd *args)
> +{
> +	if (args->flags & MSHV_IRQFD_FLAG_DEASSIGN)
> +		return mshv_irqfd_deassign(partition, args);
> +
> +	return mshv_irqfd_assign(partition, args);
> +}
> +
> +/*
> + * This function is called as the mshv VM fd is being released.
> + * Shutdown all irqfds that still remain open
> + */
> +static void
> +mshv_irqfd_release(struct mshv_partition *partition)
> +{
> +	struct mshv_kernel_irqfd *irqfd;
> +	struct hlist_node *n;
> +
> +	spin_lock_irq(&partition->irqfds.lock);
> +
> +	hlist_for_each_entry_safe(irqfd, n, &partition->irqfds.items, hnode)
> +		irqfd_deactivate(irqfd);
> +
> +	spin_unlock_irq(&partition->irqfds.lock);
> +
> +	/*
> +	 * Block until we know all outstanding shutdown jobs have completed
> +	 * since we do not take a mshv_partition* reference.
> +	 */
> +	flush_workqueue(irqfd_cleanup_wq);
> +
> +}
> +
> +int mshv_irqfd_wq_init(void)
> +{
> +	irqfd_cleanup_wq = alloc_workqueue("mshv-irqfd-cleanup", 0, 0);
> +	if (!irqfd_cleanup_wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +void mshv_irqfd_wq_cleanup(void)
> +{
> +	destroy_workqueue(irqfd_cleanup_wq);
> +}
> +
> +/*
> + * --------------------------------------------------------------------
> + * ioeventfd: translate a MMIO memory write to an eventfd signal.
> + *
> + * userspace can register a MMIO address with an eventfd for receiving
> + * notification when the memory has been touched.
> + *
> + * TODO: Implement eventfd for PIO as well.
> + * --------------------------------------------------------------------
> + */
> +
> +static void
> +ioeventfd_release(struct kernel_mshv_ioeventfd *p, u64 partition_id)
> +{
> +	if (p->doorbell_id > 0)
> +		mshv_unregister_doorbell(partition_id, p->doorbell_id);
> +	eventfd_ctx_put(p->eventfd);
> +	kfree(p);
> +}
> +
> +/* MMIO writes trigger an event if the addr/val match */
> +static void
> +ioeventfd_mmio_write(int doorbell_id, void *data)
> +{
> +	struct mshv_partition *partition = (struct mshv_partition *)data;
> +	struct kernel_mshv_ioeventfd *p;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(p, &partition->ioeventfds.items, hnode) {
> +		if (p->doorbell_id == doorbell_id) {
> +			eventfd_signal(p->eventfd, 1);
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +}
> +
> +static bool
> +ioeventfd_check_collision(struct mshv_partition *partition,
> +			  struct kernel_mshv_ioeventfd *p)
> +	__must_hold(&partition->mutex)
> +{
> +	struct kernel_mshv_ioeventfd *_p;
> +
> +	hlist_for_each_entry(_p, &partition->ioeventfds.items, hnode)
> +		if (_p->addr == p->addr && _p->length == p->length &&
> +		    (_p->wildcard || p->wildcard ||
> +		     _p->datamatch == p->datamatch))
> +			return true;
> +
> +	return false;
> +}
> +
> +static int
> +mshv_assign_ioeventfd(struct mshv_partition *partition,
> +		      struct mshv_ioeventfd *args)
> +	__must_hold(&partition->mutex)
> +{
> +	struct kernel_mshv_ioeventfd *p;
> +	struct eventfd_ctx *eventfd;
> +	u64 doorbell_flags = 0;
> +	int ret;
> +
> +	/* This mutex is currently protecting ioeventfd.items list */
> +	WARN_ON_ONCE(!mutex_is_locked(&partition->mutex));
> +
> +	if (args->flags & MSHV_IOEVENTFD_FLAG_PIO)
> +		return -EOPNOTSUPP;
> +
> +	/* must be natural-word sized */
> +	switch (args->len) {
> +	case 0:
> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_ANY;
> +		break;
> +	case 1:
> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_BYTE;
> +		break;
> +	case 2:
> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_WORD;
> +		break;
> +	case 4:
> +		doorbell_flags =
> HV_DOORBELL_FLAG_TRIGGER_SIZE_DWORD;
> +		break;
> +	case 8:
> +		doorbell_flags =
> HV_DOORBELL_FLAG_TRIGGER_SIZE_QWORD;
> +		break;
> +	default:
> +		pr_warn("ioeventfd: invalid length specified\n");
> +		return -EINVAL;
> +	}
> +
> +	/* check for range overflow */
> +	if (args->addr + args->len < args->addr)
> +		return -EINVAL;
> +
> +	/* check for extra flags that we don't understand */
> +	if (args->flags & ~MSHV_IOEVENTFD_VALID_FLAG_MASK)
> +		return -EINVAL;
> +
> +	eventfd = eventfd_ctx_fdget(args->fd);
> +	if (IS_ERR(eventfd))
> +		return PTR_ERR(eventfd);
> +
> +	p = kzalloc(sizeof(*p), GFP_KERNEL);
> +	if (!p) {
> +		ret = -ENOMEM;
> +		goto fail;
> +	}
> +
> +	p->addr    = args->addr;
> +	p->length  = args->len;
> +	p->eventfd = eventfd;
> +
> +	/* The datamatch feature is optional, otherwise this is a wildcard */
> +	if (args->flags & MSHV_IOEVENTFD_FLAG_DATAMATCH)
> +		p->datamatch = args->datamatch;
> +	else {
> +		p->wildcard = true;
> +		doorbell_flags |=
> HV_DOORBELL_FLAG_TRIGGER_ANY_VALUE;
> +	}
> +
> +	if (ioeventfd_check_collision(partition, p)) {
> +		ret = -EEXIST;
> +		goto unlock_fail;
> +	}
> +
> +	ret = mshv_register_doorbell(partition->id, ioeventfd_mmio_write,
> +				     (void *)partition, p->addr,
> +				     p->datamatch, doorbell_flags);
> +	if (ret < 0) {
> +		pr_err("Failed to register ioeventfd doorbell!\n");

Nit: Do we like to print function name at the start of pr_err. 

- Saurabh
Nuno Das Neves Aug. 18, 2023, 6:59 p.m. UTC | #2
On 8/18/2023 6:08 AM, Saurabh Singh Sengar wrote:
>> +
>> +config MSHV_VTL
>> +	tristate "Microsoft Hyper-V VTL driver"
>> +	depends on MSHV
>> +	select HYPERV_VTL_MODE
>> +	select TRANSPARENT_HUGEPAGE
> 
> TRANSPARENT_HUGEPAGE can be avoided for now.
> 

I will remove it in the next version. Thanks.
>> +
>> +#define HV_GET_REGISTER_BATCH_SIZE	\
>> +	(HV_HYP_PAGE_SIZE / sizeof(union hv_register_value))
>> +#define HV_SET_REGISTER_BATCH_SIZE	\
>> +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_registers)) \
>> +		/ sizeof(struct hv_register_assoc))
>> +
>> +int hv_call_get_vp_registers(
>> +		u32 vp_index,
>> +		u64 partition_id,
>> +		u16 count,
>> +		union hv_input_vtl input_vtl,
>> +		struct hv_register_assoc *registers)
>> +{
>> +	struct hv_input_get_vp_registers *input_page;
>> +	union hv_register_value *output_page;
>> +	u16 completed = 0;
>> +	unsigned long remaining = count;
>> +	int rep_count, i;
>> +	u64 status;
>> +	unsigned long flags;
>> +
>> +	local_irq_save(flags);
>> +
>> +	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
>> +
>> +	input_page->partition_id = partition_id;
>> +	input_page->vp_index = vp_index;
>> +	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
>> +	input_page->rsvd_z8 = 0;
>> +	input_page->rsvd_z16 = 0;
>> +
>> +	while (remaining) {
>> +		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
>> +		for (i = 0; i < rep_count; ++i)
>> +			input_page->names[i] = registers[i].name;
>> +
>> +		status = hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS,
>> rep_count,
>> +					     0, input_page, output_page);
> 
> Is there any possibility that count value is passed 0 by mistake ? In that case
> status will remain uninitialized. 
> 

These lines ensure rep_count is never 0 here:

	while (remaining) {
		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);

Remaining can't be 0 or the loop would exit, and HV_GET_REGISTER_BATCH_SIZE
is not 0, or we would never get any registers.

>> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
>> index 13f972e72375..ccd76f30a638 100644
>> --- a/drivers/hv/hv_common.c
>> +++ b/drivers/hv/hv_common.c
>> @@ -62,7 +62,11 @@ EXPORT_SYMBOL_GPL(hyperv_pcpu_output_arg);
>>   */
>>  static inline bool hv_output_arg_exists(void)
>>  {
>> +#ifdef CONFIG_MSHV_VTL
> 
> Although today both the option works together. But thinking
> which is more accurate CONFIG_HYPERV_VTL_MODE or
> CONFIG_MSHV_VTL here for scalability of VTL modules.
> 

Good point. Though I'm not sure it matters too much right now,
since as you mention they will always be enabled together.

Does CONFIG_HYPERV_VTL_MODE use the output arg?

>> diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
>> new file mode 100644
>> index 000000000000..166480a73f3f
>> --- /dev/null
>> +++ b/drivers/hv/mshv.h
>> @@ -0,0 +1,156 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +/*
>> + * Copyright (c) 2023, Microsoft Corporation.
>> + */
>> +
>> +#ifndef _MSHV_H_
>> +#define _MSHV_H_
>> +
>> +#include <linux/spinlock.h>
>> +#include <linux/mutex.h>
>> +#include <linux/semaphore.h>
>> +#include <linux/sched.h>
>> +#include <linux/srcu.h>
>> +#include <linux/wait.h>
>> +#include <uapi/linux/mshv.h>
>> +
>> +/*
>> + * Hyper-V hypercalls
>> + */
>> +
>> +int hv_call_withdraw_memory(u64 count, int node, u64 partition_id);
>> +int hv_call_create_partition(
>> +		u64 flags,
>> +		struct hv_partition_creation_properties creation_properties,
>> +		union hv_partition_isolation_properties isolation_properties,
>> +		u64 *partition_id);
>> +int hv_call_initialize_partition(u64 partition_id);
>> +int hv_call_finalize_partition(u64 partition_id);
>> +int hv_call_delete_partition(u64 partition_id);
>> +int hv_call_map_gpa_pages(
>> +		u64 partition_id,
>> +		u64 gpa_target,
>> +		u64 page_count, u32 flags,
>> +		struct page **pages);
>> +int hv_call_unmap_gpa_pages(
>> +		u64 partition_id,
>> +		u64 gpa_target,
>> +		u64 page_count, u32 flags);
>> +int hv_call_get_vp_registers(
>> +		u32 vp_index,
>> +		u64 partition_id,
>> +		u16 count,
>> +		union hv_input_vtl input_vtl,
>> +		struct hv_register_assoc *registers);
>> +int hv_call_get_gpa_access_states(
>> +		u64 partition_id,
>> +		u32 count,
>> +		u64 gpa_base_pfn,
>> +		u64 state_flags,
>> +		int *written_total,
>> +		union hv_gpa_page_access_state *states);
>> +
>> +int hv_call_set_vp_registers(
>> +		u32 vp_index,
>> +		u64 partition_id,
>> +		u16 count,
>> +		union hv_input_vtl input_vtl,
>> +		struct hv_register_assoc *registers);
> 
> Nit: Opportunity to fix many of the checkpatch.pl related to line break here
> and many other places.
> 

checkpatch.pl doesn't complain about anything in this file.

>> +static int
>> +mshv_assign_ioeventfd(struct mshv_partition *partition,
>> +		      struct mshv_ioeventfd *args)
>> +	__must_hold(&partition->mutex)
>> +{
>> +	struct kernel_mshv_ioeventfd *p;
>> +	struct eventfd_ctx *eventfd;
>> +	u64 doorbell_flags = 0;
>> +	int ret;
>> +
>> +	/* This mutex is currently protecting ioeventfd.items list */
>> +	WARN_ON_ONCE(!mutex_is_locked(&partition->mutex));
>> +
>> +	if (args->flags & MSHV_IOEVENTFD_FLAG_PIO)
>> +		return -EOPNOTSUPP;
>> +
>> +	/* must be natural-word sized */
>> +	switch (args->len) {
>> +	case 0:
>> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_ANY;
>> +		break;
>> +	case 1:
>> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_BYTE;
>> +		break;
>> +	case 2:
>> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_WORD;
>> +		break;
>> +	case 4:
>> +		doorbell_flags =
>> HV_DOORBELL_FLAG_TRIGGER_SIZE_DWORD;
>> +		break;
>> +	case 8:
>> +		doorbell_flags =
>> HV_DOORBELL_FLAG_TRIGGER_SIZE_QWORD;
>> +		break;
>> +	default:
>> +		pr_warn("ioeventfd: invalid length specified\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	/* check for range overflow */
>> +	if (args->addr + args->len < args->addr)
>> +		return -EINVAL;
>> +
>> +	/* check for extra flags that we don't understand */
>> +	if (args->flags & ~MSHV_IOEVENTFD_VALID_FLAG_MASK)
>> +		return -EINVAL;
>> +
>> +	eventfd = eventfd_ctx_fdget(args->fd);
>> +	if (IS_ERR(eventfd))
>> +		return PTR_ERR(eventfd);
>> +
>> +	p = kzalloc(sizeof(*p), GFP_KERNEL);
>> +	if (!p) {
>> +		ret = -ENOMEM;
>> +		goto fail;
>> +	}
>> +
>> +	p->addr    = args->addr;
>> +	p->length  = args->len;
>> +	p->eventfd = eventfd;
>> +
>> +	/* The datamatch feature is optional, otherwise this is a wildcard */
>> +	if (args->flags & MSHV_IOEVENTFD_FLAG_DATAMATCH)
>> +		p->datamatch = args->datamatch;
>> +	else {
>> +		p->wildcard = true;
>> +		doorbell_flags |=
>> HV_DOORBELL_FLAG_TRIGGER_ANY_VALUE;
>> +	}
>> +
>> +	if (ioeventfd_check_collision(partition, p)) {
>> +		ret = -EEXIST;
>> +		goto unlock_fail;
>> +	}
>> +
>> +	ret = mshv_register_doorbell(partition->id, ioeventfd_mmio_write,
>> +				     (void *)partition, p->addr,
>> +				     p->datamatch, doorbell_flags);
>> +	if (ret < 0) {
>> +		pr_err("Failed to register ioeventfd doorbell!\n");
> 
> Nit: Do we like to print function name at the start of pr_err. 
> 

Yes, we should. I will fix it. Thanks!
Saurabh Singh Sengar Aug. 20, 2023, 5:19 a.m. UTC | #3
> -----Original Message-----
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> Sent: Saturday, August 19, 2023 12:30 AM
> To: Saurabh Singh Sengar <ssengar@microsoft.com>; linux-
> hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org; linux-
> arm-kernel@lists.infradead.org; linux-arch@vger.kernel.org
> Cc: patches@lists.linux.dev; Michael Kelley (LINUX)
> <mikelley@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
> wei.liu@kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; Dexuan Cui
> <decui@microsoft.com>; apais@linux.microsoft.com; Tianyu Lan
> <Tianyu.Lan@microsoft.com>; ssengar@linux.microsoft.com; MUKESH
> RATHOR <mukeshrathor@microsoft.com>; stanislav.kinsburskiy@gmail.com;
> jinankjain@linux.microsoft.com; vkuznets <vkuznets@redhat.com>;
> tglx@linutronix.de; mingo@redhat.com; bp@alien8.de;
> dave.hansen@linux.intel.com; hpa@zytor.com; will@kernel.org;
> catalin.marinas@arm.com
> Subject: Re: [PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv
> to VMMs running on Hyper-V
> 
> On 8/18/2023 6:08 AM, Saurabh Singh Sengar wrote:
> >> +
> >> +config MSHV_VTL
> >> +	tristate "Microsoft Hyper-V VTL driver"
> >> +	depends on MSHV
> >> +	select HYPERV_VTL_MODE
> >> +	select TRANSPARENT_HUGEPAGE
> >
> > TRANSPARENT_HUGEPAGE can be avoided for now.
> >
> 
> I will remove it in the next version. Thanks.
> >> +
> >> +#define HV_GET_REGISTER_BATCH_SIZE	\
> >> +	(HV_HYP_PAGE_SIZE / sizeof(union hv_register_value))
> >> +#define HV_SET_REGISTER_BATCH_SIZE	\
> >> +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_registers)) \
> >> +		/ sizeof(struct hv_register_assoc))
> >> +
> >> +int hv_call_get_vp_registers(
> >> +		u32 vp_index,
> >> +		u64 partition_id,
> >> +		u16 count,
> >> +		union hv_input_vtl input_vtl,
> >> +		struct hv_register_assoc *registers) {
> >> +	struct hv_input_get_vp_registers *input_page;
> >> +	union hv_register_value *output_page;
> >> +	u16 completed = 0;
> >> +	unsigned long remaining = count;
> >> +	int rep_count, i;
> >> +	u64 status;
> >> +	unsigned long flags;
> >> +
> >> +	local_irq_save(flags);
> >> +
> >> +	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
> >> +
> >> +	input_page->partition_id = partition_id;
> >> +	input_page->vp_index = vp_index;
> >> +	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
> >> +	input_page->rsvd_z8 = 0;
> >> +	input_page->rsvd_z16 = 0;
> >> +
> >> +	while (remaining) {
> >> +		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
> >> +		for (i = 0; i < rep_count; ++i)
> >> +			input_page->names[i] = registers[i].name;
> >> +
> >> +		status = hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS,
> >> rep_count,
> >> +					     0, input_page, output_page);
> >
> > Is there any possibility that count value is passed 0 by mistake ? In
> > that case status will remain uninitialized.
> >
> 
> These lines ensure rep_count is never 0 here:
> 
> 	while (remaining) {
> 		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
> 
> Remaining can't be 0 or the loop would exit, and
> HV_GET_REGISTER_BATCH_SIZE is not 0, or we would never get any registers.

There is a parameter in this function "count". I was checking if there is any possibility
that is passed as 0 by mistake ? this will lead to "remaining" value as 0 and loop will never
execute. Which results using "status" uninitialized later in the function.


> 
> >> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c index
> >> 13f972e72375..ccd76f30a638 100644
> >> --- a/drivers/hv/hv_common.c
> >> +++ b/drivers/hv/hv_common.c
> >> @@ -62,7 +62,11 @@ EXPORT_SYMBOL_GPL(hyperv_pcpu_output_arg);
> >>   */
> >>  static inline bool hv_output_arg_exists(void)  {
> >> +#ifdef CONFIG_MSHV_VTL
> >
> > Although today both the option works together. But thinking
> > which is more accurate CONFIG_HYPERV_VTL_MODE or
> > CONFIG_MSHV_VTL here for scalability of VTL modules.
> >
> 
> Good point. Though I'm not sure it matters too much right now,
> since as you mention they will always be enabled together.
> 
> Does CONFIG_HYPERV_VTL_MODE use the output arg?

Currently its not, I think MSHV_VTL is good for now. Thanks.

> 
> >> diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
> >> new file mode 100644
> >> index 000000000000..166480a73f3f
> >> --- /dev/null
> >> +++ b/drivers/hv/mshv.h
> >> @@ -0,0 +1,156 @@
> >> +/* SPDX-License-Identifier: GPL-2.0-only */
> >> +/*
> >> + * Copyright (c) 2023, Microsoft Corporation.
> >> + */
> >> +
> >> +#ifndef _MSHV_H_
> >> +#define _MSHV_H_
> >> +
> >> +#include <linux/spinlock.h>
> >> +#include <linux/mutex.h>
> >> +#include <linux/semaphore.h>
> >> +#include <linux/sched.h>
> >> +#include <linux/srcu.h>
> >> +#include <linux/wait.h>
> >> +#include <uapi/linux/mshv.h>
> >> +
> >> +/*
> >> + * Hyper-V hypercalls
> >> + */
> >> +
> >> +int hv_call_withdraw_memory(u64 count, int node, u64 partition_id);
> >> +int hv_call_create_partition(
> >> +		u64 flags,
> >> +		struct hv_partition_creation_properties creation_properties,
> >> +		union hv_partition_isolation_properties isolation_properties,
> >> +		u64 *partition_id);
> >> +int hv_call_initialize_partition(u64 partition_id);
> >> +int hv_call_finalize_partition(u64 partition_id);
> >> +int hv_call_delete_partition(u64 partition_id);
> >> +int hv_call_map_gpa_pages(
> >> +		u64 partition_id,
> >> +		u64 gpa_target,
> >> +		u64 page_count, u32 flags,
> >> +		struct page **pages);
> >> +int hv_call_unmap_gpa_pages(
> >> +		u64 partition_id,
> >> +		u64 gpa_target,
> >> +		u64 page_count, u32 flags);
> >> +int hv_call_get_vp_registers(
> >> +		u32 vp_index,
> >> +		u64 partition_id,
> >> +		u16 count,
> >> +		union hv_input_vtl input_vtl,
> >> +		struct hv_register_assoc *registers);
> >> +int hv_call_get_gpa_access_states(
> >> +		u64 partition_id,
> >> +		u32 count,
> >> +		u64 gpa_base_pfn,
> >> +		u64 state_flags,
> >> +		int *written_total,
> >> +		union hv_gpa_page_access_state *states);
> >> +
> >> +int hv_call_set_vp_registers(
> >> +		u32 vp_index,
> >> +		u64 partition_id,
> >> +		u16 count,
> >> +		union hv_input_vtl input_vtl,
> >> +		struct hv_register_assoc *registers);
> >
> > Nit: Opportunity to fix many of the checkpatch.pl related to line break here
> > and many other places.
> >
> 
> checkpatch.pl doesn't complain about anything in this file.

If we use --strict switch with checkpatch.pl we observe additional CHECK(s).
I observe 159 CHECK(s) with this patch overall.
(total: 1 errors, 7 warnings, 159 checks, 7460 lines checked)
Few examples:

CHECK: Lines should not end with a '('
#240: FILE: drivers/hv/hv_call.c:73:
+int hv_call_set_vp_registers(

CHECK: Alignment should match open parenthesis
#266: FILE: drivers/hv/hv_call.c:99:
+               memcpy(input_page->elements, registers,
+                       sizeof(struct hv_register_assoc) * rep_count);


I also see an error with flexible array, possibly we can fix that as well.

ERROR: Use C99 flexible arrays - see https://docs.kernel.org/process/deprecated.html#zero-length-and-one-element-arrays
#7468: FILE: include/uapi/linux/mshv.h:134:
+       struct mshv_msi_routing_entry entries[0];
+};

- Saurabh
Saurabh Singh Sengar Aug. 21, 2023, 6:18 p.m. UTC | #4
> -----Original Message-----
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> Sent: Friday, August 18, 2023 3:32 AM
> To: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org;
> x86@kernel.org; linux-arm-kernel@lists.infradead.org; linux-
> arch@vger.kernel.org
> Cc: patches@lists.linux.dev; Michael Kelley (LINUX)
> <mikelley@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
> wei.liu@kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; Dexuan Cui
> <decui@microsoft.com>; apais@linux.microsoft.com; Tianyu Lan
> <Tianyu.Lan@microsoft.com>; ssengar@linux.microsoft.com; MUKESH
> RATHOR <mukeshrathor@microsoft.com>; stanislav.kinsburskiy@gmail.com;
> jinankjain@linux.microsoft.com; vkuznets <vkuznets@redhat.com>;
> tglx@linutronix.de; mingo@redhat.com; bp@alien8.de;
> dave.hansen@linux.intel.com; hpa@zytor.com; will@kernel.org;
> catalin.marinas@arm.com
> Subject: [PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv to
> VMMs running on Hyper-V
> 
> Add mshv, mshv_root, and mshv_vtl modules:
> 

<snip>

> +	ret = mshv_set_vp_registers(vp->index, vp->partition->id,
> +				    1, &dispatch_suspend);
> +	if (ret)
> +		pr_err("%s: failed to suspend partition %llu vp %u\n",
> +			__func__, vp->partition->id, vp->index);
> +
> +	return ret;
> +}
> +
> +static int
> +get_vp_signaled_count(struct mshv_vp *vp, u64 *count)
> +{
> +	int ret;
> +	struct hv_register_assoc root_signal_count = {
> +		.name = HV_REGISTER_VP_ROOT_SIGNAL_COUNT,
> +	};
> +
> +	ret = mshv_get_vp_registers(vp->index, vp->partition->id,
> +				    1, &root_signal_count);
> +
> +	if (ret) {
> +		pr_err("%s: failed to get root signal count for partition %llu vp
> %u",
> +			__func__, vp->partition->id, vp->index);
> +		*count = 0;

Have we missed a return here ?
Moreover, the return type of this function is never checked consider
checking it or change it to void.

<snip>

> +
> +/* Retrieve and stash the supported scheduler type */
> +static int __init mshv_retrieve_scheduler_type(void)
> +{
> +	struct hv_input_get_system_property *input;
> +	struct hv_output_get_system_property *output;
> +	unsigned long flags;
> +	u64 status;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	memset(input, 0, sizeof(*input));
> +	memset(output, 0, sizeof(*output));
> +	input->property_id = HV_SYSTEM_PROPERTY_SCHEDULER_TYPE;
> +
> +	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input,
> output);
> +	if (!hv_result_success(status)) {
> +		local_irq_restore(flags);
> +		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
> +		return hv_status_to_errno(status);
> +	}
> +
> +	hv_scheduler_type = output->scheduler_type;
> +	local_irq_restore(flags);
> +
> +	pr_info("mshv: hypervisor using %s\n",
> scheduler_type_to_string(hv_scheduler_type));

Nit: In this file we are using two styles of prints, few are appended with
"mshv:" and few with "__func__".  It's better to have a single style
for one module.
Nuno Das Neves Aug. 22, 2023, 8:18 p.m. UTC | #5
On 8/19/2023 10:19 PM, Saurabh Singh Sengar wrote:
> 
> 
>> -----Original Message-----
>> From: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> Sent: Saturday, August 19, 2023 12:30 AM
>> To: Saurabh Singh Sengar <ssengar@microsoft.com>; linux-
>> hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org; linux-
>> arm-kernel@lists.infradead.org; linux-arch@vger.kernel.org
>> Cc: patches@lists.linux.dev; Michael Kelley (LINUX)
>> <mikelley@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
>> wei.liu@kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; Dexuan Cui
>> <decui@microsoft.com>; apais@linux.microsoft.com; Tianyu Lan
>> <Tianyu.Lan@microsoft.com>; ssengar@linux.microsoft.com; MUKESH
>> RATHOR <mukeshrathor@microsoft.com>; stanislav.kinsburskiy@gmail.com;
>> jinankjain@linux.microsoft.com; vkuznets <vkuznets@redhat.com>;
>> tglx@linutronix.de; mingo@redhat.com; bp@alien8.de;
>> dave.hansen@linux.intel.com; hpa@zytor.com; will@kernel.org;
>> catalin.marinas@arm.com
>> Subject: Re: [PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv
>> to VMMs running on Hyper-V
>>
>> On 8/18/2023 6:08 AM, Saurabh Singh Sengar wrote:
>>>> +
>>>> +config MSHV_VTL
>>>> +	tristate "Microsoft Hyper-V VTL driver"
>>>> +	depends on MSHV
>>>> +	select HYPERV_VTL_MODE
>>>> +	select TRANSPARENT_HUGEPAGE
>>>
>>> TRANSPARENT_HUGEPAGE can be avoided for now.
>>>
>>
>> I will remove it in the next version. Thanks.
>>>> +
>>>> +#define HV_GET_REGISTER_BATCH_SIZE	\
>>>> +	(HV_HYP_PAGE_SIZE / sizeof(union hv_register_value))
>>>> +#define HV_SET_REGISTER_BATCH_SIZE	\
>>>> +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_registers)) \
>>>> +		/ sizeof(struct hv_register_assoc))
>>>> +
>>>> +int hv_call_get_vp_registers(
>>>> +		u32 vp_index,
>>>> +		u64 partition_id,
>>>> +		u16 count,
>>>> +		union hv_input_vtl input_vtl,
>>>> +		struct hv_register_assoc *registers) {
>>>> +	struct hv_input_get_vp_registers *input_page;
>>>> +	union hv_register_value *output_page;
>>>> +	u16 completed = 0;
>>>> +	unsigned long remaining = count;
>>>> +	int rep_count, i;
>>>> +	u64 status;
>>>> +	unsigned long flags;
>>>> +
>>>> +	local_irq_save(flags);
>>>> +
>>>> +	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
>>>> +
>>>> +	input_page->partition_id = partition_id;
>>>> +	input_page->vp_index = vp_index;
>>>> +	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
>>>> +	input_page->rsvd_z8 = 0;
>>>> +	input_page->rsvd_z16 = 0;
>>>> +
>>>> +	while (remaining) {
>>>> +		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
>>>> +		for (i = 0; i < rep_count; ++i)
>>>> +			input_page->names[i] = registers[i].name;
>>>> +
>>>> +		status = hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS,
>>>> rep_count,
>>>> +					     0, input_page, output_page);
>>>
>>> Is there any possibility that count value is passed 0 by mistake ? In
>>> that case status will remain uninitialized.
>>>
>>
>> These lines ensure rep_count is never 0 here:
>>
>> 	while (remaining) {
>> 		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
>>
>> Remaining can't be 0 or the loop would exit, and
>> HV_GET_REGISTER_BATCH_SIZE is not 0, or we would never get any registers.
> 
> There is a parameter in this function "count". I was checking if there is any possibility
> that is passed as 0 by mistake ? this will lead to "remaining" value as 0 and loop will never
> execute. Which results using "status" uninitialized later in the function.
> 
> 

Ah ok, thanks! I will change it to return immediately in case count is 0.
>>>> diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
>>>> new file mode 100644
>>>> index 000000000000..166480a73f3f
>>>> --- /dev/null
>>>> +++ b/drivers/hv/mshv.h
>>>> @@ -0,0 +1,156 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>>> +/*
>>>> + * Copyright (c) 2023, Microsoft Corporation.
>>>> + */
>>>> +
>>>> +#ifndef _MSHV_H_
>>>> +#define _MSHV_H_
>>>> +
>>>> +#include <linux/spinlock.h>
>>>> +#include <linux/mutex.h>
>>>> +#include <linux/semaphore.h>
>>>> +#include <linux/sched.h>
>>>> +#include <linux/srcu.h>
>>>> +#include <linux/wait.h>
>>>> +#include <uapi/linux/mshv.h>
>>>> +
>>>> +/*
>>>> + * Hyper-V hypercalls
>>>> + */
>>>> +
>>>> +int hv_call_withdraw_memory(u64 count, int node, u64 partition_id);
>>>> +int hv_call_create_partition(
>>>> +		u64 flags,
>>>> +		struct hv_partition_creation_properties creation_properties,
>>>> +		union hv_partition_isolation_properties isolation_properties,
>>>> +		u64 *partition_id);
>>>> +int hv_call_initialize_partition(u64 partition_id);
>>>> +int hv_call_finalize_partition(u64 partition_id);
>>>> +int hv_call_delete_partition(u64 partition_id);
>>>> +int hv_call_map_gpa_pages(
>>>> +		u64 partition_id,
>>>> +		u64 gpa_target,
>>>> +		u64 page_count, u32 flags,
>>>> +		struct page **pages);
>>>> +int hv_call_unmap_gpa_pages(
>>>> +		u64 partition_id,
>>>> +		u64 gpa_target,
>>>> +		u64 page_count, u32 flags);
>>>> +int hv_call_get_vp_registers(
>>>> +		u32 vp_index,
>>>> +		u64 partition_id,
>>>> +		u16 count,
>>>> +		union hv_input_vtl input_vtl,
>>>> +		struct hv_register_assoc *registers);
>>>> +int hv_call_get_gpa_access_states(
>>>> +		u64 partition_id,
>>>> +		u32 count,
>>>> +		u64 gpa_base_pfn,
>>>> +		u64 state_flags,
>>>> +		int *written_total,
>>>> +		union hv_gpa_page_access_state *states);
>>>> +
>>>> +int hv_call_set_vp_registers(
>>>> +		u32 vp_index,
>>>> +		u64 partition_id,
>>>> +		u16 count,
>>>> +		union hv_input_vtl input_vtl,
>>>> +		struct hv_register_assoc *registers);
>>>
>>> Nit: Opportunity to fix many of the checkpatch.pl related to line break here
>>> and many other places.
>>>
>>
>> checkpatch.pl doesn't complain about anything in this file.
> 
> If we use --strict switch with checkpatch.pl we observe additional CHECK(s).
> I observe 159 CHECK(s) with this patch overall.
> (total: 1 errors, 7 warnings, 159 checks, 7460 lines checked)
> Few examples:
> 
> CHECK: Lines should not end with a '('
> #240: FILE: drivers/hv/hv_call.c:73:
> +int hv_call_set_vp_registers(
> 
> CHECK: Alignment should match open parenthesis
> #266: FILE: drivers/hv/hv_call.c:99:
> +               memcpy(input_page->elements, registers,
> +                       sizeof(struct hv_register_assoc) * rep_count);
> 
> 

I didn't try --strict before. Thanks, I'll fix up the formatting.

> I also see an error with flexible array, possibly we can fix that as well.
> 
> ERROR: Use C99 flexible arrays - see https://docs.kernel.org/process/deprecated.html#zero-length-and-one-element-arrays
> #7468: FILE: include/uapi/linux/mshv.h:134:
> +       struct mshv_msi_routing_entry entries[0];
> +};
> 

Indeed, I'll fix it too.
Nuno Das Neves Aug. 22, 2023, 9 p.m. UTC | #6
On 8/21/2023 11:18 AM, Saurabh Singh Sengar wrote:
> 
> 
>> -----Original Message-----
>> From: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> Sent: Friday, August 18, 2023 3:32 AM
>> To: linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org;
>> x86@kernel.org; linux-arm-kernel@lists.infradead.org; linux-
>> arch@vger.kernel.org
>> Cc: patches@lists.linux.dev; Michael Kelley (LINUX)
>> <mikelley@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
>> wei.liu@kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; Dexuan Cui
>> <decui@microsoft.com>; apais@linux.microsoft.com; Tianyu Lan
>> <Tianyu.Lan@microsoft.com>; ssengar@linux.microsoft.com; MUKESH
>> RATHOR <mukeshrathor@microsoft.com>; stanislav.kinsburskiy@gmail.com;
>> jinankjain@linux.microsoft.com; vkuznets <vkuznets@redhat.com>;
>> tglx@linutronix.de; mingo@redhat.com; bp@alien8.de;
>> dave.hansen@linux.intel.com; hpa@zytor.com; will@kernel.org;
>> catalin.marinas@arm.com
>> Subject: [PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv to
>> VMMs running on Hyper-V
>>
>> Add mshv, mshv_root, and mshv_vtl modules:
>>
> 
> <snip>
> 
>> +	ret = mshv_set_vp_registers(vp->index, vp->partition->id,
>> +				    1, &dispatch_suspend);
>> +	if (ret)
>> +		pr_err("%s: failed to suspend partition %llu vp %u\n",
>> +			__func__, vp->partition->id, vp->index);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +get_vp_signaled_count(struct mshv_vp *vp, u64 *count)
>> +{
>> +	int ret;
>> +	struct hv_register_assoc root_signal_count = {
>> +		.name = HV_REGISTER_VP_ROOT_SIGNAL_COUNT,
>> +	};
>> +
>> +	ret = mshv_get_vp_registers(vp->index, vp->partition->id,
>> +				    1, &root_signal_count);
>> +
>> +	if (ret) {
>> +		pr_err("%s: failed to get root signal count for partition %llu vp
>> %u",
>> +			__func__, vp->partition->id, vp->index);
>> +		*count = 0;
> 
> Have we missed a return here ?
> Moreover, the return type of this function is never checked consider
> checking it or change it to void.
> 

Thanks, we do need to return here.

This function is called on a cleanup path (deleting the guest VM), so if it fails
something is wrong. Instead of returning void, I think we should check the return
value with WARN_ON(), and abort the cleanup if it failed.

>> +
>> +/* Retrieve and stash the supported scheduler type */
>> +static int __init mshv_retrieve_scheduler_type(void)
>> +{
>> +	struct hv_input_get_system_property *input;
>> +	struct hv_output_get_system_property *output;
>> +	unsigned long flags;
>> +	u64 status;
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
>> +
>> +	memset(input, 0, sizeof(*input));
>> +	memset(output, 0, sizeof(*output));
>> +	input->property_id = HV_SYSTEM_PROPERTY_SCHEDULER_TYPE;
>> +
>> +	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input,
>> output);
>> +	if (!hv_result_success(status)) {
>> +		local_irq_restore(flags);
>> +		pr_err("%s: %s\n", __func__, hv_status_to_string(status));
>> +		return hv_status_to_errno(status);
>> +	}
>> +
>> +	hv_scheduler_type = output->scheduler_type;
>> +	local_irq_restore(flags);
>> +
>> +	pr_info("mshv: hypervisor using %s\n",
>> scheduler_type_to_string(hv_scheduler_type));
> 
> Nit: In this file we are using two styles of prints, few are appended with
> "mshv:" and few with "__func__".  It's better to have a single style
> for one module.

Thanks, I'll switch them all to __func__
Saurabh Singh Sengar Aug. 23, 2023, 7:40 a.m. UTC | #7
> -----Original Message-----
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> Sent: Wednesday, August 23, 2023 1:49 AM
> To: Saurabh Singh Sengar <ssengar@microsoft.com>; linux-
> hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org; linux-
> arm-kernel@lists.infradead.org; linux-arch@vger.kernel.org
> Cc: patches@lists.linux.dev; Michael Kelley (LINUX)
> <mikelley@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
> wei.liu@kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; Dexuan Cui
> <decui@microsoft.com>; apais@linux.microsoft.com; Tianyu Lan
> <Tianyu.Lan@microsoft.com>; ssengar@linux.microsoft.com; MUKESH
> RATHOR <mukeshrathor@microsoft.com>; stanislav.kinsburskiy@gmail.com;
> jinankjain@linux.microsoft.com; vkuznets <vkuznets@redhat.com>;
> tglx@linutronix.de; mingo@redhat.com; bp@alien8.de;
> dave.hansen@linux.intel.com; hpa@zytor.com; will@kernel.org;
> catalin.marinas@arm.com
> Subject: Re: [PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv
> to VMMs running on Hyper-V
> 
> On 8/19/2023 10:19 PM, Saurabh Singh Sengar wrote:
> >
> >
> >> -----Original Message-----
> >> From: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> Sent: Saturday, August 19, 2023 12:30 AM
> >> To: Saurabh Singh Sengar <ssengar@microsoft.com>; linux-
> >> hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org;
> >> linux- arm-kernel@lists.infradead.org; linux-arch@vger.kernel.org
> >> Cc: patches@lists.linux.dev; Michael Kelley (LINUX)
> >> <mikelley@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
> >> wei.liu@kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; Dexuan
> >> Cui <decui@microsoft.com>; apais@linux.microsoft.com; Tianyu Lan
> >> <Tianyu.Lan@microsoft.com>; ssengar@linux.microsoft.com; MUKESH
> >> RATHOR <mukeshrathor@microsoft.com>;
> stanislav.kinsburskiy@gmail.com;
> >> jinankjain@linux.microsoft.com; vkuznets <vkuznets@redhat.com>;
> >> tglx@linutronix.de; mingo@redhat.com; bp@alien8.de;
> >> dave.hansen@linux.intel.com; hpa@zytor.com; will@kernel.org;
> >> catalin.marinas@arm.com
> >> Subject: Re: [PATCH v2 15/15] Drivers: hv: Add modules to expose
> >> /dev/mshv to VMMs running on Hyper-V
> >>
> >> On 8/18/2023 6:08 AM, Saurabh Singh Sengar wrote:
> >>>> +
> >>>> +config MSHV_VTL
> >>>> +	tristate "Microsoft Hyper-V VTL driver"
> >>>> +	depends on MSHV
> >>>> +	select HYPERV_VTL_MODE
> >>>> +	select TRANSPARENT_HUGEPAGE
> >>>
> >>> TRANSPARENT_HUGEPAGE can be avoided for now.
> >>>
> >>
> >> I will remove it in the next version. Thanks.
> >>>> +
> >>>> +#define HV_GET_REGISTER_BATCH_SIZE	\
> >>>> +	(HV_HYP_PAGE_SIZE / sizeof(union hv_register_value))
> >>>> +#define HV_SET_REGISTER_BATCH_SIZE	\
> >>>> +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_registers)) \
> >>>> +		/ sizeof(struct hv_register_assoc))
> >>>> +
> >>>> +int hv_call_get_vp_registers(
> >>>> +		u32 vp_index,
> >>>> +		u64 partition_id,
> >>>> +		u16 count,
> >>>> +		union hv_input_vtl input_vtl,
> >>>> +		struct hv_register_assoc *registers) {
> >>>> +	struct hv_input_get_vp_registers *input_page;
> >>>> +	union hv_register_value *output_page;
> >>>> +	u16 completed = 0;
> >>>> +	unsigned long remaining = count;
> >>>> +	int rep_count, i;
> >>>> +	u64 status;
> >>>> +	unsigned long flags;
> >>>> +
> >>>> +	local_irq_save(flags);
> >>>> +
> >>>> +	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >>>> +	output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
> >>>> +
> >>>> +	input_page->partition_id = partition_id;
> >>>> +	input_page->vp_index = vp_index;
> >>>> +	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
> >>>> +	input_page->rsvd_z8 = 0;
> >>>> +	input_page->rsvd_z16 = 0;
> >>>> +
> >>>> +	while (remaining) {
> >>>> +		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
> >>>> +		for (i = 0; i < rep_count; ++i)
> >>>> +			input_page->names[i] = registers[i].name;
> >>>> +
> >>>> +		status = hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS,
> >>>> rep_count,
> >>>> +					     0, input_page, output_page);
> >>>
> >>> Is there any possibility that count value is passed 0 by mistake ?
> >>> In that case status will remain uninitialized.
> >>>
> >>
> >> These lines ensure rep_count is never 0 here:
> >>
> >> 	while (remaining) {
> >> 		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
> >>
> >> Remaining can't be 0 or the loop would exit, and
> >> HV_GET_REGISTER_BATCH_SIZE is not 0, or we would never get any
> registers.
> >
> > There is a parameter in this function "count". I was checking if there
> > is any possibility that is passed as 0 by mistake ? this will lead to
> > "remaining" value as 0 and loop will never execute. Which results using
> "status" uninitialized later in the function.
> >
> >
> 
> Ah ok, thanks! I will change it to return immediately in case count is 0.

Or you can initialize status with appropriate error value, either way is fine.
Please consider fixing the same issue in hv_call_set_vp_registers as well.

- Saurabh
Boqun Feng Aug. 24, 2023, 6:31 p.m. UTC | #8
On Thu, Aug 17, 2023 at 03:01:51PM -0700, Nuno Das Neves wrote:
[...]
> +static long
> +mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> +{
> +	switch (ioctl) {
> +	case MSHV_CHECK_EXTENSION:
> +		return mshv_ioctl_check_extension((void __user *)arg);
> +	case MSHV_CREATE_PARTITION:
> +		return mshv.create_partition((void __user *)arg);
> +	case MSHV_CREATE_VTL:
> +		return mshv.create_vtl((void __user *)arg);
> +	}
> +

Shouldn't we also have a MSHV_GET_API_VERSION ioctl similar as KVM? So
that in the future when we change these ioctl interfaces or semantics,
we can bump up the API version to avoid breaking userspace?

Regards,
Boqun
Nuno Das Neves Aug. 25, 2023, 6:26 p.m. UTC | #9
On 8/23/2023 12:40 AM, Saurabh Singh Sengar wrote:
> 
> 
>> -----Original Message-----
>> From: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>> Sent: Wednesday, August 23, 2023 1:49 AM
>> To: Saurabh Singh Sengar <ssengar@microsoft.com>; linux-
>> hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org; linux-
>> arm-kernel@lists.infradead.org; linux-arch@vger.kernel.org
>> Cc: patches@lists.linux.dev; Michael Kelley (LINUX)
>> <mikelley@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
>> wei.liu@kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; Dexuan Cui
>> <decui@microsoft.com>; apais@linux.microsoft.com; Tianyu Lan
>> <Tianyu.Lan@microsoft.com>; ssengar@linux.microsoft.com; MUKESH
>> RATHOR <mukeshrathor@microsoft.com>; stanislav.kinsburskiy@gmail.com;
>> jinankjain@linux.microsoft.com; vkuznets <vkuznets@redhat.com>;
>> tglx@linutronix.de; mingo@redhat.com; bp@alien8.de;
>> dave.hansen@linux.intel.com; hpa@zytor.com; will@kernel.org;
>> catalin.marinas@arm.com
>> Subject: Re: [PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv
>> to VMMs running on Hyper-V
>>
>> On 8/19/2023 10:19 PM, Saurabh Singh Sengar wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>>>> Sent: Saturday, August 19, 2023 12:30 AM
>>>> To: Saurabh Singh Sengar <ssengar@microsoft.com>; linux-
>>>> hyperv@vger.kernel.org; linux-kernel@vger.kernel.org; x86@kernel.org;
>>>> linux- arm-kernel@lists.infradead.org; linux-arch@vger.kernel.org
>>>> Cc: patches@lists.linux.dev; Michael Kelley (LINUX)
>>>> <mikelley@microsoft.com>; KY Srinivasan <kys@microsoft.com>;
>>>> wei.liu@kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; Dexuan
>>>> Cui <decui@microsoft.com>; apais@linux.microsoft.com; Tianyu Lan
>>>> <Tianyu.Lan@microsoft.com>; ssengar@linux.microsoft.com; MUKESH
>>>> RATHOR <mukeshrathor@microsoft.com>;
>> stanislav.kinsburskiy@gmail.com;
>>>> jinankjain@linux.microsoft.com; vkuznets <vkuznets@redhat.com>;
>>>> tglx@linutronix.de; mingo@redhat.com; bp@alien8.de;
>>>> dave.hansen@linux.intel.com; hpa@zytor.com; will@kernel.org;
>>>> catalin.marinas@arm.com
>>>> Subject: Re: [PATCH v2 15/15] Drivers: hv: Add modules to expose
>>>> /dev/mshv to VMMs running on Hyper-V
>>>>
>>>> On 8/18/2023 6:08 AM, Saurabh Singh Sengar wrote:
>>>>>> +
>>>>>> +config MSHV_VTL
>>>>>> +	tristate "Microsoft Hyper-V VTL driver"
>>>>>> +	depends on MSHV
>>>>>> +	select HYPERV_VTL_MODE
>>>>>> +	select TRANSPARENT_HUGEPAGE
>>>>>
>>>>> TRANSPARENT_HUGEPAGE can be avoided for now.
>>>>>
>>>>
>>>> I will remove it in the next version. Thanks.
>>>>>> +
>>>>>> +#define HV_GET_REGISTER_BATCH_SIZE	\
>>>>>> +	(HV_HYP_PAGE_SIZE / sizeof(union hv_register_value))
>>>>>> +#define HV_SET_REGISTER_BATCH_SIZE	\
>>>>>> +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_registers)) \
>>>>>> +		/ sizeof(struct hv_register_assoc))
>>>>>> +
>>>>>> +int hv_call_get_vp_registers(
>>>>>> +		u32 vp_index,
>>>>>> +		u64 partition_id,
>>>>>> +		u16 count,
>>>>>> +		union hv_input_vtl input_vtl,
>>>>>> +		struct hv_register_assoc *registers) {
>>>>>> +	struct hv_input_get_vp_registers *input_page;
>>>>>> +	union hv_register_value *output_page;
>>>>>> +	u16 completed = 0;
>>>>>> +	unsigned long remaining = count;
>>>>>> +	int rep_count, i;
>>>>>> +	u64 status;
>>>>>> +	unsigned long flags;
>>>>>> +
>>>>>> +	local_irq_save(flags);
>>>>>> +
>>>>>> +	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>>>> +	output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
>>>>>> +
>>>>>> +	input_page->partition_id = partition_id;
>>>>>> +	input_page->vp_index = vp_index;
>>>>>> +	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
>>>>>> +	input_page->rsvd_z8 = 0;
>>>>>> +	input_page->rsvd_z16 = 0;
>>>>>> +
>>>>>> +	while (remaining) {
>>>>>> +		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
>>>>>> +		for (i = 0; i < rep_count; ++i)
>>>>>> +			input_page->names[i] = registers[i].name;
>>>>>> +
>>>>>> +		status = hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS,
>>>>>> rep_count,
>>>>>> +					     0, input_page, output_page);
>>>>>
>>>>> Is there any possibility that count value is passed 0 by mistake ?
>>>>> In that case status will remain uninitialized.
>>>>>
>>>>
>>>> These lines ensure rep_count is never 0 here:
>>>>
>>>> 	while (remaining) {
>>>> 		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
>>>>
>>>> Remaining can't be 0 or the loop would exit, and
>>>> HV_GET_REGISTER_BATCH_SIZE is not 0, or we would never get any
>> registers.
>>>
>>> There is a parameter in this function "count". I was checking if there
>>> is any possibility that is passed as 0 by mistake ? this will lead to
>>> "remaining" value as 0 and loop will never execute. Which results using
>> "status" uninitialized later in the function.
>>>
>>>
>>
>> Ah ok, thanks! I will change it to return immediately in case count is 0.
> 
> Or you can initialize status with appropriate error value, either way is fine.
> Please consider fixing the same issue in hv_call_set_vp_registers as well.
> 

Thanks again - noted.
Nuno Das Neves Aug. 25, 2023, 6:41 p.m. UTC | #10
On 8/24/2023 11:31 AM, Boqun Feng wrote:
> On Thu, Aug 17, 2023 at 03:01:51PM -0700, Nuno Das Neves wrote:
> [...]
>> +static long
>> +mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
>> +{
>> +	switch (ioctl) {
>> +	case MSHV_CHECK_EXTENSION:
>> +		return mshv_ioctl_check_extension((void __user *)arg);
>> +	case MSHV_CREATE_PARTITION:
>> +		return mshv.create_partition((void __user *)arg);
>> +	case MSHV_CREATE_VTL:
>> +		return mshv.create_vtl((void __user *)arg);
>> +	}
>> +
> 
> Shouldn't we also have a MSHV_GET_API_VERSION ioctl similar as KVM? So
> that in the future when we change these ioctl interfaces or semantics,
> we can bump up the API version to avoid breaking userspace?
> 

Note that we contribute and maintain support for this driver in
Cloud Hypervisor, so we control both sides of this API - if we break
userspace we can fix it ourselves.

For now the MSHV_CHECK_EXTENSION ioctl is sufficient - we can pass it
MSHV_CAP_CORE_API_STABLE, and it will return 0 to indicate that the API
is not yet stable.

A version check may be useful in the future, but is not needed right now.

Thanks
Nuno Das Neves
Boqun Feng Aug. 25, 2023, 8:15 p.m. UTC | #11
On Fri, Aug 25, 2023 at 11:41:35AM -0700, Nuno Das Neves wrote:
> On 8/24/2023 11:31 AM, Boqun Feng wrote:
> > On Thu, Aug 17, 2023 at 03:01:51PM -0700, Nuno Das Neves wrote:
> > [...]
> >> +static long
> >> +mshv_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
> >> +{
> >> +	switch (ioctl) {
> >> +	case MSHV_CHECK_EXTENSION:
> >> +		return mshv_ioctl_check_extension((void __user *)arg);
> >> +	case MSHV_CREATE_PARTITION:
> >> +		return mshv.create_partition((void __user *)arg);
> >> +	case MSHV_CREATE_VTL:
> >> +		return mshv.create_vtl((void __user *)arg);
> >> +	}
> >> +
> > 
> > Shouldn't we also have a MSHV_GET_API_VERSION ioctl similar as KVM? So
> > that in the future when we change these ioctl interfaces or semantics,
> > we can bump up the API version to avoid breaking userspace?
> > 
> 
> Note that we contribute and maintain support for this driver in
> Cloud Hypervisor, so we control both sides of this API - if we break
> userspace we can fix it ourselves.
> 

This actually doesn't always work, e.g. old Clould Hypervisor + new
kernel, but..

> For now the MSHV_CHECK_EXTENSION ioctl is sufficient - we can pass it
> MSHV_CAP_CORE_API_STABLE, and it will return 0 to indicate that the API
> is not yet stable.
> 

.. I missed that we are using this to report API is not stable, so I
agree, the API version is not needed for now.

> A version check may be useful in the future, but is not needed right now.
> 

Thanks for the explanation.

Regards,
Boqun

> Thanks
> Nuno Das Neves