diff mbox

[RFC,v2,1/1] kvm: Add documentation and ABI/API header for VM introspection

Message ID 20170707143416.11195-2-alazar@bitdefender.com (mailing list archive)
State New, archived
Headers show

Commit Message

Adalbert Lazăr July 7, 2017, 2:34 p.m. UTC
Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazar <alazar@bitdefender.com>
---
 Documentation/virtual/kvm/kvmi.rst | 985 +++++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvmi.h          | 310 ++++++++++++
 2 files changed, 1295 insertions(+)
 create mode 100644 Documentation/virtual/kvm/kvmi.rst
 create mode 100644 include/uapi/linux/kvmi.h

Comments

Paolo Bonzini July 7, 2017, 4:52 p.m. UTC | #1
I have a lot of comments, but it's a very good start.  Thanks for
writing down the specification.

On 07/07/2017 16:34, Adalbert Lazar wrote:
> +1. KVMI_GET_VERSION
> +-------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: {}
> +:Returns: ↴

What is this character? :)

> +::
> +
> +	struct kvmi_get_version_reply {
> +		__s32 err;
> +		__u32 version;
> +	};
> +
> +Returns the introspection API version (the KVMI_VERSION constant) and the
> +error code (zero). In case of an unlikely error, the version will have an
> +undefined value.

Would it make sense to return a few more information fields, for example
the set of supported KVMI_CONTROL_EVENTS events?

> +2. KVMI_GET_GUEST_INFO
> +----------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: {}
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_get_guest_info_reply {
> +		__s32 err;
> +		__u16 vcpu_count;
> +		__u16 padding;
> +		__u64 tsc_speed;
> +	};
> +
> +Returns the number of online vcpus, and the TSC frequency in HZ, if supported
> +by the architecture (otherwise is 0).

Can the TSC frequency be specified only if the guest is using TSC
scaling?  Defining the TSC frequency on older hosts is a bit tricky.  0
would be the host.

Maybe define the second padding to be

	__u16 arch;

(0 = x86) followed by an arch-specific payload.

> +3. KVMI_PAUSE_GUEST
> +-------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: {}
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +This command will pause all vcpus threads, by getting them out of guest mode
> +and put them in the "waiting introspection commands" state.

Pausing all threads synchronously is a tricky concept.  The main issue
is that the guest could be paused at the userspace level.  Then KVM
would not be running (userspace is stuck in pthread_cond_wait,
typically) and could not acknowledge the pause.

Worse, KVM is not able to distinguish userspace that has paused the VM
from userspace that is doing MMIO or userspace that has a bug and hung
somewhere.  And even worse, there are cases where userspace wants to
modify registers while doing port I/O (the awful VMware RPCI port).  So
I'd rather avoid this.

> +4. KVMI_UNPAUSE_GUEST
> +---------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: {}
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Resume the vcpu threads, or at least get them out of "waiting introspection
> +commands" state.

And this as well, of course.

> +5. KVMI_SHUTDOWN_GUEST
> +----------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: {}
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Ungracefully shutdown the guest.

Note that all KVM can do here is pass down a request asynchronously to
userspace, failing further KVM_RUN ioctls.  I'm suggesting below an
alternative action.

> +6. KVMI_GET_REGISTERS
> +---------------------
> +
> +:Architectures: x86 (could be all, but with different input/output)
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_get_registers_x86 {
> +		__u16 vcpu;
> +		__u16 nmsrs;
> +		__u32 msrs_idx[0];
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_get_registers_x86_reply {
> +		__s32 err;
> +		__u32 mode;
> +		struct kvm_regs  regs;
> +		struct kvm_sregs sregs;
> +		struct kvm_msrs  msrs;
> +	};
> +
> +For the given vcpu_id and the nmsrs sized array of MSRs registers, returns
> +the vCPU mode (in bytes: 2, 4 or 8), the general purpose registers,
> +the special registers and the requested set of MSR-s.
> +
> +7. KVMI_SET_REGISTERS
> +---------------------
> +
> +:Architectures: x86 (could be all, but with different input)
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_set_registers_x86 {
> +		__u16 vcpu;
> +		__u16 padding[3];
> +		struct kvm_regs regs;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Sets the general purpose registers for the given vcpu_id.
> +
> +8. KVMI_GET_MTRR_TYPE
> +---------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_mtrr_type {
> +		__u64 gpa;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_mtrr_type_reply {
> +		__s32 err;
> +		__u32 type;
> +	};
> +
> +Returns the guest memory type for a specific physical address.

What is this used for?  KVM ignores the guest MTRRs, so if possible I'd
rather avoid it.

> +9. KVMI_GET_MTRRS
> +-----------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_mtrrs {
> +		__u16 vcpu;
> +		__u16 padding[3];
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_mtrrs_reply {
> +		__s32 err;
> +		__u32 padding;
> +		__u64 pat;
> +		__u64 cap;
> +		__u64 type;
> +	};
> +
> +Returns MSR_IA32_CR_PAT, MSR_MTRRcap and MSR_MTRRdefType for the specified
> +vCPU.

This could use KVM_GET_REGISTERS, couldn't it?

> +10. KVMI_GET_XSAVE_INFO
> +-----------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_xsave_info {
> +		__u16 vcpu;
> +		__u16 padding[3];
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_xsave_info_reply {
> +		__s32 err;
> +		__u32 size;
> +	};
> +
> +Returns the xstate size for the specified vCPU.

Could this be replaced by a generic CPUID accessor?

> +11. KVMI_GET_PAGE_ACCESS
> +------------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_get_page_access {
> +		__u16 vcpu;
> +		__u16 padding[3];
> +		__u64 gpa;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_get_page_access_reply {
> +		__s32 err;
> +		__u32 access;
> +	};
> +
> +Returns the spte flags (rwx - present, write & user) for the specified
> +vCPU and guest physical address.
> +
> +12. KVMI_SET_PAGE_ACCESS
> +------------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_set_page_access {
> +		__u16 vcpu;
> +		__u16 padding;
> +		__u32 access;
> +		__u64 gpa;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Sets the spte flags (rwx - present, write & user) - access - for the specified
> +vCPU and guest physical address.

rwx or pwu?  I suppose RWX.  Maybe #define the constants in the
documentation.

Also, it should be noted here that the spte flags are ANDed with
whatever is provided by userspace, because there could be readonly
memslots, and that some access combinations could fail (--x) or will
surely fail (-wx for example).

> +13. KVMI_INJECT_PAGE_FAULT
> +--------------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_page_fault {
> +		__u16 vcpu;
> +		__u16 padding;
> +		__u32 error;

error_code, I guess?  Why not a generic inject exception message?

> +		__u64 gva;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Injects a vCPU page fault with the specified guest virtual address and
> +error code.
> +
> +14. KVMI_INJECT_BREAKPOINT
> +--------------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_inject_breakpoint {
> +		__u16 vcpu;
> +		__u16 padding[3];
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Injects a breakpoint for the specified vCPU. This command is usually sent in
> +response to an event and as such the proper GPR-s will be set with the reply.

What is a "breakpoint" in this context?

> +15. KVMI_MAP_PHYSICAL_PAGE_TO_GUEST
> +-----------------------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_map_physical_page_to_guest {
> +		__u64 gpa_src;
> +		__u64 gfn_dest;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Maps a page from an introspected guest memory (gpa_src) to the guest running
> +the introspection tool. 'gfn_dest' points to an anonymous, locked mapping one
> +page in size.
> +
> +This command is used to "read" the introspected guest memory and potentially
> +place patches (eg. INT3-s).

Two problems here:

1) The gpa->hva mapping can change as the guest runs.  I'd prefer a
read/write API to map/unmap, where read replies and events provide a
"cookie" (generation count).  Write (including breakpoints?  it's not
clear to me who physically writes the 0xCC) then return an error if the
incoming generation count doesn't match the current one.

2) There can be multiple address spaces, notably SMM and non-SMM.

> +17. KVMI_CONTROL_EVENTS
> +-----------------------
> +
> +:Architectures: all
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_control_events {
> +		__u16 vcpu;
> +		__u16 padding;
> +		__u32 events;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Enables/disables vCPU introspection events, by setting/clearing one or more
> +of the following bits (see 'Events' below) :
> +
> +	KVMI_EVENT_CR
> +	KVMI_EVENT_MSR
> +	KVMI_EVENT_XSETBV
> +	KVMI_EVENT_BREAKPOINT
> +	KVMI_EVENT_USER_CALL
> +	KVMI_EVENT_PAGE_FAULT
> +	KVMI_EVENT_TRAP
> +
> +Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current
> +architecture would fail and -EINVAL will be returned.

I would prefer the interface to allow enable (set bits), disable (clear
bits) in addition to set (what you have here) the events.

The return value could indicate the set of enabled events.


> +19. KVMI_MSR_CONTROL
> +--------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_msr_control {
> +		__u8 enable;
> +		__u8 padding[3];
> +		__u32 msr;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_error_code {
> +		__s32 err;
> +		__u32 padding;
> +	};
> +
> +Enables/disables introspection for a specific MSR, and must be used
> +in addition to KVMI_CONTROL_EVENTS with the KVMI_EVENT_MSR bit flag set.
> +
> +Currently, only MSRs within the following 3 ranges are supported. Trying
> +to control any other register will return -EINVAL. ::
> +
> +	0          ... 0x00001fff
> +	0x40000000 ... 0x40001fff
> +	0xc0000000 ... 0xc0001fff

Note that KVM custom MSRs fall in the range 0x4b564d00-0x4b564dff.

> +Events
> +------
> +
> +All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will
> +be sent unless enabled with a KVMI_CONTROL_EVENTS command.
> +
> +For x86, the message data starts with a common structure::
> +
> +	struct kvmi_event_x86 {
> +		__u16 vcpu;
> +		__u8 mode;
> +		__u8 padding1;
> +		__u32 event;
> +		struct kvm_regs regs;
> +		struct kvm_sregs sregs;
> +		struct {
> +			__u64 sysenter_cs;
> +			__u64 sysenter_esp;
> +			__u64 sysenter_eip;
> +			__u64 efer;
> +			__u64 star;
> +			__u64 lstar;

cstar too, for AMD CPUs.

> +		} msrs;
> +	};
> +
> +In order to help the introspection tool with the event analysis while
> +avoiding unnecessary introspection commands, the message data holds some
> +registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside
> +the vCPU id, its mode (in bytes) and the event (one of the flags set
> +with the KVMI_CONTROL_EVENTS command).
> +
> +The replies to events also start with a common structure, having the
> +KVMI_EVENT_VCPU_REPLY message id::
> +
> +	struct kvmi_event_x86_reply {
> +		struct kvm_regs regs;

Put regs last?

> +		__u32 actions;
> +		__u32 padding;
> +	};
> +
> +The 'actions' member holds one or more flags. For example, if
> +KVMI_EVENT_ACTION_SET_REGS is set, the general purpose registers will
> +be overwritten with the new values (regs) from introspector.
> +
> +Specific data can follow these common structures.
> +
> +1. KVMI_EVENT_CR
> +----------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_event_x86;
> +	struct kvmi_event_cr {
> +		__u16 cr;
> +		__u16 padding[3];
> +		__u64 old_value;
> +		__u64 new_value;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_event_x86_reply;
> +	struct kvmi_event_cr_reply {
> +		__u64 new_val;
> +	};
> +
> +This event is sent when a CR register was modified and the introspection
> +has already been enabled for this kind of event (KVMI_CONTROL_EVENTS)
> +and for this specific register (KVMI_CR_CONTROL).
> +
> +kvmi_event_x86, the CR number, the old value and the new value are
> +sent to the introspector, which can respond with one or more action flags:
> +
> +   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
> +   using the values from introspector (regs)

I think the action should be KVMI_EVENT_ACTION_SKIP.  Setting registers
can be done separately with a command.

> +   KVMI_EVENT_ACTION_ALLOW - allow the register modification with the
> +   value from introspector (new_val), otherwise deny the modification
> +   but allow the guest to proceed as if the register has been loaded
> +   with the desired value.

Maybe add an extra action for exit to userspace (which would presumably
shutdown the guest)?  This would apply to all events.

Also, a retry action can be useful in case e.g. the write cookie becomes
stale.

> +5. KVMI_EVENT_USER_CALL
> +-----------------------

Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL.

> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_event_x86;
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_event_x86_reply;
> +
> +This event is sent on a user hypercall and the introspection has already
> +already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
> +
> +kvmi_event_x86 is sent to the introspector, which can respond with the
> +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host
> +kernel to override the general purpose registers using the values from
> +introspector (regs).

Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall?  Why is
KVMI_EVENT_ACTION_ALLOW not allowed?  As before, I'd prefer
SKIP/RETRY/ALLOW/CRASH as the actions.

> +6. KVMI_EVENT_PAGE_FAULT
> +------------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_event_x86;
> +	struct kvmi_event_page_fault {
> +		__u64 gva;
> +		__u64 gpa;
> +		__u32 mode;
> +		__u32 padding;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_event_x86_reply;
> +	struct kvmi_event_page_fault_reply {
> +		__u32 ctx_size;
> +		__u8 ctx_data[256];
> +	};
> +
> +This event is sent if a hypervisor page fault was encountered, the
> +introspection has already enabled the reports for this kind of event
> +(KVMI_CONTROL_EVENTS), and it was generated for a page for which the
> +introspector has shown interest (ie. has previously touched it by
> +adjusting the permissions).

If the introspector sets permissions to r-- and you get a read page
fault because the page is swapped out, should the introspector get the
event?

> +kvmi_event_x86, guest virtual address, guest physical address and
> +the exit qualification (mode) are sent to the introspector, which
> +can respond with one or more action flags:
> +
> +   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
> +   using the values from introspector (regs)
> +
> +   (KVMI_EVENT_ALLOW | KVMI_EVENT_NOEMU) - let the guest re-trigger
> +   the page fault

"re-execute the instruction"?  Maybe call it KVMI_EVENT_RETRY?

> +   (KVMI_EVENT_ALLOW | KVMI_EVENT_SET_CTX) - allow the page fault
> +   via emulation but with custom input (ctx_data, ctx_size). This is
> +   used to trick the guest software into believing it has read
> +   certain data. In practice it is used to hide the contents of certain
> +   memory areas

This is tricky.  The instruction could be, for example, a string read.
I think there should be a separate event for I/O from a trapping page
(so perhaps return KVMI_EVENT_ALLOW | KVMI_EVENT_TRAP_ACCESS).

> +   KVMI_EVENT_ALLOW - allow the page fault via emulation
>
> +If KVMI_EVENT_ALLOW is not set, it will fall back to the page fault handler
> +which usually implies overwriting any spte page access changes made before.
> +An introspection tool will always set this flag and prevent unwanted changes
> +to memory by skipping the instruction. It is up to the tool to adjust the
> +program counter in order to achieve this result.

I don't like this.  It should be what KVMI_EVENT_RETRY is for.  But
maybe I'm missing the difference between "0" and KVMI_EVENT_RETRY.

Here, the actions would be SKIP/RETRY/ALLOW/CRASH, with an optional
KVMI_EVENT_TRAP_ACCESS flag for "allow".


> +7. KVMI_EVENT_TRAP
> +------------------
> +
> +:Architectures: x86
> +:Versions: >= 1
> +:Parameters: ↴
> +
> +::
> +
> +	struct kvmi_event_x86;
> +	struct kvmi_event_trap {
> +		__u32 vector;
> +		__u32 type;
> +		__u32 err;

error_code?

> +		__u32 padding;
> +		__u64 cr2;
> +	};
> +
> +:Returns: ↴
> +
> +::
> +
> +	struct kvmi_event_x86_reply;
> +
> +This event is sent if a trap will be delivered to the guest (page fault,
> +breakpoint, etc.) and the introspection has already enabled the reports
> +for this kind of event (KVMI_CONTROL_EVENTS).
> +
> +This is used to inform the introspector of all pending traps giving it
> +a chance to determine if it should try again later in case a previous
> +KVMI_INJECT_PAGE_FAULT/KVMI_INJECT_BREAKPOINT command has been overwritten
> +by an interrupt picked up during guest reentry.
> +
> +kvmi_event_x86, exception/interrupt number (vector), exception/interrupt
> +type, exception code (err) and CR2 are sent to the introspector, which can
> +respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing
> +the host kernel to override the general purpose registers using the values
> +from introspector (regs).

Here I think the actions could be RETRY/ALLOW/CRASH only, with "set
regs" done as a separate command.

Some x86 exceptions are faults, others are traps.  Traps should not
allow RETRY as an action.  There should be an input telling the
introspector if retrying is allowed.

How are dr6/dr7 handled around breakpoints?

Thanks,

Paolo

> diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
> new file mode 100644
> index 000000000000..54a2d8ebf649
> --- /dev/null
> +++ b/include/uapi/linux/kvmi.h
> @@ -0,0 +1,310 @@
> +/*
> + * Copyright (C) 2017 Bitdefender S.R.L.
> + *
> + * The KVMI Library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the named License, or any later version.
> + *
> + * The KVMI Library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with the KVMI Library; if not, see <http://www.gnu.org/licenses/>
> + */
> +#ifndef __KVMI_H_INCLUDED__
> +#define __KVMI_H_INCLUDED__
> +
> +#include "asm/kvm.h"
> +#include <linux/types.h>
> +
> +#define KVMI_VERSION 0x00000001
> +
> +#define KVMI_EVENT_CR         (1 << 1)	/* control register was modified */
> +#define KVMI_EVENT_MSR        (1 << 2)	/* model specific reg. was modified */
> +#define KVMI_EVENT_XSETBV     (1 << 3)	/* ext. control register was modified */
> +#define KVMI_EVENT_BREAKPOINT (1 << 4)	/* breakpoint was reached */
> +#define KVMI_EVENT_USER_CALL  (1 << 5)	/* user hypercall */
> +#define KVMI_EVENT_PAGE_FAULT (1 << 6)	/* hyp. page fault was encountered */
> +#define KVMI_EVENT_TRAP       (1 << 7)	/* trap was injected */
> +
> +#define KVMI_KNOWN_EVENTS (KVMI_EVENT_CR | \
> +			   KVMI_EVENT_MSR | \
> +			   KVMI_EVENT_XSETBV | \
> +			   KVMI_EVENT_BREAKPOINT | \
> +			   KVMI_EVENT_USER_CALL | \
> +			   KVMI_EVENT_PAGE_FAULT | \
> +			   KVMI_EVENT_TRAP)
> +
> +#define KVMI_EVENT_ACTION_ALLOW      (1 << 0)	/* used in replies */
> +#define KVMI_EVENT_ACTION_SET_REGS   (1 << 1)	/* registers need to be written back */
> +#define KVMI_EVENT_ACTION_SET_CTX    (1 << 2)	/* set the emulation context */
> +#define KVMI_EVENT_ACTION_NOEMU      (1 << 3)	/* return to guest without emulation */
> +
> +#define KVMI_GET_VERSION                    1
> +#define KVMI_GET_GUESTS                     2 /* TODO: remove me */
> +#define KVMI_GET_GUEST_INFO                 3
> +#define KVMI_PAUSE_GUEST                    4
> +#define KVMI_UNPAUSE_GUEST                  5
> +#define KVMI_GET_REGISTERS                  6
> +#define KVMI_SET_REGISTERS                  7
> +#define KVMI_SHUTDOWN_GUEST                 8
> +#define KVMI_GET_MTRR_TYPE                  9
> +#define KVMI_GET_MTRRS                      10
> +#define KVMI_GET_XSAVE_INFO                 11
> +#define KVMI_GET_PAGE_ACCESS                12
> +#define KVMI_SET_PAGE_ACCESS                13
> +#define KVMI_INJECT_PAGE_FAULT              14
> +#define KVMI_READ_PHYSICAL                  15 /* TODO: remove me */
> +#define KVMI_WRITE_PHYSICAL                 16 /* TODO: remove me */
> +#define KVMI_MAP_PHYSICAL_PAGE_TO_GUEST     17
> +#define KVMI_UNMAP_PHYSICAL_PAGE_FROM_GUEST 18
> +#define KVMI_CONTROL_EVENTS                 19
> +#define KVMI_CR_CONTROL                     20
> +#define KVMI_MSR_CONTROL                    21
> +#define KVMI_INJECT_BREAKPOINT              22
> +#define KVMI_EVENT_GUEST_ON                 23 /* TODO: remove me */
> +#define KVMI_EVENT_GUEST_OFF                24 /* TODO: remove me */
> +#define KVMI_EVENT_VCPU                     25
> +#define KVMI_EVENT_VCPU_REPLY               26

Errnos values are not portable, I'd prefer to have them defined
explicitly in the header.

Paolo

> +/* TODO: remove me */
> +struct kvmi_guest {
> +	__u8 uuid[16];
> +};
> +
> +/* TODO: remove me */
> +struct kvmi_guests {
> +	__u32 size;		/* in: the size of the entire structure */
> +	struct kvmi_guest guests[1];
> +};
> +
> +/* TODO: remove me */
> +struct kvmi_read_physical {
> +	__u64 gpa;
> +	__u64 size;
> +};
> +
> +/* TODO: remove me */
> +struct kvmi_read_physical_reply {
> +	__s32 err;
> +	__u8 bytes[0];
> +};
> +
> +/* TODO: remove me */
> +struct kvmi_write_physical {
> +	__u64 gpa;
> +	__u64 size;
> +	__u8 bytes[0];
> +};
> +
> +
> +struct kvmi_socket_hdr {
> +	__u16 msg_id;
> +	__u16 size;
> +	__u32 seq;
> +};
> +
> +struct kvmi_error_code {
> +	__s32 err;
> +	__u32 padding;
> +};
> +
> +struct kvmi_get_version_reply {
> +	__s32 err;
> +	__u32 version;
> +};
> +
> +struct kvmi_get_guest_info_reply {
> +	__s32 err;
> +	__u16 vcpu_count;
> +	__u16 padding;
> +	__u64 tsc_speed;
> +};
> +
> +struct kvmi_get_registers_x86 {
> +	__u16 vcpu;
> +	__u16 nmsrs;
> +	__u32 msrs_idx[0];
> +};
> +
> +struct kvmi_get_registers_x86_reply {
> +	__s32 err;
> +	__u32 mode;
> +	struct kvm_regs regs;
> +	struct kvm_sregs sregs;
> +	struct kvm_msrs msrs;
> +};
> +
> +struct kvmi_set_registers_x86 {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +	struct kvm_regs regs;
> +};
> +
> +struct kvmi_mtrr_type {
> +	__u64 gpa;
> +};
> +
> +struct kvmi_mtrr_type_reply {
> +	__s32 err;
> +	__u32 padding;
> +	__u64 type;
> +};
> +
> +struct kvmi_mtrrs {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +};
> +
> +struct kvmi_mtrrs_reply {
> +	__s32 err;
> +	__u32 padding;
> +	__u64 pat;
> +	__u64 cap;
> +	__u64 type;
> +};
> +
> +struct kvmi_xsave_info {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +};
> +
> +struct kvmi_xsave_info_reply {
> +	__s32 err;
> +	__u32 size;
> +};
> +
> +struct kvmi_get_page_access {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +	__u64 gpa;
> +};
> +
> +struct kvmi_get_page_access_reply {
> +	__s32 err;
> +	__u32 access;
> +};
> +
> +struct kvmi_set_page_access {
> +	__u16 vcpu;
> +	__u16 padding;
> +	__u32 access;
> +	__u64 gpa;
> +};
> +
> +struct kvmi_page_fault {
> +	__u16 vcpu;
> +	__u16 padding;
> +	__u32 error;
> +	__u64 gva;
> +};
> +
> +struct kvmi_inject_breakpoint {
> +	__u16 vcpu;
> +	__u16 padding[3];
> +};
> +
> +struct kvmi_map_physical_page_to_guest {
> +	__u64 gpa_src;
> +	__u64 gfn_dest;
> +};
> +
> +struct kvmi_unmap_physical_page_from_guest {
> +	__u64 gfn_dest;
> +};
> +
> +struct kvmi_control_events {
> +	__u16 vcpu;
> +	__u16 padding;
> +	__u32 events;
> +};
> +
> +struct kvmi_cr_control {
> +	__u8 enable;
> +	__u8 padding[3];
> +	__u32 cr;
> +};
> +
> +struct kvmi_msr_control {
> +	__u8 enable;
> +	__u8 padding[3];
> +	__u32 msr;
> +};
> +
> +struct kvmi_event_x86 {
> +	__u16 vcpu;
> +	__u8 mode;
> +	__u8 padding1;
> +	__u32 event;
> +	struct kvm_regs regs;
> +	struct kvm_sregs sregs;
> +	struct {
> +		__u64 sysenter_cs;
> +		__u64 sysenter_esp;
> +		__u64 sysenter_eip;
> +		__u64 efer;
> +		__u64 star;
> +		__u64 lstar;
> +	} msrs;
> +};
> +
> +struct kvmi_event_x86_reply {
> +	struct kvm_regs regs;
> +	__u32 actions;
> +	__u32 padding;
> +};
> +
> +struct kvmi_event_cr {
> +	__u16 cr;
> +	__u16 padding[3];
> +	__u64 old_value;
> +	__u64 new_value;
> +};
> +
> +struct kvmi_event_cr_reply {
> +	__u64 new_val;
> +};
> +
> +struct kvmi_event_msr {
> +	__u32 msr;
> +	__u32 padding;
> +	__u64 old_value;
> +	__u64 new_value;
> +};
> +
> +struct kvmi_event_msr_reply {
> +	__u64 new_val;
> +};
> +
> +struct kvmi_event_xsetbv {
> +	__u64 xcr0;
> +};
> +
> +struct kvmi_event_breakpoint {
> +	__u64 gpa;
> +};
> +
> +struct kvmi_event_page_fault {
> +	__u64 gva;
> +	__u64 gpa;
> +	__u32 mode;
> +	__u32 padding;
> +};
> +
> +struct kvmi_event_page_fault_reply {
> +	__u32 ctx_size;
> +	__u8 ctx_data[256];
> +};
> +
> +struct kvmi_event_trap {
> +	__u32 vector;
> +	__u32 type;
> +	__u32 err;
> +	__u32 padding;
> +	__u64 cr2;
> +};
> +
> +#endif /* __KVMI_H_INCLUDED__ */
Adalbert Lazăr July 10, 2017, 3:32 p.m. UTC | #2
Thanks for your review, Paolo.
Below are some answers.
Will have to chew on the others.

On Fri, 7 Jul 2017 18:52:45 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 07/07/2017 16:34, Adalbert Lazar wrote:
> > +:Returns: ↴
> 
> What is this character? :)

I thought using a nice right-down arrow instead of a boring/inexpressive
ASCII char in order to avoid trailing spaces will be cool. At least it
was fun :)

> > +	struct kvmi_get_version_reply {
> > +		__s32 err;
> > +		__u32 version;
> > +	};
> > +
> > +Returns the introspection API version (the KVMI_VERSION constant) and the
> > +error code (zero). In case of an unlikely error, the version will have an
> > +undefined value.
> 
> Would it make sense to return a few more information fields, for example
> the set of supported KVMI_CONTROL_EVENTS events?

Sure.

> > +	struct kvmi_get_guest_info_reply {
> > +		__s32 err;
> > +		__u16 vcpu_count;
> > +		__u16 padding;
> > +		__u64 tsc_speed;
> > +	};
> > +
> > +Returns the number of online vcpus, and the TSC frequency in HZ, if supported
> > +by the architecture (otherwise is 0).
> 
> Can the TSC frequency be specified only if the guest is using TSC
> scaling?  Defining the TSC frequency on older hosts is a bit tricky.  0
> would be the host.
> 
> Maybe define the second padding to be
> 
> 	__u16 arch;
> 
> (0 = x86) followed by an arch-specific payload.

Sure.

> > +5. KVMI_SHUTDOWN_GUEST
> > +----------------------
> > +
> > +Ungracefully shutdown the guest.
> 
> Note that all KVM can do here is pass down a request asynchronously to
> userspace, failing further KVM_RUN ioctls.  I'm suggesting below an
> alternative action.

We were thinking about sending an "ungraceful" TERM signal :D

Adding another reponse (exit-to-userspace-with-shutdown) on events,
as you suggested below, makes sense.

> > +12. KVMI_SET_PAGE_ACCESS
> > +------------------------
> > +
> > +Sets the spte flags (rwx - present, write & user) - access - for the specified
> > +vCPU and guest physical address.
> 
> rwx or pwu?  I suppose RWX.  Maybe #define the constants in the
> documentation.

RWX.
Will do.

> > +17. KVMI_CONTROL_EVENTS
> > +-----------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_control_events {
> > +		__u16 vcpu;
> > +		__u16 padding;
> > +		__u32 events;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_error_code {
> > +		__s32 err;
> > +		__u32 padding;
> > +	};
> > +
> > +Enables/disables vCPU introspection events, by setting/clearing one or more
> > +of the following bits (see 'Events' below) :
> > +
> > +	KVMI_EVENT_CR
> > +	KVMI_EVENT_MSR
> > +	KVMI_EVENT_XSETBV
> > +	KVMI_EVENT_BREAKPOINT
> > +	KVMI_EVENT_USER_CALL
> > +	KVMI_EVENT_PAGE_FAULT
> > +	KVMI_EVENT_TRAP
> > +
> > +Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current
> > +architecture would fail and -EINVAL will be returned.
> 
> I would prefer the interface to allow enable (set bits), disable (clear
> bits) in addition to set (what you have here) the events.

Calling KVMI_CONTROL_EVENTS with event=KVMI_EVENT_CR|KVMI_EVENT_MSR will
enable CR and MSR events and disable all the other events. It functions like
a set and clear, in the same time.

It will be better to have KVMI_ENABLE_CONTROL_EVENTS and
KVMI_DISABLE_CONTROL_EVENTS instead?

> The return value could indicate the set of enabled events.

Indeed.

> > +Events
> > +------
> > +
> > +All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will
> > +be sent unless enabled with a KVMI_CONTROL_EVENTS command.
> > +
> > +For x86, the message data starts with a common structure::
> > +
> > +	struct kvmi_event_x86 {
> > +		__u16 vcpu;
> > +		__u8 mode;
> > +		__u8 padding1;
> > +		__u32 event;
> > +		struct kvm_regs regs;
> > +		struct kvm_sregs sregs;
> > +		struct {
> > +			__u64 sysenter_cs;
> > +			__u64 sysenter_esp;
> > +			__u64 sysenter_eip;
> > +			__u64 efer;
> > +			__u64 star;
> > +			__u64 lstar;
> 
> cstar too, for AMD CPUs.

Thanks. Will do.

> > +		} msrs;
> > +	};
> > +
> > +In order to help the introspection tool with the event analysis while
> > +avoiding unnecessary introspection commands, the message data holds some
> > +registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside
> > +the vCPU id, its mode (in bytes) and the event (one of the flags set
> > +with the KVMI_CONTROL_EVENTS command).
> > +
> > +The replies to events also start with a common structure, having the
> > +KVMI_EVENT_VCPU_REPLY message id::
> > +
> > +	struct kvmi_event_x86_reply {
> > +		struct kvm_regs regs;
> 
> Put regs last?
> 
> > +		__u32 actions;
> > +		__u32 padding;
> > +	};

Right.

> > +5. KVMI_EVENT_USER_CALL
> > +-----------------------
> 
> Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL.

Sure.
 
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86;
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86_reply;
> > +
> > +This event is sent on a user hypercall and the introspection has already
> > +already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
> > +
> > +kvmi_event_x86 is sent to the introspector, which can respond with the
> > +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host
> > +kernel to override the general purpose registers using the values from
> > +introspector (regs).
> 
> Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall?  Why is
> KVMI_EVENT_ACTION_ALLOW not allowed?  As before, I'd prefer
> SKIP/RETRY/ALLOW/CRASH as the actions.

We've used this as way for guest code to communicate with the introspector.
KVMI_EVENT_ACTION_ALLOW is implied here.

> > --- /dev/null
> > +++ b/include/uapi/linux/kvmi.h
...
> 
> Errnos values are not portable, I'd prefer to have them defined
> explicitly in the header.

We were thinking that the introspection tool will take care of this,
as with the endianness.

Surely, it will be much more clear to have the errnos defined here.

Thanks again,

Adalbert
Paolo Bonzini July 10, 2017, 5:03 p.m. UTC | #3
On 10/07/2017 17:32, alazar@bitdefender.com wrote:
> Thanks for your review, Paolo.
> Below are some answers.
> Will have to chew on the others.

I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and
more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH.

The main remaining issue seems to be map/unmap.

Paolo

> On Fri, 7 Jul 2017 18:52:45 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 07/07/2017 16:34, Adalbert Lazar wrote:
>>> +:Returns: ↴
>>
>> What is this character? :)
> 
> I thought using a nice right-down arrow instead of a boring/inexpressive
> ASCII char in order to avoid trailing spaces will be cool. At least it
> was fun :)
> 
>>> +	struct kvmi_get_version_reply {
>>> +		__s32 err;
>>> +		__u32 version;
>>> +	};
>>> +
>>> +Returns the introspection API version (the KVMI_VERSION constant) and the
>>> +error code (zero). In case of an unlikely error, the version will have an
>>> +undefined value.
>>
>> Would it make sense to return a few more information fields, for example
>> the set of supported KVMI_CONTROL_EVENTS events?
> 
> Sure.
> 
>>> +	struct kvmi_get_guest_info_reply {
>>> +		__s32 err;
>>> +		__u16 vcpu_count;
>>> +		__u16 padding;
>>> +		__u64 tsc_speed;
>>> +	};
>>> +
>>> +Returns the number of online vcpus, and the TSC frequency in HZ, if supported
>>> +by the architecture (otherwise is 0).
>>
>> Can the TSC frequency be specified only if the guest is using TSC
>> scaling?  Defining the TSC frequency on older hosts is a bit tricky.  0
>> would be the host.
>>
>> Maybe define the second padding to be
>>
>> 	__u16 arch;
>>
>> (0 = x86) followed by an arch-specific payload.
> 
> Sure.
> 
>>> +5. KVMI_SHUTDOWN_GUEST
>>> +----------------------
>>> +
>>> +Ungracefully shutdown the guest.
>>
>> Note that all KVM can do here is pass down a request asynchronously to
>> userspace, failing further KVM_RUN ioctls.  I'm suggesting below an
>> alternative action.
> 
> We were thinking about sending an "ungraceful" TERM signal :D
> 
> Adding another reponse (exit-to-userspace-with-shutdown) on events,
> as you suggested below, makes sense.
> 
>>> +12. KVMI_SET_PAGE_ACCESS
>>> +------------------------
>>> +
>>> +Sets the spte flags (rwx - present, write & user) - access - for the specified
>>> +vCPU and guest physical address.
>>
>> rwx or pwu?  I suppose RWX.  Maybe #define the constants in the
>> documentation.
> 
> RWX.
> Will do.
> 
>>> +17. KVMI_CONTROL_EVENTS
>>> +-----------------------
>>> +
>>> +:Architectures: all
>>> +:Versions: >= 1
>>> +:Parameters: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_control_events {
>>> +		__u16 vcpu;
>>> +		__u16 padding;
>>> +		__u32 events;
>>> +	};
>>> +
>>> +:Returns: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_error_code {
>>> +		__s32 err;
>>> +		__u32 padding;
>>> +	};
>>> +
>>> +Enables/disables vCPU introspection events, by setting/clearing one or more
>>> +of the following bits (see 'Events' below) :
>>> +
>>> +	KVMI_EVENT_CR
>>> +	KVMI_EVENT_MSR
>>> +	KVMI_EVENT_XSETBV
>>> +	KVMI_EVENT_BREAKPOINT
>>> +	KVMI_EVENT_USER_CALL
>>> +	KVMI_EVENT_PAGE_FAULT
>>> +	KVMI_EVENT_TRAP
>>> +
>>> +Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current
>>> +architecture would fail and -EINVAL will be returned.
>>
>> I would prefer the interface to allow enable (set bits), disable (clear
>> bits) in addition to set (what you have here) the events.
> 
> Calling KVMI_CONTROL_EVENTS with event=KVMI_EVENT_CR|KVMI_EVENT_MSR will
> enable CR and MSR events and disable all the other events. It functions like
> a set and clear, in the same time.
> 
> It will be better to have KVMI_ENABLE_CONTROL_EVENTS and
> KVMI_DISABLE_CONTROL_EVENTS instead?
> 
>> The return value could indicate the set of enabled events.
> 
> Indeed.
> 
>>> +Events
>>> +------
>>> +
>>> +All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will
>>> +be sent unless enabled with a KVMI_CONTROL_EVENTS command.
>>> +
>>> +For x86, the message data starts with a common structure::
>>> +
>>> +	struct kvmi_event_x86 {
>>> +		__u16 vcpu;
>>> +		__u8 mode;
>>> +		__u8 padding1;
>>> +		__u32 event;
>>> +		struct kvm_regs regs;
>>> +		struct kvm_sregs sregs;
>>> +		struct {
>>> +			__u64 sysenter_cs;
>>> +			__u64 sysenter_esp;
>>> +			__u64 sysenter_eip;
>>> +			__u64 efer;
>>> +			__u64 star;
>>> +			__u64 lstar;
>>
>> cstar too, for AMD CPUs.
> 
> Thanks. Will do.
> 
>>> +		} msrs;
>>> +	};
>>> +
>>> +In order to help the introspection tool with the event analysis while
>>> +avoiding unnecessary introspection commands, the message data holds some
>>> +registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside
>>> +the vCPU id, its mode (in bytes) and the event (one of the flags set
>>> +with the KVMI_CONTROL_EVENTS command).
>>> +
>>> +The replies to events also start with a common structure, having the
>>> +KVMI_EVENT_VCPU_REPLY message id::
>>> +
>>> +	struct kvmi_event_x86_reply {
>>> +		struct kvm_regs regs;
>>
>> Put regs last?
>>
>>> +		__u32 actions;
>>> +		__u32 padding;
>>> +	};
> 
> Right.
> 
>>> +5. KVMI_EVENT_USER_CALL
>>> +-----------------------
>>
>> Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL.
> 
> Sure.
>  
>>> +
>>> +:Architectures: x86
>>> +:Versions: >= 1
>>> +:Parameters: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_event_x86;
>>> +
>>> +:Returns: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_event_x86_reply;
>>> +
>>> +This event is sent on a user hypercall and the introspection has already
>>> +already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
>>> +
>>> +kvmi_event_x86 is sent to the introspector, which can respond with the
>>> +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host
>>> +kernel to override the general purpose registers using the values from
>>> +introspector (regs).
>>
>> Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall?  Why is
>> KVMI_EVENT_ACTION_ALLOW not allowed?  As before, I'd prefer
>> SKIP/RETRY/ALLOW/CRASH as the actions.
> 
> We've used this as way for guest code to communicate with the introspector.
> KVMI_EVENT_ACTION_ALLOW is implied here.
> 
>>> --- /dev/null
>>> +++ b/include/uapi/linux/kvmi.h
> ...
>>
>> Errnos values are not portable, I'd prefer to have them defined
>> explicitly in the header.
> 
> We were thinking that the introspection tool will take care of this,
> as with the endianness.
> 
> Surely, it will be much more clear to have the errnos defined here.
> 
> Thanks again,
> 
> Adalbert
>
Adalbert Lazăr July 11, 2017, 4:48 p.m. UTC | #4
On Mon, 10 Jul 2017 19:03:06 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and
> more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH.
> 
> The main remaining issue seems to be map/unmap.

Definitely, SKIP/RETRY/ALLOW/CRASH looks better, but SET_REGS helps
performance wise. Maybe we could have it as an optional flag for ALLOW? Or
at least for the hot paths?

Summarily, on events, besides CRASH (vs SHUTDOWN cmd) and any other
additional flag:

  * CR, MSR - ALLOW with untouched new_value will let the guest continue,
              but with "new_value = old_value" is a "deny"

  * xsetbv - ALLOW is implied

  * breakpoint - SKIP means the BP is processed by the introspector,
                 ALLOW means let the guest handle it

  * hypercall - ALLOW is implied

  * page_fault - ALLOW means emulate, RETRY means let guest re-trigger the PF,
                 ALLOW with adjusted PC is a "skip" (done by the tool
                 for the moment).

Adalbert
Paolo Bonzini July 11, 2017, 4:51 p.m. UTC | #5
On 11/07/2017 18:48, Adalbert Lazar wrote:
> On Mon, 10 Jul 2017 19:03:06 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and
>> more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH.
>>
>> The main remaining issue seems to be map/unmap.
> 
> Definitely, SKIP/RETRY/ALLOW/CRASH looks better, but SET_REGS helps
> performance wise. Maybe we could have it as an optional flag for ALLOW?

Actually it makes more sense for SKIP, I think, where the introspector
is actually doing emulation?

But why is KVMI_SET_REGS slower than a set regs command followed by an
action?

> Or at least for the hot paths?
> 
> Summarily, on events, besides CRASH (vs SHUTDOWN cmd) and any other
> additional flag:
> 
>   * CR, MSR - ALLOW with untouched new_value will let the guest continue,
>               but with "new_value = old_value" is a "deny"
> 
>   * xsetbv - ALLOW is implied
> 
>   * breakpoint - SKIP means the BP is processed by the introspector,
>                  ALLOW means let the guest handle it
> 
>   * hypercall - ALLOW is implied
> 
>   * page_fault - ALLOW means emulate, RETRY means let guest re-trigger the PF,
>                  ALLOW with adjusted PC is a "skip" (done by the tool
>                  for the moment).

This is more complicated than necessary.  I would just make it simple:

- SKIP, adjust RIP to point to the next instruction and enter the guest

- RETRY, enter the guest

- ALLOW, emulate the instruction with information coming from the
response packet

- CRASH, self-explanatory

Paolo
Mihai Donțu July 13, 2017, 5:57 a.m. UTC | #6
On Tue, 2017-07-11 at 18:51 +0200, Paolo Bonzini wrote:
> On 11/07/2017 18:48, Adalbert Lazar wrote:
> > On Mon, 10 Jul 2017 19:03:06 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > I'm not sure what you think of removing KVMI_EVENT_ACTION_SET_REGS and
> > > more or less standardizing on actions SKIP/RETRY/ALLOW/CRASH.
> > > 
> > > The main remaining issue seems to be map/unmap.
> > 
> > Definitely, SKIP/RETRY/ALLOW/CRASH looks better, but SET_REGS helps
> > performance wise. Maybe we could have it as an optional flag for ALLOW?
> 
> Actually it makes more sense for SKIP, I think, where the introspector
> is actually doing emulation?

I'm afraid I don't undestand your question, however we were looking at
using KVM's x86 emulator rather than putting together our own, as such
software might be fun to write but they take a very long time to get
right. I'd argue that KVM's emulator has already seen a lot of
coverage.

In the future we are looking at maybe moving away from it on Intel-s,
by way of VMFUNC and #VE.

> But why is KVMI_SET_REGS slower than a set regs command followed by an
> action?

To be honest, we just looked at the Xen implementation which gates
writing back the registers to VMCS on them actually having been
changed. I just figured the less VMWRITE-s, the better. I'm also
looking over some older stats that show the registers being written
back quite often. Allow me 1-2 days to gather new ones and followup on
this thread.

> > Or at least for the hot paths?
> > 
> > Summarily, on events, besides CRASH (vs SHUTDOWN cmd) and any other
> > additional flag:
> > 
> >   * CR, MSR - ALLOW with untouched new_value will let the guest continue,
> >               but with "new_value = old_value" is a "deny"
> > 
> >   * xsetbv - ALLOW is implied
> > 
> >   * breakpoint - SKIP means the BP is processed by the introspector,
> >                  ALLOW means let the guest handle it
> > 
> >   * hypercall - ALLOW is implied
> > 
> >   * page_fault - ALLOW means emulate, RETRY means let guest re-trigger the PF,
> >                  ALLOW with adjusted PC is a "skip" (done by the tool
> >                  for the moment).
> 
> This is more complicated than necessary.  I would just make it simple:
> 
> - SKIP, adjust RIP to point to the next instruction and enter the guest
> 
> - RETRY, enter the guest
> 
> - ALLOW, emulate the instruction with information coming from the
> response packet
> 
> - CRASH, self-explanatory

I see no major problem with your version. The reason we removed SKIP
from the #PF description is that the RIP adjustment is done by the
introspection tool after it decodes the instruction and computes its
opcode length. So when the event response reaches KVM, it all looks
like ALLOW was requested as the host-side code would be oblivous to the
RIP change.
Paolo Bonzini July 13, 2017, 7:32 a.m. UTC | #7
On 13/07/2017 07:57, Mihai Donțu wrote:
>> Actually it makes more sense for SKIP, I think, where the introspector
>> is actually doing emulation?
>
> I'm afraid I don't undestand your question, however we were looking at
> using KVM's x86 emulator rather than putting together our own, as such
> software might be fun to write but they take a very long time to get
> right. I'd argue that KVM's emulator has already seen a lot of
> coverage.

Of course!  But there could be some special cases (e.g. hypercalls)
where you do emulation on your own.  In that case, KVMI_SET_REGS + SKIP
is the right thing to do.

> In the future we are looking at maybe moving away from it on Intel-s,
> by way of VMFUNC and #VE.
> 
>> But why is KVMI_SET_REGS slower than a set regs command followed by an
>> action?
> To be honest, we just looked at the Xen implementation which gates
> writing back the registers to VMCS on them actually having been
> changed.

That would be possible on KVMI too.  Just don't do the KVMI_SET_REGS
unless the registers have changed.

Paolo
Mihai Donțu July 13, 2017, 8:36 a.m. UTC | #8
On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
> > +3. KVMI_PAUSE_GUEST
> > +-------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: {}
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_error_code {
> > +		__s32 err;
> > +		__u32 padding;
> > +	};
> > +
> > +This command will pause all vcpus threads, by getting them out of guest mode
> > +and put them in the "waiting introspection commands" state.
> 
> Pausing all threads synchronously is a tricky concept.  The main issue
> is that the guest could be paused at the userspace level.  Then KVM
> would not be running (userspace is stuck in pthread_cond_wait,
> typically) and could not acknowledge the pause.
> 
> Worse, KVM is not able to distinguish userspace that has paused the VM
> from userspace that is doing MMIO or userspace that has a bug and hung
> somewhere.  And even worse, there are cases where userspace wants to
> modify registers while doing port I/O (the awful VMware RPCI port).  So
> I'd rather avoid this.

I should give more details here: we don't need to pause the vCPU-s in
the sense widely understood but just prevent them from entering the
guest for a short period of time. In our particular case, we need this
when we start introspecting a VM that's already running. For this we
kick the vCPU-s out of the guest so that our scan of the memory does
not race with the guest kernel/applications.

Another use case is when we inject applications into a running guest.
We also kick the vCPU-s out while we atomically make changes to kernel
specific structures.

> > +8. KVMI_GET_MTRR_TYPE
> > +---------------------
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_mtrr_type {
> > +		__u64 gpa;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_mtrr_type_reply {
> > +		__s32 err;
> > +		__u32 type;
> > +	};
> > +
> > +Returns the guest memory type for a specific physical address.
> 
> What is this used for?  KVM ignores the guest MTRRs, so if possible I'd
> rather avoid it.

We use it do identify cacheable memory which usually indicates device
memory, something we don't want to touch. We are also looking into
making use of the page attributes (PAT) or other PTE-bits instead of
MTRR, but for the time being MTRR-s are still being relied upon.

> > +9. KVMI_GET_MTRRS
> > +-----------------
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_mtrrs {
> > +		__u16 vcpu;
> > +		__u16 padding[3];
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_mtrrs_reply {
> > +		__s32 err;
> > +		__u32 padding;
> > +		__u64 pat;
> > +		__u64 cap;
> > +		__u64 type;
> > +	};
> > +
> > +Returns MSR_IA32_CR_PAT, MSR_MTRRcap and MSR_MTRRdefType for the specified
> > +vCPU.
> 
> This could use KVM_GET_REGISTERS, couldn't it?

Yes, I belive it can be folded into that command.

> > +14. KVMI_INJECT_BREAKPOINT
> > +--------------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_inject_breakpoint {
> > +		__u16 vcpu;
> > +		__u16 padding[3];
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_error_code {
> > +		__s32 err;
> > +		__u32 padding;
> > +	};
> > +
> > +Injects a breakpoint for the specified vCPU. This command is usually sent in
> > +response to an event and as such the proper GPR-s will be set with the reply.
> 
> What is a "breakpoint" in this context?

A simple INT3. It's what most of our patches consist of. We keep track
of them and handle the ones we own while pass (reinject) the ones used
by the guest (debuggers or whatnot).
Paolo Bonzini July 13, 2017, 9:15 a.m. UTC | #9
On 13/07/2017 10:36, Mihai Donțu wrote:
> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
>> Worse, KVM is not able to distinguish userspace that has paused the VM
>> from userspace that is doing MMIO or userspace that has a bug and hung
>> somewhere.  And even worse, there are cases where userspace wants to
>> modify registers while doing port I/O (the awful VMware RPCI port).  So
>> I'd rather avoid this.
> 
> I should give more details here: we don't need to pause the vCPU-s in
> the sense widely understood but just prevent them from entering the
> guest for a short period of time. In our particular case, we need this
> when we start introspecting a VM that's already running. For this we
> kick the vCPU-s out of the guest so that our scan of the memory does
> not race with the guest kernel/applications.
> 
> Another use case is when we inject applications into a running guest.
> We also kick the vCPU-s out while we atomically make changes to kernel
> specific structures.

This is not possible to do in KVM, because KVM doesn't control what
happens to the memory outside KVM_RUN (and of course it doesn't control
devices doing DMA).  You need to talk to QEMU in order to do this.

To do atomic changes to kernel specific structures, I would change the
page tables to inaccessible instead, but that also doesn't protect them
from devices doing DMA into them.

Another issue: say a VM is waiting for a reply from the introspector,
and the reply is delayed so the VM gets a signal and needs to get out to
QEMU with EINTR.  I don't think it is always possible to retry the
instruction on the next KVM_RUN, because the introspector might be
making partial changes.  Add live migration to the mix if you want to
make things even more complicated. :)

I think we need a way to mark a set of commands for atomic application.
That is, the structure of the command stream needs to be

    command 1
    command 2
    event reply 1
    transaction end marker
    command 3
    transaction end marker
    command 4
    event reply 2
    transaction end marker

>>> +8. KVMI_GET_MTRR_TYPE
>>> +---------------------
>>
>> What is this used for?  KVM ignores the guest MTRRs, so if possible I'd
>> rather avoid it.
> 
> We use it do identify cacheable memory which usually indicates device
> memory, something we don't want to touch. We are also looking into
> making use of the page attributes (PAT) or other PTE-bits instead of
> MTRR, but for the time being MTRR-s are still being relied upon.

Fair enough.  But you can compute it yourself from the MTRRs, can't you?
 A separate command is just adding attack surface in the hypervisor.

>>> +14. KVMI_INJECT_BREAKPOINT
>>> +--------------------------
>>> +
>>> +:Architectures: all
>>> +:Versions: >= 1
>>> +:Parameters: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_inject_breakpoint {
>>> +		__u16 vcpu;
>>> +		__u16 padding[3];
>>> +	};
>>> +
>>> +:Returns: ↴
>>> +
>>> +::
>>> +
>>> +	struct kvmi_error_code {
>>> +		__s32 err;
>>> +		__u32 padding;
>>> +	};
>>> +
>>> +Injects a breakpoint for the specified vCPU. This command is usually sent in
>>> +response to an event and as such the proper GPR-s will be set with the reply.
>>
>> What is a "breakpoint" in this context?
> 
> A simple INT3. It's what most of our patches consist of. We keep track
> of them and handle the ones we own while pass (reinject) the ones used
> by the guest (debuggers or whatnot).

Why can't they be written with KVMI_READ/WRITE_PHYSICAL?  (I would keep
those two as they provide a more direct interface than map/unmap, and
they work even with introspectors that are not sibling guests of the
introspected VM).

Paolo
Mihai Donțu July 18, 2017, 11:51 a.m. UTC | #10
On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote:
> On 13/07/2017 07:57, Mihai Donțu wrote:
> > > Actually it makes more sense for SKIP, I think, where the introspector
> > > is actually doing emulation?
> > 
> > I'm afraid I don't undestand your question, however we were looking at
> > using KVM's x86 emulator rather than putting together our own, as such
> > software might be fun to write but they take a very long time to get
> > right. I'd argue that KVM's emulator has already seen a lot of
> > coverage.
> 
> Of course!  But there could be some special cases (e.g. hypercalls)
> where you do emulation on your own.  In that case, KVMI_SET_REGS + SKIP
> is the right thing to do.

I think I finally understand what you're saying. That SKIP would tell
the introspection subsystem to just write back the registers and enter
the guest, no in-host emulation needed. So, to reiterate, the possible
actions would be:

 * SKIP - re-enter the guest (the introspection tool has adjusted all
   registers)
 * RETRY - re-enter the guest
 * ALLOW - use the emulator
 * CRASH - kill the guest process

It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that
sets all registers that might have been affected by the emulation
(control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for
that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY.

> > In the future we are looking at maybe moving away from it on Intel-s,
> > by way of VMFUNC and #VE.
> > 
> > > But why is KVMI_SET_REGS slower than a set regs command followed by an
> > > action?
> > 
> > To be honest, we just looked at the Xen implementation which gates
> > writing back the registers to VMCS on them actually having been
> > changed.
> 
> That would be possible on KVMI too.  Just don't do the KVMI_SET_REGS
> unless the registers have changed.
Mihai Donțu July 18, 2017, 12:02 p.m. UTC | #11
On Tue, 2017-07-18 at 14:51 +0300, Mihai Donțu wrote:
> On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote:
> > On 13/07/2017 07:57, Mihai Donțu wrote:
> > > > Actually it makes more sense for SKIP, I think, where the introspector
> > > > is actually doing emulation?
> > > 
> > > I'm afraid I don't undestand your question, however we were looking at
> > > using KVM's x86 emulator rather than putting together our own, as such
> > > software might be fun to write but they take a very long time to get
> > > right. I'd argue that KVM's emulator has already seen a lot of
> > > coverage.
> > 
> > Of course!  But there could be some special cases (e.g. hypercalls)
> > where you do emulation on your own.  In that case, KVMI_SET_REGS + SKIP
> > is the right thing to do.
> 
> I think I finally understand what you're saying. That SKIP would tell
> the introspection subsystem to just write back the registers and enter
> the guest, no in-host emulation needed. So, to reiterate, the possible
> actions would be:
> 
>  * SKIP - re-enter the guest (the introspection tool has adjusted all
>    registers)
>  * RETRY - re-enter the guest
>  * ALLOW - use the emulator
>  * CRASH - kill the guest process
> 
> It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that
> sets all registers that might have been affected by the emulation
> (control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for
> that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY.

I mentioned in a previous email something about statistics. The
following are a sort of that:

eventCount: 13.2 eventsMemAccess: 12.4 eventsWriteCtrlReg: 0.8 partialContext: 0.2 partialCpu: 0.2 xcGetMemAccess: 116 xcMapPage: 191.4 xcSetMemAccessMulti: 1.6 
eventCount: 4261.8 eventsBreakPoint: 4170.2 eventsMemAccess: 84.4 eventsWriteCtrlReg: 7.2 setRegisters: 4170.2 xcMapPage: 3.4 xcSetMemAccessMulti: 0.2 
eventCount: 8098.6 eventsBreakPoint: 4783.4 eventsMemAccess: 3302.4 eventsWriteCtrlReg: 12.8 setRegisters: 4783.4 xcMapPage: 2.6 
eventCount: 3495 eventsBreakPoint: 2631.6 eventsMemAccess: 846.6 eventsWriteCtrlReg: 16.8 setRegisters: 2631.6 xcMapPage: 2.6 xcSetMemAccessMulti: 0.4 
eventCount: 538.4 eventsBreakPoint: 248 eventsMemAccess: 279.2 eventsWriteCtrlReg: 11.2 setRegisters: 248 xcMapPage: 1.2 
eventCount: 4876.2 eventsBreakPoint: 3683.4 eventsMemAccess: 1172.8 eventsWriteCtrlReg: 20 setRegisters: 3683.4 xcMapPage: 4.8 
eventCount: 5937.4 eventsBreakPoint: 4403.8 eventsMemAccess: 1507.2 eventsWriteCtrlReg: 26.4 setRegisters: 4403.8 xcMapPage: 4.8 
eventCount: 9992.4 eventsBreakPoint: 7948.6 eventsMemAccess: 2019.8 eventsWriteCtrlReg: 24 setRegisters: 7948.6 xcMapPage: 1.4 xcSetMemAccessMulti: 5 
eventCount: 5150.6 eventsBreakPoint: 2175 eventsMemAccess: 2902.8 eventsWriteCtrlReg: 72.8 setRegisters: 2175 xcGetMemAccess: 0.4 xcMapPage: 8 xcSetMemAccessMulti: 1 
eventCount: 5422.2 eventsBreakPoint: 4362.2 eventsMemAccess: 1012.8 eventsWriteCtrlReg: 47.2 setRegisters: 4362.2 xcGetMemAccess: 2.8 xcMapPage: 10.2 xcSetMemAccessMulti: 3.4 
eventCount: 1910.2 eventsBreakPoint: 1665.6 eventsMemAccess: 231.8 eventsWriteCtrlReg: 12.8 setRegisters: 1665.6 xcGetMemAccess: 0.2 xcMapPage: 2.2 xcSetMemAccessMulti: 0.6 
eventCount: 1834.4 eventsBreakPoint: 1357.6 eventsMemAccess: 462.4 eventsWriteCtrlReg: 14.4 setRegisters: 1357.6 xcGetMemAccess: 0.2 xcMapPage: 4.8 xcSetMemAccessMulti: 0.4 
eventCount: 6081.2 eventsBreakPoint: 4855.6 eventsMemAccess: 1208.8 eventsWriteCtrlReg: 16.8 setRegisters: 4855.6 xcMapPage: 4 
eventCount: 1105.4 eventsBreakPoint: 855 eventsMemAccess: 226.4 eventsWriteCtrlReg: 24 setRegisters: 855 xcMapPage: 1.6 
eventCount: 8362.8 eventsBreakPoint: 4409.2 eventsMemAccess: 3917.6 eventsWriteCtrlReg: 36 setRegisters: 4409.2 xcGetMemAccess: 117.4 xcMapPage: 9 xcSetMemAccessMulti: 254.6 
eventCount: 2222.2 eventsBreakPoint: 32.2 eventsMemAccess: 2169.2 eventsWriteCtrlReg: 20.8 setRegisters: 32.2 xcGetMemAccess: 2.8 xcMapPage: 5 xcSetMemAccessMulti: 104.4 
eventCount: 2889.2 eventsBreakPoint: 1447.8 eventsMemAccess: 1419.8 eventsWriteCtrlReg: 21.6 partialContext: 0.8 partialCpu: 0.8 setRegisters: 1447.8 xcGetMemAccess: 1.4 xcMapPage: 16.6 xcSetMemAccessMulti: 2.2 
eventCount: 1698.8 eventsBreakPoint: 1031.8 eventsMemAccess: 655 eventsWriteCtrlReg: 12 setRegisters: 1031.8 xcGetMemAccess: 0.6 xcMapPage: 5 xcSetMemAccessMulti: 0.8 
eventCount: 691.8 eventsBreakPoint: 250.4 eventsMemAccess: 435.8 eventsWriteCtrlReg: 5.6 setRegisters: 250.4 xcGetMemAccess: 0.4 xcMapPage: 4 xcSetMemAccessMulti: 0.4 
eventCount: 756.8 eventsBreakPoint: 293.2 eventsMemAccess: 454.8 eventsWriteCtrlReg: 8.8 setRegisters: 293.2 xcGetMemAccess: 1.4 xcMapPage: 3.4 xcSetMemAccessMulti: 1.4 

These represent the number events/calls per second during a normal
introspection session (start Windows 10 x64, open Edge). The breakpoint
events correspond to the various API calls invoked and, as it can be
seen, for each of them we do a 'setRegisters'. We were hoping we can
reduce the overhead by a bit by bundling KVMI_SET_REGISTERS with the
event response.

If I have not managed to convince you, I think we can go ahead and keep
them separate, have an initial implementation and see some actual
performance numbers. Should be no hassle. :-)

> > > In the future we are looking at maybe moving away from it on Intel-s,
> > > by way of VMFUNC and #VE.
> > > 
> > > > But why is KVMI_SET_REGS slower than a set regs command followed by an
> > > > action?
> > > 
> > > To be honest, we just looked at the Xen implementation which gates
> > > writing back the registers to VMCS on them actually having been
> > > changed.
> > 
> > That would be possible on KVMI too.  Just don't do the KVMI_SET_REGS
> > unless the registers have changed.
Paolo Bonzini July 23, 2017, 1:02 p.m. UTC | #12
On 18/07/2017 13:51, Mihai Donțu wrote:
> On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote:
>> On 13/07/2017 07:57, Mihai Donțu wrote:
>>>> Actually it makes more sense for SKIP, I think, where the introspector
>>>> is actually doing emulation?
>>>
>>> I'm afraid I don't undestand your question, however we were looking at
>>> using KVM's x86 emulator rather than putting together our own, as such
>>> software might be fun to write but they take a very long time to get
>>> right. I'd argue that KVM's emulator has already seen a lot of
>>> coverage.
>>
>> Of course!  But there could be some special cases (e.g. hypercalls)
>> where you do emulation on your own.  In that case, KVMI_SET_REGS + SKIP
>> is the right thing to do.
> 
> I think I finally understand what you're saying. That SKIP would tell
> the introspection subsystem to just write back the registers and enter
> the guest, no in-host emulation needed. So, to reiterate, the possible
> actions would be:
> 
>  * SKIP - re-enter the guest (the introspection tool has adjusted all
>    registers)
>  * RETRY - re-enter the guest
>  * ALLOW - use the emulator
>  * CRASH - kill the guest process
> 
> It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that
> sets all registers that might have been affected by the emulation
> (control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for
> that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY.

One difference that comes to mind between SKIP and RETRY is that SKIP
would inject a singlestep exception if TF was 1 on entry, and clear the
interrupt shadow.  RETRY would not do either of those.

(The name for SKIP comes from the KVM function
kvm_skip_emulated_instruction).

> We were hoping we can
> reduce the overhead by a bit by bundling KVMI_SET_REGISTERS with the
> event response.
> 
> If I have not managed to convince you, I think we can go ahead and keep
> them separate, have an initial implementation and see some actual
> performance numbers. Should be no hassle. 

I think you should implement transactions in the protocol, so
effectively KVMI_SET_REGISTERS would be bundled with the event response
anyway.

Paolo

>>> In the future we are looking at maybe moving away from it on Intel-s,
>>> by way of VMFUNC and #VE.
>>>
>>>> But why is KVMI_SET_REGS slower than a set regs command followed by an
>>>> action?
>>>
>>> To be honest, we just looked at the Xen implementation which gates
>>> writing back the registers to VMCS on them actually having been
>>> changed.
>>
>> That would be possible on KVMI too.  Just don't do the KVMI_SET_REGS
>> unless the registers have changed.
>
Mihai Donțu July 26, 2017, 5:04 p.m. UTC | #13
On Sun, 2017-07-23 at 15:02 +0200, Paolo Bonzini wrote:
> On 18/07/2017 13:51, Mihai Donțu wrote:
> > On Thu, 2017-07-13 at 09:32 +0200, Paolo Bonzini wrote:
> > > On 13/07/2017 07:57, Mihai Donțu wrote:
> > > > > Actually it makes more sense for SKIP, I think, where the introspector
> > > > > is actually doing emulation?
> > > > 
> > > > I'm afraid I don't undestand your question, however we were looking at
> > > > using KVM's x86 emulator rather than putting together our own, as such
> > > > software might be fun to write but they take a very long time to get
> > > > right. I'd argue that KVM's emulator has already seen a lot of
> > > > coverage.
> > > 
> > > Of course!  But there could be some special cases (e.g. hypercalls)
> > > where you do emulation on your own.  In that case, KVMI_SET_REGS + SKIP
> > > is the right thing to do.
> > 
> > I think I finally understand what you're saying. That SKIP would tell
> > the introspection subsystem to just write back the registers and enter
> > the guest, no in-host emulation needed. So, to reiterate, the possible
> > actions would be:
> > 
> >  * SKIP - re-enter the guest (the introspection tool has adjusted all
> >    registers)
> >  * RETRY - re-enter the guest
> >  * ALLOW - use the emulator
> >  * CRASH - kill the guest process
> > 
> > It seems that SKIP requires a variant of KVMI_SET_REGS (_FULL?) that
> > sets all registers that might have been affected by the emulation
> > (control, MSR-s, MMX/SSE/AVX). I guess there can be an usecase for
> > that. It also looks like its identical with KVMI_SET_REGS_FULL + RETRY.
> 
> One difference that comes to mind between SKIP and RETRY is that SKIP
> would inject a singlestep exception if TF was 1 on entry, and clear the
> interrupt shadow.  RETRY would not do either of those.
> 
> (The name for SKIP comes from the KVM function
> kvm_skip_emulated_instruction).

OK.

> > We were hoping we can
> > reduce the overhead by a bit by bundling KVMI_SET_REGISTERS with the
> > event response.
> > 
> > If I have not managed to convince you, I think we can go ahead and keep
> > them separate, have an initial implementation and see some actual
> > performance numbers. Should be no hassle. 
> 
> I think you should implement transactions in the protocol, so
> effectively KVMI_SET_REGISTERS would be bundled with the event response
> anyway.

I see. Then maybe we should provide a way for commands to specify an
event ID. If zero, then the command is satisfied using data straight
from the vCPU (when making changes), otherwise a structure associated
with the event will be used as cache for all get-s/set-s and apply them
all in one go when the event reply arrives. This should work nicely
since we read a good deal of the register set anyway when sending the
event.

> > > > In the future we are looking at maybe moving away from it on Intel-s,
> > > > by way of VMFUNC and #VE.
> > > > 
> > > > > But why is KVMI_SET_REGS slower than a set regs command followed by an
> > > > > action?
> > > > 
> > > > To be honest, we just looked at the Xen implementation which gates
> > > > writing back the registers to VMCS on them actually having been
> > > > changed.
> > > 
> > > That would be possible on KVMI too.  Just don't do the KVMI_SET_REGS
> > > unless the registers have changed.
Tamas K Lengyel July 26, 2017, 5:25 p.m. UTC | #14
Hi Mihai,
this series was just brought to my attention (and this is the first
message in my inbox I can reply to) and I'm very happy to see you guys
working on this! I'm still reviewing the series and the only question
I have at this point is whether you will be also adding support for
setting the MTF through KVMI?

Thanks,
Tamas
Paolo Bonzini July 27, 2017, 1:33 p.m. UTC | #15
On 26/07/2017 19:04, Mihai Donțu wrote:
>> I think you should implement transactions in the protocol, so
>> effectively KVMI_SET_REGISTERS would be bundled with the event response
>> anyway.
> 
> I see. Then maybe we should provide a way for commands to specify an
> event ID. If zero, then the command is satisfied using data straight
> from the vCPU (when making changes), otherwise a structure associated
> with the event will be used as cache for all get-s/set-s and apply them
> all in one go when the event reply arrives. This should work nicely
> since we read a good deal of the register set anyway when sending the
> event.

Yes, that is okay.  Just a question, why would the event ID provide more
information than just the vCPU id?  How can there be more than one event
active on the same vCPU?

Paolo
Mihai Donțu July 27, 2017, 2:41 p.m. UTC | #16
Hi Tamas,

On Wed, 2017-07-26 at 11:25 -0600, Tamas K Lengyel wrote:
> this series was just brought to my attention (and this is the first
> message in my inbox I can reply to) and I'm very happy to see you guys
> working on this! I'm still reviewing the series and the only question
> I have at this point is whether you will be also adding support for
> setting the MTF through KVMI?

Thanks for taking a look over this.

Currently we have no code in place for guest single stepping, but since
we're looking into making use of Intel's VMFUNC and #VE, I'm sure a
need for a feature like this will arrise. Or _maybe_ we can make it
part of the #PF handling code and have the introspection tool respond
with ALLOW and an EPT view ID and have KVM automatically switch, single
step, switch back and resume, if that makes sense for everybody.

Regards,
Mihai Donțu July 27, 2017, 2:46 p.m. UTC | #17
On Thu, 2017-07-27 at 15:33 +0200, Paolo Bonzini wrote:
> On 26/07/2017 19:04, Mihai Donțu wrote:
> > > I think you should implement transactions in the protocol, so
> > > effectively KVMI_SET_REGISTERS would be bundled with the event response
> > > anyway.
> > 
> > I see. Then maybe we should provide a way for commands to specify an
> > event ID. If zero, then the command is satisfied using data straight
> > from the vCPU (when making changes), otherwise a structure associated
> > with the event will be used as cache for all get-s/set-s and apply them
> > all in one go when the event reply arrives. This should work nicely
> > since we read a good deal of the register set anyway when sending the
> > event.
> 
> Yes, that is okay.  Just a question, why would the event ID provide more
> information than just the vCPU id?  How can there be more than one event
> active on the same vCPU?

You're right. There can't. We'll just add some logic that says if the
vCPU has an event pending (or currently being handled), to use that
cache to satisfy all commands.

Thanks,
Mihai Donțu July 27, 2017, 4:23 p.m. UTC | #18
On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote:
> On 13/07/2017 10:36, Mihai Donțu wrote:
> > On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
> > > Worse, KVM is not able to distinguish userspace that has paused the VM
> > > from userspace that is doing MMIO or userspace that has a bug and hung
> > > somewhere.  And even worse, there are cases where userspace wants to
> > > modify registers while doing port I/O (the awful VMware RPCI port).  So
> > > I'd rather avoid this.
> > 
> > I should give more details here: we don't need to pause the vCPU-s in
> > the sense widely understood but just prevent them from entering the
> > guest for a short period of time. In our particular case, we need this
> > when we start introspecting a VM that's already running. For this we
> > kick the vCPU-s out of the guest so that our scan of the memory does
> > not race with the guest kernel/applications.
> > 
> > Another use case is when we inject applications into a running guest.
> > We also kick the vCPU-s out while we atomically make changes to kernel
> > specific structures.
> 
> This is not possible to do in KVM, because KVM doesn't control what
> happens to the memory outside KVM_RUN (and of course it doesn't control
> devices doing DMA).  You need to talk to QEMU in order to do this.

Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on
the already opened file descriptor to /dev/kvm for an event?

> To do atomic changes to kernel specific structures, I would change the
> page tables to inaccessible instead, but that also doesn't protect them
> from devices doing DMA into them.

If we have qemu pull out of the guest all vCPU-s and wait for a sign
from the KVMI subsystem, then that looks sufficient. Devices acessing
the memory (passedthrough devices, I assume) should be no problem as
we're never interested in device memory.

> Another issue: say a VM is waiting for a reply from the introspector,
> and the reply is delayed so the VM gets a signal and needs to get out to
> QEMU with EINTR.  I don't think it is always possible to retry the
> instruction on the next KVM_RUN, because the introspector might be
> making partial changes.  Add live migration to the mix if you want to
> make things even more complicated. :)
> 
> I think we need a way to mark a set of commands for atomic application.
> That is, the structure of the command stream needs to be
> 
>     command 1
>     command 2
>     event reply 1
>     transaction end marker
>     command 3
>     transaction end marker
>     command 4
>     event reply 2
>     transaction end marker

This should be covered by a previous email exchange. Commands targeting
vCPU-s for which events are pending or are currently being handled
should be done via a structure that acts like a cache, so that when the
event reply reaches KVM, all potential modifications are applied in one
go. This affects all register manipulations. Commands targeting memory
are unaffected by the state of the vCPU, though this might change when
we factor EPT views. But, alas, we have no clear view on this last
topic as of yet.

> > > > +8. KVMI_GET_MTRR_TYPE
> > > > +---------------------
> > > 
> > > What is this used for?  KVM ignores the guest MTRRs, so if possible I'd
> > > rather avoid it.
> > 
> > We use it do identify cacheable memory which usually indicates device
> > memory, something we don't want to touch. We are also looking into
> > making use of the page attributes (PAT) or other PTE-bits instead of
> > MTRR, but for the time being MTRR-s are still being relied upon.
> 
> Fair enough.  But you can compute it yourself from the MTRRs, can't you?
> A separate command is just adding attack surface in the hypervisor.

I think we can make some basic MTRR info available via GET_REGISTERS
and do the rest in the introspection tool.

> > > > +14. KVMI_INJECT_BREAKPOINT
> > > > +--------------------------
> > > > +
> > > > +:Architectures: all
> > > > +:Versions: >= 1
> > > > +:Parameters: ↴
> > > > +
> > > > +::
> > > > +
> > > > +	struct kvmi_inject_breakpoint {
> > > > +		__u16 vcpu;
> > > > +		__u16 padding[3];
> > > > +	};
> > > > +
> > > > +:Returns: ↴
> > > > +
> > > > +::
> > > > +
> > > > +	struct kvmi_error_code {
> > > > +		__s32 err;
> > > > +		__u32 padding;
> > > > +	};
> > > > +
> > > > +Injects a breakpoint for the specified vCPU. This command is usually sent in
> > > > +response to an event and as such the proper GPR-s will be set with the reply.
> > > 
> > > What is a "breakpoint" in this context?
> > 
> > A simple INT3. It's what most of our patches consist of. We keep track
> > of them and handle the ones we own while pass (reinject) the ones used
> > by the guest (debuggers or whatnot).
> 
> Why can't they be written with KVMI_READ/WRITE_PHYSICAL?  (I would keep
> those two as they provide a more direct interface than map/unmap, and
> they work even with introspectors that are not sibling guests of the
> introspected VM).

They can, nothing is stopping that. Also, we can keep the plain
read/write interfaces around. It just seemed easier to implement them
on top of an eventual mmap/munmap interface.
Paolo Bonzini July 27, 2017, 4:52 p.m. UTC | #19
On 27/07/2017 18:23, Mihai Donțu wrote:
> On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote:
>> On 13/07/2017 10:36, Mihai Donțu wrote:
>>> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
>>>> Worse, KVM is not able to distinguish userspace that has paused the VM
>>>> from userspace that is doing MMIO or userspace that has a bug and hung
>>>> somewhere.  And even worse, there are cases where userspace wants to
>>>> modify registers while doing port I/O (the awful VMware RPCI port).  So
>>>> I'd rather avoid this.
>>>
>>> I should give more details here: we don't need to pause the vCPU-s in
>>> the sense widely understood but just prevent them from entering the
>>> guest for a short period of time. In our particular case, we need this
>>> when we start introspecting a VM that's already running. For this we
>>> kick the vCPU-s out of the guest so that our scan of the memory does
>>> not race with the guest kernel/applications.
>>>
>>> Another use case is when we inject applications into a running guest.
>>> We also kick the vCPU-s out while we atomically make changes to kernel
>>> specific structures.
>>
>> This is not possible to do in KVM, because KVM doesn't control what
>> happens to the memory outside KVM_RUN (and of course it doesn't control
>> devices doing DMA).  You need to talk to QEMU in order to do this.
> 
> Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on
> the already opened file descriptor to /dev/kvm for an event?

Nope.  QEMU might be running and writing to memory in another thread.  I
don't see how this can be reliable on other hypervisors too, actually.

>> To do atomic changes to kernel specific structures, I would change the
>> page tables to inaccessible instead, but that also doesn't protect them
>> from devices doing DMA into them.
> 
> If we have qemu pull out of the guest all vCPU-s and wait for a sign
> from the KVMI subsystem, then that looks sufficient. Devices acessing
> the memory (passedthrough devices, I assume) should be no problem as
> we're never interested in device memory.

You're certainly interested in bus-master DMA from those devices though.

>> Another issue: say a VM is waiting for a reply from the introspector,
>> and the reply is delayed so the VM gets a signal and needs to get out to
>> QEMU with EINTR.  I don't think it is always possible to retry the
>> instruction on the next KVM_RUN, because the introspector might be
>> making partial changes.  Add live migration to the mix if you want to
>> make things even more complicated. :)
>>
>> I think we need a way to mark a set of commands for atomic application.
>> That is, the structure of the command stream needs to be
>>
>>     command 1
>>     command 2
>>     event reply 1
>>     transaction end marker
>>     command 3
>>     transaction end marker
>>     command 4
>>     event reply 2
>>     transaction end marker
> 
> This should be covered by a previous email exchange.

Correct.

>>>>> +8. KVMI_GET_MTRR_TYPE
>>>>> +---------------------
>>>>
>>>> What is this used for?  KVM ignores the guest MTRRs, so if possible I'd
>>>> rather avoid it.
>>>
>>> We use it do identify cacheable memory which usually indicates device
>>> memory, something we don't want to touch. We are also looking into
>>> making use of the page attributes (PAT) or other PTE-bits instead of
>>> MTRR, but for the time being MTRR-s are still being relied upon.
>>
>> Fair enough.  But you can compute it yourself from the MTRRs, can't you?
>> A separate command is just adding attack surface in the hypervisor.
> 
> I think we can make some basic MTRR info available via GET_REGISTERS
> and do the rest in the introspection tool.

Ok.

>>>>> +Injects a breakpoint for the specified vCPU. This command is usually sent in
>>>>> +response to an event and as such the proper GPR-s will be set with the reply.
>>>>
>>>> What is a "breakpoint" in this context?
>>>
>>> A simple INT3. It's what most of our patches consist of. We keep track
>>> of them and handle the ones we own while pass (reinject) the ones used
>>> by the guest (debuggers or whatnot).
>>
>> Why can't they be written with KVMI_READ/WRITE_PHYSICAL?  (I would keep
>> those two as they provide a more direct interface than map/unmap, and
>> they work even with introspectors that are not sibling guests of the
>> introspected VM).
> 
> They can, nothing is stopping that. Also, we can keep the plain
> read/write interfaces around. It just seemed easier to implement them
> on top of an eventual mmap/munmap interface.

I prefer to keep the simple interface and drop the breakpoint one.

Paolo
Mihai Donțu July 27, 2017, 5:06 p.m. UTC | #20
On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
> > +2. KVMI_GET_GUEST_INFO
> > +----------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: {}
> > +:Returns: ↴
> > +
> > +::
> > +
> > > > +	struct kvmi_get_guest_info_reply {
> > > > +		__s32 err;
> > > > +		__u16 vcpu_count;
> > > > +		__u16 padding;
> > > > +		__u64 tsc_speed;
> > > > +	};
> > +
> > +Returns the number of online vcpus, and the TSC frequency in HZ, if supported
> > +by the architecture (otherwise is 0).
> 
> Can the TSC frequency be specified only if the guest is using TSC
> scaling?  Defining the TSC frequency on older hosts is a bit tricky.  0
> would be the host.
> 
> Maybe define the second padding to be
> 
> 	__u16 arch;
> 
> (0 = x86) followed by an arch-specific payload.

We use the TSC to compute some performance numbers, but only when we're
debugging reported issues. Should be OK if the TSC information is not
available. We'll manage in some other way.

> > +10. KVMI_GET_XSAVE_INFO
> > +-----------------------
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_xsave_info {
> > +		__u16 vcpu;
> > +		__u16 padding[3];
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_xsave_info_reply {
> > +		__s32 err;
> > +		__u32 size;
> > +	};
> > +
> > +Returns the xstate size for the specified vCPU.
> 
> Could this be replaced by a generic CPUID accessor?

Absolutely. I wonder if we should only made certain leafs acessible,
part of the whole "make the least amount of information available"
security philosophy, though I'm not aware of any attacks that can be
mounted on the host just by knowing too many things about a guest.

> > +11. KVMI_GET_PAGE_ACCESS
> > +------------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_get_page_access {
> > +		__u16 vcpu;
> > +		__u16 padding[3];
> > +		__u64 gpa;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_get_page_access_reply {
> > +		__s32 err;
> > +		__u32 access;
> > +	};
> > +
> > +Returns the spte flags (rwx - present, write & user) for the specified
> > +vCPU and guest physical address.
> > +
> > +12. KVMI_SET_PAGE_ACCESS
> > +------------------------
> > +
> > +:Architectures: all
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_set_page_access {
> > +		__u16 vcpu;
> > +		__u16 padding;
> > +		__u32 access;
> > +		__u64 gpa;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_error_code {
> > +		__s32 err;
> > +		__u32 padding;
> > +	};
> > +
> > +Sets the spte flags (rwx - present, write & user) - access - for the specified
> > +vCPU and guest physical address.
> 
> rwx or pwu?  I suppose RWX.  Maybe #define the constants in the
> documentation.
> 
> Also, it should be noted here that the spte flags are ANDed with
> whatever is provided by userspace, because there could be readonly
> memslots, and that some access combinations could fail (--x) or will
> surely fail (-wx for example).

We are closely looking into how KVM's MMU works and see how we'd go
about extending it so as to make everyone happy (qemu and a possible
introspection tool).

As for the permitted page access flags, we have (thus far) kept away
from invalid combinations on older arch-es (--x on pre-SandyBridge, for
example). However, it would be very nice to have arch-specific code
that does a validation ahead of time (ie. before the guest is re-
entered) so that an introspection tool can use an alternative approach.

> > +13. KVMI_INJECT_PAGE_FAULT
> > +--------------------------
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_page_fault {
> > +		__u16 vcpu;
> > +		__u16 padding;
> > +		__u32 error;
> 
> error_code, I guess?  Why not a generic inject exception message?

OK, we will rework the interface to be used for generic exception
injection. Maybe we can fold the breakpoint injection into it too.

> > +		__u64 gva;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_error_code {
> > +		__s32 err;
> > +		__u32 padding;
> > +	};
> > +
> > +Injects a vCPU page fault with the specified guest virtual address and
> > +error code.
> > +
> > +5. KVMI_EVENT_USER_CALL
> > +-----------------------
> 
> Please rename this to KVMI_EVENT_HCALL or HYPERCALL or VMCALL.
> 
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86;
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86_reply;
> > +
> > +This event is sent on a user hypercall and the introspection has already
> > +already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
> > +
> > +kvmi_event_x86 is sent to the introspector, which can respond with the
> > +KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host
> > +kernel to override the general purpose registers using the values from
> > +introspector (regs).
> 
> Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall?  Why is
> KVMI_EVENT_ACTION_ALLOW not allowed?  As before, I'd prefer
> SKIP/RETRY/ALLOW/CRASH as the actions.

No, we use SET_REGS only as a means to communicate with the guest code
that did the hypercall. We don't specify an action because, in our
specific case, saw no use for it.

Apropos: if possible, we'd like to keep the Xen convention for this
hypercall. It doesn't appear to interfere with either KVM or Hyper-V
(rax = 34 [__HYPERVISOR_hvm_op], rbx/rdi = 24
[HVMOP_guest_request_vm_event] - depends on wether long mode is
enabled).

> > +7. KVMI_EVENT_TRAP
> > +------------------
> > +
> > +:Architectures: x86
> > +:Versions: >= 1
> > +:Parameters: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86;
> > +	struct kvmi_event_trap {
> > +		__u32 vector;
> > +		__u32 type;
> > +		__u32 err;
> 
> error_code?
> 
> > +		__u32 padding;
> > +		__u64 cr2;
> > +	};
> > +
> > +:Returns: ↴
> > +
> > +::
> > +
> > +	struct kvmi_event_x86_reply;
> > +
> > +This event is sent if a trap will be delivered to the guest (page fault,
> > +breakpoint, etc.) and the introspection has already enabled the reports
> > +for this kind of event (KVMI_CONTROL_EVENTS).
> > +
> > +This is used to inform the introspector of all pending traps giving it
> > +a chance to determine if it should try again later in case a previous
> > +KVMI_INJECT_PAGE_FAULT/KVMI_INJECT_BREAKPOINT command has been overwritten
> > +by an interrupt picked up during guest reentry.
> > +
> > +kvmi_event_x86, exception/interrupt number (vector), exception/interrupt
> > +type, exception code (err) and CR2 are sent to the introspector, which can
> > +respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing
> > +the host kernel to override the general purpose registers using the values
> > +from introspector (regs).
> 
> Here I think the actions could be RETRY/ALLOW/CRASH only, with "set
> regs" done as a separate command.
> 
> Some x86 exceptions are faults, others are traps.  Traps should not
> allow RETRY as an action.  There should be an input telling the
> introspector if retrying is allowed.
> 
> How are dr6/dr7 handled around breakpoints?

Afaik, the debug registers should remain untouched, but I will ask my
colleagues with better knowledge of this.

Thanks,
Paolo Bonzini July 27, 2017, 5:18 p.m. UTC | #21
On 27/07/2017 19:06, Mihai Donțu wrote:
>>> +
>>> +	struct kvmi_xsave_info_reply {
>>> +		__s32 err;
>>> +		__u32 size;
>>> +	};
>>> +
>>> +Returns the xstate size for the specified vCPU.
>> Could this be replaced by a generic CPUID accessor?
> 
> Absolutely. I wonder if we should only made certain leafs acessible,
> part of the whole "make the least amount of information available"
> security philosophy, though I'm not aware of any attacks that can be
> mounted on the host just by knowing too many things about a guest.

That should be safe.

>> Also, it should be noted here that the spte flags are ANDed with
>> whatever is provided by userspace, because there could be readonly
>> memslots, and that some access combinations could fail (--x) or will
>> surely fail (-wx for example).
> We are closely looking into how KVM's MMU works and see how we'd go
> about extending it so as to make everyone happy (qemu and a possible
> introspection tool).
> 
> As for the permitted page access flags, we have (thus far) kept away
> from invalid combinations on older arch-es (--x on pre-SandyBridge, for
> example). However, it would be very nice to have arch-specific code
> that does a validation ahead of time (ie. before the guest is re-
> entered) so that an introspection tool can use an alternative approach.

We can make the command fail if you end up with --x (and it is not
supported) or -wx.

>> Does KVMI_EVENT_ACTION_SET_REGS bypass the hypercall?  Why is
>> KVMI_EVENT_ACTION_ALLOW not allowed?  As before, I'd prefer
>> SKIP/RETRY/ALLOW/CRASH as the actions.
>
> No, we use SET_REGS only as a means to communicate with the guest code
> that did the hypercall. We don't specify an action because, in our
> specific case, saw no use for it.

Ok, I'd prefer the usual set of actions just for consistency.  ALLOW
would fall through to KVM's handling of the hypercall.

> Apropos: if possible, we'd like to keep the Xen convention for this
> hypercall. It doesn't appear to interfere with either KVM or Hyper-V
> (rax = 34 [__HYPERVISOR_hvm_op], rbx/rdi = 24
> [HVMOP_guest_request_vm_event] - depends on wether long mode is
> enabled).

I was actually thinking that you'd trap all hypercalls to the
introspector, hence SKIP/RETRY/ALLOW/CRASH.  Reserving RAX = 34 for you
is fine (but please document the ABI and subfunctions).

Paolo
Mihai Donțu July 27, 2017, 5:19 p.m. UTC | #22
On Thu, 2017-07-27 at 18:52 +0200, Paolo Bonzini wrote:
> On 27/07/2017 18:23, Mihai Donțu wrote:
> > On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote:
> > > On 13/07/2017 10:36, Mihai Donțu wrote:
> > > > On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
> > > > > Worse, KVM is not able to distinguish userspace that has paused the VM
> > > > > from userspace that is doing MMIO or userspace that has a bug and hung
> > > > > somewhere.  And even worse, there are cases where userspace wants to
> > > > > modify registers while doing port I/O (the awful VMware RPCI port).  So
> > > > > I'd rather avoid this.
> > > > 
> > > > I should give more details here: we don't need to pause the vCPU-s in
> > > > the sense widely understood but just prevent them from entering the
> > > > guest for a short period of time. In our particular case, we need this
> > > > when we start introspecting a VM that's already running. For this we
> > > > kick the vCPU-s out of the guest so that our scan of the memory does
> > > > not race with the guest kernel/applications.
> > > > 
> > > > Another use case is when we inject applications into a running guest.
> > > > We also kick the vCPU-s out while we atomically make changes to kernel
> > > > specific structures.
> > > 
> > > This is not possible to do in KVM, because KVM doesn't control what
> > > happens to the memory outside KVM_RUN (and of course it doesn't control
> > > devices doing DMA).  You need to talk to QEMU in order to do this.
> > 
> > Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on
> > the already opened file descriptor to /dev/kvm for an event?
> 
> Nope.  QEMU might be running and writing to memory in another thread.  I
> don't see how this can be reliable on other hypervisors too, actually.

I assume it largely depends on knowing what's possible to do and what
not with the guest memory even while the vCPU-s are suspended. The
price of breaking this rule will be something any KVMI user will have
to be very aware of.

We did quite a bit of fine-tunning to our application to avoid
interferring with the device manager (or passedthrough devices) and
also it helps when the hypervisor refuses to grant access to certain
memory areas that it knows are not safe to touch (even read) from
remote code.
Paolo Bonzini Aug. 1, 2017, 10:40 a.m. UTC | #23
On 27/07/2017 19:19, Mihai Donțu wrote:
> On Thu, 2017-07-27 at 18:52 +0200, Paolo Bonzini wrote:
>> On 27/07/2017 18:23, Mihai Donțu wrote:
>>> On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote:
>>>> On 13/07/2017 10:36, Mihai Donțu wrote:
>>>>> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
>>>>>> Worse, KVM is not able to distinguish userspace that has paused the VM
>>>>>> from userspace that is doing MMIO or userspace that has a bug and hung
>>>>>> somewhere.  And even worse, there are cases where userspace wants to
>>>>>> modify registers while doing port I/O (the awful VMware RPCI port).  So
>>>>>> I'd rather avoid this.
>>>>>
>>>>> I should give more details here: we don't need to pause the vCPU-s in
>>>>> the sense widely understood but just prevent them from entering the
>>>>> guest for a short period of time. In our particular case, we need this
>>>>> when we start introspecting a VM that's already running. For this we
>>>>> kick the vCPU-s out of the guest so that our scan of the memory does
>>>>> not race with the guest kernel/applications.
>>>>>
>>>>> Another use case is when we inject applications into a running guest.
>>>>> We also kick the vCPU-s out while we atomically make changes to kernel
>>>>> specific structures.
>>>>
>>>> This is not possible to do in KVM, because KVM doesn't control what
>>>> happens to the memory outside KVM_RUN (and of course it doesn't control
>>>> devices doing DMA).  You need to talk to QEMU in order to do this.
>>>
>>> Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on
>>> the already opened file descriptor to /dev/kvm for an event?
>>
>> Nope.  QEMU might be running and writing to memory in another thread.  I
>> don't see how this can be reliable on other hypervisors too, actually.
> 
> I assume it largely depends on knowing what's possible to do and what
> not with the guest memory even while the vCPU-s are suspended. The
> price of breaking this rule will be something any KVMI user will have
> to be very aware of.

If you actually pause the whole VM (through QEMU's monitor commands
"stop" and "cont") everything should be safe.  Of course there can be
bugs and PCI passthrough devices should be problematic, but in general
the device emulation is quiescent.  This however is not the case when
only the VCPUs are paused.

Paolo

> We did quite a bit of fine-tunning to our application to avoid
> interferring with the device manager (or passedthrough devices) and
> also it helps when the hypervisor refuses to grant access to certain
> memory areas that it knows are not safe to touch (even read) from
> remote code.
Tamas K Lengyel Aug. 1, 2017, 4:33 p.m. UTC | #24
On Tue, Aug 1, 2017 at 4:40 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 27/07/2017 19:19, Mihai Donțu wrote:
>> On Thu, 2017-07-27 at 18:52 +0200, Paolo Bonzini wrote:
>>> On 27/07/2017 18:23, Mihai Donțu wrote:
>>>> On Thu, 2017-07-13 at 11:15 +0200, Paolo Bonzini wrote:
>>>>> On 13/07/2017 10:36, Mihai Donțu wrote:
>>>>>> On Fri, 2017-07-07 at 18:52 +0200, Paolo Bonzini wrote:
>>>>>>> Worse, KVM is not able to distinguish userspace that has paused the VM
>>>>>>> from userspace that is doing MMIO or userspace that has a bug and hung
>>>>>>> somewhere.  And even worse, there are cases where userspace wants to
>>>>>>> modify registers while doing port I/O (the awful VMware RPCI port).  So
>>>>>>> I'd rather avoid this.
>>>>>>
>>>>>> I should give more details here: we don't need to pause the vCPU-s in
>>>>>> the sense widely understood but just prevent them from entering the
>>>>>> guest for a short period of time. In our particular case, we need this
>>>>>> when we start introspecting a VM that's already running. For this we
>>>>>> kick the vCPU-s out of the guest so that our scan of the memory does
>>>>>> not race with the guest kernel/applications.
>>>>>>
>>>>>> Another use case is when we inject applications into a running guest.
>>>>>> We also kick the vCPU-s out while we atomically make changes to kernel
>>>>>> specific structures.
>>>>>
>>>>> This is not possible to do in KVM, because KVM doesn't control what
>>>>> happens to the memory outside KVM_RUN (and of course it doesn't control
>>>>> devices doing DMA).  You need to talk to QEMU in order to do this.
>>>>
>>>> Maybe add a new exit reason (eg. KVM_EXIT_PAUSE) and have qemu wait on
>>>> the already opened file descriptor to /dev/kvm for an event?
>>>
>>> Nope.  QEMU might be running and writing to memory in another thread.  I
>>> don't see how this can be reliable on other hypervisors too, actually.
>>
>> I assume it largely depends on knowing what's possible to do and what
>> not with the guest memory even while the vCPU-s are suspended. The
>> price of breaking this rule will be something any KVMI user will have
>> to be very aware of.
>
> If you actually pause the whole VM (through QEMU's monitor commands
> "stop" and "cont") everything should be safe.  Of course there can be
> bugs and PCI passthrough devices should be problematic, but in general
> the device emulation is quiescent.  This however is not the case when
> only the VCPUs are paused.

IMHO for some use-cases it is sufficient to have the guest itself be
limited in the modifications it makes to memory. So for example if
just a vCPU is paused there are areas of memory that you can interact
with without having to worry about it changing underneath the
introspecting application (ie. thread-specific datastructures like the
KPCR, etc..). If the introspecting application needs access to areas
that non-paused vCPUs may touch, or QEMU, or a pass-through device,
then it should be a decision for the introspecting app whether to
pause the VM completely. It may still choose to instead do some
error-detection on reads/writes to detect inconsistent accesses and
perhaps just re-try the operation till it succeeds. This may have less
of an impact on the performance of the VM as no full VM pause had to
be performed. It is all very application specific, so having options
is always a good thing.

Tamas
Paolo Bonzini Aug. 1, 2017, 8:47 p.m. UTC | #25
On 01/08/2017 18:33, Tamas K Lengyel wrote:
>> If you actually pause the whole VM (through QEMU's monitor commands
>> "stop" and "cont") everything should be safe.  Of course there can be
>> bugs and PCI passthrough devices should be problematic, but in general
>> the device emulation is quiescent.  This however is not the case when
>> only the VCPUs are paused.
> IMHO for some use-cases it is sufficient to have the guest itself be
> limited in the modifications it makes to memory. So for example if
> just a vCPU is paused there are areas of memory that you can interact
> with without having to worry about it changing underneath the
> introspecting application (ie. thread-specific datastructures like the
> KPCR, etc..). If the introspecting application needs access to areas
> that non-paused vCPUs may touch, or QEMU, or a pass-through device,
> then it should be a decision for the introspecting app whether to
> pause the VM completely. It may still choose to instead do some
> error-detection on reads/writes to detect inconsistent accesses and
> perhaps just re-try the operation till it succeeds. This may have less
> of an impact on the performance of the VM as no full VM pause had to
> be performed. It is all very application specific, so having options
> is always a good thing.

Fair enough.  There is another issue however.

If a guest is runnnig in the kernel, it can be easily paused while KVMI
processes events and the like.

While a guest is outside the kernel, it could be running or paused.

If running, the value of a register might change before the VM reenters
execution (due to a reset, or due to ugly features such as the VMware
magic I/O port 0x5658).  So the introspector would probably prefer
anyway to do any changes while the guest is in the kernel: one idea I
had was a KVMI_PAUSE_VCPU command that replies with a KVMI_VCPU_PAUSED
event---then the introspector can send commands that do the required
patching and then restart the guest by replying to the event.

But if the guest is paused, KVMI_PAUSE_VCPU would never be processed.
So how could the introspector distinguish the two cases and avoid the
KVMI_PAUSE_VCPU if the guest is paused?

(There is another complication: the guest could be running with the APIC
emulated in userspace.  In that case, a VCPU doing "cli;hlt" spends
infinite time in userspace even though it's running, and KVM has no idea
why.  This is less common, but it's worth mentioning too).

Paolo
Mihai Donțu Aug. 2, 2017, 11:52 a.m. UTC | #26
On Tue, 2017-08-01 at 22:47 +0200, Paolo Bonzini wrote:
> On 01/08/2017 18:33, Tamas K Lengyel wrote:
> > > If you actually pause the whole VM (through QEMU's monitor commands
> > > "stop" and "cont") everything should be safe.  Of course there can be
> > > bugs and PCI passthrough devices should be problematic, but in general
> > > the device emulation is quiescent.  This however is not the case when
> > > only the VCPUs are paused.
> > 
> > IMHO for some use-cases it is sufficient to have the guest itself be
> > limited in the modifications it makes to memory. So for example if
> > just a vCPU is paused there are areas of memory that you can interact
> > with without having to worry about it changing underneath the
> > introspecting application (ie. thread-specific datastructures like the
> > KPCR, etc..). If the introspecting application needs access to areas
> > that non-paused vCPUs may touch, or QEMU, or a pass-through device,
> > then it should be a decision for the introspecting app whether to
> > pause the VM completely. It may still choose to instead do some
> > error-detection on reads/writes to detect inconsistent accesses and
> > perhaps just re-try the operation till it succeeds. This may have less
> > of an impact on the performance of the VM as no full VM pause had to
> > be performed. It is all very application specific, so having options
> > is always a good thing.
> 
> Fair enough.  There is another issue however.
> 
> If a guest is runnnig in the kernel, it can be easily paused while KVMI
> processes events and the like.
> 
> While a guest is outside the kernel, it could be running or paused.
> 
> If running, the value of a register might change before the VM reenters
> execution (due to a reset, or due to ugly features such as the VMware
> magic I/O port 0x5658).  So the introspector would probably prefer
> anyway to do any changes while the guest is in the kernel: one idea I
> had was a KVMI_PAUSE_VCPU command that replies with a KVMI_VCPU_PAUSED
> event---then the introspector can send commands that do the required
> patching and then restart the guest by replying to the event.
> 
> But if the guest is paused, KVMI_PAUSE_VCPU would never be processed.
> So how could the introspector distinguish the two cases and avoid the
> KVMI_PAUSE_VCPU if the guest is paused?
> 
> (There is another complication: the guest could be running with the APIC
> emulated in userspace.  In that case, a VCPU doing "cli;hlt" spends
> infinite time in userspace even though it's running, and KVM has no idea
> why.  This is less common, but it's worth mentioning too).

I think it might help to distinguish two situations in which we require
the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST
command can be translated into a qemu pause. In our particular usecase
we made special arrangements to call it as few times as possible
assuming it's very costly. The other is needed only by the internal KVM
code for situations similar to:

  kvm_pause_vcpu(vcpu);
  vcpu_load(vcpu);
  kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
  vcpu_put(vcpu);
  kvm_unpause_vcpu(vcpu);

or more generally put, for accesses that involve the vCPU state
(registers, MSR-s, exceptions etc.), no guest memory involved.

Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if
so, make it somehow available for quick re-entry with
kvm_unpause_vcpu(). If said vCPU is already out, then the function will
be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if
we're currently handling an event or one is pending.

I hope this narrows down further the exact requirements.

One exception that might have a better solution is:

  kvm_pause_all_vcpus(kvm);
  kvm_set_page_access(kvm, gfn); /* pause for get too? */
  kvm_unpause_all_vcpus(kvm);

There might be a way to make the change and then IPI all vCPU-s without
pulling them out of the guest.

Regards,
Paolo Bonzini Aug. 2, 2017, 12:27 p.m. UTC | #27
On 02/08/2017 13:52, Mihai Donțu wrote:
> I think it might help to distinguish two situations in which we require
> the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST
> command can be translated into a qemu pause. In our particular usecase
> we made special arrangements to call it as few times as possible
> assuming it's very costly. The other is needed only by the internal KVM
> code for situations similar to:
> 
>   kvm_pause_vcpu(vcpu);
>   vcpu_load(vcpu);
>   kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
>   vcpu_put(vcpu);
>   kvm_unpause_vcpu(vcpu);
> 
> or more generally put, for accesses that involve the vCPU state
> (registers, MSR-s, exceptions etc.), no guest memory involved.
> 
> Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if
> so, make it somehow available for quick re-entry with
> kvm_unpause_vcpu(). If said vCPU is already out, then the function will
> be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if
> we're currently handling an event or one is pending.

Understood.  The issue is that there is an inherent race between
anything userspace is doing and get_regs.  What are the cases where you
need to get regs or similar outside an event?

> One exception that might have a better solution is:
> 
>   kvm_pause_all_vcpus(kvm);
>   kvm_set_page_access(kvm, gfn); /* pause for get too? */
>   kvm_unpause_all_vcpus(kvm);
> 
> There might be a way to make the change and then IPI all vCPU-s without
> pulling them out of the guest.

For that I think KVMI should define a VM-wide "mask" layered over the
actual memory map permissions.  Such a command can be implemented
relatively easily by hooking into the callers of __gfn_to_pfn_memslot
and kvm_vcpu_gfn_to_hva_prot.

Paolo
Mihai Donțu Aug. 2, 2017, 1:32 p.m. UTC | #28
On Wed, 2017-08-02 at 14:27 +0200, Paolo Bonzini wrote:
> On 02/08/2017 13:52, Mihai Donțu wrote:
> > I think it might help to distinguish two situations in which we require
> > the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST
> > command can be translated into a qemu pause. In our particular usecase
> > we made special arrangements to call it as few times as possible
> > assuming it's very costly. The other is needed only by the internal KVM
> > code for situations similar to:
> > 
> >   kvm_pause_vcpu(vcpu);
> >   vcpu_load(vcpu);
> >   kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
> >   vcpu_put(vcpu);
> >   kvm_unpause_vcpu(vcpu);
> > 
> > or more generally put, for accesses that involve the vCPU state
> > (registers, MSR-s, exceptions etc.), no guest memory involved.
> > 
> > Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if
> > so, make it somehow available for quick re-entry with
> > kvm_unpause_vcpu(). If said vCPU is already out, then the function will
> > be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if
> > we're currently handling an event or one is pending.
> 
> Understood.  The issue is that there is an inherent race between
> anything userspace is doing and get_regs.  What are the cases where you
> need to get regs or similar outside an event?

We have currently identified three cases:

 * initial hooking of a guest
 * periodically checking the integrity of data that is not properly
   placed into a page and thus cannot be efficiently tracked via SPT
 * injecting processes

> > One exception that might have a better solution is:
> > 
> >   kvm_pause_all_vcpus(kvm);
> >   kvm_set_page_access(kvm, gfn); /* pause for get too? */
> >   kvm_unpause_all_vcpus(kvm);
> > 
> > There might be a way to make the change and then IPI all vCPU-s without
> > pulling them out of the guest.
> 
> For that I think KVMI should define a VM-wide "mask" layered over the
> actual memory map permissions.  Such a command can be implemented
> relatively easily by hooking into the callers of __gfn_to_pfn_memslot
> and kvm_vcpu_gfn_to_hva_prot.
Paolo Bonzini Aug. 2, 2017, 1:51 p.m. UTC | #29
On 02/08/2017 15:32, Mihai Donțu wrote:
> We have currently identified three cases:
> 
>  * initial hooking of a guest

What triggers the initial hooking, and how is it done?

>  * periodically checking the integrity of data that is not properly
>    placed into a page and thus cannot be efficiently tracked via SPT

This only needs read memory (and it's okay for it to race against DMA
because it's periodic).

>  * injecting processes

This also doesn't need pause.  IIRC you put a breakpoint somewhere, or
make a page non-executable, to ensure the guest doesn't get in the way.
DMA can still get in the way, but that can happen anyway right after
process injection so it's not an issue.

Have you thought about monitoring hardware registers, for example in
order to check that IOMMU page tables protect from overwriting the kernel?

Paolo
Mihai Donțu Aug. 2, 2017, 1:53 p.m. UTC | #30
On Wed, 2017-08-02 at 16:32 +0300, Mihai Donțu wrote:
> On Wed, 2017-08-02 at 14:27 +0200, Paolo Bonzini wrote:
> > On 02/08/2017 13:52, Mihai Donțu wrote:
> > > I think it might help to distinguish two situations in which we require
> > > the guest _or_ a single vCPU to be paused. Our initial KVMI_PAUSE_GUEST
> > > command can be translated into a qemu pause. In our particular usecase
> > > we made special arrangements to call it as few times as possible
> > > assuming it's very costly. The other is needed only by the internal KVM
> > > code for situations similar to:
> > > 
> > >   kvm_pause_vcpu(vcpu);
> > >   vcpu_load(vcpu);
> > >   kvm_arch_vcpu_ioctl_get_regs(vcpu, regs);
> > >   vcpu_put(vcpu);
> > >   kvm_unpause_vcpu(vcpu);
> > > 
> > > or more generally put, for accesses that involve the vCPU state
> > > (registers, MSR-s, exceptions etc.), no guest memory involved.
> > > 
> > > Here kvm_pause_vcpu() will only pull the vCPU out of the guest and, if
> > > so, make it somehow available for quick re-entry with
> > > kvm_unpause_vcpu(). If said vCPU is already out, then the function will
> > > be a no-op. Obviously, kvm_{pause,unpause}_vcpu() will do nothing if
> > > we're currently handling an event or one is pending.
> > 
> > Understood.  The issue is that there is an inherent race between
> > anything userspace is doing and get_regs.  What are the cases where you
> > need to get regs or similar outside an event?
> 
> We have currently identified three cases:
> 
>  * initial hooking of a guest
>  * periodically checking the integrity of data that is not properly
>    placed into a page and thus cannot be efficiently tracked via SPT
>  * injecting processes

A few more details are required here, taken from our current
application: cases (a) and (c) involve poking the vCPU-s a couple of
times before doing KVMI_PAUSE_GUEST() after which everything should be
OK, though we should make sure qemu sync-s the states before entering
the paused state.

Case (b), as far as I can see, is implemented only via memory maps (eg.
the IDT is mapped and kept around for quick checking), but a need
_might_ arrise to poke the vCPU a few times before concluding that an
area has really been corrupted.

> > > One exception that might have a better solution is:
> > > 
> > >   kvm_pause_all_vcpus(kvm);
> > >   kvm_set_page_access(kvm, gfn); /* pause for get too? */
> > >   kvm_unpause_all_vcpus(kvm);
> > > 
> > > There might be a way to make the change and then IPI all vCPU-s without
> > > pulling them out of the guest.
> > 
> > For that I think KVMI should define a VM-wide "mask" layered over the
> > actual memory map permissions.  Such a command can be implemented
> > relatively easily by hooking into the callers of __gfn_to_pfn_memslot
> > and kvm_vcpu_gfn_to_hva_prot.
Mihai Donțu Aug. 2, 2017, 2:17 p.m. UTC | #31
On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote:
> On 02/08/2017 15:32, Mihai Donțu wrote:
> > We have currently identified three cases:
> > 
> >  * initial hooking of a guest
> 
> What triggers the initial hooking, and how is it done?

There are two types o hooks: dynamic (the guest is hooked as it boots)
and static (a fully booted guest is being hooked). They both start with
a notification from qemu or some other application that a guest is
available for introspection. After that we poke its vCPU-s a few times
to determine what type of hooking to continue with. I belive the
syscall entry point MSR-s are key here.

> >  * periodically checking the integrity of data that is not properly
> >    placed into a page and thus cannot be efficiently tracked via SPT
> 
> This only needs read memory (and it's okay for it to race against DMA
> because it's periodic).

I just looked through some traces (the logic changed quite a bit since
I last checked) and looks entirely based on memory reads right now.

> >  * injecting processes
> 
> This also doesn't need pause.  IIRC you put a breakpoint somewhere, or
> make a page non-executable, to ensure the guest doesn't get in the way.
> DMA can still get in the way, but that can happen anyway right after
> process injection so it's not an issue.

That might be a very possible approach. The code we have in place now
pauses the guest entirely, though.

I have added in CC a colleague of mine, Andrei Luțaș. He leads the
development of the introspection technology, irrespective of the
hypervisor. Adalbert and I only work on bridging it with KVM. :-) I'll
kindly ask him to help with more details wherever you feel my
explanations were not sufficient.

> Have you thought about monitoring hardware registers, for example in
> order to check that IOMMU page tables protect from overwriting the kernel?

Sorry, but I'm not sure I understand: are you thinking at a way to make
sure none of the IOMMU grups are configured with a "too generous" DMA
window?

Regards,
Paolo Bonzini Aug. 4, 2017, 8:35 a.m. UTC | #32
On 02/08/2017 16:17, Mihai Donțu wrote:
> On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote:
>> On 02/08/2017 15:32, Mihai Donțu wrote:
>>> We have currently identified three cases:
>>>
>>>  * initial hooking of a guest
>>
>> What triggers the initial hooking, and how is it done?
> 
> There are two types o hooks: dynamic (the guest is hooked as it boots)
> and static (a fully booted guest is being hooked). They both start with
> a notification from qemu or some other application that a guest is
> available for introspection. After that we poke its vCPU-s a few times
> to determine what type of hooking to continue with. I belive the
> syscall entry point MSR-s are key here.

Reads need not be transactional here, and the syscall entry point MSRs
are generally immutable so I think it is fine not to pause.

I'm curious how the introspector decides that the guest is ready to be
hooked in, that is, that it's far enough in the boot process.

I think a command to "pause" KVM_RUN is okay to add.  That is, if in
KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately and
halt the vCPU.  If not in KVM_RUN, the command would return 0 and
trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN.

The introspector then can decide whether to wait if the commands return
0.  There is no need for an "unpause" command, as that is achieved
simply by completing the 'paused' event.

>> Have you thought about monitoring hardware registers, for example in
>> order to check that IOMMU page tables protect from overwriting the kernel?
> 
> Sorry, but I'm not sure I understand: are you thinking at a way to make
> sure none of the IOMMU grups are configured with a "too generous" DMA
> window?

Yes, exactly.  Optional, of course.

Paolo
Mihai Donțu Aug. 4, 2017, 3:29 p.m. UTC | #33
On Fri, 2017-08-04 at 10:35 +0200, Paolo Bonzini wrote:
> On 02/08/2017 16:17, Mihai Donțu wrote:
> > On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote:
> > > On 02/08/2017 15:32, Mihai Donțu wrote:
> > > > We have currently identified three cases:
> > > > 
> > > >  * initial hooking of a guest
> > > 
> > > What triggers the initial hooking, and how is it done?
> > 
> > There are two types o hooks: dynamic (the guest is hooked as it boots)
> > and static (a fully booted guest is being hooked). They both start with
> > a notification from qemu or some other application that a guest is
> > available for introspection. After that we poke its vCPU-s a few times
> > to determine what type of hooking to continue with. I belive the
> > syscall entry point MSR-s are key here.
> 
> Reads need not be transactional here, and the syscall entry point MSRs
> are generally immutable so I think it is fine not to pause.

I might be misunderstanding. Entry point MSR-s (and maybe others) are
generally immutable in well behaving guests. We are, however, looking
for signs that something is breaking this pattern.

> I'm curious how the introspector decides that the guest is ready to be
> hooked in, that is, that it's far enough in the boot process.

I will let Andrei add some details here.

> I think a command to "pause" KVM_RUN is okay to add.  That is, if in
> KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately and
> halt the vCPU.  If not in KVM_RUN, the command would return 0 and
> trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN.
> 
> The introspector then can decide whether to wait if the commands return
> 0.  There is no need for an "unpause" command, as that is achieved
> simply by completing the 'paused' event.

This mechanism will allow exposing a KVMI_PAUSE_VCPU to introspectors,
something that maybe some future users can leverage. I, however, was
trying to justify a "slow" KVMI_PAUSE_VM and "fast" kvm_pause_vcpu(),
the latter existing only in KVM (kernel). The event-based mecanism for
pausing a vCPU feels like it has too much overhead for what one usually
needs it (read some registers).

> > > Have you thought about monitoring hardware registers, for example in
> > > order to check that IOMMU page tables protect from overwriting the kernel?
> > 
> > Sorry, but I'm not sure I understand: are you thinking at a way to make
> > sure none of the IOMMU grups are configured with a "too generous" DMA
> > window?
> 
> Yes, exactly.  Optional, of course.

We could add a command that returns the list of DMA ranges which can
then be used by an introspector to check if the OS has made a mistake
and placed sensible data in that rage.

Also, I belive we have refined a number of ideas on this thread and
more remain to clarify. I would like to update the design document with
what we have agreed on so far, add some more details to what felt
'under explained' and continue again from there. Is it OK for you?

Regards,
Paolo Bonzini Aug. 4, 2017, 3:37 p.m. UTC | #34
On 04/08/2017 17:29, Mihai Donțu wrote:
> On Fri, 2017-08-04 at 10:35 +0200, Paolo Bonzini wrote:
>> On 02/08/2017 16:17, Mihai Donțu wrote:
>>> On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote:
>>>> On 02/08/2017 15:32, Mihai Donțu wrote:
>>>>> We have currently identified three cases:
>>>>>
>>>>>  * initial hooking of a guest
>>>>
>>>> What triggers the initial hooking, and how is it done?
>>>
>>> There are two types o hooks: dynamic (the guest is hooked as it boots)
>>> and static (a fully booted guest is being hooked). They both start with
>>> a notification from qemu or some other application that a guest is
>>> available for introspection. After that we poke its vCPU-s a few times
>>> to determine what type of hooking to continue with. I belive the
>>> syscall entry point MSR-s are key here.
>>
>> Reads need not be transactional here, and the syscall entry point MSRs
>> are generally immutable so I think it is fine not to pause.
> 
> I might be misunderstanding. Entry point MSR-s (and maybe others) are
> generally immutable in well behaving guests. We are, however, looking
> for signs that something is breaking this pattern.

Right but you can always set up an MSR write intercept even before the
guest starts.  The MSRs are written rarely-enough that the intercept
doesn't add any noticeable overhead.

Reading the MSR while the guest runs is okay.  If races are important,
the introspectors can detect them through the intercepts it sets up.  In
many cases races are not an issue even (for example, in MTRRs).

>> I think a command to "pause" KVM_RUN is okay to add.  That is, if in
>> KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately and
>> halt the vCPU.  If not in KVM_RUN, the command would return 0 and
>> trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN.
>>
>> The introspector then can decide whether to wait if the commands return
>> 0.  There is no need for an "unpause" command, as that is achieved
>> simply by completing the 'paused' event.
> 
> This mechanism will allow exposing a KVMI_PAUSE_VCPU to introspectors,
> something that maybe some future users can leverage. I, however, was
> trying to justify a "slow" KVMI_PAUSE_VM and "fast" kvm_pause_vcpu(),
> the latter existing only in KVM (kernel). The event-based mecanism for
> pausing a vCPU feels like it has too much overhead for what one usually
> needs it (read some registers).

My point is that usually even register reads (actually MSRs) do not
require a pause.  The event-based mechanism would be used only for the
initial hooking.

KVMI_PAUSE_VM would require QEMU communication, wouldn't it?

> Also, I belive we have refined a number of ideas on this thread and
> more remain to clarify. I would like to update the design document with
> what we have agreed on so far, add some more details to what felt
> 'under explained' and continue again from there. Is it OK for you?

Yes.  Pausing is the last issue that needs to be clarified.

Paolo
Andrei Vlad LUTAS Aug. 5, 2017, 8 a.m. UTC | #35
> -----Original Message-----

> From: Mihai Donțu [mailto:mdontu@bitdefender.com]

> Sent: 4 August, 2017 18:30

> To: Paolo Bonzini <pbonzini@redhat.com>

> Cc: Radim Krčmář <rkrcmar@redhat.com>; Jan Kiszka

> <jan.kiszka@siemens.com>; Stefan Hajnoczi <stefanha@redhat.com>;

> Adalbert Lazar <alazar@bitdefender.com>; kvm@vger.kernel.org; Tamas K

> Lengyel <tamas.k.lengyel@gmail.com>; Andrei Vlad LUTAS

> <vlutas@bitdefender.com>

> Subject: Re: [RFC PATCH v2 1/1] kvm: Add documentation and ABI/API

> header for VM introspection

>

> On Fri, 2017-08-04 at 10:35 +0200, Paolo Bonzini wrote:

> > On 02/08/2017 16:17, Mihai Donțu wrote:

> > > On Wed, 2017-08-02 at 15:51 +0200, Paolo Bonzini wrote:

> > > > On 02/08/2017 15:32, Mihai Donțu wrote:

> > > > > We have currently identified three cases:

> > > > >

> > > > >  * initial hooking of a guest

> > > >

> > > > What triggers the initial hooking, and how is it done?

> > >

> > > There are two types o hooks: dynamic (the guest is hooked as it

> > > boots) and static (a fully booted guest is being hooked). They both

> > > start with a notification from qemu or some other application that a

> > > guest is available for introspection. After that we poke its vCPU-s

> > > a few times to determine what type of hooking to continue with. I

> > > belive the syscall entry point MSR-s are key here.

> >

> > Reads need not be transactional here, and the syscall entry point MSRs

> > are generally immutable so I think it is fine not to pause.

>

> I might be misunderstanding. Entry point MSR-s (and maybe others) are

> generally immutable in well behaving guests. We are, however, looking for

> signs that something is breaking this pattern.

>

> > I'm curious how the introspector decides that the guest is ready to be

> > hooked in, that is, that it's far enough in the boot process.

>

> I will let Andrei add some details here.

>

The first steps of enabling the protection for a guest depends on hardware events (like Mihai said, MSR writes, to give one example). Depending on what MSRs are written & the values they are written with, HVI will be able to identify, based on analyzing the guest memory, the type  of the running OS (Windows x86, Windows x64, Linux, etc.) - of course, how exactly the protection is enabled from that point on (analyzing the SYSCALL/interrupt handlers, other events such as XSETBV, etc.) is OS dependent, and there isn't a single correct answer.

The necessity to pause the running VCPUs for short intervals of time comes in several cases:
- when receiving the first relevant events, such as the SYSCALL MSR writes, we want to ensure no other VCPU could be running that might interfere with memory that we analyse (that might include not-yet-known writable pages)
- when establishing certain intercepts inside the guest memory, we need to briefly modify it, and to do so safely, we pause all the VCPUs, to make sure we don't race with the guest
- when ensuring the integrity of certain structures that already lie inside writable pages (so intercepting writes via an EPT/NPT is not doable, because the induced performance overhead would be huge), so that no other VCPU modifies the said pages while we make sure they haven't been tampered with

Of course, just how Paolo suggested, we can place finer-grained intercepts (such as execute-protect a page in order to ensure no VCPU runs code from it while we modify it), but this is a more complicated solution and we've never had to think for something other than simply pausing the VCPUs, since that was always available so far.

Hope this piece of info helps.

> > I think a command to "pause" KVM_RUN is okay to add.  That is, if in

> > KVM_RUN it would e.g. return 1, trigger a 'paused' event immediately

> > and halt the vCPU.  If not in KVM_RUN, the command would return 0 and

> > trigger the 'paused' event as soon as the hypervisor invokes KVM_RUN.

> >

> > The introspector then can decide whether to wait if the commands

> > return 0.  There is no need for an "unpause" command, as that is

> > achieved simply by completing the 'paused' event.

>

> This mechanism will allow exposing a KVMI_PAUSE_VCPU to introspectors,

> something that maybe some future users can leverage. I, however, was

> trying to justify a "slow" KVMI_PAUSE_VM and "fast" kvm_pause_vcpu(),

> the latter existing only in KVM (kernel). The event-based mecanism for

> pausing a vCPU feels like it has too much overhead for what one usually

> needs it (read some registers).

>

> > > > Have you thought about monitoring hardware registers, for example

> > > > in order to check that IOMMU page tables protect from overwriting the

> kernel?

> > >

> > > Sorry, but I'm not sure I understand: are you thinking at a way to

> > > make sure none of the IOMMU grups are configured with a "too

> > > generous" DMA window?

> >

> > Yes, exactly.  Optional, of course.

>

> We could add a command that returns the list of DMA ranges which can then

> be used by an introspector to check if the OS has made a mistake and placed

> sensible data in that rage.

>

> Also, I belive we have refined a number of ideas on this thread and more

> remain to clarify. I would like to update the design document with what we

> have agreed on so far, add some more details to what felt 'under explained'

> and continue again from there. Is it OK for you?

>

> Regards,

>

> --

> Mihai Donțu

>

>

> ________________________

> This email was scanned by Bitdefender


Best regards,
Andrei.

________________________
This email was scanned by Bitdefender
Paolo Bonzini Aug. 7, 2017, 12:18 p.m. UTC | #36
On 05/08/2017 10:00, Andrei Vlad LUTAS wrote:
> Of course, just how Paolo suggested, we can place finer-grained
> intercepts (such as execute-protect a page in order to ensure no VCPU
> runs code from it while we modify it), but this is a more complicated
> solution and we've never had to think for something other than simply
> pausing the VCPUs, since that was always available so far.
> 
> Hope this piece of info helps.

We can certainly add a "pause the VCPU with a given id" command.  The
command reports its success with an event, and replying to the event
restarts the VCPU.  If the VCPU is currently in userspace, the event
would be delayed until the next time KVM is re-entered, but this should
not be an issue in general.  The introspector can operate as if the VCPU
was paused.

"Pause all VCPUs and stop all DMA" would definitely be a layering
violation, so it cannot be added.

"Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
a given id" commands.  I lean towards omitting it.

However, now that I'm thinking of it, we need a new event for "new VCPU
created".  When the event is enabled, newly-created VCPUs should be in
paused mode.

Paolo
Mihai Donțu Aug. 7, 2017, 1:25 p.m. UTC | #37
On Mon, 2017-08-07 at 14:18 +0200, Paolo Bonzini wrote:
> On 05/08/2017 10:00, Andrei Vlad LUTAS wrote:
> > Of course, just how Paolo suggested, we can place finer-grained
> > intercepts (such as execute-protect a page in order to ensure no VCPU
> > runs code from it while we modify it), but this is a more complicated
> > solution and we've never had to think for something other than simply
> > pausing the VCPUs, since that was always available so far.
> > 
> > Hope this piece of info helps.
> 
> We can certainly add a "pause the VCPU with a given id" command.  The
> command reports its success with an event, and replying to the event
> restarts the VCPU.  If the VCPU is currently in userspace, the event
> would be delayed until the next time KVM is re-entered, but this should
> not be an issue in general.  The introspector can operate as if the VCPU
> was paused.

I have a plan to modify our application a bit and see how often we
query a vCPU outside an event handler. If it's seldom enough, a command
that pauses the vCPU and triggers an event should be just fine.

> "Pause all VCPUs and stop all DMA" would definitely be a layering
> violation, so it cannot be added.
> 
> "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
> a given id" commands.  I lean towards omitting it.

The case where the introspector wants to scan the guest memory needs a
KVMI_PAUSE_VM, which as discussed in a previous email, can be the
actual qemu 'pause' command. However, we would like to limit the
communication channels we have with the host and not use qmp (or
libvirt/etc. if qmp is not exposed). Instead, have a command that
triggers a KVM_RUN exit to qemu which in turn will call the underlying
pause function used by qmp. Would that be OK with you?

> However, now that I'm thinking of it, we need a new event for "new VCPU
> created".  When the event is enabled, newly-created VCPUs should be in
> paused mode.

I assume you are thinking about vCPU hotplug here. If so, yes, an event
that gives the introspector the chance to update its internal
bookkeeping would be useful.
Paolo Bonzini Aug. 7, 2017, 1:49 p.m. UTC | #38
On 07/08/2017 15:25, Mihai Donțu wrote:
>> "Pause all VCPUs and stop all DMA" would definitely be a layering
>> violation, so it cannot be added.
>>
>> "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
>> a given id" commands.  I lean towards omitting it.
> 
> The case where the introspector wants to scan the guest memory needs a
> KVMI_PAUSE_VM, which as discussed in a previous email, can be the
> actual qemu 'pause' command.

Do you mean it needs to stop DMA as well?

> However, we would like to limit the
> communication channels we have with the host and not use qmp (or
> libvirt/etc. if qmp is not exposed). Instead, have a command that
> triggers a KVM_RUN exit to qemu which in turn will call the underlying
> pause function used by qmp. Would that be OK with you?

You would have to send back something on completion, and then I am
worried of races and deadlocks.  Plus, pausing a VM at the QEMU level is
a really expensive operation, so I don't think it's a good idea to let
the introspector do this.  You can pause all VCPUs, or use memory page
permissions.

>> However, now that I'm thinking of it, we need a new event for "new VCPU
>> created".  When the event is enabled, newly-created VCPUs should be in
>> paused mode.
> 
> I assume you are thinking about vCPU hotplug here. If so, yes, an event
> that gives the introspector the chance to update its internal
> bookkeeping would be useful.

Yes, exactly.

Paolo
Mihai Donțu Aug. 7, 2017, 2:12 p.m. UTC | #39
On Mon, 2017-08-07 at 15:49 +0200, Paolo Bonzini wrote:
> On 07/08/2017 15:25, Mihai Donțu wrote:
> > > "Pause all VCPUs and stop all DMA" would definitely be a layering
> > > violation, so it cannot be added.
> > > 
> > > "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
> > > a given id" commands.  I lean towards omitting it.
> > 
> > The case where the introspector wants to scan the guest memory needs a
> > KVMI_PAUSE_VM, which as discussed in a previous email, can be the
> > actual qemu 'pause' command.
> 
> Do you mean it needs to stop DMA as well?

No, DMA can proceed normally. I remain of the opinion that KVMI users
must know what guest memory ranges are OK to access by looking at MTRR-
s, PAT or guest kernel structures, or a combination of all three.

> > However, we would like to limit the
> > communication channels we have with the host and not use qmp (or
> > libvirt/etc. if qmp is not exposed). Instead, have a command that
> > triggers a KVM_RUN exit to qemu which in turn will call the underlying
> > pause function used by qmp. Would that be OK with you?
> 
> You would have to send back something on completion, and then I am
> worried of races and deadlocks.  Plus, pausing a VM at the QEMU level is
> a really expensive operation, so I don't think it's a good idea to let
> the introspector do this.  You can pause all VCPUs, or use memory page
> permissions.

Pausing all vCPU-s was my first thought, I was just trying to follow
your statement: "I lean towards omitting it". :-)

It will take a bit of user-space-fu, in that after issuing N vCPU pause
commands in a row we will have to wait for N events, which might race
with other events (MSR, CRx etc.) which need handling otherwise the
pause ones will not arrive ... I wonder if there's a way to do this
cleanly in kernel (piggyback on the code for pausing a single vCPU and
then somehow 'coalesce' all pause events into a single
KVMI_EVENT_VCPUS_PAUSED).

> > > However, now that I'm thinking of it, we need a new event for "new VCPU
> > > created".  When the event is enabled, newly-created VCPUs should be in
> > > paused mode.
> > 
> > I assume you are thinking about vCPU hotplug here. If so, yes, an event
> > that gives the introspector the chance to update its internal
> > bookkeeping would be useful.
> 
> Yes, exactly.

OK. We will update the design document.
Paolo Bonzini Aug. 7, 2017, 3:56 p.m. UTC | #40
On 07/08/2017 16:12, Mihai Donțu wrote:
> On Mon, 2017-08-07 at 15:49 +0200, Paolo Bonzini wrote:
>> On 07/08/2017 15:25, Mihai Donțu wrote:
>>>> "Pause all VCPUs and stop all DMA" would definitely be a layering
>>>> violation, so it cannot be added.
>>>>
>>>> "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
>>>> a given id" commands.  I lean towards omitting it.
>>>
>>> The case where the introspector wants to scan the guest memory needs a
>>> KVMI_PAUSE_VM, which as discussed in a previous email, can be the
>>> actual qemu 'pause' command.
>>
>> Do you mean it needs to stop DMA as well?
> 
> No, DMA can proceed normally. I remain of the opinion that KVMI users
> must know what guest memory ranges are OK to access by looking at MTRR-
> s, PAT or guest kernel structures, or a combination of all three.

Ok, good.  Sorry if I am dense on the DMA/no-DMA cases.  (But, I don't
understand your remark about guest memory ranges; the point of
bus-master DMA is that it works on any memory, and cache snooping makes
it even easier for hypothetical malware to do memory writes via
bus-master DMA).

>>> However, we would like to limit the
>>> communication channels we have with the host and not use qmp (or
>>> libvirt/etc. if qmp is not exposed). Instead, have a command that
>>> triggers a KVM_RUN exit to qemu which in turn will call the underlying
>>> pause function used by qmp. Would that be OK with you?
>>
>> You would have to send back something on completion, and then I am
>> worried of races and deadlocks.  Plus, pausing a VM at the QEMU level is
>> a really expensive operation, so I don't think it's a good idea to let
>> the introspector do this.  You can pause all VCPUs, or use memory page
>> permissions.
> 
> Pausing all vCPU-s was my first thought, I was just trying to follow
> your statement: "I lean towards omitting it". :-)

Yes, and I still do because a hypothetical "pause all VCPUs" command
still has the issue that you could get other events before the command
completes.  So I am not convinced that a specialized command actually
makes the introspector code much simpler.

I hope you understand that I want to keep the trusted base (not just the
code I maintain, though that is a secondary benefit ;)) as simple as
possible.

> It will take a bit of user-space-fu, in that after issuing N vCPU pause
> commands in a row we will have to wait for N events, which might race
> with other events (MSR, CRx etc.) which need handling otherwise the
> pause ones will not arrive

The same issue would be there in QEMU or KVM though.

If you can always request "pause all vCPUs" from an event handler,
avoiding deadlocks is relatively easy.  If you cannot ensure that, for
example because of work that is scheduled periodically, you can send a
KVM_PAUSE command to ensure the work is done in a safe condition.

Then you get the following pseudocode algorithm:

    // a vCPU is not executing guest code, and it's going to check
    // num_pause_vm_requests before going back to guest code
    vcpu_not_running(id) {
        unmark vCPU "id" as running
        if (num vcpus running == 0)
            cond_broadcast(no_running_cpus)
    }

    pause_vcpu(id) {
        mark vCPU "id" as being-paused
        send KVMI_PAUSED for the vcpu
    }

    // return only when no vCPU is in KVM_RUN
    pause_vm() {
        if this vCPU is running
            if not in an event handler
                // caller should do pause_vcpu and defer the work
                return

            // we know this vCPU is not KVM_RUN
            vcpu_not_running()

        num_pause_vm_requests++
        if (num vcpus running > 0)
            for each vCPU that is running and not being-paused
                pause_vcpu(id)
            while (num vcpus running > 0)
                cond_wait(no_running_vcpus)
    }

    // tell paused vCPUs that they can resume
    resume_vm() {
        num_pause_vm_requests--
        if (num_pause_all_requests == 0)
            cond_broadcast(no_pending_pause_vm_requests)
        // either we're in an event handler, or a "pause" command was
        // sent for this vCPU.  in any case we're guaranteed to do an
        // event_reply sooner or later, which will again mark the vCPU
        // as running
    }

    // after an event reply, the vCPU goes back to KVM_RUN.  therefore
    // an event reply can act as a synchronization point for pause-vm
    // requests: delay the reply if there's such a request
    event_reply(id, data) {
        if (num_pause_vm_requests > 0) {
            if vCPU "id" is running
                vcpu_not_running(id)
            while (num_pause_vm_requests > 0)
                cond_wait(no_pending_pause_vm_requests)
        }
        mark vCPU "id" as running
        send event reply on KVMI socket
    }

    // this is what you do when KVM tells you that the guest is either
    // in userspace, or waiting to be woken up ("paused" event).  from
    // the introspector POV the two are the same.
    vcpu_ack_pause(id) {
        vcpu_not_running(id)
        unmark vCPU "id" as being-paused

        // deferred work presumably calls pause_vm/resumve_vm, and this
        // vCPU is not running now, so this is a nice point to flush it
        if any deferred work exists, do it now
    }

and on the KVMI read handler:

    on reply to "pause" command:
        if reply says the vCPU is currently in userspace
            // we'll get a KVMI_PAUSED event as soon as the host
            // reenters KVM with KVM_RUN, but we can already say the
            // CPU is not running
            vcpu_ack_pause()

    on "paused" event:
        vcpu_ack_pause()
        event_reply()

Paolo
Mihai Donțu Aug. 7, 2017, 4:44 p.m. UTC | #41
On Mon, 2017-08-07 at 17:56 +0200, Paolo Bonzini wrote:
> On 07/08/2017 16:12, Mihai Donțu wrote:
> > On Mon, 2017-08-07 at 15:49 +0200, Paolo Bonzini wrote:
> > > On 07/08/2017 15:25, Mihai Donțu wrote:
> > > > > "Pause all VCPUs and stop all DMA" would definitely be a layering
> > > > > violation, so it cannot be added.
> > > > > 
> > > > > "Pause all VCPUs" is basically a shortcut for many "pause the VCPU with
> > > > > a given id" commands.  I lean towards omitting it.
> > > > 
> > > > The case where the introspector wants to scan the guest memory needs a
> > > > KVMI_PAUSE_VM, which as discussed in a previous email, can be the
> > > > actual qemu 'pause' command.
> > > 
> > > Do you mean it needs to stop DMA as well?
> > 
> > No, DMA can proceed normally. I remain of the opinion that KVMI users
> > must know what guest memory ranges are OK to access by looking at MTRR-
> > s, PAT or guest kernel structures, or a combination of all three.
> 
> Ok, good.  Sorry if I am dense on the DMA/no-DMA cases.

I think it's OK to restate things, especially since my (our) view on
these matters might not match the KVM reality that you know far
better.

> (But, I don't understand your remark about guest memory ranges; the point of
> bus-master DMA is that it works on any memory, and cache snooping makes
> it even easier for hypothetical malware to do memory writes via
> bus-master DMA).

This is where I separated things in my head: I merely limited myself to
accessing memory, while leaving the reality of DMA-based attacks a
problem to be solved separately. There is some reasearch work being
tested internally on that, but I have yet to get in touch with the
people involved in it. As soon as I get some details maybe we can
connect something in KVM.

> > > > However, we would like to limit the
> > > > communication channels we have with the host and not use qmp (or
> > > > libvirt/etc. if qmp is not exposed). Instead, have a command that
> > > > triggers a KVM_RUN exit to qemu which in turn will call the underlying
> > > > pause function used by qmp. Would that be OK with you?
> > > 
> > > You would have to send back something on completion, and then I am
> > > worried of races and deadlocks.  Plus, pausing a VM at the QEMU level is
> > > a really expensive operation, so I don't think it's a good idea to let
> > > the introspector do this.  You can pause all VCPUs, or use memory page
> > > permissions.
> > 
> > Pausing all vCPU-s was my first thought, I was just trying to follow
> > your statement: "I lean towards omitting it". :-)
> 
> Yes, and I still do because a hypothetical "pause all VCPUs" command
> still has the issue that you could get other events before the command
> completes.  So I am not convinced that a specialized command actually
> makes the introspector code much simpler.
> 
> I hope you understand that I want to keep the trusted base (not just the
> code I maintain, though that is a secondary benefit ;)) as simple as
> possible.
> 
> > It will take a bit of user-space-fu, in that after issuing N vCPU pause
> > commands in a row we will have to wait for N events, which might race
> > with other events (MSR, CRx etc.) which need handling otherwise the
> > pause ones will not arrive
> 
> The same issue would be there in QEMU or KVM though.
> 
> If you can always request "pause all vCPUs" from an event handler,
> avoiding deadlocks is relatively easy.  If you cannot ensure that, for
> example because of work that is scheduled periodically, you can send a
> KVM_PAUSE command to ensure the work is done in a safe condition.
> 
> Then you get the following pseudocode algorithm:
> 
>     // a vCPU is not executing guest code, and it's going to check
>     // num_pause_vm_requests before going back to guest code
>     vcpu_not_running(id) {
>         unmark vCPU "id" as running
>         if (num vcpus running == 0)
>             cond_broadcast(no_running_cpus)
>     }
> 
>     pause_vcpu(id) {
>         mark vCPU "id" as being-paused
>         send KVMI_PAUSED for the vcpu
>     }
> 
>     // return only when no vCPU is in KVM_RUN
>     pause_vm() {
>         if this vCPU is running
>             if not in an event handler
>                 // caller should do pause_vcpu and defer the work
>                 return
> 
>             // we know this vCPU is not KVM_RUN
>             vcpu_not_running()
> 
>         num_pause_vm_requests++
>         if (num vcpus running > 0)
>             for each vCPU that is running and not being-paused
>                 pause_vcpu(id)
>             while (num vcpus running > 0)
>                 cond_wait(no_running_vcpus)
>     }
> 
>     // tell paused vCPUs that they can resume
>     resume_vm() {
>         num_pause_vm_requests--
>         if (num_pause_all_requests == 0)
>             cond_broadcast(no_pending_pause_vm_requests)
>         // either we're in an event handler, or a "pause" command was
>         // sent for this vCPU.  in any case we're guaranteed to do an
>         // event_reply sooner or later, which will again mark the vCPU
>         // as running
>     }
> 
>     // after an event reply, the vCPU goes back to KVM_RUN.  therefore
>     // an event reply can act as a synchronization point for pause-vm
>     // requests: delay the reply if there's such a request
>     event_reply(id, data) {
>         if (num_pause_vm_requests > 0) {
>             if vCPU "id" is running
>                 vcpu_not_running(id)
>             while (num_pause_vm_requests > 0)
>                 cond_wait(no_pending_pause_vm_requests)
>         }
>         mark vCPU "id" as running
>         send event reply on KVMI socket
>     }
> 
>     // this is what you do when KVM tells you that the guest is either
>     // in userspace, or waiting to be woken up ("paused" event).  from
>     // the introspector POV the two are the same.
>     vcpu_ack_pause(id) {
>         vcpu_not_running(id)
>         unmark vCPU "id" as being-paused
> 
>         // deferred work presumably calls pause_vm/resumve_vm, and this
>         // vCPU is not running now, so this is a nice point to flush it
>         if any deferred work exists, do it now
>     }
> 
> and on the KVMI read handler:
> 
>     on reply to "pause" command:
>         if reply says the vCPU is currently in userspace
>             // we'll get a KVMI_PAUSED event as soon as the host
>             // reenters KVM with KVM_RUN, but we can already say the
>             // CPU is not running
>             vcpu_ack_pause()
> 
>     on "paused" event:
>         vcpu_ack_pause()
>         event_reply()

Thank you for this!
diff mbox

Patch

diff --git a/Documentation/virtual/kvm/kvmi.rst b/Documentation/virtual/kvm/kvmi.rst
new file mode 100644
index 000000000000..63d3a75d5ffc
--- /dev/null
+++ b/Documentation/virtual/kvm/kvmi.rst
@@ -0,0 +1,985 @@ 
+=========================================================
+KVMi - the kernel virtual machine introspection subsystem
+=========================================================
+
+The KVM introspection subsystem provides a facility for applications running
+on the host or in a separate VM, to control the execution of other VM-s
+(pause, resume, shutdown), query the state of the vCPU-s (GPR-s, MSR-s etc.),
+alter the page access bits in the shadow page tables (only for the hardware
+backed ones, eg. Intel's EPT) and receive notifications when events of
+interest have taken place (shadow page table level faults, key MSR writes,
+hypercalls etc.). Some notifications can be responded to with an action
+(like preveting an MSR from being written), others are mere informative
+(like breakpoint events which are used for execution tracing), though the
+option to alter the GPR-s is common to each of them (usually the program
+counter is advanced past the instruction that triggered the guest exit).
+All events are optional. An application using this subsystem will explicitly
+register for them.
+
+The use case that gave way for the creation of this subsystem is to monitor
+the guest OS and as such the ABI/API is higly influenced by how the guest
+software (kernel, applications) see the world. For example, some events
+provide information specific for the host CPU architecture
+(eg. MSR_IA32_SYSENTER_EIP) merely because its leveraged by guest software
+to implement a critical feature (fast system calls).
+
+At the moment, the target audience for VMI are security software authors
+that wish to perform forensics on newly discovered threats (exploits) or
+to implement another layer of security like preventing a large set of
+kernel rootkits simply by "locking" the kernel image in the shadow page
+tables (ie. enforce .text r-x, .rodata rw- etc.). It's the latter case that
+made VMI a separate subsystem, even though many of these features are
+available in the device manager (eg. qemu). The ability to build a security
+application that does not interfere (in terms of performance) with the
+guest software asks for a specialized interface that is designed for minimum
+overhead.
+
+API/ABI
+=======
+
+This chapter describes the VMI interface used to monitor and control local
+guests from an user application.
+
+Overview
+--------
+
+The interface is socket based, one connection for every VM. One end is in the
+host kernel while the other is held by the user application (introspection
+tool).
+
+The initial connection is established by an application running on the host
+(eg. qemu) that connects to the introspection tool and after a handshake the
+socket is passed to the host kernel making all further communication take
+place between it and the introspection tool. The initiating party (qemu) can
+close its end so that any potential exploits cannot take a hold of it.
+
+The socket protocol allows for commands and events to be multiplexed over
+the same connection. A such, it is possible for the introspection tool to
+receive an event while waiting for the result of a command. Also, it can
+send a command while the host kernel is waiting for a reply to an event.
+
+The kernel side of the socket communication is blocking and will wait for
+an answer from its peer indefinitely or until the guest is powered off
+(killed) at which point it will wake up and properly cleanup. If the peer
+goes away KVM will exit to user space and the device manager will try and
+reconnect. If it fails, the device manager will inform KVM to cleanup and
+continue normal guest execution as if the introspection subsystem has never
+been used on that guest.
+
+All events have a common header::
+
+	struct kvmi_socket_hdr {
+		__u16 msg_id;
+		__u16 size;
+		__u32 seq;
+	};
+
+and all need a reply with the same kind of header, having the same
+sequence number (seq) and the same message id (msg_id).
+
+Because events from different vCPU threads can send messages at the same
+time and the replies can come in any order, the receiver loop uses the
+sequence number (seq) to identify which reply belongs to which vCPU, in
+order to dispatch the message to the right thread waiting for it.
+
+After 'kvmi_socket_hdr', 'msg_id' specific data of 'kvmi_socket_hdr.size'
+bytes will follow.
+
+The message header and its data must be sent with one write() call
+to the socket (as a whole). This simplifies the receiver loop and avoids
+the recontruction of messages on the other side.
+
+The wire protocol uses the host native byte-order. The introspection tool
+must check this during the handshake and do the necessary conversion.
+
+Replies to commands have an error code (__s32) at offset 0 in the message
+data. Specific message data will follow this. If the error code is not
+zero, all the other data members will have undefined content (not random
+heap or stack data, but valid results at the time of the failure), unless
+otherwise specified.
+
+In case of an unsupported command, the message data will contain only
+the error code (-ENOSYS).
+
+The error code is related to the processing of the corresponding
+message. For all the other errors (socket errrors, incomplete messages,
+wrong sequence numbers etc.) the socket must be closed and the connection
+can be retried.
+
+While all commands will have a reply as soon as possible, the replies
+to events will probably be delayed until a set of (new) commands will
+complete::
+
+   Host kernel               Tool
+   -----------               --------
+   event 1 ->
+                             <- command 1
+   command 1 reply ->
+                             <- command 2
+   command 2 reply ->
+                             <- event 1 reply
+
+If both ends send a message "in the same time"::
+
+   KVMi                      Userland
+   ----                     --------
+   event X ->               <- command X
+
+the host kernel should reply to 'command X', regardless of the receive time
+(before or after the 'event X' was sent).
+
+As it can be seen below, the wire protocol specifies occasional padding. This
+is to permit working with the data by directly using C structures. The
+members should have the 'natural' alignment of the host.
+
+To describe the commands/events, we reuse some conventions from api.txt:
+
+  - Architectures: which instruction set architectures providing this command/event
+
+  - Versions: which versions provide this command/event
+
+  - Parameters: incoming message data
+
+  - Returns: outgoing/reply message data
+
+Handshake
+---------
+
+Allthough this falls out of the scope of the introspection subsystem, below
+is a proposal of a handshake that can be used by implementors.
+
+The device manager will connect to the introspection tool and wait for a
+cryptographic hash of a cookie that should be known by both peers. If the
+hash is correct (the destination has been "authenticated"), the device
+manager will send another cryptographic hash and random salt. The peer
+recomputes the hash of the cookie bytes including the salt and if they match,
+the device manager has been "authenticated" too. This is a rather crude
+system that makes it difficult for device manager exploits to trick the
+introspection tool into believing its working OK.
+
+The cookie would normally be generated by a management tool (eg. libvirt)
+and make it available to the device manager and to a properly authenticated
+client. It is the job of a third party to retrieve the cookie from the
+management application and pass it over a secure channel to the introspection
+tool.
+
+Once the basic "authentication" has taken place, the introspection tool
+can receive information on the guest (its UUID) and other flags (endianness
+or features supported by the host kernel).
+
+Introspection capabilities
+--------------------------
+
+TODO
+
+Commands
+--------
+
+The following C structures are meant to be used directly when communicating
+over the wire. The peer that detects any size mismatch should simply close
+the connection and report the error.
+
+1. KVMI_GET_VERSION
+-------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: {}
+:Returns: ↴
+
+::
+
+	struct kvmi_get_version_reply {
+		__s32 err;
+		__u32 version;
+	};
+
+Returns the introspection API version (the KVMI_VERSION constant) and the
+error code (zero). In case of an unlikely error, the version will have an
+undefined value.
+
+2. KVMI_GET_GUEST_INFO
+----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: {}
+:Returns: ↴
+
+::
+
+	struct kvmi_get_guest_info_reply {
+		__s32 err;
+		__u16 vcpu_count;
+		__u16 padding;
+		__u64 tsc_speed;
+	};
+
+Returns the number of online vcpus, and the TSC frequency in HZ, if supported
+by the architecture (otherwise is 0).
+
+3. KVMI_PAUSE_GUEST
+-------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: {}
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+This command will pause all vcpus threads, by getting them out of guest mode
+and put them in the "waiting introspection commands" state.
+
+4. KVMI_UNPAUSE_GUEST
+---------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: {}
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Resume the vcpu threads, or at least get them out of "waiting introspection
+commands" state.
+
+5. KVMI_SHUTDOWN_GUEST
+----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: {}
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Ungracefully shutdown the guest.
+
+6. KVMI_GET_REGISTERS
+---------------------
+
+:Architectures: x86 (could be all, but with different input/output)
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_get_registers_x86 {
+		__u16 vcpu;
+		__u16 nmsrs;
+		__u32 msrs_idx[0];
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_get_registers_x86_reply {
+		__s32 err;
+		__u32 mode;
+		struct kvm_regs  regs;
+		struct kvm_sregs sregs;
+		struct kvm_msrs  msrs;
+	};
+
+For the given vcpu_id and the nmsrs sized array of MSRs registers, returns
+the vCPU mode (in bytes: 2, 4 or 8), the general purpose registers,
+the special registers and the requested set of MSR-s.
+
+7. KVMI_SET_REGISTERS
+---------------------
+
+:Architectures: x86 (could be all, but with different input)
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_set_registers_x86 {
+		__u16 vcpu;
+		__u16 padding[3];
+		struct kvm_regs regs;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Sets the general purpose registers for the given vcpu_id.
+
+8. KVMI_GET_MTRR_TYPE
+---------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_mtrr_type {
+		__u64 gpa;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_mtrr_type_reply {
+		__s32 err;
+		__u32 type;
+	};
+
+Returns the guest memory type for a specific physical address.
+
+9. KVMI_GET_MTRRS
+-----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_mtrrs {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_mtrrs_reply {
+		__s32 err;
+		__u32 padding;
+		__u64 pat;
+		__u64 cap;
+		__u64 type;
+	};
+
+Returns MSR_IA32_CR_PAT, MSR_MTRRcap and MSR_MTRRdefType for the specified
+vCPU.
+
+10. KVMI_GET_XSAVE_INFO
+-----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_xsave_info {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_xsave_info_reply {
+		__s32 err;
+		__u32 size;
+	};
+
+Returns the xstate size for the specified vCPU.
+
+11. KVMI_GET_PAGE_ACCESS
+------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_get_page_access {
+		__u16 vcpu;
+		__u16 padding[3];
+		__u64 gpa;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_get_page_access_reply {
+		__s32 err;
+		__u32 access;
+	};
+
+Returns the spte flags (rwx - present, write & user) for the specified
+vCPU and guest physical address.
+
+12. KVMI_SET_PAGE_ACCESS
+------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_set_page_access {
+		__u16 vcpu;
+		__u16 padding;
+		__u32 access;
+		__u64 gpa;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Sets the spte flags (rwx - present, write & user) - access - for the specified
+vCPU and guest physical address.
+
+13. KVMI_INJECT_PAGE_FAULT
+--------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_page_fault {
+		__u16 vcpu;
+		__u16 padding;
+		__u32 error;
+		__u64 gva;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Injects a vCPU page fault with the specified guest virtual address and
+error code.
+
+14. KVMI_INJECT_BREAKPOINT
+--------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_inject_breakpoint {
+		__u16 vcpu;
+		__u16 padding[3];
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Injects a breakpoint for the specified vCPU. This command is usually sent in
+response to an event and as such the proper GPR-s will be set with the reply.
+
+15. KVMI_MAP_PHYSICAL_PAGE_TO_GUEST
+-----------------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_map_physical_page_to_guest {
+		__u64 gpa_src;
+		__u64 gfn_dest;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Maps a page from an introspected guest memory (gpa_src) to the guest running
+the introspection tool. 'gfn_dest' points to an anonymous, locked mapping one
+page in size.
+
+This command is used to "read" the introspected guest memory and potentially
+place patches (eg. INT3-s).
+
+16. KVMI_UNMAP_PHYSICAL_PAGE_FROM_GUEST
+---------------------------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_unmap_physical_page_from_guest {
+		__u64 gfn_dest;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Unmaps a previously mapped page.
+
+17. KVMI_CONTROL_EVENTS
+-----------------------
+
+:Architectures: all
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_control_events {
+		__u16 vcpu;
+		__u16 padding;
+		__u32 events;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Enables/disables vCPU introspection events, by setting/clearing one or more
+of the following bits (see 'Events' below) :
+
+	KVMI_EVENT_CR
+	KVMI_EVENT_MSR
+	KVMI_EVENT_XSETBV
+	KVMI_EVENT_BREAKPOINT
+	KVMI_EVENT_USER_CALL
+	KVMI_EVENT_PAGE_FAULT
+	KVMI_EVENT_TRAP
+
+Trying to enable unsupported events (~KVMI_KNOWN_EVENTS) by the current
+architecture would fail and -EINVAL will be returned.
+
+18. KVMI_CR_CONTROL
+-------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_cr_control {
+		__u8 enable;
+		__u8 padding[3];
+		__u32 cr;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Enables/disables introspection for a specific CR register and must
+be used in addition to KVMI_CONTROL_EVENTS with the KVMI_EVENT_CR bit
+flag set.
+
+Eg. kvmi_cr_control { .enable=1, .cr=3 } will enable introspection
+for CR3.
+
+Currently, trying to set any register but CR0, CR3 and CR4 will return
+-EINVAL.
+
+19. KVMI_MSR_CONTROL
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_msr_control {
+		__u8 enable;
+		__u8 padding[3];
+		__u32 msr;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_error_code {
+		__s32 err;
+		__u32 padding;
+	};
+
+Enables/disables introspection for a specific MSR, and must be used
+in addition to KVMI_CONTROL_EVENTS with the KVMI_EVENT_MSR bit flag set.
+
+Currently, only MSRs within the following 3 ranges are supported. Trying
+to control any other register will return -EINVAL. ::
+
+	0          ... 0x00001fff
+	0x40000000 ... 0x40001fff
+	0xc0000000 ... 0xc0001fff
+
+Events
+------
+
+All vcpu events are sent using the KVMI_EVENT_VCPU message id. No event will
+be sent unless enabled with a KVMI_CONTROL_EVENTS command.
+
+For x86, the message data starts with a common structure::
+
+	struct kvmi_event_x86 {
+		__u16 vcpu;
+		__u8 mode;
+		__u8 padding1;
+		__u32 event;
+		struct kvm_regs regs;
+		struct kvm_sregs sregs;
+		struct {
+			__u64 sysenter_cs;
+			__u64 sysenter_esp;
+			__u64 sysenter_eip;
+			__u64 efer;
+			__u64 star;
+			__u64 lstar;
+		} msrs;
+	};
+
+In order to help the introspection tool with the event analysis while
+avoiding unnecessary introspection commands, the message data holds some
+registers (kvm_regs, kvm_sregs and a couple of MSR-s) beside
+the vCPU id, its mode (in bytes) and the event (one of the flags set
+with the KVMI_CONTROL_EVENTS command).
+
+The replies to events also start with a common structure, having the
+KVMI_EVENT_VCPU_REPLY message id::
+
+	struct kvmi_event_x86_reply {
+		struct kvm_regs regs;
+		__u32 actions;
+		__u32 padding;
+	};
+
+The 'actions' member holds one or more flags. For example, if
+KVMI_EVENT_ACTION_SET_REGS is set, the general purpose registers will
+be overwritten with the new values (regs) from introspector.
+
+Specific data can follow these common structures.
+
+1. KVMI_EVENT_CR
+----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_cr {
+		__u16 cr;
+		__u16 padding[3];
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+	struct kvmi_event_cr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a CR register was modified and the introspection
+has already been enabled for this kind of event (KVMI_CONTROL_EVENTS)
+and for this specific register (KVMI_CR_CONTROL).
+
+kvmi_event_x86, the CR number, the old value and the new value are
+sent to the introspector, which can respond with one or more action flags:
+
+   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
+   using the values from introspector (regs)
+
+   KVMI_EVENT_ACTION_ALLOW - allow the register modification with the
+   value from introspector (new_val), otherwise deny the modification
+   but allow the guest to proceed as if the register has been loaded
+   with the desired value.
+
+2. KVMI_EVENT_MSR
+-----------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_msr {
+		__u32 msr;
+		__u32 padding;
+		__u64 old_value;
+		__u64 new_value;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+	struct kvmi_event_msr_reply {
+		__u64 new_val;
+	};
+
+This event is sent when a MSR was modified and the introspection has already
+been enabled for this kind of event (KVMI_CONTROL_EVENTS) and for this
+specific register (KVMI_MSR_CONTROL).
+
+kvmi_event_x86, the MSR number, the old value and the new value are
+sent to the introspector, which can respond with one or more action flags:
+
+   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
+   using the values from introspector (regs)
+
+   KVMI_EVENT_ACTION_ALLOW - allow the register modification with the
+   value from introspector (new_val), otherwise deny the modification
+   but allow the guest to proceed as if the register has been loaded
+   with the desired value.
+
+3. KVMI_EVENT_XSETBV
+--------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_xsetbv {
+		__u64 xcr0;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+
+This event is sent when the extended control register XCR0 was modified
+and the introspection has already been enabled for this kind of event
+(KVMI_CONTROL_EVENTS).
+
+kvmi_event_x86 and the new value are sent to the introspector, which
+can respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions',
+instructing KVMi to override the general purpose registers using the
+values from introspector (regs).
+
+4. KVMI_EVENT_BREAKPOINT
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_breakpoint {
+		__u64 gpa;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+
+This event is sent when a breakpoint was reached and the introspection has
+already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
+
+kvmi_event_x86 and the guest physical address are sent to the introspector,
+which can respond with one or more action flags:
+
+   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
+   using the values from introspector (regs)
+
+   KVMI_EVENT_ACTION_ALLOW - is implied if not specified
+
+5. KVMI_EVENT_USER_CALL
+-----------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+
+This event is sent on a user hypercall and the introspection has already
+already been enabled for this kind of event (KVMI_CONTROL_EVENTS).
+
+kvmi_event_x86 is sent to the introspector, which can respond with the
+KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing the host
+kernel to override the general purpose registers using the values from
+introspector (regs).
+
+6. KVMI_EVENT_PAGE_FAULT
+------------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_page_fault {
+		__u64 gva;
+		__u64 gpa;
+		__u32 mode;
+		__u32 padding;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+	struct kvmi_event_page_fault_reply {
+		__u32 ctx_size;
+		__u8 ctx_data[256];
+	};
+
+This event is sent if a hypervisor page fault was encountered, the
+introspection has already enabled the reports for this kind of event
+(KVMI_CONTROL_EVENTS), and it was generated for a page for which the
+introspector has shown interest (ie. has previously touched it by
+adjusting the permissions).
+
+kvmi_event_x86, guest virtual address, guest physical address and
+the exit qualification (mode) are sent to the introspector, which
+can respond with one or more action flags:
+
+   KVMI_EVENT_ACTION_SET_REGS - override the general purpose registers
+   using the values from introspector (regs)
+
+   (KVMI_EVENT_ALLOW | KVMI_EVENT_NOEMU) - let the guest re-trigger
+   the page fault
+
+   (KVMI_EVENT_ALLOW | KVMI_EVENT_SET_CTX) - allow the page fault
+   via emulation but with custom input (ctx_data, ctx_size). This is
+   used to trick the guest software into believing it has read
+   certain data. In practice it is used to hide the contents of certain
+   memory areas
+
+   KVMI_EVENT_ALLOW - allow the page fault via emulation
+
+If KVMI_EVENT_ALLOW is not set, it will fall back to the page fault handler
+which usually implies overwriting any spte page access changes made before.
+An introspection tool will always set this flag and prevent unwanted changes
+to memory by skipping the instruction. It is up to the tool to adjust the
+program counter in order to achieve this result.
+
+7. KVMI_EVENT_TRAP
+------------------
+
+:Architectures: x86
+:Versions: >= 1
+:Parameters: ↴
+
+::
+
+	struct kvmi_event_x86;
+	struct kvmi_event_trap {
+		__u32 vector;
+		__u32 type;
+		__u32 err;
+		__u32 padding;
+		__u64 cr2;
+	};
+
+:Returns: ↴
+
+::
+
+	struct kvmi_event_x86_reply;
+
+This event is sent if a trap will be delivered to the guest (page fault,
+breakpoint, etc.) and the introspection has already enabled the reports
+for this kind of event (KVMI_CONTROL_EVENTS).
+
+This is used to inform the introspector of all pending traps giving it
+a chance to determine if it should try again later in case a previous
+KVMI_INJECT_PAGE_FAULT/KVMI_INJECT_BREAKPOINT command has been overwritten
+by an interrupt picked up during guest reentry.
+
+kvmi_event_x86, exception/interrupt number (vector), exception/interrupt
+type, exception code (err) and CR2 are sent to the introspector, which can
+respond with the KVMI_EVENT_ACTION_SET_REGS bit set in 'actions', instructing
+the host kernel to override the general purpose registers using the values
+from introspector (regs).
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
new file mode 100644
index 000000000000..54a2d8ebf649
--- /dev/null
+++ b/include/uapi/linux/kvmi.h
@@ -0,0 +1,310 @@ 
+/*
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * The KVMI Library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the named License, or any later version.
+ *
+ * The KVMI Library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with the KVMI Library; if not, see <http://www.gnu.org/licenses/>
+ */
+#ifndef __KVMI_H_INCLUDED__
+#define __KVMI_H_INCLUDED__
+
+#include "asm/kvm.h"
+#include <linux/types.h>
+
+#define KVMI_VERSION 0x00000001
+
+#define KVMI_EVENT_CR         (1 << 1)	/* control register was modified */
+#define KVMI_EVENT_MSR        (1 << 2)	/* model specific reg. was modified */
+#define KVMI_EVENT_XSETBV     (1 << 3)	/* ext. control register was modified */
+#define KVMI_EVENT_BREAKPOINT (1 << 4)	/* breakpoint was reached */
+#define KVMI_EVENT_USER_CALL  (1 << 5)	/* user hypercall */
+#define KVMI_EVENT_PAGE_FAULT (1 << 6)	/* hyp. page fault was encountered */
+#define KVMI_EVENT_TRAP       (1 << 7)	/* trap was injected */
+
+#define KVMI_KNOWN_EVENTS (KVMI_EVENT_CR | \
+			   KVMI_EVENT_MSR | \
+			   KVMI_EVENT_XSETBV | \
+			   KVMI_EVENT_BREAKPOINT | \
+			   KVMI_EVENT_USER_CALL | \
+			   KVMI_EVENT_PAGE_FAULT | \
+			   KVMI_EVENT_TRAP)
+
+#define KVMI_EVENT_ACTION_ALLOW      (1 << 0)	/* used in replies */
+#define KVMI_EVENT_ACTION_SET_REGS   (1 << 1)	/* registers need to be written back */
+#define KVMI_EVENT_ACTION_SET_CTX    (1 << 2)	/* set the emulation context */
+#define KVMI_EVENT_ACTION_NOEMU      (1 << 3)	/* return to guest without emulation */
+
+#define KVMI_GET_VERSION                    1
+#define KVMI_GET_GUESTS                     2 /* TODO: remove me */
+#define KVMI_GET_GUEST_INFO                 3
+#define KVMI_PAUSE_GUEST                    4
+#define KVMI_UNPAUSE_GUEST                  5
+#define KVMI_GET_REGISTERS                  6
+#define KVMI_SET_REGISTERS                  7
+#define KVMI_SHUTDOWN_GUEST                 8
+#define KVMI_GET_MTRR_TYPE                  9
+#define KVMI_GET_MTRRS                      10
+#define KVMI_GET_XSAVE_INFO                 11
+#define KVMI_GET_PAGE_ACCESS                12
+#define KVMI_SET_PAGE_ACCESS                13
+#define KVMI_INJECT_PAGE_FAULT              14
+#define KVMI_READ_PHYSICAL                  15 /* TODO: remove me */
+#define KVMI_WRITE_PHYSICAL                 16 /* TODO: remove me */
+#define KVMI_MAP_PHYSICAL_PAGE_TO_GUEST     17
+#define KVMI_UNMAP_PHYSICAL_PAGE_FROM_GUEST 18
+#define KVMI_CONTROL_EVENTS                 19
+#define KVMI_CR_CONTROL                     20
+#define KVMI_MSR_CONTROL                    21
+#define KVMI_INJECT_BREAKPOINT              22
+#define KVMI_EVENT_GUEST_ON                 23 /* TODO: remove me */
+#define KVMI_EVENT_GUEST_OFF                24 /* TODO: remove me */
+#define KVMI_EVENT_VCPU                     25
+#define KVMI_EVENT_VCPU_REPLY               26
+
+/* TODO: remove me */
+struct kvmi_guest {
+	__u8 uuid[16];
+};
+
+/* TODO: remove me */
+struct kvmi_guests {
+	__u32 size;		/* in: the size of the entire structure */
+	struct kvmi_guest guests[1];
+};
+
+/* TODO: remove me */
+struct kvmi_read_physical {
+	__u64 gpa;
+	__u64 size;
+};
+
+/* TODO: remove me */
+struct kvmi_read_physical_reply {
+	__s32 err;
+	__u8 bytes[0];
+};
+
+/* TODO: remove me */
+struct kvmi_write_physical {
+	__u64 gpa;
+	__u64 size;
+	__u8 bytes[0];
+};
+
+
+struct kvmi_socket_hdr {
+	__u16 msg_id;
+	__u16 size;
+	__u32 seq;
+};
+
+struct kvmi_error_code {
+	__s32 err;
+	__u32 padding;
+};
+
+struct kvmi_get_version_reply {
+	__s32 err;
+	__u32 version;
+};
+
+struct kvmi_get_guest_info_reply {
+	__s32 err;
+	__u16 vcpu_count;
+	__u16 padding;
+	__u64 tsc_speed;
+};
+
+struct kvmi_get_registers_x86 {
+	__u16 vcpu;
+	__u16 nmsrs;
+	__u32 msrs_idx[0];
+};
+
+struct kvmi_get_registers_x86_reply {
+	__s32 err;
+	__u32 mode;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct kvm_msrs msrs;
+};
+
+struct kvmi_set_registers_x86 {
+	__u16 vcpu;
+	__u16 padding[3];
+	struct kvm_regs regs;
+};
+
+struct kvmi_mtrr_type {
+	__u64 gpa;
+};
+
+struct kvmi_mtrr_type_reply {
+	__s32 err;
+	__u32 padding;
+	__u64 type;
+};
+
+struct kvmi_mtrrs {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_mtrrs_reply {
+	__s32 err;
+	__u32 padding;
+	__u64 pat;
+	__u64 cap;
+	__u64 type;
+};
+
+struct kvmi_xsave_info {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_xsave_info_reply {
+	__s32 err;
+	__u32 size;
+};
+
+struct kvmi_get_page_access {
+	__u16 vcpu;
+	__u16 padding[3];
+	__u64 gpa;
+};
+
+struct kvmi_get_page_access_reply {
+	__s32 err;
+	__u32 access;
+};
+
+struct kvmi_set_page_access {
+	__u16 vcpu;
+	__u16 padding;
+	__u32 access;
+	__u64 gpa;
+};
+
+struct kvmi_page_fault {
+	__u16 vcpu;
+	__u16 padding;
+	__u32 error;
+	__u64 gva;
+};
+
+struct kvmi_inject_breakpoint {
+	__u16 vcpu;
+	__u16 padding[3];
+};
+
+struct kvmi_map_physical_page_to_guest {
+	__u64 gpa_src;
+	__u64 gfn_dest;
+};
+
+struct kvmi_unmap_physical_page_from_guest {
+	__u64 gfn_dest;
+};
+
+struct kvmi_control_events {
+	__u16 vcpu;
+	__u16 padding;
+	__u32 events;
+};
+
+struct kvmi_cr_control {
+	__u8 enable;
+	__u8 padding[3];
+	__u32 cr;
+};
+
+struct kvmi_msr_control {
+	__u8 enable;
+	__u8 padding[3];
+	__u32 msr;
+};
+
+struct kvmi_event_x86 {
+	__u16 vcpu;
+	__u8 mode;
+	__u8 padding1;
+	__u32 event;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct {
+		__u64 sysenter_cs;
+		__u64 sysenter_esp;
+		__u64 sysenter_eip;
+		__u64 efer;
+		__u64 star;
+		__u64 lstar;
+	} msrs;
+};
+
+struct kvmi_event_x86_reply {
+	struct kvm_regs regs;
+	__u32 actions;
+	__u32 padding;
+};
+
+struct kvmi_event_cr {
+	__u16 cr;
+	__u16 padding[3];
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_cr_reply {
+	__u64 new_val;
+};
+
+struct kvmi_event_msr {
+	__u32 msr;
+	__u32 padding;
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_msr_reply {
+	__u64 new_val;
+};
+
+struct kvmi_event_xsetbv {
+	__u64 xcr0;
+};
+
+struct kvmi_event_breakpoint {
+	__u64 gpa;
+};
+
+struct kvmi_event_page_fault {
+	__u64 gva;
+	__u64 gpa;
+	__u32 mode;
+	__u32 padding;
+};
+
+struct kvmi_event_page_fault_reply {
+	__u32 ctx_size;
+	__u8 ctx_data[256];
+};
+
+struct kvmi_event_trap {
+	__u32 vector;
+	__u32 type;
+	__u32 err;
+	__u32 padding;
+	__u64 cr2;
+};
+
+#endif /* __KVMI_H_INCLUDED__ */