diff mbox series

[v10,2/9] KVM: Introduce per-page memory attributes

Message ID 20221202061347.1070246-3-chao.p.peng@linux.intel.com (mailing list archive)
State New
Headers show
Series KVM: mm: fd-based approach for supporting KVM | expand

Commit Message

Chao Peng Dec. 2, 2022, 6:13 a.m. UTC
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
    a guest memory range.
  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
    memory attributes.

KVM internally uses xarray to store the per-page memory attributes.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
---
 Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
 arch/x86/kvm/Kconfig           |  1 +
 include/linux/kvm_host.h       |  3 ++
 include/uapi/linux/kvm.h       | 17 ++++++++
 virt/kvm/Kconfig               |  3 ++
 virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
 6 files changed, 163 insertions(+)

Comments

Fabiano Rosas Dec. 6, 2022, 1:34 p.m. UTC | #1
Chao Peng <chao.p.peng@linux.intel.com> writes:

> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
>
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
>   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>     a guest memory range.
>   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>     memory attributes.
>
> KVM internally uses xarray to store the per-page memory attributes.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
>  Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
>  arch/x86/kvm/Kconfig           |  1 +
>  include/linux/kvm_host.h       |  3 ++
>  include/uapi/linux/kvm.h       | 17 ++++++++
>  virt/kvm/Kconfig               |  3 ++
>  virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
>  6 files changed, 163 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
>  The "pad" and "reserved" fields may be used for future extensions and should be
>  set to 0s by userspace.
>  
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> +  struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +  };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.

This wording could cause some confusion, what about a simpler:

"reflect the range of pages that had its attributes successfully set"

> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully.

"address + 1 page" or "subsequent page" perhaps.

In fact, wouldn't this all become simpler if size were number of pages instead?

> The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +

...

> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)
> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;
> +
> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;

Here PAGE_SIZE and -1 cancel out.

Consider using gpa_to_gfn as well.

> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +	mutex_lock(&kvm->lock);
> +	for (i = start; i < end; i++)
> +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT)))
> +			break;
> +	mutex_unlock(&kvm->lock);
> +
> +	attrs->address = i << PAGE_SHIFT;
> +	attrs->size = (end - i) << PAGE_SHIFT;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  {
>  	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #ifdef CONFIG_HAVE_KVM_MSI
>  	case KVM_CAP_SIGNAL_MSI:
>  #endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	case KVM_CAP_MEMORY_ATTRIBUTES:
> +#endif
>  #ifdef CONFIG_HAVE_KVM_IRQFD
>  	case KVM_CAP_IRQFD:
>  	case KVM_CAP_IRQFD_RESAMPLE:
> @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
>  		break;
>  	}
>  #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> +		u64 attrs = kvm_supported_mem_attributes(kvm);
> +
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &attrs, sizeof(attrs)))
> +			goto out;
> +		r = 0;
> +		break;
> +	}
> +	case KVM_SET_MEMORY_ATTRIBUTES: {
> +		struct kvm_memory_attributes attrs;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&attrs, argp, sizeof(attrs)))
> +			goto out;
> +
> +		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> +
> +		if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> +			r = -EFAULT;
> +		break;
> +	}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
>  	case KVM_CREATE_DEVICE: {
>  		struct kvm_create_device cd;
Fuad Tabba Dec. 6, 2022, 3:07 p.m. UTC | #2
Hi,

On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
>
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
>
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
>   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>     a guest memory range.
>   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>     memory attributes.
>
> KVM internally uses xarray to store the per-page memory attributes.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
>  Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
>  arch/x86/kvm/Kconfig           |  1 +
>  include/linux/kvm_host.h       |  3 ++
>  include/uapi/linux/kvm.h       | 17 ++++++++
>  virt/kvm/Kconfig               |  3 ++
>  virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
>  6 files changed, 163 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
>  The "pad" and "reserved" fields may be used for future extensions and should be
>  set to 0s by userspace.
>
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> +  struct kvm_memory_attributes {
> +       __u64 address;
> +       __u64 size;
> +       __u64 attributes;
> +       __u64 flags;
> +  };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully. The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +
>  5. The kvm_run structure
>  ========================
>
> @@ -8270,6 +8323,16 @@ structure.
>  When getting the Modified Change Topology Report value, the attr->addr
>  must point to a byte where the value will be stored or retrieved from.
>
> +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> +------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm
> +
> +This capability indicates KVM supports per-page memory attributes and ioctls
> +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> +
>  9. Known KVM API problems
>  =========================
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
>         select SRCU
>         select INTERVAL_TREE
>         select HAVE_KVM_PM_NOTIFIER if PM
> +       select HAVE_KVM_MEMORY_ATTRIBUTES
>         help
>           Support hosting fully virtualized guest machines using hardware
>           virtualization extensions.  You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8f874a964313..a784e2b06625 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -800,6 +800,9 @@ struct kvm {
>
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>         struct notifier_block pm_notifier;
> +#endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +       struct xarray mem_attr_array;
>  #endif
>         char stats_id[KVM_STATS_NAME_SIZE];
>  };
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 64dfe9c07c87..5d0941acb5bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_CPU_TOPOLOGY 222
>  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
>  #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> +#define KVM_CAP_MEMORY_ATTRIBUTES 225
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
>  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
> +#define KVM_SET_MEMORY_ATTRIBUTES              _IOWR(KVMIO,  0xd3, struct kvm_memory_attributes)
> +
> +struct kvm_memory_attributes {
> +       __u64 address;
> +       __u64 size;
> +       __u64 attributes;
> +       __u64 flags;
> +};
> +
> +#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)

nit: how about using the BIT() macro for these?

> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..effdea5dd4f0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
>  config HAVE_KVM_DIRTY_RING
>         bool
>
> +config HAVE_KVM_MEMORY_ATTRIBUTES
> +       bool
> +
>  # Only strongly ordered architectures can select this, as it doesn't
>  # put any explicit constraint on userspace ordering. They can also
>  # select the _ACQ_REL version.
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>         spin_lock_init(&kvm->mn_invalidate_lock);
>         rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>         xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +       xa_init(&kvm->mem_attr_array);
> +#endif
>
>         INIT_LIST_HEAD(&kvm->gpc_list);
>         spin_lock_init(&kvm->gpc_lock);
> @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>                 kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>         }
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +       xa_destroy(&kvm->mem_attr_array);
> +#endif
>         cleanup_srcu_struct(&kvm->irq_srcu);
>         cleanup_srcu_struct(&kvm->srcu);
>         kvm_arch_free_vm(kvm);
> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +       return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +                                          struct kvm_memory_attributes *attrs)
> +{
> +       gfn_t start, end;
> +       unsigned long i;
> +       void *entry;
> +       u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +       /* flags is currently not used. */

nit: "is reserved"? I think it makes it a bit clearer what its purpose is.

> +       if (attrs->flags)
> +               return -EINVAL;
> +       if (attrs->attributes & ~supported_attrs)
> +               return -EINVAL;
> +       if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +               return -EINVAL;
> +       if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +               return -EINVAL;
> +
> +       start = attrs->address >> PAGE_SHIFT;
> +       end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;

Would using existing helpers be better for getting the frame numbers?
Also, the code checks that the address and size are page aligned, so
the end rounding up seems redundant, and might even be wrong if the
address+size-1 is close to the gfn_t limit (which this code tries to
avoid in an earlier check).

> +       entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +       mutex_lock(&kvm->lock);
> +       for (i = start; i < end; i++)
> +               if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +                                   GFP_KERNEL_ACCOUNT)))
> +                       break;
> +       mutex_unlock(&kvm->lock);
> +
> +       attrs->address = i << PAGE_SHIFT;
> +       attrs->size = (end - i) << PAGE_SHIFT;

nit: helpers for these too?

With the end calculation fixed,

Reviewed-by: Fuad Tabba <tabba@google.com>
After adding the necessary configs for arm64 (on qemu/arm64):
Tested-by: Fuad Tabba <tabba@google.com>

Cheers,
/fuad

> +
> +       return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  {
>         return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #ifdef CONFIG_HAVE_KVM_MSI
>         case KVM_CAP_SIGNAL_MSI:
>  #endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +       case KVM_CAP_MEMORY_ATTRIBUTES:
> +#endif
>  #ifdef CONFIG_HAVE_KVM_IRQFD
>         case KVM_CAP_IRQFD:
>         case KVM_CAP_IRQFD_RESAMPLE:
> @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
>                 break;
>         }
>  #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +       case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> +               u64 attrs = kvm_supported_mem_attributes(kvm);
> +
> +               r = -EFAULT;
> +               if (copy_to_user(argp, &attrs, sizeof(attrs)))
> +                       goto out;
> +               r = 0;
> +               break;
> +       }
> +       case KVM_SET_MEMORY_ATTRIBUTES: {
> +               struct kvm_memory_attributes attrs;
> +
> +               r = -EFAULT;
> +               if (copy_from_user(&attrs, argp, sizeof(attrs)))
> +                       goto out;
> +
> +               r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> +
> +               if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> +                       r = -EFAULT;
> +               break;
> +       }
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
>         case KVM_CREATE_DEVICE: {
>                 struct kvm_create_device cd;
>
> --
> 2.25.1
>
Chao Peng Dec. 7, 2022, 2:31 p.m. UTC | #3
On Tue, Dec 06, 2022 at 10:34:32AM -0300, Fabiano Rosas wrote:
> Chao Peng <chao.p.peng@linux.intel.com> writes:
> 
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> >
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> >   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> >     a guest memory range.
> >   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> >     memory attributes.
> >
> > KVM internally uses xarray to store the per-page memory attributes.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > ---
> >  Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> >  arch/x86/kvm/Kconfig           |  1 +
> >  include/linux/kvm_host.h       |  3 ++
> >  include/uapi/linux/kvm.h       | 17 ++++++++
> >  virt/kvm/Kconfig               |  3 ++
> >  virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
> >  6 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 5617bc4f899f..bb2f709c0900 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> >  The "pad" and "reserved" fields may be used for future extensions and should be
> >  set to 0s by userspace.
> >  
> > +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: u64 memory attributes bitmask(out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Returns supported memory attributes bitmask. Supported memory attributes will
> > +have the corresponding bits set in u64 memory attributes bitmask.
> > +
> > +The following memory attributes are defined::
> > +
> > +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> > +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> > +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> > +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> > +
> > +4.139 KVM_SET_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_memory_attributes(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Sets memory attributes for pages in a guest memory range. Parameters are
> > +specified via the following structure::
> > +
> > +  struct kvm_memory_attributes {
> > +	__u64 address;
> > +	__u64 size;
> > +	__u64 attributes;
> > +	__u64 flags;
> > +  };
> > +
> > +The user sets the per-page memory attributes to a guest memory range indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the attributes.
> 
> This wording could cause some confusion, what about a simpler:
> 
> "reflect the range of pages that had its attributes successfully set"

Thanks, this is much better.

> 
> > +If the call returns 0, "address" is updated to the last successful address + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully.
> 
> "address + 1 page" or "subsequent page" perhaps.
> 
> In fact, wouldn't this all become simpler if size were number of pages instead?

It indeed becomes better if the size is number of pages and the address
is gfn, but I think we don't want to imply that the page size is 4K to
userspace.

> 
> > The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may want
> > +to retry the operation with the returned address/size if the previous range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 0s.
> > +
> 
> ...
> 
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +					   struct kvm_memory_attributes *attrs)
> > +{
> > +	gfn_t start, end;
> > +	unsigned long i;
> > +	void *entry;
> > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +	/* flags is currently not used. */
> > +	if (attrs->flags)
> > +		return -EINVAL;
> > +	if (attrs->attributes & ~supported_attrs)
> > +		return -EINVAL;
> > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +		return -EINVAL;
> > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +		return -EINVAL;
> > +
> > +	start = attrs->address >> PAGE_SHIFT;
> > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> 
> Here PAGE_SIZE and -1 cancel out.

Correct!

> 
> Consider using gpa_to_gfn as well.

Yes using gpa_to_gfn is appropriate.

Thanks,
Chao
> 
> > +
> > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	for (i = start; i < end; i++)
> > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +				    GFP_KERNEL_ACCOUNT)))
> > +			break;
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	attrs->address = i << PAGE_SHIFT;
> > +	attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > +	return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> >  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> >  {
> >  	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >  #ifdef CONFIG_HAVE_KVM_MSI
> >  	case KVM_CAP_SIGNAL_MSI:
> >  #endif
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +	case KVM_CAP_MEMORY_ATTRIBUTES:
> > +#endif
> >  #ifdef CONFIG_HAVE_KVM_IRQFD
> >  	case KVM_CAP_IRQFD:
> >  	case KVM_CAP_IRQFD_RESAMPLE:
> > @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
> >  		break;
> >  	}
> >  #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +	case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> > +		u64 attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +		r = -EFAULT;
> > +		if (copy_to_user(argp, &attrs, sizeof(attrs)))
> > +			goto out;
> > +		r = 0;
> > +		break;
> > +	}
> > +	case KVM_SET_MEMORY_ATTRIBUTES: {
> > +		struct kvm_memory_attributes attrs;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&attrs, argp, sizeof(attrs)))
> > +			goto out;
> > +
> > +		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> > +
> > +		if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> > +			r = -EFAULT;
> > +		break;
> > +	}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> >  	case KVM_CREATE_DEVICE: {
> >  		struct kvm_create_device cd;
Chao Peng Dec. 7, 2022, 2:51 p.m. UTC | #4
On Tue, Dec 06, 2022 at 03:07:27PM +0000, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> >
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> >
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> >   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> >     a guest memory range.
> >   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> >     memory attributes.
> >
> > KVM internally uses xarray to store the per-page memory attributes.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > ---
> >  Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> >  arch/x86/kvm/Kconfig           |  1 +
> >  include/linux/kvm_host.h       |  3 ++
> >  include/uapi/linux/kvm.h       | 17 ++++++++
> >  virt/kvm/Kconfig               |  3 ++
> >  virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
> >  6 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 5617bc4f899f..bb2f709c0900 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> >  The "pad" and "reserved" fields may be used for future extensions and should be
> >  set to 0s by userspace.
> >
> > +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: u64 memory attributes bitmask(out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Returns supported memory attributes bitmask. Supported memory attributes will
> > +have the corresponding bits set in u64 memory attributes bitmask.
> > +
> > +The following memory attributes are defined::
> > +
> > +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> > +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> > +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> > +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> > +
> > +4.139 KVM_SET_MEMORY_ATTRIBUTES
> > +-----------------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_memory_attributes(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Sets memory attributes for pages in a guest memory range. Parameters are
> > +specified via the following structure::
> > +
> > +  struct kvm_memory_attributes {
> > +       __u64 address;
> > +       __u64 size;
> > +       __u64 attributes;
> > +       __u64 flags;
> > +  };
> > +
> > +The user sets the per-page memory attributes to a guest memory range indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the attributes.
> > +If the call returns 0, "address" is updated to the last successful address + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully. The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may want
> > +to retry the operation with the returned address/size if the previous range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 0s.
> > +
> >  5. The kvm_run structure
> >  ========================
> >
> > @@ -8270,6 +8323,16 @@ structure.
> >  When getting the Modified Change Topology Report value, the attr->addr
> >  must point to a byte where the value will be stored or retrieved from.
> >
> > +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> > +------------------------------
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm
> > +
> > +This capability indicates KVM supports per-page memory attributes and ioctls
> > +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> > +
> >  9. Known KVM API problems
> >  =========================
> >
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index fbeaa9ddef59..a8e379a3afee 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -49,6 +49,7 @@ config KVM
> >         select SRCU
> >         select INTERVAL_TREE
> >         select HAVE_KVM_PM_NOTIFIER if PM
> > +       select HAVE_KVM_MEMORY_ATTRIBUTES
> >         help
> >           Support hosting fully virtualized guest machines using hardware
> >           virtualization extensions.  You will need a fairly recent
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 8f874a964313..a784e2b06625 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -800,6 +800,9 @@ struct kvm {
> >
> >  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
> >         struct notifier_block pm_notifier;
> > +#endif
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +       struct xarray mem_attr_array;
> >  #endif
> >         char stats_id[KVM_STATS_NAME_SIZE];
> >  };
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 64dfe9c07c87..5d0941acb5bb 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_S390_CPU_TOPOLOGY 222
> >  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
> >  #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> > +#define KVM_CAP_MEMORY_ATTRIBUTES 225
> >
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >
> > @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
> >  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
> >  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
> >
> > +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> > +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
> > +#define KVM_SET_MEMORY_ATTRIBUTES              _IOWR(KVMIO,  0xd3, struct kvm_memory_attributes)
> > +
> > +struct kvm_memory_attributes {
> > +       __u64 address;
> > +       __u64 size;
> > +       __u64 attributes;
> > +       __u64 flags;
> > +};
> > +
> > +#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> > +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> > +#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> > +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> 
> nit: how about using the BIT() macro for these?

Might be the _BITULL() in include/uapi/linux/const.h since it will be
used by userspace also.

> 
> > +
> >  #endif /* __LINUX_KVM_H */
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index 800f9470e36b..effdea5dd4f0 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
> >  config HAVE_KVM_DIRTY_RING
> >         bool
> >
> > +config HAVE_KVM_MEMORY_ATTRIBUTES
> > +       bool
> > +
> >  # Only strongly ordered architectures can select this, as it doesn't
> >  # put any explicit constraint on userspace ordering. They can also
> >  # select the _ACQ_REL version.
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1782c4555d94..7f0f5e9f2406 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> >         spin_lock_init(&kvm->mn_invalidate_lock);
> >         rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> >         xa_init(&kvm->vcpu_array);
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +       xa_init(&kvm->mem_attr_array);
> > +#endif
> >
> >         INIT_LIST_HEAD(&kvm->gpc_list);
> >         spin_lock_init(&kvm->gpc_lock);
> > @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
> >                 kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
> >                 kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
> >         }
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +       xa_destroy(&kvm->mem_attr_array);
> > +#endif
> >         cleanup_srcu_struct(&kvm->irq_srcu);
> >         cleanup_srcu_struct(&kvm->srcu);
> >         kvm_arch_free_vm(kvm);
> > @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> >  }
> >  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
> >
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> > +{
> > +       return 0;
> > +}
> > +
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +                                          struct kvm_memory_attributes *attrs)
> > +{
> > +       gfn_t start, end;
> > +       unsigned long i;
> > +       void *entry;
> > +       u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +       /* flags is currently not used. */
> 
> nit: "is reserved"? I think it makes it a bit clearer what its purpose is.
OK, then:
  flags is reserved for future extention and currently is not used.

> 
> > +       if (attrs->flags)
> > +               return -EINVAL;
> > +       if (attrs->attributes & ~supported_attrs)
> > +               return -EINVAL;
> > +       if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +               return -EINVAL;
> > +       if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +               return -EINVAL;
> > +
> > +       start = attrs->address >> PAGE_SHIFT;
> > +       end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> 
> Would using existing helpers be better for getting the frame numbers?

Yes, gpa_to_gfn() can be used.

> Also, the code checks that the address and size are page aligned, so
> the end rounding up seems redundant, and might even be wrong if the
> address+size-1 is close to the gfn_t limit (which this code tries to
> avoid in an earlier check).

That's right.

> 
> > +       entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > +       mutex_lock(&kvm->lock);
> > +       for (i = start; i < end; i++)
> > +               if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +                                   GFP_KERNEL_ACCOUNT)))
> > +                       break;
> > +       mutex_unlock(&kvm->lock);
> > +
> > +       attrs->address = i << PAGE_SHIFT;
> > +       attrs->size = (end - i) << PAGE_SHIFT;
> 
> nit: helpers for these too?

Similarly, gfn_to_gpa() will be used.

> 
> With the end calculation fixed,
> 
> Reviewed-by: Fuad Tabba <tabba@google.com>
> After adding the necessary configs for arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba <tabba@google.com>

Thanks.
Chao
> 
> Cheers,
> /fuad
> 
> > +
> > +       return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> >  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> >  {
> >         return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >  #ifdef CONFIG_HAVE_KVM_MSI
> >         case KVM_CAP_SIGNAL_MSI:
> >  #endif
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +       case KVM_CAP_MEMORY_ATTRIBUTES:
> > +#endif
> >  #ifdef CONFIG_HAVE_KVM_IRQFD
> >         case KVM_CAP_IRQFD:
> >         case KVM_CAP_IRQFD_RESAMPLE:
> > @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
> >                 break;
> >         }
> >  #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +       case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> > +               u64 attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +               r = -EFAULT;
> > +               if (copy_to_user(argp, &attrs, sizeof(attrs)))
> > +                       goto out;
> > +               r = 0;
> > +               break;
> > +       }
> > +       case KVM_SET_MEMORY_ATTRIBUTES: {
> > +               struct kvm_memory_attributes attrs;
> > +
> > +               r = -EFAULT;
> > +               if (copy_from_user(&attrs, argp, sizeof(attrs)))
> > +                       goto out;
> > +
> > +               r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> > +
> > +               if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> > +                       r = -EFAULT;
> > +               break;
> > +       }
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> >         case KVM_CREATE_DEVICE: {
> >                 struct kvm_create_device cd;
> >
> > --
> > 2.25.1
> >
Borislav Petkov Dec. 16, 2022, 3:09 p.m. UTC | #5
On Fri, Dec 02, 2022 at 02:13:40PM +0800, Chao Peng wrote:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  	spin_lock_init(&kvm->mn_invalidate_lock);
>  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>  	xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	xa_init(&kvm->mem_attr_array);
> +#endif

	if (IS_ENABLED(CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES))
		...

would at least remove the ugly ifdeffery.

Or you could create wrapper functions for that xa_init() and
xa_destroy() and put the ifdeffery in there.

> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>  
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)

I guess that function should have a verb in the name:

kvm_get_supported_mem_attributes()

> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)
> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;

Dunno, shouldn't those issue some sort of an error message so that the
caller knows where it failed? Or at least return different retvals which
signal what the problem is?

> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +	mutex_lock(&kvm->lock);
> +	for (i = start; i < end; i++)
> +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT)))
> +			break;
> +	mutex_unlock(&kvm->lock);
> +
> +	attrs->address = i << PAGE_SHIFT;
> +	attrs->size = (end - i) << PAGE_SHIFT;
> +
> +	return 0;
> +}
Chao Peng Dec. 19, 2022, 8:15 a.m. UTC | #6
On Fri, Dec 16, 2022 at 04:09:06PM +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 02:13:40PM +0800, Chao Peng wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1782c4555d94..7f0f5e9f2406 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> >  	spin_lock_init(&kvm->mn_invalidate_lock);
> >  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> >  	xa_init(&kvm->vcpu_array);
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +	xa_init(&kvm->mem_attr_array);
> > +#endif
> 
> 	if (IS_ENABLED(CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES))
> 		...
> 
> would at least remove the ugly ifdeffery.
> 
> Or you could create wrapper functions for that xa_init() and
> xa_destroy() and put the ifdeffery in there.

Agreed.

> 
> > @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
> >  }
> >  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
> >  
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> 
> I guess that function should have a verb in the name:
> 
> kvm_get_supported_mem_attributes()

Right!
> 
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +					   struct kvm_memory_attributes *attrs)
> > +{
> > +	gfn_t start, end;
> > +	unsigned long i;
> > +	void *entry;
> > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +	/* flags is currently not used. */
> > +	if (attrs->flags)
> > +		return -EINVAL;
> > +	if (attrs->attributes & ~supported_attrs)
> > +		return -EINVAL;
> > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +		return -EINVAL;
> > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +		return -EINVAL;
> 
> Dunno, shouldn't those issue some sort of an error message so that the
> caller knows where it failed? Or at least return different retvals which
> signal what the problem is?

Tamping down with error number a bit:

        if (attrs->flags)
                return -ENXIO;
        if (attrs->attributes & ~supported_attrs)
                return -EOPNOTSUPP;
        if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
            attrs->size == 0)
                return -EINVAL;
        if (attrs->address + attrs->size < attrs->address)
                return -E2BIG;

Chao
> 
> > +	start = attrs->address >> PAGE_SHIFT;
> > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	for (i = start; i < end; i++)
> > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +				    GFP_KERNEL_ACCOUNT)))
> > +			break;
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	attrs->address = i << PAGE_SHIFT;
> > +	attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > +	return 0;
> > +}
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov Dec. 19, 2022, 10:17 a.m. UTC | #7
On Mon, Dec 19, 2022 at 04:15:32PM +0800, Chao Peng wrote:
> Tamping down with error number a bit:
> 
>         if (attrs->flags)
>                 return -ENXIO;
>         if (attrs->attributes & ~supported_attrs)
>                 return -EOPNOTSUPP;
>         if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
>             attrs->size == 0)
>                 return -EINVAL;
>         if (attrs->address + attrs->size < attrs->address)
>                 return -E2BIG;

Yap, better.

I guess you should add those to the documentation of the ioctl too
so that people can find out why it fails. Or, well, they can look
at the code directly too but still... imagine some blurb about
user-friendliness here...

:-)
Chao Peng Dec. 20, 2022, 7:24 a.m. UTC | #8
On Mon, Dec 19, 2022 at 11:17:22AM +0100, Borislav Petkov wrote:
> On Mon, Dec 19, 2022 at 04:15:32PM +0800, Chao Peng wrote:
> > Tamping down with error number a bit:
> > 
> >         if (attrs->flags)
> >                 return -ENXIO;
> >         if (attrs->attributes & ~supported_attrs)
> >                 return -EOPNOTSUPP;
> >         if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
> >             attrs->size == 0)
> >                 return -EINVAL;
> >         if (attrs->address + attrs->size < attrs->address)
> >                 return -E2BIG;
> 
> Yap, better.
> 
> I guess you should add those to the documentation of the ioctl too
> so that people can find out why it fails. Or, well, they can look
> at the code directly too but still... imagine some blurb about
> user-friendliness here...

Thanks for reminding. Yes KVM api doc is the right place to put these
documentation in.

Thanks,
Chao
> 
> :-)
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
Chenyi Qiang Dec. 28, 2022, 8:28 a.m. UTC | #9
On 12/2/2022 2:13 PM, Chao Peng wrote:
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
> 
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
>   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>     a guest memory range.
>   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>     memory attributes.
> 
> KVM internally uses xarray to store the per-page memory attributes.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
>  Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
>  arch/x86/kvm/Kconfig           |  1 +
>  include/linux/kvm_host.h       |  3 ++
>  include/uapi/linux/kvm.h       | 17 ++++++++
>  virt/kvm/Kconfig               |  3 ++
>  virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
>  6 files changed, 163 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
>  The "pad" and "reserved" fields may be used for future extensions and should be
>  set to 0s by userspace.
>  
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> +  struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +  };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully. The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +
>  5. The kvm_run structure
>  ========================
>  
> @@ -8270,6 +8323,16 @@ structure.
>  When getting the Modified Change Topology Report value, the attr->addr
>  must point to a byte where the value will be stored or retrieved from.
>  
> +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> +------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm
> +
> +This capability indicates KVM supports per-page memory attributes and ioctls
> +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> +
>  9. Known KVM API problems
>  =========================
>  
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
>  	select SRCU
>  	select INTERVAL_TREE
>  	select HAVE_KVM_PM_NOTIFIER if PM
> +	select HAVE_KVM_MEMORY_ATTRIBUTES
>  	help
>  	  Support hosting fully virtualized guest machines using hardware
>  	  virtualization extensions.  You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8f874a964313..a784e2b06625 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -800,6 +800,9 @@ struct kvm {
>  
>  #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>  	struct notifier_block pm_notifier;
> +#endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	struct xarray mem_attr_array;
>  #endif
>  	char stats_id[KVM_STATS_NAME_SIZE];
>  };
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 64dfe9c07c87..5d0941acb5bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_CPU_TOPOLOGY 222
>  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
>  #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> +#define KVM_CAP_MEMORY_ATTRIBUTES 225
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
>  /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>  #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>  
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
> +#define KVM_SET_MEMORY_ATTRIBUTES              _IOWR(KVMIO,  0xd3, struct kvm_memory_attributes)
> +
> +struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +};
> +
> +#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..effdea5dd4f0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
>  config HAVE_KVM_DIRTY_RING
>         bool
>  
> +config HAVE_KVM_MEMORY_ATTRIBUTES
> +       bool
> +
>  # Only strongly ordered architectures can select this, as it doesn't
>  # put any explicit constraint on userspace ordering. They can also
>  # select the _ACQ_REL version.
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  	spin_lock_init(&kvm->mn_invalidate_lock);
>  	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>  	xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	xa_init(&kvm->mem_attr_array);
> +#endif
>  
>  	INIT_LIST_HEAD(&kvm->gpc_list);
>  	spin_lock_init(&kvm->gpc_lock);
> @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>  		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>  	}
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	xa_destroy(&kvm->mem_attr_array);
> +#endif
>  	cleanup_srcu_struct(&kvm->irq_srcu);
>  	cleanup_srcu_struct(&kvm->srcu);
>  	kvm_arch_free_vm(kvm);
> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>  }
>  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>  
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +	return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)
> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;
> +
> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +

Because guest memory defaults to private, and now this patch stores the
attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of _SHARED, it
would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of boot
time. Maybe it can be optimized somehow in other places? e.g. set mem
attr in advance.

> +	mutex_lock(&kvm->lock);
> +	for (i = start; i < end; i++)
> +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT)))
> +			break;
> +	mutex_unlock(&kvm->lock);
> +
> +	attrs->address = i << PAGE_SHIFT;
> +	attrs->size = (end - i) << PAGE_SHIFT;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
>  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>  {
>  	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
Chao Peng Jan. 3, 2023, 1:39 a.m. UTC | #10
On Wed, Dec 28, 2022 at 04:28:01PM +0800, Chenyi Qiang wrote:
...
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +					   struct kvm_memory_attributes *attrs)
> > +{
> > +	gfn_t start, end;
> > +	unsigned long i;
> > +	void *entry;
> > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +	/* flags is currently not used. */
> > +	if (attrs->flags)
> > +		return -EINVAL;
> > +	if (attrs->attributes & ~supported_attrs)
> > +		return -EINVAL;
> > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +		return -EINVAL;
> > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +		return -EINVAL;
> > +
> > +	start = attrs->address >> PAGE_SHIFT;
> > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> 
> Because guest memory defaults to private, and now this patch stores the
> attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of _SHARED, it
> would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of boot
> time. Maybe it can be optimized somehow in other places? e.g. set mem
> attr in advance.

KVM defaults to 'shared' because this ioctl can also be potentially used
by normal VMs and 'shared' sounds a value meaningful for both normal VMs
and confidential VMs. As for more KVM_EXIT_MEMORY_FAULT exits during the
booting time, yes, setting all memory to 'private' for confidential VMs
through this ioctl in userspace before guest launch is an approach for
KVM userspace to 'override' the KVM default and reduce the number of
implicit conversions.

Thanks,
Chao
> 
> > +	mutex_lock(&kvm->lock);
> > +	for (i = start; i < end; i++)
> > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +				    GFP_KERNEL_ACCOUNT)))
> > +			break;
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	attrs->address = i << PAGE_SHIFT;
> > +	attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > +	return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> >  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> >  {
> >  	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
Wang, Wei W Jan. 3, 2023, 3:32 a.m. UTC | #11
On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > Because guest memory defaults to private, and now this patch stores
> > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> _SHARED,
> > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > mem attr in advance.
> 
> KVM defaults to 'shared' because this ioctl can also be potentially used by
> normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> confidential VMs. 

Do you mean a normal VM could have pages marked private? What's the usage?
(If all the pages are just marked shared for normal VMs, then why do we need it)

> As for more KVM_EXIT_MEMORY_FAULT exits during the
> booting time, yes, setting all memory to 'private' for confidential VMs through
> this ioctl in userspace before guest launch is an approach for KVM userspace to
> 'override' the KVM default and reduce the number of implicit conversions.

Most pages of a confidential VM are likely to be private pages. It seems more efficient
(and not difficult to check vm_type) to have KVM defaults to "private" for confidential VMs
and defaults to "shared" for normal VMs.
Sean Christopherson Jan. 3, 2023, 11:06 p.m. UTC | #12
On Tue, Jan 03, 2023, Wang, Wei W wrote:
> On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > > Because guest memory defaults to private, and now this patch stores
> > > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> > _SHARED,
> > > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > > mem attr in advance.
> > 
> > KVM defaults to 'shared' because this ioctl can also be potentially used by
> > normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> > confidential VMs. 
> 
> Do you mean a normal VM could have pages marked private? What's the usage?
> (If all the pages are just marked shared for normal VMs, then why do we need it)

No, there are potential use cases for per-page attribute/permissions, e.g. to
make select pages read-only, exec-only, no-exec, etc...

> > As for more KVM_EXIT_MEMORY_FAULT exits during the
> > booting time, yes, setting all memory to 'private' for confidential VMs through
> > this ioctl in userspace before guest launch is an approach for KVM userspace to
> > 'override' the KVM default and reduce the number of implicit conversions.
> 
> Most pages of a confidential VM are likely to be private pages. It seems more efficient
> (and not difficult to check vm_type) to have KVM defaults to "private" for confidential VMs
> and defaults to "shared" for normal VMs.

If done right, the default shouldn't matter all that much for efficiency.  KVM
needs to be able to effeciently track large ranges regardless of the default,
otherwise the memory overhead and the presumably cost of lookups will be painful.
E.g. converting a 1GiB chunk to shared should ideally require one entry, not 256k
entries.

Looks like that behavior was changed in v8 in response to feedback[*] that doing
xa_store_range() on a subset of an existing range (entry) would overwrite the
entire existing range (entry), not just the smaller subset.  xa_store_range() does
appear to be too simplistic for this use case, but looking at __filemap_add_folio(),
splitting an existing entry isn't super complex.

Using xa_store() for the very initial implementation is ok, and probably a good
idea since it's more obviously correct and will give us a bisection point.  But
we definitely want a more performant implementation sooner than later.  The hardest
part will likely be merging existing entries, but that can be done separately too,
and is probably lower priority.

E.g. (1) use xa_store() and always track at 4KiB granularity, (2) support storing
metadata in multi-index entries, and finally (3) support merging adjacent entries
with identical values.

[*] https://lore.kernel.org/all/CAGtprH9xyw6bt4=RBWF6-v2CSpabOCpKq5rPz+e-9co7EisoVQ@mail.gmail.com
Chao Peng Jan. 5, 2023, 4:39 a.m. UTC | #13
On Tue, Jan 03, 2023 at 11:06:37PM +0000, Sean Christopherson wrote:
> On Tue, Jan 03, 2023, Wang, Wei W wrote:
> > On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > > > Because guest memory defaults to private, and now this patch stores
> > > > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> > > _SHARED,
> > > > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > > > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > > > mem attr in advance.
> > > 
> > > KVM defaults to 'shared' because this ioctl can also be potentially used by
> > > normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> > > confidential VMs. 
> > 
> > Do you mean a normal VM could have pages marked private? What's the usage?
> > (If all the pages are just marked shared for normal VMs, then why do we need it)
> 
> No, there are potential use cases for per-page attribute/permissions, e.g. to
> make select pages read-only, exec-only, no-exec, etc...

Right, normal VMs are not likely use private/shared bit. Not sure pKVM,
but perhaps not call it 'normal' VMs in this context. But since the
ioctl can be used by normal VMs for other bits (read-only, exec-only,
no-exec, etc), a default 'private' looks strange for them. That's why I
default it to 'shared' and for confidential guest, we can issue another
call to this ioctl to set all the memory to 'private' before guest
booting, if default 'private' is needed for guest.

Like Wei mentioned, it's also possible to make the default dependents on
vm_type, but that looks awkward to me from the API definition as well as
the implementation, also the vm_type has not been introduced at this time.

> 
> > > As for more KVM_EXIT_MEMORY_FAULT exits during the
> > > booting time, yes, setting all memory to 'private' for confidential VMs through
> > > this ioctl in userspace before guest launch is an approach for KVM userspace to
> > > 'override' the KVM default and reduce the number of implicit conversions.
> > 
> > Most pages of a confidential VM are likely to be private pages. It seems more efficient
> > (and not difficult to check vm_type) to have KVM defaults to "private" for confidential VMs
> > and defaults to "shared" for normal VMs.
> 
> If done right, the default shouldn't matter all that much for efficiency.  KVM
> needs to be able to effeciently track large ranges regardless of the default,
> otherwise the memory overhead and the presumably cost of lookups will be painful.
> E.g. converting a 1GiB chunk to shared should ideally require one entry, not 256k
> entries.

I agree, KVM should have the ability to track large ranges efficiently.

> 
> Looks like that behavior was changed in v8 in response to feedback[*] that doing
> xa_store_range() on a subset of an existing range (entry) would overwrite the
> entire existing range (entry), not just the smaller subset.  xa_store_range() does
> appear to be too simplistic for this use case, but looking at __filemap_add_folio(),
> splitting an existing entry isn't super complex.

Yes, xa_store_range() looks a perfect match for us initially but the
'overwriting the entire entry' behavior makes it incorrect for us when
storing a subset on an existing large entry. xarray lib has utilities
for splitting, the hard part is merging existing entries, as you also
said below. Thanks for pointing out the __filemap_add_folio() example,
it does look not too complex for splitting.

> 
> Using xa_store() for the very initial implementation is ok, and probably a good
> idea since it's more obviously correct and will give us a bisection point.  But
> we definitely want a more performant implementation sooner than later.  The hardest
> part will likely be merging existing entries, but that can be done separately too,
> and is probably lower priority.
> 
> E.g. (1) use xa_store() and always track at 4KiB granularity, (2) support storing
> metadata in multi-index entries, and finally (3) support merging adjacent entries
> with identical values.

This path looks good to me.

Thanks,
Chao
> 
> [*] https://lore.kernel.org/all/CAGtprH9xyw6bt4=RBWF6-v2CSpabOCpKq5rPz+e-9co7EisoVQ@mail.gmail.com
Sean Christopherson Jan. 13, 2023, 10:02 p.m. UTC | #14
On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
>  	select SRCU
>  	select INTERVAL_TREE
>  	select HAVE_KVM_PM_NOTIFIER if PM
> +	select HAVE_KVM_MEMORY_ATTRIBUTES

I would prefer to call this KVM_GENERIC_MEMORY_ATTRIBUTES.  Similar to
KVM_GENERIC_HARDWARE_ENABLING, ARM does need/have hardware enabling, it just
doesn't want KVM's generic implementation.  In this case, pKVM does support memory
attributes, but uses stage-2 tables to track ownership and doesn't need/want the
overhead of the generic implementation.

>  	help

...

> +#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)

I think we should carve out bits 0-2 for RWX, but I don't think we should actually
define them until they're actually accepted by KVM.

> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)

Nit, no need for "supported_attrs", just consume kvm_supported_mem_attributes()
directly.

> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;
> +
> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +	mutex_lock(&kvm->lock);

Peeking forward multiple patches, this needs to take kvm->slots_lock, not kvm->lock.
There's a bug in the lpage_disallowed patch that I believe can most easily be
solved by making this mutually exclusive with memslot changes.

When a memslot is created, KVM needs to walk through the attributes to detect
whether or not the attributes are identical for the entire slot.  To avoid races,
that means taking slots_lock.

The alternative would be to query the attributes when adjusting the hugepage level
and avoid lpage_disallowed entirely, but in the (very brief) time I've thought
about this I haven't come up with a way to do that in a performant manner.

> +	for (i = start; i < end; i++)

Curly braces needed on the for-loop.
Binbin Wu Jan. 17, 2023, 3:21 a.m. UTC | #15
On 12/2/2022 2:13 PM, Chao Peng wrote:
> In confidential computing usages, whether a page is private or shared is
> necessary information for KVM to perform operations like page fault
> handling, page zapping etc. There are other potential use cases for
> per-page memory attributes, e.g. to make memory read-only (or no-exec,
> or exec-only, etc.) without having to modify memslots.
>
> Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> userspace to operate on the per-page memory attributes.
>    - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
>      a guest memory range.
>    - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
>      memory attributes.
>
> KVM internally uses xarray to store the per-page memory attributes.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> ---
>   Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
>   arch/x86/kvm/Kconfig           |  1 +
>   include/linux/kvm_host.h       |  3 ++
>   include/uapi/linux/kvm.h       | 17 ++++++++

Should the changes introduced in this file also need to be added in 
tools/include/uapi/linux/kvm.h ?



>   virt/kvm/Kconfig               |  3 ++
>   virt/kvm/kvm_main.c            | 76 ++++++++++++++++++++++++++++++++++
>   6 files changed, 163 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 5617bc4f899f..bb2f709c0900 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
>   The "pad" and "reserved" fields may be used for future extensions and should be
>   set to 0s by userspace.
>   
> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> +  struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +  };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully. The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.
> +
>   5. The kvm_run structure
>   ========================
>   
> @@ -8270,6 +8323,16 @@ structure.
>   When getting the Modified Change Topology Report value, the attr->addr
>   must point to a byte where the value will be stored or retrieved from.
>   
> +8.40 KVM_CAP_MEMORY_ATTRIBUTES
> +------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm
> +
> +This capability indicates KVM supports per-page memory attributes and ioctls
> +KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
> +
>   9. Known KVM API problems
>   =========================
>   
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
>   	select SRCU
>   	select INTERVAL_TREE
>   	select HAVE_KVM_PM_NOTIFIER if PM
> +	select HAVE_KVM_MEMORY_ATTRIBUTES
>   	help
>   	  Support hosting fully virtualized guest machines using hardware
>   	  virtualization extensions.  You will need a fairly recent
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8f874a964313..a784e2b06625 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -800,6 +800,9 @@ struct kvm {
>   
>   #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
>   	struct notifier_block pm_notifier;
> +#endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	struct xarray mem_attr_array;
>   #endif
>   	char stats_id[KVM_STATS_NAME_SIZE];
>   };
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 64dfe9c07c87..5d0941acb5bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1182,6 +1182,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_S390_CPU_TOPOLOGY 222
>   #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
>   #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> +#define KVM_CAP_MEMORY_ATTRIBUTES 225
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> @@ -2238,4 +2239,20 @@ struct kvm_s390_zpci_op {
>   /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
>   #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
>   
> +/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
> +#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
> +#define KVM_SET_MEMORY_ATTRIBUTES              _IOWR(KVMIO,  0xd3, struct kvm_memory_attributes)
> +
> +struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +};
> +
> +#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
>   #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 800f9470e36b..effdea5dd4f0 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -19,6 +19,9 @@ config HAVE_KVM_IRQ_ROUTING
>   config HAVE_KVM_DIRTY_RING
>          bool
>   
> +config HAVE_KVM_MEMORY_ATTRIBUTES
> +       bool
> +
>   # Only strongly ordered architectures can select this, as it doesn't
>   # put any explicit constraint on userspace ordering. They can also
>   # select the _ACQ_REL version.
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1782c4555d94..7f0f5e9f2406 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>   	spin_lock_init(&kvm->mn_invalidate_lock);
>   	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
>   	xa_init(&kvm->vcpu_array);
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	xa_init(&kvm->mem_attr_array);
> +#endif
>   
>   	INIT_LIST_HEAD(&kvm->gpc_list);
>   	spin_lock_init(&kvm->gpc_lock);
> @@ -1323,6 +1326,9 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>   	}
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	xa_destroy(&kvm->mem_attr_array);
> +#endif
>   	cleanup_srcu_struct(&kvm->irq_srcu);
>   	cleanup_srcu_struct(&kvm->srcu);
>   	kvm_arch_free_vm(kvm);
> @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
>   }
>   #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
>   
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> +{
> +	return 0;
> +}
> +
> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)
> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;
> +
> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +	mutex_lock(&kvm->lock);
> +	for (i = start; i < end; i++)
> +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT)))
> +			break;
> +	mutex_unlock(&kvm->lock);
> +
> +	attrs->address = i << PAGE_SHIFT;
> +	attrs->size = (end - i) << PAGE_SHIFT;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
>   struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
>   {
>   	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
> @@ -4459,6 +4508,9 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>   #ifdef CONFIG_HAVE_KVM_MSI
>   	case KVM_CAP_SIGNAL_MSI:
>   #endif
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	case KVM_CAP_MEMORY_ATTRIBUTES:
> +#endif
>   #ifdef CONFIG_HAVE_KVM_IRQFD
>   	case KVM_CAP_IRQFD:
>   	case KVM_CAP_IRQFD_RESAMPLE:
> @@ -4804,6 +4856,30 @@ static long kvm_vm_ioctl(struct file *filp,
>   		break;
>   	}
>   #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +	case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
> +		u64 attrs = kvm_supported_mem_attributes(kvm);
> +
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &attrs, sizeof(attrs)))
> +			goto out;
> +		r = 0;
> +		break;
> +	}
> +	case KVM_SET_MEMORY_ATTRIBUTES: {
> +		struct kvm_memory_attributes attrs;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&attrs, argp, sizeof(attrs)))
> +			goto out;
> +
> +		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
> +
> +		if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
> +			r = -EFAULT;
> +		break;
> +	}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
>   	case KVM_CREATE_DEVICE: {
>   		struct kvm_create_device cd;
>
Chao Peng Jan. 17, 2023, 1:30 p.m. UTC | #16
On Tue, Jan 17, 2023 at 11:21:10AM +0800, Binbin Wu wrote:
> 
> On 12/2/2022 2:13 PM, Chao Peng wrote:
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> > 
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> >    - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> >      a guest memory range.
> >    - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> >      memory attributes.
> > 
> > KVM internally uses xarray to store the per-page memory attributes.
> > 
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > ---
> >   Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> >   arch/x86/kvm/Kconfig           |  1 +
> >   include/linux/kvm_host.h       |  3 ++
> >   include/uapi/linux/kvm.h       | 17 ++++++++
> 
> Should the changes introduced in this file also need to be added in
> tools/include/uapi/linux/kvm.h ?

Yes I think. But I'm hesitate to include in this patch or not. I see
many commits sync kernel kvm.h to tools's copy. Looks that is done
periodically and with a 'pull' model.

Chao
Sean Christopherson Jan. 17, 2023, 5:25 p.m. UTC | #17
On Tue, Jan 17, 2023, Chao Peng wrote:
> On Tue, Jan 17, 2023 at 11:21:10AM +0800, Binbin Wu wrote:
> > 
> > On 12/2/2022 2:13 PM, Chao Peng wrote:
> > > In confidential computing usages, whether a page is private or shared is
> > > necessary information for KVM to perform operations like page fault
> > > handling, page zapping etc. There are other potential use cases for
> > > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > > or exec-only, etc.) without having to modify memslots.
> > > 
> > > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > > userspace to operate on the per-page memory attributes.
> > >    - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > >      a guest memory range.
> > >    - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > >      memory attributes.
> > > 
> > > KVM internally uses xarray to store the per-page memory attributes.
> > > 
> > > Suggested-by: Sean Christopherson <seanjc@google.com>
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com/
> > > ---
> > >   Documentation/virt/kvm/api.rst | 63 ++++++++++++++++++++++++++++
> > >   arch/x86/kvm/Kconfig           |  1 +
> > >   include/linux/kvm_host.h       |  3 ++
> > >   include/uapi/linux/kvm.h       | 17 ++++++++
> > 
> > Should the changes introduced in this file also need to be added in
> > tools/include/uapi/linux/kvm.h ?
> 
> Yes I think.

I'm not sure how Paolo or others feel, but my preference is to never update KVM's
uapi headers in tools/ in KVM's tree.  Nothing KVM-related in tools/ actually
relies on the headers being copied into tools/, e.g. KVM selftests pulls KVM's
headers from the .../usr/include/ directory that's populated by `make headers_install`.

Perf's tooling is what actually "needs" the headers to be copied into tools/, so
my preference is to let the tools/perf maintainers deal with the headache of keeping
everything up-to-date.

> But I'm hesitate to include in this patch or not. I see many commits sync
> kernel kvm.h to tools's copy. Looks that is done periodically and with a
> 'pull' model.
Isaku Yamahata Feb. 9, 2023, 7:25 a.m. UTC | #18
On Fri, Dec 02, 2022 at 02:13:40PM +0800,
Chao Peng <chao.p.peng@linux.intel.com> wrote:

> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +					   struct kvm_memory_attributes *attrs)
> +{
> +	gfn_t start, end;
> +	unsigned long i;
> +	void *entry;
> +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> +	/* flags is currently not used. */
> +	if (attrs->flags)
> +		return -EINVAL;
> +	if (attrs->attributes & ~supported_attrs)
> +		return -EINVAL;
> +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> +		return -EINVAL;
> +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> +		return -EINVAL;
> +
> +	start = attrs->address >> PAGE_SHIFT;
> +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> +	mutex_lock(&kvm->lock);
> +	for (i = start; i < end; i++)
> +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> +				    GFP_KERNEL_ACCOUNT)))
> +			break;
> +	mutex_unlock(&kvm->lock);
> +
> +	attrs->address = i << PAGE_SHIFT;
> +	attrs->size = (end - i) << PAGE_SHIFT;
> +
> +	return 0;
> +}
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +

If memslot isn't private, it should return error if private attribute is set.
Something like following check is needed.

+       if (attrs->flags & KVM_MEM_PRIVATE) {
+               /* non-private memory slot doesn't allow KVM_MEM_PRIVATE */
+               for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
+                       struct kvm_memslot_iter iter;
+                       struct kvm_memslots *slots;
+
+                       slots = __kvm_memslots(kvm, i);
+                       kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+                               if (!kvm_slot_can_be_private(iter.slot)) {
+                                       mutex_unlock(&kvm->slots_lock);
+                                       return -EINVAL;
+                               }
+                       }
+               }
+       }
+
Sean Christopherson Feb. 10, 2023, 12:35 a.m. UTC | #19
On Wed, Feb 08, 2023, Isaku Yamahata wrote:
> On Fri, Dec 02, 2022 at 02:13:40PM +0800,
> Chao Peng <chao.p.peng@linux.intel.com> wrote:
> 
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +					   struct kvm_memory_attributes *attrs)
> > +{
> > +	gfn_t start, end;
> > +	unsigned long i;
> > +	void *entry;
> > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +	/* flags is currently not used. */
> > +	if (attrs->flags)
> > +		return -EINVAL;
> > +	if (attrs->attributes & ~supported_attrs)
> > +		return -EINVAL;
> > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +		return -EINVAL;
> > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +		return -EINVAL;
> > +
> > +	start = attrs->address >> PAGE_SHIFT;
> > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	for (i = start; i < end; i++)
> > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > +				    GFP_KERNEL_ACCOUNT)))
> > +			break;
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	attrs->address = i << PAGE_SHIFT;
> > +	attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > +	return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> 
> If memslot isn't private, it should return error if private attribute is set.

Why?  I'd rather keep the two things separate.  If we enforce this sort of thing
at KVM_SET_MEMORY_ATTRIBUTES, then we also have to enforce it at
KVM_SET_USER_MEMORY_REGION.
Isaku Yamahata Feb. 13, 2023, 11:53 p.m. UTC | #20
On Fri, Feb 10, 2023 at 12:35:30AM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Wed, Feb 08, 2023, Isaku Yamahata wrote:
> > On Fri, Dec 02, 2022 at 02:13:40PM +0800,
> > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > 
> > > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > +					   struct kvm_memory_attributes *attrs)
> > > +{
> > > +	gfn_t start, end;
> > > +	unsigned long i;
> > > +	void *entry;
> > > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > > +
> > > +	/* flags is currently not used. */
> > > +	if (attrs->flags)
> > > +		return -EINVAL;
> > > +	if (attrs->attributes & ~supported_attrs)
> > > +		return -EINVAL;
> > > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > > +		return -EINVAL;
> > > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > > +		return -EINVAL;
> > > +
> > > +	start = attrs->address >> PAGE_SHIFT;
> > > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > > +
> > > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > > +
> > > +	mutex_lock(&kvm->lock);
> > > +	for (i = start; i < end; i++)
> > > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > +				    GFP_KERNEL_ACCOUNT)))
> > > +			break;
> > > +	mutex_unlock(&kvm->lock);
> > > +
> > > +	attrs->address = i << PAGE_SHIFT;
> > > +	attrs->size = (end - i) << PAGE_SHIFT;
> > > +
> > > +	return 0;
> > > +}
> > > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > > +
> > 
> > If memslot isn't private, it should return error if private attribute is set.
> 
> Why?  I'd rather keep the two things separate.  If we enforce this sort of thing
> at KVM_SET_MEMORY_ATTRIBUTES, then we also have to enforce it at
> KVM_SET_USER_MEMORY_REGION.

For device assignment via shared GPA, non-private memory slot needs to be
allowed.
Sean Christopherson Feb. 14, 2023, 6:07 p.m. UTC | #21
On Mon, Feb 13, 2023, Isaku Yamahata wrote:
> On Fri, Feb 10, 2023 at 12:35:30AM +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Wed, Feb 08, 2023, Isaku Yamahata wrote:
> > > On Fri, Dec 02, 2022 at 02:13:40PM +0800,
> > > Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > 
> > > > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > +					   struct kvm_memory_attributes *attrs)
> > > > +{
> > > > +	gfn_t start, end;
> > > > +	unsigned long i;
> > > > +	void *entry;
> > > > +	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > > > +
> > > > +	/* flags is currently not used. */
> > > > +	if (attrs->flags)
> > > > +		return -EINVAL;
> > > > +	if (attrs->attributes & ~supported_attrs)
> > > > +		return -EINVAL;
> > > > +	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > > > +		return -EINVAL;
> > > > +	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > > > +		return -EINVAL;
> > > > +
> > > > +	start = attrs->address >> PAGE_SHIFT;
> > > > +	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > > > +
> > > > +	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > > > +
> > > > +	mutex_lock(&kvm->lock);
> > > > +	for (i = start; i < end; i++)
> > > > +		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
> > > > +				    GFP_KERNEL_ACCOUNT)))
> > > > +			break;
> > > > +	mutex_unlock(&kvm->lock);
> > > > +
> > > > +	attrs->address = i << PAGE_SHIFT;
> > > > +	attrs->size = (end - i) << PAGE_SHIFT;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > > > +
> > > 
> > > If memslot isn't private, it should return error if private attribute is set.
> > 
> > Why?  I'd rather keep the two things separate.  If we enforce this sort of thing
> > at KVM_SET_MEMORY_ATTRIBUTES, then we also have to enforce it at
> > KVM_SET_USER_MEMORY_REGION.
> 
> For device assignment via shared GPA, non-private memory slot needs to be
> allowed.

That doesn't say anything about why setting attributes needs to poke into the
memslot.  The page fault path already kicks out to userspace if there's a
discrepancy between the attributes and the memslot, why is that insufficient?
Nicolas Saenz Julienne May 19, 2023, 5:32 p.m. UTC | #22
Hi,

On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:

[...]

> +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: u64 memory attributes bitmask(out)
> +:Returns: 0 on success, <0 on error
> +
> +Returns supported memory attributes bitmask. Supported memory attributes will
> +have the corresponding bits set in u64 memory attributes bitmask.
> +
> +The following memory attributes are defined::
> +
> +  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
> +  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
> +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> +
> +4.139 KVM_SET_MEMORY_ATTRIBUTES
> +-----------------------------------------
> +
> +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> +:Architectures: x86
> +:Type: vm ioctl
> +:Parameters: struct kvm_memory_attributes(in/out)
> +:Returns: 0 on success, <0 on error
> +
> +Sets memory attributes for pages in a guest memory range. Parameters are
> +specified via the following structure::
> +
> +  struct kvm_memory_attributes {
> +	__u64 address;
> +	__u64 size;
> +	__u64 attributes;
> +	__u64 flags;
> +  };
> +
> +The user sets the per-page memory attributes to a guest memory range indicated
> +by address/size, and in return KVM adjusts address and size to reflect the
> +actual pages of the memory range have been successfully set to the attributes.
> +If the call returns 0, "address" is updated to the last successful address + 1
> +and "size" is updated to the remaining address size that has not been set
> +successfully. The user should check the return value as well as the size to
> +decide if the operation succeeded for the whole range or not. The user may want
> +to retry the operation with the returned address/size if the previous range was
> +partially successful.
> +
> +Both address and size should be page aligned and the supported attributes can be
> +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> +
> +The "flags" field may be used for future extensions and should be set to 0s.

We have been looking into adding support for the Hyper-V VSM extensions
which Windows uses to implement Credential Guard. This interface seems
like a good fit for one of its underlying features. I just wanted to
share a bit about it, and see if we can expand it to fit this use-case.
Note that this was already briefly discussed between Sean and Alex some
time ago[1].

VSM introduces isolated guest execution contexts called Virtual Trust
Levels (VTL) [2]. Each VTL has its own memory access protections,
virtual processors states, interrupt controllers and overlay pages. VTLs
are hierarchical and might enforce memory protections on less privileged
VTLs. Memory protections are enforced on a per-GPA granularity.

The list of possible protections is:
- No access -- This needs a new memory attribute, I think.
- Read-only, no execute
- Read-only, execute
- Read/write, no execute
- Read/write, execute

We implemented this in the past by using a separate address space per
VTL and updating memory regions on protection changes. But having to
update the memory slot layout for every permission change scales poorly,
especially as we have to perform 100.000s of these operations at boot
(see [1] for a little more context).

I believe the biggest barrier for us to use memory attributes is not
having the ability to target specific address spaces, or to the very
least having some mechanism to maintain multiple independent layers of
attributes.

Also sorry for not posting our VSM patches. They are not ready for
upstream review yet, but we're working on it.

Nicolas

[1] https://patchwork.kernel.org/comment/25054908/
[2] See Chapter 15 of Microsoft's 'Hypervisor Top Level Functional Specification':
    https://raw.githubusercontent.com/MicrosoftDocs/Virtualization-Documentation/main/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf
Sean Christopherson May 19, 2023, 6:23 p.m. UTC | #23
On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> Hi,
> 
> On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:
> 
> [...]
> > +The user sets the per-page memory attributes to a guest memory range indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the attributes.
> > +If the call returns 0, "address" is updated to the last successful address + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully. The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may want
> > +to retry the operation with the returned address/size if the previous range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 0s.
> 
> We have been looking into adding support for the Hyper-V VSM extensions
> which Windows uses to implement Credential Guard. This interface seems
> like a good fit for one of its underlying features. I just wanted to
> share a bit about it, and see if we can expand it to fit this use-case.
> Note that this was already briefly discussed between Sean and Alex some
> time ago[1].
> 
> VSM introduces isolated guest execution contexts called Virtual Trust
> Levels (VTL) [2]. Each VTL has its own memory access protections,
> virtual processors states, interrupt controllers and overlay pages. VTLs
> are hierarchical and might enforce memory protections on less privileged
> VTLs. Memory protections are enforced on a per-GPA granularity.
> 
> The list of possible protections is:
> - No access -- This needs a new memory attribute, I think.

No, if KVM provides three bits for READ, WRITE, and EXECUTE, then userspace can
get all the possible combinations.  E.g. this is RWX=000b

> - Read-only, no execute

RWX=100b (using my completely arbitrary ordering of RWX bits :-) )

> - Read-only, execute

RWX=101b

> - Read/write, no execute

RWX=110b

> - Read/write, execute

RWX=111b

> We implemented this in the past by using a separate address space per
> VTL and updating memory regions on protection changes. But having to
> update the memory slot layout for every permission change scales poorly,
> especially as we have to perform 100.000s of these operations at boot
> (see [1] for a little more context).
> 
> I believe the biggest barrier for us to use memory attributes is not
> having the ability to target specific address spaces, or to the very
> least having some mechanism to maintain multiple independent layers of
> attributes.

Can you elaborate on "specific address spaces"?  In KVM, that usually means SMM,
but the VTL comment above makes me think you're talking about something entirely
different.  E.g. can you provide a brief summary of the requirements/expectations?

> Also sorry for not posting our VSM patches. They are not ready for
> upstream review yet, but we're working on it.
> 
> Nicolas
> 
> [1] https://patchwork.kernel.org/comment/25054908/
> [2] See Chapter 15 of Microsoft's 'Hypervisor Top Level Functional Specification':
>     https://raw.githubusercontent.com/MicrosoftDocs/Virtualization-Documentation/main/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf
Nicolas Saenz Julienne May 19, 2023, 7:49 p.m. UTC | #24
Hi Sean,

On Fri May 19, 2023 at 6:23 PM UTC, Sean Christopherson wrote:
> On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> > Hi,
> >
> > On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:
> >
> > [...]
> > > +The user sets the per-page memory attributes to a guest memory range indicated
> > > +by address/size, and in return KVM adjusts address and size to reflect the
> > > +actual pages of the memory range have been successfully set to the attributes.
> > > +If the call returns 0, "address" is updated to the last successful address + 1
> > > +and "size" is updated to the remaining address size that has not been set
> > > +successfully. The user should check the return value as well as the size to
> > > +decide if the operation succeeded for the whole range or not. The user may want
> > > +to retry the operation with the returned address/size if the previous range was
> > > +partially successful.
> > > +
> > > +Both address and size should be page aligned and the supported attributes can be
> > > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > > +
> > > +The "flags" field may be used for future extensions and should be set to 0s.
> >
> > We have been looking into adding support for the Hyper-V VSM extensions
> > which Windows uses to implement Credential Guard. This interface seems
> > like a good fit for one of its underlying features. I just wanted to
> > share a bit about it, and see if we can expand it to fit this use-case.
> > Note that this was already briefly discussed between Sean and Alex some
> > time ago[1].
> >
> > VSM introduces isolated guest execution contexts called Virtual Trust
> > Levels (VTL) [2]. Each VTL has its own memory access protections,
> > virtual processors states, interrupt controllers and overlay pages. VTLs
> > are hierarchical and might enforce memory protections on less privileged
> > VTLs. Memory protections are enforced on a per-GPA granularity.
> >
> > The list of possible protections is:
> > - No access -- This needs a new memory attribute, I think.
>
> No, if KVM provides three bits for READ, WRITE, and EXECUTE, then userspace can
> get all the possible combinations.  E.g. this is RWX=000b

That's not what the current implementation does, when attributes is
equal 0 it clears the entries from the xarray:

static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
					   struct kvm_memory_attributes *attrs)
{

    entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
[...]
    for (i = start; i < end; i++)
    	if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
    			    GFP_KERNEL_ACCOUNT)))
        		break;
}

From Documentation/core-api/xarray.rst:

"There is no difference between an entry that has never
been stored to, one that has been erased and one that has most recently
had ``NULL`` stored to it."

The way I understood the series, there needs to be a differentiation
between no attributes (regular page fault) and no-access.

> > We implemented this in the past by using a separate address space per
> > VTL and updating memory regions on protection changes. But having to
> > update the memory slot layout for every permission change scales poorly,
> > especially as we have to perform 100.000s of these operations at boot
> > (see [1] for a little more context).
> >
> > I believe the biggest barrier for us to use memory attributes is not
> > having the ability to target specific address spaces, or to the very
> > least having some mechanism to maintain multiple independent layers of
> > attributes.
>
> Can you elaborate on "specific address spaces"?  In KVM, that usually means SMM,
> but the VTL comment above makes me think you're talking about something entirely
> different.  E.g. can you provide a brief summary of the requirements/expectations?

I'll do so with a clear head on Monday. :)

Thanks!
Nicolas
Sean Christopherson May 19, 2023, 7:57 p.m. UTC | #25
On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> Hi Sean,
> 
> On Fri May 19, 2023 at 6:23 PM UTC, Sean Christopherson wrote:
> > On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> > > Hi,
> > >
> > > On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:
> > >
> > > [...]
> > > > +The user sets the per-page memory attributes to a guest memory range indicated
> > > > +by address/size, and in return KVM adjusts address and size to reflect the
> > > > +actual pages of the memory range have been successfully set to the attributes.
> > > > +If the call returns 0, "address" is updated to the last successful address + 1
> > > > +and "size" is updated to the remaining address size that has not been set
> > > > +successfully. The user should check the return value as well as the size to
> > > > +decide if the operation succeeded for the whole range or not. The user may want
> > > > +to retry the operation with the returned address/size if the previous range was
> > > > +partially successful.
> > > > +
> > > > +Both address and size should be page aligned and the supported attributes can be
> > > > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > > > +
> > > > +The "flags" field may be used for future extensions and should be set to 0s.
> > >
> > > We have been looking into adding support for the Hyper-V VSM extensions
> > > which Windows uses to implement Credential Guard. This interface seems
> > > like a good fit for one of its underlying features. I just wanted to
> > > share a bit about it, and see if we can expand it to fit this use-case.
> > > Note that this was already briefly discussed between Sean and Alex some
> > > time ago[1].
> > >
> > > VSM introduces isolated guest execution contexts called Virtual Trust
> > > Levels (VTL) [2]. Each VTL has its own memory access protections,
> > > virtual processors states, interrupt controllers and overlay pages. VTLs
> > > are hierarchical and might enforce memory protections on less privileged
> > > VTLs. Memory protections are enforced on a per-GPA granularity.
> > >
> > > The list of possible protections is:
> > > - No access -- This needs a new memory attribute, I think.
> >
> > No, if KVM provides three bits for READ, WRITE, and EXECUTE, then userspace can
> > get all the possible combinations.  E.g. this is RWX=000b
> 
> That's not what the current implementation does, when attributes is
> equal 0 it clears the entries from the xarray:
> 
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> 					   struct kvm_memory_attributes *attrs)
> {
> 
>     entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> [...]
>     for (i = start; i < end; i++)
>     	if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
>     			    GFP_KERNEL_ACCOUNT)))
>         		break;
> }
> 
> >From Documentation/core-api/xarray.rst:
> 
> "There is no difference between an entry that has never
> been stored to, one that has been erased and one that has most recently
> had ``NULL`` stored to it."
> 
> The way I understood the series, there needs to be a differentiation
> between no attributes (regular page fault) and no-access.

Ah, I see what you're saying.  There are multiple ways to solve things without a
"no access" flag while still maintaining an empty xarray for the default case.
E.g. invert the flags to be DENY flags[*], have an internal-only "entry valid" flag,
etc.

[*] I vaguely recall suggesting a "deny" approach somewhere, but I may just be
    making things up to make it look like I thought deeply about this ;-)
Nicolas Saenz Julienne May 23, 2023, 6:59 p.m. UTC | #26
Hi Sean,

On Fri May 19, 2023 at 6:23 PM UTC, Sean Christopherson wrote:
> On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> > Hi,
> > On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:

[...]

> > VSM introduces isolated guest execution contexts called Virtual Trust
> > Levels (VTL) [2]. Each VTL has its own memory access protections,
> > virtual processors states, interrupt controllers and overlay pages. VTLs
> > are hierarchical and might enforce memory protections on less privileged
> > VTLs. Memory protections are enforced on a per-GPA granularity.
> >
> > We implemented this in the past by using a separate address space per
> > VTL and updating memory regions on protection changes. But having to
> > update the memory slot layout for every permission change scales poorly,
> > especially as we have to perform 100.000s of these operations at boot
> > (see [1] for a little more context).
> >
> > I believe the biggest barrier for us to use memory attributes is not
> > having the ability to target specific address spaces, or to the very
> > least having some mechanism to maintain multiple independent layers of
> > attributes.
>
> Can you elaborate on "specific address spaces"?  In KVM, that usually means SMM,
> but the VTL comment above makes me think you're talking about something entirely
> different.  E.g. can you provide a brief summary of the requirements/expectations?

Let me refresh some concepts first. VTLs are vCPU modes implemented by
the hypervisor. Lower VTLs switch into higher VTLs [1] through a
hypercall or asynchronously through interrupts. Each VTL has its own CPU
architectural state, lapic and MSR state (applies to only some MSRs).
These are saved/restored when switching VTLS [2]. Additionally, VTLs
share a common GPA->HPA mapping, but protection bits differ depending on
which VTL the CPU is on. Privileged VTLs might revoke R/W/X(+MBEC,
optional) access bits from lower VTLs on a per-GPA basis.

In order to deal with the per-VTL memory protection bits, we extended
the number of KVM address spaces and assigned one to each VTL. The
hypervisor initializes all VTLs address spaces with the same mappings
and protections, they are expected to diverge during runtime. Operations
that rely on memory slots for GPA->HPA/HVA translations (including page
faults) are already address space aware, so adding VTL support was
fairly simple.

Ultimately, when a privileged VTL enforces memory protections on lower
VTLs we update that VTL's address space memory regions to reflect them.
Protection changes are requested through a hypercall, which expects the
new protection to be visible system wide upon returning from it. These
hypercalls happen around 100000+ times during boot, so we introduced an
"atomic memory slot update" API similar to Emanuele's [3] that allows
splitting memory regions/changing permissions concurrent with other
vCPUs.

Now, if we had a way to map memory attributes to specific VTLs, we could
use that instead. Actually, we wouldn't need to extend address spaces at
all to support this (we might still need them to support Overlay Pages,
but that's another story).

Hope it makes a little more sense now. :)

Nicolas

[1] In practice we've only seen VTL0 and VTL1 being used. The spec
    supports up to 16 VTLs.

[2] One can draw an analogy with arm's TrustZone. The hypervisor plays
    the role of EL3. Windows (VTL0) runs in Non-Secure (EL0/EL1) and the
    secure kernel (VTL1) in Secure World (EL1s/EL0s).

[3] https://lore.kernel.org/all/20220909104506.738478-1-eesposit@redhat.com/
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 5617bc4f899f..bb2f709c0900 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5952,6 +5952,59 @@  delivery must be provided via the "reg_aen" struct.
 The "pad" and "reserved" fields may be used for future extensions and should be
 set to 0s by userspace.
 
+4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
+-----------------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: u64 memory attributes bitmask(out)
+:Returns: 0 on success, <0 on error
+
+Returns supported memory attributes bitmask. Supported memory attributes will
+have the corresponding bits set in u64 memory attributes bitmask.
+
+The following memory attributes are defined::
+
+  #define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
+  #define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
+  #define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
+  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+
+4.139 KVM_SET_MEMORY_ATTRIBUTES
+-----------------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes(in/out)
+:Returns: 0 on success, <0 on error
+
+Sets memory attributes for pages in a guest memory range. Parameters are
+specified via the following structure::
+
+  struct kvm_memory_attributes {
+	__u64 address;
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+  };
+
+The user sets the per-page memory attributes to a guest memory range indicated
+by address/size, and in return KVM adjusts address and size to reflect the
+actual pages of the memory range have been successfully set to the attributes.
+If the call returns 0, "address" is updated to the last successful address + 1
+and "size" is updated to the remaining address size that has not been set
+successfully. The user should check the return value as well as the size to
+decide if the operation succeeded for the whole range or not. The user may want
+to retry the operation with the returned address/size if the previous range was
+partially successful.
+
+Both address and size should be page aligned and the supported attributes can be
+retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
+
+The "flags" field may be used for future extensions and should be set to 0s.
+
 5. The kvm_run structure
 ========================
 
@@ -8270,6 +8323,16 @@  structure.
 When getting the Modified Change Topology Report value, the attr->addr
 must point to a byte where the value will be stored or retrieved from.
 
+8.40 KVM_CAP_MEMORY_ATTRIBUTES
+------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm
+
+This capability indicates KVM supports per-page memory attributes and ioctls
+KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
+
 9. Known KVM API problems
 =========================
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index fbeaa9ddef59..a8e379a3afee 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -49,6 +49,7 @@  config KVM
 	select SRCU
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
+	select HAVE_KVM_MEMORY_ATTRIBUTES
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8f874a964313..a784e2b06625 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -800,6 +800,9 @@  struct kvm {
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
+#endif
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+	struct xarray mem_attr_array;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
 };
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 64dfe9c07c87..5d0941acb5bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1182,6 +1182,7 @@  struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_CPU_TOPOLOGY 222
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
+#define KVM_CAP_MEMORY_ATTRIBUTES 225
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2238,4 +2239,20 @@  struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
+#define KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES    _IOR(KVMIO,  0xd2, __u64)
+#define KVM_SET_MEMORY_ATTRIBUTES              _IOWR(KVMIO,  0xd3, struct kvm_memory_attributes)
+
+struct kvm_memory_attributes {
+	__u64 address;
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+};
+
+#define KVM_MEMORY_ATTRIBUTE_READ              (1ULL << 0)
+#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
+#define KVM_MEMORY_ATTRIBUTE_EXECUTE           (1ULL << 2)
+#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 800f9470e36b..effdea5dd4f0 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -19,6 +19,9 @@  config HAVE_KVM_IRQ_ROUTING
 config HAVE_KVM_DIRTY_RING
        bool
 
+config HAVE_KVM_MEMORY_ATTRIBUTES
+       bool
+
 # Only strongly ordered architectures can select this, as it doesn't
 # put any explicit constraint on userspace ordering. They can also
 # select the _ACQ_REL version.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1782c4555d94..7f0f5e9f2406 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1150,6 +1150,9 @@  static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+	xa_init(&kvm->mem_attr_array);
+#endif
 
 	INIT_LIST_HEAD(&kvm->gpc_list);
 	spin_lock_init(&kvm->gpc_lock);
@@ -1323,6 +1326,9 @@  static void kvm_destroy_vm(struct kvm *kvm)
 		kvm_free_memslots(kvm, &kvm->__memslots[i][0]);
 		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
 	}
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+	xa_destroy(&kvm->mem_attr_array);
+#endif
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	cleanup_srcu_struct(&kvm->srcu);
 	kvm_arch_free_vm(kvm);
@@ -2323,6 +2329,49 @@  static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+static u64 kvm_supported_mem_attributes(struct kvm *kvm)
+{
+	return 0;
+}
+
+static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
+					   struct kvm_memory_attributes *attrs)
+{
+	gfn_t start, end;
+	unsigned long i;
+	void *entry;
+	u64 supported_attrs = kvm_supported_mem_attributes(kvm);
+
+	/* flags is currently not used. */
+	if (attrs->flags)
+		return -EINVAL;
+	if (attrs->attributes & ~supported_attrs)
+		return -EINVAL;
+	if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
+		return -EINVAL;
+	if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
+		return -EINVAL;
+
+	start = attrs->address >> PAGE_SHIFT;
+	end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
+
+	entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
+
+	mutex_lock(&kvm->lock);
+	for (i = start; i < end; i++)
+		if (xa_err(xa_store(&kvm->mem_attr_array, i, entry,
+				    GFP_KERNEL_ACCOUNT)))
+			break;
+	mutex_unlock(&kvm->lock);
+
+	attrs->address = i << PAGE_SHIFT;
+	attrs->size = (end - i) << PAGE_SHIFT;
+
+	return 0;
+}
+#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
+
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
 	return __gfn_to_memslot(kvm_memslots(kvm), gfn);
@@ -4459,6 +4508,9 @@  static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #ifdef CONFIG_HAVE_KVM_MSI
 	case KVM_CAP_SIGNAL_MSI:
 #endif
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+	case KVM_CAP_MEMORY_ATTRIBUTES:
+#endif
 #ifdef CONFIG_HAVE_KVM_IRQFD
 	case KVM_CAP_IRQFD:
 	case KVM_CAP_IRQFD_RESAMPLE:
@@ -4804,6 +4856,30 @@  static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
+#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
+	case KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES: {
+		u64 attrs = kvm_supported_mem_attributes(kvm);
+
+		r = -EFAULT;
+		if (copy_to_user(argp, &attrs, sizeof(attrs)))
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_SET_MEMORY_ATTRIBUTES: {
+		struct kvm_memory_attributes attrs;
+
+		r = -EFAULT;
+		if (copy_from_user(&attrs, argp, sizeof(attrs)))
+			goto out;
+
+		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
+
+		if (!r && copy_to_user(argp, &attrs, sizeof(attrs)))
+			r = -EFAULT;
+		break;
+	}
+#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
 	case KVM_CREATE_DEVICE: {
 		struct kvm_create_device cd;