[v3,02/70] RAMBlock: Add support of KVM private guest memfd

Message ID	20231115071519.2864957-3-xiaoyao.li@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4854F5381 for <kvm@vger.kernel.org>; Wed, 15 Nov 2023 07:15:48 +0000 (UTC) From: Xiaoyao Li <xiaoyao.li@intel.com> To: Paolo Bonzini <pbonzini@redhat.com>, David Hildenbrand <david@redhat.com>, Igor Mammedov <imammedo@redhat.com>, "Michael S . Tsirkin" <mst@redhat.com>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, Richard Henderson <richard.henderson@linaro.org>, Peter Xu <peterx@redhat.com>, =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>, Cornelia Huck <cohuck@redhat.com>, =?utf-8?q?Daniel_P_=2E_Berrang=C3=A9?= <berrange@redhat.com>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>, Marcelo Tosatti <mtosatti@redhat.com> Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org, xiaoyao.li@intel.com, Michael Roth <michael.roth@amd.com>, Sean Christopherson <seanjc@google.com>, Claudio Fontana <cfontana@suse.de>, Gerd Hoffmann <kraxel@redhat.com>, Isaku Yamahata <isaku.yamahata@gmail.com>, Chenyi Qiang <chenyi.qiang@intel.com> Subject: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd Date: Wed, 15 Nov 2023 02:14:11 -0500 Message-Id: <20231115071519.2864957-3-xiaoyao.li@intel.com> In-Reply-To: <20231115071519.2864957-1-xiaoyao.li@intel.com> References: <20231115071519.2864957-1-xiaoyao.li@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	QEMU Guest memfd + QEMU TDX support \| expand [v3,00/70] QEMU Guest memfd + QEMU TDX support [v3,01/70] * HACK * linux-headers: Update headers to pull in gmem APIs [v3,02/70] RAMBlock: Add support of KVM private guest memfd [v3,03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE [v3,04/70] HostMem: Add mechanism to opt in kvm guest memfd via MachineState [v3,05/70] kvm: Enable KVM_SET_USER_MEMORY_REGION2 for memslot [v3,06/70] kvm: Introduce support for memory_attributes [v3,07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range() [v3,08/70] physmem: replace function name with __func__ in ram_block_discard_range() [v3,09/70] physmem: Introduce ram_block_convert_range() for page conversion [v3,10/70] kvm: handle KVM_EXIT_MEMORY_FAULT [v3,11/70] trace/kvm: Add trace for page convertion between shared and private [v3,12/70] * HACK * linux-headers: Update headers to pull in TDX API changes [v3,13/70] i386: Introduce tdx-guest object [v3,14/70] target/i386: Implement mc->kvm_type() to get VM type [v3,15/70] target/i386: Parse TDX vm type [v3,16/70] target/i386: Introduce kvm_confidential_guest_init() [v3,17/70] i386/tdx: Implement tdx_kvm_init() to initialize TDX VM context [v3,18/70] i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES [v3,19/70] i386/tdx: Introduce is_tdx_vm() helper and cache tdx_guest object [v3,20/70] i386/tdx: Adjust the supported CPUID based on TDX restrictions [v3,21/70] i386/tdx: Update tdx_cpuid_lookup[].tdx_fixed0/1 by tdx_caps.cpuid_config[] [v3,22/70] i386/tdx: Integrate tdx_caps->xfam_fixed0/1 into tdx_cpuid_lookup [v3,23/70] i386/tdx: Integrate tdx_caps->attrs_fixed0/1 to tdx_cpuid_lookup [v3,24/70] i386/kvm: Move architectural CPUID leaf generation to separate helper [v3,25/70] kvm: Introduce kvm_arch_pre_create_vcpu() [v3,26/70] i386/tdx: Initialize TDX before creating TD vcpus [v3,27/70] i386/tdx: Add property sept-ve-disable for tdx-guest object [v3,28/70] i386/tdx: Make sept_ve_disable set by default [v3,29/70] i386/tdx: Wire CPU features up with attributes of TD guest [v3,30/70] i386/tdx: Validate TD attributes [v3,31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM [v3,32/70] i386/tdx: Implement user specified tsc frequency [v3,33/70] i386/tdx: Set kvm_readonly_mem_enabled to false for TDX VM [v3,34/70] kvm/memory: Introduce the infrastructure to set the default shared/private value [v3,35/70] i386/tdx: Make memory type private by default [v3,36/70] kvm/tdx: Don't complain when converting vMMIO region to shared [v3,37/70] kvm/tdx: Ignore memory conversion to shared of unassigned region [v3,38/70] i386/tdvf: Introduce function to parse TDVF metadata [v3,39/70] i386/tdx: Parse TDVF metadata for TDX VM [v3,40/70] i386/tdx: Skip BIOS shadowing setup [v3,41/70] i386/tdx: Don't initialize pc.rom for TDX VMs [v3,42/70] i386/tdx: Track mem_ptr for each firmware entry of TDVF [v3,43/70] i386/tdx: Track RAM entries for TDX VM [v3,44/70] headers: Add definitions from UEFI spec for volumes, resources, etc... [v3,45/70] i386/tdx: Setup the TD HOB list [v3,46/70] i386/tdx: Add TDVF memory via KVM_TDX_INIT_MEM_REGION [v3,47/70] memory: Introduce memory_region_init_ram_guest_memfd() [v3,48/70] i386/tdx: register TDVF as private memory [v3,49/70] i386/tdx: Call KVM_TDX_INIT_VCPU to initialize TDX vcpu [v3,50/70] i386/tdx: Finalize TDX VM [v3,51/70] i386/tdx: handle TDG.VP.VMCALL<SetupEventNotifyInterrupt> [v3,52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote> [v3,53/70] i386/tdx: setup a timer for the qio channel [v3,54/70] i386/tdx: handle TDG.VP.VMCALL<MapGPA> hypercall [v3,55/70] i386/tdx: Limit the range size for MapGPA [v3,56/70] i386/tdx: Handle TDG.VP.VMCALL<REPORT_FATAL_ERROR> [v3,57/70] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility [v3,58/70] pci-host/q35: Move PAM initialization above SMRAM initialization [v3,59/70] q35: Introduce smm_ranges property for q35-pci-host [v3,60/70] i386/tdx: Disable SMM for TDX VMs [v3,61/70] i386/tdx: Disable PIC for TDX VMs [v3,62/70] i386/tdx: Don't allow system reset for TDX VMs [v3,63/70] i386/tdx: LMCE is not supported for TDX [v3,64/70] hw/i386: add eoi_intercept_unsupported member to X86MachineState [v3,65/70] hw/i386: add option to forcibly report edge trigger in acpi tables [v3,66/70] i386/tdx: Don't synchronize guest tsc for TDs [v3,67/70] i386/tdx: Only configure MSR_IA32_UCODE_REV in kvm_init_msrs() for TDs [v3,68/70] i386/tdx: Skip kvm_put_apicbase() for TDs [v3,69/70] i386/tdx: Don't get/put guest state for TDX VMs [v3,70/70] docs: Add TDX documentation

Xiaoyao Li Nov. 15, 2023, 7:14 a.m. UTC

Add KVM guest_memfd support to RAMBlock so both normal hva based memory
and kvm guest memfd based private memory can be associated in one RAMBlock.

Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
create private guest_memfd during RAMBlock setup.

Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
confidential guests, such as TDX VM. How and when to set it for memory
backends will be implemented in the following patches.

Introduce memory_region_has_guest_memfd() to query if the MemoryRegion has
KVM guest_memfd allocated.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
Changes in v3:
- rename gmem to guest_memfd;
- close(guest_memfd) when RAMBlock is released; (Daniel P. Berrangé)
- Suqash the patch that introduces memory_region_has_guest_memfd().
---
 accel/kvm/kvm-all.c     | 24 ++++++++++++++++++++++++
 include/exec/memory.h   | 13 +++++++++++++
 include/exec/ramblock.h |  1 +
 include/sysemu/kvm.h    |  2 ++
 system/memory.c         |  5 +++++
 system/physmem.c        | 27 ++++++++++++++++++++++++---
 6 files changed, 69 insertions(+), 3 deletions(-)

Daniel P. Berrangé Nov. 15, 2023, 10:20 a.m. UTC | #1

On Wed, Nov 15, 2023 at 02:14:11AM -0500, Xiaoyao Li wrote:
> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
> and kvm guest memfd based private memory can be associated in one RAMBlock.
> 
> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
> create private guest_memfd during RAMBlock setup.
> 
> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
> confidential guests, such as TDX VM. How and when to set it for memory
> backends will be implemented in the following patches.
> 
> Introduce memory_region_has_guest_memfd() to query if the MemoryRegion has
> KVM guest_memfd allocated.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
> - rename gmem to guest_memfd;
> - close(guest_memfd) when RAMBlock is released; (Daniel P. Berrangé)
> - Suqash the patch that introduces memory_region_has_guest_memfd().
> ---
>  accel/kvm/kvm-all.c     | 24 ++++++++++++++++++++++++
>  include/exec/memory.h   | 13 +++++++++++++
>  include/exec/ramblock.h |  1 +
>  include/sysemu/kvm.h    |  2 ++
>  system/memory.c         |  5 +++++
>  system/physmem.c        | 27 ++++++++++++++++++++++++---
>  6 files changed, 69 insertions(+), 3 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index c1b40e873531..9f751d4971f8 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -101,6 +101,7 @@ bool kvm_msi_use_devid;
>  bool kvm_has_guest_debug;
>  static int kvm_sstep_flags;
>  static bool kvm_immediate_exit;
> +static bool kvm_guest_memfd_supported;
>  static hwaddr kvm_max_slot_size = ~0;
>  
>  static const KVMCapabilityInfo kvm_required_capabilites[] = {
> @@ -2397,6 +2398,8 @@ static int kvm_init(MachineState *ms)
>      }
>      s->as = g_new0(struct KVMAs, s->nr_as);
>  
> +    kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
> +
>      if (object_property_find(OBJECT(current_machine), "kvm-type")) {
>          g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine),
>                                                              "kvm-type",
> @@ -4078,3 +4081,24 @@ void query_stats_schemas_cb(StatsSchemaList **result, Error **errp)
>          query_stats_schema_vcpu(first_cpu, &stats_args);
>      }
>  }
> +
> +int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
> +{
> +    int fd;
> +    struct kvm_create_guest_memfd guest_memfd = {
> +        .size = size,
> +        .flags = flags,
> +    };
> +
> +    if (!kvm_guest_memfd_supported) {
> +        error_setg(errp, "KVM doesn't support guest memfd\n");
> +        return -EOPNOTSUPP;

Returning an errno value is unusual when we have an 'Error **errp' parameter
for reporting, and the following codepath merely returns -1, so this is
inconsistent. Just return -1 here too.

> +    }
> +
> +    fd = kvm_vm_ioctl(kvm_state, KVM_CREATE_GUEST_MEMFD, &guest_memfd);
> +    if (fd < 0) {
> +        error_setg_errno(errp, errno, "%s: error creating kvm guest memfd\n", __func__);

I'd prefer an explicit 'return -1' here, even though 'fd' is technically going
to be -1 already.

Also including __func__ in the error message is not really needed IMHO

> +    }
> +
> +    return fd;
> +}
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 831f7c996d9d..f780367ab1bd 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -243,6 +243,9 @@ typedef struct IOMMUTLBEvent {
>  /* RAM FD is opened read-only */
>  #define RAM_READONLY_FD (1 << 11)
>  
> +/* RAM can be private that has kvm gmem backend */
> +#define RAM_GUEST_MEMFD   (1 << 12)
> +
>  static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>                                         IOMMUNotifierFlag flags,
>                                         hwaddr start, hwaddr end,
> @@ -1702,6 +1705,16 @@ static inline bool memory_region_is_romd(MemoryRegion *mr)
>   */
>  bool memory_region_is_protected(MemoryRegion *mr);
>  
> +/**
> + * memory_region_has_guest_memfd: check whether a memory region has guest_memfd
> + *     associated
> + *
> + * Returns %true if a memory region's ram_block has valid guest_memfd assigned.
> + *
> + * @mr: the memory region being queried
> + */
> +bool memory_region_has_guest_memfd(MemoryRegion *mr);
> +
>  /**
>   * memory_region_get_iommu: check whether a memory region is an iommu
>   *
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 69c6a5390293..0a17ba882729 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -41,6 +41,7 @@ struct RAMBlock {
>      QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
>      int fd;
>      uint64_t fd_offset;
> +    int guest_memfd;
>      size_t page_size;
>      /* dirty bitmap used during migration */
>      unsigned long *bmap;
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index d61487816421..fedc28c7d17f 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -538,4 +538,6 @@ bool kvm_arch_cpu_check_are_resettable(void);
>  bool kvm_dirty_ring_enabled(void);
>  
>  uint32_t kvm_dirty_ring_size(void);
> +
> +int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
>  #endif
> diff --git a/system/memory.c b/system/memory.c
> index 304fa843ea12..69741d91bbb7 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -1862,6 +1862,11 @@ bool memory_region_is_protected(MemoryRegion *mr)
>      return mr->ram && (mr->ram_block->flags & RAM_PROTECTED);
>  }
>  
> +bool memory_region_has_guest_memfd(MemoryRegion *mr)
> +{
> +    return mr->ram_block && mr->ram_block->guest_memfd >= 0;
> +}
> +
>  uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>  {
>      uint8_t mask = mr->dirty_log_mask;
> diff --git a/system/physmem.c b/system/physmem.c
> index fc2b0fee0188..0af2213cbd9c 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>          }
>      }
>  
> +#ifdef CONFIG_KVM
> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
> +        new_block->guest_memfd < 0) {
> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
> +        uint64_t flags = 0;
> +        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
> +                                                        flags, errp);
> +        if (new_block->guest_memfd < 0) {
> +            qemu_mutex_unlock_ramlist();
> +            return;
> +        }
> +    }
> +#endif
> +
>      new_ram_size = MAX(old_ram_size,
>                (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS);
>      if (new_ram_size > old_ram_size) {
> @@ -1903,7 +1917,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>      /* Just support these ram flags by now. */
>      assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE |
>                            RAM_PROTECTED | RAM_NAMED_FILE | RAM_READONLY |
> -                          RAM_READONLY_FD)) == 0);
> +                          RAM_READONLY_FD | RAM_GUEST_MEMFD)) == 0);
>  
>      if (xen_enabled()) {
>          error_setg(errp, "-mem-path not supported with Xen");
> @@ -1938,6 +1952,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>      new_block->used_length = size;
>      new_block->max_length = size;
>      new_block->flags = ram_flags;
> +    new_block->guest_memfd = -1;
>      new_block->host = file_ram_alloc(new_block, size, fd, !file_size, offset,
>                                       errp);
>      if (!new_block->host) {
> @@ -2016,7 +2031,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>      Error *local_err = NULL;
>  
>      assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC |
> -                          RAM_NORESERVE)) == 0);
> +                          RAM_NORESERVE| RAM_GUEST_MEMFD)) == 0);
>      assert(!host ^ (ram_flags & RAM_PREALLOC));
>  
>      size = HOST_PAGE_ALIGN(size);
> @@ -2028,6 +2043,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>      new_block->max_length = max_size;
>      assert(max_size >= size);
>      new_block->fd = -1;
> +    new_block->guest_memfd = -1;
>      new_block->page_size = qemu_real_host_page_size();
>      new_block->host = host;
>      new_block->flags = ram_flags;
> @@ -2050,7 +2066,7 @@ RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
>  RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
>                           MemoryRegion *mr, Error **errp)
>  {
> -    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
> +    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE | RAM_GUEST_MEMFD)) == 0);
>      return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
>  }
>  
> @@ -2078,6 +2094,11 @@ static void reclaim_ramblock(RAMBlock *block)
>      } else {
>          qemu_anon_ram_free(block->host, block->max_length);
>      }
> +
> +    if (block->guest_memfd >= 0) {
> +        close(block->guest_memfd);
> +    }
> +
>      g_free(block);
>  }
>  
> -- 
> 2.34.1
> 

With regards,
Daniel

David Hildenbrand Nov. 15, 2023, 5:54 p.m. UTC | #2

On 15.11.23 08:14, Xiaoyao Li wrote:
> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
> and kvm guest memfd based private memory can be associated in one RAMBlock.
> 
> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
> create private guest_memfd during RAMBlock setup.
> 
> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
> confidential guests, such as TDX VM. How and when to set it for memory
> backends will be implemented in the following patches.

Can you elaborate (and add to the patch description if there is good 
reason) why we need that flag and why we cannot simply rely on the VM 
type instead to decide whether to allocate a guest_memfd or not?

Xiaoyao Li Nov. 16, 2023, 2:45 a.m. UTC | #3

On 11/16/2023 1:54 AM, David Hildenbrand wrote:
> On 15.11.23 08:14, Xiaoyao Li wrote:
>> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
>> and kvm guest memfd based private memory can be associated in one 
>> RAMBlock.
>>
>> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
>> create private guest_memfd during RAMBlock setup.
>>
>> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
>> confidential guests, such as TDX VM. How and when to set it for memory
>> backends will be implemented in the following patches.
> 
> Can you elaborate (and add to the patch description if there is good 
> reason) why we need that flag and why we cannot simply rely on the VM 
> type instead to decide whether to allocate a guest_memfd or not?
> 

The reason is, relying on the VM type is sort of hack that we need to 
get the MachineState instance and retrieve the vm type info. I think 
it's better not to couple them.

More importantly, it's not flexible and extensible for future case that 
not all the memory need guest memfd.

Xiaoyao Li Nov. 16, 2023, 3:34 a.m. UTC | #4

On 11/15/2023 6:20 PM, Daniel P. Berrangé wrote:
> On Wed, Nov 15, 2023 at 02:14:11AM -0500, Xiaoyao Li wrote:
>> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
>> and kvm guest memfd based private memory can be associated in one RAMBlock.
>>
>> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
>> create private guest_memfd during RAMBlock setup.
>>
>> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
>> confidential guests, such as TDX VM. How and when to set it for memory
>> backends will be implemented in the following patches.
>>
>> Introduce memory_region_has_guest_memfd() to query if the MemoryRegion has
>> KVM guest_memfd allocated.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> Changes in v3:
>> - rename gmem to guest_memfd;
>> - close(guest_memfd) when RAMBlock is released; (Daniel P. Berrangé)
>> - Suqash the patch that introduces memory_region_has_guest_memfd().
>> ---
>>   accel/kvm/kvm-all.c     | 24 ++++++++++++++++++++++++
>>   include/exec/memory.h   | 13 +++++++++++++
>>   include/exec/ramblock.h |  1 +
>>   include/sysemu/kvm.h    |  2 ++
>>   system/memory.c         |  5 +++++
>>   system/physmem.c        | 27 ++++++++++++++++++++++++---
>>   6 files changed, 69 insertions(+), 3 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index c1b40e873531..9f751d4971f8 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -101,6 +101,7 @@ bool kvm_msi_use_devid;
>>   bool kvm_has_guest_debug;
>>   static int kvm_sstep_flags;
>>   static bool kvm_immediate_exit;
>> +static bool kvm_guest_memfd_supported;
>>   static hwaddr kvm_max_slot_size = ~0;
>>   
>>   static const KVMCapabilityInfo kvm_required_capabilites[] = {
>> @@ -2397,6 +2398,8 @@ static int kvm_init(MachineState *ms)
>>       }
>>       s->as = g_new0(struct KVMAs, s->nr_as);
>>   
>> +    kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
>> +
>>       if (object_property_find(OBJECT(current_machine), "kvm-type")) {
>>           g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine),
>>                                                               "kvm-type",
>> @@ -4078,3 +4081,24 @@ void query_stats_schemas_cb(StatsSchemaList **result, Error **errp)
>>           query_stats_schema_vcpu(first_cpu, &stats_args);
>>       }
>>   }
>> +
>> +int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
>> +{
>> +    int fd;
>> +    struct kvm_create_guest_memfd guest_memfd = {
>> +        .size = size,
>> +        .flags = flags,
>> +    };
>> +
>> +    if (!kvm_guest_memfd_supported) {
>> +        error_setg(errp, "KVM doesn't support guest memfd\n");
>> +        return -EOPNOTSUPP;
> 
> Returning an errno value is unusual when we have an 'Error **errp' parameter
> for reporting, and the following codepath merely returns -1, so this is
> inconsistent. Just return -1 here too.

OK.

>> +    }
>> +
>> +    fd = kvm_vm_ioctl(kvm_state, KVM_CREATE_GUEST_MEMFD, &guest_memfd);
>> +    if (fd < 0) {
>> +        error_setg_errno(errp, errno, "%s: error creating kvm guest memfd\n", __func__);
> 
> I'd prefer an explicit 'return -1' here, even though 'fd' is technically going
> to be -1 already.
> 
> Also including __func__ in the error message is not really needed IMHO

OK

>> +    }
>> +
>> +    return fd;
>> +}
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 831f7c996d9d..f780367ab1bd 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -243,6 +243,9 @@ typedef struct IOMMUTLBEvent {
>>   /* RAM FD is opened read-only */
>>   #define RAM_READONLY_FD (1 << 11)
>>   
>> +/* RAM can be private that has kvm gmem backend */
>> +#define RAM_GUEST_MEMFD   (1 << 12)
>> +
>>   static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>>                                          IOMMUNotifierFlag flags,
>>                                          hwaddr start, hwaddr end,
>> @@ -1702,6 +1705,16 @@ static inline bool memory_region_is_romd(MemoryRegion *mr)
>>    */
>>   bool memory_region_is_protected(MemoryRegion *mr);
>>   
>> +/**
>> + * memory_region_has_guest_memfd: check whether a memory region has guest_memfd
>> + *     associated
>> + *
>> + * Returns %true if a memory region's ram_block has valid guest_memfd assigned.
>> + *
>> + * @mr: the memory region being queried
>> + */
>> +bool memory_region_has_guest_memfd(MemoryRegion *mr);
>> +
>>   /**
>>    * memory_region_get_iommu: check whether a memory region is an iommu
>>    *
>> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
>> index 69c6a5390293..0a17ba882729 100644
>> --- a/include/exec/ramblock.h
>> +++ b/include/exec/ramblock.h
>> @@ -41,6 +41,7 @@ struct RAMBlock {
>>       QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
>>       int fd;
>>       uint64_t fd_offset;
>> +    int guest_memfd;
>>       size_t page_size;
>>       /* dirty bitmap used during migration */
>>       unsigned long *bmap;
>> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
>> index d61487816421..fedc28c7d17f 100644
>> --- a/include/sysemu/kvm.h
>> +++ b/include/sysemu/kvm.h
>> @@ -538,4 +538,6 @@ bool kvm_arch_cpu_check_are_resettable(void);
>>   bool kvm_dirty_ring_enabled(void);
>>   
>>   uint32_t kvm_dirty_ring_size(void);
>> +
>> +int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
>>   #endif
>> diff --git a/system/memory.c b/system/memory.c
>> index 304fa843ea12..69741d91bbb7 100644
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -1862,6 +1862,11 @@ bool memory_region_is_protected(MemoryRegion *mr)
>>       return mr->ram && (mr->ram_block->flags & RAM_PROTECTED);
>>   }
>>   
>> +bool memory_region_has_guest_memfd(MemoryRegion *mr)
>> +{
>> +    return mr->ram_block && mr->ram_block->guest_memfd >= 0;
>> +}
>> +
>>   uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>>   {
>>       uint8_t mask = mr->dirty_log_mask;
>> diff --git a/system/physmem.c b/system/physmem.c
>> index fc2b0fee0188..0af2213cbd9c 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>           }
>>       }
>>   
>> +#ifdef CONFIG_KVM
>> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
>> +        new_block->guest_memfd < 0) {
>> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
>> +        uint64_t flags = 0;
>> +        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
>> +                                                        flags, errp);
>> +        if (new_block->guest_memfd < 0) {
>> +            qemu_mutex_unlock_ramlist();
>> +            return;
>> +        }
>> +    }
>> +#endif
>> +
>>       new_ram_size = MAX(old_ram_size,
>>                 (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS);
>>       if (new_ram_size > old_ram_size) {
>> @@ -1903,7 +1917,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>>       /* Just support these ram flags by now. */
>>       assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE |
>>                             RAM_PROTECTED | RAM_NAMED_FILE | RAM_READONLY |
>> -                          RAM_READONLY_FD)) == 0);
>> +                          RAM_READONLY_FD | RAM_GUEST_MEMFD)) == 0);
>>   
>>       if (xen_enabled()) {
>>           error_setg(errp, "-mem-path not supported with Xen");
>> @@ -1938,6 +1952,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>>       new_block->used_length = size;
>>       new_block->max_length = size;
>>       new_block->flags = ram_flags;
>> +    new_block->guest_memfd = -1;
>>       new_block->host = file_ram_alloc(new_block, size, fd, !file_size, offset,
>>                                        errp);
>>       if (!new_block->host) {
>> @@ -2016,7 +2031,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>>       Error *local_err = NULL;
>>   
>>       assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC |
>> -                          RAM_NORESERVE)) == 0);
>> +                          RAM_NORESERVE| RAM_GUEST_MEMFD)) == 0);
>>       assert(!host ^ (ram_flags & RAM_PREALLOC));
>>   
>>       size = HOST_PAGE_ALIGN(size);
>> @@ -2028,6 +2043,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>>       new_block->max_length = max_size;
>>       assert(max_size >= size);
>>       new_block->fd = -1;
>> +    new_block->guest_memfd = -1;
>>       new_block->page_size = qemu_real_host_page_size();
>>       new_block->host = host;
>>       new_block->flags = ram_flags;
>> @@ -2050,7 +2066,7 @@ RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
>>   RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
>>                            MemoryRegion *mr, Error **errp)
>>   {
>> -    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
>> +    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE | RAM_GUEST_MEMFD)) == 0);
>>       return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
>>   }
>>   
>> @@ -2078,6 +2094,11 @@ static void reclaim_ramblock(RAMBlock *block)
>>       } else {
>>           qemu_anon_ram_free(block->host, block->max_length);
>>       }
>> +
>> +    if (block->guest_memfd >= 0) {
>> +        close(block->guest_memfd);
>> +    }
>> +
>>       g_free(block);
>>   }
>>   
>> -- 
>> 2.34.1
>>
> 
> With regards,
> Daniel

Isaku Yamahata Nov. 17, 2023, 8:35 p.m. UTC | #5

On Wed, Nov 15, 2023 at 02:14:11AM -0500,
Xiaoyao Li <xiaoyao.li@intel.com> wrote:

> diff --git a/system/physmem.c b/system/physmem.c
> index fc2b0fee0188..0af2213cbd9c 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>          }
>      }
>  
> +#ifdef CONFIG_KVM
> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
> +        new_block->guest_memfd < 0) {
> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
> +        uint64_t flags = 0;
> +        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
> +                                                        flags, errp);
> +        if (new_block->guest_memfd < 0) {
> +            qemu_mutex_unlock_ramlist();
> +            return;
> +        }
> +    }
> +#endif
> +

We should define kvm_create_guest_memfd() stub in accel/stub/kvm-stub.c.
We can remove this #ifdef.

David Hildenbrand Nov. 20, 2023, 9:19 a.m. UTC | #6

On 16.11.23 03:45, Xiaoyao Li wrote:
> On 11/16/2023 1:54 AM, David Hildenbrand wrote:
>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
>>> and kvm guest memfd based private memory can be associated in one
>>> RAMBlock.
>>>
>>> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
>>> create private guest_memfd during RAMBlock setup.
>>>
>>> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
>>> confidential guests, such as TDX VM. How and when to set it for memory
>>> backends will be implemented in the following patches.
>>
>> Can you elaborate (and add to the patch description if there is good
>> reason) why we need that flag and why we cannot simply rely on the VM
>> type instead to decide whether to allocate a guest_memfd or not?
>>
> 
> The reason is, relying on the VM type is sort of hack that we need to
> get the MachineState instance and retrieve the vm type info. I think
> it's better not to couple them.
> 
> More importantly, it's not flexible and extensible for future case that
> not all the memory need guest memfd.
> 

Okay. In that case, please update the documentation of all functions 
where we are allowed to pass in RAM_GUEST_MEMFD. There are a couple of 
them in include/exec/memory.h

I'll note that the name/terminology of "RAM_GUEST_MEMFD" is extremely 
Linux+kvm specific. But I cannot really come up with something better 
right now.

David Hildenbrand Nov. 20, 2023, 9:24 a.m. UTC | #7

>   uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>   {
>       uint8_t mask = mr->dirty_log_mask;
> diff --git a/system/physmem.c b/system/physmem.c
> index fc2b0fee0188..0af2213cbd9c 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>           }
>       }
>   
> +#ifdef CONFIG_KVM
> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&


I recall that we prefer to write this as

	if (kvm_enabled() && (new_block->flags & RAM_GUEST_MEMFD) &&

> +        new_block->guest_memfd < 0) {
> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
> +        uint64_t flags = 0;
> +        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
> +                                                        flags, errp);

Get rid of "flags" and just pass 0". Whatever code wants to pass flags 
later can decide how to do that.

Xiaoyao Li Nov. 30, 2023, 7:35 a.m. UTC | #8

On 11/20/2023 5:19 PM, David Hildenbrand wrote:
> On 16.11.23 03:45, Xiaoyao Li wrote:
>> On 11/16/2023 1:54 AM, David Hildenbrand wrote:
>>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>>> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
>>>> and kvm guest memfd based private memory can be associated in one
>>>> RAMBlock.
>>>>
>>>> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM 
>>>> ioctl to
>>>> create private guest_memfd during RAMBlock setup.
>>>>
>>>> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
>>>> confidential guests, such as TDX VM. How and when to set it for memory
>>>> backends will be implemented in the following patches.
>>>
>>> Can you elaborate (and add to the patch description if there is good
>>> reason) why we need that flag and why we cannot simply rely on the VM
>>> type instead to decide whether to allocate a guest_memfd or not?
>>>
>>
>> The reason is, relying on the VM type is sort of hack that we need to
>> get the MachineState instance and retrieve the vm type info. I think
>> it's better not to couple them.
>>
>> More importantly, it's not flexible and extensible for future case that
>> not all the memory need guest memfd.
>>
> 
> Okay. In that case, please update the documentation of all functions 
> where we are allowed to pass in RAM_GUEST_MEMFD. There are a couple of 
> them in include/exec/memory.h

sure, thanks!

> I'll note that the name/terminology of "RAM_GUEST_MEMFD" is extremely 
> Linux+kvm specific. But I cannot really come up with something better 
> right now.
>

Xiaoyao Li Nov. 30, 2023, 7:37 a.m. UTC | #9

On 11/20/2023 5:24 PM, David Hildenbrand wrote:
>>   uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>>   {
>>       uint8_t mask = mr->dirty_log_mask;
>> diff --git a/system/physmem.c b/system/physmem.c
>> index fc2b0fee0188..0af2213cbd9c 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, 
>> Error **errp)
>>           }
>>       }
>> +#ifdef CONFIG_KVM
>> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
> 
> 
> I recall that we prefer to write this as
> 
>      if (kvm_enabled() && (new_block->flags & RAM_GUEST_MEMFD) &&

get it.

Thanks!

>> +        new_block->guest_memfd < 0) {
>> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is 
>> supported */
>> +        uint64_t flags = 0;
>> +        new_block->guest_memfd = 
>> kvm_create_guest_memfd(new_block->max_length,
>> +                                                        flags, errp);
> 
> Get rid of "flags" and just pass 0". Whatever code wants to pass flags 
> later can decide how to do that.


How to handle it please see the reply to patch 3.

Xiaoyao Li Nov. 30, 2023, 8:31 a.m. UTC | #10

On 11/18/2023 4:35 AM, Isaku Yamahata wrote:
> On Wed, Nov 15, 2023 at 02:14:11AM -0500,
> Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> 
>> diff --git a/system/physmem.c b/system/physmem.c
>> index fc2b0fee0188..0af2213cbd9c 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>           }
>>       }
>>   
>> +#ifdef CONFIG_KVM
>> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
>> +        new_block->guest_memfd < 0) {
>> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
>> +        uint64_t flags = 0;
>> +        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
>> +                                                        flags, errp);
>> +        if (new_block->guest_memfd < 0) {
>> +            qemu_mutex_unlock_ramlist();
>> +            return;
>> +        }
>> +    }
>> +#endif
>> +
> 
> We should define kvm_create_guest_memfd() stub in accel/stub/kvm-stub.c.
> We can remove this #ifdef.

Nice suggestion! Will use stub.

David Hildenbrand Nov. 30, 2023, 11:01 a.m. UTC | #11

On 30.11.23 08:37, Xiaoyao Li wrote:
> On 11/20/2023 5:24 PM, David Hildenbrand wrote:
>>>    uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>>>    {
>>>        uint8_t mask = mr->dirty_log_mask;
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index fc2b0fee0188..0af2213cbd9c 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block,
>>> Error **errp)
>>>            }
>>>        }
>>> +#ifdef CONFIG_KVM
>>> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
>>
>>
>> I recall that we prefer to write this as
>>
>>       if (kvm_enabled() && (new_block->flags & RAM_GUEST_MEMFD) &&
> 
> get it.
> 
> Thanks!
> 
>>> +        new_block->guest_memfd < 0) {
>>> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is
>>> supported */
>>> +        uint64_t flags = 0;
>>> +        new_block->guest_memfd =
>>> kvm_create_guest_memfd(new_block->max_length,
>>> +                                                        flags, errp);
>>
>> Get rid of "flags" and just pass 0". Whatever code wants to pass flags
>> later can decide how to do that.
> 
> 
> How to handle it please see the reply to patch 3.

If patch #3 cannot go in now and has to be deferred, then please clean 
this here up. Otherwise, as suggested, squash with #3.

Depending on KVM_GUEST_MEMFD_ALLOW_HUGEPAGE support :)

[v3,02/70] RAMBlock: Add support of KVM private guest memfd

Commit Message

Comments

Patch