From patchwork Fri Dec 13 07:08:43 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chenyi Qiang <chenyi.qiang@intel.com>
X-Patchwork-Id: 13906620
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4249E18BBBB
	for <kvm@vger.kernel.org>; Fri, 13 Dec 2024 07:09:19 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.10
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1734073761; cv=none;
 b=RSZzXZnoqksuK+e74Qf33RUOemdjzVLLDd0pEajqeQnKFoQfIHRxGuTiwkcYBABskaP5hvaGqdlV8GwjgVjcf+mhEuJ3Z8LMZEIWMJJYkg8HsSoXHRehMRH1oc8OG2A7EBURIN6S9e58XibMHQ5JsHo/onwlN7ye1NDchTZtnTM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1734073761; c=relaxed/simple;
	bh=+gGDRKc6A8E0xNRDSzs8+ysZ9ibQoXdsqeDZYD2FgLQ=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=aYkVr9+Q43qIaqA/+CZeZyjYD54RROtMxz/a6oz+ZN7SKjOkfZ+Q0pprU+XvwQyx8Fn3dvGPNr4ygXn1wfQV34NFA2mu+gQCBGLvjc7NJArRiM14x+FOgpXWeRndq8UFBmJkczOMP9WT69iZIK/6IxYh3W0qSjYClU5l2RsKKVE=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=lnyYDw3h; arc=none smtp.client-ip=198.175.65.10
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="lnyYDw3h"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1734073759; x=1765609759;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=+gGDRKc6A8E0xNRDSzs8+ysZ9ibQoXdsqeDZYD2FgLQ=;
  b=lnyYDw3h/tQP7L9nKubkUeGCBmay6p0D3eoqsLooHvMw4ZOMvMaN/pze
   UtH0xU6TqjeIcfWrdGR2uEoLtTLw0yikDIDVryK1mrjNQm1nYww6eo39t
   InViWP/L8Dxiq+1mGTqhr61uRzTNC/5o/kXMjDc6DPgDaYYokW+uuGEQE
   K0UQ0sHwjj4MKs/rAw+upXO0gNUxO/OXW0JbQES8vbb9lDP9tI2i12Ogt
   JXVBN2rhvXgmT+4Cs2wbSIX1iz89y2AtjCWa5XwBx4wVE7EzUyu65klpO
   hQ+WRlBqs+QP8fhcO8OjNrGwyAxqn0EMsCyJ1l+md6JDTmk+4OfJGcmEF
   g==;
X-CSE-ConnectionGUID: uVRCZRlISuyu40dlpcAvsw==
X-CSE-MsgGUID: T0csoUADQx+jPuZNQyim+g==
X-IronPort-AV: E=McAfee;i="6700,10204,11284"; a="51937070"
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="51937070"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:19 -0800
X-CSE-ConnectionGUID: dr84zZlCQqSEHaVcDIz0oQ==
X-CSE-MsgGUID: swkuzg1CTXaftgdaBFhv4g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="96365541"
Received: from emr-bkc.sh.intel.com ([10.112.230.82])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:16 -0800
From: Chenyi Qiang <chenyi.qiang@intel.com>
To: David Hildenbrand <david@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>,
 Peter Xu <peterx@redhat.com>,
 =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>,
 Michael Roth <michael.roth@amd.com>
Cc: Chenyi Qiang <chenyi.qiang@intel.com>,
	qemu-devel@nongnu.org,
	kvm@vger.kernel.org,
	Williams Dan J <dan.j.williams@intel.com>,
	Peng Chao P <chao.p.peng@intel.com>,
	Gao Chao <chao.gao@intel.com>,
	Xu Yilun <yilun.xu@intel.com>
Subject: [PATCH 1/7] memory: Export a helper to get intersection of a
 MemoryRegionSection with a given range
Date: Fri, 13 Dec 2024 15:08:43 +0800
Message-ID: <20241213070852.106092-2-chenyi.qiang@intel.com>
X-Mailer: git-send-email 2.43.5
In-Reply-To: <20241213070852.106092-1-chenyi.qiang@intel.com>
References: <20241213070852.106092-1-chenyi.qiang@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Rename the helper to memory_region_section_intersect_range() to make it
more generic.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 hw/virtio/virtio-mem.c | 32 +++++---------------------------
 include/exec/memory.h  | 13 +++++++++++++
 system/memory.c        | 17 +++++++++++++++++
 3 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 80ada89551..e3d1ccaeeb 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -242,28 +242,6 @@ static int virtio_mem_for_each_plugged_range(VirtIOMEM *vmem, void *arg,
     return ret;
 }
 
-/*
- * Adjust the memory section to cover the intersection with the given range.
- *
- * Returns false if the intersection is empty, otherwise returns true.
- */
-static bool virtio_mem_intersect_memory_section(MemoryRegionSection *s,
-                                                uint64_t offset, uint64_t size)
-{
-    uint64_t start = MAX(s->offset_within_region, offset);
-    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
-                       offset + size);
-
-    if (end <= start) {
-        return false;
-    }
-
-    s->offset_within_address_space += start - s->offset_within_region;
-    s->offset_within_region = start;
-    s->size = int128_make64(end - start);
-    return true;
-}
-
 typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg);
 
 static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
@@ -285,7 +263,7 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
                                       first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -317,7 +295,7 @@ static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
                                  first_bit + 1) - 1;
         size = (last_bit - first_bit + 1) * vmem->block_size;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             break;
         }
         ret = cb(&tmp, arg);
@@ -353,7 +331,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         rdl->notify_discard(rdl, &tmp);
@@ -369,7 +347,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         MemoryRegionSection tmp = *rdl->section;
 
-        if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
         ret = rdl->notify_populate(rdl, &tmp);
@@ -386,7 +364,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
             if (rdl2 == rdl) {
                 break;
             }
-            if (!virtio_mem_intersect_memory_section(&tmp, offset, size)) {
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                 continue;
             }
             rdl2->notify_discard(rdl2, &tmp);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index e5e865d1a9..ec7bc641e8 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1196,6 +1196,19 @@ MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s);
  */
 void memory_region_section_free_copy(MemoryRegionSection *s);
 
+/**
+ * memory_region_section_intersect_range: Adjust the memory section to cover
+ * the intersection with the given range.
+ *
+ * @s: the #MemoryRegionSection to be adjusted
+ * @offset: the offset of the given range in the memory region
+ * @size: the size of the given range
+ *
+ * Returns false if the intersection is empty, otherwise returns true.
+ */
+bool memory_region_section_intersect_range(MemoryRegionSection *s,
+                                           uint64_t offset, uint64_t size);
+
 /**
  * memory_region_init: Initialize a memory region
  *
diff --git a/system/memory.c b/system/memory.c
index 85f6834cb3..ddcec90f5e 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2898,6 +2898,23 @@ void memory_region_section_free_copy(MemoryRegionSection *s)
     g_free(s);
 }
 
+bool memory_region_section_intersect_range(MemoryRegionSection *s,
+                                           uint64_t offset, uint64_t size)
+{
+    uint64_t start = MAX(s->offset_within_region, offset);
+    uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
+                       offset + size);
+
+    if (end <= start) {
+        return false;
+    }
+
+    s->offset_within_address_space += start - s->offset_within_region;
+    s->offset_within_region = start;
+    s->size = int128_make64(end - start);
+    return true;
+}
+
 bool memory_region_present(MemoryRegion *container, hwaddr addr)
 {
     MemoryRegion *mr;

From patchwork Fri Dec 13 07:08:44 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chenyi Qiang <chenyi.qiang@intel.com>
X-Patchwork-Id: 13906621
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7CC8818F2FC
	for <kvm@vger.kernel.org>; Fri, 13 Dec 2024 07:09:22 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.10
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1734073764; cv=none;
 b=jzAxhT2V41yZyWwohnwuixE5pAR7kYwagxJUVHB6uYwt7Q3s1MsPeMqbDPFFThmEgWklwgkA9TpT61TwdBiZL/FJWt72KjY19YCKkRejUe4E/RDGEf5Ye7IjHP7ipCAg8/YneFfTXdTtQPZXoD/Qw0TCvPuKmR2oDhmScZhSnXw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1734073764; c=relaxed/simple;
	bh=f3SH0dN9l5XpHOQkWq7GcuZYymqFqU3Q2bWZ4IWC33c=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=KA2NB70laIxONGxQuy0d67hnZkWQf7rauzixE/U+t+VYLUk14kAu4ugG1+e5I9hgV4xhnhtzl8BOQeFXSZm2jsIPTxIJjqimiVUvPAa/UsvWskf16F3nfTfngE5ZyugUinBXwkorFj4ooDVVenMGW5rVMayqChccoN0+nXO6WaA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=Juc4APAd; arc=none smtp.client-ip=198.175.65.10
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="Juc4APAd"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1734073763; x=1765609763;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=f3SH0dN9l5XpHOQkWq7GcuZYymqFqU3Q2bWZ4IWC33c=;
  b=Juc4APAdY8u3XD9Ze9IzHULj7A3WVB12mc3zfMBeM1L5QiFglhOiSa4W
   /ZfFSAncZ4I9TDDcWauxB3V4v3trkaVeRPuGdgw7NVs6F8oge48dasKmd
   e+5RCJj2A049Se2C3bVAKgNVuVAbs/rsrJwC1xjtRzndVQQY3vP8/PAbX
   dm7DA+ju851xhaqs8AI9vtju0SgZ30Q5vFbhH6mQJBA/6iGY/jQZ/ggo3
   IBr5ILSizU/RSZUHxofVFrkdPEjJKMZKG5iWM3a3yPZft9cPoy+WAry8X
   1u5fH0ixV35OTuFEjVt1FOzzW2VDB/+QIENnjt4mffEDM/2WSTroWOklC
   A==;
X-CSE-ConnectionGUID: TdTz5WD3T9CpjGLHFZ8bzw==
X-CSE-MsgGUID: 8bEf0kVbRBKjIXuGYhCOfw==
X-IronPort-AV: E=McAfee;i="6700,10204,11284"; a="51937076"
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="51937076"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:23 -0800
X-CSE-ConnectionGUID: IixbyfNwTPSBxOVNMAoO2Q==
X-CSE-MsgGUID: TT8enhCkR7aiPygj5iCDvg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="96365549"
Received: from emr-bkc.sh.intel.com ([10.112.230.82])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:19 -0800
From: Chenyi Qiang <chenyi.qiang@intel.com>
To: David Hildenbrand <david@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>,
 Peter Xu <peterx@redhat.com>,
 =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>,
 Michael Roth <michael.roth@amd.com>
Cc: Chenyi Qiang <chenyi.qiang@intel.com>,
	qemu-devel@nongnu.org,
	kvm@vger.kernel.org,
	Williams Dan J <dan.j.williams@intel.com>,
	Peng Chao P <chao.p.peng@intel.com>,
	Gao Chao <chao.gao@intel.com>,
	Xu Yilun <yilun.xu@intel.com>
Subject: [PATCH 2/7] guest_memfd: Introduce an object to manage the
 guest-memfd with RamDiscardManager
Date: Fri, 13 Dec 2024 15:08:44 +0800
Message-ID: <20241213070852.106092-3-chenyi.qiang@intel.com>
X-Mailer: git-send-email 2.43.5
In-Reply-To: <20241213070852.106092-1-chenyi.qiang@intel.com>
References: <20241213070852.106092-1-chenyi.qiang@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

As the commit 852f0048f3 ("RAMBlock: make guest_memfd require
uncoordinated discard") highlighted, some subsystems like VFIO might
disable ram block discard. However, guest_memfd relies on the discard
operation to perform page conversion between private and shared memory.
This can lead to stale IOMMU mapping issue when assigning a hardware
device to a confidential VM via shared memory (unprotected memory
pages). Blocking shared page discard can solve this problem, but it
could cause guests to consume twice the memory with VFIO, which is not
acceptable in some cases. An alternative solution is to convey other
systems like VFIO to refresh its outdated IOMMU mappings.

RamDiscardManager is an existing concept (used by virtio-mem) to adjust
VFIO mappings in relation to VM page assignment. Effectively page
conversion is similar to hot-removing a page in one mode and adding it
back in the other, so the similar work that needs to happen in response
to virtio-mem changes needs to happen for page conversion events.
Introduce the RamDiscardManager to guest_memfd to achieve it.

However, guest_memfd is not an object so it cannot directly implement
the RamDiscardManager interface.

One solution is to implement the interface in HostMemoryBackend. Any
guest_memfd-backed host memory backend can register itself in the target
MemoryRegion. However, this solution doesn't cover the scenario where a
guest_memfd MemoryRegion doesn't belong to the HostMemoryBackend, e.g.
the virtual BIOS MemoryRegion.

Thus, choose the second option, i.e. define an object type named
guest_memfd_manager with RamDiscardManager interface. Upon creation of
guest_memfd, a new guest_memfd_manager object can be instantiated and
registered to the managed guest_memfd MemoryRegion to handle the page
conversion events.

In the context of guest_memfd, the discarded state signifies that the
page is private, while the populated state indicated that the page is
shared. The state of the memory is tracked at the granularity of the
host page size (i.e. block_size), as the minimum conversion size can be
one page per request.

In addition, VFIO expects the DMA mapping for a specific iova to be
mapped and unmapped with the same granularity. However, the confidential
VMs may do partial conversion, e.g. conversion happens on a small region
within a large region. To prevent such invalid cases and before any
potential optimization comes out, all operations are performed with 4K
granularity.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 include/sysemu/guest-memfd-manager.h |  46 +++++
 system/guest-memfd-manager.c         | 250 +++++++++++++++++++++++++++
 system/meson.build                   |   1 +
 3 files changed, 297 insertions(+)
 create mode 100644 include/sysemu/guest-memfd-manager.h
 create mode 100644 system/guest-memfd-manager.c

diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
new file mode 100644
index 0000000000..ba4a99b614
--- /dev/null
+++ b/include/sysemu/guest-memfd-manager.h
@@ -0,0 +1,46 @@
+/*
+ * QEMU guest memfd manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#ifndef SYSEMU_GUEST_MEMFD_MANAGER_H
+#define SYSEMU_GUEST_MEMFD_MANAGER_H
+
+#include "sysemu/hostmem.h"
+
+#define TYPE_GUEST_MEMFD_MANAGER "guest-memfd-manager"
+
+OBJECT_DECLARE_TYPE(GuestMemfdManager, GuestMemfdManagerClass, GUEST_MEMFD_MANAGER)
+
+struct GuestMemfdManager {
+    Object parent;
+
+    /* Managed memory region. */
+    MemoryRegion *mr;
+
+    /*
+     * 1-setting of the bit represents the memory is populated (shared).
+     */
+    int32_t bitmap_size;
+    unsigned long *bitmap;
+
+    /* block size and alignment */
+    uint64_t block_size;
+
+    /* listeners to notify on populate/discard activity. */
+    QLIST_HEAD(, RamDiscardListener) rdl_list;
+};
+
+struct GuestMemfdManagerClass {
+    ObjectClass parent_class;
+};
+
+#endif
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
new file mode 100644
index 0000000000..d7e105fead
--- /dev/null
+++ b/system/guest-memfd-manager.c
@@ -0,0 +1,250 @@
+/*
+ * QEMU guest memfd manager
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Chenyi Qiang <chenyi.qiang@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "sysemu/guest-memfd-manager.h"
+
+OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
+                                          guest_memfd_manager,
+                                          GUEST_MEMFD_MANAGER,
+                                          OBJECT,
+                                          { TYPE_RAM_DISCARD_MANAGER },
+                                          { })
+
+static bool guest_memfd_rdm_is_populated(const RamDiscardManager *rdm,
+                                         const MemoryRegionSection *section)
+{
+    const GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    uint64_t first_bit = section->offset_within_region / gmm->block_size;
+    uint64_t last_bit = first_bit + int128_get64(section->size) / gmm->block_size - 1;
+    unsigned long first_discard_bit;
+
+    first_discard_bit = find_next_zero_bit(gmm->bitmap, last_bit + 1, first_bit);
+    return first_discard_bit > last_bit;
+}
+
+typedef int (*guest_memfd_section_cb)(MemoryRegionSection *s, void *arg);
+
+static int guest_memfd_notify_populate_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    return rdl->notify_populate(rdl, section);
+}
+
+static int guest_memfd_notify_discard_cb(MemoryRegionSection *section, void *arg)
+{
+    RamDiscardListener *rdl = arg;
+
+    rdl->notify_discard(rdl, section);
+
+    return 0;
+}
+
+static int guest_memfd_for_each_populated_section(const GuestMemfdManager *gmm,
+                                                  MemoryRegionSection *section,
+                                                  void *arg,
+                                                  guest_memfd_section_cb cb)
+{
+    unsigned long first_one_bit, last_one_bit;
+    uint64_t offset, size;
+    int ret = 0;
+
+    first_one_bit = section->offset_within_region / gmm->block_size;
+    first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size, first_one_bit);
+
+    while (first_one_bit < gmm->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_one_bit * gmm->block_size;
+        last_one_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
+                                          first_one_bit + 1) - 1;
+        size = (last_one_bit - first_one_bit + 1) * gmm->block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            break;
+        }
+
+        first_one_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
+                                      last_one_bit + 2);
+    }
+
+    return ret;
+}
+
+static int guest_memfd_for_each_discarded_section(const GuestMemfdManager *gmm,
+                                                  MemoryRegionSection *section,
+                                                  void *arg,
+                                                  guest_memfd_section_cb cb)
+{
+    unsigned long first_zero_bit, last_zero_bit;
+    uint64_t offset, size;
+    int ret = 0;
+
+    first_zero_bit = section->offset_within_region / gmm->block_size;
+    first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
+                                        first_zero_bit);
+
+    while (first_zero_bit < gmm->bitmap_size) {
+        MemoryRegionSection tmp = *section;
+
+        offset = first_zero_bit * gmm->block_size;
+        last_zero_bit = find_next_bit(gmm->bitmap, gmm->bitmap_size,
+                                      first_zero_bit + 1) - 1;
+        size = (last_zero_bit - first_zero_bit + 1) * gmm->block_size;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            break;
+        }
+
+        ret = cb(&tmp, arg);
+        if (ret) {
+            break;
+        }
+
+        first_zero_bit = find_next_zero_bit(gmm->bitmap, gmm->bitmap_size,
+                                            last_zero_bit + 2);
+    }
+
+    return ret;
+}
+
+static uint64_t guest_memfd_rdm_get_min_granularity(const RamDiscardManager *rdm,
+                                                    const MemoryRegion *mr)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+
+    g_assert(mr == gmm->mr);
+    return gmm->block_size;
+}
+
+static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
+                                              RamDiscardListener *rdl,
+                                              MemoryRegionSection *section)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    int ret;
+
+    g_assert(section->mr == gmm->mr);
+    rdl->section = memory_region_section_new_copy(section);
+
+    QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
+
+    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
+                                                 guest_memfd_notify_populate_cb);
+    if (ret) {
+        error_report("%s: Failed to register RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+}
+
+static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
+                                                RamDiscardListener *rdl)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    int ret;
+
+    g_assert(rdl->section);
+    g_assert(rdl->section->mr == gmm->mr);
+
+    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl,
+                                                 guest_memfd_notify_discard_cb);
+    if (ret) {
+        error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
+                     strerror(-ret));
+    }
+
+    memory_region_section_free_copy(rdl->section);
+    rdl->section = NULL;
+    QLIST_REMOVE(rdl, next);
+
+}
+
+typedef struct GuestMemfdReplayData {
+    void *fn;
+    void *opaque;
+} GuestMemfdReplayData;
+
+static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, void *arg)
+{
+    struct GuestMemfdReplayData *data = arg;
+    ReplayRamPopulate replay_fn = data->fn;
+
+    return replay_fn(section, data->opaque);
+}
+
+static int guest_memfd_rdm_replay_populated(const RamDiscardManager *rdm,
+                                            MemoryRegionSection *section,
+                                            ReplayRamPopulate replay_fn,
+                                            void *opaque)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == gmm->mr);
+    return guest_memfd_for_each_populated_section(gmm, section, &data,
+                                                  guest_memfd_rdm_replay_populated_cb);
+}
+
+static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection *section, void *arg)
+{
+    struct GuestMemfdReplayData *data = arg;
+    ReplayRamDiscard replay_fn = data->fn;
+
+    replay_fn(section, data->opaque);
+
+    return 0;
+}
+
+static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                             MemoryRegionSection *section,
+                                             ReplayRamDiscard replay_fn,
+                                             void *opaque)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
+    struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
+
+    g_assert(section->mr == gmm->mr);
+    guest_memfd_for_each_discarded_section(gmm, section, &data,
+                                           guest_memfd_rdm_replay_discarded_cb);
+}
+
+static void guest_memfd_manager_init(Object *obj)
+{
+    GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
+
+    QLIST_INIT(&gmm->rdl_list);
+}
+
+static void guest_memfd_manager_finalize(Object *obj)
+{
+    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
+}
+
+static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
+{
+    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
+
+    rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
+    rdmc->register_listener = guest_memfd_rdm_register_listener;
+    rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;
+    rdmc->is_populated = guest_memfd_rdm_is_populated;
+    rdmc->replay_populated = guest_memfd_rdm_replay_populated;
+    rdmc->replay_discarded = guest_memfd_rdm_replay_discarded;
+}
diff --git a/system/meson.build b/system/meson.build
index 4952f4b2c7..ed4e1137bd 100644
--- a/system/meson.build
+++ b/system/meson.build
@@ -15,6 +15,7 @@ system_ss.add(files(
   'dirtylimit.c',
   'dma-helpers.c',
   'globals.c',
+  'guest-memfd-manager.c',
   'memory_mapping.c',
   'qdev-monitor.c',
   'qtest.c',

From patchwork Fri Dec 13 07:08:45 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chenyi Qiang <chenyi.qiang@intel.com>
X-Patchwork-Id: 13906622
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 91AA718FC8C
	for <kvm@vger.kernel.org>; Fri, 13 Dec 2024 07:09:25 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.10
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1734073767; cv=none;
 b=nywbkiBSpMuzpvvu8XWUL8gDs/lcpjwHLdyFo1llzmZNAYqyZ2j2V03yqg0k9l80a9YpZUIIym9H2TOF7rEV+D4/Tg05CgRdTek2msrg6GX7lE7VxrmNA950lGTBjR4aTL0L0mNYBClG+NeVS2bmJEYicZqCM5ZaUzEWdsMEvfM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1734073767; c=relaxed/simple;
	bh=clDYu4irPfldg6nv/XSDBXCk/lCoW+l/KtnHIqDPmow=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=iotkDdrqquv2F+Qmz7QuQdug3+Xzex4iuyM8gx8IuAurxbOHIXuVVGQDO0BSbHwS2oWnjQKIIkdDfZCbcd0d5OL17x1K2VZlVQ/c5g6pThQ4BsIcJA4eG1GFhBxaKPOKFDUCynBPbmOBQ6MXhKlq3T85Jx3bccRB+aTU9Pyvfuw=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=jgFOmK9H; arc=none smtp.client-ip=198.175.65.10
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="jgFOmK9H"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1734073766; x=1765609766;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=clDYu4irPfldg6nv/XSDBXCk/lCoW+l/KtnHIqDPmow=;
  b=jgFOmK9H49F8OUmXAomC97WrnTE7ClXeYm3dKosV7XevV3yDObAjSjeo
   CQan/28s/kOAyIvN9HS3/aqBREc6sQU98VWE5rUBKUjMmNxMCQAyav3GH
   /4gWXCgI2N1uMWua0O1C9gl5NCmmgx2E5NgwHwzXuk/zvoByH9o4ImqgJ
   Vx9hjJcOiDHVo9rS0DrPZFX3uOzDCyRmrQBk3/1Pm8W8yUCmEhlUglBmA
   H66GhVXZbs7I6tbGS74MzO3jzqZWCNmfOpTWqTharl3ck1Lkyx1r9vQrC
   zWbm9TrnyKzneAMHPh+UF/6Kb/lJthlpSkJG8pllRpz26W4EN1m1jiYji
   g==;
X-CSE-ConnectionGUID: gb3buA+iTfm1Q9jHcngUeQ==
X-CSE-MsgGUID: YZzTDSH/Q3umkjFl6X5TaQ==
X-IronPort-AV: E=McAfee;i="6700,10204,11284"; a="51937083"
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="51937083"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:26 -0800
X-CSE-ConnectionGUID: /kKuGrECQ462TJEg+6+ouA==
X-CSE-MsgGUID: TZBRwFL4T86Wddo0G1L9Jg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="96365557"
Received: from emr-bkc.sh.intel.com ([10.112.230.82])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:22 -0800
From: Chenyi Qiang <chenyi.qiang@intel.com>
To: David Hildenbrand <david@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>,
 Peter Xu <peterx@redhat.com>,
 =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>,
 Michael Roth <michael.roth@amd.com>
Cc: Chenyi Qiang <chenyi.qiang@intel.com>,
	qemu-devel@nongnu.org,
	kvm@vger.kernel.org,
	Williams Dan J <dan.j.williams@intel.com>,
	Peng Chao P <chao.p.peng@intel.com>,
	Gao Chao <chao.gao@intel.com>,
	Xu Yilun <yilun.xu@intel.com>
Subject: [PATCH 3/7] guest_memfd: Introduce a callback to notify the
 shared/private state change
Date: Fri, 13 Dec 2024 15:08:45 +0800
Message-ID: <20241213070852.106092-4-chenyi.qiang@intel.com>
X-Mailer: git-send-email 2.43.5
In-Reply-To: <20241213070852.106092-1-chenyi.qiang@intel.com>
References: <20241213070852.106092-1-chenyi.qiang@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Introduce a new state_change() callback in GuestMemfdManagerClass to
efficiently notify all registered RamDiscardListeners, including VFIO
listeners about the memory conversion events in guest_memfd. The
existing VFIO listener can dynamically DMA map/unmap the shared pages
based on conversion types:
- For conversions from shared to private, the VFIO system ensures the
  discarding of shared mapping from the IOMMU.
- For conversions from private to shared, it triggers the population of
  the shared mapping into the IOMMU.

Additionally, there could be some special conversion requests:
- When a conversion request is made for a page already in the desired
  state, the helper simply returns success.
- For requests involving a range partially in the desired state, only
  the necessary segments are converted, ensuring the entire range
  complies with the request efficiently.
- In scenarios where a conversion request is declined by other systems,
  such as a failure from VFIO during notify_populate(), the helper will
  roll back the request, maintaining consistency.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 include/sysemu/guest-memfd-manager.h |   3 +
 system/guest-memfd-manager.c         | 144 +++++++++++++++++++++++++++
 2 files changed, 147 insertions(+)

diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
index ba4a99b614..f4b175529b 100644
--- a/include/sysemu/guest-memfd-manager.h
+++ b/include/sysemu/guest-memfd-manager.h
@@ -41,6 +41,9 @@ struct GuestMemfdManager {
 
 struct GuestMemfdManagerClass {
     ObjectClass parent_class;
+
+    int (*state_change)(GuestMemfdManager *gmm, uint64_t offset, uint64_t size,
+                        bool shared_to_private);
 };
 
 #endif
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index d7e105fead..6601df5f3f 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -225,6 +225,147 @@ static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
                                            guest_memfd_rdm_replay_discarded_cb);
 }
 
+static bool guest_memfd_is_valid_range(GuestMemfdManager *gmm,
+                                       uint64_t offset, uint64_t size)
+{
+    MemoryRegion *mr = gmm->mr;
+
+    g_assert(mr);
+
+    uint64_t region_size = memory_region_size(mr);
+    if (!QEMU_IS_ALIGNED(offset, gmm->block_size)) {
+        return false;
+    }
+    if (offset + size < offset || !size) {
+        return false;
+    }
+    if (offset >= region_size || offset + size > region_size) {
+        return false;
+    }
+    return true;
+}
+
+static void guest_memfd_notify_discard(GuestMemfdManager *gmm,
+                                       uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl;
+
+    QLIST_FOREACH(rdl, &gmm->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+
+        guest_memfd_for_each_populated_section(gmm, &tmp, rdl,
+                                               guest_memfd_notify_discard_cb);
+    }
+}
+
+
+static int guest_memfd_notify_populate(GuestMemfdManager *gmm,
+                                       uint64_t offset, uint64_t size)
+{
+    RamDiscardListener *rdl, *rdl2;
+    int ret = 0;
+
+    QLIST_FOREACH(rdl, &gmm->rdl_list, next) {
+        MemoryRegionSection tmp = *rdl->section;
+
+        if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+            continue;
+        }
+
+        ret = guest_memfd_for_each_discarded_section(gmm, &tmp, rdl,
+                                                     guest_memfd_notify_populate_cb);
+        if (ret) {
+            break;
+        }
+    }
+
+    if (ret) {
+        /* Notify all already-notified listeners. */
+        QLIST_FOREACH(rdl2, &gmm->rdl_list, next) {
+            MemoryRegionSection tmp = *rdl2->section;
+
+            if (rdl2 == rdl) {
+                break;
+            }
+            if (!memory_region_section_intersect_range(&tmp, offset, size)) {
+                continue;
+            }
+
+            guest_memfd_for_each_discarded_section(gmm, &tmp, rdl2,
+                                                   guest_memfd_notify_discard_cb);
+        }
+    }
+    return ret;
+}
+
+static bool guest_memfd_is_range_populated(GuestMemfdManager *gmm,
+                                           uint64_t offset, uint64_t size)
+{
+    const unsigned long first_bit = offset / gmm->block_size;
+    const unsigned long last_bit = first_bit + (size / gmm->block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_zero_bit(gmm->bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+static bool guest_memfd_is_range_discarded(GuestMemfdManager *gmm,
+                                           uint64_t offset, uint64_t size)
+{
+    const unsigned long first_bit = offset / gmm->block_size;
+    const unsigned long last_bit = first_bit + (size / gmm->block_size) - 1;
+    unsigned long found_bit;
+
+    /* We fake a shorter bitmap to avoid searching too far. */
+    found_bit = find_next_bit(gmm->bitmap, last_bit + 1, first_bit);
+    return found_bit > last_bit;
+}
+
+static int guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset,
+                                    uint64_t size, bool shared_to_private)
+{
+    int ret = 0;
+
+    if (!guest_memfd_is_valid_range(gmm, offset, size)) {
+        error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
+                     __func__, offset, size);
+        return -1;
+    }
+
+    if ((shared_to_private && guest_memfd_is_range_discarded(gmm, offset, size)) ||
+        (!shared_to_private && guest_memfd_is_range_populated(gmm, offset, size))) {
+        return 0;
+    }
+
+    if (shared_to_private) {
+        guest_memfd_notify_discard(gmm, offset, size);
+    } else {
+        ret = guest_memfd_notify_populate(gmm, offset, size);
+    }
+
+    if (!ret) {
+        unsigned long first_bit = offset / gmm->block_size;
+        unsigned long nbits = size / gmm->block_size;
+
+        g_assert((first_bit + nbits) <= gmm->bitmap_size);
+
+        if (shared_to_private) {
+            bitmap_clear(gmm->bitmap, first_bit, nbits);
+        } else {
+            bitmap_set(gmm->bitmap, first_bit, nbits);
+        }
+
+        return 0;
+    }
+
+    return ret;
+}
+
 static void guest_memfd_manager_init(Object *obj)
 {
     GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
@@ -239,8 +380,11 @@ static void guest_memfd_manager_finalize(Object *obj)
 
 static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
 {
+    GuestMemfdManagerClass *gmmc = GUEST_MEMFD_MANAGER_CLASS(oc);
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
 
+    gmmc->state_change = guest_memfd_state_change;
+
     rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
     rdmc->register_listener = guest_memfd_rdm_register_listener;
     rdmc->unregister_listener = guest_memfd_rdm_unregister_listener;

From patchwork Fri Dec 13 07:08:46 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chenyi Qiang <chenyi.qiang@intel.com>
X-Patchwork-Id: 13906623
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id F060218FDB2
	for <kvm@vger.kernel.org>; Fri, 13 Dec 2024 07:09:28 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.10
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1734073770; cv=none;
 b=gFoWIZ8Nlbcay1wlyUqB4CscXAr2UuM3i2gDO3T+V29XFyqnT8956yQEgJVtsb1Y3sOJB0bS8yj8c67gAwIgziB9ul8rVKtSP3qkFerW13MKdg29415Mg9reYZLFxybj+ruQ8do2PVYVy07d3O17qcUemix61mjba0cY3b0OPSk=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1734073770; c=relaxed/simple;
	bh=vCYpc1Vq5+niHLJ2a/1T5SIiwGkK2cfUHuX+/Tv813o=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=mQ+6akhQbfhiWvcTE8HftlZTPQNq7Y20drim/PAB9Hr4qdMvY/KHNkXuUv9pK5HLsW6wvadFW2JIdGyXml1o6ciDsF7waaNNnAuT9Pphyrr82KCVqxqycD5Xi4Mg50JxiW+BGHYVOezxFmQ2v9mnQUDTRhjJycx0YaKSa0KFfJA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=g0EvWFj7; arc=none smtp.client-ip=198.175.65.10
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="g0EvWFj7"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1734073769; x=1765609769;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=vCYpc1Vq5+niHLJ2a/1T5SIiwGkK2cfUHuX+/Tv813o=;
  b=g0EvWFj7BUgGvlgeiPOxFr/XPShO8MGMiqSz061LtdAfee2+6On27uDS
   hMOtM9MdQsfEA3ifXaiYVR0XJyz4hXMHwihyBErVEt5SFMnVTtnW+EJOL
   Xa430H4zBbFOUdv/sulpqcAcfzRRsGDqzASl5tMQCcel+JW6Wrw2rfWT+
   3tL7NimRJiIGiEP8hePpx0ljyOBcoU94tsrrotEl1ZnQscb6VN3ARf/Fv
   kyzPVD1/2PLH/rmSTnttYTgzGC7j8NcY4ee5qlUG26PAFoF3o/7NKIsu8
   mH8o8mIbdKR2v6Pz8/oTpBdIv6DlJkMvcLY65Eb5fQkRpFab60IpPMOnk
   A==;
X-CSE-ConnectionGUID: 2aDDxuxURnum7lT9tXO4/Q==
X-CSE-MsgGUID: 4gWqIqgbStysNeIPO0PBHg==
X-IronPort-AV: E=McAfee;i="6700,10204,11284"; a="51937088"
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="51937088"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:29 -0800
X-CSE-ConnectionGUID: PkxAlT9xTnSNCm03DXVOsw==
X-CSE-MsgGUID: Z1jMnQXCQzuVd9eD3SkzhA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="96365565"
Received: from emr-bkc.sh.intel.com ([10.112.230.82])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:26 -0800
From: Chenyi Qiang <chenyi.qiang@intel.com>
To: David Hildenbrand <david@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>,
 Peter Xu <peterx@redhat.com>,
 =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>,
 Michael Roth <michael.roth@amd.com>
Cc: Chenyi Qiang <chenyi.qiang@intel.com>,
	qemu-devel@nongnu.org,
	kvm@vger.kernel.org,
	Williams Dan J <dan.j.williams@intel.com>,
	Peng Chao P <chao.p.peng@intel.com>,
	Gao Chao <chao.gao@intel.com>,
	Xu Yilun <yilun.xu@intel.com>
Subject: [PATCH 4/7] KVM: Notify the state change event during shared/private
 conversion
Date: Fri, 13 Dec 2024 15:08:46 +0800
Message-ID: <20241213070852.106092-5-chenyi.qiang@intel.com>
X-Mailer: git-send-email 2.43.5
In-Reply-To: <20241213070852.106092-1-chenyi.qiang@intel.com>
References: <20241213070852.106092-1-chenyi.qiang@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Introduce a helper to trigger the state_change() callback of the class.
Once exit to userspace to convert the page from private to shared or
vice versa at runtime, notify the event via the helper so that other
registered subsystems like VFIO can be notified.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 accel/kvm/kvm-all.c                  |  4 ++++
 include/sysemu/guest-memfd-manager.h | 15 +++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 52425af534..38f41a98a5 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -48,6 +48,7 @@
 #include "kvm-cpus.h"
 #include "sysemu/dirtylimit.h"
 #include "qemu/range.h"
+#include "sysemu/guest-memfd-manager.h"
 
 #include "hw/boards.h"
 #include "sysemu/stats.h"
@@ -3080,6 +3081,9 @@ int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     addr = memory_region_get_ram_ptr(mr) + section.offset_within_region;
     rb = qemu_ram_block_from_host(addr, false, &offset);
 
+    guest_memfd_manager_state_change(GUEST_MEMFD_MANAGER(mr->rdm), offset,
+                                     size, to_private);
+
     if (to_private) {
         if (rb->page_size != qemu_real_host_page_size()) {
             /*
diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
index f4b175529b..9dc4e0346d 100644
--- a/include/sysemu/guest-memfd-manager.h
+++ b/include/sysemu/guest-memfd-manager.h
@@ -46,4 +46,19 @@ struct GuestMemfdManagerClass {
                         bool shared_to_private);
 };
 
+static inline int guest_memfd_manager_state_change(GuestMemfdManager *gmm, uint64_t offset,
+                                                   uint64_t size, bool shared_to_private)
+{
+    GuestMemfdManagerClass *klass;
+
+    g_assert(gmm);
+    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
+
+    if (klass->state_change) {
+        return klass->state_change(gmm, offset, size, shared_to_private);
+    }
+
+    return 0;
+}
+
 #endif

From patchwork Fri Dec 13 07:08:47 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chenyi Qiang <chenyi.qiang@intel.com>
X-Patchwork-Id: 13906624
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0E3918FDC6
	for <kvm@vger.kernel.org>; Fri, 13 Dec 2024 07:09:31 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.10
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1734073773; cv=none;
 b=qsoyv7Pgim2jCqLZUD5D9SbkddWilXHHpXUVjZO1Gor/HIf2tZuc5lwPbzEfPZKsTWaEceNqZE9KPxgoOmyYA28HJQR3GgsYDYpSultzTNIIjxNav2TNItZqves9Vw7IU3uln6CoPj9CY+pxhsyjyReiqu7Ka+oCX+f7epsBDyM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1734073773; c=relaxed/simple;
	bh=A4w+xUWPFzrHzwWGCwJC81L1HqXdufeFhGlIL1EjnJc=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=sUL1RV0MgIqBERT2uJ50bTCvSX3biaIxNWk/0kSStPwbsTQ7QuVanatMFvkDGLFz9rky3CRZvmbddqa35FcbfUmFugZSgDg+M0iwaOG5UiLDLuakGupIoQkznJ8yMCkopDAZticO8Xnwj0MUPKGMGYiKVR79efta3OxAM+of3MM=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=CRzU8fbu; arc=none smtp.client-ip=198.175.65.10
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="CRzU8fbu"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1734073772; x=1765609772;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=A4w+xUWPFzrHzwWGCwJC81L1HqXdufeFhGlIL1EjnJc=;
  b=CRzU8fbuk9T6+s8sfDFIt8m6XxDd13/7hrZ7LHyXufKadp/JUZBU5qTs
   IiBwk0otW+C5muDkuZzF379pSQ8Ya+JKHULO/29W+sdHZMxZsmaceNgS6
   aHa0st3tFhc6w2BSzOwGLdPG0GL/CKtKx5LPoef/svfc6VY3EC1gCjn0k
   qNqQtm0NVeYjJJt8c0yNmz2txkp54sjYbtXskxT4WoeZCurTNvrdBVfjw
   MqaYAFqZW0QtBvFQtuxyN0dun+M69iF/5WwHK4UehIql8vD92bMeAIyuQ
   Ih/LKCacyx1hMmA9NBYDekP1G+BxBQXrvTL9Rnj4xO2ctQ6jfepHKlsdX
   g==;
X-CSE-ConnectionGUID: nyAIUrumSSGpPya6JkXnmA==
X-CSE-MsgGUID: cIvbMKFlRKyUMj07QKFknQ==
X-IronPort-AV: E=McAfee;i="6700,10204,11284"; a="51937094"
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="51937094"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:32 -0800
X-CSE-ConnectionGUID: XvVkZrwWTDSpO5tbYZGm3Q==
X-CSE-MsgGUID: WI9Kr5bxSjiT/r1psnF6jg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="96365574"
Received: from emr-bkc.sh.intel.com ([10.112.230.82])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:29 -0800
From: Chenyi Qiang <chenyi.qiang@intel.com>
To: David Hildenbrand <david@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>,
 Peter Xu <peterx@redhat.com>,
 =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>,
 Michael Roth <michael.roth@amd.com>
Cc: Chenyi Qiang <chenyi.qiang@intel.com>,
	qemu-devel@nongnu.org,
	kvm@vger.kernel.org,
	Williams Dan J <dan.j.williams@intel.com>,
	Peng Chao P <chao.p.peng@intel.com>,
	Gao Chao <chao.gao@intel.com>,
	Xu Yilun <yilun.xu@intel.com>
Subject: [PATCH 5/7] memory: Register the RamDiscardManager instance upon
 guest_memfd creation
Date: Fri, 13 Dec 2024 15:08:47 +0800
Message-ID: <20241213070852.106092-6-chenyi.qiang@intel.com>
X-Mailer: git-send-email 2.43.5
In-Reply-To: <20241213070852.106092-1-chenyi.qiang@intel.com>
References: <20241213070852.106092-1-chenyi.qiang@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Introduce the realize()/unrealize() callbacks to initialize/uninitialize
the new guest_memfd_manager object and register/unregister it in the
target MemoryRegion.

Guest_memfd was initially set to shared until the commit bd3bcf6962
("kvm/memory: Make memory type private by default if it has guest memfd
backend"). To align with this change, the default state in
guest_memfd_manager is set to private. (The bitmap is cleared to 0).
Additionally, setting the default to private can also reduce the
overhead of mapping shared pages into IOMMU by VFIO during the bootup stage.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 include/sysemu/guest-memfd-manager.h | 27 +++++++++++++++++++++++++++
 system/guest-memfd-manager.c         | 28 +++++++++++++++++++++++++++-
 system/physmem.c                     |  7 +++++++
 3 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/guest-memfd-manager.h b/include/sysemu/guest-memfd-manager.h
index 9dc4e0346d..d1e7f698e8 100644
--- a/include/sysemu/guest-memfd-manager.h
+++ b/include/sysemu/guest-memfd-manager.h
@@ -42,6 +42,8 @@ struct GuestMemfdManager {
 struct GuestMemfdManagerClass {
     ObjectClass parent_class;
 
+    void (*realize)(GuestMemfdManager *gmm, MemoryRegion *mr, uint64_t region_size);
+    void (*unrealize)(GuestMemfdManager *gmm);
     int (*state_change)(GuestMemfdManager *gmm, uint64_t offset, uint64_t size,
                         bool shared_to_private);
 };
@@ -61,4 +63,29 @@ static inline int guest_memfd_manager_state_change(GuestMemfdManager *gmm, uint6
     return 0;
 }
 
+static inline void guest_memfd_manager_realize(GuestMemfdManager *gmm,
+                                              MemoryRegion *mr, uint64_t region_size)
+{
+    GuestMemfdManagerClass *klass;
+
+    g_assert(gmm);
+    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
+
+    if (klass->realize) {
+        klass->realize(gmm, mr, region_size);
+    }
+}
+
+static inline void guest_memfd_manager_unrealize(GuestMemfdManager *gmm)
+{
+    GuestMemfdManagerClass *klass;
+
+    g_assert(gmm);
+    klass = GUEST_MEMFD_MANAGER_GET_CLASS(gmm);
+
+    if (klass->unrealize) {
+        klass->unrealize(gmm);
+    }
+}
+
 #endif
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index 6601df5f3f..b6a32f0bfb 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -366,6 +366,31 @@ static int guest_memfd_state_change(GuestMemfdManager *gmm, uint64_t offset,
     return ret;
 }
 
+static void guest_memfd_manager_realizefn(GuestMemfdManager *gmm, MemoryRegion *mr,
+                                          uint64_t region_size)
+{
+    uint64_t bitmap_size;
+
+    gmm->block_size = qemu_real_host_page_size();
+    bitmap_size = ROUND_UP(region_size, gmm->block_size) / gmm->block_size;
+
+    gmm->mr = mr;
+    gmm->bitmap_size = bitmap_size;
+    gmm->bitmap = bitmap_new(bitmap_size);
+
+    memory_region_set_ram_discard_manager(gmm->mr, RAM_DISCARD_MANAGER(gmm));
+}
+
+static void guest_memfd_manager_unrealizefn(GuestMemfdManager *gmm)
+{
+    memory_region_set_ram_discard_manager(gmm->mr, NULL);
+
+    g_free(gmm->bitmap);
+    gmm->bitmap = NULL;
+    gmm->bitmap_size = 0;
+    gmm->mr = NULL;
+}
+
 static void guest_memfd_manager_init(Object *obj)
 {
     GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(obj);
@@ -375,7 +400,6 @@ static void guest_memfd_manager_init(Object *obj)
 
 static void guest_memfd_manager_finalize(Object *obj)
 {
-    g_free(GUEST_MEMFD_MANAGER(obj)->bitmap);
 }
 
 static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
@@ -384,6 +408,8 @@ static void guest_memfd_manager_class_init(ObjectClass *oc, void *data)
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_CLASS(oc);
 
     gmmc->state_change = guest_memfd_state_change;
+    gmmc->realize = guest_memfd_manager_realizefn;
+    gmmc->unrealize = guest_memfd_manager_unrealizefn;
 
     rdmc->get_min_granularity = guest_memfd_rdm_get_min_granularity;
     rdmc->register_listener = guest_memfd_rdm_register_listener;
diff --git a/system/physmem.c b/system/physmem.c
index dc1db3a384..532182a6dd 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -53,6 +53,7 @@
 #include "sysemu/hostmem.h"
 #include "sysemu/hw_accel.h"
 #include "sysemu/xen-mapcache.h"
+#include "sysemu/guest-memfd-manager.h"
 #include "trace.h"
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -1885,6 +1886,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             qemu_mutex_unlock_ramlist();
             goto out_free;
         }
+
+        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(object_new(TYPE_GUEST_MEMFD_MANAGER));
+        guest_memfd_manager_realize(gmm, new_block->mr, new_block->mr->size);
     }
 
     ram_size = (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS;
@@ -2139,6 +2143,9 @@ static void reclaim_ramblock(RAMBlock *block)
 
     if (block->guest_memfd >= 0) {
         close(block->guest_memfd);
+        GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(block->mr->rdm);
+        guest_memfd_manager_unrealize(gmm);
+        object_unref(OBJECT(gmm));
         ram_block_discard_require(false);
     }
 

From patchwork Fri Dec 13 07:08:48 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chenyi Qiang <chenyi.qiang@intel.com>
X-Patchwork-Id: 13906625
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9D8FF19006B
	for <kvm@vger.kernel.org>; Fri, 13 Dec 2024 07:09:34 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.10
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1734073776; cv=none;
 b=b2I7viAiGqRov5/7R2JW8fkRdpDEwi3OAX+litBHu0yrKfqT2vErWexLnvhTey4pCzkZExdCPHrwy0UXtGa19VtbLd/v7jrsXjuOORG8u1zoMx9cwsNTYDYN1qaEgztgI078oPbZIw+X6I7FhZ8KB6sZKjeOiQbdZwCK6CpgSJQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1734073776; c=relaxed/simple;
	bh=be+9xTTlW9Q1UDuF2/Stfqw8KCfr7EfJ8tCu6lwkuG8=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=ZIFJ8Vv6whQZYOrJZOJmOvWMG7SsPEEPPqbk2dzmgdVQ8AMNnyoXETqif60dld13MdiOp+4VPkziLEnvfyut0cCKGl4g6WuL2QgG46CnVKwhEWMj+FduyFk9oeqmy6B56nJA4OVnlPCRM4QpAyKPHYWk/RjWDxCwi1nPvMFtUbk=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=O0aZH8u9; arc=none smtp.client-ip=198.175.65.10
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="O0aZH8u9"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1734073775; x=1765609775;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=be+9xTTlW9Q1UDuF2/Stfqw8KCfr7EfJ8tCu6lwkuG8=;
  b=O0aZH8u9NQMSpBSpGR2WaBnsmnKZ9hSJyyCGNQPXu8O3osbyK/tA0wvE
   fAlIWrPVSKHIsLSfQ7crPOuK6I1dnKpT7QftyiMJn/MM8uN2OgSld9rU9
   0srYL0Vx7YUsGRK0u+pxkmoxFzHmvyrPV1IMN30KekwayEvZN5Vv6o2/A
   D7Y5M9wEAEFm5ai/831Cs8XaUJsxji64KfjvZdrBKNgpd9ZuaWUQfirSH
   pDyv4oBbITYsnaLKBdw2UXtbpFZn1soyjkAVZrgfLK/gROb8f+8+CWLTY
   Fkv82tUKgnc/fUhk22UCynpiGIgTr9E13TPKOsFzvHdITrhEDClqO0xVC
   A==;
X-CSE-ConnectionGUID: 5LkizjiSQCi1W9wmSY7O5g==
X-CSE-MsgGUID: Tghwb7ECRxKpxhYn44hSPg==
X-IronPort-AV: E=McAfee;i="6700,10204,11284"; a="51937099"
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="51937099"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:35 -0800
X-CSE-ConnectionGUID: EN3ia8tuQ6Cs2wMQRzzq2g==
X-CSE-MsgGUID: RVV3xVuERumDJExGsQeMfw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="96365582"
Received: from emr-bkc.sh.intel.com ([10.112.230.82])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:32 -0800
From: Chenyi Qiang <chenyi.qiang@intel.com>
To: David Hildenbrand <david@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>,
 Peter Xu <peterx@redhat.com>,
 =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>,
 Michael Roth <michael.roth@amd.com>
Cc: Chenyi Qiang <chenyi.qiang@intel.com>,
	qemu-devel@nongnu.org,
	kvm@vger.kernel.org,
	Williams Dan J <dan.j.williams@intel.com>,
	Peng Chao P <chao.p.peng@intel.com>,
	Gao Chao <chao.gao@intel.com>,
	Xu Yilun <yilun.xu@intel.com>
Subject: [PATCH 6/7] RAMBlock: make guest_memfd require coordinate discard
Date: Fri, 13 Dec 2024 15:08:48 +0800
Message-ID: <20241213070852.106092-7-chenyi.qiang@intel.com>
X-Mailer: git-send-email 2.43.5
In-Reply-To: <20241213070852.106092-1-chenyi.qiang@intel.com>
References: <20241213070852.106092-1-chenyi.qiang@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

As guest_memfd is now managed by guest_memfd_manager with
RamDiscardManager, only block uncoordinated discard.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 system/physmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/system/physmem.c b/system/physmem.c
index 532182a6dd..585090b063 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1872,7 +1872,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         assert(kvm_enabled());
         assert(new_block->guest_memfd < 0);
 
-        ret = ram_block_discard_require(true);
+        ret = ram_block_coordinated_discard_require(true);
         if (ret < 0) {
             error_setg_errno(errp, -ret,
                              "cannot set up private guest memory: discard currently blocked");

From patchwork Fri Dec 13 07:08:49 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chenyi Qiang <chenyi.qiang@intel.com>
X-Patchwork-Id: 13906626
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C735119006B
	for <kvm@vger.kernel.org>; Fri, 13 Dec 2024 07:09:37 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=198.175.65.10
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1734073780; cv=none;
 b=sqxbAFOG1AP334WTq7CVxptT6PM9YNLz7umtoGnFlXnv9vWa/hZO4ClQt4psqSamhsF+UaUemJhtyzIHJZMDjl9YyUPtSHVbJKZYmkunrtFsmsT51oyPZzp383YKKexnX6iQY5OxRW2yErFOUEfqBR3GfNJv9u35cmQkQeqZlQ8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1734073780; c=relaxed/simple;
	bh=t+LB+Ns9IyXdCBWhMrXMtMR2X8Lk7IJwTq/0yEzRAaU=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=Ra8Ka15Bw53cjt9HmGlgWFBc2CGOPAv0IR+IlRZL+xJSL4EkzQKzyVEI+9tMKazAdwSQxGa48I7ITyvIsqaDGilNMQPHluP7gqX3taWW1sZn5AdZEMS+AIiLSAKKtR9AMiBpAXGtG6EOgJviA1h01gEa5MzWecHzdFfZ7znT+po=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com;
 spf=pass smtp.mailfrom=intel.com;
 dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b=FpR+xUK4; arc=none smtp.client-ip=198.175.65.10
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="FpR+xUK4"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1734073778; x=1765609778;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=t+LB+Ns9IyXdCBWhMrXMtMR2X8Lk7IJwTq/0yEzRAaU=;
  b=FpR+xUK4FHYRXQYR47nKrHxN5DUxoU/LmOz3F/ehhdzzWZOaX5Lgh+4Q
   mjf1aelr9fs9wP/+E50EQnquedVhrjpTGHfnP/EPjJMQyztQ6jiLZuq8E
   0iwvF8+qH0E9LoOcrF/0kC9Zn4/SR/yUFy6v68HRVjHtUCar6K2QtsoBK
   iQPQh5cNOsCN/eE+zK1bCu1Mw54K2tc2gT4PfSBm9/NEyJJ9+tWh4eSWV
   zKlXeH/Qj0v8Vh4rld4HZDFzGiwKC9vE/ya8BtsfF8FOOERH2TwRq92SM
   V1BkByGdqVa6HwrZ5RQOmXa5ZQ6qhPE8zGI6Jol5g1cRgoBRqAYbBl6hb
   g==;
X-CSE-ConnectionGUID: AxRolGBxRmK2CVK8PiZ5mg==
X-CSE-MsgGUID: smAFfKDPTRqmvREA37MgMQ==
X-IronPort-AV: E=McAfee;i="6700,10204,11284"; a="51937109"
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="51937109"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:38 -0800
X-CSE-ConnectionGUID: oYKPgmgOTZGMb6AP0BKP5w==
X-CSE-MsgGUID: y5rdZvvoRtibrtxIQBkOlg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.12,230,1728975600";
   d="scan'208";a="96365590"
Received: from emr-bkc.sh.intel.com ([10.112.230.82])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 12 Dec 2024 23:09:35 -0800
From: Chenyi Qiang <chenyi.qiang@intel.com>
To: David Hildenbrand <david@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>,
 Peter Xu <peterx@redhat.com>,
 =?utf-8?q?Philippe_Mathieu-Daud=C3=A9?= <philmd@linaro.org>,
 Michael Roth <michael.roth@amd.com>
Cc: Chenyi Qiang <chenyi.qiang@intel.com>,
	qemu-devel@nongnu.org,
	kvm@vger.kernel.org,
	Williams Dan J <dan.j.williams@intel.com>,
	Peng Chao P <chao.p.peng@intel.com>,
	Gao Chao <chao.gao@intel.com>,
	Xu Yilun <yilun.xu@intel.com>
Subject: [RFC PATCH 7/7] memory: Add a new argument to indicate the request
 attribute in RamDismcardManager helpers
Date: Fri, 13 Dec 2024 15:08:49 +0800
Message-ID: <20241213070852.106092-8-chenyi.qiang@intel.com>
X-Mailer: git-send-email 2.43.5
In-Reply-To: <20241213070852.106092-1-chenyi.qiang@intel.com>
References: <20241213070852.106092-1-chenyi.qiang@intel.com>
Precedence: bulk
X-Mailing-List: kvm@vger.kernel.org
List-Id: <kvm.vger.kernel.org>
List-Subscribe: <mailto:kvm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:kvm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

For each ram_discard_manager helper, add a new argument 'is_private' to
indicate the request attribute. If is_private is true, the operation
targets the private range in the section. For example,
replay_populate(true) will replay the populate operation on private part
in the MemoryRegionSection, while replay_popuate(false) will replay
population on shared part.

This helps to distinguish between the states of private/shared and
discarded/populated. It is essential for guest_memfd_manager which uses
RamDiscardManager interface but can't treat private memory as discarded
memory. This is because it does not align with the expectation of
current RamDiscardManager users (e.g. live migration), who expect that
discarded memory is hot-removed and can be skipped when processing guest
memory. Treating private memory as discarded won't work in the future if
live migration needs to handle private memory. For example, live
migration needs to migrate private memory.

The user of the helper needs to figure out which attribute to
manipulate. For legacy VM case, use is_private=true by default. Private
attribute is only valid in a guest_memfd based VM.

Opportunistically rename the guest_memfd_for_each_{discarded,
populated}_section() to guest_memfd_for_each_{private, shared)_section()
to distinguish between private/shared and discarded/populated at the
same time.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 hw/vfio/common.c             |  22 ++++++--
 hw/virtio/virtio-mem.c       |  23 ++++----
 include/exec/memory.h        |  23 ++++++--
 migration/ram.c              |  14 ++---
 system/guest-memfd-manager.c | 106 +++++++++++++++++++++++------------
 system/memory.c              |  13 +++--
 system/memory_mapping.c      |   4 +-
 7 files changed, 135 insertions(+), 70 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index dcef44fe55..a6f49e6450 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -345,7 +345,8 @@ out:
 }
 
 static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
-                                            MemoryRegionSection *section)
+                                            MemoryRegionSection *section,
+                                            bool is_private)
 {
     VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
                                                 listener);
@@ -354,6 +355,11 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
     const hwaddr iova = section->offset_within_address_space;
     int ret;
 
+    if (is_private) {
+        /* Not support discard private memory yet. */
+        return;
+    }
+
     /* Unmap with a single call. */
     ret = vfio_container_dma_unmap(bcontainer, iova, size , NULL);
     if (ret) {
@@ -363,7 +369,8 @@ static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
 }
 
 static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
-                                            MemoryRegionSection *section)
+                                            MemoryRegionSection *section,
+                                            bool is_private)
 {
     VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
                                                 listener);
@@ -374,6 +381,11 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
     void *vaddr;
     int ret;
 
+    if (is_private) {
+        /* Not support discard private memory yet. */
+        return 0;
+    }
+
     /*
      * Map in (aligned within memory region) minimum granularity, so we can
      * unmap in minimum granularity later.
@@ -390,7 +402,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
                                      vaddr, section->readonly);
         if (ret) {
             /* Rollback */
-            vfio_ram_discard_notify_discard(rdl, section);
+            vfio_ram_discard_notify_discard(rdl, section, false);
             return ret;
         }
     }
@@ -1248,7 +1260,7 @@ out:
 }
 
 static int vfio_ram_discard_get_dirty_bitmap(MemoryRegionSection *section,
-                                             void *opaque)
+                                             bool is_private, void *opaque)
 {
     const hwaddr size = int128_get64(section->size);
     const hwaddr iova = section->offset_within_address_space;
@@ -1293,7 +1305,7 @@ vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainerBase *bcontainer,
      * We only want/can synchronize the bitmap for actually mapped parts -
      * which correspond to populated parts. Replay all populated parts.
      */
-    return ram_discard_manager_replay_populated(rdm, section,
+    return ram_discard_manager_replay_populated(rdm, section, false,
                                               vfio_ram_discard_get_dirty_bitmap,
                                                 &vrdl);
 }
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index e3d1ccaeeb..e7304c7e47 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -312,14 +312,14 @@ static int virtio_mem_notify_populate_cb(MemoryRegionSection *s, void *arg)
 {
     RamDiscardListener *rdl = arg;
 
-    return rdl->notify_populate(rdl, s);
+    return rdl->notify_populate(rdl, s, false);
 }
 
 static int virtio_mem_notify_discard_cb(MemoryRegionSection *s, void *arg)
 {
     RamDiscardListener *rdl = arg;
 
-    rdl->notify_discard(rdl, s);
+    rdl->notify_discard(rdl, s, false);
     return 0;
 }
 
@@ -334,7 +334,7 @@ static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
         if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
-        rdl->notify_discard(rdl, &tmp);
+        rdl->notify_discard(rdl, &tmp, false);
     }
 }
 
@@ -350,7 +350,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
         if (!memory_region_section_intersect_range(&tmp, offset, size)) {
             continue;
         }
-        ret = rdl->notify_populate(rdl, &tmp);
+        ret = rdl->notify_populate(rdl, &tmp, false);
         if (ret) {
             break;
         }
@@ -367,7 +367,7 @@ static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
             if (!memory_region_section_intersect_range(&tmp, offset, size)) {
                 continue;
             }
-            rdl2->notify_discard(rdl2, &tmp);
+            rdl2->notify_discard(rdl2, &tmp, false);
         }
     }
     return ret;
@@ -383,7 +383,7 @@ static void virtio_mem_notify_unplug_all(VirtIOMEM *vmem)
 
     QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
         if (rdl->double_discard_supported) {
-            rdl->notify_discard(rdl, rdl->section);
+            rdl->notify_discard(rdl, rdl->section, false);
         } else {
             virtio_mem_for_each_plugged_section(vmem, rdl->section, rdl,
                                                 virtio_mem_notify_discard_cb);
@@ -1685,7 +1685,8 @@ static uint64_t virtio_mem_rdm_get_min_granularity(const RamDiscardManager *rdm,
 }
 
 static bool virtio_mem_rdm_is_populated(const RamDiscardManager *rdm,
-                                        const MemoryRegionSection *s)
+                                        const MemoryRegionSection *s,
+                                        bool is_private)
 {
     const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
     uint64_t start_gpa = vmem->addr + s->offset_within_region;
@@ -1712,11 +1713,12 @@ static int virtio_mem_rdm_replay_populated_cb(MemoryRegionSection *s, void *arg)
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    return ((ReplayRamPopulate)data->fn)(s, data->opaque);
+    return ((ReplayRamPopulate)data->fn)(s, false, data->opaque);
 }
 
 static int virtio_mem_rdm_replay_populated(const RamDiscardManager *rdm,
                                            MemoryRegionSection *s,
+                                           bool is_private,
                                            ReplayRamPopulate replay_fn,
                                            void *opaque)
 {
@@ -1736,12 +1738,13 @@ static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
 {
     struct VirtIOMEMReplayData *data = arg;
 
-    ((ReplayRamDiscard)data->fn)(s, data->opaque);
+    ((ReplayRamDiscard)data->fn)(s, false, data->opaque);
     return 0;
 }
 
 static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
                                             MemoryRegionSection *s,
+                                            bool is_private,
                                             ReplayRamDiscard replay_fn,
                                             void *opaque)
 {
@@ -1783,7 +1786,7 @@ static void virtio_mem_rdm_unregister_listener(RamDiscardManager *rdm,
     g_assert(rdl->section->mr == &vmem->memdev->mr);
     if (vmem->size) {
         if (rdl->double_discard_supported) {
-            rdl->notify_discard(rdl, rdl->section);
+            rdl->notify_discard(rdl, rdl->section, false);
         } else {
             virtio_mem_for_each_plugged_section(vmem, rdl->section, rdl,
                                                 virtio_mem_notify_discard_cb);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index ec7bc641e8..8aac61af08 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -508,9 +508,11 @@ struct IOMMUMemoryRegionClass {
 
 typedef struct RamDiscardListener RamDiscardListener;
 typedef int (*NotifyRamPopulate)(RamDiscardListener *rdl,
-                                 MemoryRegionSection *section);
+                                 MemoryRegionSection *section,
+                                 bool is_private);
 typedef void (*NotifyRamDiscard)(RamDiscardListener *rdl,
-                                 MemoryRegionSection *section);
+                                 MemoryRegionSection *section,
+                                 bool is_private);
 
 struct RamDiscardListener {
     /*
@@ -566,8 +568,8 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
     rdl->double_discard_supported = double_discard_supported;
 }
 
-typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
-typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
+typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, bool is_private, void *opaque);
+typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, bool is_private, void *opaque);
 
 /*
  * RamDiscardManagerClass:
@@ -632,11 +634,13 @@ struct RamDiscardManagerClass {
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
+     * @is_private: the attribute of the request section
      *
      * Returns whether the given range is completely populated.
      */
     bool (*is_populated)(const RamDiscardManager *rdm,
-                         const MemoryRegionSection *section);
+                         const MemoryRegionSection *section,
+                         bool is_private);
 
     /**
      * @replay_populated:
@@ -648,6 +652,7 @@ struct RamDiscardManagerClass {
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
+     * @is_private: the attribute of the populated parts
      * @replay_fn: the #ReplayRamPopulate callback
      * @opaque: pointer to forward to the callback
      *
@@ -655,6 +660,7 @@ struct RamDiscardManagerClass {
      */
     int (*replay_populated)(const RamDiscardManager *rdm,
                             MemoryRegionSection *section,
+                            bool is_private,
                             ReplayRamPopulate replay_fn, void *opaque);
 
     /**
@@ -665,11 +671,13 @@ struct RamDiscardManagerClass {
      *
      * @rdm: the #RamDiscardManager
      * @section: the #MemoryRegionSection
+     * @is_private: the attribute of the discarded parts
      * @replay_fn: the #ReplayRamDiscard callback
      * @opaque: pointer to forward to the callback
      */
     void (*replay_discarded)(const RamDiscardManager *rdm,
                              MemoryRegionSection *section,
+                             bool is_private,
                              ReplayRamDiscard replay_fn, void *opaque);
 
     /**
@@ -709,15 +717,18 @@ uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
                                                  const MemoryRegion *mr);
 
 bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
-                                      const MemoryRegionSection *section);
+                                      const MemoryRegionSection *section,
+                                      bool is_private);
 
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
+                                         bool is_private,
                                          ReplayRamPopulate replay_fn,
                                          void *opaque);
 
 void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
                                           MemoryRegionSection *section,
+                                          bool is_private,
                                           ReplayRamDiscard replay_fn,
                                           void *opaque);
 
diff --git a/migration/ram.c b/migration/ram.c
index 05ff9eb328..b9efba1d14 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -838,7 +838,7 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
 }
 
 static void dirty_bitmap_clear_section(MemoryRegionSection *section,
-                                       void *opaque)
+                                       bool is_private, void *opaque)
 {
     const hwaddr offset = section->offset_within_region;
     const hwaddr size = int128_get64(section->size);
@@ -884,7 +884,7 @@ static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
             .size = int128_make64(qemu_ram_get_used_length(rb)),
         };
 
-        ram_discard_manager_replay_discarded(rdm, &section,
+        ram_discard_manager_replay_discarded(rdm, &section, false,
                                              dirty_bitmap_clear_section,
                                              &cleared_bits);
     }
@@ -907,7 +907,7 @@ bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start)
             .size = int128_make64(qemu_ram_pagesize(rb)),
         };
 
-        return !ram_discard_manager_is_populated(rdm, &section);
+        return !ram_discard_manager_is_populated(rdm, &section, false);
     }
     return false;
 }
@@ -1539,7 +1539,7 @@ static inline void populate_read_range(RAMBlock *block, ram_addr_t offset,
 }
 
 static inline int populate_read_section(MemoryRegionSection *section,
-                                        void *opaque)
+                                        bool is_private, void *opaque)
 {
     const hwaddr size = int128_get64(section->size);
     hwaddr offset = section->offset_within_region;
@@ -1579,7 +1579,7 @@ static void ram_block_populate_read(RAMBlock *rb)
             .size = rb->mr->size,
         };
 
-        ram_discard_manager_replay_populated(rdm, &section,
+        ram_discard_manager_replay_populated(rdm, &section, false,
                                              populate_read_section, NULL);
     } else {
         populate_read_range(rb, 0, rb->used_length);
@@ -1614,7 +1614,7 @@ void ram_write_tracking_prepare(void)
 }
 
 static inline int uffd_protect_section(MemoryRegionSection *section,
-                                       void *opaque)
+                                       bool is_private, void *opaque)
 {
     const hwaddr size = int128_get64(section->size);
     const hwaddr offset = section->offset_within_region;
@@ -1638,7 +1638,7 @@ static int ram_block_uffd_protect(RAMBlock *rb, int uffd_fd)
             .size = rb->mr->size,
         };
 
-        return ram_discard_manager_replay_populated(rdm, &section,
+        return ram_discard_manager_replay_populated(rdm, &section, false,
                                                     uffd_protect_section,
                                                     (void *)(uintptr_t)uffd_fd);
     }
diff --git a/system/guest-memfd-manager.c b/system/guest-memfd-manager.c
index b6a32f0bfb..50802b34d7 100644
--- a/system/guest-memfd-manager.c
+++ b/system/guest-memfd-manager.c
@@ -23,39 +23,51 @@ OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(GuestMemfdManager,
                                           { })
 
 static bool guest_memfd_rdm_is_populated(const RamDiscardManager *rdm,
-                                         const MemoryRegionSection *section)
+                                         const MemoryRegionSection *section,
+                                         bool is_private)
 {
     const GuestMemfdManager *gmm = GUEST_MEMFD_MANAGER(rdm);
     uint64_t first_bit = section->offset_within_region / gmm->block_size;
     uint64_t last_bit = first_bit + int128_get64(section->size) / gmm->block_size - 1;
     unsigned long first_discard_bit;
 
-    first_discard_bit = find_next_zero_bit(gmm->bitmap, last_bit + 1, first_bit);
+    if (is_private) {
+        /* Check if the private section is populated */
+        first_discard_bit = find_next_bit(gmm->bitmap, last_bit + 1, first_bit);
+    } else {
+        /* Check if the shared section is populated */
+        first_discard_bit = find_next_zero_bit(gmm->bitmap, last_bit + 1, first_bit);
+    }
+
     return first_discard_bit > last_bit;
 }
 
-typedef int (*guest_memfd_section_cb)(MemoryRegionSection *s, void *arg);
+typedef int (*guest_memfd_section_cb)(MemoryRegionSection *s, bool is_private,
+                                      void *arg);
 
-static int guest_memfd_notify_populate_cb(MemoryRegionSection *section, void *arg)
+static int guest_memfd_notify_populate_cb(MemoryRegionSection *section, bool is_private,
+                                          void *arg)
 {
     RamDiscardListener *rdl = arg;
 
-    return rdl->notify_populate(rdl, section);
+    return rdl->notify_populate(rdl, section, is_private);
 }
 
-static int guest_memfd_notify_discard_cb(MemoryRegionSection *section, void *arg)
+static int guest_memfd_notify_discard_cb(MemoryRegionSection *section, bool is_private,
+                                         void *arg)
 {
     RamDiscardListener *rdl = arg;
 
-    rdl->notify_discard(rdl, section);
+    rdl->notify_discard(rdl, section, is_private);
 
     return 0;
 }
 
-static int guest_memfd_for_each_populated_section(const GuestMemfdManager *gmm,
-                                                  MemoryRegionSection *section,
-                                                  void *arg,
-                                                  guest_memfd_section_cb cb)
+static int guest_memfd_for_each_shared_section(const GuestMemfdManager *gmm,
+                                               MemoryRegionSection *section,
+                                               bool is_private,
+                                               void *arg,
+                                               guest_memfd_section_cb cb)
 {
     unsigned long first_one_bit, last_one_bit;
     uint64_t offset, size;
@@ -76,7 +88,7 @@ static int guest_memfd_for_each_populated_section(const GuestMemfdManager *gmm,
             break;
         }
 
-        ret = cb(&tmp, arg);
+        ret = cb(&tmp, is_private, arg);
         if (ret) {
             break;
         }
@@ -88,10 +100,11 @@ static int guest_memfd_for_each_populated_section(const GuestMemfdManager *gmm,
     return ret;
 }
 
-static int guest_memfd_for_each_discarded_section(const GuestMemfdManager *gmm,
-                                                  MemoryRegionSection *section,
-                                                  void *arg,
-                                                  guest_memfd_section_cb cb)
+static int guest_memfd_for_each_private_section(const GuestMemfdManager *gmm,
+                                                MemoryRegionSection *section,
+                                                bool is_private,
+                                                void *arg,
+                                                guest_memfd_section_cb cb)
 {
     unsigned long first_zero_bit, last_zero_bit;
     uint64_t offset, size;
@@ -113,7 +126,7 @@ static int guest_memfd_for_each_discarded_section(const GuestMemfdManager *gmm,
             break;
         }
 
-        ret = cb(&tmp, arg);
+        ret = cb(&tmp, is_private, arg);
         if (ret) {
             break;
         }
@@ -146,8 +159,9 @@ static void guest_memfd_rdm_register_listener(RamDiscardManager *rdm,
 
     QLIST_INSERT_HEAD(&gmm->rdl_list, rdl, next);
 
-    ret = guest_memfd_for_each_populated_section(gmm, section, rdl,
-                                                 guest_memfd_notify_populate_cb);
+    /* Populate shared part */
+    ret = guest_memfd_for_each_shared_section(gmm, section, false, rdl,
+                                              guest_memfd_notify_populate_cb);
     if (ret) {
         error_report("%s: Failed to register RAM discard listener: %s", __func__,
                      strerror(-ret));
@@ -163,8 +177,9 @@ static void guest_memfd_rdm_unregister_listener(RamDiscardManager *rdm,
     g_assert(rdl->section);
     g_assert(rdl->section->mr == gmm->mr);
 
-    ret = guest_memfd_for_each_populated_section(gmm, rdl->section, rdl,
-                                                 guest_memfd_notify_discard_cb);
+    /* Discard shared part */
+    ret = guest_memfd_for_each_shared_section(gmm, rdl->section, false, rdl,
+                                              guest_memfd_notify_discard_cb);
     if (ret) {
         error_report("%s: Failed to unregister RAM discard listener: %s", __func__,
                      strerror(-ret));
@@ -181,16 +196,18 @@ typedef struct GuestMemfdReplayData {
     void *opaque;
 } GuestMemfdReplayData;
 
-static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section, void *arg)
+static int guest_memfd_rdm_replay_populated_cb(MemoryRegionSection *section,
+                                               bool is_private, void *arg)
 {
     struct GuestMemfdReplayData *data = arg;
     ReplayRamPopulate replay_fn = data->fn;
 
-    return replay_fn(section, data->opaque);
+    return replay_fn(section, is_private, data->opaque);
 }
 
 static int guest_memfd_rdm_replay_populated(const RamDiscardManager *rdm,
                                             MemoryRegionSection *section,
+                                            bool is_private,
                                             ReplayRamPopulate replay_fn,
                                             void *opaque)
 {
@@ -198,22 +215,31 @@ static int guest_memfd_rdm_replay_populated(const RamDiscardManager *rdm,
     struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
 
     g_assert(section->mr == gmm->mr);
-    return guest_memfd_for_each_populated_section(gmm, section, &data,
-                                                  guest_memfd_rdm_replay_populated_cb);
+    if (is_private) {
+        /* Replay populate on private section */
+        return guest_memfd_for_each_private_section(gmm, section, is_private, &data,
+                                                    guest_memfd_rdm_replay_populated_cb);
+    } else {
+        /* Replay populate on shared section */
+        return guest_memfd_for_each_shared_section(gmm, section, is_private, &data,
+                                                   guest_memfd_rdm_replay_populated_cb);
+    }
 }
 
-static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection *section, void *arg)
+static int guest_memfd_rdm_replay_discarded_cb(MemoryRegionSection *section,
+                                               bool is_private, void *arg)
 {
     struct GuestMemfdReplayData *data = arg;
     ReplayRamDiscard replay_fn = data->fn;
 
-    replay_fn(section, data->opaque);
+    replay_fn(section, is_private, data->opaque);
 
     return 0;
 }
 
 static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
                                              MemoryRegionSection *section,
+                                             bool is_private,
                                              ReplayRamDiscard replay_fn,
                                              void *opaque)
 {
@@ -221,8 +247,16 @@ static void guest_memfd_rdm_replay_discarded(const RamDiscardManager *rdm,
     struct GuestMemfdReplayData data = { .fn = replay_fn, .opaque = opaque };
 
     g_assert(section->mr == gmm->mr);
-    guest_memfd_for_each_discarded_section(gmm, section, &data,
-                                           guest_memfd_rdm_replay_discarded_cb);
+
+    if (is_private) {
+        /* Replay discard on private section */
+        guest_memfd_for_each_private_section(gmm, section, is_private, &data,
+                                             guest_memfd_rdm_replay_discarded_cb);
+    } else {
+        /* Replay discard on shared section */
+        guest_memfd_for_each_shared_section(gmm, section, is_private, &data,
+                                            guest_memfd_rdm_replay_discarded_cb);
+    }
 }
 
 static bool guest_memfd_is_valid_range(GuestMemfdManager *gmm,
@@ -257,8 +291,9 @@ static void guest_memfd_notify_discard(GuestMemfdManager *gmm,
             continue;
         }
 
-        guest_memfd_for_each_populated_section(gmm, &tmp, rdl,
-                                               guest_memfd_notify_discard_cb);
+        /* For current shared section, notify to discard shared parts */
+        guest_memfd_for_each_shared_section(gmm, &tmp, false, rdl,
+                                            guest_memfd_notify_discard_cb);
     }
 }
 
@@ -276,8 +311,9 @@ static int guest_memfd_notify_populate(GuestMemfdManager *gmm,
             continue;
         }
 
-        ret = guest_memfd_for_each_discarded_section(gmm, &tmp, rdl,
-                                                     guest_memfd_notify_populate_cb);
+        /* For current private section, notify to populate the shared parts */
+        ret = guest_memfd_for_each_private_section(gmm, &tmp, false, rdl,
+                                                   guest_memfd_notify_populate_cb);
         if (ret) {
             break;
         }
@@ -295,8 +331,8 @@ static int guest_memfd_notify_populate(GuestMemfdManager *gmm,
                 continue;
             }
 
-            guest_memfd_for_each_discarded_section(gmm, &tmp, rdl2,
-                                                   guest_memfd_notify_discard_cb);
+            guest_memfd_for_each_private_section(gmm, &tmp, false, rdl2,
+                                                 guest_memfd_notify_discard_cb);
         }
     }
     return ret;
diff --git a/system/memory.c b/system/memory.c
index ddcec90f5e..d3d5a04f98 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2133,34 +2133,37 @@ uint64_t ram_discard_manager_get_min_granularity(const RamDiscardManager *rdm,
 }
 
 bool ram_discard_manager_is_populated(const RamDiscardManager *rdm,
-                                      const MemoryRegionSection *section)
+                                      const MemoryRegionSection *section,
+                                      bool is_private)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
 
     g_assert(rdmc->is_populated);
-    return rdmc->is_populated(rdm, section);
+    return rdmc->is_populated(rdm, section, is_private);
 }
 
 int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          MemoryRegionSection *section,
+                                         bool is_private,
                                          ReplayRamPopulate replay_fn,
                                          void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
 
     g_assert(rdmc->replay_populated);
-    return rdmc->replay_populated(rdm, section, replay_fn, opaque);
+    return rdmc->replay_populated(rdm, section, is_private, replay_fn, opaque);
 }
 
 void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
                                           MemoryRegionSection *section,
+                                          bool is_private,
                                           ReplayRamDiscard replay_fn,
                                           void *opaque)
 {
     RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
 
     g_assert(rdmc->replay_discarded);
-    rdmc->replay_discarded(rdm, section, replay_fn, opaque);
+    rdmc->replay_discarded(rdm, section, is_private, replay_fn, opaque);
 }
 
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
@@ -2221,7 +2224,7 @@ bool memory_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
          * Disallow that. vmstate priorities make sure any RamDiscardManager
          * were already restored before IOMMUs are restored.
          */
-        if (!ram_discard_manager_is_populated(rdm, &tmp)) {
+        if (!ram_discard_manager_is_populated(rdm, &tmp, false)) {
             error_setg(errp, "iommu map to discarded memory (e.g., unplugged"
                          " via virtio-mem): %" HWADDR_PRIx "",
                          iotlb->translated_addr);
diff --git a/system/memory_mapping.c b/system/memory_mapping.c
index ca2390eb80..c55c0c0c93 100644
--- a/system/memory_mapping.c
+++ b/system/memory_mapping.c
@@ -249,7 +249,7 @@ static void guest_phys_block_add_section(GuestPhysListener *g,
 }
 
 static int guest_phys_ram_populate_cb(MemoryRegionSection *section,
-                                      void *opaque)
+                                      bool is_private, void *opaque)
 {
     GuestPhysListener *g = opaque;
 
@@ -274,7 +274,7 @@ static void guest_phys_blocks_region_add(MemoryListener *listener,
         RamDiscardManager *rdm;
 
         rdm = memory_region_get_ram_discard_manager(section->mr);
-        ram_discard_manager_replay_populated(rdm, section,
+        ram_discard_manager_replay_populated(rdm, section, false,
                                              guest_phys_ram_populate_cb, g);
         return;
     }