migration: Fix a potential guest memory corruption

Message ID	20220919093237.2219892-1-zhenzhong.duan@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Zhenzhong Duan <zhenzhong.duan@intel.com> To: qemu-devel@nongnu.org Cc: quintela@redhat.com, dgilbert@redhat.com Subject: [PATCH] migration: Fix a potential guest memory corruption Date: Mon, 19 Sep 2022 17:32:37 +0800 Message-Id: <20220919093237.2219892-1-zhenzhong.duan@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=192.55.52.115; envelope-from=zhenzhong.duan@intel.com; helo=mga14.intel.com X-Spam_score_int: -70 X-Spam_score: -7.1 X-Spam_bar: ------- X-Spam_report: (-7.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_HI=-5, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	migration: Fix a potential guest memory corruption \| expand migration: Fix a potential guest memory corruption

Message ID

20220919093237.2219892-1-zhenzhong.duan@intel.com (mailing list archive)

State

New, archived

Headers

From: Zhenzhong Duan <zhenzhong.duan@intel.com>
To: qemu-devel@nongnu.org
Cc: quintela@redhat.com,
	dgilbert@redhat.com
Subject: [PATCH] migration: Fix a potential guest memory corruption
Date: Mon, 19 Sep 2022 17:32:37 +0800
Message-Id: <20220919093237.2219892-1-zhenzhong.duan@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=192.55.52.115;
 envelope-from=zhenzhong.duan@intel.com; helo=mga14.intel.com
X-Spam_score_int: -70
X-Spam_score: -7.1
X-Spam_bar: -------
X-Spam_report: (-7.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001,
 DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
 RCVD_IN_DNSWL_HI=-5, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Sender: "Qemu-devel"
 <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>

Series

migration: Fix a potential guest memory corruption | expand

Commit Message

Duan, Zhenzhong Sept. 19, 2022, 9:32 a.m. UTC

Imagine a rare case, after a dirty page is sent to compression threads's
ring, dirty bitmap sync trigger right away and mark the same page dirty
again and sent. Then the new page may be overwriten by stale page in
compression threads's ring in the destination.

So we need to ensure there is only one copy of the same dirty page either
by flushing the ring after every bitmap sync or avoiding processing same
dirty page continuously.

I choose the 2nd which avoids the time consuming flush operation.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 migration/ram.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Dr. David Alan Gilbert Oct. 11, 2022, 11:05 a.m. UTC | #1

* Zhenzhong Duan (zhenzhong.duan@intel.com) wrote:

Hi,

> Imagine a rare case, after a dirty page is sent to compression threads's
> ring, dirty bitmap sync trigger right away and mark the same page dirty
> again and sent. Then the new page may be overwriten by stale page in
> compression threads's ring in the destination.

Yes, I think we had a similar problem in multifd.

> So we need to ensure there is only one copy of the same dirty page either
> by flushing the ring after every bitmap sync or avoiding processing same
> dirty page continuously.
> 
> I choose the 2nd which avoids the time consuming flush operation.

I'm not sure this guarantees it; it makes it much less likely but if
only a few pages are dirtied and you have lots of threads, I think the
same thing could still happy.

I think you're going to need to flush the ring after each sync.

Dave

> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  migration/ram.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index dc1de9ddbc68..67b2035586bd 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1551,7 +1551,7 @@ static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
>      pss->postcopy_requested = false;
>      pss->postcopy_target_channel = RAM_CHANNEL_PRECOPY;
>  
> -    pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
> +    pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page + 1);
>      if (pss->complete_round && pss->block == rs->last_seen_block &&
>          pss->page >= rs->last_page) {
>          /*
> @@ -1564,7 +1564,7 @@ static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
>      if (!offset_in_ramblock(pss->block,
>                              ((ram_addr_t)pss->page) << TARGET_PAGE_BITS)) {
>          /* Didn't find anything in this RAM Block */
> -        pss->page = 0;
> +        pss->page = -1;
>          pss->block = QLIST_NEXT_RCU(pss->block, next);
>          if (!pss->block) {
>              /*
> @@ -2694,7 +2694,7 @@ static void ram_state_reset(RAMState *rs)
>  {
>      rs->last_seen_block = NULL;
>      rs->last_sent_block = NULL;
> -    rs->last_page = 0;
> +    rs->last_page = -1;
>      rs->last_version = ram_list.version;
>      rs->xbzrle_enabled = false;
>      postcopy_preempt_reset(rs);
> @@ -2889,7 +2889,7 @@ void ram_postcopy_send_discard_bitmap(MigrationState *ms)
>      /* Easiest way to make sure we don't resume in the middle of a host-page */
>      rs->last_seen_block = NULL;
>      rs->last_sent_block = NULL;
> -    rs->last_page = 0;
> +    rs->last_page = -1;
>  
>      postcopy_each_ram_send_discard(ms);
>  
> -- 
> 2.25.1
>

Duan, Zhenzhong Oct. 17, 2022, 2:17 a.m. UTC | #2

>-----Original Message-----
>From: Dr. David Alan Gilbert <dgilbert@redhat.com>
>Sent: Tuesday, October 11, 2022 7:06 PM
>To: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>Cc: qemu-devel@nongnu.org; quintela@redhat.com
>Subject: Re: [PATCH] migration: Fix a potential guest memory corruption
>
>* Zhenzhong Duan (zhenzhong.duan@intel.com) wrote:
>
>Hi,
Hi,
Sorry for late response. Just back from vacation.

>
>> Imagine a rare case, after a dirty page is sent to compression
>> threads's ring, dirty bitmap sync trigger right away and mark the same
>> page dirty again and sent. Then the new page may be overwriten by
>> stale page in compression threads's ring in the destination.
>
>Yes, I think we had a similar problem in multifd.
Multifd flush operation multifd_send_sync_main() is called in each memory iteration
which is more aggressive than in compression. I think not an issue in multifd?

>
>> So we need to ensure there is only one copy of the same dirty page
>> either by flushing the ring after every bitmap sync or avoiding
>> processing same dirty page continuously.
>>
>> I choose the 2nd which avoids the time consuming flush operation.
>
>I'm not sure this guarantees it; it makes it much less likely but if only a few
>pages are dirtied and you have lots of threads, I think the same thing could
>still happy.
I didn't get it, imagine there are dirty page "A B C D E F G" in current RAMBLOCK.
1. Page "A B C D" are sent to compression thread.
2. dirty page sync triggers, update dirty map to "A B D E F G"
3. Page D is checked and sent to compression thread again, so there may be two copy of page D in compression thread, corruption!
4. Page "E F G" are sent to compression thread.
5. flush operation triggered at end of current RAMBLOCK.
6. In next iteration, page "A B" are sent.

After patch:
1. Page "A B C D" are sent to compression thread.
2. dirty page sync triggers, update dirty map to " A B D E F G"
3. Page after page D are checked and sent to compression thread which are Page "E F G".
5. flush operation triggered at end of current RAMBLOCK, ensures page D flushed.
6. In next iteration, page "A B D" are sent.

Thanks
Zhenzhong
>
>I think you're going to need to flush the ring after each sync.
>
>Dave
>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  migration/ram.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c index
>> dc1de9ddbc68..67b2035586bd 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1551,7 +1551,7 @@ static bool find_dirty_block(RAMState *rs,
>PageSearchStatus *pss, bool *again)
>>      pss->postcopy_requested = false;
>>      pss->postcopy_target_channel = RAM_CHANNEL_PRECOPY;
>>
>> -    pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
>> +    pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page
>> + + 1);
>>      if (pss->complete_round && pss->block == rs->last_seen_block &&
>>          pss->page >= rs->last_page) {
>>          /*
>> @@ -1564,7 +1564,7 @@ static bool find_dirty_block(RAMState *rs,
>PageSearchStatus *pss, bool *again)
>>      if (!offset_in_ramblock(pss->block,
>>                              ((ram_addr_t)pss->page) << TARGET_PAGE_BITS)) {
>>          /* Didn't find anything in this RAM Block */
>> -        pss->page = 0;
>> +        pss->page = -1;
>>          pss->block = QLIST_NEXT_RCU(pss->block, next);
>>          if (!pss->block) {
>>              /*
>> @@ -2694,7 +2694,7 @@ static void ram_state_reset(RAMState *rs)  {
>>      rs->last_seen_block = NULL;
>>      rs->last_sent_block = NULL;
>> -    rs->last_page = 0;
>> +    rs->last_page = -1;
>>      rs->last_version = ram_list.version;
>>      rs->xbzrle_enabled = false;
>>      postcopy_preempt_reset(rs);
>> @@ -2889,7 +2889,7 @@ void
>ram_postcopy_send_discard_bitmap(MigrationState *ms)
>>      /* Easiest way to make sure we don't resume in the middle of a host-page
>*/
>>      rs->last_seen_block = NULL;
>>      rs->last_sent_block = NULL;
>> -    rs->last_page = 0;
>> +    rs->last_page = -1;
>>
>>      postcopy_each_ram_send_discard(ms);
>>
>> --
>> 2.25.1
>>
>--
>Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

diff --git a/migration/ram.c b/migration/ram.c
index dc1de9ddbc68..67b2035586bd 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1551,7 +1551,7 @@  static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
     pss->postcopy_requested = false;
     pss->postcopy_target_channel = RAM_CHANNEL_PRECOPY;
 
-    pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page);
+    pss->page = migration_bitmap_find_dirty(rs, pss->block, pss->page + 1);
     if (pss->complete_round && pss->block == rs->last_seen_block &&
         pss->page >= rs->last_page) {
         /*
@@ -1564,7 +1564,7 @@  static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
     if (!offset_in_ramblock(pss->block,
                             ((ram_addr_t)pss->page) << TARGET_PAGE_BITS)) {
         /* Didn't find anything in this RAM Block */
-        pss->page = 0;
+        pss->page = -1;
         pss->block = QLIST_NEXT_RCU(pss->block, next);
         if (!pss->block) {
             /*
@@ -2694,7 +2694,7 @@  static void ram_state_reset(RAMState *rs)
 {
     rs->last_seen_block = NULL;
     rs->last_sent_block = NULL;
-    rs->last_page = 0;
+    rs->last_page = -1;
     rs->last_version = ram_list.version;
     rs->xbzrle_enabled = false;
     postcopy_preempt_reset(rs);
@@ -2889,7 +2889,7 @@  void ram_postcopy_send_discard_bitmap(MigrationState *ms)
     /* Easiest way to make sure we don't resume in the middle of a host-page */
     rs->last_seen_block = NULL;
     rs->last_sent_block = NULL;
-    rs->last_page = 0;
+    rs->last_page = -1;
 
     postcopy_each_ram_send_discard(ms);

migration: Fix a potential guest memory corruption

Commit Message

Comments

Patch