[v13,0/5] UFFD write-tracking migration/snapshots

Message ID	20210121152458.193248-1-andrey.gruzdev@virtuozzo.com (mailing list archive)
Headers	show Return-Path: <SRS0=GitL=GY=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A131123A1C To: qemu-devel@nongnu.org Subject: [PATCH v13 0/5] UFFD write-tracking migration/snapshots Date: Thu, 21 Jan 2021 18:24:53 +0300 Message-Id: <20210121152458.193248-1-andrey.gruzdev@virtuozzo.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=185.231.240.75; envelope-from=andrey.gruzdev@virtuozzo.com; helo=relay.sw.ru X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: Juan Quintela <quintela@redhat.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, Peter Xu <peterx@redhat.com>, Markus Armbruster <armbru@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Den Lunev <den@openvz.org>, Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> Reply-to: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com> From: andrey.gruzdev--- via <qemu-devel@nongnu.org>
Series	UFFD write-tracking migration/snapshots \| expand [v13,0/5] UFFD write-tracking migration/snapshots [v13,1/5] migration: introduce 'background-snapshot' migration capability [v13,2/5] migration: introduce UFFD-WP low-level interface helpers [v13,3/5] migration: support UFFD write fault processing in ram_save_iterate() [v13,4/5] migration: implementation of background snapshot thread [v13,5/5] migration: introduce 'userfaultfd-wrlat.py' script

Zhijian Li (Fujitsu)" via Jan. 21, 2021, 3:24 p.m. UTC

This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.

Currently the only way to make (external) live VM snapshot is using existing
dirty page logging migration mechanism. The main problem is that it tends to
produce a lot of page duplicates while running VM goes on updating already
saved pages. That leads to the fact that vmstate image size is commonly several
times bigger then non-zero part of virtual machine's RSS. Time required to
converge RAM migration and the size of snapshot image severely depend on the
guest memory write rate, sometimes resulting in unacceptably long snapshot
creation time and huge image size.

This series propose a way to solve the aforementioned problems. This is done
by using different RAM migration mechanism based on UFFD write protection
management introduced in v5.7 kernel. The migration strategy is to 'freeze'
guest RAM content using write-protection and iteratively release protection
for memory ranges that have already been saved to the migration stream.
At the same time we read in pending UFFD write fault events and save those
pages out-of-order with higher priority.

How to use:
1. Enable write-tracking migration capability
   virsh qemu-monitor-command <domain> --hmp migrate_set_capability
   background-snapshot on

2. Start the external migration to a file
   virsh qemu-monitor-command <domain> --hmp migrate exec:'cat > ./vm_state'

3. Wait for the migration finish and check that the migration has completed.
state.


Changes v12->v13:

* 1. Fixed codestyle problem for checkpatch.

Changes v11->v12:

* 1. Consolidated UFFD-related code under single #if defined(__linux__).
* 2. Abandoned use of pre/post hooks in ram_find_and_save_block() in favour
*    of more compact code fragment in ram_save_host_page().
* 3. Refactored/simplified eBPF code in userfaultfd-wrlat.py script.

Changes v10->v11:

* 1. Updated commit messages.

Changes v9->v10:

* 1. Fixed commit message for [PATCH v9 1/5].

Changes v8->v9:

* 1. Fixed wrong cover letter subject.

Changes v7->v8:

* 1. Fixed coding style problems to pass checkpatch.

Changes v6->v7:

* 1. Fixed background snapshot on suspended guest: call qemu_system_wakeup_request()
*    before stopping VM to make runstate transition valid.
* 2. Disabled dirty page logging and log syn when 'background-snapshot' is enabled.
* 3. Introduced 'userfaultfd-wrlat.py' script to analyze UFFD write fault latencies.

Changes v5->v6:

* 1. Consider possible hot pluggin/unpluggin of memory device - don't use static
*    for write-tracking support level in migrate_query_write_tracking(), check
*    each time when one tries to enable 'background-snapshot' capability.

Changes v4->v5:

* 1. Refactored util/userfaultfd.c code to support features required by postcopy.
* 2. Introduced checks for host kernel and guest memory backend compatibility
*    to 'background-snapshot' branch in migrate_caps_check().
* 3. Switched to using trace_xxx instead of info_report()/error_report() for
*    cases when error message must be hidden (probing UFFD-IO) or info may be
*    really littering output if goes to stderr.
* 4  Added RCU_READ_LOCK_GUARDs to the code dealing with RAM block list.
* 5. Added memory_region_ref() for each RAM block being wr-protected.
* 6. Reused qemu_ram_block_from_host() instead of custom RAM block lookup routine.
* 7. Refused from using specific hwaddr/ram_addr_t in favour of void */uint64_t.
* 8. Currently dropped 'linear-scan-rate-limiting' patch. The reason is that
*    that choosen criteria for high-latency fault detection (i.e. timestamp of
*    UFFD event fetch) is not representative enough for this task.
*    At the moment it looks somehow like premature optimization effort.
* 8. Dropped some unnecessary/unused code.

Andrey Gruzdev (5):
  migration: introduce 'background-snapshot' migration capability
  migration: introduce UFFD-WP low-level interface helpers
  migration: support UFFD write fault processing in ram_save_iterate()
  migration: implementation of background snapshot thread
  migration: introduce 'userfaultfd-wrlat.py' script

 include/exec/memory.h        |   8 +
 include/qemu/userfaultfd.h   |  35 ++++
 migration/migration.c        | 357 ++++++++++++++++++++++++++++++++++-
 migration/migration.h        |   4 +
 migration/ram.c              | 303 ++++++++++++++++++++++++++++-
 migration/ram.h              |   6 +
 migration/savevm.c           |   1 -
 migration/savevm.h           |   2 +
 migration/trace-events       |   2 +
 qapi/migration.json          |   7 +-
 scripts/userfaultfd-wrlat.py | 122 ++++++++++++
 util/meson.build             |   1 +
 util/trace-events            |   9 +
 util/userfaultfd.c           | 345 +++++++++++++++++++++++++++++++++
 14 files changed, 1190 insertions(+), 12 deletions(-)
 create mode 100644 include/qemu/userfaultfd.h
 create mode 100755 scripts/userfaultfd-wrlat.py
 create mode 100644 util/userfaultfd.c

David Hildenbrand Feb. 9, 2021, 12:37 p.m. UTC | #1

On 21.01.21 16:24, andrey.gruzdev--- via wrote:
> This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's
> implemented in his series '[PATCH v0 0/4] migration: add background snapshot'.
> 
> Currently the only way to make (external) live VM snapshot is using existing
> dirty page logging migration mechanism. The main problem is that it tends to
> produce a lot of page duplicates while running VM goes on updating already
> saved pages. That leads to the fact that vmstate image size is commonly several
> times bigger then non-zero part of virtual machine's RSS. Time required to
> converge RAM migration and the size of snapshot image severely depend on the
> guest memory write rate, sometimes resulting in unacceptably long snapshot
> creation time and huge image size.
> 
> This series propose a way to solve the aforementioned problems. This is done
> by using different RAM migration mechanism based on UFFD write protection
> management introduced in v5.7 kernel. The migration strategy is to 'freeze'
> guest RAM content using write-protection and iteratively release protection
> for memory ranges that have already been saved to the migration stream.
> At the same time we read in pending UFFD write fault events and save those
> pages out-of-order with higher priority.
> 

Hi,

just stumbled over this, quick question:

I recently played with UFFD_WP and notices that write protection is only 
effective on pages/ranges that have already pages populated (IOW: 
!pte_none() in the kernel).

In case memory was never populated (or was discarded using e.g., 
madvice(DONTNEED)), write-protection will be skipped silently and you 
won't get WP events for applicable pages.

So if someone writes to a yet unpoupulated page ("zero"), you won't get 
WP events.

I can spot that you do a single uffd_change_protection() on the whole 
RAMBlock.

How are you handling that scenario, or why don't you have to handle that 
scenario?

Andrey Gruzdev Feb. 9, 2021, 6:38 p.m. UTC | #2

On 09.02.2021 15:37, David Hildenbrand wrote:
> On 21.01.21 16:24, andrey.gruzdev--- via wrote:
>> This patch series is a kind of 'rethinking' of Denis Plotnikov's 
>> ideas he's
>> implemented in his series '[PATCH v0 0/4] migration: add background 
>> snapshot'.
>>
>> Currently the only way to make (external) live VM snapshot is using 
>> existing
>> dirty page logging migration mechanism. The main problem is that it 
>> tends to
>> produce a lot of page duplicates while running VM goes on updating 
>> already
>> saved pages. That leads to the fact that vmstate image size is 
>> commonly several
>> times bigger then non-zero part of virtual machine's RSS. Time 
>> required to
>> converge RAM migration and the size of snapshot image severely depend 
>> on the
>> guest memory write rate, sometimes resulting in unacceptably long 
>> snapshot
>> creation time and huge image size.
>>
>> This series propose a way to solve the aforementioned problems. This 
>> is done
>> by using different RAM migration mechanism based on UFFD write 
>> protection
>> management introduced in v5.7 kernel. The migration strategy is to 
>> 'freeze'
>> guest RAM content using write-protection and iteratively release 
>> protection
>> for memory ranges that have already been saved to the migration stream.
>> At the same time we read in pending UFFD write fault events and save 
>> those
>> pages out-of-order with higher priority.
>>
>
> Hi,
>
> just stumbled over this, quick question:
>
> I recently played with UFFD_WP and notices that write protection is 
> only effective on pages/ranges that have already pages populated (IOW: 
> !pte_none() in the kernel).
>
> In case memory was never populated (or was discarded using e.g., 
> madvice(DONTNEED)), write-protection will be skipped silently and you 
> won't get WP events for applicable pages.
>
> So if someone writes to a yet unpoupulated page ("zero"), you won't 
> get WP events.
>
> I can spot that you do a single uffd_change_protection() on the whole 
> RAMBlock.
>
> How are you handling that scenario, or why don't you have to handle 
> that scenario?
>
Hi David,

I really wonder if such a problem exists.. If we are talking about a 
write to an unpopulated page, we should get first page fault on 
non-present page and populate it with protection bits from respective vma.
For UFFD_WP vma's  page will be populated non-writable. So we'll get 
another page fault on present but read-only page and go to handle_userfault.

David Hildenbrand Feb. 9, 2021, 7:06 p.m. UTC | #3

>> Hi,
>>
>> just stumbled over this, quick question:
>>
>> I recently played with UFFD_WP and notices that write protection is
>> only effective on pages/ranges that have already pages populated (IOW:
>> !pte_none() in the kernel).
>>
>> In case memory was never populated (or was discarded using e.g.,
>> madvice(DONTNEED)), write-protection will be skipped silently and you
>> won't get WP events for applicable pages.
>>
>> So if someone writes to a yet unpoupulated page ("zero"), you won't
>> get WP events.
>>
>> I can spot that you do a single uffd_change_protection() on the whole
>> RAMBlock.
>>
>> How are you handling that scenario, or why don't you have to handle
>> that scenario?
>>
> Hi David,
> 
> I really wonder if such a problem exists.. If we are talking about a

I immediately ran into this issue with my simplest test cases. :)

> write to an unpopulated page, we should get first page fault on
> non-present page and populate it with protection bits from respective vma.
> For UFFD_WP vma's  page will be populated non-writable. So we'll get
> another page fault on present but read-only page and go to handle_userfault.

See the attached test program. Triggers for me on 5.11.0-rc6+ and 
5.9.13-200.fc33

gcc -lpthread uffdio_wp.c -o uffdio_wp
./uffdio_wp
WP did not fire

Uncomment the placement of the zeropage just before registering to make 
the WP actually trigger. If there is no PTE, there is nothing to protect.


And it makes sense: How should the fault handler know which ranges you 
wp-ed, if there is no place to store that information (the PTEs!). The 
VMA cannot tell that story, it only knows that someone registered 
UFFD_WP to selectively wp some parts.

You might have to register also for MISSING faults and place zero pages.

Peter Xu Feb. 9, 2021, 8:09 p.m. UTC | #4

Hi, David, Andrey,

On Tue, Feb 09, 2021 at 08:06:58PM +0100, David Hildenbrand wrote:
> > > Hi,
> > > 
> > > just stumbled over this, quick question:
> > > 
> > > I recently played with UFFD_WP and notices that write protection is
> > > only effective on pages/ranges that have already pages populated (IOW:
> > > !pte_none() in the kernel).
> > > 
> > > In case memory was never populated (or was discarded using e.g.,
> > > madvice(DONTNEED)), write-protection will be skipped silently and you
> > > won't get WP events for applicable pages.
> > > 
> > > So if someone writes to a yet unpoupulated page ("zero"), you won't
> > > get WP events.
> > > 
> > > I can spot that you do a single uffd_change_protection() on the whole
> > > RAMBlock.
> > > 
> > > How are you handling that scenario, or why don't you have to handle
> > > that scenario?

Good catch..  Indeed I overlooked that as well when reviewing the code.

> > > 
> > Hi David,
> > 
> > I really wonder if such a problem exists.. If we are talking about a
> 
> I immediately ran into this issue with my simplest test cases. :)
> 
> > write to an unpopulated page, we should get first page fault on
> > non-present page and populate it with protection bits from respective vma.
> > For UFFD_WP vma's  page will be populated non-writable. So we'll get
> > another page fault on present but read-only page and go to handle_userfault.

The problem is even if the page is read-only, it does not yet have the uffd-wp
bit set, so it won't really trigger the handle_userfault() path.

> You might have to register also for MISSING faults and place zero pages.

So I think what's missing for live snapshot is indeed to register with both
missing & wp mode.

Then we'll receive two messages: For wp, we do like before.  For missing, we do
UFFDIO_ZEROCOPY and at the same time dump this page as a zero page.

I bet live snapshot didn't encounter this issue simply because normal live
snapshots would still work, especially when there's the guest OS. Say, the
worst case is we could have migrated some zero pages with some random data
filled in along with the snapshot, however all these pages were zero pages and
not used by the guest OS after all, then when we load a snapshot we won't
easily notice either..

Thanks,

Peter Xu Feb. 9, 2021, 8:31 p.m. UTC | #5

On Tue, Feb 09, 2021 at 03:09:28PM -0500, Peter Xu wrote:
> Hi, David, Andrey,
> 
> On Tue, Feb 09, 2021 at 08:06:58PM +0100, David Hildenbrand wrote:
> > > > Hi,
> > > > 
> > > > just stumbled over this, quick question:
> > > > 
> > > > I recently played with UFFD_WP and notices that write protection is
> > > > only effective on pages/ranges that have already pages populated (IOW:
> > > > !pte_none() in the kernel).
> > > > 
> > > > In case memory was never populated (or was discarded using e.g.,
> > > > madvice(DONTNEED)), write-protection will be skipped silently and you
> > > > won't get WP events for applicable pages.
> > > > 
> > > > So if someone writes to a yet unpoupulated page ("zero"), you won't
> > > > get WP events.
> > > > 
> > > > I can spot that you do a single uffd_change_protection() on the whole
> > > > RAMBlock.
> > > > 
> > > > How are you handling that scenario, or why don't you have to handle
> > > > that scenario?
> 
> Good catch..  Indeed I overlooked that as well when reviewing the code.
> 
> > > > 
> > > Hi David,
> > > 
> > > I really wonder if such a problem exists.. If we are talking about a
> > 
> > I immediately ran into this issue with my simplest test cases. :)
> > 
> > > write to an unpopulated page, we should get first page fault on
> > > non-present page and populate it with protection bits from respective vma.
> > > For UFFD_WP vma's  page will be populated non-writable. So we'll get
> > > another page fault on present but read-only page and go to handle_userfault.
> 
> The problem is even if the page is read-only, it does not yet have the uffd-wp
> bit set, so it won't really trigger the handle_userfault() path.
> 
> > You might have to register also for MISSING faults and place zero pages.
> 
> So I think what's missing for live snapshot is indeed to register with both
> missing & wp mode.
> 
> Then we'll receive two messages: For wp, we do like before.  For missing, we do
> UFFDIO_ZEROCOPY and at the same time dump this page as a zero page.
> 
> I bet live snapshot didn't encounter this issue simply because normal live
> snapshots would still work, especially when there's the guest OS. Say, the
> worst case is we could have migrated some zero pages with some random data
> filled in along with the snapshot, however all these pages were zero pages and
> not used by the guest OS after all, then when we load a snapshot we won't
> easily notice either..

I'm thinking some way to verify this from live snapshot pov, and I've got an
idea so I just share it out...  Maybe we need a guest application that does
something like below:

  - mmap() a huge lot of memory

  - call mlockall(), so that pages will be provisioned in the guest but without
    data written.  IIUC on the host these pages should be backed by missing
    pages as long as guest app doesn't write.  Then...

  - the app starts to read input from user:

    - If user inputs "dirty" and enter: it'll start to dirty the whole range.
      Write non-zero to the 1st byte of each page would suffice.

    - If user inputs "check" and enter: it'll read the whole memory chunk to
      see whether all the pages are zero pages.  If it reads any non-zero page,
      it should bail out and print error.

With the help of above program, we can do below to verify the live snapshot
worked as expected on zero pages:

  - Guest: start above program, don't input yet (so waiting to read either
    "dirty" or "check" command)

  - Host: start live snapshot

  - Guest: input "dirty" command, so start quickly dirtying the ram

  - Host: live snapshot completes

Then to verify the snapshot image, we do:

  - Host: load the snapshot we've got

  - Guest: (should still be in the state of waiting for cmd) this time we enter
    "check"

Thanks,

Andrey Gruzdev Feb. 11, 2021, 9:21 a.m. UTC | #6

On 09.02.2021 23:31, Peter Xu wrote:
> On Tue, Feb 09, 2021 at 03:09:28PM -0500, Peter Xu wrote:
>> Hi, David, Andrey,
>>
>> On Tue, Feb 09, 2021 at 08:06:58PM +0100, David Hildenbrand wrote:
>>>>> Hi,
>>>>>
>>>>> just stumbled over this, quick question:
>>>>>
>>>>> I recently played with UFFD_WP and notices that write protection is
>>>>> only effective on pages/ranges that have already pages populated (IOW:
>>>>> !pte_none() in the kernel).
>>>>>
>>>>> In case memory was never populated (or was discarded using e.g.,
>>>>> madvice(DONTNEED)), write-protection will be skipped silently and you
>>>>> won't get WP events for applicable pages.
>>>>>
>>>>> So if someone writes to a yet unpoupulated page ("zero"), you won't
>>>>> get WP events.
>>>>>
>>>>> I can spot that you do a single uffd_change_protection() on the whole
>>>>> RAMBlock.
>>>>>
>>>>> How are you handling that scenario, or why don't you have to handle
>>>>> that scenario?
>> Good catch..  Indeed I overlooked that as well when reviewing the code.
>>
>>>> Hi David,
>>>>
>>>> I really wonder if such a problem exists.. If we are talking about a
>>> I immediately ran into this issue with my simplest test cases. :)
>>>
>>>> write to an unpopulated page, we should get first page fault on
>>>> non-present page and populate it with protection bits from respective vma.
>>>> For UFFD_WP vma's  page will be populated non-writable. So we'll get
>>>> another page fault on present but read-only page and go to handle_userfault.
>> The problem is even if the page is read-only, it does not yet have the uffd-wp
>> bit set, so it won't really trigger the handle_userfault() path.
>>
>>> You might have to register also for MISSING faults and place zero pages.
>> So I think what's missing for live snapshot is indeed to register with both
>> missing & wp mode.
>>
>> Then we'll receive two messages: For wp, we do like before.  For missing, we do
>> UFFDIO_ZEROCOPY and at the same time dump this page as a zero page.
>>
>> I bet live snapshot didn't encounter this issue simply because normal live
>> snapshots would still work, especially when there's the guest OS. Say, the
>> worst case is we could have migrated some zero pages with some random data
>> filled in along with the snapshot, however all these pages were zero pages and
>> not used by the guest OS after all, then when we load a snapshot we won't
>> easily notice either..
> I'm thinking some way to verify this from live snapshot pov, and I've got an
> idea so I just share it out...  Maybe we need a guest application that does
> something like below:
>
>    - mmap() a huge lot of memory
>
>    - call mlockall(), so that pages will be provisioned in the guest but without
>      data written.  IIUC on the host these pages should be backed by missing
>      pages as long as guest app doesn't write.  Then...
>
>    - the app starts to read input from user:
>
>      - If user inputs "dirty" and enter: it'll start to dirty the whole range.
>        Write non-zero to the 1st byte of each page would suffice.
>
>      - If user inputs "check" and enter: it'll read the whole memory chunk to
>        see whether all the pages are zero pages.  If it reads any non-zero page,
>        it should bail out and print error.
>
> With the help of above program, we can do below to verify the live snapshot
> worked as expected on zero pages:
>
>    - Guest: start above program, don't input yet (so waiting to read either
>      "dirty" or "check" command)
>
>    - Host: start live snapshot
>
>    - Guest: input "dirty" command, so start quickly dirtying the ram
>
>    - Host: live snapshot completes
>
> Then to verify the snapshot image, we do:
>
>    - Host: load the snapshot we've got
>
>    - Guest: (should still be in the state of waiting for cmd) this time we enter
>      "check"
>
> Thanks,
>
Hi David, Peter,

A little unexpected behavior, from my point of view, for UFFD write-protection.
So, that means that UFFD_WP protection/events works only for locked memory?
I'm now looking at kernel implementation, to understand..

Andrey Gruzdev Feb. 11, 2021, 4:19 p.m. UTC | #7

On 09.02.2021 22:06, David Hildenbrand wrote:
>>> Hi,
>>>
>>> just stumbled over this, quick question:
>>>
>>> I recently played with UFFD_WP and notices that write protection is
>>> only effective on pages/ranges that have already pages populated (IOW:
>>> !pte_none() in the kernel).
>>>
>>> In case memory was never populated (or was discarded using e.g.,
>>> madvice(DONTNEED)), write-protection will be skipped silently and you
>>> won't get WP events for applicable pages.
>>>
>>> So if someone writes to a yet unpoupulated page ("zero"), you won't
>>> get WP events.
>>>
>>> I can spot that you do a single uffd_change_protection() on the whole
>>> RAMBlock.
>>>
>>> How are you handling that scenario, or why don't you have to handle
>>> that scenario?
>>>
>> Hi David,
>>
>> I really wonder if such a problem exists.. If we are talking about a
>
> I immediately ran into this issue with my simplest test cases. :)
>
>> write to an unpopulated page, we should get first page fault on
>> non-present page and populate it with protection bits from respective 
>> vma.
>> For UFFD_WP vma's  page will be populated non-writable. So we'll get
>> another page fault on present but read-only page and go to 
>> handle_userfault.
>
> See the attached test program. Triggers for me on 5.11.0-rc6+ and 
> 5.9.13-200.fc33
>
> gcc -lpthread uffdio_wp.c -o uffdio_wp
> ./uffdio_wp
> WP did not fire
>
> Uncomment the placement of the zeropage just before registering to 
> make the WP actually trigger. If there is no PTE, there is nothing to 
> protect.
>
>
> And it makes sense: How should the fault handler know which ranges you 
> wp-ed, if there is no place to store that information (the PTEs!). The 
> VMA cannot tell that story, it only knows that someone registered 
> UFFD_WP to selectively wp some parts.
>
> You might have to register also for MISSING faults and place zero pages.
>
Looked at the kernel code, agree that we miss WP events for unpopulated 
pages, UFFD_WP softbit won't be set in this case. But it doesn't make 
saved snapshot inconsistent or introduce security issues. The only side 
effect is that we may save updated page instead of zeroed, just 
increasing snapshot size. However this guest-physical page has never 
been touched from the point of view of saved vCPU/device state and is 
not a concern.

Often (at least on desktop Windows guests) only a small part of RAM has 
ever been allocated by guest. Migration code needs to read each 
guest-physical page, so we'll have a lot of additional UFFD events, much 
more MISSING events then WP-faults.

And the main problem is that adding MISSING handler is impossible in 
current single-threaded snapshot code. We'll get an immediate deadlock 
on iterative page read.

Peter Xu Feb. 11, 2021, 5:18 p.m. UTC | #8

On Thu, Feb 11, 2021 at 12:21:51PM +0300, Andrey Gruzdev wrote:
> On 09.02.2021 23:31, Peter Xu wrote:
> > On Tue, Feb 09, 2021 at 03:09:28PM -0500, Peter Xu wrote:
> > > Hi, David, Andrey,
> > > 
> > > On Tue, Feb 09, 2021 at 08:06:58PM +0100, David Hildenbrand wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > just stumbled over this, quick question:
> > > > > > 
> > > > > > I recently played with UFFD_WP and notices that write protection is
> > > > > > only effective on pages/ranges that have already pages populated (IOW:
> > > > > > !pte_none() in the kernel).
> > > > > > 
> > > > > > In case memory was never populated (or was discarded using e.g.,
> > > > > > madvice(DONTNEED)), write-protection will be skipped silently and you
> > > > > > won't get WP events for applicable pages.
> > > > > > 
> > > > > > So if someone writes to a yet unpoupulated page ("zero"), you won't
> > > > > > get WP events.
> > > > > > 
> > > > > > I can spot that you do a single uffd_change_protection() on the whole
> > > > > > RAMBlock.
> > > > > > 
> > > > > > How are you handling that scenario, or why don't you have to handle
> > > > > > that scenario?
> > > Good catch..  Indeed I overlooked that as well when reviewing the code.
> > > 
> > > > > Hi David,
> > > > > 
> > > > > I really wonder if such a problem exists.. If we are talking about a
> > > > I immediately ran into this issue with my simplest test cases. :)
> > > > 
> > > > > write to an unpopulated page, we should get first page fault on
> > > > > non-present page and populate it with protection bits from respective vma.
> > > > > For UFFD_WP vma's  page will be populated non-writable. So we'll get
> > > > > another page fault on present but read-only page and go to handle_userfault.
> > > The problem is even if the page is read-only, it does not yet have the uffd-wp
> > > bit set, so it won't really trigger the handle_userfault() path.
> > > 
> > > > You might have to register also for MISSING faults and place zero pages.
> > > So I think what's missing for live snapshot is indeed to register with both
> > > missing & wp mode.
> > > 
> > > Then we'll receive two messages: For wp, we do like before.  For missing, we do
> > > UFFDIO_ZEROCOPY and at the same time dump this page as a zero page.
> > > 
> > > I bet live snapshot didn't encounter this issue simply because normal live
> > > snapshots would still work, especially when there's the guest OS. Say, the
> > > worst case is we could have migrated some zero pages with some random data
> > > filled in along with the snapshot, however all these pages were zero pages and
> > > not used by the guest OS after all, then when we load a snapshot we won't
> > > easily notice either..
> > I'm thinking some way to verify this from live snapshot pov, and I've got an
> > idea so I just share it out...  Maybe we need a guest application that does
> > something like below:
> > 
> >    - mmap() a huge lot of memory
> > 
> >    - call mlockall(), so that pages will be provisioned in the guest but without
> >      data written.  IIUC on the host these pages should be backed by missing
> >      pages as long as guest app doesn't write.  Then...
> > 
> >    - the app starts to read input from user:
> > 
> >      - If user inputs "dirty" and enter: it'll start to dirty the whole range.
> >        Write non-zero to the 1st byte of each page would suffice.
> > 
> >      - If user inputs "check" and enter: it'll read the whole memory chunk to
> >        see whether all the pages are zero pages.  If it reads any non-zero page,
> >        it should bail out and print error.
> > 
> > With the help of above program, we can do below to verify the live snapshot
> > worked as expected on zero pages:
> > 
> >    - Guest: start above program, don't input yet (so waiting to read either
> >      "dirty" or "check" command)
> > 
> >    - Host: start live snapshot
> > 
> >    - Guest: input "dirty" command, so start quickly dirtying the ram
> > 
> >    - Host: live snapshot completes
> > 
> > Then to verify the snapshot image, we do:
> > 
> >    - Host: load the snapshot we've got
> > 
> >    - Guest: (should still be in the state of waiting for cmd) this time we enter
> >      "check"
> > 
> > Thanks,
> > 
> Hi David, Peter,
> 
> A little unexpected behavior, from my point of view, for UFFD write-protection.
> So, that means that UFFD_WP protection/events works only for locked memory?
> I'm now looking at kernel implementation, to understand..

Not really; it definitely works for all memories that we've touched.  My
previous exmaple wanted to let the guest app use a not-yet-allocated page.  I
figured mlockall() might achieve that, hence I proposed such an example
assuming that may verify the zero page issue on live snapshot.  So if my
understanding is correct, if we run above scenario, current live snapshot might
fail that app when we do the "check" command at last, by finding non-zero pages.

Thanks,

Peter Xu Feb. 11, 2021, 5:32 p.m. UTC | #9

On Thu, Feb 11, 2021 at 07:19:47PM +0300, Andrey Gruzdev wrote:
> On 09.02.2021 22:06, David Hildenbrand wrote:
> > > > Hi,
> > > > 
> > > > just stumbled over this, quick question:
> > > > 
> > > > I recently played with UFFD_WP and notices that write protection is
> > > > only effective on pages/ranges that have already pages populated (IOW:
> > > > !pte_none() in the kernel).
> > > > 
> > > > In case memory was never populated (or was discarded using e.g.,
> > > > madvice(DONTNEED)), write-protection will be skipped silently and you
> > > > won't get WP events for applicable pages.
> > > > 
> > > > So if someone writes to a yet unpoupulated page ("zero"), you won't
> > > > get WP events.
> > > > 
> > > > I can spot that you do a single uffd_change_protection() on the whole
> > > > RAMBlock.
> > > > 
> > > > How are you handling that scenario, or why don't you have to handle
> > > > that scenario?
> > > > 
> > > Hi David,
> > > 
> > > I really wonder if such a problem exists.. If we are talking about a
> > 
> > I immediately ran into this issue with my simplest test cases. :)
> > 
> > > write to an unpopulated page, we should get first page fault on
> > > non-present page and populate it with protection bits from
> > > respective vma.
> > > For UFFD_WP vma's  page will be populated non-writable. So we'll get
> > > another page fault on present but read-only page and go to
> > > handle_userfault.
> > 
> > See the attached test program. Triggers for me on 5.11.0-rc6+ and
> > 5.9.13-200.fc33
> > 
> > gcc -lpthread uffdio_wp.c -o uffdio_wp
> > ./uffdio_wp
> > WP did not fire
> > 
> > Uncomment the placement of the zeropage just before registering to make
> > the WP actually trigger. If there is no PTE, there is nothing to
> > protect.
> > 
> > 
> > And it makes sense: How should the fault handler know which ranges you
> > wp-ed, if there is no place to store that information (the PTEs!). The
> > VMA cannot tell that story, it only knows that someone registered
> > UFFD_WP to selectively wp some parts.
> > 
> > You might have to register also for MISSING faults and place zero pages.
> > 
> Looked at the kernel code, agree that we miss WP events for unpopulated
> pages, UFFD_WP softbit won't be set in this case. But it doesn't make saved
> snapshot inconsistent or introduce security issues. The only side effect is
> that we may save updated page instead of zeroed, just increasing snapshot
> size. However this guest-physical page has never been touched from the point
> of view of saved vCPU/device state and is not a concern.

Oh I just remembered one thing, that Linux should be zeroing pages when
allocating, so even if the page has legacy content it'll be cleared with
__GFP_ZERO allocations.  So yeah it would be harder to have issue at least with
a sensible OS.  I'm not sure about Windows or others, but it could be a common
case.  Then the only overhead is the extra pages we kept in the live snapshot,
which takes some more disk space.

Or there could be firmware running without OS at all, but it should really not
read unallocated pages assuming there must be zero.  It's not a sane behavior
even for a firmware.

> 
> Often (at least on desktop Windows guests) only a small part of RAM has ever
> been allocated by guest. Migration code needs to read each guest-physical
> page, so we'll have a lot of additional UFFD events, much more MISSING
> events then WP-faults.
> 
> And the main problem is that adding MISSING handler is impossible in current
> single-threaded snapshot code. We'll get an immediate deadlock on iterative
> page read.

Right.  We'll need to rework the design but just for saving a bunch of snapshot
image disk size.  So now I agree with you, let's keep this in mind, but maybe
it isn't worth a fix for now, at least until we figure something really broken.

Andrey, do you think we should still mention this issue into the todo list of
the wiki page of live snapshot?

Thanks,

Andrey Gruzdev Feb. 11, 2021, 6:15 p.m. UTC | #10

On 11.02.2021 20:18, Peter Xu wrote:
> On Thu, Feb 11, 2021 at 12:21:51PM +0300, Andrey Gruzdev wrote:
>> On 09.02.2021 23:31, Peter Xu wrote:
>>> On Tue, Feb 09, 2021 at 03:09:28PM -0500, Peter Xu wrote:
>>>> Hi, David, Andrey,
>>>>
>>>> On Tue, Feb 09, 2021 at 08:06:58PM +0100, David Hildenbrand wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> just stumbled over this, quick question:
>>>>>>>
>>>>>>> I recently played with UFFD_WP and notices that write protection is
>>>>>>> only effective on pages/ranges that have already pages populated (IOW:
>>>>>>> !pte_none() in the kernel).
>>>>>>>
>>>>>>> In case memory was never populated (or was discarded using e.g.,
>>>>>>> madvice(DONTNEED)), write-protection will be skipped silently and you
>>>>>>> won't get WP events for applicable pages.
>>>>>>>
>>>>>>> So if someone writes to a yet unpoupulated page ("zero"), you won't
>>>>>>> get WP events.
>>>>>>>
>>>>>>> I can spot that you do a single uffd_change_protection() on the whole
>>>>>>> RAMBlock.
>>>>>>>
>>>>>>> How are you handling that scenario, or why don't you have to handle
>>>>>>> that scenario?
>>>> Good catch..  Indeed I overlooked that as well when reviewing the code.
>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> I really wonder if such a problem exists.. If we are talking about a
>>>>> I immediately ran into this issue with my simplest test cases. :)
>>>>>
>>>>>> write to an unpopulated page, we should get first page fault on
>>>>>> non-present page and populate it with protection bits from respective vma.
>>>>>> For UFFD_WP vma's  page will be populated non-writable. So we'll get
>>>>>> another page fault on present but read-only page and go to handle_userfault.
>>>> The problem is even if the page is read-only, it does not yet have the uffd-wp
>>>> bit set, so it won't really trigger the handle_userfault() path.
>>>>
>>>>> You might have to register also for MISSING faults and place zero pages.
>>>> So I think what's missing for live snapshot is indeed to register with both
>>>> missing & wp mode.
>>>>
>>>> Then we'll receive two messages: For wp, we do like before.  For missing, we do
>>>> UFFDIO_ZEROCOPY and at the same time dump this page as a zero page.
>>>>
>>>> I bet live snapshot didn't encounter this issue simply because normal live
>>>> snapshots would still work, especially when there's the guest OS. Say, the
>>>> worst case is we could have migrated some zero pages with some random data
>>>> filled in along with the snapshot, however all these pages were zero pages and
>>>> not used by the guest OS after all, then when we load a snapshot we won't
>>>> easily notice either..
>>> I'm thinking some way to verify this from live snapshot pov, and I've got an
>>> idea so I just share it out...  Maybe we need a guest application that does
>>> something like below:
>>>
>>>     - mmap() a huge lot of memory
>>>
>>>     - call mlockall(), so that pages will be provisioned in the guest but without
>>>       data written.  IIUC on the host these pages should be backed by missing
>>>       pages as long as guest app doesn't write.  Then...
>>>
>>>     - the app starts to read input from user:
>>>
>>>       - If user inputs "dirty" and enter: it'll start to dirty the whole range.
>>>         Write non-zero to the 1st byte of each page would suffice.
>>>
>>>       - If user inputs "check" and enter: it'll read the whole memory chunk to
>>>         see whether all the pages are zero pages.  If it reads any non-zero page,
>>>         it should bail out and print error.
>>>
>>> With the help of above program, we can do below to verify the live snapshot
>>> worked as expected on zero pages:
>>>
>>>     - Guest: start above program, don't input yet (so waiting to read either
>>>       "dirty" or "check" command)
>>>
>>>     - Host: start live snapshot
>>>
>>>     - Guest: input "dirty" command, so start quickly dirtying the ram
>>>
>>>     - Host: live snapshot completes
>>>
>>> Then to verify the snapshot image, we do:
>>>
>>>     - Host: load the snapshot we've got
>>>
>>>     - Guest: (should still be in the state of waiting for cmd) this time we enter
>>>       "check"
>>>
>>> Thanks,
>>>
>> Hi David, Peter,
>>
>> A little unexpected behavior, from my point of view, for UFFD write-protection.
>> So, that means that UFFD_WP protection/events works only for locked memory?
>> I'm now looking at kernel implementation, to understand..
> Not really; it definitely works for all memories that we've touched.  My
> previous exmaple wanted to let the guest app use a not-yet-allocated page.  I
> figured mlockall() might achieve that, hence I proposed such an example
> assuming that may verify the zero page issue on live snapshot.  So if my
> understanding is correct, if we run above scenario, current live snapshot might
> fail that app when we do the "check" command at last, by finding non-zero pages.
>
> Thanks,
>
Yes, I understand the limitations with vma's which lead to the fact we can write-protect with PTE softbits only.
I think mlockall() is not required, just need mmap() with MAP_POPULATE. Since the problem is related
only to pte_none() entries.

Andrey Gruzdev Feb. 11, 2021, 6:28 p.m. UTC | #11

On 11.02.2021 20:32, Peter Xu wrote:
> On Thu, Feb 11, 2021 at 07:19:47PM +0300, Andrey Gruzdev wrote:
>> On 09.02.2021 22:06, David Hildenbrand wrote:
>>>>> Hi,
>>>>>
>>>>> just stumbled over this, quick question:
>>>>>
>>>>> I recently played with UFFD_WP and notices that write protection is
>>>>> only effective on pages/ranges that have already pages populated (IOW:
>>>>> !pte_none() in the kernel).
>>>>>
>>>>> In case memory was never populated (or was discarded using e.g.,
>>>>> madvice(DONTNEED)), write-protection will be skipped silently and you
>>>>> won't get WP events for applicable pages.
>>>>>
>>>>> So if someone writes to a yet unpoupulated page ("zero"), you won't
>>>>> get WP events.
>>>>>
>>>>> I can spot that you do a single uffd_change_protection() on the whole
>>>>> RAMBlock.
>>>>>
>>>>> How are you handling that scenario, or why don't you have to handle
>>>>> that scenario?
>>>>>
>>>> Hi David,
>>>>
>>>> I really wonder if such a problem exists.. If we are talking about a
>>> I immediately ran into this issue with my simplest test cases. :)
>>>
>>>> write to an unpopulated page, we should get first page fault on
>>>> non-present page and populate it with protection bits from
>>>> respective vma.
>>>> For UFFD_WP vma's  page will be populated non-writable. So we'll get
>>>> another page fault on present but read-only page and go to
>>>> handle_userfault.
>>> See the attached test program. Triggers for me on 5.11.0-rc6+ and
>>> 5.9.13-200.fc33
>>>
>>> gcc -lpthread uffdio_wp.c -o uffdio_wp
>>> ./uffdio_wp
>>> WP did not fire
>>>
>>> Uncomment the placement of the zeropage just before registering to make
>>> the WP actually trigger. If there is no PTE, there is nothing to
>>> protect.
>>>
>>>
>>> And it makes sense: How should the fault handler know which ranges you
>>> wp-ed, if there is no place to store that information (the PTEs!). The
>>> VMA cannot tell that story, it only knows that someone registered
>>> UFFD_WP to selectively wp some parts.
>>>
>>> You might have to register also for MISSING faults and place zero pages.
>>>
>> Looked at the kernel code, agree that we miss WP events for unpopulated
>> pages, UFFD_WP softbit won't be set in this case. But it doesn't make saved
>> snapshot inconsistent or introduce security issues. The only side effect is
>> that we may save updated page instead of zeroed, just increasing snapshot
>> size. However this guest-physical page has never been touched from the point
>> of view of saved vCPU/device state and is not a concern.
> Oh I just remembered one thing, that Linux should be zeroing pages when
> allocating, so even if the page has legacy content it'll be cleared with
> __GFP_ZERO allocations.  So yeah it would be harder to have issue at least with
> a sensible OS.  I'm not sure about Windows or others, but it could be a common
> case.  Then the only overhead is the extra pages we kept in the live snapshot,
> which takes some more disk space.
>
> Or there could be firmware running without OS at all, but it should really not
> read unallocated pages assuming there must be zero.  It's not a sane behavior
> even for a firmware.
>
>> Often (at least on desktop Windows guests) only a small part of RAM has ever
>> been allocated by guest. Migration code needs to read each guest-physical
>> page, so we'll have a lot of additional UFFD events, much more MISSING
>> events then WP-faults.
>>
>> And the main problem is that adding MISSING handler is impossible in current
>> single-threaded snapshot code. We'll get an immediate deadlock on iterative
>> page read.
> Right.  We'll need to rework the design but just for saving a bunch of snapshot
> image disk size.  So now I agree with you, let's keep this in mind, but maybe
> it isn't worth a fix for now, at least until we figure something really broken.
>
> Andrey, do you think we should still mention this issue into the todo list of
> the wiki page of live snapshot?
>
> Thanks,
>
Yes, even if the page happens to be overwritten, it's overwritten by the same VM so
no security boundaries are crossed. And no machine code can assume that RAM content
is zeroed on power-on or reset so our snapshot state stays quite consistent.

Agree we should keep it in mind, but IMHO adding MISSING handler and running separate
thread would make performance worse.. So I doubt it's worth adding this to TODO list..

Thanks,

David Hildenbrand Feb. 11, 2021, 7:01 p.m. UTC | #12

On 11.02.21 19:28, Andrey Gruzdev wrote:
> On 11.02.2021 20:32, Peter Xu wrote:
>> On Thu, Feb 11, 2021 at 07:19:47PM +0300, Andrey Gruzdev wrote:
>>> On 09.02.2021 22:06, David Hildenbrand wrote:
>>>>>> Hi,
>>>>>>
>>>>>> just stumbled over this, quick question:
>>>>>>
>>>>>> I recently played with UFFD_WP and notices that write protection is
>>>>>> only effective on pages/ranges that have already pages populated (IOW:
>>>>>> !pte_none() in the kernel).
>>>>>>
>>>>>> In case memory was never populated (or was discarded using e.g.,
>>>>>> madvice(DONTNEED)), write-protection will be skipped silently and you
>>>>>> won't get WP events for applicable pages.
>>>>>>
>>>>>> So if someone writes to a yet unpoupulated page ("zero"), you won't
>>>>>> get WP events.
>>>>>>
>>>>>> I can spot that you do a single uffd_change_protection() on the whole
>>>>>> RAMBlock.
>>>>>>
>>>>>> How are you handling that scenario, or why don't you have to handle
>>>>>> that scenario?
>>>>>>
>>>>> Hi David,
>>>>>
>>>>> I really wonder if such a problem exists.. If we are talking about a
>>>> I immediately ran into this issue with my simplest test cases. :)
>>>>
>>>>> write to an unpopulated page, we should get first page fault on
>>>>> non-present page and populate it with protection bits from
>>>>> respective vma.
>>>>> For UFFD_WP vma's  page will be populated non-writable. So we'll get
>>>>> another page fault on present but read-only page and go to
>>>>> handle_userfault.
>>>> See the attached test program. Triggers for me on 5.11.0-rc6+ and
>>>> 5.9.13-200.fc33
>>>>
>>>> gcc -lpthread uffdio_wp.c -o uffdio_wp
>>>> ./uffdio_wp
>>>> WP did not fire
>>>>
>>>> Uncomment the placement of the zeropage just before registering to make
>>>> the WP actually trigger. If there is no PTE, there is nothing to
>>>> protect.
>>>>
>>>>
>>>> And it makes sense: How should the fault handler know which ranges you
>>>> wp-ed, if there is no place to store that information (the PTEs!). The
>>>> VMA cannot tell that story, it only knows that someone registered
>>>> UFFD_WP to selectively wp some parts.
>>>>
>>>> You might have to register also for MISSING faults and place zero pages.
>>>>
>>> Looked at the kernel code, agree that we miss WP events for unpopulated
>>> pages, UFFD_WP softbit won't be set in this case. But it doesn't make saved
>>> snapshot inconsistent or introduce security issues. The only side effect is
>>> that we may save updated page instead of zeroed, just increasing snapshot
>>> size. However this guest-physical page has never been touched from the point
>>> of view of saved vCPU/device state and is not a concern.
>> Oh I just remembered one thing, that Linux should be zeroing pages when
>> allocating, so even if the page has legacy content it'll be cleared with
>> __GFP_ZERO allocations.  So yeah it would be harder to have issue at least with
>> a sensible OS.  I'm not sure about Windows or others, but it could be a common
>> case.  Then the only overhead is the extra pages we kept in the live snapshot,
>> which takes some more disk space.
>>
>> Or there could be firmware running without OS at all, but it should really not
>> read unallocated pages assuming there must be zero.  It's not a sane behavior
>> even for a firmware.
>>
>>> Often (at least on desktop Windows guests) only a small part of RAM has ever
>>> been allocated by guest. Migration code needs to read each guest-physical
>>> page, so we'll have a lot of additional UFFD events, much more MISSING
>>> events then WP-faults.
>>>
>>> And the main problem is that adding MISSING handler is impossible in current
>>> single-threaded snapshot code. We'll get an immediate deadlock on iterative
>>> page read.
>> Right.  We'll need to rework the design but just for saving a bunch of snapshot
>> image disk size.  So now I agree with you, let's keep this in mind, but maybe
>> it isn't worth a fix for now, at least until we figure something really broken.
>>
>> Andrey, do you think we should still mention this issue into the todo list of
>> the wiki page of live snapshot?
>>
>> Thanks,
>>
> Yes, even if the page happens to be overwritten, it's overwritten by the same VM so
> no security boundaries are crossed. And no machine code can assume that RAM content
> is zeroed on power-on or reset so our snapshot state stays quite consistent.
> 
> Agree we should keep it in mind, but IMHO adding MISSING handler and running separate
> thread would make performance worse.. So I doubt it's worth adding this to TODO list..
> 

So, here is what happens: your snapshot will contain garbage at places 
where it should contain zero.

This happens when your guest starts using an unpopulated page after 
snapshotting started and the page has not been copied to the snapshot 
yet. You won't get a WP event, therefore you cannot copy "zero" to the 
snapshot before content gets overridden.

If you load your snapshot, it contains garbage at places that are 
supposed to contain zero.


There is a feature in virtio-balloon that *depends* on previously 
discarded pages from staying unmodified in some cases: free page reporting.

The guest will report pages (that might have been initialized to zero) 
to the hypervisor). The hypervisor will discard page content if the 
content was initialized to zero. After that, the guest is free to reuse 
the pages anytime with the guarantee that content has not been modified 
(-> still is zero).


See QEMU hw/virtio/virtio-balloon.c: virtio_balloon_handle_report()

"When we discard the page it has the effect of removing the page from 
the hypervisor itself and causing it to be zeroed when it is returned to 
us. So we must not discard the page [...] if the guest is expecting it 
to retain a non-zero value."

And Linux drivers/virtio/virtio_balloon.c: virtballoon_validate()

"Inform the hypervisor that our pages are poisoned or initialized. If we 
cannot do that then we should disable page reporting as it could 
potentially change the contents of our free pages."


It's as simple as having a Linux guest with init_on_free and 
free-page-reporting active via virtio-balloon.

Long story short, your feature might break guests (when continuing the 
snapshot) when free page reporting is active. I agree that MISSING 
events might be too heavy, so you should disable snapshots if free page 
reporting is active.

David Hildenbrand Feb. 11, 2021, 7:21 p.m. UTC | #13

On 09.02.21 19:38, Andrey Gruzdev wrote:
> On 09.02.2021 15:37, David Hildenbrand wrote:
>> On 21.01.21 16:24, andrey.gruzdev--- via wrote:
>>> This patch series is a kind of 'rethinking' of Denis Plotnikov's
>>> ideas he's
>>> implemented in his series '[PATCH v0 0/4] migration: add background
>>> snapshot'.
>>>
>>> Currently the only way to make (external) live VM snapshot is using
>>> existing
>>> dirty page logging migration mechanism. The main problem is that it
>>> tends to
>>> produce a lot of page duplicates while running VM goes on updating
>>> already
>>> saved pages. That leads to the fact that vmstate image size is
>>> commonly several
>>> times bigger then non-zero part of virtual machine's RSS. Time
>>> required to
>>> converge RAM migration and the size of snapshot image severely depend
>>> on the
>>> guest memory write rate, sometimes resulting in unacceptably long
>>> snapshot
>>> creation time and huge image size.
>>>
>>> This series propose a way to solve the aforementioned problems. This
>>> is done
>>> by using different RAM migration mechanism based on UFFD write
>>> protection
>>> management introduced in v5.7 kernel. The migration strategy is to
>>> 'freeze'
>>> guest RAM content using write-protection and iteratively release
>>> protection
>>> for memory ranges that have already been saved to the migration stream.
>>> At the same time we read in pending UFFD write fault events and save
>>> those
>>> pages out-of-order with higher priority.
>>>
>>
>> Hi,
>>
>> just stumbled over this, quick question:
>>
>> I recently played with UFFD_WP and notices that write protection is
>> only effective on pages/ranges that have already pages populated (IOW:
>> !pte_none() in the kernel).
>>
>> In case memory was never populated (or was discarded using e.g.,
>> madvice(DONTNEED)), write-protection will be skipped silently and you
>> won't get WP events for applicable pages.
>>
>> So if someone writes to a yet unpoupulated page ("zero"), you won't
>> get WP events.
>>
>> I can spot that you do a single uffd_change_protection() on the whole
>> RAMBlock.
>>
>> How are you handling that scenario, or why don't you have to handle
>> that scenario?
>>
> Hi David,
> 
> I really wonder if such a problem exists.. If we are talking about a
> write to an unpopulated page, we should get first page fault on
> non-present page and populate it with protection bits from respective vma.
> For UFFD_WP vma's  page will be populated non-writable. So we'll get
> another page fault on present but read-only page and go to handle_userfault.
> 

Hi,

here is another fun issue.

Assume you

1. Have a populated page, with some valuable content
2. WP protected the page
3. madvise(DONTNEED) that page
4. Write to the page

On write access, you won't get a WP event!

Instead, you will get a UFFD_EVENT_REMOVE during 3. But you cannot stop 
that event (dont wake), so you cannot simply defer as you can do with WP 
events.


So if the guest inflates the balloon (including balloon page migration 
in Linux) or free-page-reporting reports a free page while snapshotting 
is active, you won't be able to save the old content before it is zapped 
and your snapshot misses pages with actual content.

Something similar would happen with virtio-mem when unplugging blocks, 
however, it does not discard any pages while migration is active.


Snapshotting seems to be incompatible with concurrent discards via 
virtio-balloon. You might want to inhibit ballooning while snapshotting 
is active in

hw/virtio/virtio-balloon.c:virtio_balloon_inhibited() just as we do for 
postcopy.

Peter Xu Feb. 11, 2021, 8:31 p.m. UTC | #14

On Thu, Feb 11, 2021 at 08:01:29PM +0100, David Hildenbrand wrote:
> On 11.02.21 19:28, Andrey Gruzdev wrote:
> > On 11.02.2021 20:32, Peter Xu wrote:
> > > On Thu, Feb 11, 2021 at 07:19:47PM +0300, Andrey Gruzdev wrote:
> > > > On 09.02.2021 22:06, David Hildenbrand wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > just stumbled over this, quick question:
> > > > > > > 
> > > > > > > I recently played with UFFD_WP and notices that write protection is
> > > > > > > only effective on pages/ranges that have already pages populated (IOW:
> > > > > > > !pte_none() in the kernel).
> > > > > > > 
> > > > > > > In case memory was never populated (or was discarded using e.g.,
> > > > > > > madvice(DONTNEED)), write-protection will be skipped silently and you
> > > > > > > won't get WP events for applicable pages.
> > > > > > > 
> > > > > > > So if someone writes to a yet unpoupulated page ("zero"), you won't
> > > > > > > get WP events.
> > > > > > > 
> > > > > > > I can spot that you do a single uffd_change_protection() on the whole
> > > > > > > RAMBlock.
> > > > > > > 
> > > > > > > How are you handling that scenario, or why don't you have to handle
> > > > > > > that scenario?
> > > > > > > 
> > > > > > Hi David,
> > > > > > 
> > > > > > I really wonder if such a problem exists.. If we are talking about a
> > > > > I immediately ran into this issue with my simplest test cases. :)
> > > > > 
> > > > > > write to an unpopulated page, we should get first page fault on
> > > > > > non-present page and populate it with protection bits from
> > > > > > respective vma.
> > > > > > For UFFD_WP vma's  page will be populated non-writable. So we'll get
> > > > > > another page fault on present but read-only page and go to
> > > > > > handle_userfault.
> > > > > See the attached test program. Triggers for me on 5.11.0-rc6+ and
> > > > > 5.9.13-200.fc33
> > > > > 
> > > > > gcc -lpthread uffdio_wp.c -o uffdio_wp
> > > > > ./uffdio_wp
> > > > > WP did not fire
> > > > > 
> > > > > Uncomment the placement of the zeropage just before registering to make
> > > > > the WP actually trigger. If there is no PTE, there is nothing to
> > > > > protect.
> > > > > 
> > > > > 
> > > > > And it makes sense: How should the fault handler know which ranges you
> > > > > wp-ed, if there is no place to store that information (the PTEs!). The
> > > > > VMA cannot tell that story, it only knows that someone registered
> > > > > UFFD_WP to selectively wp some parts.
> > > > > 
> > > > > You might have to register also for MISSING faults and place zero pages.
> > > > > 
> > > > Looked at the kernel code, agree that we miss WP events for unpopulated
> > > > pages, UFFD_WP softbit won't be set in this case. But it doesn't make saved
> > > > snapshot inconsistent or introduce security issues. The only side effect is
> > > > that we may save updated page instead of zeroed, just increasing snapshot
> > > > size. However this guest-physical page has never been touched from the point
> > > > of view of saved vCPU/device state and is not a concern.
> > > Oh I just remembered one thing, that Linux should be zeroing pages when
> > > allocating, so even if the page has legacy content it'll be cleared with
> > > __GFP_ZERO allocations.  So yeah it would be harder to have issue at least with
> > > a sensible OS.  I'm not sure about Windows or others, but it could be a common
> > > case.  Then the only overhead is the extra pages we kept in the live snapshot,
> > > which takes some more disk space.
> > > 
> > > Or there could be firmware running without OS at all, but it should really not
> > > read unallocated pages assuming there must be zero.  It's not a sane behavior
> > > even for a firmware.
> > > 
> > > > Often (at least on desktop Windows guests) only a small part of RAM has ever
> > > > been allocated by guest. Migration code needs to read each guest-physical
> > > > page, so we'll have a lot of additional UFFD events, much more MISSING
> > > > events then WP-faults.
> > > > 
> > > > And the main problem is that adding MISSING handler is impossible in current
> > > > single-threaded snapshot code. We'll get an immediate deadlock on iterative
> > > > page read.
> > > Right.  We'll need to rework the design but just for saving a bunch of snapshot
> > > image disk size.  So now I agree with you, let's keep this in mind, but maybe
> > > it isn't worth a fix for now, at least until we figure something really broken.
> > > 
> > > Andrey, do you think we should still mention this issue into the todo list of
> > > the wiki page of live snapshot?
> > > 
> > > Thanks,
> > > 
> > Yes, even if the page happens to be overwritten, it's overwritten by the same VM so
> > no security boundaries are crossed. And no machine code can assume that RAM content
> > is zeroed on power-on or reset so our snapshot state stays quite consistent.
> > 
> > Agree we should keep it in mind, but IMHO adding MISSING handler and running separate
> > thread would make performance worse.. So I doubt it's worth adding this to TODO list..
> > 
> 
> So, here is what happens: your snapshot will contain garbage at places where
> it should contain zero.
> 
> This happens when your guest starts using an unpopulated page after
> snapshotting started and the page has not been copied to the snapshot yet.
> You won't get a WP event, therefore you cannot copy "zero" to the snapshot
> before content gets overridden.
> 
> If you load your snapshot, it contains garbage at places that are supposed
> to contain zero.
> 
> 
> There is a feature in virtio-balloon that *depends* on previously discarded
> pages from staying unmodified in some cases: free page reporting.
> 
> The guest will report pages (that might have been initialized to zero) to
> the hypervisor). The hypervisor will discard page content if the content was
> initialized to zero. After that, the guest is free to reuse the pages
> anytime with the guarantee that content has not been modified (-> still is
> zero).
> 
> 
> See QEMU hw/virtio/virtio-balloon.c: virtio_balloon_handle_report()
> 
> "When we discard the page it has the effect of removing the page from the
> hypervisor itself and causing it to be zeroed when it is returned to us. So
> we must not discard the page [...] if the guest is expecting it to retain a
> non-zero value."
> 
> And Linux drivers/virtio/virtio_balloon.c: virtballoon_validate()
> 
> "Inform the hypervisor that our pages are poisoned or initialized. If we
> cannot do that then we should disable page reporting as it could potentially
> change the contents of our free pages."
> 
> 
> It's as simple as having a Linux guest with init_on_free and
> free-page-reporting active via virtio-balloon.
> 
> Long story short, your feature might break guests (when continuing the
> snapshot) when free page reporting is active. I agree that MISSING events
> might be too heavy, so you should disable snapshots if free page reporting
> is active.

When the page is reported with poison_val set, it is written already by the
guest, right (either all zeros, or some special marks)?  If so, why it won't be
wr-protected by uffd?  Thanks,

David Hildenbrand Feb. 11, 2021, 8:44 p.m. UTC | #15

> Am 11.02.2021 um 21:31 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Thu, Feb 11, 2021 at 08:01:29PM +0100, David Hildenbrand wrote:
>>> On 11.02.21 19:28, Andrey Gruzdev wrote:
>>> On 11.02.2021 20:32, Peter Xu wrote:
>>>> On Thu, Feb 11, 2021 at 07:19:47PM +0300, Andrey Gruzdev wrote:
>>>>> On 09.02.2021 22:06, David Hildenbrand wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> just stumbled over this, quick question:
>>>>>>>> 
>>>>>>>> I recently played with UFFD_WP and notices that write protection is
>>>>>>>> only effective on pages/ranges that have already pages populated (IOW:
>>>>>>>> !pte_none() in the kernel).
>>>>>>>> 
>>>>>>>> In case memory was never populated (or was discarded using e.g.,
>>>>>>>> madvice(DONTNEED)), write-protection will be skipped silently and you
>>>>>>>> won't get WP events for applicable pages.
>>>>>>>> 
>>>>>>>> So if someone writes to a yet unpoupulated page ("zero"), you won't
>>>>>>>> get WP events.
>>>>>>>> 
>>>>>>>> I can spot that you do a single uffd_change_protection() on the whole
>>>>>>>> RAMBlock.
>>>>>>>> 
>>>>>>>> How are you handling that scenario, or why don't you have to handle
>>>>>>>> that scenario?
>>>>>>>> 
>>>>>>> Hi David,
>>>>>>> 
>>>>>>> I really wonder if such a problem exists.. If we are talking about a
>>>>>> I immediately ran into this issue with my simplest test cases. :)
>>>>>> 
>>>>>>> write to an unpopulated page, we should get first page fault on
>>>>>>> non-present page and populate it with protection bits from
>>>>>>> respective vma.
>>>>>>> For UFFD_WP vma's  page will be populated non-writable. So we'll get
>>>>>>> another page fault on present but read-only page and go to
>>>>>>> handle_userfault.
>>>>>> See the attached test program. Triggers for me on 5.11.0-rc6+ and
>>>>>> 5.9.13-200.fc33
>>>>>> 
>>>>>> gcc -lpthread uffdio_wp.c -o uffdio_wp
>>>>>> ./uffdio_wp
>>>>>> WP did not fire
>>>>>> 
>>>>>> Uncomment the placement of the zeropage just before registering to make
>>>>>> the WP actually trigger. If there is no PTE, there is nothing to
>>>>>> protect.
>>>>>> 
>>>>>> 
>>>>>> And it makes sense: How should the fault handler know which ranges you
>>>>>> wp-ed, if there is no place to store that information (the PTEs!). The
>>>>>> VMA cannot tell that story, it only knows that someone registered
>>>>>> UFFD_WP to selectively wp some parts.
>>>>>> 
>>>>>> You might have to register also for MISSING faults and place zero pages.
>>>>>> 
>>>>> Looked at the kernel code, agree that we miss WP events for unpopulated
>>>>> pages, UFFD_WP softbit won't be set in this case. But it doesn't make saved
>>>>> snapshot inconsistent or introduce security issues. The only side effect is
>>>>> that we may save updated page instead of zeroed, just increasing snapshot
>>>>> size. However this guest-physical page has never been touched from the point
>>>>> of view of saved vCPU/device state and is not a concern.
>>>> Oh I just remembered one thing, that Linux should be zeroing pages when
>>>> allocating, so even if the page has legacy content it'll be cleared with
>>>> __GFP_ZERO allocations.  So yeah it would be harder to have issue at least with
>>>> a sensible OS.  I'm not sure about Windows or others, but it could be a common
>>>> case.  Then the only overhead is the extra pages we kept in the live snapshot,
>>>> which takes some more disk space.
>>>> 
>>>> Or there could be firmware running without OS at all, but it should really not
>>>> read unallocated pages assuming there must be zero.  It's not a sane behavior
>>>> even for a firmware.
>>>> 
>>>>> Often (at least on desktop Windows guests) only a small part of RAM has ever
>>>>> been allocated by guest. Migration code needs to read each guest-physical
>>>>> page, so we'll have a lot of additional UFFD events, much more MISSING
>>>>> events then WP-faults.
>>>>> 
>>>>> And the main problem is that adding MISSING handler is impossible in current
>>>>> single-threaded snapshot code. We'll get an immediate deadlock on iterative
>>>>> page read.
>>>> Right.  We'll need to rework the design but just for saving a bunch of snapshot
>>>> image disk size.  So now I agree with you, let's keep this in mind, but maybe
>>>> it isn't worth a fix for now, at least until we figure something really broken.
>>>> 
>>>> Andrey, do you think we should still mention this issue into the todo list of
>>>> the wiki page of live snapshot?
>>>> 
>>>> Thanks,
>>>> 
>>> Yes, even if the page happens to be overwritten, it's overwritten by the same VM so
>>> no security boundaries are crossed. And no machine code can assume that RAM content
>>> is zeroed on power-on or reset so our snapshot state stays quite consistent.
>>> 
>>> Agree we should keep it in mind, but IMHO adding MISSING handler and running separate
>>> thread would make performance worse.. So I doubt it's worth adding this to TODO list..
>>> 
>> 
>> So, here is what happens: your snapshot will contain garbage at places where
>> it should contain zero.
>> 
>> This happens when your guest starts using an unpopulated page after
>> snapshotting started and the page has not been copied to the snapshot yet.
>> You won't get a WP event, therefore you cannot copy "zero" to the snapshot
>> before content gets overridden.
>> 
>> If you load your snapshot, it contains garbage at places that are supposed
>> to contain zero.
>> 
>> 
>> There is a feature in virtio-balloon that *depends* on previously discarded
>> pages from staying unmodified in some cases: free page reporting.
>> 
>> The guest will report pages (that might have been initialized to zero) to
>> the hypervisor). The hypervisor will discard page content if the content was
>> initialized to zero. After that, the guest is free to reuse the pages
>> anytime with the guarantee that content has not been modified (-> still is
>> zero).
>> 
>> 
>> See QEMU hw/virtio/virtio-balloon.c: virtio_balloon_handle_report()
>> 
>> "When we discard the page it has the effect of removing the page from the
>> hypervisor itself and causing it to be zeroed when it is returned to us. So
>> we must not discard the page [...] if the guest is expecting it to retain a
>> non-zero value."
>> 
>> And Linux drivers/virtio/virtio_balloon.c: virtballoon_validate()
>> 
>> "Inform the hypervisor that our pages are poisoned or initialized. If we
>> cannot do that then we should disable page reporting as it could potentially
>> change the contents of our free pages."
>> 
>> 
>> It's as simple as having a Linux guest with init_on_free and
>> free-page-reporting active via virtio-balloon.
>> 
>> Long story short, your feature might break guests (when continuing the
>> snapshot) when free page reporting is active. I agree that MISSING events
>> might be too heavy, so you should disable snapshots if free page reporting
>> is active.
> 
> When the page is reported with poison_val set, it is written already by the
> guest, right (either all zeros, or some special marks)?  If so, why it won't be
> wr-protected by uffd?

Let‘s take a look at init-on-free.

The guest zeroes a page and puts it onto a buddy freelist. Free page reporting code takes it off that list and reports it to the hypervisor. The hypervisor discards the physical page and tells the guest he‘s done processing the page. The guest re-places the page onto the free page list.

From that point on, the page can be re-allocated inside the guest and is assumed to be zero. On access, a fresh (zeroed) page is populated by the hypervisor. The guest won‘t re-zero the page, as it has the guarantee (from free page reporting) that the page remained zero.

Write-protecting the unpopulated page won‘t work as discussed.

> 
> -- 
> Peter Xu
>

Peter Xu Feb. 11, 2021, 9:05 p.m. UTC | #16

On Thu, Feb 11, 2021 at 09:44:07PM +0100, David Hildenbrand wrote:
> Let‘s take a look at init-on-free.
> 
> The guest zeroes a page and puts it onto a buddy freelist. Free page reporting code takes it off that list and reports it to the hypervisor. The hypervisor discards the physical page and tells the guest he‘s done processing the page. The guest re-places the page onto the free page list.
> 
> From that point on, the page can be re-allocated inside the guest and is assumed to be zero. On access, a fresh (zeroed) page is populated by the hypervisor. The guest won‘t re-zero the page, as it has the guarantee (from free page reporting) that the page remained zero.
> 
> Write-protecting the unpopulated page won‘t work as discussed.

IMHO no matter if it's init_on_alloc or init_on_free or both, as long as it's
inited in some way then it means the guest OS wrote to it.  Then wr-protect
will work..

MADV_DONTNEED during live snapshot seems to be a separate topic as you
mentioned in the other thread.  For that, I agree we'd better simply let
virtio_balloon_inhibited() to return true for live snapshot too just like
postcopy.

Thanks,

David Hildenbrand Feb. 11, 2021, 9:09 p.m. UTC | #17

> Am 11.02.2021 um 22:05 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Thu, Feb 11, 2021 at 09:44:07PM +0100, David Hildenbrand wrote:
>> Let‘s take a look at init-on-free.
>> 
>> The guest zeroes a page and puts it onto a buddy freelist. Free page reporting code takes it off that list and reports it to the hypervisor. The hypervisor discards the physical page and tells the guest he‘s done processing the page. The guest re-places the page onto the free page list.
>> 
>> From that point on, the page can be re-allocated inside the guest and is assumed to be zero. On access, a fresh (zeroed) page is populated by the hypervisor. The guest won‘t re-zero the page, as it has the guarantee (from free page reporting) that the page remained zero.
>> 
>> Write-protecting the unpopulated page won‘t work as discussed.
> 
> IMHO no matter if it's init_on_alloc or init_on_free or both, as long as it's
> inited in some way then it means the guest OS wrote to it.  Then wr-protect
> will work..

The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.

> 
> MADV_DONTNEED during live snapshot seems to be a separate topic as you
> mentioned in the other thread.  For that, I agree we'd better simply let
> virtio_balloon_inhibited() to return true for live snapshot too just like
> postcopy.

Yes, but other issue.

> 
> Thanks,
> 
> -- 
> Peter Xu
>

Peter Xu Feb. 12, 2021, 3:06 a.m. UTC | #18

On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote:
> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.

I see what you mean now, and iiuc it will only be a problem if init_on_free=1.
I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the
impact should be small, I think.  I thought about it, but indeed I didn't see a
good way to fix this if without fixing the zero page copy for live snapshot.

David Hildenbrand Feb. 12, 2021, 8:52 a.m. UTC | #19

On 12.02.21 04:06, Peter Xu wrote:
> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote:
>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.
> 
> I see what you mean now, and iiuc it will only be a problem if init_on_free=1.
> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the

Yes, some distros seem to enable init_on_alloc instead. Looking at the 
introducing commit 6471384af2a6 ("mm: security: introduce 
init_on_alloc=1 and init_on_free=1 boot options") there are security use 
cases and it might become important with memory tagging.

Note that in Linux, there was also the option to poison pages with 0, 
removed via f289041ed4cf ("mm, page_poison: remove 
CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported 
free page reporting.

It got removed and use cases got told to use init_on_free.

> impact should be small, I think.  I thought about it, but indeed I didn't see a
> good way to fix this if without fixing the zero page copy for live snapshot.

We should really document this (unexpected) behavior of snapshotting. 
Otherwise, the next feature comes around that relies on pages that were 
discarded to remain zeroed (I even have one in mind ;) ) and forgets to 
disable snapshots.

Peter Xu Feb. 12, 2021, 4:11 p.m. UTC | #20

On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote:
> On 12.02.21 04:06, Peter Xu wrote:
> > On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote:
> > > The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.
> > 
> > I see what you mean now, and iiuc it will only be a problem if init_on_free=1.
> > I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the
> 
> Yes, some distros seem to enable init_on_alloc instead. Looking at the
> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1
> and init_on_free=1 boot options") there are security use cases and it might
> become important with memory tagging.
> 
> Note that in Linux, there was also the option to poison pages with 0,
> removed via f289041ed4cf ("mm, page_poison: remove
> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free
> page reporting.
> 
> It got removed and use cases got told to use init_on_free.
> 
> > impact should be small, I think.  I thought about it, but indeed I didn't see a
> > good way to fix this if without fixing the zero page copy for live snapshot.
> 
> We should really document this (unexpected) behavior of snapshotting.
> Otherwise, the next feature comes around that relies on pages that were
> discarded to remain zeroed (I even have one in mind ;) ) and forgets to
> disable snapshots.

Agreed.  I'll see whether Andrey would have any idea to workaround this, or
further comment.  Or I can draft a patch to document this next week (or unless
Andrey would beat me to it :).

Andrey Gruzdev Feb. 13, 2021, 9:34 a.m. UTC | #21

On 12.02.2021 19:11, Peter Xu wrote:
> On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote:
>> On 12.02.21 04:06, Peter Xu wrote:
>>> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote:
>>>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.
>>> I see what you mean now, and iiuc it will only be a problem if init_on_free=1.
>>> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the
>> Yes, some distros seem to enable init_on_alloc instead. Looking at the
>> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1
>> and init_on_free=1 boot options") there are security use cases and it might
>> become important with memory tagging.
>>
>> Note that in Linux, there was also the option to poison pages with 0,
>> removed via f289041ed4cf ("mm, page_poison: remove
>> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free
>> page reporting.
>>
>> It got removed and use cases got told to use init_on_free.

I think we talk about init_on_free()/init_on_alloc() on guest side, right?
Still can't get how it relates to host's unpopulated pages..

Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No.
These 'trash' pages in migration stream are like never written physical memory pages, they are really
not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical
page is full of garbage.

Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM?
I think no. Or we need a good example how it can be potentially exploited.

The only issue that I see is madvise(MADV_DONTNEED) for RAM blocks during snapshotting. And free page reporting
or memory balloon is secondary - the point is that UFFD_WP snapshot is incompatible with madvise(MADV_DONTNEED) on
hypervisor side. No matter which guest functionality can induce it.

>>> impact should be small, I think.  I thought about it, but indeed I didn't see a
>>> good way to fix this if without fixing the zero page copy for live snapshot.
>> We should really document this (unexpected) behavior of snapshotting.
>> Otherwise, the next feature comes around that relies on pages that were
>> discarded to remain zeroed (I even have one in mind ;) ) and forgets to
>> disable snapshots.
> Agreed.  I'll see whether Andrey would have any idea to workaround this, or
> further comment.  Or I can draft a patch to document this next week (or unless
> Andrey would beat me to it :).
>
Really better to document this specific behaviour but also clarify that the saved state remains
consistent and secure, off course if you agree with my arguments.

David Hildenbrand Feb. 13, 2021, 10:30 a.m. UTC | #22

> Am 13.02.2021 um 10:34 schrieb Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>:
> 
> 
>> On 12.02.2021 19:11, Peter Xu wrote:
>>>> On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote:
>>>>> On 12.02.21 04:06, Peter Xu wrote:
>>>>>> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote:
>>>>>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.
>>>>> I see what you mean now, and iiuc it will only be a problem if init_on_free=1.
>>>>> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the
>>>> Yes, some distros seem to enable init_on_alloc instead. Looking at the
>>>> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1
>>>> and init_on_free=1 boot options") there are security use cases and it might
>>>> become important with memory tagging.
>>>> 
>>>> Note that in Linux, there was also the option to poison pages with 0,
>>>> removed via f289041ed4cf ("mm, page_poison: remove
>>>> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free
>>>> page reporting.
>>>> 
>>>> It got removed and use cases got told to use init_on_free.
>> I think we talk about init_on_free()/init_on_alloc() on guest side, right?
>> Still can't get how it relates to host's unpopulated pages..
>>  
>> Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No.
>> These 'trash' pages in migration stream are like never written physical memory pages, they are really
>> not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical
>> page is full of garbage.
>> 
>> Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM?
>> I think no. Or we need a good example how it can be potentially exploited.
I tried to explain how your implementation breaks *the guest inside the snapshot* (I have no idea why you talk about sensitive information). If you run the snapshot, the guest will run into trouble, because the snapshot contains different data than the guest expects:

1. with discards before snapshotting started and free page reporting is running
2. with discards after snapshotting started

Maybe Peter can enlighten you, or the links I shared.

Peter Xu Feb. 16, 2021, 11:35 p.m. UTC | #23

Hi, Andrey,

On Sat, Feb 13, 2021 at 12:34:07PM +0300, Andrey Gruzdev wrote:
> On 12.02.2021 19:11, Peter Xu wrote:
> > On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote:
> > > On 12.02.21 04:06, Peter Xu wrote:
> > > > On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote:
> > > > > The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.
> > > > I see what you mean now, and iiuc it will only be a problem if init_on_free=1.
> > > > I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the
> > > Yes, some distros seem to enable init_on_alloc instead. Looking at the
> > > introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1
> > > and init_on_free=1 boot options") there are security use cases and it might
> > > become important with memory tagging.
> > > 
> > > Note that in Linux, there was also the option to poison pages with 0,
> > > removed via f289041ed4cf ("mm, page_poison: remove
> > > CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free
> > > page reporting.
> > > 
> > > It got removed and use cases got told to use init_on_free.
> 
> I think we talk about init_on_free()/init_on_alloc() on guest side, right?

Right.  IIUC it's the init_on_free() that matters.

We'll have no issue if init_on_alloc=1 && init_on_free=0, since in that case
all pages will be zeroed after all before the new page returned to the caller
to allocate the page. Then we're safe, I think.

> Still can't get how it relates to host's unpopulated pages..
> Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No.
> These 'trash' pages in migration stream are like never written physical memory pages, they are really
> not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical
> page is full of garbage.
> 
> Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM?
> I think no. Or we need a good example how it can be potentially exploited.
> 
> The only issue that I see is madvise(MADV_DONTNEED) for RAM blocks during snapshotting. And free page reporting
> or memory balloon is secondary - the point is that UFFD_WP snapshot is incompatible with madvise(MADV_DONTNEED) on
> hypervisor side. No matter which guest functionality can induce it.

I think the problem is if with init_on_free=1, the kernel will assume that
all the pages that got freed has been zeroed before-hand so it thinks that it's
a waste of time to zero it again when the page is reused/reallocated.  As a
reference see kernel prep_new_page() where there's:

	if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags))
		kernel_init_free_pages(page, 1 << order);

In this case I believe free_pages_prezeroed() will return true, then we don't
even need to check want_init_on_alloc() at all. Note that it'll cover all the
cases where kernel allocates with __GFP_ZERO: it means it could happen that
even the guest kernel tries to alloc_page(__GFP_ZERO) it may got a page with
random data after the live snapshot is loaded.  So it's not about any hardware,
it's the optimization of guest kernel instead.  It is actually reasonable and
efficient since if we *know* that page is zero page then we shouldn't bother
zeroing it again.  However it brought us a bit of trouble on live snapshot that
the current solution might not work for all guest OS configurations.

> 
> > > > impact should be small, I think.  I thought about it, but indeed I didn't see a
> > > > good way to fix this if without fixing the zero page copy for live snapshot.
> > > We should really document this (unexpected) behavior of snapshotting.
> > > Otherwise, the next feature comes around that relies on pages that were
> > > discarded to remain zeroed (I even have one in mind ;) ) and forgets to
> > > disable snapshots.
> > Agreed.  I'll see whether Andrey would have any idea to workaround this, or
> > further comment.  Or I can draft a patch to document this next week (or unless
> > Andrey would beat me to it :).
> > 
> Really better to document this specific behaviour but also clarify that the saved state remains
> consistent and secure, off course if you agree with my arguments.

Yes, no rush on anything yet, let's reach a consensus on understanding the
problem before trying to even document anything.  If you agree with above, feel
free to draft a documentation update patch if you'd like, that could also
contain the code to prohibit virtio-balloon page reporting when live snapshot.

IMHO if above issue exists, it'll be indeed better to implement the MISSING
tracking as well with UFFDIO_ZEROCOPY - it's still not optimized to keep trash
pages in the live snapshot, since for a high dirty rate guest the number of
trash pages could still be huge.  It would definitely be more challenging than
only do wr-protect so we may need multi-threading at least, but it'll be a pity
that live snapshot randomly fails some guests, even if it's rare.  The thing is
host cannot easily detect that guest config, so we can't simply block some
users taking snapshots even if at last the snapshot could be silently broken.

Thanks,

David Hildenbrand Feb. 17, 2021, 10:31 a.m. UTC | #24

On 17.02.21 00:35, Peter Xu wrote:
> Hi, Andrey,
> 
> On Sat, Feb 13, 2021 at 12:34:07PM +0300, Andrey Gruzdev wrote:
>> On 12.02.2021 19:11, Peter Xu wrote:
>>> On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote:
>>>> On 12.02.21 04:06, Peter Xu wrote:
>>>>> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote:
>>>>>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.
>>>>> I see what you mean now, and iiuc it will only be a problem if init_on_free=1.
>>>>> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the
>>>> Yes, some distros seem to enable init_on_alloc instead. Looking at the
>>>> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1
>>>> and init_on_free=1 boot options") there are security use cases and it might
>>>> become important with memory tagging.
>>>>
>>>> Note that in Linux, there was also the option to poison pages with 0,
>>>> removed via f289041ed4cf ("mm, page_poison: remove
>>>> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free
>>>> page reporting.
>>>>
>>>> It got removed and use cases got told to use init_on_free.
>>
>> I think we talk about init_on_free()/init_on_alloc() on guest side, right?
> 
> Right.  IIUC it's the init_on_free() that matters.
> 
> We'll have no issue if init_on_alloc=1 && init_on_free=0, since in that case
> all pages will be zeroed after all before the new page returned to the caller
> to allocate the page. Then we're safe, I think.
> 
>> Still can't get how it relates to host's unpopulated pages..
>> Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No.
>> These 'trash' pages in migration stream are like never written physical memory pages, they are really
>> not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical
>> page is full of garbage.
>>
>> Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM?
>> I think no. Or we need a good example how it can be potentially exploited.
>>
>> The only issue that I see is madvise(MADV_DONTNEED) for RAM blocks during snapshotting. And free page reporting
>> or memory balloon is secondary - the point is that UFFD_WP snapshot is incompatible with madvise(MADV_DONTNEED) on
>> hypervisor side. No matter which guest functionality can induce it.
> 
> I think the problem is if with init_on_free=1, the kernel will assume that
> all the pages that got freed has been zeroed before-hand so it thinks that it's
> a waste of time to zero it again when the page is reused/reallocated.  As a
> reference see kernel prep_new_page() where there's:
> 
> 	if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags))
> 		kernel_init_free_pages(page, 1 << order);
> 
> In this case I believe free_pages_prezeroed() will return true, then we don't
> even need to check want_init_on_alloc() at all. Note that it'll cover all the
> cases where kernel allocates with __GFP_ZERO: it means it could happen that
> even the guest kernel tries to alloc_page(__GFP_ZERO) it may got a page with
> random data after the live snapshot is loaded.  So it's not about any hardware,
> it's the optimization of guest kernel instead.  It is actually reasonable and
> efficient since if we *know* that page is zero page then we shouldn't bother
> zeroing it again.  However it brought us a bit of trouble on live snapshot that
> the current solution might not work for all guest OS configurations.

Adding to that, we are so far talking about how Linux *currently* 
implements it, but that is just an instance of the problem where it 
could happen in practice.

Free page reporting documents in the spec that with the right 
configuration, previously reported free pages are guaranteed to retain a 
certain value (e.g., 0) when re-accessed. So any future guest changes 
that rely on the virtio spec (e.g., Windows support) would be 
problematic - as these pages in the snapshot don't actually keep the value.

Andrey Gruzdev Feb. 19, 2021, 6:57 a.m. UTC | #25

On 17.02.2021 02:35, Peter Xu wrote:
> Hi, Andrey,
>
> On Sat, Feb 13, 2021 at 12:34:07PM +0300, Andrey Gruzdev wrote:
>> On 12.02.2021 19:11, Peter Xu wrote:
>>> On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote:
>>>> On 12.02.21 04:06, Peter Xu wrote:
>>>>> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote:
>>>>>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.
>>>>> I see what you mean now, and iiuc it will only be a problem if init_on_free=1.
>>>>> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the
>>>> Yes, some distros seem to enable init_on_alloc instead. Looking at the
>>>> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1
>>>> and init_on_free=1 boot options") there are security use cases and it might
>>>> become important with memory tagging.
>>>>
>>>> Note that in Linux, there was also the option to poison pages with 0,
>>>> removed via f289041ed4cf ("mm, page_poison: remove
>>>> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free
>>>> page reporting.
>>>>
>>>> It got removed and use cases got told to use init_on_free.
>> I think we talk about init_on_free()/init_on_alloc() on guest side, right?
> Right.  IIUC it's the init_on_free() that matters.
>
> We'll have no issue if init_on_alloc=1 && init_on_free=0, since in that case
> all pages will be zeroed after all before the new page returned to the caller
> to allocate the page. Then we're safe, I think.
>
>> Still can't get how it relates to host's unpopulated pages..
>> Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No.
>> These 'trash' pages in migration stream are like never written physical memory pages, they are really
>> not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical
>> page is full of garbage.
>>
>> Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM?
>> I think no. Or we need a good example how it can be potentially exploited.
>>
>> The only issue that I see is madvise(MADV_DONTNEED) for RAM blocks during snapshotting. And free page reporting
>> or memory balloon is secondary - the point is that UFFD_WP snapshot is incompatible with madvise(MADV_DONTNEED) on
>> hypervisor side. No matter which guest functionality can induce it.
> I think the problem is if with init_on_free=1, the kernel will assume that
> all the pages that got freed has been zeroed before-hand so it thinks that it's
> a waste of time to zero it again when the page is reused/reallocated.  As a
> reference see kernel prep_new_page() where there's:
>
> 	if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags))
> 		kernel_init_free_pages(page, 1 << order);
>
> In this case I believe free_pages_prezeroed() will return true, then we don't
> even need to check want_init_on_alloc() at all. Note that it'll cover all the
> cases where kernel allocates with __GFP_ZERO: it means it could happen that
> even the guest kernel tries to alloc_page(__GFP_ZERO) it may got a page with
> random data after the live snapshot is loaded.  So it's not about any hardware,
> it's the optimization of guest kernel instead.  It is actually reasonable and
> efficient since if we *know* that page is zero page then we shouldn't bother
> zeroing it again.  However it brought us a bit of trouble on live snapshot that
> the current solution might not work for all guest OS configurations.

So I'd like to clarify that guest's init_on_alloc/init_on_free cannot bring any problems if we don't
consider virtio-baloon here.. If these pages have been zeroed inside the guest they certainly have got
populated on the host. And UFFD_WP should work as expected.

>>>>> impact should be small, I think.  I thought about it, but indeed I didn't see a
>>>>> good way to fix this if without fixing the zero page copy for live snapshot.
>>>> We should really document this (unexpected) behavior of snapshotting.
>>>> Otherwise, the next feature comes around that relies on pages that were
>>>> discarded to remain zeroed (I even have one in mind ;) ) and forgets to
>>>> disable snapshots.
>>> Agreed.  I'll see whether Andrey would have any idea to workaround this, or
>>> further comment.  Or I can draft a patch to document this next week (or unless
>>> Andrey would beat me to it :).
>>>
>> Really better to document this specific behaviour but also clarify that the saved state remains
>> consistent and secure, off course if you agree with my arguments.
> Yes, no rush on anything yet, let's reach a consensus on understanding the
> problem before trying to even document anything.  If you agree with above, feel
> free to draft a documentation update patch if you'd like, that could also
> contain the code to prohibit virtio-balloon page reporting when live snapshot.
>
> IMHO if above issue exists, it'll be indeed better to implement the MISSING
> tracking as well with UFFDIO_ZEROCOPY - it's still not optimized to keep trash
> pages in the live snapshot, since for a high dirty rate guest the number of
> trash pages could still be huge.  It would definitely be more challenging than
> only do wr-protect so we may need multi-threading at least, but it'll be a pity
> that live snapshot randomly fails some guests, even if it's rare.  The thing is
> host cannot easily detect that guest config, so we can't simply block some
> users taking snapshots even if at last the snapshot could be silently broken.
>
> Thanks,
>
Agree, let's reach consensus before doing anything. For the concurrent RAM block discards it seems
clear enough that they should be disabled for the duration of snapshot, like it's done when incoming
postcopy is active.

For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon
code more to get clear with it. Anyhow I'm quite sure that adding global MISSING handler for snapshotting
is too heavy and not really needed.

Thanks,

David Hildenbrand Feb. 19, 2021, 7:45 a.m. UTC | #26

On 19.02.21 07:57, Andrey Gruzdev wrote:
> On 17.02.2021 02:35, Peter Xu wrote:
>> Hi, Andrey,
>>
>> On Sat, Feb 13, 2021 at 12:34:07PM +0300, Andrey Gruzdev wrote:
>>> On 12.02.2021 19:11, Peter Xu wrote:
>>>> On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote:
>>>>> On 12.02.21 04:06, Peter Xu wrote:
>>>>>> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote:
>>>>>>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.
>>>>>> I see what you mean now, and iiuc it will only be a problem if init_on_free=1.
>>>>>> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the
>>>>> Yes, some distros seem to enable init_on_alloc instead. Looking at the
>>>>> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1
>>>>> and init_on_free=1 boot options") there are security use cases and it might
>>>>> become important with memory tagging.
>>>>>
>>>>> Note that in Linux, there was also the option to poison pages with 0,
>>>>> removed via f289041ed4cf ("mm, page_poison: remove
>>>>> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free
>>>>> page reporting.
>>>>>
>>>>> It got removed and use cases got told to use init_on_free.
>>> I think we talk about init_on_free()/init_on_alloc() on guest side, right?
>> Right.  IIUC it's the init_on_free() that matters.
>>
>> We'll have no issue if init_on_alloc=1 && init_on_free=0, since in that case
>> all pages will be zeroed after all before the new page returned to the caller
>> to allocate the page. Then we're safe, I think.
>>
>>> Still can't get how it relates to host's unpopulated pages..
>>> Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No.
>>> These 'trash' pages in migration stream are like never written physical memory pages, they are really
>>> not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical
>>> page is full of garbage.
>>>
>>> Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM?
>>> I think no. Or we need a good example how it can be potentially exploited.
>>>
>>> The only issue that I see is madvise(MADV_DONTNEED) for RAM blocks during snapshotting. And free page reporting
>>> or memory balloon is secondary - the point is that UFFD_WP snapshot is incompatible with madvise(MADV_DONTNEED) on
>>> hypervisor side. No matter which guest functionality can induce it.
>> I think the problem is if with init_on_free=1, the kernel will assume that
>> all the pages that got freed has been zeroed before-hand so it thinks that it's
>> a waste of time to zero it again when the page is reused/reallocated.  As a
>> reference see kernel prep_new_page() where there's:
>>
>> 	if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags))
>> 		kernel_init_free_pages(page, 1 << order);
>>
>> In this case I believe free_pages_prezeroed() will return true, then we don't
>> even need to check want_init_on_alloc() at all. Note that it'll cover all the
>> cases where kernel allocates with __GFP_ZERO: it means it could happen that
>> even the guest kernel tries to alloc_page(__GFP_ZERO) it may got a page with
>> random data after the live snapshot is loaded.  So it's not about any hardware,
>> it's the optimization of guest kernel instead.  It is actually reasonable and
>> efficient since if we *know* that page is zero page then we shouldn't bother
>> zeroing it again.  However it brought us a bit of trouble on live snapshot that
>> the current solution might not work for all guest OS configurations.
> 
> So I'd like to clarify that guest's init_on_alloc/init_on_free cannot bring any problems if we don't
> consider virtio-baloon here.. If these pages have been zeroed inside the guest they certainly have got
> populated on the host. And UFFD_WP should work as expected.

Yes, no discards, no problem.

Peter Xu Feb. 19, 2021, 8:50 p.m. UTC | #27

Andrey,

On Fri, Feb 19, 2021 at 09:57:37AM +0300, Andrey Gruzdev wrote:
> For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon
> code more to get clear with it.

Yes it's very tricky on how the error could trigger.

Let's think of below sequence:

  - Start a guest with init_on_free=1 set and also a virtio-balloon device

  - Guest frees a page P and zeroed it (since init_on_free=1). Now P contains
    all zeros.

  - Virtio-balloon reports this page to host, MADV_DONTNEED sent, then this
    page is dropped on the host.

  - Start live snapshot, wr-protect all pages (but not including page P because
    it's currently missing).  Let's call it $SNAPSHOT1.

  - Guest does alloc_page(__GFP_ZERO), accidentally fetching this page P and
    returned

  - So far, page P is still all zero (which is good!), then guest uses page P
    and writes data to it (say, now P has data P1 rather than all zeros).

  - Live snapshot saves page P, which content P1 rather than all zeros.

  - Live snapshot completed.  Saved as $SNAPSHOT1.

Then when load snapshot $SNAPSHOT1, we'll have P contains data P1.  After
snapshot loaded, when guest allocate again with alloc_page(__GFP_ZERO) on this
page P, since guest kernel "thought" this page is all-zero already so memzero()
is skipped even if __GFP_ZERO is provided.  Then this page P (with content P1)
got returned for the alloc_page(__GFP_ZERO) even if __GFP_ZERO set.  That could
break the caller of alloc_page().

> Anyhow I'm quite sure that adding global MISSING handler for snapshotting
> is too heavy and not really needed.

UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it.  There'll
definitely be overhead, but it may not be that huge as imagined.  Live snapshot
is great in that we have point-in-time image of guest without stopping the
guest, so taking slightly longer time won't be a huge loss to us too.

Actually we can also think of other ways to work around it.  One way is we can
pre-fault all guest pages before wr-protect.  Note that we don't need to write
to the guest page because read would suffice, since uffd-wp would also work
with zero pfn.  It's just that this workaround won't help on saving snapshot
disk space, but it seems working.  It would be great if you have other
workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route.

Thanks,

Peter Xu Feb. 19, 2021, 9:10 p.m. UTC | #28

On Fri, Feb 19, 2021 at 03:50:52PM -0500, Peter Xu wrote:
> Andrey,
> 
> On Fri, Feb 19, 2021 at 09:57:37AM +0300, Andrey Gruzdev wrote:
> > For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon
> > code more to get clear with it.
> 
> Yes it's very tricky on how the error could trigger.
> 
> Let's think of below sequence:
> 
>   - Start a guest with init_on_free=1 set and also a virtio-balloon device
> 
>   - Guest frees a page P and zeroed it (since init_on_free=1). Now P contains
>     all zeros.
> 
>   - Virtio-balloon reports this page to host, MADV_DONTNEED sent, then this
>     page is dropped on the host.
> 
>   - Start live snapshot, wr-protect all pages (but not including page P because
>     it's currently missing).  Let's call it $SNAPSHOT1.
> 
>   - Guest does alloc_page(__GFP_ZERO), accidentally fetching this page P and
>     returned
> 
>   - So far, page P is still all zero (which is good!), then guest uses page P
>     and writes data to it (say, now P has data P1 rather than all zeros).
> 
>   - Live snapshot saves page P, which content P1 rather than all zeros.
> 
>   - Live snapshot completed.  Saved as $SNAPSHOT1.
> 
> Then when load snapshot $SNAPSHOT1, we'll have P contains data P1.  After
> snapshot loaded, when guest allocate again with alloc_page(__GFP_ZERO) on this
> page P, since guest kernel "thought" this page is all-zero already so memzero()
> is skipped even if __GFP_ZERO is provided.  Then this page P (with content P1)
> got returned for the alloc_page(__GFP_ZERO) even if __GFP_ZERO set.  That could
> break the caller of alloc_page().
> 
> > Anyhow I'm quite sure that adding global MISSING handler for snapshotting
> > is too heavy and not really needed.
> 
> UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it.  There'll
> definitely be overhead, but it may not be that huge as imagined.  Live snapshot
> is great in that we have point-in-time image of guest without stopping the
> guest, so taking slightly longer time won't be a huge loss to us too.
> 
> Actually we can also think of other ways to work around it.  One way is we can
> pre-fault all guest pages before wr-protect.  Note that we don't need to write
> to the guest page because read would suffice, since uffd-wp would also work
> with zero pfn.  It's just that this workaround won't help on saving snapshot
> disk space, but it seems working.  It would be great if you have other
> workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route.

Wait.. it actually seems to also solve the disk usage issue.. :)

We should just need to make sure to prohibit balloon before staring to
pre-fault read on all guest ram.  Seems awkward, but also seems working.. Hmm..

David Hildenbrand Feb. 19, 2021, 9:14 p.m. UTC | #29

> Am 19.02.2021 um 22:10 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Fri, Feb 19, 2021 at 03:50:52PM -0500, Peter Xu wrote:
>> Andrey,
>> 
>>> On Fri, Feb 19, 2021 at 09:57:37AM +0300, Andrey Gruzdev wrote:
>>> For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon
>>> code more to get clear with it.
>> 
>> Yes it's very tricky on how the error could trigger.
>> 
>> Let's think of below sequence:
>> 
>>  - Start a guest with init_on_free=1 set and also a virtio-balloon device
>> 
>>  - Guest frees a page P and zeroed it (since init_on_free=1). Now P contains
>>    all zeros.
>> 
>>  - Virtio-balloon reports this page to host, MADV_DONTNEED sent, then this
>>    page is dropped on the host.
>> 
>>  - Start live snapshot, wr-protect all pages (but not including page P because
>>    it's currently missing).  Let's call it $SNAPSHOT1.
>> 
>>  - Guest does alloc_page(__GFP_ZERO), accidentally fetching this page P and
>>    returned
>> 
>>  - So far, page P is still all zero (which is good!), then guest uses page P
>>    and writes data to it (say, now P has data P1 rather than all zeros).
>> 
>>  - Live snapshot saves page P, which content P1 rather than all zeros.
>> 
>>  - Live snapshot completed.  Saved as $SNAPSHOT1.
>> 
>> Then when load snapshot $SNAPSHOT1, we'll have P contains data P1.  After
>> snapshot loaded, when guest allocate again with alloc_page(__GFP_ZERO) on this
>> page P, since guest kernel "thought" this page is all-zero already so memzero()
>> is skipped even if __GFP_ZERO is provided.  Then this page P (with content P1)
>> got returned for the alloc_page(__GFP_ZERO) even if __GFP_ZERO set.  That could
>> break the caller of alloc_page().
>> 
>>> Anyhow I'm quite sure that adding global MISSING handler for snapshotting
>>> is too heavy and not really needed.
>> 
>> UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it.  There'll
>> definitely be overhead, but it may not be that huge as imagined.  Live snapshot
>> is great in that we have point-in-time image of guest without stopping the
>> guest, so taking slightly longer time won't be a huge loss to us too.
>> 
>> Actually we can also think of other ways to work around it.  One way is we can
>> pre-fault all guest pages before wr-protect.  Note that we don't need to write
>> to the guest page because read would suffice, since uffd-wp would also work
>> with zero pfn.  It's just that this workaround won't help on saving snapshot
>> disk space, but it seems working.  It would be great if you have other
>> workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route.
> 
> Wait.. it actually seems to also solve the disk usage issue.. :)
> 
> We should just need to make sure to prohibit balloon before staring to
> pre-fault read on all guest ram.  Seems awkward, but also seems working.. Hmm..

A shiver just went down my spine. Please don‘t just for the sake of creating a snapshot.

(Just imagine you don‘t have a shared zeropage...)


> -- 
> Peter Xu
>

David Hildenbrand Feb. 19, 2021, 9:20 p.m. UTC | #30

> Am 19.02.2021 um 22:14 schrieb David Hildenbrand <dhildenb@redhat.com>:
> 
> 
>>> Am 19.02.2021 um 22:10 schrieb Peter Xu <peterx@redhat.com>:
>>> 
>>> On Fri, Feb 19, 2021 at 03:50:52PM -0500, Peter Xu wrote:
>>> Andrey,
>>> 
>>>> On Fri, Feb 19, 2021 at 09:57:37AM +0300, Andrey Gruzdev wrote:
>>>> For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon
>>>> code more to get clear with it.
>>> 
>>> Yes it's very tricky on how the error could trigger.
>>> 
>>> Let's think of below sequence:
>>> 
>>> - Start a guest with init_on_free=1 set and also a virtio-balloon device
>>> 
>>> - Guest frees a page P and zeroed it (since init_on_free=1). Now P contains
>>>   all zeros.
>>> 
>>> - Virtio-balloon reports this page to host, MADV_DONTNEED sent, then this
>>>   page is dropped on the host.
>>> 
>>> - Start live snapshot, wr-protect all pages (but not including page P because
>>>   it's currently missing).  Let's call it $SNAPSHOT1.
>>> 
>>> - Guest does alloc_page(__GFP_ZERO), accidentally fetching this page P and
>>>   returned
>>> 
>>> - So far, page P is still all zero (which is good!), then guest uses page P
>>>   and writes data to it (say, now P has data P1 rather than all zeros).
>>> 
>>> - Live snapshot saves page P, which content P1 rather than all zeros.
>>> 
>>> - Live snapshot completed.  Saved as $SNAPSHOT1.
>>> 
>>> Then when load snapshot $SNAPSHOT1, we'll have P contains data P1.  After
>>> snapshot loaded, when guest allocate again with alloc_page(__GFP_ZERO) on this
>>> page P, since guest kernel "thought" this page is all-zero already so memzero()
>>> is skipped even if __GFP_ZERO is provided.  Then this page P (with content P1)
>>> got returned for the alloc_page(__GFP_ZERO) even if __GFP_ZERO set.  That could
>>> break the caller of alloc_page().
>>> 
>>>> Anyhow I'm quite sure that adding global MISSING handler for snapshotting
>>>> is too heavy and not really needed.
>>> 
>>> UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it.  There'll
>>> definitely be overhead, but it may not be that huge as imagined.  Live snapshot
>>> is great in that we have point-in-time image of guest without stopping the
>>> guest, so taking slightly longer time won't be a huge loss to us too.
>>> 
>>> Actually we can also think of other ways to work around it.  One way is we can
>>> pre-fault all guest pages before wr-protect.  Note that we don't need to write
>>> to the guest page because read would suffice, since uffd-wp would also work
>>> with zero pfn.  It's just that this workaround won't help on saving snapshot
>>> disk space, but it seems working.  It would be great if you have other
>>> workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route.
>> 
>> Wait.. it actually seems to also solve the disk usage issue.. :)
>> 
>> We should just need to make sure to prohibit balloon before staring to
>> pre-fault read on all guest ram.  Seems awkward, but also seems working.. Hmm..
> 
> A shiver just went down my spine. Please don‘t just for the sake of creating a snapshot.
> 
> (Just imagine you don‘t have a shared zeropage...)

... and I just remembered we read all memory either way. Gah.

I have some patches to make snapshots fly with virtio-mem so exactly that won‘t happen. But they depend on vfio support, so it might take a while.

Peter Xu Feb. 19, 2021, 10:47 p.m. UTC | #31

On Fri, Feb 19, 2021 at 10:20:42PM +0100, David Hildenbrand wrote:
> > A shiver just went down my spine. Please don‘t just for the sake of creating a snapshot.
> > 
> > (Just imagine you don‘t have a shared zeropage...)
> 
> ... and I just remembered we read all memory either way. Gah.
> 
> I have some patches to make snapshots fly with virtio-mem so exactly that won‘t happen. But they depend on vfio support, so it might take a while.

Sorry I can't really follow.

It'll be great if virtio-mem won't have similar problem with live snapshot
finally.  Is that idea applicable to balloon too, then?

David Hildenbrand Feb. 20, 2021, 7:59 a.m. UTC | #32

> Am 19.02.2021 um 23:47 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Fri, Feb 19, 2021 at 10:20:42PM +0100, David Hildenbrand wrote:
>>> A shiver just went down my spine. Please don‘t just for the sake of creating a snapshot.
>>> 
>>> (Just imagine you don‘t have a shared zeropage...)
>> 
>> ... and I just remembered we read all memory either way. Gah.
>> 
>> I have some patches to make snapshots fly with virtio-mem so exactly that won‘t happen. But they depend on vfio support, so it might take a while.
> 
> Sorry I can't really follow.
> 

Live snapshotting ends up reading all guest memory (dirty bitmap starts with all 1s), which is not what we want for virtio-mem - we don’t want to read and migrate memory that has been discarded and has no stable content.

For ordinary migration we use the guest page hint API to clear bits in the dirty bitmap after dirty bitmap sync. Well, if we don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the problem.

> It'll be great if virtio-mem won't have similar problem with live snapshot
> finally.  Is that idea applicable to balloon too, then?

The idea is to reuse the RamDiscardMgr as to be introduced with vfio support to teach migration code which parts of specific RAMBlock never to migrate. It avoids the hack with reusing the guest page hint API.

As virtio-balloon is not applicable to RamDiscardMgr (and virtio-balloon has no memory about what has been inflated) it won‘t affect virtio-balloon.

> 
> -- 
> Peter Xu
>

Peter Xu Feb. 22, 2021, 5:29 p.m. UTC | #33

On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote:
> Live snapshotting ends up reading all guest memory (dirty bitmap starts with all 1s), which is not what we want for virtio-mem - we don’t want to read and migrate memory that has been discarded and has no stable content.
> 
> For ordinary migration we use the guest page hint API to clear bits in the dirty bitmap after dirty bitmap sync. Well, if we don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the problem.

Using dirty bitmap for that information is less efficient, becase it's
definitely a larger granularity information than PAGE_SIZE.  If the disgarded
ranges are always continuous and at the end of a memory region, we should have
some parameter in the ramblock showing that where we got shrinked then we don't
check dirty bitmap at all, rather than always assuming used_length is the one.

Thanks,

David Hildenbrand Feb. 22, 2021, 5:33 p.m. UTC | #34

On 22.02.21 18:29, Peter Xu wrote:
> On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote:
>> Live snapshotting ends up reading all guest memory (dirty bitmap starts with all 1s), which is not what we want for virtio-mem - we don’t want to read and migrate memory that has been discarded and has no stable content.
>>
>> For ordinary migration we use the guest page hint API to clear bits in the dirty bitmap after dirty bitmap sync. Well, if we don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the problem.
> 
> Using dirty bitmap for that information is less efficient, becase it's
> definitely a larger granularity information than PAGE_SIZE.  If the disgarded
> ranges are always continuous and at the end of a memory region, we should have
> some parameter in the ramblock showing that where we got shrinked then we don't
> check dirty bitmap at all, rather than always assuming used_length is the one.

They are randomly scattered across the whole RAMBlock. Shrinking/growing 
will be done to some degree in the future (but it won't get rid of the 
general sparse layout we can produce).

Peter Xu Feb. 22, 2021, 5:54 p.m. UTC | #35

On Mon, Feb 22, 2021 at 06:33:27PM +0100, David Hildenbrand wrote:
> On 22.02.21 18:29, Peter Xu wrote:
> > On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote:
> > > Live snapshotting ends up reading all guest memory (dirty bitmap starts with all 1s), which is not what we want for virtio-mem - we don’t want to read and migrate memory that has been discarded and has no stable content.
> > > 
> > > For ordinary migration we use the guest page hint API to clear bits in the dirty bitmap after dirty bitmap sync. Well, if we don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the problem.
> > 
> > Using dirty bitmap for that information is less efficient, becase it's
> > definitely a larger granularity information than PAGE_SIZE.  If the disgarded
> > ranges are always continuous and at the end of a memory region, we should have
> > some parameter in the ramblock showing that where we got shrinked then we don't
> > check dirty bitmap at all, rather than always assuming used_length is the one.
> 
> They are randomly scattered across the whole RAMBlock. Shrinking/growing
> will be done to some degree in the future (but it won't get rid of the
> general sparse layout we can produce).

OK. Btw I think currently live snapshot should still be reading dirty bitmap,
so maybe it's still fine.  It's just that it's still not very clear to hide
virtio-mem information into dirty bitmap, imho, since that's not how we
interpret dirty bitmap - which is only for the sake of tracking page changes.

What's the granule of virtio-mem for this discard behavior?  Maybe we could
decouple it with dirty bitmap some day; if the unit is big enough it's also a
gain on efficiency so we skip in chunk rather than looping over tons of pages
knowing that they're discarded.

Thanks,

David Hildenbrand Feb. 22, 2021, 6:11 p.m. UTC | #36

On 22.02.21 18:54, Peter Xu wrote:
> On Mon, Feb 22, 2021 at 06:33:27PM +0100, David Hildenbrand wrote:
>> On 22.02.21 18:29, Peter Xu wrote:
>>> On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote:
>>>> Live snapshotting ends up reading all guest memory (dirty bitmap starts with all 1s), which is not what we want for virtio-mem - we don’t want to read and migrate memory that has been discarded and has no stable content.
>>>>
>>>> For ordinary migration we use the guest page hint API to clear bits in the dirty bitmap after dirty bitmap sync. Well, if we don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the problem.
>>>
>>> Using dirty bitmap for that information is less efficient, becase it's
>>> definitely a larger granularity information than PAGE_SIZE.  If the disgarded
>>> ranges are always continuous and at the end of a memory region, we should have
>>> some parameter in the ramblock showing that where we got shrinked then we don't
>>> check dirty bitmap at all, rather than always assuming used_length is the one.
>>
>> They are randomly scattered across the whole RAMBlock. Shrinking/growing
>> will be done to some degree in the future (but it won't get rid of the
>> general sparse layout we can produce).
> 
> OK. Btw I think currently live snapshot should still be reading dirty bitmap,
> so maybe it's still fine.  It's just that it's still not very clear to hide
> virtio-mem information into dirty bitmap, imho, since that's not how we
> interpret dirty bitmap - which is only for the sake of tracking page changes.

Well, currently it is "what do we have to migrate".

> 
> What's the granule of virtio-mem for this discard behavior?  Maybe we could

virtio-mem granularity is at least 1MB. This corresponds to 256 bits (32 
bytes) in the dirty bitmap I think.

> decouple it with dirty bitmap some day; if the unit is big enough it's also a
> gain on efficiency so we skip in chunk rather than looping over tons of pages
> knowing that they're discarded.

Yeah, it's not optimal having to go over the dirty bitmap to cross off 
"discarded" parts and later having to find bits to migrate.

At least find_next_bit() can skip whole longs (8 bytes) and is fairly 
efficient. There is certainly room for improvement (the current guest 
free page hinting API is certainly a hack).

Andrey Gruzdev Feb. 24, 2021, 4:43 p.m. UTC | #37

On 19.02.2021 23:50, Peter Xu wrote:
> Andrey,
>
> On Fri, Feb 19, 2021 at 09:57:37AM +0300, Andrey Gruzdev wrote:
>> For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon
>> code more to get clear with it.
> Yes it's very tricky on how the error could trigger.
>
> Let's think of below sequence:
>
>    - Start a guest with init_on_free=1 set and also a virtio-balloon device
>
>    - Guest frees a page P and zeroed it (since init_on_free=1). Now P contains
>      all zeros.
>
>    - Virtio-balloon reports this page to host, MADV_DONTNEED sent, then this
>      page is dropped on the host.
>
>    - Start live snapshot, wr-protect all pages (but not including page P because
>      it's currently missing).  Let's call it $SNAPSHOT1.
>
>    - Guest does alloc_page(__GFP_ZERO), accidentally fetching this page P and
>      returned
>
>    - So far, page P is still all zero (which is good!), then guest uses page P
>      and writes data to it (say, now P has data P1 rather than all zeros).
>
>    - Live snapshot saves page P, which content P1 rather than all zeros.
>
>    - Live snapshot completed.  Saved as $SNAPSHOT1.
>
> Then when load snapshot $SNAPSHOT1, we'll have P contains data P1.  After
> snapshot loaded, when guest allocate again with alloc_page(__GFP_ZERO) on this
> page P, since guest kernel "thought" this page is all-zero already so memzero()
> is skipped even if __GFP_ZERO is provided.  Then this page P (with content P1)
> got returned for the alloc_page(__GFP_ZERO) even if __GFP_ZERO set.  That could
> break the caller of alloc_page().

Yep, that's quite clear.

>> Anyhow I'm quite sure that adding global MISSING handler for snapshotting
>> is too heavy and not really needed.
> UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it.  There'll
> definitely be overhead, but it may not be that huge as imagined.  Live snapshot
> is great in that we have point-in-time image of guest without stopping the
> guest, so taking slightly longer time won't be a huge loss to us too.
>
> Actually we can also think of other ways to work around it.  One way is we can
> pre-fault all guest pages before wr-protect.  Note that we don't need to write
> to the guest page because read would suffice, since uffd-wp would also work
> with zero pfn.  It's just that this workaround won't help on saving snapshot
> disk space, but it seems working.  It would be great if you have other
> workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route.
>
> Thanks,
>
Just to add: one of the good options is too keep track of virtio-baloon discarded pages and
pre-fault them before migration starts. What do you think?

David Hildenbrand Feb. 24, 2021, 4:54 p.m. UTC | #38

>>> Anyhow I'm quite sure that adding global MISSING handler for snapshotting
>>> is too heavy and not really needed.
>> UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it.  There'll
>> definitely be overhead, but it may not be that huge as imagined.  Live snapshot
>> is great in that we have point-in-time image of guest without stopping the
>> guest, so taking slightly longer time won't be a huge loss to us too.
>>
>> Actually we can also think of other ways to work around it.  One way is we can
>> pre-fault all guest pages before wr-protect.  Note that we don't need to write
>> to the guest page because read would suffice, since uffd-wp would also work
>> with zero pfn.  It's just that this workaround won't help on saving snapshot
>> disk space, but it seems working.  It would be great if you have other
>> workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route.
>>
>> Thanks,
>>
> Just to add: one of the good options is too keep track of virtio-baloon discarded pages and
> pre-fault them before migration starts. What do you think?

Just pre-fault everything and inhibit the balloon. That should work.

Andrey Gruzdev Feb. 24, 2021, 4:56 p.m. UTC | #39

On 22.02.2021 21:11, David Hildenbrand wrote:
> On 22.02.21 18:54, Peter Xu wrote:
>> On Mon, Feb 22, 2021 at 06:33:27PM +0100, David Hildenbrand wrote:
>>> On 22.02.21 18:29, Peter Xu wrote:
>>>> On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote:
>>>>> Live snapshotting ends up reading all guest memory (dirty bitmap 
>>>>> starts with all 1s), which is not what we want for virtio-mem - we 
>>>>> don’t want to read and migrate memory that has been discarded and 
>>>>> has no stable content.
>>>>>
>>>>> For ordinary migration we use the guest page hint API to clear 
>>>>> bits in the dirty bitmap after dirty bitmap sync. Well, if we 
>>>>> don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the 
>>>>> problem.
>>>>
>>>> Using dirty bitmap for that information is less efficient, becase it's
>>>> definitely a larger granularity information than PAGE_SIZE. If the 
>>>> disgarded
>>>> ranges are always continuous and at the end of a memory region, we 
>>>> should have
>>>> some parameter in the ramblock showing that where we got shrinked 
>>>> then we don't
>>>> check dirty bitmap at all, rather than always assuming used_length 
>>>> is the one.
>>>
>>> They are randomly scattered across the whole RAMBlock. 
>>> Shrinking/growing
>>> will be done to some degree in the future (but it won't get rid of the
>>> general sparse layout we can produce).
>>
>> OK. Btw I think currently live snapshot should still be reading dirty 
>> bitmap,
>> so maybe it's still fine.  It's just that it's still not very clear 
>> to hide
>> virtio-mem information into dirty bitmap, imho, since that's not how we
>> interpret dirty bitmap - which is only for the sake of tracking page 
>> changes.
>
> Well, currently it is "what do we have to migrate".
>
Using dirty bitmap can prohibit unexpected (overwritten) content of 
pre-discarded pages, but they'll appear as zeroed on destination.
What about other virtio-baloon features like HW_POISON when poisoned 
pages should also be retained on population filled with certain pattern?

Thanks,

>>
>> What's the granule of virtio-mem for this discard behavior? Maybe we 
>> could
>
> virtio-mem granularity is at least 1MB. This corresponds to 256 bits 
> (32 bytes) in the dirty bitmap I think.
>
>> decouple it with dirty bitmap some day; if the unit is big enough 
>> it's also a
>> gain on efficiency so we skip in chunk rather than looping over tons 
>> of pages
>> knowing that they're discarded.
>
> Yeah, it's not optimal having to go over the dirty bitmap to cross off 
> "discarded" parts and later having to find bits to migrate.
>
> At least find_next_bit() can skip whole longs (8 bytes) and is fairly 
> efficient. There is certainly room for improvement (the current guest 
> free page hinting API is certainly a hack).
>

Andrey Gruzdev Feb. 24, 2021, 5 p.m. UTC | #40

On 24.02.2021 19:54, David Hildenbrand wrote:
>>>> Anyhow I'm quite sure that adding global MISSING handler for 
>>>> snapshotting
>>>> is too heavy and not really needed.
>>> UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it.  
>>> There'll
>>> definitely be overhead, but it may not be that huge as imagined.  
>>> Live snapshot
>>> is great in that we have point-in-time image of guest without 
>>> stopping the
>>> guest, so taking slightly longer time won't be a huge loss to us too.
>>>
>>> Actually we can also think of other ways to work around it. One way 
>>> is we can
>>> pre-fault all guest pages before wr-protect.  Note that we don't 
>>> need to write
>>> to the guest page because read would suffice, since uffd-wp would 
>>> also work
>>> with zero pfn.  It's just that this workaround won't help on saving 
>>> snapshot
>>> disk space, but it seems working.  It would be great if you have other
>>> workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route.
>>>
>>> Thanks,
>>>
>> Just to add: one of the good options is too keep track of 
>> virtio-baloon discarded pages and
>> pre-fault them before migration starts. What do you think?
>
> Just pre-fault everything and inhibit the balloon. That should work.
>
Yep.

David Hildenbrand Feb. 24, 2021, 5:01 p.m. UTC | #41

On 24.02.21 17:56, Andrey Gruzdev wrote:
> On 22.02.2021 21:11, David Hildenbrand wrote:
>> On 22.02.21 18:54, Peter Xu wrote:
>>> On Mon, Feb 22, 2021 at 06:33:27PM +0100, David Hildenbrand wrote:
>>>> On 22.02.21 18:29, Peter Xu wrote:
>>>>> On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote:
>>>>>> Live snapshotting ends up reading all guest memory (dirty bitmap
>>>>>> starts with all 1s), which is not what we want for virtio-mem - we
>>>>>> don’t want to read and migrate memory that has been discarded and
>>>>>> has no stable content.
>>>>>>
>>>>>> For ordinary migration we use the guest page hint API to clear
>>>>>> bits in the dirty bitmap after dirty bitmap sync. Well, if we
>>>>>> don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the
>>>>>> problem.
>>>>>
>>>>> Using dirty bitmap for that information is less efficient, becase it's
>>>>> definitely a larger granularity information than PAGE_SIZE. If the
>>>>> disgarded
>>>>> ranges are always continuous and at the end of a memory region, we
>>>>> should have
>>>>> some parameter in the ramblock showing that where we got shrinked
>>>>> then we don't
>>>>> check dirty bitmap at all, rather than always assuming used_length
>>>>> is the one.
>>>>
>>>> They are randomly scattered across the whole RAMBlock.
>>>> Shrinking/growing
>>>> will be done to some degree in the future (but it won't get rid of the
>>>> general sparse layout we can produce).
>>>
>>> OK. Btw I think currently live snapshot should still be reading dirty
>>> bitmap,
>>> so maybe it's still fine.  It's just that it's still not very clear
>>> to hide
>>> virtio-mem information into dirty bitmap, imho, since that's not how we
>>> interpret dirty bitmap - which is only for the sake of tracking page
>>> changes.
>>
>> Well, currently it is "what do we have to migrate".
>>
> Using dirty bitmap can prohibit unexpected (overwritten) content of
> pre-discarded pages, but they'll appear as zeroed on destination.

To be precise, they'll usually appear as discarded (no pages in the 
migration stream).

> What about other virtio-baloon features like HW_POISON when poisoned
> pages should also be retained on population filled with certain pattern?

In case of free page reporting, we skip discarding if poisoning is set 
to != 0 because of that reason:

if (virtio_balloon_inhibited() || dev->poison_val) {
	goto skip_element;
}

Andrey Gruzdev Feb. 24, 2021, 5:52 p.m. UTC | #42

On 24.02.2021 20:01, David Hildenbrand wrote:
> On 24.02.21 17:56, Andrey Gruzdev wrote:
>> On 22.02.2021 21:11, David Hildenbrand wrote:
>>> On 22.02.21 18:54, Peter Xu wrote:
>>>> On Mon, Feb 22, 2021 at 06:33:27PM +0100, David Hildenbrand wrote:
>>>>> On 22.02.21 18:29, Peter Xu wrote:
>>>>>> On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote:
>>>>>>> Live snapshotting ends up reading all guest memory (dirty bitmap
>>>>>>> starts with all 1s), which is not what we want for virtio-mem - we
>>>>>>> don’t want to read and migrate memory that has been discarded and
>>>>>>> has no stable content.
>>>>>>>
>>>>>>> For ordinary migration we use the guest page hint API to clear
>>>>>>> bits in the dirty bitmap after dirty bitmap sync. Well, if we
>>>>>>> don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the
>>>>>>> problem.
>>>>>>
>>>>>> Using dirty bitmap for that information is less efficient, becase 
>>>>>> it's
>>>>>> definitely a larger granularity information than PAGE_SIZE. If the
>>>>>> disgarded
>>>>>> ranges are always continuous and at the end of a memory region, we
>>>>>> should have
>>>>>> some parameter in the ramblock showing that where we got shrinked
>>>>>> then we don't
>>>>>> check dirty bitmap at all, rather than always assuming used_length
>>>>>> is the one.
>>>>>
>>>>> They are randomly scattered across the whole RAMBlock.
>>>>> Shrinking/growing
>>>>> will be done to some degree in the future (but it won't get rid of the
>>>>> general sparse layout we can produce).
>>>>
>>>> OK. Btw I think currently live snapshot should still be reading dirty
>>>> bitmap,
>>>> so maybe it's still fine.  It's just that it's still not very clear
>>>> to hide
>>>> virtio-mem information into dirty bitmap, imho, since that's not how we
>>>> interpret dirty bitmap - which is only for the sake of tracking page
>>>> changes.
>>>
>>> Well, currently it is "what do we have to migrate".
>>>
>> Using dirty bitmap can prohibit unexpected (overwritten) content of
>> pre-discarded pages, but they'll appear as zeroed on destination.
>
> To be precise, they'll usually appear as discarded (no pages in the 
> migration stream).

I just mean the retained content of the page after population - zeroes.

>
>> What about other virtio-baloon features like HW_POISON when poisoned
>> pages should also be retained on population filled with certain pattern?
>
> In case of free page reporting, we skip discarding if poisoning is set 
> to != 0 because of that reason:
>
> if (virtio_balloon_inhibited() || dev->poison_val) {
>     goto skip_element;
> }
>
Got it, thanks.

[v13,0/5] UFFD write-tracking migration/snapshots

Message

Comments