Message ID | 20210121152458.193248-1-andrey.gruzdev@virtuozzo.com (mailing list archive) |
---|---|
Headers | show |
Series | UFFD write-tracking migration/snapshots | expand |
On 21.01.21 16:24, andrey.gruzdev--- via wrote: > This patch series is a kind of 'rethinking' of Denis Plotnikov's ideas he's > implemented in his series '[PATCH v0 0/4] migration: add background snapshot'. > > Currently the only way to make (external) live VM snapshot is using existing > dirty page logging migration mechanism. The main problem is that it tends to > produce a lot of page duplicates while running VM goes on updating already > saved pages. That leads to the fact that vmstate image size is commonly several > times bigger then non-zero part of virtual machine's RSS. Time required to > converge RAM migration and the size of snapshot image severely depend on the > guest memory write rate, sometimes resulting in unacceptably long snapshot > creation time and huge image size. > > This series propose a way to solve the aforementioned problems. This is done > by using different RAM migration mechanism based on UFFD write protection > management introduced in v5.7 kernel. The migration strategy is to 'freeze' > guest RAM content using write-protection and iteratively release protection > for memory ranges that have already been saved to the migration stream. > At the same time we read in pending UFFD write fault events and save those > pages out-of-order with higher priority. > Hi, just stumbled over this, quick question: I recently played with UFFD_WP and notices that write protection is only effective on pages/ranges that have already pages populated (IOW: !pte_none() in the kernel). In case memory was never populated (or was discarded using e.g., madvice(DONTNEED)), write-protection will be skipped silently and you won't get WP events for applicable pages. So if someone writes to a yet unpoupulated page ("zero"), you won't get WP events. I can spot that you do a single uffd_change_protection() on the whole RAMBlock. How are you handling that scenario, or why don't you have to handle that scenario?
On 09.02.2021 15:37, David Hildenbrand wrote: > On 21.01.21 16:24, andrey.gruzdev--- via wrote: >> This patch series is a kind of 'rethinking' of Denis Plotnikov's >> ideas he's >> implemented in his series '[PATCH v0 0/4] migration: add background >> snapshot'. >> >> Currently the only way to make (external) live VM snapshot is using >> existing >> dirty page logging migration mechanism. The main problem is that it >> tends to >> produce a lot of page duplicates while running VM goes on updating >> already >> saved pages. That leads to the fact that vmstate image size is >> commonly several >> times bigger then non-zero part of virtual machine's RSS. Time >> required to >> converge RAM migration and the size of snapshot image severely depend >> on the >> guest memory write rate, sometimes resulting in unacceptably long >> snapshot >> creation time and huge image size. >> >> This series propose a way to solve the aforementioned problems. This >> is done >> by using different RAM migration mechanism based on UFFD write >> protection >> management introduced in v5.7 kernel. The migration strategy is to >> 'freeze' >> guest RAM content using write-protection and iteratively release >> protection >> for memory ranges that have already been saved to the migration stream. >> At the same time we read in pending UFFD write fault events and save >> those >> pages out-of-order with higher priority. >> > > Hi, > > just stumbled over this, quick question: > > I recently played with UFFD_WP and notices that write protection is > only effective on pages/ranges that have already pages populated (IOW: > !pte_none() in the kernel). > > In case memory was never populated (or was discarded using e.g., > madvice(DONTNEED)), write-protection will be skipped silently and you > won't get WP events for applicable pages. > > So if someone writes to a yet unpoupulated page ("zero"), you won't > get WP events. > > I can spot that you do a single uffd_change_protection() on the whole > RAMBlock. > > How are you handling that scenario, or why don't you have to handle > that scenario? > Hi David, I really wonder if such a problem exists.. If we are talking about a write to an unpopulated page, we should get first page fault on non-present page and populate it with protection bits from respective vma. For UFFD_WP vma's page will be populated non-writable. So we'll get another page fault on present but read-only page and go to handle_userfault.
>> Hi, >> >> just stumbled over this, quick question: >> >> I recently played with UFFD_WP and notices that write protection is >> only effective on pages/ranges that have already pages populated (IOW: >> !pte_none() in the kernel). >> >> In case memory was never populated (or was discarded using e.g., >> madvice(DONTNEED)), write-protection will be skipped silently and you >> won't get WP events for applicable pages. >> >> So if someone writes to a yet unpoupulated page ("zero"), you won't >> get WP events. >> >> I can spot that you do a single uffd_change_protection() on the whole >> RAMBlock. >> >> How are you handling that scenario, or why don't you have to handle >> that scenario? >> > Hi David, > > I really wonder if such a problem exists.. If we are talking about a I immediately ran into this issue with my simplest test cases. :) > write to an unpopulated page, we should get first page fault on > non-present page and populate it with protection bits from respective vma. > For UFFD_WP vma's page will be populated non-writable. So we'll get > another page fault on present but read-only page and go to handle_userfault. See the attached test program. Triggers for me on 5.11.0-rc6+ and 5.9.13-200.fc33 gcc -lpthread uffdio_wp.c -o uffdio_wp ./uffdio_wp WP did not fire Uncomment the placement of the zeropage just before registering to make the WP actually trigger. If there is no PTE, there is nothing to protect. And it makes sense: How should the fault handler know which ranges you wp-ed, if there is no place to store that information (the PTEs!). The VMA cannot tell that story, it only knows that someone registered UFFD_WP to selectively wp some parts. You might have to register also for MISSING faults and place zero pages.
Hi, David, Andrey, On Tue, Feb 09, 2021 at 08:06:58PM +0100, David Hildenbrand wrote: > > > Hi, > > > > > > just stumbled over this, quick question: > > > > > > I recently played with UFFD_WP and notices that write protection is > > > only effective on pages/ranges that have already pages populated (IOW: > > > !pte_none() in the kernel). > > > > > > In case memory was never populated (or was discarded using e.g., > > > madvice(DONTNEED)), write-protection will be skipped silently and you > > > won't get WP events for applicable pages. > > > > > > So if someone writes to a yet unpoupulated page ("zero"), you won't > > > get WP events. > > > > > > I can spot that you do a single uffd_change_protection() on the whole > > > RAMBlock. > > > > > > How are you handling that scenario, or why don't you have to handle > > > that scenario? Good catch.. Indeed I overlooked that as well when reviewing the code. > > > > > Hi David, > > > > I really wonder if such a problem exists.. If we are talking about a > > I immediately ran into this issue with my simplest test cases. :) > > > write to an unpopulated page, we should get first page fault on > > non-present page and populate it with protection bits from respective vma. > > For UFFD_WP vma's page will be populated non-writable. So we'll get > > another page fault on present but read-only page and go to handle_userfault. The problem is even if the page is read-only, it does not yet have the uffd-wp bit set, so it won't really trigger the handle_userfault() path. > You might have to register also for MISSING faults and place zero pages. So I think what's missing for live snapshot is indeed to register with both missing & wp mode. Then we'll receive two messages: For wp, we do like before. For missing, we do UFFDIO_ZEROCOPY and at the same time dump this page as a zero page. I bet live snapshot didn't encounter this issue simply because normal live snapshots would still work, especially when there's the guest OS. Say, the worst case is we could have migrated some zero pages with some random data filled in along with the snapshot, however all these pages were zero pages and not used by the guest OS after all, then when we load a snapshot we won't easily notice either.. Thanks,
On Tue, Feb 09, 2021 at 03:09:28PM -0500, Peter Xu wrote: > Hi, David, Andrey, > > On Tue, Feb 09, 2021 at 08:06:58PM +0100, David Hildenbrand wrote: > > > > Hi, > > > > > > > > just stumbled over this, quick question: > > > > > > > > I recently played with UFFD_WP and notices that write protection is > > > > only effective on pages/ranges that have already pages populated (IOW: > > > > !pte_none() in the kernel). > > > > > > > > In case memory was never populated (or was discarded using e.g., > > > > madvice(DONTNEED)), write-protection will be skipped silently and you > > > > won't get WP events for applicable pages. > > > > > > > > So if someone writes to a yet unpoupulated page ("zero"), you won't > > > > get WP events. > > > > > > > > I can spot that you do a single uffd_change_protection() on the whole > > > > RAMBlock. > > > > > > > > How are you handling that scenario, or why don't you have to handle > > > > that scenario? > > Good catch.. Indeed I overlooked that as well when reviewing the code. > > > > > > > > Hi David, > > > > > > I really wonder if such a problem exists.. If we are talking about a > > > > I immediately ran into this issue with my simplest test cases. :) > > > > > write to an unpopulated page, we should get first page fault on > > > non-present page and populate it with protection bits from respective vma. > > > For UFFD_WP vma's page will be populated non-writable. So we'll get > > > another page fault on present but read-only page and go to handle_userfault. > > The problem is even if the page is read-only, it does not yet have the uffd-wp > bit set, so it won't really trigger the handle_userfault() path. > > > You might have to register also for MISSING faults and place zero pages. > > So I think what's missing for live snapshot is indeed to register with both > missing & wp mode. > > Then we'll receive two messages: For wp, we do like before. For missing, we do > UFFDIO_ZEROCOPY and at the same time dump this page as a zero page. > > I bet live snapshot didn't encounter this issue simply because normal live > snapshots would still work, especially when there's the guest OS. Say, the > worst case is we could have migrated some zero pages with some random data > filled in along with the snapshot, however all these pages were zero pages and > not used by the guest OS after all, then when we load a snapshot we won't > easily notice either.. I'm thinking some way to verify this from live snapshot pov, and I've got an idea so I just share it out... Maybe we need a guest application that does something like below: - mmap() a huge lot of memory - call mlockall(), so that pages will be provisioned in the guest but without data written. IIUC on the host these pages should be backed by missing pages as long as guest app doesn't write. Then... - the app starts to read input from user: - If user inputs "dirty" and enter: it'll start to dirty the whole range. Write non-zero to the 1st byte of each page would suffice. - If user inputs "check" and enter: it'll read the whole memory chunk to see whether all the pages are zero pages. If it reads any non-zero page, it should bail out and print error. With the help of above program, we can do below to verify the live snapshot worked as expected on zero pages: - Guest: start above program, don't input yet (so waiting to read either "dirty" or "check" command) - Host: start live snapshot - Guest: input "dirty" command, so start quickly dirtying the ram - Host: live snapshot completes Then to verify the snapshot image, we do: - Host: load the snapshot we've got - Guest: (should still be in the state of waiting for cmd) this time we enter "check" Thanks,
On 09.02.2021 23:31, Peter Xu wrote: > On Tue, Feb 09, 2021 at 03:09:28PM -0500, Peter Xu wrote: >> Hi, David, Andrey, >> >> On Tue, Feb 09, 2021 at 08:06:58PM +0100, David Hildenbrand wrote: >>>>> Hi, >>>>> >>>>> just stumbled over this, quick question: >>>>> >>>>> I recently played with UFFD_WP and notices that write protection is >>>>> only effective on pages/ranges that have already pages populated (IOW: >>>>> !pte_none() in the kernel). >>>>> >>>>> In case memory was never populated (or was discarded using e.g., >>>>> madvice(DONTNEED)), write-protection will be skipped silently and you >>>>> won't get WP events for applicable pages. >>>>> >>>>> So if someone writes to a yet unpoupulated page ("zero"), you won't >>>>> get WP events. >>>>> >>>>> I can spot that you do a single uffd_change_protection() on the whole >>>>> RAMBlock. >>>>> >>>>> How are you handling that scenario, or why don't you have to handle >>>>> that scenario? >> Good catch.. Indeed I overlooked that as well when reviewing the code. >> >>>> Hi David, >>>> >>>> I really wonder if such a problem exists.. If we are talking about a >>> I immediately ran into this issue with my simplest test cases. :) >>> >>>> write to an unpopulated page, we should get first page fault on >>>> non-present page and populate it with protection bits from respective vma. >>>> For UFFD_WP vma's page will be populated non-writable. So we'll get >>>> another page fault on present but read-only page and go to handle_userfault. >> The problem is even if the page is read-only, it does not yet have the uffd-wp >> bit set, so it won't really trigger the handle_userfault() path. >> >>> You might have to register also for MISSING faults and place zero pages. >> So I think what's missing for live snapshot is indeed to register with both >> missing & wp mode. >> >> Then we'll receive two messages: For wp, we do like before. For missing, we do >> UFFDIO_ZEROCOPY and at the same time dump this page as a zero page. >> >> I bet live snapshot didn't encounter this issue simply because normal live >> snapshots would still work, especially when there's the guest OS. Say, the >> worst case is we could have migrated some zero pages with some random data >> filled in along with the snapshot, however all these pages were zero pages and >> not used by the guest OS after all, then when we load a snapshot we won't >> easily notice either.. > I'm thinking some way to verify this from live snapshot pov, and I've got an > idea so I just share it out... Maybe we need a guest application that does > something like below: > > - mmap() a huge lot of memory > > - call mlockall(), so that pages will be provisioned in the guest but without > data written. IIUC on the host these pages should be backed by missing > pages as long as guest app doesn't write. Then... > > - the app starts to read input from user: > > - If user inputs "dirty" and enter: it'll start to dirty the whole range. > Write non-zero to the 1st byte of each page would suffice. > > - If user inputs "check" and enter: it'll read the whole memory chunk to > see whether all the pages are zero pages. If it reads any non-zero page, > it should bail out and print error. > > With the help of above program, we can do below to verify the live snapshot > worked as expected on zero pages: > > - Guest: start above program, don't input yet (so waiting to read either > "dirty" or "check" command) > > - Host: start live snapshot > > - Guest: input "dirty" command, so start quickly dirtying the ram > > - Host: live snapshot completes > > Then to verify the snapshot image, we do: > > - Host: load the snapshot we've got > > - Guest: (should still be in the state of waiting for cmd) this time we enter > "check" > > Thanks, > Hi David, Peter, A little unexpected behavior, from my point of view, for UFFD write-protection. So, that means that UFFD_WP protection/events works only for locked memory? I'm now looking at kernel implementation, to understand..
On 09.02.2021 22:06, David Hildenbrand wrote: >>> Hi, >>> >>> just stumbled over this, quick question: >>> >>> I recently played with UFFD_WP and notices that write protection is >>> only effective on pages/ranges that have already pages populated (IOW: >>> !pte_none() in the kernel). >>> >>> In case memory was never populated (or was discarded using e.g., >>> madvice(DONTNEED)), write-protection will be skipped silently and you >>> won't get WP events for applicable pages. >>> >>> So if someone writes to a yet unpoupulated page ("zero"), you won't >>> get WP events. >>> >>> I can spot that you do a single uffd_change_protection() on the whole >>> RAMBlock. >>> >>> How are you handling that scenario, or why don't you have to handle >>> that scenario? >>> >> Hi David, >> >> I really wonder if such a problem exists.. If we are talking about a > > I immediately ran into this issue with my simplest test cases. :) > >> write to an unpopulated page, we should get first page fault on >> non-present page and populate it with protection bits from respective >> vma. >> For UFFD_WP vma's page will be populated non-writable. So we'll get >> another page fault on present but read-only page and go to >> handle_userfault. > > See the attached test program. Triggers for me on 5.11.0-rc6+ and > 5.9.13-200.fc33 > > gcc -lpthread uffdio_wp.c -o uffdio_wp > ./uffdio_wp > WP did not fire > > Uncomment the placement of the zeropage just before registering to > make the WP actually trigger. If there is no PTE, there is nothing to > protect. > > > And it makes sense: How should the fault handler know which ranges you > wp-ed, if there is no place to store that information (the PTEs!). The > VMA cannot tell that story, it only knows that someone registered > UFFD_WP to selectively wp some parts. > > You might have to register also for MISSING faults and place zero pages. > Looked at the kernel code, agree that we miss WP events for unpopulated pages, UFFD_WP softbit won't be set in this case. But it doesn't make saved snapshot inconsistent or introduce security issues. The only side effect is that we may save updated page instead of zeroed, just increasing snapshot size. However this guest-physical page has never been touched from the point of view of saved vCPU/device state and is not a concern. Often (at least on desktop Windows guests) only a small part of RAM has ever been allocated by guest. Migration code needs to read each guest-physical page, so we'll have a lot of additional UFFD events, much more MISSING events then WP-faults. And the main problem is that adding MISSING handler is impossible in current single-threaded snapshot code. We'll get an immediate deadlock on iterative page read.
On Thu, Feb 11, 2021 at 12:21:51PM +0300, Andrey Gruzdev wrote: > On 09.02.2021 23:31, Peter Xu wrote: > > On Tue, Feb 09, 2021 at 03:09:28PM -0500, Peter Xu wrote: > > > Hi, David, Andrey, > > > > > > On Tue, Feb 09, 2021 at 08:06:58PM +0100, David Hildenbrand wrote: > > > > > > Hi, > > > > > > > > > > > > just stumbled over this, quick question: > > > > > > > > > > > > I recently played with UFFD_WP and notices that write protection is > > > > > > only effective on pages/ranges that have already pages populated (IOW: > > > > > > !pte_none() in the kernel). > > > > > > > > > > > > In case memory was never populated (or was discarded using e.g., > > > > > > madvice(DONTNEED)), write-protection will be skipped silently and you > > > > > > won't get WP events for applicable pages. > > > > > > > > > > > > So if someone writes to a yet unpoupulated page ("zero"), you won't > > > > > > get WP events. > > > > > > > > > > > > I can spot that you do a single uffd_change_protection() on the whole > > > > > > RAMBlock. > > > > > > > > > > > > How are you handling that scenario, or why don't you have to handle > > > > > > that scenario? > > > Good catch.. Indeed I overlooked that as well when reviewing the code. > > > > > > > > Hi David, > > > > > > > > > > I really wonder if such a problem exists.. If we are talking about a > > > > I immediately ran into this issue with my simplest test cases. :) > > > > > > > > > write to an unpopulated page, we should get first page fault on > > > > > non-present page and populate it with protection bits from respective vma. > > > > > For UFFD_WP vma's page will be populated non-writable. So we'll get > > > > > another page fault on present but read-only page and go to handle_userfault. > > > The problem is even if the page is read-only, it does not yet have the uffd-wp > > > bit set, so it won't really trigger the handle_userfault() path. > > > > > > > You might have to register also for MISSING faults and place zero pages. > > > So I think what's missing for live snapshot is indeed to register with both > > > missing & wp mode. > > > > > > Then we'll receive two messages: For wp, we do like before. For missing, we do > > > UFFDIO_ZEROCOPY and at the same time dump this page as a zero page. > > > > > > I bet live snapshot didn't encounter this issue simply because normal live > > > snapshots would still work, especially when there's the guest OS. Say, the > > > worst case is we could have migrated some zero pages with some random data > > > filled in along with the snapshot, however all these pages were zero pages and > > > not used by the guest OS after all, then when we load a snapshot we won't > > > easily notice either.. > > I'm thinking some way to verify this from live snapshot pov, and I've got an > > idea so I just share it out... Maybe we need a guest application that does > > something like below: > > > > - mmap() a huge lot of memory > > > > - call mlockall(), so that pages will be provisioned in the guest but without > > data written. IIUC on the host these pages should be backed by missing > > pages as long as guest app doesn't write. Then... > > > > - the app starts to read input from user: > > > > - If user inputs "dirty" and enter: it'll start to dirty the whole range. > > Write non-zero to the 1st byte of each page would suffice. > > > > - If user inputs "check" and enter: it'll read the whole memory chunk to > > see whether all the pages are zero pages. If it reads any non-zero page, > > it should bail out and print error. > > > > With the help of above program, we can do below to verify the live snapshot > > worked as expected on zero pages: > > > > - Guest: start above program, don't input yet (so waiting to read either > > "dirty" or "check" command) > > > > - Host: start live snapshot > > > > - Guest: input "dirty" command, so start quickly dirtying the ram > > > > - Host: live snapshot completes > > > > Then to verify the snapshot image, we do: > > > > - Host: load the snapshot we've got > > > > - Guest: (should still be in the state of waiting for cmd) this time we enter > > "check" > > > > Thanks, > > > Hi David, Peter, > > A little unexpected behavior, from my point of view, for UFFD write-protection. > So, that means that UFFD_WP protection/events works only for locked memory? > I'm now looking at kernel implementation, to understand.. Not really; it definitely works for all memories that we've touched. My previous exmaple wanted to let the guest app use a not-yet-allocated page. I figured mlockall() might achieve that, hence I proposed such an example assuming that may verify the zero page issue on live snapshot. So if my understanding is correct, if we run above scenario, current live snapshot might fail that app when we do the "check" command at last, by finding non-zero pages. Thanks,
On Thu, Feb 11, 2021 at 07:19:47PM +0300, Andrey Gruzdev wrote: > On 09.02.2021 22:06, David Hildenbrand wrote: > > > > Hi, > > > > > > > > just stumbled over this, quick question: > > > > > > > > I recently played with UFFD_WP and notices that write protection is > > > > only effective on pages/ranges that have already pages populated (IOW: > > > > !pte_none() in the kernel). > > > > > > > > In case memory was never populated (or was discarded using e.g., > > > > madvice(DONTNEED)), write-protection will be skipped silently and you > > > > won't get WP events for applicable pages. > > > > > > > > So if someone writes to a yet unpoupulated page ("zero"), you won't > > > > get WP events. > > > > > > > > I can spot that you do a single uffd_change_protection() on the whole > > > > RAMBlock. > > > > > > > > How are you handling that scenario, or why don't you have to handle > > > > that scenario? > > > > > > > Hi David, > > > > > > I really wonder if such a problem exists.. If we are talking about a > > > > I immediately ran into this issue with my simplest test cases. :) > > > > > write to an unpopulated page, we should get first page fault on > > > non-present page and populate it with protection bits from > > > respective vma. > > > For UFFD_WP vma's page will be populated non-writable. So we'll get > > > another page fault on present but read-only page and go to > > > handle_userfault. > > > > See the attached test program. Triggers for me on 5.11.0-rc6+ and > > 5.9.13-200.fc33 > > > > gcc -lpthread uffdio_wp.c -o uffdio_wp > > ./uffdio_wp > > WP did not fire > > > > Uncomment the placement of the zeropage just before registering to make > > the WP actually trigger. If there is no PTE, there is nothing to > > protect. > > > > > > And it makes sense: How should the fault handler know which ranges you > > wp-ed, if there is no place to store that information (the PTEs!). The > > VMA cannot tell that story, it only knows that someone registered > > UFFD_WP to selectively wp some parts. > > > > You might have to register also for MISSING faults and place zero pages. > > > Looked at the kernel code, agree that we miss WP events for unpopulated > pages, UFFD_WP softbit won't be set in this case. But it doesn't make saved > snapshot inconsistent or introduce security issues. The only side effect is > that we may save updated page instead of zeroed, just increasing snapshot > size. However this guest-physical page has never been touched from the point > of view of saved vCPU/device state and is not a concern. Oh I just remembered one thing, that Linux should be zeroing pages when allocating, so even if the page has legacy content it'll be cleared with __GFP_ZERO allocations. So yeah it would be harder to have issue at least with a sensible OS. I'm not sure about Windows or others, but it could be a common case. Then the only overhead is the extra pages we kept in the live snapshot, which takes some more disk space. Or there could be firmware running without OS at all, but it should really not read unallocated pages assuming there must be zero. It's not a sane behavior even for a firmware. > > Often (at least on desktop Windows guests) only a small part of RAM has ever > been allocated by guest. Migration code needs to read each guest-physical > page, so we'll have a lot of additional UFFD events, much more MISSING > events then WP-faults. > > And the main problem is that adding MISSING handler is impossible in current > single-threaded snapshot code. We'll get an immediate deadlock on iterative > page read. Right. We'll need to rework the design but just for saving a bunch of snapshot image disk size. So now I agree with you, let's keep this in mind, but maybe it isn't worth a fix for now, at least until we figure something really broken. Andrey, do you think we should still mention this issue into the todo list of the wiki page of live snapshot? Thanks,
On 11.02.2021 20:18, Peter Xu wrote: > On Thu, Feb 11, 2021 at 12:21:51PM +0300, Andrey Gruzdev wrote: >> On 09.02.2021 23:31, Peter Xu wrote: >>> On Tue, Feb 09, 2021 at 03:09:28PM -0500, Peter Xu wrote: >>>> Hi, David, Andrey, >>>> >>>> On Tue, Feb 09, 2021 at 08:06:58PM +0100, David Hildenbrand wrote: >>>>>>> Hi, >>>>>>> >>>>>>> just stumbled over this, quick question: >>>>>>> >>>>>>> I recently played with UFFD_WP and notices that write protection is >>>>>>> only effective on pages/ranges that have already pages populated (IOW: >>>>>>> !pte_none() in the kernel). >>>>>>> >>>>>>> In case memory was never populated (or was discarded using e.g., >>>>>>> madvice(DONTNEED)), write-protection will be skipped silently and you >>>>>>> won't get WP events for applicable pages. >>>>>>> >>>>>>> So if someone writes to a yet unpoupulated page ("zero"), you won't >>>>>>> get WP events. >>>>>>> >>>>>>> I can spot that you do a single uffd_change_protection() on the whole >>>>>>> RAMBlock. >>>>>>> >>>>>>> How are you handling that scenario, or why don't you have to handle >>>>>>> that scenario? >>>> Good catch.. Indeed I overlooked that as well when reviewing the code. >>>> >>>>>> Hi David, >>>>>> >>>>>> I really wonder if such a problem exists.. If we are talking about a >>>>> I immediately ran into this issue with my simplest test cases. :) >>>>> >>>>>> write to an unpopulated page, we should get first page fault on >>>>>> non-present page and populate it with protection bits from respective vma. >>>>>> For UFFD_WP vma's page will be populated non-writable. So we'll get >>>>>> another page fault on present but read-only page and go to handle_userfault. >>>> The problem is even if the page is read-only, it does not yet have the uffd-wp >>>> bit set, so it won't really trigger the handle_userfault() path. >>>> >>>>> You might have to register also for MISSING faults and place zero pages. >>>> So I think what's missing for live snapshot is indeed to register with both >>>> missing & wp mode. >>>> >>>> Then we'll receive two messages: For wp, we do like before. For missing, we do >>>> UFFDIO_ZEROCOPY and at the same time dump this page as a zero page. >>>> >>>> I bet live snapshot didn't encounter this issue simply because normal live >>>> snapshots would still work, especially when there's the guest OS. Say, the >>>> worst case is we could have migrated some zero pages with some random data >>>> filled in along with the snapshot, however all these pages were zero pages and >>>> not used by the guest OS after all, then when we load a snapshot we won't >>>> easily notice either.. >>> I'm thinking some way to verify this from live snapshot pov, and I've got an >>> idea so I just share it out... Maybe we need a guest application that does >>> something like below: >>> >>> - mmap() a huge lot of memory >>> >>> - call mlockall(), so that pages will be provisioned in the guest but without >>> data written. IIUC on the host these pages should be backed by missing >>> pages as long as guest app doesn't write. Then... >>> >>> - the app starts to read input from user: >>> >>> - If user inputs "dirty" and enter: it'll start to dirty the whole range. >>> Write non-zero to the 1st byte of each page would suffice. >>> >>> - If user inputs "check" and enter: it'll read the whole memory chunk to >>> see whether all the pages are zero pages. If it reads any non-zero page, >>> it should bail out and print error. >>> >>> With the help of above program, we can do below to verify the live snapshot >>> worked as expected on zero pages: >>> >>> - Guest: start above program, don't input yet (so waiting to read either >>> "dirty" or "check" command) >>> >>> - Host: start live snapshot >>> >>> - Guest: input "dirty" command, so start quickly dirtying the ram >>> >>> - Host: live snapshot completes >>> >>> Then to verify the snapshot image, we do: >>> >>> - Host: load the snapshot we've got >>> >>> - Guest: (should still be in the state of waiting for cmd) this time we enter >>> "check" >>> >>> Thanks, >>> >> Hi David, Peter, >> >> A little unexpected behavior, from my point of view, for UFFD write-protection. >> So, that means that UFFD_WP protection/events works only for locked memory? >> I'm now looking at kernel implementation, to understand.. > Not really; it definitely works for all memories that we've touched. My > previous exmaple wanted to let the guest app use a not-yet-allocated page. I > figured mlockall() might achieve that, hence I proposed such an example > assuming that may verify the zero page issue on live snapshot. So if my > understanding is correct, if we run above scenario, current live snapshot might > fail that app when we do the "check" command at last, by finding non-zero pages. > > Thanks, > Yes, I understand the limitations with vma's which lead to the fact we can write-protect with PTE softbits only. I think mlockall() is not required, just need mmap() with MAP_POPULATE. Since the problem is related only to pte_none() entries.
On 11.02.2021 20:32, Peter Xu wrote: > On Thu, Feb 11, 2021 at 07:19:47PM +0300, Andrey Gruzdev wrote: >> On 09.02.2021 22:06, David Hildenbrand wrote: >>>>> Hi, >>>>> >>>>> just stumbled over this, quick question: >>>>> >>>>> I recently played with UFFD_WP and notices that write protection is >>>>> only effective on pages/ranges that have already pages populated (IOW: >>>>> !pte_none() in the kernel). >>>>> >>>>> In case memory was never populated (or was discarded using e.g., >>>>> madvice(DONTNEED)), write-protection will be skipped silently and you >>>>> won't get WP events for applicable pages. >>>>> >>>>> So if someone writes to a yet unpoupulated page ("zero"), you won't >>>>> get WP events. >>>>> >>>>> I can spot that you do a single uffd_change_protection() on the whole >>>>> RAMBlock. >>>>> >>>>> How are you handling that scenario, or why don't you have to handle >>>>> that scenario? >>>>> >>>> Hi David, >>>> >>>> I really wonder if such a problem exists.. If we are talking about a >>> I immediately ran into this issue with my simplest test cases. :) >>> >>>> write to an unpopulated page, we should get first page fault on >>>> non-present page and populate it with protection bits from >>>> respective vma. >>>> For UFFD_WP vma's page will be populated non-writable. So we'll get >>>> another page fault on present but read-only page and go to >>>> handle_userfault. >>> See the attached test program. Triggers for me on 5.11.0-rc6+ and >>> 5.9.13-200.fc33 >>> >>> gcc -lpthread uffdio_wp.c -o uffdio_wp >>> ./uffdio_wp >>> WP did not fire >>> >>> Uncomment the placement of the zeropage just before registering to make >>> the WP actually trigger. If there is no PTE, there is nothing to >>> protect. >>> >>> >>> And it makes sense: How should the fault handler know which ranges you >>> wp-ed, if there is no place to store that information (the PTEs!). The >>> VMA cannot tell that story, it only knows that someone registered >>> UFFD_WP to selectively wp some parts. >>> >>> You might have to register also for MISSING faults and place zero pages. >>> >> Looked at the kernel code, agree that we miss WP events for unpopulated >> pages, UFFD_WP softbit won't be set in this case. But it doesn't make saved >> snapshot inconsistent or introduce security issues. The only side effect is >> that we may save updated page instead of zeroed, just increasing snapshot >> size. However this guest-physical page has never been touched from the point >> of view of saved vCPU/device state and is not a concern. > Oh I just remembered one thing, that Linux should be zeroing pages when > allocating, so even if the page has legacy content it'll be cleared with > __GFP_ZERO allocations. So yeah it would be harder to have issue at least with > a sensible OS. I'm not sure about Windows or others, but it could be a common > case. Then the only overhead is the extra pages we kept in the live snapshot, > which takes some more disk space. > > Or there could be firmware running without OS at all, but it should really not > read unallocated pages assuming there must be zero. It's not a sane behavior > even for a firmware. > >> Often (at least on desktop Windows guests) only a small part of RAM has ever >> been allocated by guest. Migration code needs to read each guest-physical >> page, so we'll have a lot of additional UFFD events, much more MISSING >> events then WP-faults. >> >> And the main problem is that adding MISSING handler is impossible in current >> single-threaded snapshot code. We'll get an immediate deadlock on iterative >> page read. > Right. We'll need to rework the design but just for saving a bunch of snapshot > image disk size. So now I agree with you, let's keep this in mind, but maybe > it isn't worth a fix for now, at least until we figure something really broken. > > Andrey, do you think we should still mention this issue into the todo list of > the wiki page of live snapshot? > > Thanks, > Yes, even if the page happens to be overwritten, it's overwritten by the same VM so no security boundaries are crossed. And no machine code can assume that RAM content is zeroed on power-on or reset so our snapshot state stays quite consistent. Agree we should keep it in mind, but IMHO adding MISSING handler and running separate thread would make performance worse.. So I doubt it's worth adding this to TODO list.. Thanks,
On 11.02.21 19:28, Andrey Gruzdev wrote: > On 11.02.2021 20:32, Peter Xu wrote: >> On Thu, Feb 11, 2021 at 07:19:47PM +0300, Andrey Gruzdev wrote: >>> On 09.02.2021 22:06, David Hildenbrand wrote: >>>>>> Hi, >>>>>> >>>>>> just stumbled over this, quick question: >>>>>> >>>>>> I recently played with UFFD_WP and notices that write protection is >>>>>> only effective on pages/ranges that have already pages populated (IOW: >>>>>> !pte_none() in the kernel). >>>>>> >>>>>> In case memory was never populated (or was discarded using e.g., >>>>>> madvice(DONTNEED)), write-protection will be skipped silently and you >>>>>> won't get WP events for applicable pages. >>>>>> >>>>>> So if someone writes to a yet unpoupulated page ("zero"), you won't >>>>>> get WP events. >>>>>> >>>>>> I can spot that you do a single uffd_change_protection() on the whole >>>>>> RAMBlock. >>>>>> >>>>>> How are you handling that scenario, or why don't you have to handle >>>>>> that scenario? >>>>>> >>>>> Hi David, >>>>> >>>>> I really wonder if such a problem exists.. If we are talking about a >>>> I immediately ran into this issue with my simplest test cases. :) >>>> >>>>> write to an unpopulated page, we should get first page fault on >>>>> non-present page and populate it with protection bits from >>>>> respective vma. >>>>> For UFFD_WP vma's page will be populated non-writable. So we'll get >>>>> another page fault on present but read-only page and go to >>>>> handle_userfault. >>>> See the attached test program. Triggers for me on 5.11.0-rc6+ and >>>> 5.9.13-200.fc33 >>>> >>>> gcc -lpthread uffdio_wp.c -o uffdio_wp >>>> ./uffdio_wp >>>> WP did not fire >>>> >>>> Uncomment the placement of the zeropage just before registering to make >>>> the WP actually trigger. If there is no PTE, there is nothing to >>>> protect. >>>> >>>> >>>> And it makes sense: How should the fault handler know which ranges you >>>> wp-ed, if there is no place to store that information (the PTEs!). The >>>> VMA cannot tell that story, it only knows that someone registered >>>> UFFD_WP to selectively wp some parts. >>>> >>>> You might have to register also for MISSING faults and place zero pages. >>>> >>> Looked at the kernel code, agree that we miss WP events for unpopulated >>> pages, UFFD_WP softbit won't be set in this case. But it doesn't make saved >>> snapshot inconsistent or introduce security issues. The only side effect is >>> that we may save updated page instead of zeroed, just increasing snapshot >>> size. However this guest-physical page has never been touched from the point >>> of view of saved vCPU/device state and is not a concern. >> Oh I just remembered one thing, that Linux should be zeroing pages when >> allocating, so even if the page has legacy content it'll be cleared with >> __GFP_ZERO allocations. So yeah it would be harder to have issue at least with >> a sensible OS. I'm not sure about Windows or others, but it could be a common >> case. Then the only overhead is the extra pages we kept in the live snapshot, >> which takes some more disk space. >> >> Or there could be firmware running without OS at all, but it should really not >> read unallocated pages assuming there must be zero. It's not a sane behavior >> even for a firmware. >> >>> Often (at least on desktop Windows guests) only a small part of RAM has ever >>> been allocated by guest. Migration code needs to read each guest-physical >>> page, so we'll have a lot of additional UFFD events, much more MISSING >>> events then WP-faults. >>> >>> And the main problem is that adding MISSING handler is impossible in current >>> single-threaded snapshot code. We'll get an immediate deadlock on iterative >>> page read. >> Right. We'll need to rework the design but just for saving a bunch of snapshot >> image disk size. So now I agree with you, let's keep this in mind, but maybe >> it isn't worth a fix for now, at least until we figure something really broken. >> >> Andrey, do you think we should still mention this issue into the todo list of >> the wiki page of live snapshot? >> >> Thanks, >> > Yes, even if the page happens to be overwritten, it's overwritten by the same VM so > no security boundaries are crossed. And no machine code can assume that RAM content > is zeroed on power-on or reset so our snapshot state stays quite consistent. > > Agree we should keep it in mind, but IMHO adding MISSING handler and running separate > thread would make performance worse.. So I doubt it's worth adding this to TODO list.. > So, here is what happens: your snapshot will contain garbage at places where it should contain zero. This happens when your guest starts using an unpopulated page after snapshotting started and the page has not been copied to the snapshot yet. You won't get a WP event, therefore you cannot copy "zero" to the snapshot before content gets overridden. If you load your snapshot, it contains garbage at places that are supposed to contain zero. There is a feature in virtio-balloon that *depends* on previously discarded pages from staying unmodified in some cases: free page reporting. The guest will report pages (that might have been initialized to zero) to the hypervisor). The hypervisor will discard page content if the content was initialized to zero. After that, the guest is free to reuse the pages anytime with the guarantee that content has not been modified (-> still is zero). See QEMU hw/virtio/virtio-balloon.c: virtio_balloon_handle_report() "When we discard the page it has the effect of removing the page from the hypervisor itself and causing it to be zeroed when it is returned to us. So we must not discard the page [...] if the guest is expecting it to retain a non-zero value." And Linux drivers/virtio/virtio_balloon.c: virtballoon_validate() "Inform the hypervisor that our pages are poisoned or initialized. If we cannot do that then we should disable page reporting as it could potentially change the contents of our free pages." It's as simple as having a Linux guest with init_on_free and free-page-reporting active via virtio-balloon. Long story short, your feature might break guests (when continuing the snapshot) when free page reporting is active. I agree that MISSING events might be too heavy, so you should disable snapshots if free page reporting is active.
On 09.02.21 19:38, Andrey Gruzdev wrote: > On 09.02.2021 15:37, David Hildenbrand wrote: >> On 21.01.21 16:24, andrey.gruzdev--- via wrote: >>> This patch series is a kind of 'rethinking' of Denis Plotnikov's >>> ideas he's >>> implemented in his series '[PATCH v0 0/4] migration: add background >>> snapshot'. >>> >>> Currently the only way to make (external) live VM snapshot is using >>> existing >>> dirty page logging migration mechanism. The main problem is that it >>> tends to >>> produce a lot of page duplicates while running VM goes on updating >>> already >>> saved pages. That leads to the fact that vmstate image size is >>> commonly several >>> times bigger then non-zero part of virtual machine's RSS. Time >>> required to >>> converge RAM migration and the size of snapshot image severely depend >>> on the >>> guest memory write rate, sometimes resulting in unacceptably long >>> snapshot >>> creation time and huge image size. >>> >>> This series propose a way to solve the aforementioned problems. This >>> is done >>> by using different RAM migration mechanism based on UFFD write >>> protection >>> management introduced in v5.7 kernel. The migration strategy is to >>> 'freeze' >>> guest RAM content using write-protection and iteratively release >>> protection >>> for memory ranges that have already been saved to the migration stream. >>> At the same time we read in pending UFFD write fault events and save >>> those >>> pages out-of-order with higher priority. >>> >> >> Hi, >> >> just stumbled over this, quick question: >> >> I recently played with UFFD_WP and notices that write protection is >> only effective on pages/ranges that have already pages populated (IOW: >> !pte_none() in the kernel). >> >> In case memory was never populated (or was discarded using e.g., >> madvice(DONTNEED)), write-protection will be skipped silently and you >> won't get WP events for applicable pages. >> >> So if someone writes to a yet unpoupulated page ("zero"), you won't >> get WP events. >> >> I can spot that you do a single uffd_change_protection() on the whole >> RAMBlock. >> >> How are you handling that scenario, or why don't you have to handle >> that scenario? >> > Hi David, > > I really wonder if such a problem exists.. If we are talking about a > write to an unpopulated page, we should get first page fault on > non-present page and populate it with protection bits from respective vma. > For UFFD_WP vma's page will be populated non-writable. So we'll get > another page fault on present but read-only page and go to handle_userfault. > Hi, here is another fun issue. Assume you 1. Have a populated page, with some valuable content 2. WP protected the page 3. madvise(DONTNEED) that page 4. Write to the page On write access, you won't get a WP event! Instead, you will get a UFFD_EVENT_REMOVE during 3. But you cannot stop that event (dont wake), so you cannot simply defer as you can do with WP events. So if the guest inflates the balloon (including balloon page migration in Linux) or free-page-reporting reports a free page while snapshotting is active, you won't be able to save the old content before it is zapped and your snapshot misses pages with actual content. Something similar would happen with virtio-mem when unplugging blocks, however, it does not discard any pages while migration is active. Snapshotting seems to be incompatible with concurrent discards via virtio-balloon. You might want to inhibit ballooning while snapshotting is active in hw/virtio/virtio-balloon.c:virtio_balloon_inhibited() just as we do for postcopy.
On Thu, Feb 11, 2021 at 08:01:29PM +0100, David Hildenbrand wrote: > On 11.02.21 19:28, Andrey Gruzdev wrote: > > On 11.02.2021 20:32, Peter Xu wrote: > > > On Thu, Feb 11, 2021 at 07:19:47PM +0300, Andrey Gruzdev wrote: > > > > On 09.02.2021 22:06, David Hildenbrand wrote: > > > > > > > Hi, > > > > > > > > > > > > > > just stumbled over this, quick question: > > > > > > > > > > > > > > I recently played with UFFD_WP and notices that write protection is > > > > > > > only effective on pages/ranges that have already pages populated (IOW: > > > > > > > !pte_none() in the kernel). > > > > > > > > > > > > > > In case memory was never populated (or was discarded using e.g., > > > > > > > madvice(DONTNEED)), write-protection will be skipped silently and you > > > > > > > won't get WP events for applicable pages. > > > > > > > > > > > > > > So if someone writes to a yet unpoupulated page ("zero"), you won't > > > > > > > get WP events. > > > > > > > > > > > > > > I can spot that you do a single uffd_change_protection() on the whole > > > > > > > RAMBlock. > > > > > > > > > > > > > > How are you handling that scenario, or why don't you have to handle > > > > > > > that scenario? > > > > > > > > > > > > > Hi David, > > > > > > > > > > > > I really wonder if such a problem exists.. If we are talking about a > > > > > I immediately ran into this issue with my simplest test cases. :) > > > > > > > > > > > write to an unpopulated page, we should get first page fault on > > > > > > non-present page and populate it with protection bits from > > > > > > respective vma. > > > > > > For UFFD_WP vma's page will be populated non-writable. So we'll get > > > > > > another page fault on present but read-only page and go to > > > > > > handle_userfault. > > > > > See the attached test program. Triggers for me on 5.11.0-rc6+ and > > > > > 5.9.13-200.fc33 > > > > > > > > > > gcc -lpthread uffdio_wp.c -o uffdio_wp > > > > > ./uffdio_wp > > > > > WP did not fire > > > > > > > > > > Uncomment the placement of the zeropage just before registering to make > > > > > the WP actually trigger. If there is no PTE, there is nothing to > > > > > protect. > > > > > > > > > > > > > > > And it makes sense: How should the fault handler know which ranges you > > > > > wp-ed, if there is no place to store that information (the PTEs!). The > > > > > VMA cannot tell that story, it only knows that someone registered > > > > > UFFD_WP to selectively wp some parts. > > > > > > > > > > You might have to register also for MISSING faults and place zero pages. > > > > > > > > > Looked at the kernel code, agree that we miss WP events for unpopulated > > > > pages, UFFD_WP softbit won't be set in this case. But it doesn't make saved > > > > snapshot inconsistent or introduce security issues. The only side effect is > > > > that we may save updated page instead of zeroed, just increasing snapshot > > > > size. However this guest-physical page has never been touched from the point > > > > of view of saved vCPU/device state and is not a concern. > > > Oh I just remembered one thing, that Linux should be zeroing pages when > > > allocating, so even if the page has legacy content it'll be cleared with > > > __GFP_ZERO allocations. So yeah it would be harder to have issue at least with > > > a sensible OS. I'm not sure about Windows or others, but it could be a common > > > case. Then the only overhead is the extra pages we kept in the live snapshot, > > > which takes some more disk space. > > > > > > Or there could be firmware running without OS at all, but it should really not > > > read unallocated pages assuming there must be zero. It's not a sane behavior > > > even for a firmware. > > > > > > > Often (at least on desktop Windows guests) only a small part of RAM has ever > > > > been allocated by guest. Migration code needs to read each guest-physical > > > > page, so we'll have a lot of additional UFFD events, much more MISSING > > > > events then WP-faults. > > > > > > > > And the main problem is that adding MISSING handler is impossible in current > > > > single-threaded snapshot code. We'll get an immediate deadlock on iterative > > > > page read. > > > Right. We'll need to rework the design but just for saving a bunch of snapshot > > > image disk size. So now I agree with you, let's keep this in mind, but maybe > > > it isn't worth a fix for now, at least until we figure something really broken. > > > > > > Andrey, do you think we should still mention this issue into the todo list of > > > the wiki page of live snapshot? > > > > > > Thanks, > > > > > Yes, even if the page happens to be overwritten, it's overwritten by the same VM so > > no security boundaries are crossed. And no machine code can assume that RAM content > > is zeroed on power-on or reset so our snapshot state stays quite consistent. > > > > Agree we should keep it in mind, but IMHO adding MISSING handler and running separate > > thread would make performance worse.. So I doubt it's worth adding this to TODO list.. > > > > So, here is what happens: your snapshot will contain garbage at places where > it should contain zero. > > This happens when your guest starts using an unpopulated page after > snapshotting started and the page has not been copied to the snapshot yet. > You won't get a WP event, therefore you cannot copy "zero" to the snapshot > before content gets overridden. > > If you load your snapshot, it contains garbage at places that are supposed > to contain zero. > > > There is a feature in virtio-balloon that *depends* on previously discarded > pages from staying unmodified in some cases: free page reporting. > > The guest will report pages (that might have been initialized to zero) to > the hypervisor). The hypervisor will discard page content if the content was > initialized to zero. After that, the guest is free to reuse the pages > anytime with the guarantee that content has not been modified (-> still is > zero). > > > See QEMU hw/virtio/virtio-balloon.c: virtio_balloon_handle_report() > > "When we discard the page it has the effect of removing the page from the > hypervisor itself and causing it to be zeroed when it is returned to us. So > we must not discard the page [...] if the guest is expecting it to retain a > non-zero value." > > And Linux drivers/virtio/virtio_balloon.c: virtballoon_validate() > > "Inform the hypervisor that our pages are poisoned or initialized. If we > cannot do that then we should disable page reporting as it could potentially > change the contents of our free pages." > > > It's as simple as having a Linux guest with init_on_free and > free-page-reporting active via virtio-balloon. > > Long story short, your feature might break guests (when continuing the > snapshot) when free page reporting is active. I agree that MISSING events > might be too heavy, so you should disable snapshots if free page reporting > is active. When the page is reported with poison_val set, it is written already by the guest, right (either all zeros, or some special marks)? If so, why it won't be wr-protected by uffd? Thanks,
> Am 11.02.2021 um 21:31 schrieb Peter Xu <peterx@redhat.com>: > > On Thu, Feb 11, 2021 at 08:01:29PM +0100, David Hildenbrand wrote: >>> On 11.02.21 19:28, Andrey Gruzdev wrote: >>> On 11.02.2021 20:32, Peter Xu wrote: >>>> On Thu, Feb 11, 2021 at 07:19:47PM +0300, Andrey Gruzdev wrote: >>>>> On 09.02.2021 22:06, David Hildenbrand wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> just stumbled over this, quick question: >>>>>>>> >>>>>>>> I recently played with UFFD_WP and notices that write protection is >>>>>>>> only effective on pages/ranges that have already pages populated (IOW: >>>>>>>> !pte_none() in the kernel). >>>>>>>> >>>>>>>> In case memory was never populated (or was discarded using e.g., >>>>>>>> madvice(DONTNEED)), write-protection will be skipped silently and you >>>>>>>> won't get WP events for applicable pages. >>>>>>>> >>>>>>>> So if someone writes to a yet unpoupulated page ("zero"), you won't >>>>>>>> get WP events. >>>>>>>> >>>>>>>> I can spot that you do a single uffd_change_protection() on the whole >>>>>>>> RAMBlock. >>>>>>>> >>>>>>>> How are you handling that scenario, or why don't you have to handle >>>>>>>> that scenario? >>>>>>>> >>>>>>> Hi David, >>>>>>> >>>>>>> I really wonder if such a problem exists.. If we are talking about a >>>>>> I immediately ran into this issue with my simplest test cases. :) >>>>>> >>>>>>> write to an unpopulated page, we should get first page fault on >>>>>>> non-present page and populate it with protection bits from >>>>>>> respective vma. >>>>>>> For UFFD_WP vma's page will be populated non-writable. So we'll get >>>>>>> another page fault on present but read-only page and go to >>>>>>> handle_userfault. >>>>>> See the attached test program. Triggers for me on 5.11.0-rc6+ and >>>>>> 5.9.13-200.fc33 >>>>>> >>>>>> gcc -lpthread uffdio_wp.c -o uffdio_wp >>>>>> ./uffdio_wp >>>>>> WP did not fire >>>>>> >>>>>> Uncomment the placement of the zeropage just before registering to make >>>>>> the WP actually trigger. If there is no PTE, there is nothing to >>>>>> protect. >>>>>> >>>>>> >>>>>> And it makes sense: How should the fault handler know which ranges you >>>>>> wp-ed, if there is no place to store that information (the PTEs!). The >>>>>> VMA cannot tell that story, it only knows that someone registered >>>>>> UFFD_WP to selectively wp some parts. >>>>>> >>>>>> You might have to register also for MISSING faults and place zero pages. >>>>>> >>>>> Looked at the kernel code, agree that we miss WP events for unpopulated >>>>> pages, UFFD_WP softbit won't be set in this case. But it doesn't make saved >>>>> snapshot inconsistent or introduce security issues. The only side effect is >>>>> that we may save updated page instead of zeroed, just increasing snapshot >>>>> size. However this guest-physical page has never been touched from the point >>>>> of view of saved vCPU/device state and is not a concern. >>>> Oh I just remembered one thing, that Linux should be zeroing pages when >>>> allocating, so even if the page has legacy content it'll be cleared with >>>> __GFP_ZERO allocations. So yeah it would be harder to have issue at least with >>>> a sensible OS. I'm not sure about Windows or others, but it could be a common >>>> case. Then the only overhead is the extra pages we kept in the live snapshot, >>>> which takes some more disk space. >>>> >>>> Or there could be firmware running without OS at all, but it should really not >>>> read unallocated pages assuming there must be zero. It's not a sane behavior >>>> even for a firmware. >>>> >>>>> Often (at least on desktop Windows guests) only a small part of RAM has ever >>>>> been allocated by guest. Migration code needs to read each guest-physical >>>>> page, so we'll have a lot of additional UFFD events, much more MISSING >>>>> events then WP-faults. >>>>> >>>>> And the main problem is that adding MISSING handler is impossible in current >>>>> single-threaded snapshot code. We'll get an immediate deadlock on iterative >>>>> page read. >>>> Right. We'll need to rework the design but just for saving a bunch of snapshot >>>> image disk size. So now I agree with you, let's keep this in mind, but maybe >>>> it isn't worth a fix for now, at least until we figure something really broken. >>>> >>>> Andrey, do you think we should still mention this issue into the todo list of >>>> the wiki page of live snapshot? >>>> >>>> Thanks, >>>> >>> Yes, even if the page happens to be overwritten, it's overwritten by the same VM so >>> no security boundaries are crossed. And no machine code can assume that RAM content >>> is zeroed on power-on or reset so our snapshot state stays quite consistent. >>> >>> Agree we should keep it in mind, but IMHO adding MISSING handler and running separate >>> thread would make performance worse.. So I doubt it's worth adding this to TODO list.. >>> >> >> So, here is what happens: your snapshot will contain garbage at places where >> it should contain zero. >> >> This happens when your guest starts using an unpopulated page after >> snapshotting started and the page has not been copied to the snapshot yet. >> You won't get a WP event, therefore you cannot copy "zero" to the snapshot >> before content gets overridden. >> >> If you load your snapshot, it contains garbage at places that are supposed >> to contain zero. >> >> >> There is a feature in virtio-balloon that *depends* on previously discarded >> pages from staying unmodified in some cases: free page reporting. >> >> The guest will report pages (that might have been initialized to zero) to >> the hypervisor). The hypervisor will discard page content if the content was >> initialized to zero. After that, the guest is free to reuse the pages >> anytime with the guarantee that content has not been modified (-> still is >> zero). >> >> >> See QEMU hw/virtio/virtio-balloon.c: virtio_balloon_handle_report() >> >> "When we discard the page it has the effect of removing the page from the >> hypervisor itself and causing it to be zeroed when it is returned to us. So >> we must not discard the page [...] if the guest is expecting it to retain a >> non-zero value." >> >> And Linux drivers/virtio/virtio_balloon.c: virtballoon_validate() >> >> "Inform the hypervisor that our pages are poisoned or initialized. If we >> cannot do that then we should disable page reporting as it could potentially >> change the contents of our free pages." >> >> >> It's as simple as having a Linux guest with init_on_free and >> free-page-reporting active via virtio-balloon. >> >> Long story short, your feature might break guests (when continuing the >> snapshot) when free page reporting is active. I agree that MISSING events >> might be too heavy, so you should disable snapshots if free page reporting >> is active. > > When the page is reported with poison_val set, it is written already by the > guest, right (either all zeros, or some special marks)? If so, why it won't be > wr-protected by uffd? Let‘s take a look at init-on-free. The guest zeroes a page and puts it onto a buddy freelist. Free page reporting code takes it off that list and reports it to the hypervisor. The hypervisor discards the physical page and tells the guest he‘s done processing the page. The guest re-places the page onto the free page list. From that point on, the page can be re-allocated inside the guest and is assumed to be zero. On access, a fresh (zeroed) page is populated by the hypervisor. The guest won‘t re-zero the page, as it has the guarantee (from free page reporting) that the page remained zero. Write-protecting the unpopulated page won‘t work as discussed. > > -- > Peter Xu >
On Thu, Feb 11, 2021 at 09:44:07PM +0100, David Hildenbrand wrote: > Let‘s take a look at init-on-free. > > The guest zeroes a page and puts it onto a buddy freelist. Free page reporting code takes it off that list and reports it to the hypervisor. The hypervisor discards the physical page and tells the guest he‘s done processing the page. The guest re-places the page onto the free page list. > > From that point on, the page can be re-allocated inside the guest and is assumed to be zero. On access, a fresh (zeroed) page is populated by the hypervisor. The guest won‘t re-zero the page, as it has the guarantee (from free page reporting) that the page remained zero. > > Write-protecting the unpopulated page won‘t work as discussed. IMHO no matter if it's init_on_alloc or init_on_free or both, as long as it's inited in some way then it means the guest OS wrote to it. Then wr-protect will work.. MADV_DONTNEED during live snapshot seems to be a separate topic as you mentioned in the other thread. For that, I agree we'd better simply let virtio_balloon_inhibited() to return true for live snapshot too just like postcopy. Thanks,
> Am 11.02.2021 um 22:05 schrieb Peter Xu <peterx@redhat.com>: > > On Thu, Feb 11, 2021 at 09:44:07PM +0100, David Hildenbrand wrote: >> Let‘s take a look at init-on-free. >> >> The guest zeroes a page and puts it onto a buddy freelist. Free page reporting code takes it off that list and reports it to the hypervisor. The hypervisor discards the physical page and tells the guest he‘s done processing the page. The guest re-places the page onto the free page list. >> >> From that point on, the page can be re-allocated inside the guest and is assumed to be zero. On access, a fresh (zeroed) page is populated by the hypervisor. The guest won‘t re-zero the page, as it has the guarantee (from free page reporting) that the page remained zero. >> >> Write-protecting the unpopulated page won‘t work as discussed. > > IMHO no matter if it's init_on_alloc or init_on_free or both, as long as it's > inited in some way then it means the guest OS wrote to it. Then wr-protect > will work.. The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot. > > MADV_DONTNEED during live snapshot seems to be a separate topic as you > mentioned in the other thread. For that, I agree we'd better simply let > virtio_balloon_inhibited() to return true for live snapshot too just like > postcopy. Yes, but other issue. > > Thanks, > > -- > Peter Xu >
On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote:
> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot.
I see what you mean now, and iiuc it will only be a problem if init_on_free=1.
I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the
impact should be small, I think. I thought about it, but indeed I didn't see a
good way to fix this if without fixing the zero page copy for live snapshot.
On 12.02.21 04:06, Peter Xu wrote: > On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote: >> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot. > > I see what you mean now, and iiuc it will only be a problem if init_on_free=1. > I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the Yes, some distros seem to enable init_on_alloc instead. Looking at the introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") there are security use cases and it might become important with memory tagging. Note that in Linux, there was also the option to poison pages with 0, removed via f289041ed4cf ("mm, page_poison: remove CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free page reporting. It got removed and use cases got told to use init_on_free. > impact should be small, I think. I thought about it, but indeed I didn't see a > good way to fix this if without fixing the zero page copy for live snapshot. We should really document this (unexpected) behavior of snapshotting. Otherwise, the next feature comes around that relies on pages that were discarded to remain zeroed (I even have one in mind ;) ) and forgets to disable snapshots.
On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote: > On 12.02.21 04:06, Peter Xu wrote: > > On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote: > > > The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot. > > > > I see what you mean now, and iiuc it will only be a problem if init_on_free=1. > > I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the > > Yes, some distros seem to enable init_on_alloc instead. Looking at the > introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 > and init_on_free=1 boot options") there are security use cases and it might > become important with memory tagging. > > Note that in Linux, there was also the option to poison pages with 0, > removed via f289041ed4cf ("mm, page_poison: remove > CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free > page reporting. > > It got removed and use cases got told to use init_on_free. > > > impact should be small, I think. I thought about it, but indeed I didn't see a > > good way to fix this if without fixing the zero page copy for live snapshot. > > We should really document this (unexpected) behavior of snapshotting. > Otherwise, the next feature comes around that relies on pages that were > discarded to remain zeroed (I even have one in mind ;) ) and forgets to > disable snapshots. Agreed. I'll see whether Andrey would have any idea to workaround this, or further comment. Or I can draft a patch to document this next week (or unless Andrey would beat me to it :).
On 12.02.2021 19:11, Peter Xu wrote: > On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote: >> On 12.02.21 04:06, Peter Xu wrote: >>> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote: >>>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot. >>> I see what you mean now, and iiuc it will only be a problem if init_on_free=1. >>> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the >> Yes, some distros seem to enable init_on_alloc instead. Looking at the >> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 >> and init_on_free=1 boot options") there are security use cases and it might >> become important with memory tagging. >> >> Note that in Linux, there was also the option to poison pages with 0, >> removed via f289041ed4cf ("mm, page_poison: remove >> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free >> page reporting. >> >> It got removed and use cases got told to use init_on_free. I think we talk about init_on_free()/init_on_alloc() on guest side, right? Still can't get how it relates to host's unpopulated pages.. Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No. These 'trash' pages in migration stream are like never written physical memory pages, they are really not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical page is full of garbage. Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM? I think no. Or we need a good example how it can be potentially exploited. The only issue that I see is madvise(MADV_DONTNEED) for RAM blocks during snapshotting. And free page reporting or memory balloon is secondary - the point is that UFFD_WP snapshot is incompatible with madvise(MADV_DONTNEED) on hypervisor side. No matter which guest functionality can induce it. >>> impact should be small, I think. I thought about it, but indeed I didn't see a >>> good way to fix this if without fixing the zero page copy for live snapshot. >> We should really document this (unexpected) behavior of snapshotting. >> Otherwise, the next feature comes around that relies on pages that were >> discarded to remain zeroed (I even have one in mind ;) ) and forgets to >> disable snapshots. > Agreed. I'll see whether Andrey would have any idea to workaround this, or > further comment. Or I can draft a patch to document this next week (or unless > Andrey would beat me to it :). > Really better to document this specific behaviour but also clarify that the saved state remains consistent and secure, off course if you agree with my arguments.
> Am 13.02.2021 um 10:34 schrieb Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>: > > >> On 12.02.2021 19:11, Peter Xu wrote: >>>> On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote: >>>>> On 12.02.21 04:06, Peter Xu wrote: >>>>>> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote: >>>>>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot. >>>>> I see what you mean now, and iiuc it will only be a problem if init_on_free=1. >>>>> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the >>>> Yes, some distros seem to enable init_on_alloc instead. Looking at the >>>> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 >>>> and init_on_free=1 boot options") there are security use cases and it might >>>> become important with memory tagging. >>>> >>>> Note that in Linux, there was also the option to poison pages with 0, >>>> removed via f289041ed4cf ("mm, page_poison: remove >>>> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free >>>> page reporting. >>>> >>>> It got removed and use cases got told to use init_on_free. >> I think we talk about init_on_free()/init_on_alloc() on guest side, right? >> Still can't get how it relates to host's unpopulated pages.. >> >> Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No. >> These 'trash' pages in migration stream are like never written physical memory pages, they are really >> not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical >> page is full of garbage. >> >> Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM? >> I think no. Or we need a good example how it can be potentially exploited. I tried to explain how your implementation breaks *the guest inside the snapshot* (I have no idea why you talk about sensitive information). If you run the snapshot, the guest will run into trouble, because the snapshot contains different data than the guest expects: 1. with discards before snapshotting started and free page reporting is running 2. with discards after snapshotting started Maybe Peter can enlighten you, or the links I shared.
Hi, Andrey, On Sat, Feb 13, 2021 at 12:34:07PM +0300, Andrey Gruzdev wrote: > On 12.02.2021 19:11, Peter Xu wrote: > > On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote: > > > On 12.02.21 04:06, Peter Xu wrote: > > > > On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote: > > > > > The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot. > > > > I see what you mean now, and iiuc it will only be a problem if init_on_free=1. > > > > I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the > > > Yes, some distros seem to enable init_on_alloc instead. Looking at the > > > introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 > > > and init_on_free=1 boot options") there are security use cases and it might > > > become important with memory tagging. > > > > > > Note that in Linux, there was also the option to poison pages with 0, > > > removed via f289041ed4cf ("mm, page_poison: remove > > > CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free > > > page reporting. > > > > > > It got removed and use cases got told to use init_on_free. > > I think we talk about init_on_free()/init_on_alloc() on guest side, right? Right. IIUC it's the init_on_free() that matters. We'll have no issue if init_on_alloc=1 && init_on_free=0, since in that case all pages will be zeroed after all before the new page returned to the caller to allocate the page. Then we're safe, I think. > Still can't get how it relates to host's unpopulated pages.. > Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No. > These 'trash' pages in migration stream are like never written physical memory pages, they are really > not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical > page is full of garbage. > > Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM? > I think no. Or we need a good example how it can be potentially exploited. > > The only issue that I see is madvise(MADV_DONTNEED) for RAM blocks during snapshotting. And free page reporting > or memory balloon is secondary - the point is that UFFD_WP snapshot is incompatible with madvise(MADV_DONTNEED) on > hypervisor side. No matter which guest functionality can induce it. I think the problem is if with init_on_free=1, the kernel will assume that all the pages that got freed has been zeroed before-hand so it thinks that it's a waste of time to zero it again when the page is reused/reallocated. As a reference see kernel prep_new_page() where there's: if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags)) kernel_init_free_pages(page, 1 << order); In this case I believe free_pages_prezeroed() will return true, then we don't even need to check want_init_on_alloc() at all. Note that it'll cover all the cases where kernel allocates with __GFP_ZERO: it means it could happen that even the guest kernel tries to alloc_page(__GFP_ZERO) it may got a page with random data after the live snapshot is loaded. So it's not about any hardware, it's the optimization of guest kernel instead. It is actually reasonable and efficient since if we *know* that page is zero page then we shouldn't bother zeroing it again. However it brought us a bit of trouble on live snapshot that the current solution might not work for all guest OS configurations. > > > > > impact should be small, I think. I thought about it, but indeed I didn't see a > > > > good way to fix this if without fixing the zero page copy for live snapshot. > > > We should really document this (unexpected) behavior of snapshotting. > > > Otherwise, the next feature comes around that relies on pages that were > > > discarded to remain zeroed (I even have one in mind ;) ) and forgets to > > > disable snapshots. > > Agreed. I'll see whether Andrey would have any idea to workaround this, or > > further comment. Or I can draft a patch to document this next week (or unless > > Andrey would beat me to it :). > > > Really better to document this specific behaviour but also clarify that the saved state remains > consistent and secure, off course if you agree with my arguments. Yes, no rush on anything yet, let's reach a consensus on understanding the problem before trying to even document anything. If you agree with above, feel free to draft a documentation update patch if you'd like, that could also contain the code to prohibit virtio-balloon page reporting when live snapshot. IMHO if above issue exists, it'll be indeed better to implement the MISSING tracking as well with UFFDIO_ZEROCOPY - it's still not optimized to keep trash pages in the live snapshot, since for a high dirty rate guest the number of trash pages could still be huge. It would definitely be more challenging than only do wr-protect so we may need multi-threading at least, but it'll be a pity that live snapshot randomly fails some guests, even if it's rare. The thing is host cannot easily detect that guest config, so we can't simply block some users taking snapshots even if at last the snapshot could be silently broken. Thanks,
On 17.02.21 00:35, Peter Xu wrote: > Hi, Andrey, > > On Sat, Feb 13, 2021 at 12:34:07PM +0300, Andrey Gruzdev wrote: >> On 12.02.2021 19:11, Peter Xu wrote: >>> On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote: >>>> On 12.02.21 04:06, Peter Xu wrote: >>>>> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote: >>>>>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot. >>>>> I see what you mean now, and iiuc it will only be a problem if init_on_free=1. >>>>> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the >>>> Yes, some distros seem to enable init_on_alloc instead. Looking at the >>>> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 >>>> and init_on_free=1 boot options") there are security use cases and it might >>>> become important with memory tagging. >>>> >>>> Note that in Linux, there was also the option to poison pages with 0, >>>> removed via f289041ed4cf ("mm, page_poison: remove >>>> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free >>>> page reporting. >>>> >>>> It got removed and use cases got told to use init_on_free. >> >> I think we talk about init_on_free()/init_on_alloc() on guest side, right? > > Right. IIUC it's the init_on_free() that matters. > > We'll have no issue if init_on_alloc=1 && init_on_free=0, since in that case > all pages will be zeroed after all before the new page returned to the caller > to allocate the page. Then we're safe, I think. > >> Still can't get how it relates to host's unpopulated pages.. >> Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No. >> These 'trash' pages in migration stream are like never written physical memory pages, they are really >> not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical >> page is full of garbage. >> >> Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM? >> I think no. Or we need a good example how it can be potentially exploited. >> >> The only issue that I see is madvise(MADV_DONTNEED) for RAM blocks during snapshotting. And free page reporting >> or memory balloon is secondary - the point is that UFFD_WP snapshot is incompatible with madvise(MADV_DONTNEED) on >> hypervisor side. No matter which guest functionality can induce it. > > I think the problem is if with init_on_free=1, the kernel will assume that > all the pages that got freed has been zeroed before-hand so it thinks that it's > a waste of time to zero it again when the page is reused/reallocated. As a > reference see kernel prep_new_page() where there's: > > if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags)) > kernel_init_free_pages(page, 1 << order); > > In this case I believe free_pages_prezeroed() will return true, then we don't > even need to check want_init_on_alloc() at all. Note that it'll cover all the > cases where kernel allocates with __GFP_ZERO: it means it could happen that > even the guest kernel tries to alloc_page(__GFP_ZERO) it may got a page with > random data after the live snapshot is loaded. So it's not about any hardware, > it's the optimization of guest kernel instead. It is actually reasonable and > efficient since if we *know* that page is zero page then we shouldn't bother > zeroing it again. However it brought us a bit of trouble on live snapshot that > the current solution might not work for all guest OS configurations. Adding to that, we are so far talking about how Linux *currently* implements it, but that is just an instance of the problem where it could happen in practice. Free page reporting documents in the spec that with the right configuration, previously reported free pages are guaranteed to retain a certain value (e.g., 0) when re-accessed. So any future guest changes that rely on the virtio spec (e.g., Windows support) would be problematic - as these pages in the snapshot don't actually keep the value.
On 17.02.2021 02:35, Peter Xu wrote: > Hi, Andrey, > > On Sat, Feb 13, 2021 at 12:34:07PM +0300, Andrey Gruzdev wrote: >> On 12.02.2021 19:11, Peter Xu wrote: >>> On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote: >>>> On 12.02.21 04:06, Peter Xu wrote: >>>>> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote: >>>>>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot. >>>>> I see what you mean now, and iiuc it will only be a problem if init_on_free=1. >>>>> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the >>>> Yes, some distros seem to enable init_on_alloc instead. Looking at the >>>> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 >>>> and init_on_free=1 boot options") there are security use cases and it might >>>> become important with memory tagging. >>>> >>>> Note that in Linux, there was also the option to poison pages with 0, >>>> removed via f289041ed4cf ("mm, page_poison: remove >>>> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free >>>> page reporting. >>>> >>>> It got removed and use cases got told to use init_on_free. >> I think we talk about init_on_free()/init_on_alloc() on guest side, right? > Right. IIUC it's the init_on_free() that matters. > > We'll have no issue if init_on_alloc=1 && init_on_free=0, since in that case > all pages will be zeroed after all before the new page returned to the caller > to allocate the page. Then we're safe, I think. > >> Still can't get how it relates to host's unpopulated pages.. >> Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No. >> These 'trash' pages in migration stream are like never written physical memory pages, they are really >> not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical >> page is full of garbage. >> >> Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM? >> I think no. Or we need a good example how it can be potentially exploited. >> >> The only issue that I see is madvise(MADV_DONTNEED) for RAM blocks during snapshotting. And free page reporting >> or memory balloon is secondary - the point is that UFFD_WP snapshot is incompatible with madvise(MADV_DONTNEED) on >> hypervisor side. No matter which guest functionality can induce it. > I think the problem is if with init_on_free=1, the kernel will assume that > all the pages that got freed has been zeroed before-hand so it thinks that it's > a waste of time to zero it again when the page is reused/reallocated. As a > reference see kernel prep_new_page() where there's: > > if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags)) > kernel_init_free_pages(page, 1 << order); > > In this case I believe free_pages_prezeroed() will return true, then we don't > even need to check want_init_on_alloc() at all. Note that it'll cover all the > cases where kernel allocates with __GFP_ZERO: it means it could happen that > even the guest kernel tries to alloc_page(__GFP_ZERO) it may got a page with > random data after the live snapshot is loaded. So it's not about any hardware, > it's the optimization of guest kernel instead. It is actually reasonable and > efficient since if we *know* that page is zero page then we shouldn't bother > zeroing it again. However it brought us a bit of trouble on live snapshot that > the current solution might not work for all guest OS configurations. So I'd like to clarify that guest's init_on_alloc/init_on_free cannot bring any problems if we don't consider virtio-baloon here.. If these pages have been zeroed inside the guest they certainly have got populated on the host. And UFFD_WP should work as expected. >>>>> impact should be small, I think. I thought about it, but indeed I didn't see a >>>>> good way to fix this if without fixing the zero page copy for live snapshot. >>>> We should really document this (unexpected) behavior of snapshotting. >>>> Otherwise, the next feature comes around that relies on pages that were >>>> discarded to remain zeroed (I even have one in mind ;) ) and forgets to >>>> disable snapshots. >>> Agreed. I'll see whether Andrey would have any idea to workaround this, or >>> further comment. Or I can draft a patch to document this next week (or unless >>> Andrey would beat me to it :). >>> >> Really better to document this specific behaviour but also clarify that the saved state remains >> consistent and secure, off course if you agree with my arguments. > Yes, no rush on anything yet, let's reach a consensus on understanding the > problem before trying to even document anything. If you agree with above, feel > free to draft a documentation update patch if you'd like, that could also > contain the code to prohibit virtio-balloon page reporting when live snapshot. > > IMHO if above issue exists, it'll be indeed better to implement the MISSING > tracking as well with UFFDIO_ZEROCOPY - it's still not optimized to keep trash > pages in the live snapshot, since for a high dirty rate guest the number of > trash pages could still be huge. It would definitely be more challenging than > only do wr-protect so we may need multi-threading at least, but it'll be a pity > that live snapshot randomly fails some guests, even if it's rare. The thing is > host cannot easily detect that guest config, so we can't simply block some > users taking snapshots even if at last the snapshot could be silently broken. > > Thanks, > Agree, let's reach consensus before doing anything. For the concurrent RAM block discards it seems clear enough that they should be disabled for the duration of snapshot, like it's done when incoming postcopy is active. For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon code more to get clear with it. Anyhow I'm quite sure that adding global MISSING handler for snapshotting is too heavy and not really needed. Thanks,
On 19.02.21 07:57, Andrey Gruzdev wrote: > On 17.02.2021 02:35, Peter Xu wrote: >> Hi, Andrey, >> >> On Sat, Feb 13, 2021 at 12:34:07PM +0300, Andrey Gruzdev wrote: >>> On 12.02.2021 19:11, Peter Xu wrote: >>>> On Fri, Feb 12, 2021 at 09:52:52AM +0100, David Hildenbrand wrote: >>>>> On 12.02.21 04:06, Peter Xu wrote: >>>>>> On Thu, Feb 11, 2021 at 10:09:58PM +0100, David Hildenbrand wrote: >>>>>>> The issue is when the discard happened before starting the snapshot. Write-protection won‘t work and the zeroed content won‘t be retained in the snapshot. >>>>>> I see what you mean now, and iiuc it will only be a problem if init_on_free=1. >>>>>> I think CONFIG_INIT_ON_FREE_DEFAULT_ON should be off for most distros, so the >>>>> Yes, some distros seem to enable init_on_alloc instead. Looking at the >>>>> introducing commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 >>>>> and init_on_free=1 boot options") there are security use cases and it might >>>>> become important with memory tagging. >>>>> >>>>> Note that in Linux, there was also the option to poison pages with 0, >>>>> removed via f289041ed4cf ("mm, page_poison: remove >>>>> CONFIG_PAGE_POISONING_ZERO"), available in some kernels that supported free >>>>> page reporting. >>>>> >>>>> It got removed and use cases got told to use init_on_free. >>> I think we talk about init_on_free()/init_on_alloc() on guest side, right? >> Right. IIUC it's the init_on_free() that matters. >> >> We'll have no issue if init_on_alloc=1 && init_on_free=0, since in that case >> all pages will be zeroed after all before the new page returned to the caller >> to allocate the page. Then we're safe, I think. >> >>> Still can't get how it relates to host's unpopulated pages.. >>> Try to look from hardware side. Untouched SDRAM in hardware is required to contain zeroes somehow? No. >>> These 'trash' pages in migration stream are like never written physical memory pages, they are really >>> not needed in snapshot but they don't do any harm as well as there's no harm in that never-written physical >>> page is full of garbage. >>> >>> Do these 'trash' pages in snapshot contain sensitive information not allowed to be accessed by the same VM? >>> I think no. Or we need a good example how it can be potentially exploited. >>> >>> The only issue that I see is madvise(MADV_DONTNEED) for RAM blocks during snapshotting. And free page reporting >>> or memory balloon is secondary - the point is that UFFD_WP snapshot is incompatible with madvise(MADV_DONTNEED) on >>> hypervisor side. No matter which guest functionality can induce it. >> I think the problem is if with init_on_free=1, the kernel will assume that >> all the pages that got freed has been zeroed before-hand so it thinks that it's >> a waste of time to zero it again when the page is reused/reallocated. As a >> reference see kernel prep_new_page() where there's: >> >> if (!free_pages_prezeroed() && want_init_on_alloc(gfp_flags)) >> kernel_init_free_pages(page, 1 << order); >> >> In this case I believe free_pages_prezeroed() will return true, then we don't >> even need to check want_init_on_alloc() at all. Note that it'll cover all the >> cases where kernel allocates with __GFP_ZERO: it means it could happen that >> even the guest kernel tries to alloc_page(__GFP_ZERO) it may got a page with >> random data after the live snapshot is loaded. So it's not about any hardware, >> it's the optimization of guest kernel instead. It is actually reasonable and >> efficient since if we *know* that page is zero page then we shouldn't bother >> zeroing it again. However it brought us a bit of trouble on live snapshot that >> the current solution might not work for all guest OS configurations. > > So I'd like to clarify that guest's init_on_alloc/init_on_free cannot bring any problems if we don't > consider virtio-baloon here.. If these pages have been zeroed inside the guest they certainly have got > populated on the host. And UFFD_WP should work as expected. Yes, no discards, no problem.
Andrey, On Fri, Feb 19, 2021 at 09:57:37AM +0300, Andrey Gruzdev wrote: > For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon > code more to get clear with it. Yes it's very tricky on how the error could trigger. Let's think of below sequence: - Start a guest with init_on_free=1 set and also a virtio-balloon device - Guest frees a page P and zeroed it (since init_on_free=1). Now P contains all zeros. - Virtio-balloon reports this page to host, MADV_DONTNEED sent, then this page is dropped on the host. - Start live snapshot, wr-protect all pages (but not including page P because it's currently missing). Let's call it $SNAPSHOT1. - Guest does alloc_page(__GFP_ZERO), accidentally fetching this page P and returned - So far, page P is still all zero (which is good!), then guest uses page P and writes data to it (say, now P has data P1 rather than all zeros). - Live snapshot saves page P, which content P1 rather than all zeros. - Live snapshot completed. Saved as $SNAPSHOT1. Then when load snapshot $SNAPSHOT1, we'll have P contains data P1. After snapshot loaded, when guest allocate again with alloc_page(__GFP_ZERO) on this page P, since guest kernel "thought" this page is all-zero already so memzero() is skipped even if __GFP_ZERO is provided. Then this page P (with content P1) got returned for the alloc_page(__GFP_ZERO) even if __GFP_ZERO set. That could break the caller of alloc_page(). > Anyhow I'm quite sure that adding global MISSING handler for snapshotting > is too heavy and not really needed. UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it. There'll definitely be overhead, but it may not be that huge as imagined. Live snapshot is great in that we have point-in-time image of guest without stopping the guest, so taking slightly longer time won't be a huge loss to us too. Actually we can also think of other ways to work around it. One way is we can pre-fault all guest pages before wr-protect. Note that we don't need to write to the guest page because read would suffice, since uffd-wp would also work with zero pfn. It's just that this workaround won't help on saving snapshot disk space, but it seems working. It would be great if you have other workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route. Thanks,
On Fri, Feb 19, 2021 at 03:50:52PM -0500, Peter Xu wrote: > Andrey, > > On Fri, Feb 19, 2021 at 09:57:37AM +0300, Andrey Gruzdev wrote: > > For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon > > code more to get clear with it. > > Yes it's very tricky on how the error could trigger. > > Let's think of below sequence: > > - Start a guest with init_on_free=1 set and also a virtio-balloon device > > - Guest frees a page P and zeroed it (since init_on_free=1). Now P contains > all zeros. > > - Virtio-balloon reports this page to host, MADV_DONTNEED sent, then this > page is dropped on the host. > > - Start live snapshot, wr-protect all pages (but not including page P because > it's currently missing). Let's call it $SNAPSHOT1. > > - Guest does alloc_page(__GFP_ZERO), accidentally fetching this page P and > returned > > - So far, page P is still all zero (which is good!), then guest uses page P > and writes data to it (say, now P has data P1 rather than all zeros). > > - Live snapshot saves page P, which content P1 rather than all zeros. > > - Live snapshot completed. Saved as $SNAPSHOT1. > > Then when load snapshot $SNAPSHOT1, we'll have P contains data P1. After > snapshot loaded, when guest allocate again with alloc_page(__GFP_ZERO) on this > page P, since guest kernel "thought" this page is all-zero already so memzero() > is skipped even if __GFP_ZERO is provided. Then this page P (with content P1) > got returned for the alloc_page(__GFP_ZERO) even if __GFP_ZERO set. That could > break the caller of alloc_page(). > > > Anyhow I'm quite sure that adding global MISSING handler for snapshotting > > is too heavy and not really needed. > > UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it. There'll > definitely be overhead, but it may not be that huge as imagined. Live snapshot > is great in that we have point-in-time image of guest without stopping the > guest, so taking slightly longer time won't be a huge loss to us too. > > Actually we can also think of other ways to work around it. One way is we can > pre-fault all guest pages before wr-protect. Note that we don't need to write > to the guest page because read would suffice, since uffd-wp would also work > with zero pfn. It's just that this workaround won't help on saving snapshot > disk space, but it seems working. It would be great if you have other > workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route. Wait.. it actually seems to also solve the disk usage issue.. :) We should just need to make sure to prohibit balloon before staring to pre-fault read on all guest ram. Seems awkward, but also seems working.. Hmm..
> Am 19.02.2021 um 22:10 schrieb Peter Xu <peterx@redhat.com>: > > On Fri, Feb 19, 2021 at 03:50:52PM -0500, Peter Xu wrote: >> Andrey, >> >>> On Fri, Feb 19, 2021 at 09:57:37AM +0300, Andrey Gruzdev wrote: >>> For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon >>> code more to get clear with it. >> >> Yes it's very tricky on how the error could trigger. >> >> Let's think of below sequence: >> >> - Start a guest with init_on_free=1 set and also a virtio-balloon device >> >> - Guest frees a page P and zeroed it (since init_on_free=1). Now P contains >> all zeros. >> >> - Virtio-balloon reports this page to host, MADV_DONTNEED sent, then this >> page is dropped on the host. >> >> - Start live snapshot, wr-protect all pages (but not including page P because >> it's currently missing). Let's call it $SNAPSHOT1. >> >> - Guest does alloc_page(__GFP_ZERO), accidentally fetching this page P and >> returned >> >> - So far, page P is still all zero (which is good!), then guest uses page P >> and writes data to it (say, now P has data P1 rather than all zeros). >> >> - Live snapshot saves page P, which content P1 rather than all zeros. >> >> - Live snapshot completed. Saved as $SNAPSHOT1. >> >> Then when load snapshot $SNAPSHOT1, we'll have P contains data P1. After >> snapshot loaded, when guest allocate again with alloc_page(__GFP_ZERO) on this >> page P, since guest kernel "thought" this page is all-zero already so memzero() >> is skipped even if __GFP_ZERO is provided. Then this page P (with content P1) >> got returned for the alloc_page(__GFP_ZERO) even if __GFP_ZERO set. That could >> break the caller of alloc_page(). >> >>> Anyhow I'm quite sure that adding global MISSING handler for snapshotting >>> is too heavy and not really needed. >> >> UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it. There'll >> definitely be overhead, but it may not be that huge as imagined. Live snapshot >> is great in that we have point-in-time image of guest without stopping the >> guest, so taking slightly longer time won't be a huge loss to us too. >> >> Actually we can also think of other ways to work around it. One way is we can >> pre-fault all guest pages before wr-protect. Note that we don't need to write >> to the guest page because read would suffice, since uffd-wp would also work >> with zero pfn. It's just that this workaround won't help on saving snapshot >> disk space, but it seems working. It would be great if you have other >> workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route. > > Wait.. it actually seems to also solve the disk usage issue.. :) > > We should just need to make sure to prohibit balloon before staring to > pre-fault read on all guest ram. Seems awkward, but also seems working.. Hmm.. A shiver just went down my spine. Please don‘t just for the sake of creating a snapshot. (Just imagine you don‘t have a shared zeropage...) > -- > Peter Xu >
> Am 19.02.2021 um 22:14 schrieb David Hildenbrand <dhildenb@redhat.com>: > > >>> Am 19.02.2021 um 22:10 schrieb Peter Xu <peterx@redhat.com>: >>> >>> On Fri, Feb 19, 2021 at 03:50:52PM -0500, Peter Xu wrote: >>> Andrey, >>> >>>> On Fri, Feb 19, 2021 at 09:57:37AM +0300, Andrey Gruzdev wrote: >>>> For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon >>>> code more to get clear with it. >>> >>> Yes it's very tricky on how the error could trigger. >>> >>> Let's think of below sequence: >>> >>> - Start a guest with init_on_free=1 set and also a virtio-balloon device >>> >>> - Guest frees a page P and zeroed it (since init_on_free=1). Now P contains >>> all zeros. >>> >>> - Virtio-balloon reports this page to host, MADV_DONTNEED sent, then this >>> page is dropped on the host. >>> >>> - Start live snapshot, wr-protect all pages (but not including page P because >>> it's currently missing). Let's call it $SNAPSHOT1. >>> >>> - Guest does alloc_page(__GFP_ZERO), accidentally fetching this page P and >>> returned >>> >>> - So far, page P is still all zero (which is good!), then guest uses page P >>> and writes data to it (say, now P has data P1 rather than all zeros). >>> >>> - Live snapshot saves page P, which content P1 rather than all zeros. >>> >>> - Live snapshot completed. Saved as $SNAPSHOT1. >>> >>> Then when load snapshot $SNAPSHOT1, we'll have P contains data P1. After >>> snapshot loaded, when guest allocate again with alloc_page(__GFP_ZERO) on this >>> page P, since guest kernel "thought" this page is all-zero already so memzero() >>> is skipped even if __GFP_ZERO is provided. Then this page P (with content P1) >>> got returned for the alloc_page(__GFP_ZERO) even if __GFP_ZERO set. That could >>> break the caller of alloc_page(). >>> >>>> Anyhow I'm quite sure that adding global MISSING handler for snapshotting >>>> is too heavy and not really needed. >>> >>> UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it. There'll >>> definitely be overhead, but it may not be that huge as imagined. Live snapshot >>> is great in that we have point-in-time image of guest without stopping the >>> guest, so taking slightly longer time won't be a huge loss to us too. >>> >>> Actually we can also think of other ways to work around it. One way is we can >>> pre-fault all guest pages before wr-protect. Note that we don't need to write >>> to the guest page because read would suffice, since uffd-wp would also work >>> with zero pfn. It's just that this workaround won't help on saving snapshot >>> disk space, but it seems working. It would be great if you have other >>> workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route. >> >> Wait.. it actually seems to also solve the disk usage issue.. :) >> >> We should just need to make sure to prohibit balloon before staring to >> pre-fault read on all guest ram. Seems awkward, but also seems working.. Hmm.. > > A shiver just went down my spine. Please don‘t just for the sake of creating a snapshot. > > (Just imagine you don‘t have a shared zeropage...) ... and I just remembered we read all memory either way. Gah. I have some patches to make snapshots fly with virtio-mem so exactly that won‘t happen. But they depend on vfio support, so it might take a while.
On Fri, Feb 19, 2021 at 10:20:42PM +0100, David Hildenbrand wrote: > > A shiver just went down my spine. Please don‘t just for the sake of creating a snapshot. > > > > (Just imagine you don‘t have a shared zeropage...) > > ... and I just remembered we read all memory either way. Gah. > > I have some patches to make snapshots fly with virtio-mem so exactly that won‘t happen. But they depend on vfio support, so it might take a while. Sorry I can't really follow. It'll be great if virtio-mem won't have similar problem with live snapshot finally. Is that idea applicable to balloon too, then?
> Am 19.02.2021 um 23:47 schrieb Peter Xu <peterx@redhat.com>: > > On Fri, Feb 19, 2021 at 10:20:42PM +0100, David Hildenbrand wrote: >>> A shiver just went down my spine. Please don‘t just for the sake of creating a snapshot. >>> >>> (Just imagine you don‘t have a shared zeropage...) >> >> ... and I just remembered we read all memory either way. Gah. >> >> I have some patches to make snapshots fly with virtio-mem so exactly that won‘t happen. But they depend on vfio support, so it might take a while. > > Sorry I can't really follow. > Live snapshotting ends up reading all guest memory (dirty bitmap starts with all 1s), which is not what we want for virtio-mem - we don’t want to read and migrate memory that has been discarded and has no stable content. For ordinary migration we use the guest page hint API to clear bits in the dirty bitmap after dirty bitmap sync. Well, if we don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the problem. > It'll be great if virtio-mem won't have similar problem with live snapshot > finally. Is that idea applicable to balloon too, then? The idea is to reuse the RamDiscardMgr as to be introduced with vfio support to teach migration code which parts of specific RAMBlock never to migrate. It avoids the hack with reusing the guest page hint API. As virtio-balloon is not applicable to RamDiscardMgr (and virtio-balloon has no memory about what has been inflated) it won‘t affect virtio-balloon. > > -- > Peter Xu >
On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote: > Live snapshotting ends up reading all guest memory (dirty bitmap starts with all 1s), which is not what we want for virtio-mem - we don’t want to read and migrate memory that has been discarded and has no stable content. > > For ordinary migration we use the guest page hint API to clear bits in the dirty bitmap after dirty bitmap sync. Well, if we don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the problem. Using dirty bitmap for that information is less efficient, becase it's definitely a larger granularity information than PAGE_SIZE. If the disgarded ranges are always continuous and at the end of a memory region, we should have some parameter in the ramblock showing that where we got shrinked then we don't check dirty bitmap at all, rather than always assuming used_length is the one. Thanks,
On 22.02.21 18:29, Peter Xu wrote: > On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote: >> Live snapshotting ends up reading all guest memory (dirty bitmap starts with all 1s), which is not what we want for virtio-mem - we don’t want to read and migrate memory that has been discarded and has no stable content. >> >> For ordinary migration we use the guest page hint API to clear bits in the dirty bitmap after dirty bitmap sync. Well, if we don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the problem. > > Using dirty bitmap for that information is less efficient, becase it's > definitely a larger granularity information than PAGE_SIZE. If the disgarded > ranges are always continuous and at the end of a memory region, we should have > some parameter in the ramblock showing that where we got shrinked then we don't > check dirty bitmap at all, rather than always assuming used_length is the one. They are randomly scattered across the whole RAMBlock. Shrinking/growing will be done to some degree in the future (but it won't get rid of the general sparse layout we can produce).
On Mon, Feb 22, 2021 at 06:33:27PM +0100, David Hildenbrand wrote: > On 22.02.21 18:29, Peter Xu wrote: > > On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote: > > > Live snapshotting ends up reading all guest memory (dirty bitmap starts with all 1s), which is not what we want for virtio-mem - we don’t want to read and migrate memory that has been discarded and has no stable content. > > > > > > For ordinary migration we use the guest page hint API to clear bits in the dirty bitmap after dirty bitmap sync. Well, if we don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the problem. > > > > Using dirty bitmap for that information is less efficient, becase it's > > definitely a larger granularity information than PAGE_SIZE. If the disgarded > > ranges are always continuous and at the end of a memory region, we should have > > some parameter in the ramblock showing that where we got shrinked then we don't > > check dirty bitmap at all, rather than always assuming used_length is the one. > > They are randomly scattered across the whole RAMBlock. Shrinking/growing > will be done to some degree in the future (but it won't get rid of the > general sparse layout we can produce). OK. Btw I think currently live snapshot should still be reading dirty bitmap, so maybe it's still fine. It's just that it's still not very clear to hide virtio-mem information into dirty bitmap, imho, since that's not how we interpret dirty bitmap - which is only for the sake of tracking page changes. What's the granule of virtio-mem for this discard behavior? Maybe we could decouple it with dirty bitmap some day; if the unit is big enough it's also a gain on efficiency so we skip in chunk rather than looping over tons of pages knowing that they're discarded. Thanks,
On 22.02.21 18:54, Peter Xu wrote: > On Mon, Feb 22, 2021 at 06:33:27PM +0100, David Hildenbrand wrote: >> On 22.02.21 18:29, Peter Xu wrote: >>> On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote: >>>> Live snapshotting ends up reading all guest memory (dirty bitmap starts with all 1s), which is not what we want for virtio-mem - we don’t want to read and migrate memory that has been discarded and has no stable content. >>>> >>>> For ordinary migration we use the guest page hint API to clear bits in the dirty bitmap after dirty bitmap sync. Well, if we don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the problem. >>> >>> Using dirty bitmap for that information is less efficient, becase it's >>> definitely a larger granularity information than PAGE_SIZE. If the disgarded >>> ranges are always continuous and at the end of a memory region, we should have >>> some parameter in the ramblock showing that where we got shrinked then we don't >>> check dirty bitmap at all, rather than always assuming used_length is the one. >> >> They are randomly scattered across the whole RAMBlock. Shrinking/growing >> will be done to some degree in the future (but it won't get rid of the >> general sparse layout we can produce). > > OK. Btw I think currently live snapshot should still be reading dirty bitmap, > so maybe it's still fine. It's just that it's still not very clear to hide > virtio-mem information into dirty bitmap, imho, since that's not how we > interpret dirty bitmap - which is only for the sake of tracking page changes. Well, currently it is "what do we have to migrate". > > What's the granule of virtio-mem for this discard behavior? Maybe we could virtio-mem granularity is at least 1MB. This corresponds to 256 bits (32 bytes) in the dirty bitmap I think. > decouple it with dirty bitmap some day; if the unit is big enough it's also a > gain on efficiency so we skip in chunk rather than looping over tons of pages > knowing that they're discarded. Yeah, it's not optimal having to go over the dirty bitmap to cross off "discarded" parts and later having to find bits to migrate. At least find_next_bit() can skip whole longs (8 bytes) and is fairly efficient. There is certainly room for improvement (the current guest free page hinting API is certainly a hack).
On 19.02.2021 23:50, Peter Xu wrote: > Andrey, > > On Fri, Feb 19, 2021 at 09:57:37AM +0300, Andrey Gruzdev wrote: >> For the discards that happen before snapshot is started, I need to dig into Linux and QEMU virtio-baloon >> code more to get clear with it. > Yes it's very tricky on how the error could trigger. > > Let's think of below sequence: > > - Start a guest with init_on_free=1 set and also a virtio-balloon device > > - Guest frees a page P and zeroed it (since init_on_free=1). Now P contains > all zeros. > > - Virtio-balloon reports this page to host, MADV_DONTNEED sent, then this > page is dropped on the host. > > - Start live snapshot, wr-protect all pages (but not including page P because > it's currently missing). Let's call it $SNAPSHOT1. > > - Guest does alloc_page(__GFP_ZERO), accidentally fetching this page P and > returned > > - So far, page P is still all zero (which is good!), then guest uses page P > and writes data to it (say, now P has data P1 rather than all zeros). > > - Live snapshot saves page P, which content P1 rather than all zeros. > > - Live snapshot completed. Saved as $SNAPSHOT1. > > Then when load snapshot $SNAPSHOT1, we'll have P contains data P1. After > snapshot loaded, when guest allocate again with alloc_page(__GFP_ZERO) on this > page P, since guest kernel "thought" this page is all-zero already so memzero() > is skipped even if __GFP_ZERO is provided. Then this page P (with content P1) > got returned for the alloc_page(__GFP_ZERO) even if __GFP_ZERO set. That could > break the caller of alloc_page(). Yep, that's quite clear. >> Anyhow I'm quite sure that adding global MISSING handler for snapshotting >> is too heavy and not really needed. > UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it. There'll > definitely be overhead, but it may not be that huge as imagined. Live snapshot > is great in that we have point-in-time image of guest without stopping the > guest, so taking slightly longer time won't be a huge loss to us too. > > Actually we can also think of other ways to work around it. One way is we can > pre-fault all guest pages before wr-protect. Note that we don't need to write > to the guest page because read would suffice, since uffd-wp would also work > with zero pfn. It's just that this workaround won't help on saving snapshot > disk space, but it seems working. It would be great if you have other > workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route. > > Thanks, > Just to add: one of the good options is too keep track of virtio-baloon discarded pages and pre-fault them before migration starts. What do you think?
>>> Anyhow I'm quite sure that adding global MISSING handler for snapshotting >>> is too heavy and not really needed. >> UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it. There'll >> definitely be overhead, but it may not be that huge as imagined. Live snapshot >> is great in that we have point-in-time image of guest without stopping the >> guest, so taking slightly longer time won't be a huge loss to us too. >> >> Actually we can also think of other ways to work around it. One way is we can >> pre-fault all guest pages before wr-protect. Note that we don't need to write >> to the guest page because read would suffice, since uffd-wp would also work >> with zero pfn. It's just that this workaround won't help on saving snapshot >> disk space, but it seems working. It would be great if you have other >> workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route. >> >> Thanks, >> > Just to add: one of the good options is too keep track of virtio-baloon discarded pages and > pre-fault them before migration starts. What do you think? Just pre-fault everything and inhibit the balloon. That should work.
On 22.02.2021 21:11, David Hildenbrand wrote: > On 22.02.21 18:54, Peter Xu wrote: >> On Mon, Feb 22, 2021 at 06:33:27PM +0100, David Hildenbrand wrote: >>> On 22.02.21 18:29, Peter Xu wrote: >>>> On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote: >>>>> Live snapshotting ends up reading all guest memory (dirty bitmap >>>>> starts with all 1s), which is not what we want for virtio-mem - we >>>>> don’t want to read and migrate memory that has been discarded and >>>>> has no stable content. >>>>> >>>>> For ordinary migration we use the guest page hint API to clear >>>>> bits in the dirty bitmap after dirty bitmap sync. Well, if we >>>>> don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the >>>>> problem. >>>> >>>> Using dirty bitmap for that information is less efficient, becase it's >>>> definitely a larger granularity information than PAGE_SIZE. If the >>>> disgarded >>>> ranges are always continuous and at the end of a memory region, we >>>> should have >>>> some parameter in the ramblock showing that where we got shrinked >>>> then we don't >>>> check dirty bitmap at all, rather than always assuming used_length >>>> is the one. >>> >>> They are randomly scattered across the whole RAMBlock. >>> Shrinking/growing >>> will be done to some degree in the future (but it won't get rid of the >>> general sparse layout we can produce). >> >> OK. Btw I think currently live snapshot should still be reading dirty >> bitmap, >> so maybe it's still fine. It's just that it's still not very clear >> to hide >> virtio-mem information into dirty bitmap, imho, since that's not how we >> interpret dirty bitmap - which is only for the sake of tracking page >> changes. > > Well, currently it is "what do we have to migrate". > Using dirty bitmap can prohibit unexpected (overwritten) content of pre-discarded pages, but they'll appear as zeroed on destination. What about other virtio-baloon features like HW_POISON when poisoned pages should also be retained on population filled with certain pattern? Thanks, >> >> What's the granule of virtio-mem for this discard behavior? Maybe we >> could > > virtio-mem granularity is at least 1MB. This corresponds to 256 bits > (32 bytes) in the dirty bitmap I think. > >> decouple it with dirty bitmap some day; if the unit is big enough >> it's also a >> gain on efficiency so we skip in chunk rather than looping over tons >> of pages >> knowing that they're discarded. > > Yeah, it's not optimal having to go over the dirty bitmap to cross off > "discarded" parts and later having to find bits to migrate. > > At least find_next_bit() can skip whole longs (8 bytes) and is fairly > efficient. There is certainly room for improvement (the current guest > free page hinting API is certainly a hack). >
On 24.02.2021 19:54, David Hildenbrand wrote: >>>> Anyhow I'm quite sure that adding global MISSING handler for >>>> snapshotting >>>> is too heavy and not really needed. >>> UFFDIO_ZEROCOPY installs a zero pfn and that should be all of it. >>> There'll >>> definitely be overhead, but it may not be that huge as imagined. >>> Live snapshot >>> is great in that we have point-in-time image of guest without >>> stopping the >>> guest, so taking slightly longer time won't be a huge loss to us too. >>> >>> Actually we can also think of other ways to work around it. One way >>> is we can >>> pre-fault all guest pages before wr-protect. Note that we don't >>> need to write >>> to the guest page because read would suffice, since uffd-wp would >>> also work >>> with zero pfn. It's just that this workaround won't help on saving >>> snapshot >>> disk space, but it seems working. It would be great if you have other >>> workarounds, maybe as you said UFFDIO_ZEROCOPY is not the only route. >>> >>> Thanks, >>> >> Just to add: one of the good options is too keep track of >> virtio-baloon discarded pages and >> pre-fault them before migration starts. What do you think? > > Just pre-fault everything and inhibit the balloon. That should work. > Yep.
On 24.02.21 17:56, Andrey Gruzdev wrote: > On 22.02.2021 21:11, David Hildenbrand wrote: >> On 22.02.21 18:54, Peter Xu wrote: >>> On Mon, Feb 22, 2021 at 06:33:27PM +0100, David Hildenbrand wrote: >>>> On 22.02.21 18:29, Peter Xu wrote: >>>>> On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote: >>>>>> Live snapshotting ends up reading all guest memory (dirty bitmap >>>>>> starts with all 1s), which is not what we want for virtio-mem - we >>>>>> don’t want to read and migrate memory that has been discarded and >>>>>> has no stable content. >>>>>> >>>>>> For ordinary migration we use the guest page hint API to clear >>>>>> bits in the dirty bitmap after dirty bitmap sync. Well, if we >>>>>> don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the >>>>>> problem. >>>>> >>>>> Using dirty bitmap for that information is less efficient, becase it's >>>>> definitely a larger granularity information than PAGE_SIZE. If the >>>>> disgarded >>>>> ranges are always continuous and at the end of a memory region, we >>>>> should have >>>>> some parameter in the ramblock showing that where we got shrinked >>>>> then we don't >>>>> check dirty bitmap at all, rather than always assuming used_length >>>>> is the one. >>>> >>>> They are randomly scattered across the whole RAMBlock. >>>> Shrinking/growing >>>> will be done to some degree in the future (but it won't get rid of the >>>> general sparse layout we can produce). >>> >>> OK. Btw I think currently live snapshot should still be reading dirty >>> bitmap, >>> so maybe it's still fine. It's just that it's still not very clear >>> to hide >>> virtio-mem information into dirty bitmap, imho, since that's not how we >>> interpret dirty bitmap - which is only for the sake of tracking page >>> changes. >> >> Well, currently it is "what do we have to migrate". >> > Using dirty bitmap can prohibit unexpected (overwritten) content of > pre-discarded pages, but they'll appear as zeroed on destination. To be precise, they'll usually appear as discarded (no pages in the migration stream). > What about other virtio-baloon features like HW_POISON when poisoned > pages should also be retained on population filled with certain pattern? In case of free page reporting, we skip discarding if poisoning is set to != 0 because of that reason: if (virtio_balloon_inhibited() || dev->poison_val) { goto skip_element; }
On 24.02.2021 20:01, David Hildenbrand wrote: > On 24.02.21 17:56, Andrey Gruzdev wrote: >> On 22.02.2021 21:11, David Hildenbrand wrote: >>> On 22.02.21 18:54, Peter Xu wrote: >>>> On Mon, Feb 22, 2021 at 06:33:27PM +0100, David Hildenbrand wrote: >>>>> On 22.02.21 18:29, Peter Xu wrote: >>>>>> On Sat, Feb 20, 2021 at 02:59:42AM -0500, David Hildenbrand wrote: >>>>>>> Live snapshotting ends up reading all guest memory (dirty bitmap >>>>>>> starts with all 1s), which is not what we want for virtio-mem - we >>>>>>> don’t want to read and migrate memory that has been discarded and >>>>>>> has no stable content. >>>>>>> >>>>>>> For ordinary migration we use the guest page hint API to clear >>>>>>> bits in the dirty bitmap after dirty bitmap sync. Well, if we >>>>>>> don‘t do bitmap syncs we‘ll never clear any dirty bits. That‘s the >>>>>>> problem. >>>>>> >>>>>> Using dirty bitmap for that information is less efficient, becase >>>>>> it's >>>>>> definitely a larger granularity information than PAGE_SIZE. If the >>>>>> disgarded >>>>>> ranges are always continuous and at the end of a memory region, we >>>>>> should have >>>>>> some parameter in the ramblock showing that where we got shrinked >>>>>> then we don't >>>>>> check dirty bitmap at all, rather than always assuming used_length >>>>>> is the one. >>>>> >>>>> They are randomly scattered across the whole RAMBlock. >>>>> Shrinking/growing >>>>> will be done to some degree in the future (but it won't get rid of the >>>>> general sparse layout we can produce). >>>> >>>> OK. Btw I think currently live snapshot should still be reading dirty >>>> bitmap, >>>> so maybe it's still fine. It's just that it's still not very clear >>>> to hide >>>> virtio-mem information into dirty bitmap, imho, since that's not how we >>>> interpret dirty bitmap - which is only for the sake of tracking page >>>> changes. >>> >>> Well, currently it is "what do we have to migrate". >>> >> Using dirty bitmap can prohibit unexpected (overwritten) content of >> pre-discarded pages, but they'll appear as zeroed on destination. > > To be precise, they'll usually appear as discarded (no pages in the > migration stream). I just mean the retained content of the page after population - zeroes. > >> What about other virtio-baloon features like HW_POISON when poisoned >> pages should also be retained on population filled with certain pattern? > > In case of free page reporting, we skip discarding if poisoning is set > to != 0 because of that reason: > > if (virtio_balloon_inhibited() || dev->poison_val) { > goto skip_element; > } > Got it, thanks.