Message ID | 20191022221223.17338.5860.stgit@localhost.localdomain (mailing list archive) |
---|---|
Headers | show |
Series | mm / virtio: Provide support for unused page reporting | expand |
On Tue, 22 Oct 2019 15:27:52 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > Below are the results from various benchmarks. I primarily focused on two > tests. The first is the will-it-scale/page_fault2 test, and the other is > a modified version of will-it-scale/page_fault1 that was enabled to use > THP. I did this as it allows for better visibility into different parts > of the memory subsystem. The guest is running on one node of a E5-2630 v3 > CPU with 48G of RAM that I split up into two logical nodes in the guest > in order to test with NUMA as well. > > Test page_fault1 (THP) page_fault2 > Baseline 1 1256106.33 +/-0.09% 482202.67 +/-0.46% > 16 8864441.67 +/-0.09% 3734692.00 +/-1.23% > > Patches applied 1 1257096.00 +/-0.06% 477436.00 +/-0.16% > 16 8864677.33 +/-0.06% 3800037.00 +/-0.19% > > Patches enabled 1 1258420.00 +/-0.04% 480080.00 +/-0.07% > MADV disabled 16 8753840.00 +/-1.27% 3782764.00 +/-0.37% > > Patches enabled 1 1267916.33 +/-0.08% 472075.67 +/-0.39% > 16 8287050.33 +/-0.67% 3774500.33 +/-0.11% > > The results above are for a baseline with a linux-next-20191021 kernel, > that kernel with this patch set applied but page reporting disabled in > virtio-balloon, patches applied but the madvise disabled by direct > assigning a device, and the patches applied and page reporting fully > enabled. These results include the deviation seen between the average > value reported here versus the high and/or low value. I observed that > during the test the memory usage for the first three tests never dropped > whereas with the patches fully enabled the VM would drop to using only a > few GB of the host's memory when switching from memhog to page fault tests. > > Most of the overhead seen with this patch set fully enabled is due to the > fact that accessing the reported pages will cause a page fault and the host > will have to zero the page before giving it back to the guest. The overall > guest size is kept fairly small to only a few GB while the test is running. > This overhead is much more visible when using THP than with standard 4K > pages. As such for the case where the host memory is not oversubscribed > this results in a performance regression, however if the host memory were > oversubscribed this patch set should result in a performance improvement > as swapping memory from the host can be avoided. I'm trying to understand "how valuable is this patchset" and the above resulted in some headscratching. Overall, how valuable is this patchset? To real users running real workloads? > There is currently an alternative patch set[1] that has been under work > for some time however the v12 version of that patch set could not be > tested as it triggered a kernel panic when I attempted to test it. It > requires multiple modifications to get up and running with performance > comparable to this patch set. A follow-on set has yet to be posted. As > such I have not included results from that patch set, and I would > appreciate it if we could keep this patch set the focus of any discussion > on this thread. Actually, the rest of us would be interested in a comparison ;)
On Tue, 2019-10-22 at 16:01 -0700, Andrew Morton wrote: > On Tue, 22 Oct 2019 15:27:52 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > > > Below are the results from various benchmarks. I primarily focused on two > > tests. The first is the will-it-scale/page_fault2 test, and the other is > > a modified version of will-it-scale/page_fault1 that was enabled to use > > THP. I did this as it allows for better visibility into different parts > > of the memory subsystem. The guest is running on one node of a E5-2630 v3 > > CPU with 48G of RAM that I split up into two logical nodes in the guest > > in order to test with NUMA as well. > > > > Test page_fault1 (THP) page_fault2 > > Baseline 1 1256106.33 +/-0.09% 482202.67 +/-0.46% > > 16 8864441.67 +/-0.09% 3734692.00 +/-1.23% > > > > Patches applied 1 1257096.00 +/-0.06% 477436.00 +/-0.16% > > 16 8864677.33 +/-0.06% 3800037.00 +/-0.19% > > > > Patches enabled 1 1258420.00 +/-0.04% 480080.00 +/-0.07% > > MADV disabled 16 8753840.00 +/-1.27% 3782764.00 +/-0.37% > > > > Patches enabled 1 1267916.33 +/-0.08% 472075.67 +/-0.39% > > 16 8287050.33 +/-0.67% 3774500.33 +/-0.11% > > > > The results above are for a baseline with a linux-next-20191021 kernel, > > that kernel with this patch set applied but page reporting disabled in > > virtio-balloon, patches applied but the madvise disabled by direct > > assigning a device, and the patches applied and page reporting fully > > enabled. These results include the deviation seen between the average > > value reported here versus the high and/or low value. I observed that > > during the test the memory usage for the first three tests never dropped > > whereas with the patches fully enabled the VM would drop to using only a > > few GB of the host's memory when switching from memhog to page fault tests. > > > > Most of the overhead seen with this patch set fully enabled is due to the > > fact that accessing the reported pages will cause a page fault and the host > > will have to zero the page before giving it back to the guest. The overall > > guest size is kept fairly small to only a few GB while the test is running. > > This overhead is much more visible when using THP than with standard 4K > > pages. As such for the case where the host memory is not oversubscribed > > this results in a performance regression, however if the host memory were > > oversubscribed this patch set should result in a performance improvement > > as swapping memory from the host can be avoided. > > I'm trying to understand "how valuable is this patchset" and the above > resulted in some headscratching. > > Overall, how valuable is this patchset? To real users running real > workloads? A more detailed reply is in my response to your comments on patch 3. Basically the value is for host memory overcommit in that we can avoid having to go to swap nearly as often and can potentially pack the guests even tighter with better performance. > > There is currently an alternative patch set[1] that has been under work > > for some time however the v12 version of that patch set could not be > > tested as it triggered a kernel panic when I attempted to test it. It > > requires multiple modifications to get up and running with performance > > comparable to this patch set. A follow-on set has yet to be posted. As > > such I have not included results from that patch set, and I would > > appreciate it if we could keep this patch set the focus of any discussion > > on this thread. > > Actually, the rest of us would be interested in a comparison ;) I understand that. However, the last time I tried benchmarking that patch set it blew up into a thread where we kept having to fix things on that patch set and by the time we were done we weren't benchmarking the v12 patch set anymore since we had made so many modifications to it, and that assumes Nitesh and I were in sync. Also I don't know what the current state of his patch set is as he was working on some additional changes when we last discussed things. Ideally that patch set can be reposted with the necessary fixes and then we can go through any necessary debug, repair, and addressing limitations there.
On 10/22/19 7:43 PM, Alexander Duyck wrote: > On Tue, 2019-10-22 at 16:01 -0700, Andrew Morton wrote: >> On Tue, 22 Oct 2019 15:27:52 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: >> [...] >>> There is currently an alternative patch set[1] that has been under work >>> for some time however the v12 version of that patch set could not be >>> tested as it triggered a kernel panic when I attempted to test it. It >>> requires multiple modifications to get up and running with performance >>> comparable to this patch set. A follow-on set has yet to be posted. As >>> such I have not included results from that patch set, and I would >>> appreciate it if we could keep this patch set the focus of any discussion >>> on this thread. >> Actually, the rest of us would be interested in a comparison ;) > I understand that. However, the last time I tried benchmarking that patch > set it blew up into a thread where we kept having to fix things on that > patch set and by the time we were done we weren't benchmarking the v12 > patch set anymore since we had made so many modifications to it, and that > assumes Nitesh and I were in sync. Also I don't know what the current > state of his patch set is as he was working on some additional changes > when we last discussed things. Just an update about the current state of my patch-series: As we last discussed I was going to try implementing Michal Hock's suggestion of using page-isolation APIs. To do that I have replaced __isolate_free_page() with start/undo_isolate_free_page_range(). However, I am running into some issues which I am currently investigating. After this, I will be investigating the reason why I was seeing degradation specifically with (MAX_ORDER - 2) as the reporting order. > > Ideally that patch set can be reposted with the necessary fixes and then > we can go through any necessary debug, repair, and addressing limitations > there. > >
On 10/22/19 6:27 PM, Alexander Duyck wrote: > This series provides an asynchronous means of reporting unused guest > pages to a hypervisor so that the memory associated with those pages can > be dropped and reused by other processes and/or guests. > > When enabled it will allocate a set of statistics to track the number of > reported pages. When the nr_free for a given free_area is greater than > this by the high water mark we will schedule a worker to begin allocating > the non-reported memory and to provide it to the reporting interface via a > scatterlist. > > Currently this is only in use by virtio-balloon however there is the hope > that at some point in the future other hypervisors might be able to make > use of it. In the virtio-balloon/QEMU implementation the hypervisor is > currently using MADV_DONTNEED to indicate to the host kernel that the page > is currently unused. It will be faulted back into the guest the next time > the page is accessed. > > To track if a page is reported or not the Uptodate flag was repurposed and > used as a Reported flag for Buddy pages. While we are processing the pages > in a given zone we have a set of pointers we track called > reported_boundary that is used to keep our processing time to a minimum. > Without these we would have to iterate through all of the reported pages > which would become a significant burden. I measured as much as a 20% > performance degradation without using the boundary pointers. In the event > of something like compaction needing to process the zone at the same time > it currently resorts to resetting the boundary if it is rearranging the > list. However in the future it could choose to delay processing the zone > if a flag is set indicating that a zone is being actively processed. > > Below are the results from various benchmarks. I primarily focused on two > tests. The first is the will-it-scale/page_fault2 test, and the other is > a modified version of will-it-scale/page_fault1 that was enabled to use > THP. I did this as it allows for better visibility into different parts > of the memory subsystem. The guest is running on one node of a E5-2630 v3 > CPU with 48G of RAM that I split up into two logical nodes in the guest > in order to test with NUMA as well. > > Test page_fault1 (THP) page_fault2 > Baseline 1 1256106.33 +/-0.09% 482202.67 +/-0.46% > 16 8864441.67 +/-0.09% 3734692.00 +/-1.23% > > Patches applied 1 1257096.00 +/-0.06% 477436.00 +/-0.16% > 16 8864677.33 +/-0.06% 3800037.00 +/-0.19% > > Patches enabled 1 1258420.00 +/-0.04% 480080.00 +/-0.07% > MADV disabled 16 8753840.00 +/-1.27% 3782764.00 +/-0.37% > > Patches enabled 1 1267916.33 +/-0.08% 472075.67 +/-0.39% > 16 8287050.33 +/-0.67% 3774500.33 +/-0.11% > > The results above are for a baseline with a linux-next-20191021 kernel, > that kernel with this patch set applied but page reporting disabled in > virtio-balloon, patches applied but the madvise disabled by direct > assigning a device, and the patches applied and page reporting fully > enabled. These results include the deviation seen between the average > value reported here versus the high and/or low value. I observed that > during the test the memory usage for the first three tests never dropped > whereas with the patches fully enabled the VM would drop to using only a > few GB of the host's memory when switching from memhog to page fault tests. > > Most of the overhead seen with this patch set fully enabled is due to the > fact that accessing the reported pages will cause a page fault and the host > will have to zero the page before giving it back to the guest. The overall > guest size is kept fairly small to only a few GB while the test is running. > This overhead is much more visible when using THP than with standard 4K > pages. As such for the case where the host memory is not oversubscribed > this results in a performance regression, however if the host memory were > oversubscribed this patch set should result in a performance improvement > as swapping memory from the host can be avoided. > > There is currently an alternative patch set[1] that has been under work > for some time however the v12 version of that patch set could not be > tested as it triggered a kernel panic when I attempted to test it. It > requires multiple modifications to get up and running with performance > comparable to this patch set. A follow-on set has yet to be posted. As > such I have not included results from that patch set, and I would > appreciate it if we could keep this patch set the focus of any discussion > on this thread. > > For info on earlier versions you will need to follow the links provided > with the respective versions. > > [1]: https://lore.kernel.org/lkml/20190812131235.27244-1-nitesh@redhat.com/ > > Changes from v10: > https://lore.kernel.org/lkml/20190918175109.23474.67039.stgit@localhost.localdomain/ > Rebased on "Add linux-next specific files for 20190930" > Added page_is_reported() macro to prevent unneeded testing of PageReported bit > Fixed several spots where comments referred to older aeration naming > Set upper limit for phdev->capacity to page reporting high water mark > Updated virtio page poison detection logic to also cover init_on_free > Tweaked page_reporting_notify_free to reduce code size > Removed dead code in non-reporting path > > Changes from v11: > https://lore.kernel.org/lkml/20191001152441.27008.99285.stgit@localhost.localdomain/ > Removed unnecessary whitespace change from patch 2 > Minor tweak to get_unreported_page to avoid excess writes to boundary > Rewrote cover page to lay out additional performance info. > > --- > > Alexander Duyck (6): > mm: Adjust shuffle code to allow for future coalescing > mm: Use zone and order instead of free area in free_list manipulators > mm: Introduce Reported pages > mm: Add device side and notifier for unused page reporting > virtio-balloon: Pull page poisoning config out of free page hinting > virtio-balloon: Add support for providing unused page reports to host > > > drivers/virtio/Kconfig | 1 > drivers/virtio/virtio_balloon.c | 88 ++++++++- > include/linux/mmzone.h | 60 ++---- > include/linux/page-flags.h | 11 + > include/linux/page_reporting.h | 31 +++ > include/uapi/linux/virtio_balloon.h | 1 > mm/Kconfig | 11 + > mm/Makefile | 1 > mm/compaction.c | 5 > mm/memory_hotplug.c | 2 > mm/page_alloc.c | 194 +++++++++++++++---- > mm/page_reporting.c | 353 +++++++++++++++++++++++++++++++++++ > mm/page_reporting.h | 225 ++++++++++++++++++++++ > mm/shuffle.c | 12 + > mm/shuffle.h | 6 + > 15 files changed, 899 insertions(+), 102 deletions(-) > create mode 100644 include/linux/page_reporting.h > create mode 100644 mm/page_reporting.c > create mode 100644 mm/page_reporting.h > > -- > I think Michal Hocko suggested us to include a brief detail about the background explaining how we ended up with the current approach and what all things we have already tried. That would help someone reviewing the patch-series for the first time to understand it in a better way. -- Nitesh
On Wed, 2019-10-23 at 07:35 -0400, Nitesh Narayan Lal wrote: > On 10/22/19 6:27 PM, Alexander Duyck wrote: > > This series provides an asynchronous means of reporting unused guest > > pages to a hypervisor so that the memory associated with those pages can > > be dropped and reused by other processes and/or guests. > > <snip> > > > I think Michal Hocko suggested us to include a brief detail about the background > explaining how we ended up with the current approach and what all things we have > already tried. > That would help someone reviewing the patch-series for the first time to > understand it in a better way. I'm not entirely sure it helps. The problem is that even the "brief" version will probably be pretty long. From what I know the first real public discussion of guest memory overcommit and free page hinting dates back to the 2011 KVM forum and a presentation by Rik van Riel[0]. Before I got started in the code there was already virtio-balloon free page hinting[1]. However it was meant to be an all-at-once reporting of the free pages in the system at a given point in time, and used only for VM migration. All it does is inflate a balloon until it encounters an OOM and then it frees the memory back to the guest. One interesting piece that came out of the work on that patch set was the suggestion by Linus to use an array based incremental approach[2] which is what I based my later implementation on. I believe Nitesh had already been working on his own approach for unused page hinting for some time at that point. Prior to submitting my RFC there was already a v7 that had been submitted by Nitesh back in mid 2018[3]. The solution was an array based approach which appeared to instrument arch_alloc_page and arch_free_page and would prevent allocations while hinting was occurring. The first RFC I had written[4] was a synchronous approach that made use of arch_free_page to make a hypercall that would immediately flag the page as being unused. However a hypercall per page can be expensive and we ideally don't want the guest vCPU potentially being hung up while waiting on the host mmap_sem. At about this time I believe Nitesh's solution[5] was still trying to keep an array of pages that were unused and tracking that via arch_free_page. In the synchronous case it could cause OOM errors, and in the asynchronous approach it had issues with being overrun and not being able to track unused pages. Later I switched to an asynchronous approach[6], originally calling it "bubble hinting". With the asynchronous approach it is necessary to have a way to track what pages have been reported and what haven't. I originally was using the page type to track it as I had a Buddy and a TreatedBuddy, but ultimately that moved to a "Reported" page flag. In addition I pulled the counters and pointers out of the free_area/free_list and instead now have a stand-alone set of pointers and keep the reported statistics in a separate dynamic allocation. Then Nitesh's solution had changed to the bitmap approach[7]. However it has been pointed out that this solution doesn't deal with sparse memory, hotplug, and various other issues. Since then both my approach and Nitesh's approach have been iterating with mostly minor changes. [0]: https://www.linux-kvm.org/images/f/ff/2011-forum-memory-overcommit.pdf [1]: https://lore.kernel.org/lkml/1535333539-32420-1-git-send-email-wei.w.wang@intel.com/ [2]: https://lore.kernel.org/lkml/CA+55aFzqj8wxXnHAdUTiOomipgFONVbqKMjL_tfk7e5ar1FziQ@mail.gmail.com/ [3]: https://www.spinics.net/lists/kvm/msg170113.html [4]: https://lore.kernel.org/lkml/20190204181118.12095.38300.stgit@localhost.localdomain/ [5]: https://lore.kernel.org/lkml/20190204201854.2328-1-nitesh@redhat.com/ [6]: https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/ [7]: https://lore.kernel.org/lkml/20190603170306.49099-1-nitesh@redhat.com/
On 10/22/19 6:27 PM, Alexander Duyck wrote: [...] > Below are the results from various benchmarks. I primarily focused on two > tests. The first is the will-it-scale/page_fault2 test, and the other is > a modified version of will-it-scale/page_fault1 that was enabled to use > THP. I did this as it allows for better visibility into different parts > of the memory subsystem. The guest is running on one node of a E5-2630 v3 > CPU with 48G of RAM that I split up into two logical nodes in the guest > in order to test with NUMA as well. > > Test page_fault1 (THP) page_fault2 > Baseline 1 1256106.33 +/-0.09% 482202.67 +/-0.46% > 16 8864441.67 +/-0.09% 3734692.00 +/-1.23% > > Patches applied 1 1257096.00 +/-0.06% 477436.00 +/-0.16% > 16 8864677.33 +/-0.06% 3800037.00 +/-0.19% > > Patches enabled 1 1258420.00 +/-0.04% 480080.00 +/-0.07% > MADV disabled 16 8753840.00 +/-1.27% 3782764.00 +/-0.37% > > Patches enabled 1 1267916.33 +/-0.08% 472075.67 +/-0.39% > 16 8287050.33 +/-0.67% 3774500.33 +/-0.11% If I am not mistaken then you are only observing the number of processes (and not the number of threads) launched over the 1st and the 16th vcpu reported by will-it-scale?
On Mon, Oct 28, 2019 at 7:34 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote: > > > On 10/22/19 6:27 PM, Alexander Duyck wrote: > > > [...] > > Below are the results from various benchmarks. I primarily focused on two > > tests. The first is the will-it-scale/page_fault2 test, and the other is > > a modified version of will-it-scale/page_fault1 that was enabled to use > > THP. I did this as it allows for better visibility into different parts > > of the memory subsystem. The guest is running on one node of a E5-2630 v3 > > CPU with 48G of RAM that I split up into two logical nodes in the guest > > in order to test with NUMA as well. > > > > Test page_fault1 (THP) page_fault2 > > Baseline 1 1256106.33 +/-0.09% 482202.67 +/-0.46% > > 16 8864441.67 +/-0.09% 3734692.00 +/-1.23% > > > > Patches applied 1 1257096.00 +/-0.06% 477436.00 +/-0.16% > > 16 8864677.33 +/-0.06% 3800037.00 +/-0.19% > > > > Patches enabled 1 1258420.00 +/-0.04% 480080.00 +/-0.07% > > MADV disabled 16 8753840.00 +/-1.27% 3782764.00 +/-0.37% > > > > Patches enabled 1 1267916.33 +/-0.08% 472075.67 +/-0.39% > > 16 8287050.33 +/-0.67% 3774500.33 +/-0.11% > > If I am not mistaken then you are only observing the number of processes (and > not the number of threads) launched over the 1st and the 16th vcpu reported by > will-it-scale? You are correct these results are for the processes. I monitored them for 1 - 16, but only included the results for 1 and 16 since those seem to be the most relevant data points.