[v12,0/6] mm / virtio: Provide support for unused page reporting

Message ID	20191022221223.17338.5860.stgit@localhost.localdomain (mailing list archive)
Headers	show Return-Path: <SRS0=lBCn=YP=vger.kernel.org=kvm-owner@kernel.org> Subject: [PATCH v12 0/6] mm / virtio: Provide support for unused page reporting From: Alexander Duyck <alexander.duyck@gmail.com> To: kvm@vger.kernel.org, mst@redhat.com, linux-kernel@vger.kernel.org, willy@infradead.org, mhocko@kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, mgorman@techsingularity.net, vbabka@suse.cz Cc: yang.zhang.wz@gmail.com, nitesh@redhat.com, konrad.wilk@oracle.com, david@redhat.com, pagupta@redhat.com, riel@surriel.com, lcapitulino@redhat.com, dave.hansen@intel.com, wei.w.wang@intel.com, aarcange@redhat.com, pbonzini@redhat.com, dan.j.williams@intel.com, alexander.h.duyck@linux.intel.com, osalvador@suse.de Date: Tue, 22 Oct 2019 15:27:52 -0700 Message-ID: <20191022221223.17338.5860.stgit@localhost.localdomain> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	mm / virtio: Provide support for unused page reporting \| expand [v12,0/6] mm / virtio: Provide support for unused page reporting [v12,1/6] mm: Adjust shuffle code to allow for future coalescing [v12,2/6] mm: Use zone and order instead of free area in free_list manipulators [v12,3/6] mm: Introduce Reported pages [v12,4/6] mm: Add device side and notifier for unused page reporting [v12,5/6] virtio-balloon: Pull page poisoning config out of free page hinting [v12,6/6] virtio-balloon: Add support for providing unused page reports to host

Alexander Duyck Oct. 22, 2019, 10:27 p.m. UTC

This series provides an asynchronous means of reporting unused guest
pages to a hypervisor so that the memory associated with those pages can
be dropped and reused by other processes and/or guests.

When enabled it will allocate a set of statistics to track the number of
reported pages. When the nr_free for a given free_area is greater than
this by the high water mark we will schedule a worker to begin allocating
the non-reported memory and to provide it to the reporting interface via a
scatterlist.

Currently this is only in use by virtio-balloon however there is the hope
that at some point in the future other hypervisors might be able to make
use of it. In the virtio-balloon/QEMU implementation the hypervisor is
currently using MADV_DONTNEED to indicate to the host kernel that the page
is currently unused. It will be faulted back into the guest the next time
the page is accessed.

To track if a page is reported or not the Uptodate flag was repurposed and
used as a Reported flag for Buddy pages. While we are processing the pages
in a given zone we have a set of pointers we track called
reported_boundary that is used to keep our processing time to a minimum.
Without these we would have to iterate through all of the reported pages
which would become a significant burden. I measured as much as a 20%
performance degradation without using the boundary pointers. In the event
of something like compaction needing to process the zone at the same time
it currently resorts to resetting the boundary if it is rearranging the
list. However in the future it could choose to delay processing the zone
if a flag is set indicating that a zone is being actively processed.

Below are the results from various benchmarks. I primarily focused on two
tests. The first is the will-it-scale/page_fault2 test, and the other is
a modified version of will-it-scale/page_fault1 that was enabled to use
THP. I did this as it allows for better visibility into different parts
of the memory subsystem. The guest is running on one node of a E5-2630 v3
CPU with 48G of RAM that I split up into two logical nodes in the guest
in order to test with NUMA as well.

Test		    page_fault1 (THP)     page_fault2
Baseline	 1  1256106.33  +/-0.09%   482202.67  +/-0.46%
                16  8864441.67  +/-0.09%  3734692.00  +/-1.23%

Patches applied  1  1257096.00  +/-0.06%   477436.00  +/-0.16%
                16  8864677.33  +/-0.06%  3800037.00  +/-0.19%

Patches enabled	 1  1258420.00  +/-0.04%   480080.00  +/-0.07%
 MADV disabled  16  8753840.00  +/-1.27%  3782764.00  +/-0.37%

Patches enabled	 1  1267916.33  +/-0.08%   472075.67  +/-0.39%
                16  8287050.33  +/-0.67%  3774500.33  +/-0.11%

The results above are for a baseline with a linux-next-20191021 kernel,
that kernel with this patch set applied but page reporting disabled in
virtio-balloon, patches applied but the madvise disabled by direct
assigning a device, and the patches applied and page reporting fully
enabled.  These results include the deviation seen between the average
value reported here versus the high and/or low value. I observed that
during the test the memory usage for the first three tests never dropped
whereas with the patches fully enabled the VM would drop to using only a
few GB of the host's memory when switching from memhog to page fault tests.

Most of the overhead seen with this patch set fully enabled is due to the
fact that accessing the reported pages will cause a page fault and the host
will have to zero the page before giving it back to the guest. The overall
guest size is kept fairly small to only a few GB while the test is running.
This overhead is much more visible when using THP than with standard 4K
pages. As such for the case where the host memory is not oversubscribed
this results in a performance regression, however if the host memory were
oversubscribed this patch set should result in a performance improvement
as swapping memory from the host can be avoided.

There is currently an alternative patch set[1] that has been under work
for some time however the v12 version of that patch set could not be
tested as it triggered a kernel panic when I attempted to test it. It
requires multiple modifications to get up and running with performance
comparable to this patch set. A follow-on set has yet to be posted. As
such I have not included results from that patch set, and I would
appreciate it if we could keep this patch set the focus of any discussion
on this thread.

For info on earlier versions you will need to follow the links provided
with the respective versions.

[1]: https://lore.kernel.org/lkml/20190812131235.27244-1-nitesh@redhat.com/

Changes from v10:
https://lore.kernel.org/lkml/20190918175109.23474.67039.stgit@localhost.localdomain/
Rebased on "Add linux-next specific files for 20190930"
Added page_is_reported() macro to prevent unneeded testing of PageReported bit
Fixed several spots where comments referred to older aeration naming
Set upper limit for phdev->capacity to page reporting high water mark
Updated virtio page poison detection logic to also cover init_on_free
Tweaked page_reporting_notify_free to reduce code size
Removed dead code in non-reporting path

Changes from v11:
https://lore.kernel.org/lkml/20191001152441.27008.99285.stgit@localhost.localdomain/
Removed unnecessary whitespace change from patch 2
Minor tweak to get_unreported_page to avoid excess writes to boundary
Rewrote cover page to lay out additional performance info.

---

Alexander Duyck (6):
      mm: Adjust shuffle code to allow for future coalescing
      mm: Use zone and order instead of free area in free_list manipulators
      mm: Introduce Reported pages
      mm: Add device side and notifier for unused page reporting
      virtio-balloon: Pull page poisoning config out of free page hinting
      virtio-balloon: Add support for providing unused page reports to host


 drivers/virtio/Kconfig              |    1 
 drivers/virtio/virtio_balloon.c     |   88 ++++++++-
 include/linux/mmzone.h              |   60 ++----
 include/linux/page-flags.h          |   11 +
 include/linux/page_reporting.h      |   31 +++
 include/uapi/linux/virtio_balloon.h |    1 
 mm/Kconfig                          |   11 +
 mm/Makefile                         |    1 
 mm/compaction.c                     |    5 
 mm/memory_hotplug.c                 |    2 
 mm/page_alloc.c                     |  194 +++++++++++++++----
 mm/page_reporting.c                 |  353 +++++++++++++++++++++++++++++++++++
 mm/page_reporting.h                 |  225 ++++++++++++++++++++++
 mm/shuffle.c                        |   12 +
 mm/shuffle.h                        |    6 +
 15 files changed, 899 insertions(+), 102 deletions(-)
 create mode 100644 include/linux/page_reporting.h
 create mode 100644 mm/page_reporting.c
 create mode 100644 mm/page_reporting.h

--

Andrew Morton Oct. 22, 2019, 11:01 p.m. UTC | #1

On Tue, 22 Oct 2019 15:27:52 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:

> Below are the results from various benchmarks. I primarily focused on two
> tests. The first is the will-it-scale/page_fault2 test, and the other is
> a modified version of will-it-scale/page_fault1 that was enabled to use
> THP. I did this as it allows for better visibility into different parts
> of the memory subsystem. The guest is running on one node of a E5-2630 v3
> CPU with 48G of RAM that I split up into two logical nodes in the guest
> in order to test with NUMA as well.
> 
> Test		    page_fault1 (THP)     page_fault2
> Baseline	 1  1256106.33  +/-0.09%   482202.67  +/-0.46%
>                 16  8864441.67  +/-0.09%  3734692.00  +/-1.23%
> 
> Patches applied  1  1257096.00  +/-0.06%   477436.00  +/-0.16%
>                 16  8864677.33  +/-0.06%  3800037.00  +/-0.19%
> 
> Patches enabled	 1  1258420.00  +/-0.04%   480080.00  +/-0.07%
>  MADV disabled  16  8753840.00  +/-1.27%  3782764.00  +/-0.37%
> 
> Patches enabled	 1  1267916.33  +/-0.08%   472075.67  +/-0.39%
>                 16  8287050.33  +/-0.67%  3774500.33  +/-0.11%
> 
> The results above are for a baseline with a linux-next-20191021 kernel,
> that kernel with this patch set applied but page reporting disabled in
> virtio-balloon, patches applied but the madvise disabled by direct
> assigning a device, and the patches applied and page reporting fully
> enabled.  These results include the deviation seen between the average
> value reported here versus the high and/or low value. I observed that
> during the test the memory usage for the first three tests never dropped
> whereas with the patches fully enabled the VM would drop to using only a
> few GB of the host's memory when switching from memhog to page fault tests.
> 
> Most of the overhead seen with this patch set fully enabled is due to the
> fact that accessing the reported pages will cause a page fault and the host
> will have to zero the page before giving it back to the guest. The overall
> guest size is kept fairly small to only a few GB while the test is running.
> This overhead is much more visible when using THP than with standard 4K
> pages. As such for the case where the host memory is not oversubscribed
> this results in a performance regression, however if the host memory were
> oversubscribed this patch set should result in a performance improvement
> as swapping memory from the host can be avoided.

I'm trying to understand "how valuable is this patchset" and the above
resulted in some headscratching.

Overall, how valuable is this patchset?  To real users running real
workloads?

> There is currently an alternative patch set[1] that has been under work
> for some time however the v12 version of that patch set could not be
> tested as it triggered a kernel panic when I attempted to test it. It
> requires multiple modifications to get up and running with performance
> comparable to this patch set. A follow-on set has yet to be posted. As
> such I have not included results from that patch set, and I would
> appreciate it if we could keep this patch set the focus of any discussion
> on this thread.

Actually, the rest of us would be interested in a comparison ;)

Alexander Duyck Oct. 22, 2019, 11:43 p.m. UTC | #2

On Tue, 2019-10-22 at 16:01 -0700, Andrew Morton wrote:
> On Tue, 22 Oct 2019 15:27:52 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:
> 
> > Below are the results from various benchmarks. I primarily focused on two
> > tests. The first is the will-it-scale/page_fault2 test, and the other is
> > a modified version of will-it-scale/page_fault1 that was enabled to use
> > THP. I did this as it allows for better visibility into different parts
> > of the memory subsystem. The guest is running on one node of a E5-2630 v3
> > CPU with 48G of RAM that I split up into two logical nodes in the guest
> > in order to test with NUMA as well.
> > 
> > Test		    page_fault1 (THP)     page_fault2
> > Baseline	 1  1256106.33  +/-0.09%   482202.67  +/-0.46%
> >                 16  8864441.67  +/-0.09%  3734692.00  +/-1.23%
> > 
> > Patches applied  1  1257096.00  +/-0.06%   477436.00  +/-0.16%
> >                 16  8864677.33  +/-0.06%  3800037.00  +/-0.19%
> > 
> > Patches enabled	 1  1258420.00  +/-0.04%   480080.00  +/-0.07%
> >  MADV disabled  16  8753840.00  +/-1.27%  3782764.00  +/-0.37%
> > 
> > Patches enabled	 1  1267916.33  +/-0.08%   472075.67  +/-0.39%
> >                 16  8287050.33  +/-0.67%  3774500.33  +/-0.11%
> > 
> > The results above are for a baseline with a linux-next-20191021 kernel,
> > that kernel with this patch set applied but page reporting disabled in
> > virtio-balloon, patches applied but the madvise disabled by direct
> > assigning a device, and the patches applied and page reporting fully
> > enabled.  These results include the deviation seen between the average
> > value reported here versus the high and/or low value. I observed that
> > during the test the memory usage for the first three tests never dropped
> > whereas with the patches fully enabled the VM would drop to using only a
> > few GB of the host's memory when switching from memhog to page fault tests.
> > 
> > Most of the overhead seen with this patch set fully enabled is due to the
> > fact that accessing the reported pages will cause a page fault and the host
> > will have to zero the page before giving it back to the guest. The overall
> > guest size is kept fairly small to only a few GB while the test is running.
> > This overhead is much more visible when using THP than with standard 4K
> > pages. As such for the case where the host memory is not oversubscribed
> > this results in a performance regression, however if the host memory were
> > oversubscribed this patch set should result in a performance improvement
> > as swapping memory from the host can be avoided.
> 
> I'm trying to understand "how valuable is this patchset" and the above
> resulted in some headscratching.
> 
> Overall, how valuable is this patchset?  To real users running real
> workloads?

A more detailed reply is in my response to your comments on patch 3.
Basically the value is for host memory overcommit in that we can avoid
having to go to swap nearly as often and can potentially pack the guests
even tighter with better performance.

> > There is currently an alternative patch set[1] that has been under work
> > for some time however the v12 version of that patch set could not be
> > tested as it triggered a kernel panic when I attempted to test it. It
> > requires multiple modifications to get up and running with performance
> > comparable to this patch set. A follow-on set has yet to be posted. As
> > such I have not included results from that patch set, and I would
> > appreciate it if we could keep this patch set the focus of any discussion
> > on this thread.
> 
> Actually, the rest of us would be interested in a comparison ;)  

I understand that. However, the last time I tried benchmarking that patch
set it blew up into a thread where we kept having to fix things on that
patch set and by the time we were done we weren't benchmarking the v12
patch set anymore since we had made so many modifications to it, and that 
assumes Nitesh and I were in sync. Also I don't know what the current
state of his patch set is as he was working on some additional changes
when we last discussed things.

Ideally that patch set can be reposted with the necessary fixes and then
we can go through any necessary debug, repair, and addressing limitations
there.

Nitesh Narayan Lal Oct. 23, 2019, 11:19 a.m. UTC | #3

On 10/22/19 7:43 PM, Alexander Duyck wrote:
> On Tue, 2019-10-22 at 16:01 -0700, Andrew Morton wrote:
>> On Tue, 22 Oct 2019 15:27:52 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:
>>
[...]
>>> There is currently an alternative patch set[1] that has been under work
>>> for some time however the v12 version of that patch set could not be
>>> tested as it triggered a kernel panic when I attempted to test it. It
>>> requires multiple modifications to get up and running with performance
>>> comparable to this patch set. A follow-on set has yet to be posted. As
>>> such I have not included results from that patch set, and I would
>>> appreciate it if we could keep this patch set the focus of any discussion
>>> on this thread.
>> Actually, the rest of us would be interested in a comparison ;)  
> I understand that. However, the last time I tried benchmarking that patch
> set it blew up into a thread where we kept having to fix things on that
> patch set and by the time we were done we weren't benchmarking the v12
> patch set anymore since we had made so many modifications to it, and that 
> assumes Nitesh and I were in sync. Also I don't know what the current
> state of his patch set is as he was working on some additional changes
> when we last discussed things.

Just an update about the current state of my patch-series:

As we last discussed I was going to try implementing Michal Hock's suggestion of
using page-isolation APIs. To do that I have replaced __isolate_free_page() with
start/undo_isolate_free_page_range().
However, I am running into some issues which I am currently investigating.

After this, I will be investigating the reason why I was seeing degradation
specifically with (MAX_ORDER - 2) as the reporting order.

>
> Ideally that patch set can be reposted with the necessary fixes and then
> we can go through any necessary debug, repair, and addressing limitations
> there.
>
>

Nitesh Narayan Lal Oct. 23, 2019, 11:35 a.m. UTC | #4

On 10/22/19 6:27 PM, Alexander Duyck wrote:
> This series provides an asynchronous means of reporting unused guest
> pages to a hypervisor so that the memory associated with those pages can
> be dropped and reused by other processes and/or guests.
>
> When enabled it will allocate a set of statistics to track the number of
> reported pages. When the nr_free for a given free_area is greater than
> this by the high water mark we will schedule a worker to begin allocating
> the non-reported memory and to provide it to the reporting interface via a
> scatterlist.
>
> Currently this is only in use by virtio-balloon however there is the hope
> that at some point in the future other hypervisors might be able to make
> use of it. In the virtio-balloon/QEMU implementation the hypervisor is
> currently using MADV_DONTNEED to indicate to the host kernel that the page
> is currently unused. It will be faulted back into the guest the next time
> the page is accessed.
>
> To track if a page is reported or not the Uptodate flag was repurposed and
> used as a Reported flag for Buddy pages. While we are processing the pages
> in a given zone we have a set of pointers we track called
> reported_boundary that is used to keep our processing time to a minimum.
> Without these we would have to iterate through all of the reported pages
> which would become a significant burden. I measured as much as a 20%
> performance degradation without using the boundary pointers. In the event
> of something like compaction needing to process the zone at the same time
> it currently resorts to resetting the boundary if it is rearranging the
> list. However in the future it could choose to delay processing the zone
> if a flag is set indicating that a zone is being actively processed.
>
> Below are the results from various benchmarks. I primarily focused on two
> tests. The first is the will-it-scale/page_fault2 test, and the other is
> a modified version of will-it-scale/page_fault1 that was enabled to use
> THP. I did this as it allows for better visibility into different parts
> of the memory subsystem. The guest is running on one node of a E5-2630 v3
> CPU with 48G of RAM that I split up into two logical nodes in the guest
> in order to test with NUMA as well.
>
> Test		    page_fault1 (THP)     page_fault2
> Baseline	 1  1256106.33  +/-0.09%   482202.67  +/-0.46%
>                 16  8864441.67  +/-0.09%  3734692.00  +/-1.23%
>
> Patches applied  1  1257096.00  +/-0.06%   477436.00  +/-0.16%
>                 16  8864677.33  +/-0.06%  3800037.00  +/-0.19%
>
> Patches enabled	 1  1258420.00  +/-0.04%   480080.00  +/-0.07%
>  MADV disabled  16  8753840.00  +/-1.27%  3782764.00  +/-0.37%
>
> Patches enabled	 1  1267916.33  +/-0.08%   472075.67  +/-0.39%
>                 16  8287050.33  +/-0.67%  3774500.33  +/-0.11%
>
> The results above are for a baseline with a linux-next-20191021 kernel,
> that kernel with this patch set applied but page reporting disabled in
> virtio-balloon, patches applied but the madvise disabled by direct
> assigning a device, and the patches applied and page reporting fully
> enabled.  These results include the deviation seen between the average
> value reported here versus the high and/or low value. I observed that
> during the test the memory usage for the first three tests never dropped
> whereas with the patches fully enabled the VM would drop to using only a
> few GB of the host's memory when switching from memhog to page fault tests.
>
> Most of the overhead seen with this patch set fully enabled is due to the
> fact that accessing the reported pages will cause a page fault and the host
> will have to zero the page before giving it back to the guest. The overall
> guest size is kept fairly small to only a few GB while the test is running.
> This overhead is much more visible when using THP than with standard 4K
> pages. As such for the case where the host memory is not oversubscribed
> this results in a performance regression, however if the host memory were
> oversubscribed this patch set should result in a performance improvement
> as swapping memory from the host can be avoided.
>
> There is currently an alternative patch set[1] that has been under work
> for some time however the v12 version of that patch set could not be
> tested as it triggered a kernel panic when I attempted to test it. It
> requires multiple modifications to get up and running with performance
> comparable to this patch set. A follow-on set has yet to be posted. As
> such I have not included results from that patch set, and I would
> appreciate it if we could keep this patch set the focus of any discussion
> on this thread.
>
> For info on earlier versions you will need to follow the links provided
> with the respective versions.
>
> [1]: https://lore.kernel.org/lkml/20190812131235.27244-1-nitesh@redhat.com/
>
> Changes from v10:
> https://lore.kernel.org/lkml/20190918175109.23474.67039.stgit@localhost.localdomain/
> Rebased on "Add linux-next specific files for 20190930"
> Added page_is_reported() macro to prevent unneeded testing of PageReported bit
> Fixed several spots where comments referred to older aeration naming
> Set upper limit for phdev->capacity to page reporting high water mark
> Updated virtio page poison detection logic to also cover init_on_free
> Tweaked page_reporting_notify_free to reduce code size
> Removed dead code in non-reporting path
>
> Changes from v11:
> https://lore.kernel.org/lkml/20191001152441.27008.99285.stgit@localhost.localdomain/
> Removed unnecessary whitespace change from patch 2
> Minor tweak to get_unreported_page to avoid excess writes to boundary
> Rewrote cover page to lay out additional performance info.
>
> ---
>
> Alexander Duyck (6):
>       mm: Adjust shuffle code to allow for future coalescing
>       mm: Use zone and order instead of free area in free_list manipulators
>       mm: Introduce Reported pages
>       mm: Add device side and notifier for unused page reporting
>       virtio-balloon: Pull page poisoning config out of free page hinting
>       virtio-balloon: Add support for providing unused page reports to host
>
>
>  drivers/virtio/Kconfig              |    1 
>  drivers/virtio/virtio_balloon.c     |   88 ++++++++-
>  include/linux/mmzone.h              |   60 ++----
>  include/linux/page-flags.h          |   11 +
>  include/linux/page_reporting.h      |   31 +++
>  include/uapi/linux/virtio_balloon.h |    1 
>  mm/Kconfig                          |   11 +
>  mm/Makefile                         |    1 
>  mm/compaction.c                     |    5 
>  mm/memory_hotplug.c                 |    2 
>  mm/page_alloc.c                     |  194 +++++++++++++++----
>  mm/page_reporting.c                 |  353 +++++++++++++++++++++++++++++++++++
>  mm/page_reporting.h                 |  225 ++++++++++++++++++++++
>  mm/shuffle.c                        |   12 +
>  mm/shuffle.h                        |    6 +
>  15 files changed, 899 insertions(+), 102 deletions(-)
>  create mode 100644 include/linux/page_reporting.h
>  create mode 100644 mm/page_reporting.c
>  create mode 100644 mm/page_reporting.h
>
> --
>

I think Michal Hocko suggested us to include a brief detail about the background
explaining how we ended up with the current approach and what all things we have
already tried.
That would help someone reviewing the patch-series for the first time to
understand it in a better way.

--
Nitesh

Alexander Duyck Oct. 23, 2019, 10:24 p.m. UTC | #5

On Wed, 2019-10-23 at 07:35 -0400, Nitesh Narayan Lal wrote:
> On 10/22/19 6:27 PM, Alexander Duyck wrote:
> > This series provides an asynchronous means of reporting unused guest
> > pages to a hypervisor so that the memory associated with those pages can
> > be dropped and reused by other processes and/or guests.
> > 

<snip>

> > 
> I think Michal Hocko suggested us to include a brief detail about the background
> explaining how we ended up with the current approach and what all things we have
> already tried.
> That would help someone reviewing the patch-series for the first time to
> understand it in a better way.

I'm not entirely sure it helps. The problem is that even the "brief"
version will probably be pretty long.

From what I know the first real public discussion of guest memory
overcommit and free page hinting dates back to the 2011 KVM forum and a
presentation by Rik van Riel[0].

Before I got started in the code there was already virtio-balloon free
page hinting[1]. However it was meant to be an all-at-once reporting of
the free pages in the system at a given point in time, and used only for
VM migration. All it does is inflate a balloon until it encounters an OOM
and then it frees the memory back to the guest. One interesting piece that
came out of the work on that patch set was the suggestion by Linus to use
an array based incremental approach[2] which is what I based my later
implementation on.

I believe Nitesh had already been working on his own approach for unused
page hinting for some time at that point. Prior to submitting my RFC there
was already a v7 that had been submitted by Nitesh back in mid 2018[3].
The solution was an array based approach which appeared to instrument
arch_alloc_page and arch_free_page and would prevent allocations while
hinting was occurring.

The first RFC I had written[4] was a synchronous approach that made use of
arch_free_page to make a hypercall that would immediately flag the page as
being unused. However a hypercall per page can be expensive and we ideally
don't want the guest vCPU potentially being hung up while waiting on the
host mmap_sem.

At about this time I believe Nitesh's solution[5] was still trying to keep
an array of pages that were unused and tracking that via arch_free_page.
In the synchronous case it could cause OOM errors, and in the asynchronous
approach it had issues with being overrun and not being able to track
unused pages.

Later I switched to an asynchronous approach[6], originally calling it
"bubble hinting". With the asynchronous approach it is necessary to have a
way to track what pages have been reported and what haven't. I originally
was using the page type to track it as I had a Buddy and a TreatedBuddy,
but ultimately that moved to a "Reported" page flag. In addition I pulled
the counters and pointers out of the free_area/free_list  and instead now
have a stand-alone set of pointers and keep the reported statistics in a
separate dynamic allocation.

Then Nitesh's solution had changed to the bitmap approach[7]. However it
has been pointed out that this solution doesn't deal with sparse memory,
hotplug, and various other issues.

Since then both my approach and Nitesh's approach have been iterating with
mostly minor changes.

[0]: https://www.linux-kvm.org/images/f/ff/2011-forum-memory-overcommit.pdf
[1]: https://lore.kernel.org/lkml/1535333539-32420-1-git-send-email-wei.w.wang@intel.com/
[2]: https://lore.kernel.org/lkml/CA+55aFzqj8wxXnHAdUTiOomipgFONVbqKMjL_tfk7e5ar1FziQ@mail.gmail.com/
[3]: https://www.spinics.net/lists/kvm/msg170113.html
[4]: https://lore.kernel.org/lkml/20190204181118.12095.38300.stgit@localhost.localdomain/
[5]: https://lore.kernel.org/lkml/20190204201854.2328-1-nitesh@redhat.com/
[6]: https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
[7]: https://lore.kernel.org/lkml/20190603170306.49099-1-nitesh@redhat.com/

Nitesh Narayan Lal Oct. 28, 2019, 2:34 p.m. UTC | #6

On 10/22/19 6:27 PM, Alexander Duyck wrote:


[...]
> Below are the results from various benchmarks. I primarily focused on two
> tests. The first is the will-it-scale/page_fault2 test, and the other is
> a modified version of will-it-scale/page_fault1 that was enabled to use
> THP. I did this as it allows for better visibility into different parts
> of the memory subsystem. The guest is running on one node of a E5-2630 v3
> CPU with 48G of RAM that I split up into two logical nodes in the guest
> in order to test with NUMA as well.
>
> Test		    page_fault1 (THP)     page_fault2
> Baseline	 1  1256106.33  +/-0.09%   482202.67  +/-0.46%
>                 16  8864441.67  +/-0.09%  3734692.00  +/-1.23%
>
> Patches applied  1  1257096.00  +/-0.06%   477436.00  +/-0.16%
>                 16  8864677.33  +/-0.06%  3800037.00  +/-0.19%
>
> Patches enabled	 1  1258420.00  +/-0.04%   480080.00  +/-0.07%
>  MADV disabled  16  8753840.00  +/-1.27%  3782764.00  +/-0.37%
>
> Patches enabled	 1  1267916.33  +/-0.08%   472075.67  +/-0.39%
>                 16  8287050.33  +/-0.67%  3774500.33  +/-0.11%

If I am not mistaken then you are only observing the number of processes (and
not the number of threads) launched over the 1st and the 16th vcpu  reported by
will-it-scale?

Alexander Duyck Oct. 28, 2019, 3:24 p.m. UTC | #7

On Mon, Oct 28, 2019 at 7:34 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 10/22/19 6:27 PM, Alexander Duyck wrote:
>
>
> [...]
> > Below are the results from various benchmarks. I primarily focused on two
> > tests. The first is the will-it-scale/page_fault2 test, and the other is
> > a modified version of will-it-scale/page_fault1 that was enabled to use
> > THP. I did this as it allows for better visibility into different parts
> > of the memory subsystem. The guest is running on one node of a E5-2630 v3
> > CPU with 48G of RAM that I split up into two logical nodes in the guest
> > in order to test with NUMA as well.
> >
> > Test              page_fault1 (THP)     page_fault2
> > Baseline       1  1256106.33  +/-0.09%   482202.67  +/-0.46%
> >                 16  8864441.67  +/-0.09%  3734692.00  +/-1.23%
> >
> > Patches applied  1  1257096.00  +/-0.06%   477436.00  +/-0.16%
> >                 16  8864677.33  +/-0.06%  3800037.00  +/-0.19%
> >
> > Patches enabled        1  1258420.00  +/-0.04%   480080.00  +/-0.07%
> >  MADV disabled  16  8753840.00  +/-1.27%  3782764.00  +/-0.37%
> >
> > Patches enabled        1  1267916.33  +/-0.08%   472075.67  +/-0.39%
> >                 16  8287050.33  +/-0.67%  3774500.33  +/-0.11%
>
> If I am not mistaken then you are only observing the number of processes (and
> not the number of threads) launched over the 1st and the 16th vcpu  reported by
> will-it-scale?

You are correct these results are for the processes. I monitored them
for 1 - 16, but only included the results for 1 and 16 since those
seem to be the most relevant data points.

[v12,0/6] mm / virtio: Provide support for unused page reporting

Message

Comments