mbox series

[v3,0/6] mm / virtio: Provide support for unused page reporting

Message ID 20190801222158.22190.96964.stgit@localhost.localdomain (mailing list archive)
Headers show
Series mm / virtio: Provide support for unused page reporting | expand

Message

Alexander Duyck Aug. 1, 2019, 10:24 p.m. UTC
This series provides an asynchronous means of reporting to a hypervisor
that a guest page is no longer in use and can have the data associated
with it dropped. To do this I have implemented functionality that allows
for what I am referring to as unused page reporting

The functionality for this is fairly simple. When enabled it will allocate
statistics to track the number of reported pages in a given free area.
When the number of free pages exceeds this value plus a high water value,
currently 32, it will begin performing page reporting which consists of
pulling pages off of free list and placing them into a scatter list. The
scatterlist is then given to the page reporting device and it will perform
the required action to make the pages "reported", in the case of
virtio-balloon this results in the pages being madvised as MADV_DONTNEED
and as such they are forced out of the guest. After this they are placed
back on the free list, and an additional bit is added if they are not
merged indicating that they are a reported buddy page instead of a
standard buddy page. The cycle then repeats with additional non-reported
pages being pulled until the free areas all consist of reported pages.

I am leaving a number of things hard-coded such as limiting the lowest
order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
determine what the limit is on how many pages it wants to allocate to
process the hints. The upper limit for this is based on the size of the
queue used to store the scatterlist.

My primary testing has just been to verify the memory is being freed after
allocation by running memhog 40g on a 40g guest and watching the total
free memory via /proc/meminfo on the host. With this I have verified most
of the memory is freed after each iteration. As far as performance I have
been mainly focusing on the will-it-scale/page_fault1 test running with
16 vcpus. With that I have seen up to a 2% difference between the base
kernel without these patches and the patches with virtio-balloon enabled
or disabled.

One side effect of these patches is that the guest becomes much more
resilient in terms of NUMA locality. With the pages being freed and then
reallocated when used it allows for the pages to be much closer to the
active thread, and as a result there can be situations where this patch
set will out-perform the stock kernel when the guest memory is not local
to the guest vCPUs.

Patch 4 is a bit on the large side at about 600 lines of change, however
I really didn't see a good way to break it up since each piece feeds into
the next. So I couldn't add the statistics by themselves as it didn't
really make sense to add them without something that will either read or
increment/decrement them, or add the Hinted state without something that
would set/unset it. As such I just ended up adding the entire thing as
one patch. It makes it a bit bigger but avoids the issues in the previous
set where I was referencing things that had not yet been added.

Changes from the RFC:
https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
Moved aeration requested flag out of aerator and into zone->flags.
Moved boundary out of free_area and into local variables for aeration.
Moved aeration cycle out of interrupt and into workqueue.
Left nr_free as total pages instead of splitting it between raw and aerated.
Combined size and physical address values in virtio ring into one 64b value.

Changes from v1:
https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
Dropped "waste page treatment" in favor of "page hinting"
Renamed files and functions from "aeration" to "page_hinting"
Moved from page->lru list to scatterlist
Replaced wait on refcnt in shutdown with RCU and cancel_delayed_work_sync
Virtio now uses scatterlist directly instead of intermediate array
Moved stats out of free_area, now in separate area and pointed to from zone
Merged patch 5 into patch 4 to improve review-ability
Updated various code comments throughout

Changes from v2:
https://lore.kernel.org/lkml/20190724165158.6685.87228.stgit@localhost.localdomain/
Dropped "page hinting" in favor of "page reporting"
Renamed files from "hinting" to "reporting"
Replaced "Hinted" page type with "Reported" page flag
Added support for page poisoning while hinting is active
Add QEMU patch that implements PAGE_POISON feature

---

Alexander Duyck (6):
      mm: Adjust shuffle code to allow for future coalescing
      mm: Move set/get_pcppage_migratetype to mmzone.h
      mm: Use zone and order instead of free area in free_list manipulators
      mm: Introduce Reported pages
      virtio-balloon: Pull page poisoning config out of free page hinting
      virtio-balloon: Add support for providing unused page reports to host


 drivers/virtio/Kconfig              |    1 
 drivers/virtio/virtio_balloon.c     |   75 ++++++++-
 include/linux/mmzone.h              |  116 ++++++++------
 include/linux/page-flags.h          |   11 +
 include/linux/page_reporting.h      |  138 ++++++++++++++++
 include/uapi/linux/virtio_balloon.h |    1 
 mm/Kconfig                          |    5 +
 mm/Makefile                         |    1 
 mm/internal.h                       |   18 ++
 mm/memory_hotplug.c                 |    1 
 mm/page_alloc.c                     |  238 ++++++++++++++++++++--------
 mm/page_reporting.c                 |  299 +++++++++++++++++++++++++++++++++++
 mm/shuffle.c                        |   24 ---
 mm/shuffle.h                        |   32 ++++
 14 files changed, 821 insertions(+), 139 deletions(-)
 create mode 100644 include/linux/page_reporting.h
 create mode 100644 mm/page_reporting.c

--

Comments

Nitesh Narayan Lal Aug. 2, 2019, 2:41 p.m. UTC | #1
On 8/1/19 6:24 PM, Alexander Duyck wrote:
> This series provides an asynchronous means of reporting to a hypervisor
> that a guest page is no longer in use and can have the data associated
> with it dropped. To do this I have implemented functionality that allows
> for what I am referring to as unused page reporting
>
> The functionality for this is fairly simple. When enabled it will allocate
> statistics to track the number of reported pages in a given free area.
> When the number of free pages exceeds this value plus a high water value,
> currently 32, it will begin performing page reporting which consists of
> pulling pages off of free list and placing them into a scatter list. The
> scatterlist is then given to the page reporting device and it will perform
> the required action to make the pages "reported", in the case of
> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> and as such they are forced out of the guest. After this they are placed
> back on the free list, and an additional bit is added if they are not
> merged indicating that they are a reported buddy page instead of a
> standard buddy page. The cycle then repeats with additional non-reported
> pages being pulled until the free areas all consist of reported pages.
>
> I am leaving a number of things hard-coded such as limiting the lowest
> order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
> determine what the limit is on how many pages it wants to allocate to
> process the hints. The upper limit for this is based on the size of the
> queue used to store the scatterlist.
>
> My primary testing has just been to verify the memory is being freed after
> allocation by running memhog 40g on a 40g guest and watching the total
> free memory via /proc/meminfo on the host. With this I have verified most
> of the memory is freed after each iteration. As far as performance I have
> been mainly focusing on the will-it-scale/page_fault1 test running with
> 16 vcpus. With that I have seen up to a 2% difference between the base
> kernel without these patches and the patches with virtio-balloon enabled
> or disabled.

A couple of questions:

- The 2% difference which you have mentioned, is this visible for
  all the 16 cores or just the 16th core?
- I am assuming that the difference is seen for both "number of process"
  and "number of threads" launched by page_fault1. Is that right?

>
> One side effect of these patches is that the guest becomes much more
> resilient in terms of NUMA locality. With the pages being freed and then
> reallocated when used it allows for the pages to be much closer to the
> active thread, and as a result there can be situations where this patch
> set will out-perform the stock kernel when the guest memory is not local
> to the guest vCPUs.


Was this the reason because of which you were seeing better results for
page_fault1 earlier?

>
> Patch 4 is a bit on the large side at about 600 lines of change, however
> I really didn't see a good way to break it up since each piece feeds into
> the next. So I couldn't add the statistics by themselves as it didn't
> really make sense to add them without something that will either read or
> increment/decrement them, or add the Hinted state without something that
> would set/unset it. As such I just ended up adding the entire thing as
> one patch. It makes it a bit bigger but avoids the issues in the previous
> set where I was referencing things that had not yet been added.
>
> Changes from the RFC:
> https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
> Moved aeration requested flag out of aerator and into zone->flags.
> Moved boundary out of free_area and into local variables for aeration.
> Moved aeration cycle out of interrupt and into workqueue.
> Left nr_free as total pages instead of splitting it between raw and aerated.
> Combined size and physical address values in virtio ring into one 64b value.
>
> Changes from v1:
> https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
> Dropped "waste page treatment" in favor of "page hinting"
> Renamed files and functions from "aeration" to "page_hinting"
> Moved from page->lru list to scatterlist
> Replaced wait on refcnt in shutdown with RCU and cancel_delayed_work_sync
> Virtio now uses scatterlist directly instead of intermediate array
> Moved stats out of free_area, now in separate area and pointed to from zone
> Merged patch 5 into patch 4 to improve review-ability
> Updated various code comments throughout
>
> Changes from v2:
> https://lore.kernel.org/lkml/20190724165158.6685.87228.stgit@localhost.localdomain/
> Dropped "page hinting" in favor of "page reporting"
> Renamed files from "hinting" to "reporting"
> Replaced "Hinted" page type with "Reported" page flag
> Added support for page poisoning while hinting is active
> Add QEMU patch that implements PAGE_POISON feature
>
> ---
>
> Alexander Duyck (6):
>       mm: Adjust shuffle code to allow for future coalescing
>       mm: Move set/get_pcppage_migratetype to mmzone.h
>       mm: Use zone and order instead of free area in free_list manipulators
>       mm: Introduce Reported pages
>       virtio-balloon: Pull page poisoning config out of free page hinting
>       virtio-balloon: Add support for providing unused page reports to host
>
>
>  drivers/virtio/Kconfig              |    1 
>  drivers/virtio/virtio_balloon.c     |   75 ++++++++-
>  include/linux/mmzone.h              |  116 ++++++++------
>  include/linux/page-flags.h          |   11 +
>  include/linux/page_reporting.h      |  138 ++++++++++++++++
>  include/uapi/linux/virtio_balloon.h |    1 
>  mm/Kconfig                          |    5 +
>  mm/Makefile                         |    1 
>  mm/internal.h                       |   18 ++
>  mm/memory_hotplug.c                 |    1 
>  mm/page_alloc.c                     |  238 ++++++++++++++++++++--------
>  mm/page_reporting.c                 |  299 +++++++++++++++++++++++++++++++++++
>  mm/shuffle.c                        |   24 ---
>  mm/shuffle.h                        |   32 ++++
>  14 files changed, 821 insertions(+), 139 deletions(-)
>  create mode 100644 include/linux/page_reporting.h
>  create mode 100644 mm/page_reporting.c
>
> --
>
Alexander Duyck Aug. 2, 2019, 3:13 p.m. UTC | #2
On Fri, 2019-08-02 at 10:41 -0400, Nitesh Narayan Lal wrote:
> On 8/1/19 6:24 PM, Alexander Duyck wrote:
> > This series provides an asynchronous means of reporting to a hypervisor
> > that a guest page is no longer in use and can have the data associated
> > with it dropped. To do this I have implemented functionality that allows
> > for what I am referring to as unused page reporting
> > 
> > The functionality for this is fairly simple. When enabled it will allocate
> > statistics to track the number of reported pages in a given free area.
> > When the number of free pages exceeds this value plus a high water value,
> > currently 32, it will begin performing page reporting which consists of
> > pulling pages off of free list and placing them into a scatter list. The
> > scatterlist is then given to the page reporting device and it will perform
> > the required action to make the pages "reported", in the case of
> > virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> > and as such they are forced out of the guest. After this they are placed
> > back on the free list, and an additional bit is added if they are not
> > merged indicating that they are a reported buddy page instead of a
> > standard buddy page. The cycle then repeats with additional non-reported
> > pages being pulled until the free areas all consist of reported pages.
> > 
> > I am leaving a number of things hard-coded such as limiting the lowest
> > order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
> > determine what the limit is on how many pages it wants to allocate to
> > process the hints. The upper limit for this is based on the size of the
> > queue used to store the scatterlist.
> > 
> > My primary testing has just been to verify the memory is being freed after
> > allocation by running memhog 40g on a 40g guest and watching the total
> > free memory via /proc/meminfo on the host. With this I have verified most
> > of the memory is freed after each iteration. As far as performance I have
> > been mainly focusing on the will-it-scale/page_fault1 test running with
> > 16 vcpus. With that I have seen up to a 2% difference between the base
> > kernel without these patches and the patches with virtio-balloon enabled
> > or disabled.
> 
> A couple of questions:
> 
> - The 2% difference which you have mentioned, is this visible for
>   all the 16 cores or just the 16th core?
> - I am assuming that the difference is seen for both "number of process"
>   and "number of threads" launched by page_fault1. Is that right?

Really, the 2% is bordering on just being noise. Sometimes it is better
sometimes it is worse. However I think it is just slight variability in
the tests since it doesn't usually form any specific pattern.

I have been able to tighten it down a bit by actually splitting my guest
over 2 nodes and pinning the vCPUs so that the nodes in the guest match up
to the nodes in the host. Doing that I have seen results where I had less
than 1% variability between with the patches and without.

One thing I am looking at now is modifying the page_fault1 test to use THP
instead of 4K pages as I suspect there is a fair bit of overhead in
accessing the pages 4K at a time vs 2M at a time. I am hoping with that I
can put more pressure on the actual change and see if there are any
additional spots I should optimize.

> > One side effect of these patches is that the guest becomes much more
> > resilient in terms of NUMA locality. With the pages being freed and then
> > reallocated when used it allows for the pages to be much closer to the
> > active thread, and as a result there can be situations where this patch
> > set will out-perform the stock kernel when the guest memory is not local
> > to the guest vCPUs.
> 
> Was this the reason because of which you were seeing better results for
> page_fault1 earlier?

Yes I am thinking so. What I have found is that in the case where the
patches are not applied on the guest it takes a few runs for the numbers
to stabilize. What I think was going on is that I was running memhog to
initially fill the guest and that was placing all the pages on one node or
the other and as such was causing additional variability as the pages were
slowly being migrated over to the other node to rebalance the workload.
One way I tested it was by trying the unpatched case with a direct-
assigned device since that forces it to pin the memory. In that case I was
getting bad results consistently as all the memory was forced to come from
one node during the pre-allocation process.
Nitesh Narayan Lal Aug. 2, 2019, 4:19 p.m. UTC | #3
On 8/2/19 11:13 AM, Alexander Duyck wrote:
> On Fri, 2019-08-02 at 10:41 -0400, Nitesh Narayan Lal wrote:
>> On 8/1/19 6:24 PM, Alexander Duyck wrote:
>>> This series provides an asynchronous means of reporting to a hypervisor
>>> that a guest page is no longer in use and can have the data associated
>>> with it dropped. To do this I have implemented functionality that allows
>>> for what I am referring to as unused page reporting
>>>
>>> The functionality for this is fairly simple. When enabled it will allocate
>>> statistics to track the number of reported pages in a given free area.
>>> When the number of free pages exceeds this value plus a high water value,
>>> currently 32, it will begin performing page reporting which consists of
>>> pulling pages off of free list and placing them into a scatter list. The
>>> scatterlist is then given to the page reporting device and it will perform
>>> the required action to make the pages "reported", in the case of
>>> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
>>> and as such they are forced out of the guest. After this they are placed
>>> back on the free list, and an additional bit is added if they are not
>>> merged indicating that they are a reported buddy page instead of a
>>> standard buddy page. The cycle then repeats with additional non-reported
>>> pages being pulled until the free areas all consist of reported pages.
>>>
>>> I am leaving a number of things hard-coded such as limiting the lowest
>>> order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
>>> determine what the limit is on how many pages it wants to allocate to
>>> process the hints. The upper limit for this is based on the size of the
>>> queue used to store the scatterlist.
>>>
>>> My primary testing has just been to verify the memory is being freed after
>>> allocation by running memhog 40g on a 40g guest and watching the total
>>> free memory via /proc/meminfo on the host. With this I have verified most
>>> of the memory is freed after each iteration. As far as performance I have
>>> been mainly focusing on the will-it-scale/page_fault1 test running with
>>> 16 vcpus. With that I have seen up to a 2% difference between the base
>>> kernel without these patches and the patches with virtio-balloon enabled
>>> or disabled.
>> A couple of questions:
>>
>> - The 2% difference which you have mentioned, is this visible for
>>   all the 16 cores or just the 16th core?
>> - I am assuming that the difference is seen for both "number of process"
>>   and "number of threads" launched by page_fault1. Is that right?
> Really, the 2% is bordering on just being noise. Sometimes it is better
> sometimes it is worse. However I think it is just slight variability in
> the tests since it doesn't usually form any specific pattern.
>
> I have been able to tighten it down a bit by actually splitting my guest
> over 2 nodes and pinning the vCPUs so that the nodes in the guest match up
> to the nodes in the host. Doing that I have seen results where I had less
> than 1% variability between with the patches and without.

Interesting. I usually pin the guest to a single NUMA node to avoid this.

>
> One thing I am looking at now is modifying the page_fault1 test to use THP
> instead of 4K pages as I suspect there is a fair bit of overhead in
> accessing the pages 4K at a time vs 2M at a time. I am hoping with that I
> can put more pressure on the actual change and see if there are any
> additional spots I should optimize.


+1. Right now I don't think will-it-scale touches all the guest memory.
May I know how much memory does will-it-scale/page_fault1, occupies in your case
and how much do you get back with your patch-set?

Do you have any plans of running any other benchmarks as well?
Just to see the impact on other sub-systems.

>>> One side effect of these patches is that the guest becomes much more
>>> resilient in terms of NUMA locality. With the pages being freed and then
>>> reallocated when used it allows for the pages to be much closer to the
>>> active thread, and as a result there can be situations where this patch
>>> set will out-perform the stock kernel when the guest memory is not local
>>> to the guest vCPUs.
>> Was this the reason because of which you were seeing better results for
>> page_fault1 earlier?
> Yes I am thinking so. What I have found is that in the case where the
> patches are not applied on the guest it takes a few runs for the numbers
> to stabilize. What I think was going on is that I was running memhog to
> initially fill the guest and that was placing all the pages on one node or
> the other and as such was causing additional variability as the pages were
> slowly being migrated over to the other node to rebalance the workload.
> One way I tested it was by trying the unpatched case with a direct-
> assigned device since that forces it to pin the memory. In that case I was
> getting bad results consistently as all the memory was forced to come from
> one node during the pre-allocation process.
>

I have also seen that the page_fault1 values take some time to get stabilize on
an unmodified kernel.
What I am wondering here is that if on a single NUMA guest doing the following
will give the right/better idea or not:

1. Pin the guest to a single NUMA node.
2. Run memhog so that it touches all the guest memory.
3. Run will-it-scale/page_fault1.

Compare/observe the values for the last core (this is considering the other core
values doesn't drastically differ).
Alexander Duyck Aug. 2, 2019, 5:28 p.m. UTC | #4
On Fri, 2019-08-02 at 12:19 -0400, Nitesh Narayan Lal wrote:
> On 8/2/19 11:13 AM, Alexander Duyck wrote:
> > On Fri, 2019-08-02 at 10:41 -0400, Nitesh Narayan Lal wrote:
> > > On 8/1/19 6:24 PM, Alexander Duyck wrote:
> > > > This series provides an asynchronous means of reporting to a hypervisor
> > > > that a guest page is no longer in use and can have the data associated
> > > > with it dropped. To do this I have implemented functionality that allows
> > > > for what I am referring to as unused page reporting
> > > > 
> > > > The functionality for this is fairly simple. When enabled it will allocate
> > > > statistics to track the number of reported pages in a given free area.
> > > > When the number of free pages exceeds this value plus a high water value,
> > > > currently 32, it will begin performing page reporting which consists of
> > > > pulling pages off of free list and placing them into a scatter list. The
> > > > scatterlist is then given to the page reporting device and it will perform
> > > > the required action to make the pages "reported", in the case of
> > > > virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> > > > and as such they are forced out of the guest. After this they are placed
> > > > back on the free list, and an additional bit is added if they are not
> > > > merged indicating that they are a reported buddy page instead of a
> > > > standard buddy page. The cycle then repeats with additional non-reported
> > > > pages being pulled until the free areas all consist of reported pages.
> > > > 
> > > > I am leaving a number of things hard-coded such as limiting the lowest
> > > > order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
> > > > determine what the limit is on how many pages it wants to allocate to
> > > > process the hints. The upper limit for this is based on the size of the
> > > > queue used to store the scatterlist.
> > > > 
> > > > My primary testing has just been to verify the memory is being freed after
> > > > allocation by running memhog 40g on a 40g guest and watching the total
> > > > free memory via /proc/meminfo on the host. With this I have verified most
> > > > of the memory is freed after each iteration. As far as performance I have
> > > > been mainly focusing on the will-it-scale/page_fault1 test running with
> > > > 16 vcpus. With that I have seen up to a 2% difference between the base
> > > > kernel without these patches and the patches with virtio-balloon enabled
> > > > or disabled.
> > > A couple of questions:
> > > 
> > > - The 2% difference which you have mentioned, is this visible for
> > >   all the 16 cores or just the 16th core?
> > > - I am assuming that the difference is seen for both "number of process"
> > >   and "number of threads" launched by page_fault1. Is that right?
> > Really, the 2% is bordering on just being noise. Sometimes it is better
> > sometimes it is worse. However I think it is just slight variability in
> > the tests since it doesn't usually form any specific pattern.
> > 
> > I have been able to tighten it down a bit by actually splitting my guest
> > over 2 nodes and pinning the vCPUs so that the nodes in the guest match up
> > to the nodes in the host. Doing that I have seen results where I had less
> > than 1% variability between with the patches and without.
> 
> Interesting. I usually pin the guest to a single NUMA node to avoid this.

I was trying to put as much stress on this as I could so my thought was
the more CPUs the better. Also an added advantage to splitting the guest
over 2 nodes is that it split the zone locks up so that it reduced how
much of a bottleneck it was.

> > One thing I am looking at now is modifying the page_fault1 test to use THP
> > instead of 4K pages as I suspect there is a fair bit of overhead in
> > accessing the pages 4K at a time vs 2M at a time. I am hoping with that I
> > can put more pressure on the actual change and see if there are any
> > additional spots I should optimize.
> 
> +1. Right now I don't think will-it-scale touches all the guest memory.
> May I know how much memory does will-it-scale/page_fault1, occupies in your case
> and how much do you get back with your patch-set?

If I recall correctly each process/thread of the page_fault1 test occupies
128MB or memory per iteration. When you consider the base case with 1
thread is a half million iterations that should be something like up to
64GB allocated and freed per thread.

One thing I overlooked testing this time around was a setup with memory
shuffling enabled. That would cause the iterations to use a larger swath
of memory as each 128G would have chunks randomly placed on the tail of
the free lists. I will try to go and re-run a test on a pair of kernels
with that enabled to see if that has any effect.

> Do you have any plans of running any other benchmarks as well?
> Just to see the impact on other sub-systems.

The problem is other benchmarks such as netperf aren't going to show much
since they tend to operate on 4K pages, and add a bunch of additional
overhead such as skb allocation and network header processing.

What I am trying to do is focus on benchmarking just the changes without
getting too much other code pulled in. That is why I am thinking
page_fault1 modified so that it will MADV_HUGEPAGE is probably the ideal
test for this. Currently the 4K page size of page_fault1 is likely adding
a bunch of overhead for us having to split and merge pages and that would
be one of the reasons why the changes are essentially falling into the
noise.

By using THP it will be triggering allocations of higher-order pages and
then freeing that memory at the higher order as well. By using THP it can
do that much quicker and I can avoid the split/merge overhead. I am seeing
something on the order of about 1.3 million iterations per thread versus
the 500 thousand I was seeing with standard pages.

> > > > One side effect of these patches is that the guest becomes much more
> > > > resilient in terms of NUMA locality. With the pages being freed and then
> > > > reallocated when used it allows for the pages to be much closer to the
> > > > active thread, and as a result there can be situations where this patch
> > > > set will out-perform the stock kernel when the guest memory is not local
> > > > to the guest vCPUs.
> > > Was this the reason because of which you were seeing better results for
> > > page_fault1 earlier?
> > Yes I am thinking so. What I have found is that in the case where the
> > patches are not applied on the guest it takes a few runs for the numbers
> > to stabilize. What I think was going on is that I was running memhog to
> > initially fill the guest and that was placing all the pages on one node or
> > the other and as such was causing additional variability as the pages were
> > slowly being migrated over to the other node to rebalance the workload.
> > One way I tested it was by trying the unpatched case with a direct-
> > assigned device since that forces it to pin the memory. In that case I was
> > getting bad results consistently as all the memory was forced to come from
> > one node during the pre-allocation process.
> > 
> 
> I have also seen that the page_fault1 values take some time to get stabilize on
> an unmodified kernel.
> What I am wondering here is that if on a single NUMA guest doing the following
> will give the right/better idea or not:
> 
> 1. Pin the guest to a single NUMA node.
> 2. Run memhog so that it touches all the guest memory.
> 3. Run will-it-scale/page_fault1.
> 
> Compare/observe the values for the last core (this is considering the other core
> values doesn't drastically differ).

I'll rerun the test with qemu affinitized to one specific socket. It will
cut the core/thread count down to 8/16 on my test system. Also I will try
with THP and page shuffling enabled.
Alexander Duyck Aug. 2, 2019, 11:15 p.m. UTC | #5
On Fri, 2019-08-02 at 10:28 -0700, Alexander Duyck wrote:
> On Fri, 2019-08-02 at 12:19 -0400, Nitesh Narayan Lal wrote:
> > On 8/2/19 11:13 AM, Alexander Duyck wrote:
> > > On Fri, 2019-08-02 at 10:41 -0400, Nitesh Narayan Lal wrote:
> > > > On 8/1/19 6:24 PM, Alexander Duyck wrote:
> > > > > 

<snip>

> > > > > One side effect of these patches is that the guest becomes much more
> > > > > resilient in terms of NUMA locality. With the pages being freed and then
> > > > > reallocated when used it allows for the pages to be much closer to the
> > > > > active thread, and as a result there can be situations where this patch
> > > > > set will out-perform the stock kernel when the guest memory is not local
> > > > > to the guest vCPUs.
> > > > Was this the reason because of which you were seeing better results for
> > > > page_fault1 earlier?
> > > Yes I am thinking so. What I have found is that in the case where the
> > > patches are not applied on the guest it takes a few runs for the numbers
> > > to stabilize. What I think was going on is that I was running memhog to
> > > initially fill the guest and that was placing all the pages on one node or
> > > the other and as such was causing additional variability as the pages were
> > > slowly being migrated over to the other node to rebalance the workload.
> > > One way I tested it was by trying the unpatched case with a direct-
> > > assigned device since that forces it to pin the memory. In that case I was
> > > getting bad results consistently as all the memory was forced to come from
> > > one node during the pre-allocation process.
> > > 
> > 
> > I have also seen that the page_fault1 values take some time to get stabilize on
> > an unmodified kernel.
> > What I am wondering here is that if on a single NUMA guest doing the following
> > will give the right/better idea or not:
> > 
> > 1. Pin the guest to a single NUMA node.
> > 2. Run memhog so that it touches all the guest memory.
> > 3. Run will-it-scale/page_fault1.
> > 
> > Compare/observe the values for the last core (this is considering the other core
> > values doesn't drastically differ).
> 
> I'll rerun the test with qemu affinitized to one specific socket. It will
> cut the core/thread count down to 8/16 on my test system. Also I will try
> with THP and page shuffling enabled.

Okay so results with 8/16 all affinitized to one socket, THP enabled
page_fault1, and shuffling enabled:

With page reporting disabled in the hypervisor there wasn't much
difference. I saw a range of 0.69% to -1.35% versus baseline, and an
average of 0.16% improvement. So effectively no change.

With page reporting enabled I saw a range of -2.10% to -4.50%, with an
average of -3.05% regression. This is much closer to what I would expect
for this patch set as the page faulting, double zeroing (once in host, and
once in guest), and hinting process itself should have some overhead.