mbox series

[v2,0/5] mm / virtio: Provide support for page hinting

Message ID 20190724165158.6685.87228.stgit@localhost.localdomain (mailing list archive)
Headers show
Series mm / virtio: Provide support for page hinting | expand

Message

Alexander Duyck July 24, 2019, 4:54 p.m. UTC
This series provides an asynchronous means of hinting to a hypervisor
that a guest page is no longer in use and can have the data associated
with it dropped. To do this I have implemented functionality that allows
for what I am referring to as page hinting

The functionality for this is fairly simple. When enabled it will allocate
statistics to track the number of hinted pages in a given free area. When
the number of free pages exceeds this value plus a high water value,
currently 32, it will begin performing page hinting which consists of
pulling pages off of free list and placing them into a scatter list. The
scatterlist is then given to the page hinting device and it will perform
the required action to make the pages "hinted", in the case of
virtio-balloon this results in the pages being madvised as MADV_DONTNEED
and as such they are forced out of the guest. After this they are placed
back on the free list, and an additional bit is added if they are not
merged indicating that they are a hinted buddy page instead of a standard
buddy page. The cycle then repeats with additional non-hinted pages being
pulled until the free areas all consist of hinted pages.

I am leaving a number of things hard-coded such as limiting the lowest
order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
determine what the limit is on how many pages it wants to allocate to
process the hints.

My primary testing has just been to verify the memory is being freed after
allocation by running memhog 79g on a 80g guest and watching the total
free memory via /proc/meminfo on the host. With this I have verified most
of the memory is freed after each iteration. As far as performance I have
been mainly focusing on the will-it-scale/page_fault1 test running with
16 vcpus. With that I have seen at most a 2% difference between the base
kernel without these patches and the patches with virtio-balloon disabled.
With the patches and virtio-balloon enabled with hinting the results
largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
drop in performance as I approached 16 threads, however on the the lastest
linux-next kernel I saw roughly a 4% to 5% improvement in performance for
all tests with 8 or more threads. I believe the difference seen is due to
the overhead for faulting pages back into the guest and zeroing of memory.

Patch 4 is a bit on the large side at about 600 lines of change, however
I really didn't see a good way to break it up since each piece feeds into
the next. So I couldn't add the statistics by themselves as it didn't
really make sense to add them without something that will either read or
increment/decrement them, or add the Hinted state without something that
would set/unset it. As such I just ended up adding the entire thing as
one patch. It makes it a bit bigger but avoids the issues in the previous
set where I was referencing things before they had been added.

Changes from the RFC:
https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
Moved aeration requested flag out of aerator and into zone->flags.
Moved bounary out of free_area and into local variables for aeration.
Moved aeration cycle out of interrupt and into workqueue.
Left nr_free as total pages instead of splitting it between raw and aerated.
Combined size and physical address values in virtio ring into one 64b value.

Changes from v1:
https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
Dropped "waste page treatment" in favor of "page hinting"
Renamed files and functions from "aeration" to "page_hinting"
Moved from page->lru list to scatterlist
Replaced wait on refcnt in shutdown with RCU and cancel_delayed_work_sync
Virtio now uses scatterlist directly instead of intermedate array
Moved stats out of free_area, now in seperate area and pointed to from zone
Merged patch 5 into patch 4 to improve reviewability
Updated various code comments throughout

---

Alexander Duyck (5):
      mm: Adjust shuffle code to allow for future coalescing
      mm: Move set/get_pcppage_migratetype to mmzone.h
      mm: Use zone and order instead of free area in free_list manipulators
      mm: Introduce Hinted pages
      virtio-balloon: Add support for providing page hints to host


 drivers/virtio/Kconfig              |    1 
 drivers/virtio/virtio_balloon.c     |   47 ++++++
 include/linux/mmzone.h              |  116 ++++++++------
 include/linux/page-flags.h          |    8 +
 include/linux/page_hinting.h        |  139 ++++++++++++++++
 include/uapi/linux/virtio_balloon.h |    1 
 mm/Kconfig                          |    5 +
 mm/Makefile                         |    1 
 mm/internal.h                       |   18 ++
 mm/memory_hotplug.c                 |    1 
 mm/page_alloc.c                     |  238 ++++++++++++++++++++--------
 mm/page_hinting.c                   |  298 +++++++++++++++++++++++++++++++++++
 mm/shuffle.c                        |   24 ---
 mm/shuffle.h                        |   32 ++++
 14 files changed, 796 insertions(+), 133 deletions(-)
 create mode 100644 include/linux/page_hinting.h
 create mode 100644 mm/page_hinting.c

--

Comments

Nitesh Narayan Lal July 24, 2019, 6:40 p.m. UTC | #1
On 7/24/19 12:54 PM, Alexander Duyck wrote:
> This series provides an asynchronous means of hinting to a hypervisor
> that a guest page is no longer in use and can have the data associated
> with it dropped. To do this I have implemented functionality that allows
> for what I am referring to as page hinting
>
> The functionality for this is fairly simple. When enabled it will allocate
> statistics to track the number of hinted pages in a given free area. When
> the number of free pages exceeds this value plus a high water value,
> currently 32,
Shouldn't we configure this to a lower number such as 16?
>  it will begin performing page hinting which consists of
> pulling pages off of free list and placing them into a scatter list. The
> scatterlist is then given to the page hinting device and it will perform
> the required action to make the pages "hinted", in the case of
> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> and as such they are forced out of the guest. After this they are placed
> back on the free list, and an additional bit is added if they are not
> merged indicating that they are a hinted buddy page instead of a standard
> buddy page. The cycle then repeats with additional non-hinted pages being
> pulled until the free areas all consist of hinted pages.
>
> I am leaving a number of things hard-coded such as limiting the lowest
> order processed to PAGEBLOCK_ORDER,
Have you considered making this option configurable at the compile time?
>  and have left it up to the guest to
> determine what the limit is on how many pages it wants to allocate to
> process the hints.
It might make sense to set the number of pages to be hinted at a time from the
hypervisor.
>
> My primary testing has just been to verify the memory is being freed after
> allocation by running memhog 79g on a 80g guest and watching the total
> free memory via /proc/meminfo on the host. With this I have verified most
> of the memory is freed after each iteration. As far as performance I have
> been mainly focusing on the will-it-scale/page_fault1 test running with
> 16 vcpus. With that I have seen at most a 2% difference between the base
> kernel without these patches and the patches with virtio-balloon disabled.
> With the patches and virtio-balloon enabled with hinting the results
> largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
> drop in performance as I approached 16 threads,
I think this is acceptable.
>  however on the the lastest
> linux-next kernel I saw roughly a 4% to 5% improvement in performance for
> all tests with 8 or more threads. 
Do you mean that with your patches the will-it-scale/page_fault1 numbers were
better by 4-5% over an unmodified kernel?
> I believe the difference seen is due to
> the overhead for faulting pages back into the guest and zeroing of memory.
It may also make sense to test these patches with netperf to observe how much
performance drop it is introducing.
> Patch 4 is a bit on the large side at about 600 lines of change, however
> I really didn't see a good way to break it up since each piece feeds into
> the next. So I couldn't add the statistics by themselves as it didn't
> really make sense to add them without something that will either read or
> increment/decrement them, or add the Hinted state without something that
> would set/unset it. As such I just ended up adding the entire thing as
> one patch. It makes it a bit bigger but avoids the issues in the previous
> set where I was referencing things before they had been added.
>
> Changes from the RFC:
> https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
> Moved aeration requested flag out of aerator and into zone->flags.
> Moved bounary out of free_area and into local variables for aeration.
> Moved aeration cycle out of interrupt and into workqueue.
> Left nr_free as total pages instead of splitting it between raw and aerated.
> Combined size and physical address values in virtio ring into one 64b value.
>
> Changes from v1:
> https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
> Dropped "waste page treatment" in favor of "page hinting"
We may still have to try and find a better name for virtio-balloon side changes.
As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.
> Renamed files and functions from "aeration" to "page_hinting"
> Moved from page->lru list to scatterlist
> Replaced wait on refcnt in shutdown with RCU and cancel_delayed_work_sync
> Virtio now uses scatterlist directly instead of intermedate array
> Moved stats out of free_area, now in seperate area and pointed to from zone
> Merged patch 5 into patch 4 to improve reviewability
> Updated various code comments throughout
>
> ---
>
> Alexander Duyck (5):
>       mm: Adjust shuffle code to allow for future coalescing
>       mm: Move set/get_pcppage_migratetype to mmzone.h
>       mm: Use zone and order instead of free area in free_list manipulators
>       mm: Introduce Hinted pages
>       virtio-balloon: Add support for providing page hints to host
>
>
>  drivers/virtio/Kconfig              |    1 
>  drivers/virtio/virtio_balloon.c     |   47 ++++++
>  include/linux/mmzone.h              |  116 ++++++++------
>  include/linux/page-flags.h          |    8 +
>  include/linux/page_hinting.h        |  139 ++++++++++++++++
>  include/uapi/linux/virtio_balloon.h |    1 
>  mm/Kconfig                          |    5 +
>  mm/Makefile                         |    1 
>  mm/internal.h                       |   18 ++
>  mm/memory_hotplug.c                 |    1 
>  mm/page_alloc.c                     |  238 ++++++++++++++++++++--------
>  mm/page_hinting.c                   |  298 +++++++++++++++++++++++++++++++++++
>  mm/shuffle.c                        |   24 ---
>  mm/shuffle.h                        |   32 ++++
>  14 files changed, 796 insertions(+), 133 deletions(-)
>  create mode 100644 include/linux/page_hinting.h
>  create mode 100644 mm/page_hinting.c
>
> --
David Hildenbrand July 24, 2019, 6:41 p.m. UTC | #2
On 24.07.19 20:40, Nitesh Narayan Lal wrote:
> 
> On 7/24/19 12:54 PM, Alexander Duyck wrote:
>> This series provides an asynchronous means of hinting to a hypervisor
>> that a guest page is no longer in use and can have the data associated
>> with it dropped. To do this I have implemented functionality that allows
>> for what I am referring to as page hinting
>>
>> The functionality for this is fairly simple. When enabled it will allocate
>> statistics to track the number of hinted pages in a given free area. When
>> the number of free pages exceeds this value plus a high water value,
>> currently 32,
> Shouldn't we configure this to a lower number such as 16?
>>  it will begin performing page hinting which consists of
>> pulling pages off of free list and placing them into a scatter list. The
>> scatterlist is then given to the page hinting device and it will perform
>> the required action to make the pages "hinted", in the case of
>> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
>> and as such they are forced out of the guest. After this they are placed
>> back on the free list, and an additional bit is added if they are not
>> merged indicating that they are a hinted buddy page instead of a standard
>> buddy page. The cycle then repeats with additional non-hinted pages being
>> pulled until the free areas all consist of hinted pages.
>>
>> I am leaving a number of things hard-coded such as limiting the lowest
>> order processed to PAGEBLOCK_ORDER,
> Have you considered making this option configurable at the compile time?
>>  and have left it up to the guest to
>> determine what the limit is on how many pages it wants to allocate to
>> process the hints.
> It might make sense to set the number of pages to be hinted at a time from the
> hypervisor.
>>
>> My primary testing has just been to verify the memory is being freed after
>> allocation by running memhog 79g on a 80g guest and watching the total
>> free memory via /proc/meminfo on the host. With this I have verified most
>> of the memory is freed after each iteration. As far as performance I have
>> been mainly focusing on the will-it-scale/page_fault1 test running with
>> 16 vcpus. With that I have seen at most a 2% difference between the base
>> kernel without these patches and the patches with virtio-balloon disabled.
>> With the patches and virtio-balloon enabled with hinting the results
>> largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
>> drop in performance as I approached 16 threads,
> I think this is acceptable.
>>  however on the the lastest
>> linux-next kernel I saw roughly a 4% to 5% improvement in performance for
>> all tests with 8 or more threads. 
> Do you mean that with your patches the will-it-scale/page_fault1 numbers were
> better by 4-5% over an unmodified kernel?
>> I believe the difference seen is due to
>> the overhead for faulting pages back into the guest and zeroing of memory.
> It may also make sense to test these patches with netperf to observe how much
> performance drop it is introducing.
>> Patch 4 is a bit on the large side at about 600 lines of change, however
>> I really didn't see a good way to break it up since each piece feeds into
>> the next. So I couldn't add the statistics by themselves as it didn't
>> really make sense to add them without something that will either read or
>> increment/decrement them, or add the Hinted state without something that
>> would set/unset it. As such I just ended up adding the entire thing as
>> one patch. It makes it a bit bigger but avoids the issues in the previous
>> set where I was referencing things before they had been added.
>>
>> Changes from the RFC:
>> https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
>> Moved aeration requested flag out of aerator and into zone->flags.
>> Moved bounary out of free_area and into local variables for aeration.
>> Moved aeration cycle out of interrupt and into workqueue.
>> Left nr_free as total pages instead of splitting it between raw and aerated.
>> Combined size and physical address values in virtio ring into one 64b value.
>>
>> Changes from v1:
>> https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
>> Dropped "waste page treatment" in favor of "page hinting"
> We may still have to try and find a better name for virtio-balloon side changes.
> As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.

We should have named that free page reporting, but that train already
has left.
Michael S. Tsirkin July 24, 2019, 7:24 p.m. UTC | #3
On Wed, Jul 24, 2019 at 02:40:02PM -0400, Nitesh Narayan Lal wrote:
> 
> On 7/24/19 12:54 PM, Alexander Duyck wrote:
> > This series provides an asynchronous means of hinting to a hypervisor
> > that a guest page is no longer in use and can have the data associated
> > with it dropped. To do this I have implemented functionality that allows
> > for what I am referring to as page hinting
> >
> > The functionality for this is fairly simple. When enabled it will allocate
> > statistics to track the number of hinted pages in a given free area. When
> > the number of free pages exceeds this value plus a high water value,
> > currently 32,
> Shouldn't we configure this to a lower number such as 16?
> >  it will begin performing page hinting which consists of
> > pulling pages off of free list and placing them into a scatter list. The
> > scatterlist is then given to the page hinting device and it will perform
> > the required action to make the pages "hinted", in the case of
> > virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> > and as such they are forced out of the guest. After this they are placed
> > back on the free list, and an additional bit is added if they are not
> > merged indicating that they are a hinted buddy page instead of a standard
> > buddy page. The cycle then repeats with additional non-hinted pages being
> > pulled until the free areas all consist of hinted pages.
> >
> > I am leaving a number of things hard-coded such as limiting the lowest
> > order processed to PAGEBLOCK_ORDER,
> Have you considered making this option configurable at the compile time?
> >  and have left it up to the guest to
> > determine what the limit is on how many pages it wants to allocate to
> > process the hints.
> It might make sense to set the number of pages to be hinted at a time from the
> hypervisor.
> >
> > My primary testing has just been to verify the memory is being freed after
> > allocation by running memhog 79g on a 80g guest and watching the total
> > free memory via /proc/meminfo on the host. With this I have verified most
> > of the memory is freed after each iteration. As far as performance I have
> > been mainly focusing on the will-it-scale/page_fault1 test running with
> > 16 vcpus. With that I have seen at most a 2% difference between the base
> > kernel without these patches and the patches with virtio-balloon disabled.
> > With the patches and virtio-balloon enabled with hinting the results
> > largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
> > drop in performance as I approached 16 threads,
> I think this is acceptable.
> >  however on the the lastest
> > linux-next kernel I saw roughly a 4% to 5% improvement in performance for
> > all tests with 8 or more threads. 
> Do you mean that with your patches the will-it-scale/page_fault1 numbers were
> better by 4-5% over an unmodified kernel?
> > I believe the difference seen is due to
> > the overhead for faulting pages back into the guest and zeroing of memory.
> It may also make sense to test these patches with netperf to observe how much
> performance drop it is introducing.
> > Patch 4 is a bit on the large side at about 600 lines of change, however
> > I really didn't see a good way to break it up since each piece feeds into
> > the next. So I couldn't add the statistics by themselves as it didn't
> > really make sense to add them without something that will either read or
> > increment/decrement them, or add the Hinted state without something that
> > would set/unset it. As such I just ended up adding the entire thing as
> > one patch. It makes it a bit bigger but avoids the issues in the previous
> > set where I was referencing things before they had been added.
> >
> > Changes from the RFC:
> > https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
> > Moved aeration requested flag out of aerator and into zone->flags.
> > Moved bounary out of free_area and into local variables for aeration.
> > Moved aeration cycle out of interrupt and into workqueue.
> > Left nr_free as total pages instead of splitting it between raw and aerated.
> > Combined size and physical address values in virtio ring into one 64b value.
> >
> > Changes from v1:
> > https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
> > Dropped "waste page treatment" in favor of "page hinting"
> We may still have to try and find a better name for virtio-balloon side changes.
> As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.

Right. In fact I'm not sure why should these be called hints at all.
VIRTIO_BALLOON_F_FREE_PAGE_HINT is a hint because pages might no
longer be free by the time they are reported.

I think of this one as a free page reporting capability.
I don't really mind how are internal kernel functions called,
but I think for virtio uapi things that's a better name.




> > Renamed files and functions from "aeration" to "page_hinting"
> > Moved from page->lru list to scatterlist
> > Replaced wait on refcnt in shutdown with RCU and cancel_delayed_work_sync
> > Virtio now uses scatterlist directly instead of intermedate array
> > Moved stats out of free_area, now in seperate area and pointed to from zone
> > Merged patch 5 into patch 4 to improve reviewability
> > Updated various code comments throughout
> >
> > ---
> >
> > Alexander Duyck (5):
> >       mm: Adjust shuffle code to allow for future coalescing
> >       mm: Move set/get_pcppage_migratetype to mmzone.h
> >       mm: Use zone and order instead of free area in free_list manipulators
> >       mm: Introduce Hinted pages
> >       virtio-balloon: Add support for providing page hints to host
> >
> >
> >  drivers/virtio/Kconfig              |    1 
> >  drivers/virtio/virtio_balloon.c     |   47 ++++++
> >  include/linux/mmzone.h              |  116 ++++++++------
> >  include/linux/page-flags.h          |    8 +
> >  include/linux/page_hinting.h        |  139 ++++++++++++++++
> >  include/uapi/linux/virtio_balloon.h |    1 
> >  mm/Kconfig                          |    5 +
> >  mm/Makefile                         |    1 
> >  mm/internal.h                       |   18 ++
> >  mm/memory_hotplug.c                 |    1 
> >  mm/page_alloc.c                     |  238 ++++++++++++++++++++--------
> >  mm/page_hinting.c                   |  298 +++++++++++++++++++++++++++++++++++
> >  mm/shuffle.c                        |   24 ---
> >  mm/shuffle.h                        |   32 ++++
> >  14 files changed, 796 insertions(+), 133 deletions(-)
> >  create mode 100644 include/linux/page_hinting.h
> >  create mode 100644 mm/page_hinting.c
> >
> > --
> -- 
> Thanks
> Nitesh
Michael S. Tsirkin July 24, 2019, 7:31 p.m. UTC | #4
On Wed, Jul 24, 2019 at 08:41:33PM +0200, David Hildenbrand wrote:
> On 24.07.19 20:40, Nitesh Narayan Lal wrote:
> > 
> > On 7/24/19 12:54 PM, Alexander Duyck wrote:
> >> This series provides an asynchronous means of hinting to a hypervisor
> >> that a guest page is no longer in use and can have the data associated
> >> with it dropped. To do this I have implemented functionality that allows
> >> for what I am referring to as page hinting
> >>
> >> The functionality for this is fairly simple. When enabled it will allocate
> >> statistics to track the number of hinted pages in a given free area. When
> >> the number of free pages exceeds this value plus a high water value,
> >> currently 32,
> > Shouldn't we configure this to a lower number such as 16?
> >>  it will begin performing page hinting which consists of
> >> pulling pages off of free list and placing them into a scatter list. The
> >> scatterlist is then given to the page hinting device and it will perform
> >> the required action to make the pages "hinted", in the case of
> >> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> >> and as such they are forced out of the guest. After this they are placed
> >> back on the free list, and an additional bit is added if they are not
> >> merged indicating that they are a hinted buddy page instead of a standard
> >> buddy page. The cycle then repeats with additional non-hinted pages being
> >> pulled until the free areas all consist of hinted pages.
> >>
> >> I am leaving a number of things hard-coded such as limiting the lowest
> >> order processed to PAGEBLOCK_ORDER,
> > Have you considered making this option configurable at the compile time?
> >>  and have left it up to the guest to
> >> determine what the limit is on how many pages it wants to allocate to
> >> process the hints.
> > It might make sense to set the number of pages to be hinted at a time from the
> > hypervisor.
> >>
> >> My primary testing has just been to verify the memory is being freed after
> >> allocation by running memhog 79g on a 80g guest and watching the total
> >> free memory via /proc/meminfo on the host. With this I have verified most
> >> of the memory is freed after each iteration. As far as performance I have
> >> been mainly focusing on the will-it-scale/page_fault1 test running with
> >> 16 vcpus. With that I have seen at most a 2% difference between the base
> >> kernel without these patches and the patches with virtio-balloon disabled.
> >> With the patches and virtio-balloon enabled with hinting the results
> >> largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
> >> drop in performance as I approached 16 threads,
> > I think this is acceptable.
> >>  however on the the lastest
> >> linux-next kernel I saw roughly a 4% to 5% improvement in performance for
> >> all tests with 8 or more threads. 
> > Do you mean that with your patches the will-it-scale/page_fault1 numbers were
> > better by 4-5% over an unmodified kernel?
> >> I believe the difference seen is due to
> >> the overhead for faulting pages back into the guest and zeroing of memory.
> > It may also make sense to test these patches with netperf to observe how much
> > performance drop it is introducing.
> >> Patch 4 is a bit on the large side at about 600 lines of change, however
> >> I really didn't see a good way to break it up since each piece feeds into
> >> the next. So I couldn't add the statistics by themselves as it didn't
> >> really make sense to add them without something that will either read or
> >> increment/decrement them, or add the Hinted state without something that
> >> would set/unset it. As such I just ended up adding the entire thing as
> >> one patch. It makes it a bit bigger but avoids the issues in the previous
> >> set where I was referencing things before they had been added.
> >>
> >> Changes from the RFC:
> >> https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
> >> Moved aeration requested flag out of aerator and into zone->flags.
> >> Moved bounary out of free_area and into local variables for aeration.
> >> Moved aeration cycle out of interrupt and into workqueue.
> >> Left nr_free as total pages instead of splitting it between raw and aerated.
> >> Combined size and physical address values in virtio ring into one 64b value.
> >>
> >> Changes from v1:
> >> https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
> >> Dropped "waste page treatment" in favor of "page hinting"
> > We may still have to try and find a better name for virtio-balloon side changes.
> > As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.
> 
> We should have named that free page reporting, but that train already
> has left.

I think VIRTIO_BALLOON_F_FREE_PAGE_HINT is different and arguably
actually does provide hints.

> -- 
> 
> Thanks,
> 
> David / dhildenb
David Hildenbrand July 24, 2019, 7:47 p.m. UTC | #5
On 24.07.19 21:31, Michael S. Tsirkin wrote:
> On Wed, Jul 24, 2019 at 08:41:33PM +0200, David Hildenbrand wrote:
>> On 24.07.19 20:40, Nitesh Narayan Lal wrote:
>>>
>>> On 7/24/19 12:54 PM, Alexander Duyck wrote:
>>>> This series provides an asynchronous means of hinting to a hypervisor
>>>> that a guest page is no longer in use and can have the data associated
>>>> with it dropped. To do this I have implemented functionality that allows
>>>> for what I am referring to as page hinting
>>>>
>>>> The functionality for this is fairly simple. When enabled it will allocate
>>>> statistics to track the number of hinted pages in a given free area. When
>>>> the number of free pages exceeds this value plus a high water value,
>>>> currently 32,
>>> Shouldn't we configure this to a lower number such as 16?
>>>>  it will begin performing page hinting which consists of
>>>> pulling pages off of free list and placing them into a scatter list. The
>>>> scatterlist is then given to the page hinting device and it will perform
>>>> the required action to make the pages "hinted", in the case of
>>>> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
>>>> and as such they are forced out of the guest. After this they are placed
>>>> back on the free list, and an additional bit is added if they are not
>>>> merged indicating that they are a hinted buddy page instead of a standard
>>>> buddy page. The cycle then repeats with additional non-hinted pages being
>>>> pulled until the free areas all consist of hinted pages.
>>>>
>>>> I am leaving a number of things hard-coded such as limiting the lowest
>>>> order processed to PAGEBLOCK_ORDER,
>>> Have you considered making this option configurable at the compile time?
>>>>  and have left it up to the guest to
>>>> determine what the limit is on how many pages it wants to allocate to
>>>> process the hints.
>>> It might make sense to set the number of pages to be hinted at a time from the
>>> hypervisor.
>>>>
>>>> My primary testing has just been to verify the memory is being freed after
>>>> allocation by running memhog 79g on a 80g guest and watching the total
>>>> free memory via /proc/meminfo on the host. With this I have verified most
>>>> of the memory is freed after each iteration. As far as performance I have
>>>> been mainly focusing on the will-it-scale/page_fault1 test running with
>>>> 16 vcpus. With that I have seen at most a 2% difference between the base
>>>> kernel without these patches and the patches with virtio-balloon disabled.
>>>> With the patches and virtio-balloon enabled with hinting the results
>>>> largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
>>>> drop in performance as I approached 16 threads,
>>> I think this is acceptable.
>>>>  however on the the lastest
>>>> linux-next kernel I saw roughly a 4% to 5% improvement in performance for
>>>> all tests with 8 or more threads. 
>>> Do you mean that with your patches the will-it-scale/page_fault1 numbers were
>>> better by 4-5% over an unmodified kernel?
>>>> I believe the difference seen is due to
>>>> the overhead for faulting pages back into the guest and zeroing of memory.
>>> It may also make sense to test these patches with netperf to observe how much
>>> performance drop it is introducing.
>>>> Patch 4 is a bit on the large side at about 600 lines of change, however
>>>> I really didn't see a good way to break it up since each piece feeds into
>>>> the next. So I couldn't add the statistics by themselves as it didn't
>>>> really make sense to add them without something that will either read or
>>>> increment/decrement them, or add the Hinted state without something that
>>>> would set/unset it. As such I just ended up adding the entire thing as
>>>> one patch. It makes it a bit bigger but avoids the issues in the previous
>>>> set where I was referencing things before they had been added.
>>>>
>>>> Changes from the RFC:
>>>> https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
>>>> Moved aeration requested flag out of aerator and into zone->flags.
>>>> Moved bounary out of free_area and into local variables for aeration.
>>>> Moved aeration cycle out of interrupt and into workqueue.
>>>> Left nr_free as total pages instead of splitting it between raw and aerated.
>>>> Combined size and physical address values in virtio ring into one 64b value.
>>>>
>>>> Changes from v1:
>>>> https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
>>>> Dropped "waste page treatment" in favor of "page hinting"
>>> We may still have to try and find a better name for virtio-balloon side changes.
>>> As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.
>>
>> We should have named that free page reporting, but that train already
>> has left.
> 
> I think VIRTIO_BALLOON_F_FREE_PAGE_HINT is different and arguably
> actually does provide hints.

I guess it depends on the point of view (e.g., getting all free pages
feels more like a report). But I could also live with using the term
reporting in this context.

We could go ahead and name it all "page reporting", would also work for me.
Nitesh Narayan Lal July 24, 2019, 7:54 p.m. UTC | #6
On 7/24/19 3:47 PM, David Hildenbrand wrote:
> On 24.07.19 21:31, Michael S. Tsirkin wrote:
>> On Wed, Jul 24, 2019 at 08:41:33PM +0200, David Hildenbrand wrote:
>>> On 24.07.19 20:40, Nitesh Narayan Lal wrote:
>>>> On 7/24/19 12:54 PM, Alexander Duyck wrote:
>>>>> This series provides an asynchronous means of hinting to a hypervisor
>>>>> that a guest page is no longer in use and can have the data associated
>>>>> with it dropped. To do this I have implemented functionality that allows
>>>>> for what I am referring to as page hinting
>>>>>
>>>>> The functionality for this is fairly simple. When enabled it will allocate
>>>>> statistics to track the number of hinted pages in a given free area. When
>>>>> the number of free pages exceeds this value plus a high water value,
>>>>> currently 32,
>>>> Shouldn't we configure this to a lower number such as 16?
>>>>>  it will begin performing page hinting which consists of
>>>>> pulling pages off of free list and placing them into a scatter list. The
>>>>> scatterlist is then given to the page hinting device and it will perform
>>>>> the required action to make the pages "hinted", in the case of
>>>>> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
>>>>> and as such they are forced out of the guest. After this they are placed
>>>>> back on the free list, and an additional bit is added if they are not
>>>>> merged indicating that they are a hinted buddy page instead of a standard
>>>>> buddy page. The cycle then repeats with additional non-hinted pages being
>>>>> pulled until the free areas all consist of hinted pages.
>>>>>
>>>>> I am leaving a number of things hard-coded such as limiting the lowest
>>>>> order processed to PAGEBLOCK_ORDER,
>>>> Have you considered making this option configurable at the compile time?
>>>>>  and have left it up to the guest to
>>>>> determine what the limit is on how many pages it wants to allocate to
>>>>> process the hints.
>>>> It might make sense to set the number of pages to be hinted at a time from the
>>>> hypervisor.
>>>>> My primary testing has just been to verify the memory is being freed after
>>>>> allocation by running memhog 79g on a 80g guest and watching the total
>>>>> free memory via /proc/meminfo on the host. With this I have verified most
>>>>> of the memory is freed after each iteration. As far as performance I have
>>>>> been mainly focusing on the will-it-scale/page_fault1 test running with
>>>>> 16 vcpus. With that I have seen at most a 2% difference between the base
>>>>> kernel without these patches and the patches with virtio-balloon disabled.
>>>>> With the patches and virtio-balloon enabled with hinting the results
>>>>> largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
>>>>> drop in performance as I approached 16 threads,
>>>> I think this is acceptable.
>>>>>  however on the the lastest
>>>>> linux-next kernel I saw roughly a 4% to 5% improvement in performance for
>>>>> all tests with 8 or more threads. 
>>>> Do you mean that with your patches the will-it-scale/page_fault1 numbers were
>>>> better by 4-5% over an unmodified kernel?
>>>>> I believe the difference seen is due to
>>>>> the overhead for faulting pages back into the guest and zeroing of memory.
>>>> It may also make sense to test these patches with netperf to observe how much
>>>> performance drop it is introducing.
>>>>> Patch 4 is a bit on the large side at about 600 lines of change, however
>>>>> I really didn't see a good way to break it up since each piece feeds into
>>>>> the next. So I couldn't add the statistics by themselves as it didn't
>>>>> really make sense to add them without something that will either read or
>>>>> increment/decrement them, or add the Hinted state without something that
>>>>> would set/unset it. As such I just ended up adding the entire thing as
>>>>> one patch. It makes it a bit bigger but avoids the issues in the previous
>>>>> set where I was referencing things before they had been added.
>>>>>
>>>>> Changes from the RFC:
>>>>> https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
>>>>> Moved aeration requested flag out of aerator and into zone->flags.
>>>>> Moved bounary out of free_area and into local variables for aeration.
>>>>> Moved aeration cycle out of interrupt and into workqueue.
>>>>> Left nr_free as total pages instead of splitting it between raw and aerated.
>>>>> Combined size and physical address values in virtio ring into one 64b value.
>>>>>
>>>>> Changes from v1:
>>>>> https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
>>>>> Dropped "waste page treatment" in favor of "page hinting"
>>>> We may still have to try and find a better name for virtio-balloon side changes.
>>>> As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.
>>> We should have named that free page reporting, but that train already
>>> has left.
>> I think VIRTIO_BALLOON_F_FREE_PAGE_HINT is different and arguably
>> actually does provide hints.
> I guess it depends on the point of view (e.g., getting all free pages
> feels more like a report). But I could also live with using the term
> reporting in this context.
>
> We could go ahead and name it all "page reporting", would also work for me.
I think that should work.
Having two separate names one for the kernel and the other for the virtio
interface will cause unnecessary confusion.
Alexander Duyck July 24, 2019, 8:27 p.m. UTC | #7
On Wed, 2019-07-24 at 14:40 -0400, Nitesh Narayan Lal wrote:
> On 7/24/19 12:54 PM, Alexander Duyck wrote:
> > This series provides an asynchronous means of hinting to a hypervisor
> > that a guest page is no longer in use and can have the data associated
> > with it dropped. To do this I have implemented functionality that allows
> > for what I am referring to as page hinting
> > 
> > The functionality for this is fairly simple. When enabled it will allocate
> > statistics to track the number of hinted pages in a given free area. When
> > the number of free pages exceeds this value plus a high water value,
> > currently 32,
> Shouldn't we configure this to a lower number such as 16?

Yes, we could do 16.

> >  it will begin performing page hinting which consists of
> > pulling pages off of free list and placing them into a scatter list. The
> > scatterlist is then given to the page hinting device and it will perform
> > the required action to make the pages "hinted", in the case of
> > virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> > and as such they are forced out of the guest. After this they are placed
> > back on the free list, and an additional bit is added if they are not
> > merged indicating that they are a hinted buddy page instead of a standard
> > buddy page. The cycle then repeats with additional non-hinted pages being
> > pulled until the free areas all consist of hinted pages.
> > 
> > I am leaving a number of things hard-coded such as limiting the lowest
> > order processed to PAGEBLOCK_ORDER,
> Have you considered making this option configurable at the compile time?

We could. However, PAGEBLOCK_ORDER is already configurable on some
architectures. I didn't see much point in making it configurable in the
case of x86 as there are only really 2 orders that this could be used in
that provided good performance and that MAX_ORDER - 1 and PAGEBLOCK_ORDER.

> >  and have left it up to the guest to
> > determine what the limit is on how many pages it wants to allocate to
> > process the hints.
> It might make sense to set the number of pages to be hinted at a time from the
> hypervisor.

We could do that. Although I would still want some upper limit on that as
I would prefer to keep the high water mark as a static value since it is
used in an inline function. Currently the virtio driver is the one
defining the capacity of pages per request.

> > My primary testing has just been to verify the memory is being freed after
> > allocation by running memhog 79g on a 80g guest and watching the total
> > free memory via /proc/meminfo on the host. With this I have verified most
> > of the memory is freed after each iteration. As far as performance I have
> > been mainly focusing on the will-it-scale/page_fault1 test running with
> > 16 vcpus. With that I have seen at most a 2% difference between the base
> > kernel without these patches and the patches with virtio-balloon disabled.
> > With the patches and virtio-balloon enabled with hinting the results
> > largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
> > drop in performance as I approached 16 threads,
> I think this is acceptable.
> >  however on the the lastest
> > linux-next kernel I saw roughly a 4% to 5% improvement in performance for
> > all tests with 8 or more threads. 
> Do you mean that with your patches the will-it-scale/page_fault1 numbers were
> better by 4-5% over an unmodified kernel?

Yes. That is the odd thing. I am wondering if there was some improvement
in the zeroing of THP pages or something that is somehow improving the
cache performance for the accessing of the pages by the test in the guest.

> > I believe the difference seen is due to
> > the overhead for faulting pages back into the guest and zeroing of memory.
> It may also make sense to test these patches with netperf to observe how much
> performance drop it is introducing.

Do you have some test you were already using? I ask because I am not sure
netperf would generate a large enough memory window size to really trigger
much of a change in terms of hinting. If you have some test in mind I
could probably set it up and run it pretty quick.

> > Patch 4 is a bit on the large side at about 600 lines of change, however
> > I really didn't see a good way to break it up since each piece feeds into
> > the next. So I couldn't add the statistics by themselves as it didn't
> > really make sense to add them without something that will either read or
> > increment/decrement them, or add the Hinted state without something that
> > would set/unset it. As such I just ended up adding the entire thing as
> > one patch. It makes it a bit bigger but avoids the issues in the previous
> > set where I was referencing things before they had been added.
> > 
> > Changes from the RFC:
> > https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
> > Moved aeration requested flag out of aerator and into zone->flags.
> > Moved bounary out of free_area and into local variables for aeration.
> > Moved aeration cycle out of interrupt and into workqueue.
> > Left nr_free as total pages instead of splitting it between raw and aerated.
> > Combined size and physical address values in virtio ring into one 64b value.
> > 
> > Changes from v1:
> > https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
> > Dropped "waste page treatment" in favor of "page hinting"
> We may still have to try and find a better name for virtio-balloon side changes.
> As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.

We just need to settle on a name. Essentially all this requires is just a
quick find and replace with whatever name we decide on.
Nitesh Narayan Lal July 24, 2019, 8:38 p.m. UTC | #8
On 7/24/19 4:27 PM, Alexander Duyck wrote:
> On Wed, 2019-07-24 at 14:40 -0400, Nitesh Narayan Lal wrote:
>> On 7/24/19 12:54 PM, Alexander Duyck wrote:
>>> This series provides an asynchronous means of hinting to a hypervisor
>>> that a guest page is no longer in use and can have the data associated
>>> with it dropped. To do this I have implemented functionality that allows
>>> for what I am referring to as page hinting
>>>
>>> The functionality for this is fairly simple. When enabled it will allocate
>>> statistics to track the number of hinted pages in a given free area. When
>>> the number of free pages exceeds this value plus a high water value,
>>> currently 32,
>> Shouldn't we configure this to a lower number such as 16?
> Yes, we could do 16.
>
>>>  it will begin performing page hinting which consists of
>>> pulling pages off of free list and placing them into a scatter list. The
>>> scatterlist is then given to the page hinting device and it will perform
>>> the required action to make the pages "hinted", in the case of
>>> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
>>> and as such they are forced out of the guest. After this they are placed
>>> back on the free list, and an additional bit is added if they are not
>>> merged indicating that they are a hinted buddy page instead of a standard
>>> buddy page. The cycle then repeats with additional non-hinted pages being
>>> pulled until the free areas all consist of hinted pages.
>>>
>>> I am leaving a number of things hard-coded such as limiting the lowest
>>> order processed to PAGEBLOCK_ORDER,
>> Have you considered making this option configurable at the compile time?
> We could. However, PAGEBLOCK_ORDER is already configurable on some
> architectures. I didn't see much point in making it configurable in the
> case of x86 as there are only really 2 orders that this could be used in
> that provided good performance and that MAX_ORDER - 1 and PAGEBLOCK_ORDER.
>
>>>  and have left it up to the guest to
>>> determine what the limit is on how many pages it wants to allocate to
>>> process the hints.
>> It might make sense to set the number of pages to be hinted at a time from the
>> hypervisor.
> We could do that. Although I would still want some upper limit on that as
> I would prefer to keep the high water mark as a static value since it is
> used in an inline function. Currently the virtio driver is the one
> defining the capacity of pages per request.
For the upper limit I think we can rely on max vq size. Isn't?
>
>>> My primary testing has just been to verify the memory is being freed after
>>> allocation by running memhog 79g on a 80g guest and watching the total
>>> free memory via /proc/meminfo on the host. With this I have verified most
>>> of the memory is freed after each iteration. As far as performance I have
>>> been mainly focusing on the will-it-scale/page_fault1 test running with
>>> 16 vcpus. With that I have seen at most a 2% difference between the base
>>> kernel without these patches and the patches with virtio-balloon disabled.
>>> With the patches and virtio-balloon enabled with hinting the results
>>> largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
>>> drop in performance as I approached 16 threads,
>> I think this is acceptable.
>>>  however on the the lastest
>>> linux-next kernel I saw roughly a 4% to 5% improvement in performance for
>>> all tests with 8 or more threads. 
>> Do you mean that with your patches the will-it-scale/page_fault1 numbers were
>> better by 4-5% over an unmodified kernel?
> Yes. That is the odd thing. I am wondering if there was some improvement
> in the zeroing of THP pages or something that is somehow improving the
> cache performance for the accessing of the pages by the test in the guest.
The values you were observing on an unmodified kernel, were they consistent over
fresh reboot?
Do you have any sort of workload running in the host as that could also impact
the numbers.
>
>>> I believe the difference seen is due to
>>> the overhead for faulting pages back into the guest and zeroing of memory.
>> It may also make sense to test these patches with netperf to observe how much
>> performance drop it is introducing.
> Do you have some test you were already using? I ask because I am not sure
> netperf would generate a large enough memory window size to really trigger
> much of a change in terms of hinting. If you have some test in mind I
> could probably set it up and run it pretty quick.
Earlier I have tried running netperf on a guest with 2 cores, i.e., netserver
pinned to one and netperf running on the other.
You have to specify a really large packet size and run the test for at least
15-30 minutes to actually see some hinting work.
>
>>> Patch 4 is a bit on the large side at about 600 lines of change, however
>>> I really didn't see a good way to break it up since each piece feeds into
>>> the next. So I couldn't add the statistics by themselves as it didn't
>>> really make sense to add them without something that will either read or
>>> increment/decrement them, or add the Hinted state without something that
>>> would set/unset it. As such I just ended up adding the entire thing as
>>> one patch. It makes it a bit bigger but avoids the issues in the previous
>>> set where I was referencing things before they had been added.
>>>
>>> Changes from the RFC:
>>> https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
>>> Moved aeration requested flag out of aerator and into zone->flags.
>>> Moved bounary out of free_area and into local variables for aeration.
>>> Moved aeration cycle out of interrupt and into workqueue.
>>> Left nr_free as total pages instead of splitting it between raw and aerated.
>>> Combined size and physical address values in virtio ring into one 64b value.
>>>
>>> Changes from v1:
>>> https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
>>> Dropped "waste page treatment" in favor of "page hinting"
>> We may still have to try and find a better name for virtio-balloon side changes.
>> As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.
> We just need to settle on a name. Essentially all this requires is just a
> quick find and replace with whatever name we decide on.
I agree.
Michael S. Tsirkin July 24, 2019, 8:38 p.m. UTC | #9
On Wed, Jul 24, 2019 at 01:27:35PM -0700, Alexander Duyck wrote:
> On Wed, 2019-07-24 at 14:40 -0400, Nitesh Narayan Lal wrote:
> > On 7/24/19 12:54 PM, Alexander Duyck wrote:
> > > This series provides an asynchronous means of hinting to a hypervisor
> > > that a guest page is no longer in use and can have the data associated
> > > with it dropped. To do this I have implemented functionality that allows
> > > for what I am referring to as page hinting
> > > 
> > > The functionality for this is fairly simple. When enabled it will allocate
> > > statistics to track the number of hinted pages in a given free area. When
> > > the number of free pages exceeds this value plus a high water value,
> > > currently 32,
> > Shouldn't we configure this to a lower number such as 16?
> 
> Yes, we could do 16.
> 
> > >  it will begin performing page hinting which consists of
> > > pulling pages off of free list and placing them into a scatter list. The
> > > scatterlist is then given to the page hinting device and it will perform
> > > the required action to make the pages "hinted", in the case of
> > > virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> > > and as such they are forced out of the guest. After this they are placed
> > > back on the free list, and an additional bit is added if they are not
> > > merged indicating that they are a hinted buddy page instead of a standard
> > > buddy page. The cycle then repeats with additional non-hinted pages being
> > > pulled until the free areas all consist of hinted pages.
> > > 
> > > I am leaving a number of things hard-coded such as limiting the lowest
> > > order processed to PAGEBLOCK_ORDER,
> > Have you considered making this option configurable at the compile time?
> 
> We could. However, PAGEBLOCK_ORDER is already configurable on some
> architectures. I didn't see much point in making it configurable in the
> case of x86 as there are only really 2 orders that this could be used in
> that provided good performance and that MAX_ORDER - 1 and PAGEBLOCK_ORDER.
> 
> > >  and have left it up to the guest to
> > > determine what the limit is on how many pages it wants to allocate to
> > > process the hints.
> > It might make sense to set the number of pages to be hinted at a time from the
> > hypervisor.
> 
> We could do that. Although I would still want some upper limit on that as
> I would prefer to keep the high water mark as a static value since it is
> used in an inline function. Currently the virtio driver is the one
> defining the capacity of pages per request.
> 
> > > My primary testing has just been to verify the memory is being freed after
> > > allocation by running memhog 79g on a 80g guest and watching the total
> > > free memory via /proc/meminfo on the host. With this I have verified most
> > > of the memory is freed after each iteration. As far as performance I have
> > > been mainly focusing on the will-it-scale/page_fault1 test running with
> > > 16 vcpus. With that I have seen at most a 2% difference between the base
> > > kernel without these patches and the patches with virtio-balloon disabled.
> > > With the patches and virtio-balloon enabled with hinting the results
> > > largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
> > > drop in performance as I approached 16 threads,
> > I think this is acceptable.
> > >  however on the the lastest
> > > linux-next kernel I saw roughly a 4% to 5% improvement in performance for
> > > all tests with 8 or more threads. 
> > Do you mean that with your patches the will-it-scale/page_fault1 numbers were
> > better by 4-5% over an unmodified kernel?
> 
> Yes. That is the odd thing. I am wondering if there was some improvement
> in the zeroing of THP pages or something that is somehow improving the
> cache performance for the accessing of the pages by the test in the guest.

Well cache is indexed by the PA on intel, right?  So if you end up never
writing into the pages, reading them will be faster because you will end
up with a zero page. This will be offset by a fault when you finally do
write into the page.

> > > I believe the difference seen is due to
> > > the overhead for faulting pages back into the guest and zeroing of memory.
> > It may also make sense to test these patches with netperf to observe how much
> > performance drop it is introducing.
> 
> Do you have some test you were already using? I ask because I am not sure
> netperf would generate a large enough memory window size to really trigger
> much of a change in terms of hinting. If you have some test in mind I
> could probably set it up and run it pretty quick.
> 
> > > Patch 4 is a bit on the large side at about 600 lines of change, however
> > > I really didn't see a good way to break it up since each piece feeds into
> > > the next. So I couldn't add the statistics by themselves as it didn't
> > > really make sense to add them without something that will either read or
> > > increment/decrement them, or add the Hinted state without something that
> > > would set/unset it. As such I just ended up adding the entire thing as
> > > one patch. It makes it a bit bigger but avoids the issues in the previous
> > > set where I was referencing things before they had been added.
> > > 
> > > Changes from the RFC:
> > > https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
> > > Moved aeration requested flag out of aerator and into zone->flags.
> > > Moved bounary out of free_area and into local variables for aeration.
> > > Moved aeration cycle out of interrupt and into workqueue.
> > > Left nr_free as total pages instead of splitting it between raw and aerated.
> > > Combined size and physical address values in virtio ring into one 64b value.
> > > 
> > > Changes from v1:
> > > https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
> > > Dropped "waste page treatment" in favor of "page hinting"
> > We may still have to try and find a better name for virtio-balloon side changes.
> > As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.
> 
> We just need to settle on a name. Essentially all this requires is just a
> quick find and replace with whatever name we decide on.
Alexander Duyck July 24, 2019, 9 p.m. UTC | #10
On Wed, 2019-07-24 at 16:38 -0400, Nitesh Narayan Lal wrote:
> On 7/24/19 4:27 PM, Alexander Duyck wrote:
> > On Wed, 2019-07-24 at 14:40 -0400, Nitesh Narayan Lal wrote:
> > > On 7/24/19 12:54 PM, Alexander Duyck wrote:
> > > > This series provides an asynchronous means of hinting to a hypervisor
> > > > that a guest page is no longer in use and can have the data associated
> > > > with it dropped. To do this I have implemented functionality that allows
> > > > for what I am referring to as page hinting
> > > > 
> > > > The functionality for this is fairly simple. When enabled it will allocate
> > > > statistics to track the number of hinted pages in a given free area. When
> > > > the number of free pages exceeds this value plus a high water value,
> > > > currently 32,
> > > Shouldn't we configure this to a lower number such as 16?
> > Yes, we could do 16.
> > 
> > > >  it will begin performing page hinting which consists of
> > > > pulling pages off of free list and placing them into a scatter list. The
> > > > scatterlist is then given to the page hinting device and it will perform
> > > > the required action to make the pages "hinted", in the case of
> > > > virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> > > > and as such they are forced out of the guest. After this they are placed
> > > > back on the free list, and an additional bit is added if they are not
> > > > merged indicating that they are a hinted buddy page instead of a standard
> > > > buddy page. The cycle then repeats with additional non-hinted pages being
> > > > pulled until the free areas all consist of hinted pages.
> > > > 
> > > > I am leaving a number of things hard-coded such as limiting the lowest
> > > > order processed to PAGEBLOCK_ORDER,
> > > Have you considered making this option configurable at the compile time?
> > We could. However, PAGEBLOCK_ORDER is already configurable on some
> > architectures. I didn't see much point in making it configurable in the
> > case of x86 as there are only really 2 orders that this could be used in
> > that provided good performance and that MAX_ORDER - 1 and PAGEBLOCK_ORDER.
> > 
> > > >  and have left it up to the guest to
> > > > determine what the limit is on how many pages it wants to allocate to
> > > > process the hints.
> > > It might make sense to set the number of pages to be hinted at a time from the
> > > hypervisor.
> > We could do that. Although I would still want some upper limit on that as
> > I would prefer to keep the high water mark as a static value since it is
> > used in an inline function. Currently the virtio driver is the one
> > defining the capacity of pages per request.
> For the upper limit I think we can rely on max vq size. Isn't?

I would still want to limit how many pages could be pulled. Otherwise we
have the risk of a hypervisor that allocates a vq size of 1024 or
something like that and with 4M pages that could essentially OOM a 4G
guest.

That is why I figure what we probably should do is base the upper limit of
either 16 or 32 so that we only have at most something like 64M or 128M of
memory being held by the driver while it is being "reported". If we leave
spare room in the ring so be it, better that then triggering unneeded OOM
conditions.

> > > > My primary testing has just been to verify the memory is being freed after
> > > > allocation by running memhog 79g on a 80g guest and watching the total
> > > > free memory via /proc/meminfo on the host. With this I have verified most
> > > > of the memory is freed after each iteration. As far as performance I have
> > > > been mainly focusing on the will-it-scale/page_fault1 test running with
> > > > 16 vcpus. With that I have seen at most a 2% difference between the base
> > > > kernel without these patches and the patches with virtio-balloon disabled.
> > > > With the patches and virtio-balloon enabled with hinting the results
> > > > largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
> > > > drop in performance as I approached 16 threads,
> > > I think this is acceptable.
> > > >  however on the the lastest
> > > > linux-next kernel I saw roughly a 4% to 5% improvement in performance for
> > > > all tests with 8 or more threads. 
> > > Do you mean that with your patches the will-it-scale/page_fault1 numbers were
> > > better by 4-5% over an unmodified kernel?
> > Yes. That is the odd thing. I am wondering if there was some improvement
> > in the zeroing of THP pages or something that is somehow improving the
> > cache performance for the accessing of the pages by the test in the guest.
> The values you were observing on an unmodified kernel, were they consistent over
> fresh reboot?
> Do you have any sort of workload running in the host as that could also impact
> the numbers.

The host was an unmodified linux-next kernel. What I was doing is I would
reboot, load the guest run one kernel, swap the kernel in the guest and
just reboot the guest, run the next kernel, and then switch back to the
first kernel to make certain there wasn't anything that changed between
the runs.

I still need to do more research though. I'm still suspecting it has
something to do with the page zeroing on faults though as that was what
was showing up on a perf top when we hit about 8 or more threads active in
the guest.

> > > > I believe the difference seen is due to
> > > > the overhead for faulting pages back into the guest and zeroing of memory.
> > > It may also make sense to test these patches with netperf to observe how much
> > > performance drop it is introducing.
> > Do you have some test you were already using? I ask because I am not sure
> > netperf would generate a large enough memory window size to really trigger
> > much of a change in terms of hinting. If you have some test in mind I
> > could probably set it up and run it pretty quick.
> Earlier I have tried running netperf on a guest with 2 cores, i.e., netserver
> pinned to one and netperf running on the other.
> You have to specify a really large packet size and run the test for at least
> 15-30 minutes to actually see some hinting work.

I can take a look. I am not expecting much though.

> > > > Patch 4 is a bit on the large side at about 600 lines of change, however
> > > > I really didn't see a good way to break it up since each piece feeds into
> > > > the next. So I couldn't add the statistics by themselves as it didn't
> > > > really make sense to add them without something that will either read or
> > > > increment/decrement them, or add the Hinted state without something that
> > > > would set/unset it. As such I just ended up adding the entire thing as
> > > > one patch. It makes it a bit bigger but avoids the issues in the previous
> > > > set where I was referencing things before they had been added.
> > > > 
> > > > Changes from the RFC:
> > > > https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
> > > > Moved aeration requested flag out of aerator and into zone->flags.
> > > > Moved bounary out of free_area and into local variables for aeration.
> > > > Moved aeration cycle out of interrupt and into workqueue.
> > > > Left nr_free as total pages instead of splitting it between raw and aerated.
> > > > Combined size and physical address values in virtio ring into one 64b value.
> > > > 
> > > > Changes from v1:
> > > > https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
> > > > Dropped "waste page treatment" in favor of "page hinting"
> > > We may still have to try and find a better name for virtio-balloon side changes.
> > > As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.
> > We just need to settle on a name. Essentially all this requires is just a
> > quick find and replace with whatever name we decide on.
> I agree.

I will probably look at seeing if I can keep the kernel feature as free
page hinting and just make the virtio feature page reporting. It should be
pretty straight forward as I could just replace the mentions of react with
report and only have to tweak a few bits of patch 5.
Michael S. Tsirkin July 24, 2019, 9:32 p.m. UTC | #11
On Wed, Jul 24, 2019 at 09:47:24PM +0200, David Hildenbrand wrote:
> On 24.07.19 21:31, Michael S. Tsirkin wrote:
> > On Wed, Jul 24, 2019 at 08:41:33PM +0200, David Hildenbrand wrote:
> >> On 24.07.19 20:40, Nitesh Narayan Lal wrote:
> >>>
> >>> On 7/24/19 12:54 PM, Alexander Duyck wrote:
> >>>> This series provides an asynchronous means of hinting to a hypervisor
> >>>> that a guest page is no longer in use and can have the data associated
> >>>> with it dropped. To do this I have implemented functionality that allows
> >>>> for what I am referring to as page hinting
> >>>>
> >>>> The functionality for this is fairly simple. When enabled it will allocate
> >>>> statistics to track the number of hinted pages in a given free area. When
> >>>> the number of free pages exceeds this value plus a high water value,
> >>>> currently 32,
> >>> Shouldn't we configure this to a lower number such as 16?
> >>>>  it will begin performing page hinting which consists of
> >>>> pulling pages off of free list and placing them into a scatter list. The
> >>>> scatterlist is then given to the page hinting device and it will perform
> >>>> the required action to make the pages "hinted", in the case of
> >>>> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> >>>> and as such they are forced out of the guest. After this they are placed
> >>>> back on the free list, and an additional bit is added if they are not
> >>>> merged indicating that they are a hinted buddy page instead of a standard
> >>>> buddy page. The cycle then repeats with additional non-hinted pages being
> >>>> pulled until the free areas all consist of hinted pages.
> >>>>
> >>>> I am leaving a number of things hard-coded such as limiting the lowest
> >>>> order processed to PAGEBLOCK_ORDER,
> >>> Have you considered making this option configurable at the compile time?
> >>>>  and have left it up to the guest to
> >>>> determine what the limit is on how many pages it wants to allocate to
> >>>> process the hints.
> >>> It might make sense to set the number of pages to be hinted at a time from the
> >>> hypervisor.
> >>>>
> >>>> My primary testing has just been to verify the memory is being freed after
> >>>> allocation by running memhog 79g on a 80g guest and watching the total
> >>>> free memory via /proc/meminfo on the host. With this I have verified most
> >>>> of the memory is freed after each iteration. As far as performance I have
> >>>> been mainly focusing on the will-it-scale/page_fault1 test running with
> >>>> 16 vcpus. With that I have seen at most a 2% difference between the base
> >>>> kernel without these patches and the patches with virtio-balloon disabled.
> >>>> With the patches and virtio-balloon enabled with hinting the results
> >>>> largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
> >>>> drop in performance as I approached 16 threads,
> >>> I think this is acceptable.
> >>>>  however on the the lastest
> >>>> linux-next kernel I saw roughly a 4% to 5% improvement in performance for
> >>>> all tests with 8 or more threads. 
> >>> Do you mean that with your patches the will-it-scale/page_fault1 numbers were
> >>> better by 4-5% over an unmodified kernel?
> >>>> I believe the difference seen is due to
> >>>> the overhead for faulting pages back into the guest and zeroing of memory.
> >>> It may also make sense to test these patches with netperf to observe how much
> >>> performance drop it is introducing.
> >>>> Patch 4 is a bit on the large side at about 600 lines of change, however
> >>>> I really didn't see a good way to break it up since each piece feeds into
> >>>> the next. So I couldn't add the statistics by themselves as it didn't
> >>>> really make sense to add them without something that will either read or
> >>>> increment/decrement them, or add the Hinted state without something that
> >>>> would set/unset it. As such I just ended up adding the entire thing as
> >>>> one patch. It makes it a bit bigger but avoids the issues in the previous
> >>>> set where I was referencing things before they had been added.
> >>>>
> >>>> Changes from the RFC:
> >>>> https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
> >>>> Moved aeration requested flag out of aerator and into zone->flags.
> >>>> Moved bounary out of free_area and into local variables for aeration.
> >>>> Moved aeration cycle out of interrupt and into workqueue.
> >>>> Left nr_free as total pages instead of splitting it between raw and aerated.
> >>>> Combined size and physical address values in virtio ring into one 64b value.
> >>>>
> >>>> Changes from v1:
> >>>> https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
> >>>> Dropped "waste page treatment" in favor of "page hinting"
> >>> We may still have to try and find a better name for virtio-balloon side changes.
> >>> As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.
> >>
> >> We should have named that free page reporting, but that train already
> >> has left.
> > 
> > I think VIRTIO_BALLOON_F_FREE_PAGE_HINT is different and arguably
> > actually does provide hints.
> 
> I guess it depends on the point of view (e.g., getting all free pages
> feels more like a report). But I could also live with using the term
> reporting in this context.
> 
> We could go ahead and name it all "page reporting", would also work for me.

So there are actually three differences between the machanisms:
1. VIRTIO_BALLOON_F_FREE_PAGE_HINT sometimes reports pages which might no
   longer be on the free list (with subtle limitations which sometimes
   still allow hypervisor to discard the pages)
2. VIRTIO_BALLOON_F_FREE_PAGE_HINT starts reporting when requested by
   host
3. VIRTIO_BALLOON_F_FREE_PAGE_HINT is not incremental: each request
   by host reports all free memory

By comparison, the proposed patches:
- always report only actually free pages
- report at a random time
- report incrementally


> -- 
> 
> Thanks,
> 
> David / dhildenb
Nitesh Narayan Lal July 25, 2019, 12:08 p.m. UTC | #12
On 7/24/19 5:00 PM, Alexander Duyck wrote:
> On Wed, 2019-07-24 at 16:38 -0400, Nitesh Narayan Lal wrote:
>> On 7/24/19 4:27 PM, Alexander Duyck wrote:
>>> On Wed, 2019-07-24 at 14:40 -0400, Nitesh Narayan Lal wrote:
>>>> On 7/24/19 12:54 PM, Alexander Duyck wrote:
>>>>> This series provides an asynchronous means of hinting to a hypervisor
>>>>> that a guest page is no longer in use and can have the data associated
>>>>> with it dropped. To do this I have implemented functionality that allows
>>>>> for what I am referring to as page hinting
>>>>>
>>>>> The functionality for this is fairly simple. When enabled it will allocate
>>>>> statistics to track the number of hinted pages in a given free area. When
>>>>> the number of free pages exceeds this value plus a high water value,
>>>>> currently 32,
>>>> Shouldn't we configure this to a lower number such as 16?
>>> Yes, we could do 16.
>>>
>>>>>  it will begin performing page hinting which consists of
>>>>> pulling pages off of free list and placing them into a scatter list. The
>>>>> scatterlist is then given to the page hinting device and it will perform
>>>>> the required action to make the pages "hinted", in the case of
>>>>> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
>>>>> and as such they are forced out of the guest. After this they are placed
>>>>> back on the free list, and an additional bit is added if they are not
>>>>> merged indicating that they are a hinted buddy page instead of a standard
>>>>> buddy page. The cycle then repeats with additional non-hinted pages being
>>>>> pulled until the free areas all consist of hinted pages.
>>>>>
>>>>> I am leaving a number of things hard-coded such as limiting the lowest
>>>>> order processed to PAGEBLOCK_ORDER,
>>>> Have you considered making this option configurable at the compile time?
>>> We could. However, PAGEBLOCK_ORDER is already configurable on some
>>> architectures. I didn't see much point in making it configurable in the
>>> case of x86 as there are only really 2 orders that this could be used in
>>> that provided good performance and that MAX_ORDER - 1 and PAGEBLOCK_ORDER.
>>>
>>>>>  and have left it up to the guest to
>>>>> determine what the limit is on how many pages it wants to allocate to
>>>>> process the hints.
>>>> It might make sense to set the number of pages to be hinted at a time from the
>>>> hypervisor.
>>> We could do that. Although I would still want some upper limit on that as
>>> I would prefer to keep the high water mark as a static value since it is
>>> used in an inline function. Currently the virtio driver is the one
>>> defining the capacity of pages per request.
>> For the upper limit I think we can rely on max vq size. Isn't?
> I would still want to limit how many pages could be pulled. Otherwise we
> have the risk of a hypervisor that allocates a vq size of 1024 or
> something like that and with 4M pages that could essentially OOM a 4G
> guest.
>
> That is why I figure what we probably should do is base the upper limit of
> either 16 or 32 so that we only have at most something like 64M or 128M of
> memory being held by the driver while it is being "reported". If we leave
> spare room in the ring so be it, better that then triggering unneeded OOM
> conditions.
>
>>>>> My primary testing has just been to verify the memory is being freed after
>>>>> allocation by running memhog 79g on a 80g guest and watching the total
>>>>> free memory via /proc/meminfo on the host. With this I have verified most
>>>>> of the memory is freed after each iteration. As far as performance I have
>>>>> been mainly focusing on the will-it-scale/page_fault1 test running with
>>>>> 16 vcpus. With that I have seen at most a 2% difference between the base
>>>>> kernel without these patches and the patches with virtio-balloon disabled.
>>>>> With the patches and virtio-balloon enabled with hinting the results
>>>>> largely depend on the host kernel. On a 3.10 RHEL kernel I saw up to a 2%
>>>>> drop in performance as I approached 16 threads,
>>>> I think this is acceptable.
>>>>>  however on the the lastest
>>>>> linux-next kernel I saw roughly a 4% to 5% improvement in performance for
>>>>> all tests with 8 or more threads. 
>>>> Do you mean that with your patches the will-it-scale/page_fault1 numbers were
>>>> better by 4-5% over an unmodified kernel?
>>> Yes. That is the odd thing. I am wondering if there was some improvement
>>> in the zeroing of THP pages or something that is somehow improving the
>>> cache performance for the accessing of the pages by the test in the guest.
>> The values you were observing on an unmodified kernel, were they consistent over
>> fresh reboot?
>> Do you have any sort of workload running in the host as that could also impact
>> the numbers.
> The host was an unmodified linux-next kernel. What I was doing is I would
> reboot, load the guest run one kernel, swap the kernel in the guest and
> just reboot the guest, run the next kernel, and then switch back to the
> first kernel to make certain there wasn't anything that changed between
> the runs.


As long as the host kernel and environment remain the same for the guest kernel
with hinting patches and without hinting patches. We should be fine in comparing
the two? We would expect the will-it-scale/page_fault1 number in these two cases
to be close to one another.


>
> I still need to do more research though. I'm still suspecting it has
> something to do with the page zeroing on faults though as that was what
> was showing up on a perf top when we hit about 8 or more threads active in
> the guest.
>>>>> I believe the difference seen is due to
>>>>> the overhead for faulting pages back into the guest and zeroing of memory.
>>>> It may also make sense to test these patches with netperf to observe how much
>>>> performance drop it is introducing.
>>> Do you have some test you were already using? I ask because I am not sure
>>> netperf would generate a large enough memory window size to really trigger
>>> much of a change in terms of hinting. If you have some test in mind I
>>> could probably set it up and run it pretty quick.
>> Earlier I have tried running netperf on a guest with 2 cores, i.e., netserver
>> pinned to one and netperf running on the other.
>> You have to specify a really large packet size and run the test for at least
>> 15-30 minutes to actually see some hinting work.
> I can take a look. I am not expecting much though.
>
>>>>> Patch 4 is a bit on the large side at about 600 lines of change, however
>>>>> I really didn't see a good way to break it up since each piece feeds into
>>>>> the next. So I couldn't add the statistics by themselves as it didn't
>>>>> really make sense to add them without something that will either read or
>>>>> increment/decrement them, or add the Hinted state without something that
>>>>> would set/unset it. As such I just ended up adding the entire thing as
>>>>> one patch. It makes it a bit bigger but avoids the issues in the previous
>>>>> set where I was referencing things before they had been added.
>>>>>
>>>>> Changes from the RFC:
>>>>> https://lore.kernel.org/lkml/20190530215223.13974.22445.stgit@localhost.localdomain/
>>>>> Moved aeration requested flag out of aerator and into zone->flags.
>>>>> Moved bounary out of free_area and into local variables for aeration.
>>>>> Moved aeration cycle out of interrupt and into workqueue.
>>>>> Left nr_free as total pages instead of splitting it between raw and aerated.
>>>>> Combined size and physical address values in virtio ring into one 64b value.
>>>>>
>>>>> Changes from v1:
>>>>> https://lore.kernel.org/lkml/20190619222922.1231.27432.stgit@localhost.localdomain/
>>>>> Dropped "waste page treatment" in favor of "page hinting"
>>>> We may still have to try and find a better name for virtio-balloon side changes.
>>>> As "FREE_PAGE_HINT" and "PAGE_HINTING" are still confusing.
>>> We just need to settle on a name. Essentially all this requires is just a
>>> quick find and replace with whatever name we decide on.
>> I agree.
> I will probably look at seeing if I can keep the kernel feature as free
> page hinting and just make the virtio feature page reporting. It should be
> pretty straight forward as I could just replace the mentions of react with
> report and only have to tweak a few bits of patch 5.


I still think that we should stick with the same name for the kernel and the
virtio. But as you said this is not a blocker and the just requires a quick find
and replace.


>