mbox series

[v10,0/6] mm / virtio: Provide support for unused page reporting

Message ID 20190918175109.23474.67039.stgit@localhost.localdomain (mailing list archive)
Headers show
Series mm / virtio: Provide support for unused page reporting | expand

Message

Alexander Duyck Sept. 18, 2019, 5:52 p.m. UTC
This series provides an asynchronous means of reporting to a hypervisor
that a guest page is no longer in use and can have the data associated
with it dropped. To do this I have implemented functionality that allows
for what I am referring to as unused page reporting. The advantage of
unused page reporting is that we can support a significant amount of
memory over-commit with improved performance as we can avoid having to
write/read memory from swap as the VM will instead actively participate
in freeing unused memory so it doesn't have to be written.

The functionality for this is fairly simple. When enabled it will allocate
statistics to track the number of reported pages in a given free area.
When the number of free pages exceeds this value plus a high water value,
currently 32, it will begin performing page reporting which consists of
pulling non-reported pages off of the free lists of a given zone and
placing them into a scatterlist. The scatterlist is then given to the page
reporting device and it will perform the required action to make the pages
"reported", in the case of virtio-balloon this results in the pages being
madvised as MADV_DONTNEED. After this they are placed back on their
original free list. If they are not merged in freeing an additional bit is
set indicating that they are a "reported" buddy page instead of a standard
buddy page. The cycle then repeats with additional non-reported pages
being pulled until the free areas all consist of reported pages.

In order to try and keep the time needed to find a non-reported page to
a minimum we maintain a "reported_boundary" pointer. This pointer is used
by the get_unreported_pages iterator to determine at what point it should
resume searching for non-reported pages. In order to guarantee pages do
not get past the scan I have modified add_to_free_list_tail so that it
will not insert pages behind the reported_boundary.

If another process needs to perform a massive manipulation of the free
list, such as compaction, it can either reset a given individual boundary
which will push the boundary back to the list_head, or it can clear the
bit indicating the zone is actively processing which will result in the
reporting process resetting all of the boundaries for a given zone.

I am leaving a number of things hard-coded such as limiting the lowest
order processed to pageblock_order, and have left it up to the guest to
determine what the limit is on how many pages it wants to allocate to
process the hints. The upper limit for this is based on the size of the
queue used to store the scatterlist.

I wanted to avoid gaming the performance testing for this. As far as
possible gain a significant performance improvement should be visible in
cases where guests are forced to write/read from swap. As such, testing
it would be more of a benchmark of copying a page from swap versus just
allocating a zero page. I have been verifying that the memory is being
freed using memhog to allocate all the memory on the guest, and then
watching /proc/meminfo to verify the host sees the memory returned after
the test completes.

As far as possible regressions I have focused on cases where performing
the hinting would be non-optimal, such as cases where the code isn't
needed as memory is not over-committed, or the functionality is not in
use. I have been using the will-it-scale/page_fault1 test running with 16
vcpus and have modified it to use Transparent Huge Pages. With this I see
almost no difference with the patches applied and the feature disabled.
Likewise I see almost no difference with the feature enabled, but the
madvise disabled in the hypervisor due to a device being assigned. With
the feature fully enabled in both guest and hypervisor I see a regression
between -1.86% and -8.84% versus the baseline. I found that most of the
overhead was due to the page faulting/zeroing that comes as a result of
the pages having been evicted from the guest.

For info on earlier versions you will need to follow the links provided
with the respective versions.

Changes from v9:
https://lore.kernel.org/lkml/20190907172225.10910.34302.stgit@localhost.localdomain/
Updated cover page
Dropped per-cpu page randomization entropy patch
Added "to_tail" boolean value to __free_one_page to improve readability
Renamed __shuffle_pick_tail to shuffle_pick_tail, avoiding extra inline function
Dropped arm64 HUGLE_TLB_ORDER movement patch since it is no longer needed
Significant rewrite of page reporting functionality
  Updated logic to support interruptions from compaction
  get_unreported_page will now walk through reported sections
  Moved free_list manipulators out of mmzone.h and into page_alloc.c
  Removed page_reporting.h include from mmzone.h
  Split page_reporting.h between include/linux/ and mm/
  Added #include <asm/pgtable.h>" to mm/page_reporting.h
  Renamed page_reporting_startup/shutdown to page_reporting_register/unregister
Updated comments related to virtio page poison tracking feature

---

Alexander Duyck (6):
      mm: Adjust shuffle code to allow for future coalescing
      mm: Use zone and order instead of free area in free_list manipulators
      mm: Introduce Reported pages
      mm: Add device side and notifier for unused page reporting
      virtio-balloon: Pull page poisoning config out of free page hinting
      virtio-balloon: Add support for providing unused page reports to host


 drivers/virtio/Kconfig              |    1 
 drivers/virtio/virtio_balloon.c     |   87 ++++++++-
 include/linux/mmzone.h              |   60 ++----
 include/linux/page-flags.h          |   11 +
 include/linux/page_reporting.h      |   31 +++
 include/uapi/linux/virtio_balloon.h |    1 
 mm/Kconfig                          |   11 +
 mm/Makefile                         |    1 
 mm/compaction.c                     |    5 +
 mm/memory_hotplug.c                 |    2 
 mm/page_alloc.c                     |  194 +++++++++++++++----
 mm/page_reporting.c                 |  350 +++++++++++++++++++++++++++++++++++
 mm/page_reporting.h                 |  224 ++++++++++++++++++++++
 mm/shuffle.c                        |   12 +
 mm/shuffle.h                        |    6 +
 15 files changed, 893 insertions(+), 103 deletions(-)
 create mode 100644 include/linux/page_reporting.h
 create mode 100644 mm/page_reporting.c
 create mode 100644 mm/page_reporting.h

--

Comments

Michal Hocko Sept. 24, 2019, 2:23 p.m. UTC | #1
On Wed 18-09-19 10:52:25, Alexander Duyck wrote:
[...]
> In order to try and keep the time needed to find a non-reported page to
> a minimum we maintain a "reported_boundary" pointer. This pointer is used
> by the get_unreported_pages iterator to determine at what point it should
> resume searching for non-reported pages. In order to guarantee pages do
> not get past the scan I have modified add_to_free_list_tail so that it
> will not insert pages behind the reported_boundary.
> 
> If another process needs to perform a massive manipulation of the free
> list, such as compaction, it can either reset a given individual boundary
> which will push the boundary back to the list_head, or it can clear the
> bit indicating the zone is actively processing which will result in the
> reporting process resetting all of the boundaries for a given zone.

Is this any different from the previous version? The last review
feedback (both from me and Mel) was that we are not happy to have an
externally imposed constrains on how the page allocator is supposed to
maintain its free lists.

If this is really the only way to go forward then I would like to hear
very convincing arguments about other approaches not being feasible.
There are none in this cover letter unfortunately. This will be really a
hard sell without them.
Alexander Duyck Sept. 24, 2019, 3:20 p.m. UTC | #2
On Tue, Sep 24, 2019 at 7:23 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 18-09-19 10:52:25, Alexander Duyck wrote:
> [...]
> > In order to try and keep the time needed to find a non-reported page to
> > a minimum we maintain a "reported_boundary" pointer. This pointer is used
> > by the get_unreported_pages iterator to determine at what point it should
> > resume searching for non-reported pages. In order to guarantee pages do
> > not get past the scan I have modified add_to_free_list_tail so that it
> > will not insert pages behind the reported_boundary.
> >
> > If another process needs to perform a massive manipulation of the free
> > list, such as compaction, it can either reset a given individual boundary
> > which will push the boundary back to the list_head, or it can clear the
> > bit indicating the zone is actively processing which will result in the
> > reporting process resetting all of the boundaries for a given zone.
>
> Is this any different from the previous version? The last review
> feedback (both from me and Mel) was that we are not happy to have an
> externally imposed constrains on how the page allocator is supposed to
> maintain its free lists.

The main change for v10 versus v9 is that I allow the page reporting
boundary to be overridden. Specifically there are two approaches that
can be taken.

The first is to simply reset the iterator for whatever list is
updated. What this will do is reset the iterator back to list_head and
then you can do whatever you want with that specific list.

The other option is to simply clear the ZONE_PAGE_REPORTING_ACTIVE
bit. That will essentially notify the page reporting code that any/all
hints that were recorded have been discarded and that it needs to
start over.

All I am trying to do with this approach is reduce the work. Without
doing this the code has to walk the entire free page list for the
higher orders every iteration and that will not be cheap. Admittedly
it is a bit more invasive than the cut/splice logic used in compaction
which is taking the pages it has already processed and moving them to
the other end of the list. However, I have reduced things so that we
only really are limiting where add_to_free_list_tail can place pages,
and we are having to check/push back the boundaries if a reported page
is removed from a free_list.

> If this is really the only way to go forward then I would like to hear
> very convincing arguments about other approaches not being feasible.
> There are none in this cover letter unfortunately. This will be really a
> hard sell without them.

So I had considered several different approaches.

What I started out with was logic that was performing the hinting as a
part of the architecture specific arch_free_page call. It worked but
had performance issues as we were generating a hint per page freed
which has fairly high overhead.

The approach Nitesh has been using is to try and maintain a separate
bitmap of "dirty" pages that have recently been freed. There are a few
problems I saw with that approach. First is the fact that it becomes
lossy in that pages could be reallocated out while we are waiting for
the iterator to come through and process the page. This results in
there being a greater amount of work as we have to hunt and peck for
the pages, as such the zone lock has to be freed and reacquired often
which slows this approach down further. Secondly there is the
management of the bitmap itself and sparse memory which would likely
necessitate doing something similar to pageblock_flags on order to
support possible gaps in the zones.

I had considered trying to maintain a separate list entirely and have
the free pages placed there. However that was more invasive then this
solution. In addition modifying the free_list/free_area in any way is
problematic as it can result in the zone lock falling into the same
cacheline as the highest order free_area.

Ultimately what I settled on was the approach we have now where adding
a page to the head of the free_list is unchanged, adding a page to the
tail requires a check to see if the iterator is currently walking the
list, and removing the page requires pushing back the iterator if the
page is at the top of the reported list. I was trying to keep the
amount of code that would have to be touched in the non-reported case
to a minimum. With this we have to test for a bit in the zone flags if
adding to tail, and we have to test for a bit in the page on a
move/del from the freelist. So for the most common free/alloc cases we
would only have the impact of the one additional page flag check.
David Hildenbrand Sept. 24, 2019, 3:32 p.m. UTC | #3
On 24.09.19 16:23, Michal Hocko wrote:
> On Wed 18-09-19 10:52:25, Alexander Duyck wrote:
> [...]
>> In order to try and keep the time needed to find a non-reported page to
>> a minimum we maintain a "reported_boundary" pointer. This pointer is used
>> by the get_unreported_pages iterator to determine at what point it should
>> resume searching for non-reported pages. In order to guarantee pages do
>> not get past the scan I have modified add_to_free_list_tail so that it
>> will not insert pages behind the reported_boundary.
>>
>> If another process needs to perform a massive manipulation of the free
>> list, such as compaction, it can either reset a given individual boundary
>> which will push the boundary back to the list_head, or it can clear the
>> bit indicating the zone is actively processing which will result in the
>> reporting process resetting all of the boundaries for a given zone.
> 
> Is this any different from the previous version? The last review
> feedback (both from me and Mel) was that we are not happy to have an
> externally imposed constrains on how the page allocator is supposed to
> maintain its free lists.
> 
> If this is really the only way to go forward then I would like to hear
> very convincing arguments about other approaches not being feasible.

Adding to what Alexander said, I don't consider the other approaches
(especially the bitmap-based approach Nitesh is currently working on)
infeasible. There might be more rough edges (e.g., sparse zones) and
eventually sometimes a little more work to be done, but definitely
feasible. Incorporating stuff into the buddy might make some tasks
(e.g., identify free pages) more efficient.

I still somewhat like the idea of capturing hints of free pages (in
whatever data structure) and then going over the hints, seeing if the
pages are still free. Then only temporarily isolating the still-free
pages, reporting them, and un-isolating them after they were reported. I
like the idea that the pages are not fake-allocated but only temporarily
blocked. That works nicely e.g., with the movable zone (contain only
movable data).

But anyhow, after decades of people working on free page
hinting/reporting, I am happy with anything that gets accepted upstream :D
Nitesh Narayan Lal Sept. 24, 2019, 3:51 p.m. UTC | #4
On 9/24/19 11:32 AM, David Hildenbrand wrote:
> On 24.09.19 16:23, Michal Hocko wrote:
>> On Wed 18-09-19 10:52:25, Alexander Duyck wrote:
>> [...]
>>> In order to try and keep the time needed to find a non-reported page to
>>> a minimum we maintain a "reported_boundary" pointer. This pointer is used
>>> by the get_unreported_pages iterator to determine at what point it should
>>> resume searching for non-reported pages. In order to guarantee pages do
>>> not get past the scan I have modified add_to_free_list_tail so that it
>>> will not insert pages behind the reported_boundary.
>>>
>>> If another process needs to perform a massive manipulation of the free
>>> list, such as compaction, it can either reset a given individual boundary
>>> which will push the boundary back to the list_head, or it can clear the
>>> bit indicating the zone is actively processing which will result in the
>>> reporting process resetting all of the boundaries for a given zone.
>> Is this any different from the previous version? The last review
>> feedback (both from me and Mel) was that we are not happy to have an
>> externally imposed constrains on how the page allocator is supposed to
>> maintain its free lists.
>>
>> If this is really the only way to go forward then I would like to hear
>> very convincing arguments about other approaches not being feasible.
> Adding to what Alexander said, I don't consider the other approaches
> (especially the bitmap-based approach Nitesh is currently working on)
> infeasible. There might be more rough edges (e.g., sparse zones) and
> eventually sometimes a little more work to be done, but definitely
> feasible. Incorporating stuff into the buddy might make some tasks
> (e.g., identify free pages) more efficient.

My plan was to get a framework ready which can perform decently and
is acceptable upstream (keeping core-mm changes to a minimum) and then keep
optimizing it for different use-cases.
Indeed, the bitmap-based approach may not be efficient for every available use
case. But then I am not sure if we want to target that, considering it may require
mm-changes.

> I still somewhat like the idea of capturing hints of free pages (in
> whatever data structure) and then going over the hints, seeing if the
> pages are still free. Then only temporarily isolating the still-free
> pages, reporting them, and un-isolating them after they were reported. I
> like the idea that the pages are not fake-allocated but only temporarily
> blocked. That works nicely e.g., with the movable zone (contain only
> movable data).
>
> But anyhow, after decades of people working on free page
> hinting/reporting, I am happy with anything that gets accepted upstream :D

+1

>
Alexander Duyck Sept. 24, 2019, 5:07 p.m. UTC | #5
On Tue, Sep 24, 2019 at 8:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 24.09.19 16:23, Michal Hocko wrote:
> > On Wed 18-09-19 10:52:25, Alexander Duyck wrote:
> > [...]
> >> In order to try and keep the time needed to find a non-reported page to
> >> a minimum we maintain a "reported_boundary" pointer. This pointer is used
> >> by the get_unreported_pages iterator to determine at what point it should
> >> resume searching for non-reported pages. In order to guarantee pages do
> >> not get past the scan I have modified add_to_free_list_tail so that it
> >> will not insert pages behind the reported_boundary.
> >>
> >> If another process needs to perform a massive manipulation of the free
> >> list, such as compaction, it can either reset a given individual boundary
> >> which will push the boundary back to the list_head, or it can clear the
> >> bit indicating the zone is actively processing which will result in the
> >> reporting process resetting all of the boundaries for a given zone.
> >
> > Is this any different from the previous version? The last review
> > feedback (both from me and Mel) was that we are not happy to have an
> > externally imposed constrains on how the page allocator is supposed to
> > maintain its free lists.
> >
> > If this is really the only way to go forward then I would like to hear
> > very convincing arguments about other approaches not being feasible.
>
> Adding to what Alexander said, I don't consider the other approaches
> (especially the bitmap-based approach Nitesh is currently working on)
> infeasible. There might be more rough edges (e.g., sparse zones) and
> eventually sometimes a little more work to be done, but definitely
> feasible. Incorporating stuff into the buddy might make some tasks
> (e.g., identify free pages) more efficient.
>
> I still somewhat like the idea of capturing hints of free pages (in
> whatever data structure) and then going over the hints, seeing if the
> pages are still free. Then only temporarily isolating the still-free
> pages, reporting them, and un-isolating them after they were reported. I
> like the idea that the pages are not fake-allocated but only temporarily
> blocked. That works nicely e.g., with the movable zone (contain only
> movable data).

One other change in this patch set is that I split the headers so that
there is an internal header that resides in the mm tree and an
external one that provides the page reporting device structure and the
register/unregister functions. All that virtio-balloon knows is that
it is registering a notifier and will be called with scatter gather
lists for memory that is not currently in use by the kernel. It has no
visibility into the internal free_areas or the current state of the
buddy allocator. Rather than having two blocks that are both trying to
maintain that state, I have consolidated it all into the buddy
allocator with page reporting.

> But anyhow, after decades of people working on free page
> hinting/reporting, I am happy with anything that gets accepted upstream :D

Agreed. After working on this for 9 months I would be happy to get
something upstream that addresses this.

- Alex
David Hildenbrand Sept. 24, 2019, 5:28 p.m. UTC | #6
On 24.09.19 19:07, Alexander Duyck wrote:
> On Tue, Sep 24, 2019 at 8:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 24.09.19 16:23, Michal Hocko wrote:
>>> On Wed 18-09-19 10:52:25, Alexander Duyck wrote:
>>> [...]
>>>> In order to try and keep the time needed to find a non-reported page to
>>>> a minimum we maintain a "reported_boundary" pointer. This pointer is used
>>>> by the get_unreported_pages iterator to determine at what point it should
>>>> resume searching for non-reported pages. In order to guarantee pages do
>>>> not get past the scan I have modified add_to_free_list_tail so that it
>>>> will not insert pages behind the reported_boundary.
>>>>
>>>> If another process needs to perform a massive manipulation of the free
>>>> list, such as compaction, it can either reset a given individual boundary
>>>> which will push the boundary back to the list_head, or it can clear the
>>>> bit indicating the zone is actively processing which will result in the
>>>> reporting process resetting all of the boundaries for a given zone.
>>>
>>> Is this any different from the previous version? The last review
>>> feedback (both from me and Mel) was that we are not happy to have an
>>> externally imposed constrains on how the page allocator is supposed to
>>> maintain its free lists.
>>>
>>> If this is really the only way to go forward then I would like to hear
>>> very convincing arguments about other approaches not being feasible.
>>
>> Adding to what Alexander said, I don't consider the other approaches
>> (especially the bitmap-based approach Nitesh is currently working on)
>> infeasible. There might be more rough edges (e.g., sparse zones) and
>> eventually sometimes a little more work to be done, but definitely
>> feasible. Incorporating stuff into the buddy might make some tasks
>> (e.g., identify free pages) more efficient.
>>
>> I still somewhat like the idea of capturing hints of free pages (in
>> whatever data structure) and then going over the hints, seeing if the
>> pages are still free. Then only temporarily isolating the still-free
>> pages, reporting them, and un-isolating them after they were reported. I
>> like the idea that the pages are not fake-allocated but only temporarily
>> blocked. That works nicely e.g., with the movable zone (contain only
>> movable data).
> 
> One other change in this patch set is that I split the headers so that
> there is an internal header that resides in the mm tree and an
> external one that provides the page reporting device structure and the
> register/unregister functions. All that virtio-balloon knows is that
> it is registering a notifier and will be called with scatter gather
> lists for memory that is not currently in use by the kernel. It has no
> visibility into the internal free_areas or the current state of the
> buddy allocator. Rather than having two blocks that are both trying to
> maintain that state, I have consolidated it all into the buddy
> allocator with page reporting.
> 
>> But anyhow, after decades of people working on free page
>> hinting/reporting, I am happy with anything that gets accepted upstream :D
> 
> Agreed. After working on this for 9 months I would be happy to get
> something upstream that addresses this.

IBM upstreamed their proprietary solution - 45e576b1c3d0 ("S390] guest
page hinting light") - in 2008.

Rik has presented a generic approach in 2011 (!)
https://www.linux-kvm.org/images/f/ff/2011-forum-memory-overcommit.pdf

I think Nitesh has been working on this (initially as an Intern) since
Mid 2017.

So yeah, this stuff has quite some history :)

> 
> - Alex
>
Michal Hocko Sept. 26, 2019, 12:22 p.m. UTC | #7
On Tue 24-09-19 08:20:22, Alexander Duyck wrote:
> On Tue, Sep 24, 2019 at 7:23 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 18-09-19 10:52:25, Alexander Duyck wrote:
> > [...]
> > > In order to try and keep the time needed to find a non-reported page to
> > > a minimum we maintain a "reported_boundary" pointer. This pointer is used
> > > by the get_unreported_pages iterator to determine at what point it should
> > > resume searching for non-reported pages. In order to guarantee pages do
> > > not get past the scan I have modified add_to_free_list_tail so that it
> > > will not insert pages behind the reported_boundary.
> > >
> > > If another process needs to perform a massive manipulation of the free
> > > list, such as compaction, it can either reset a given individual boundary
> > > which will push the boundary back to the list_head, or it can clear the
> > > bit indicating the zone is actively processing which will result in the
> > > reporting process resetting all of the boundaries for a given zone.
> >
> > Is this any different from the previous version? The last review
> > feedback (both from me and Mel) was that we are not happy to have an
> > externally imposed constrains on how the page allocator is supposed to
> > maintain its free lists.
> 
> The main change for v10 versus v9 is that I allow the page reporting
> boundary to be overridden. Specifically there are two approaches that
> can be taken.
> 
> The first is to simply reset the iterator for whatever list is
> updated. What this will do is reset the iterator back to list_head and
> then you can do whatever you want with that specific list.

OK, this is slightly better than pushing the allocator to the corner.
The allocator really has to be under control of its data structures.
I would still be happier if the allocator wouldn't really have to bother
about somebody snooping its internal state to do its own thing. So
please make sure to describe why and how much this really matters.
 
> The other option is to simply clear the ZONE_PAGE_REPORTING_ACTIVE
> bit. That will essentially notify the page reporting code that any/all
> hints that were recorded have been discarded and that it needs to
> start over.
> 
> All I am trying to do with this approach is reduce the work. Without
> doing this the code has to walk the entire free page list for the
> higher orders every iteration and that will not be cheap.

How expensive this will be?

> Admittedly
> it is a bit more invasive than the cut/splice logic used in compaction
> which is taking the pages it has already processed and moving them to
> the other end of the list. However, I have reduced things so that we
> only really are limiting where add_to_free_list_tail can place pages,
> and we are having to check/push back the boundaries if a reported page
> is removed from a free_list.
> 
> > If this is really the only way to go forward then I would like to hear
> > very convincing arguments about other approaches not being feasible.
> > There are none in this cover letter unfortunately. This will be really a
> > hard sell without them.
> 
> So I had considered several different approaches.

Thanks this is certainly useful and it would have been even more so if
you gave some rough numbers to quantify how much overhead for different
solutions we are talking about here.
Alexander Duyck Sept. 26, 2019, 3:13 p.m. UTC | #8
On Thu, Sep 26, 2019 at 5:22 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 24-09-19 08:20:22, Alexander Duyck wrote:
> > On Tue, Sep 24, 2019 at 7:23 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Wed 18-09-19 10:52:25, Alexander Duyck wrote:
> > > [...]
> > > > In order to try and keep the time needed to find a non-reported page to
> > > > a minimum we maintain a "reported_boundary" pointer. This pointer is used
> > > > by the get_unreported_pages iterator to determine at what point it should
> > > > resume searching for non-reported pages. In order to guarantee pages do
> > > > not get past the scan I have modified add_to_free_list_tail so that it
> > > > will not insert pages behind the reported_boundary.
> > > >
> > > > If another process needs to perform a massive manipulation of the free
> > > > list, such as compaction, it can either reset a given individual boundary
> > > > which will push the boundary back to the list_head, or it can clear the
> > > > bit indicating the zone is actively processing which will result in the
> > > > reporting process resetting all of the boundaries for a given zone.
> > >
> > > Is this any different from the previous version? The last review
> > > feedback (both from me and Mel) was that we are not happy to have an
> > > externally imposed constrains on how the page allocator is supposed to
> > > maintain its free lists.
> >
> > The main change for v10 versus v9 is that I allow the page reporting
> > boundary to be overridden. Specifically there are two approaches that
> > can be taken.
> >
> > The first is to simply reset the iterator for whatever list is
> > updated. What this will do is reset the iterator back to list_head and
> > then you can do whatever you want with that specific list.
>
> OK, this is slightly better than pushing the allocator to the corner.
> The allocator really has to be under control of its data structures.
> I would still be happier if the allocator wouldn't really have to bother
> about somebody snooping its internal state to do its own thing. So
> please make sure to describe why and how much this really matters.

Okay I can try to do that. I suppose if nothing else I can put
together a test patch that reverts these bits and can add
documentation on the amount of regression seen without those bits. I
should be able to get that taken care of and a v11 out in the next few
days.

> > The other option is to simply clear the ZONE_PAGE_REPORTING_ACTIVE
> > bit. That will essentially notify the page reporting code that any/all
> > hints that were recorded have been discarded and that it needs to
> > start over.
> >
> > All I am trying to do with this approach is reduce the work. Without
> > doing this the code has to walk the entire free page list for the
> > higher orders every iteration and that will not be cheap.
>
> How expensive this will be?

Well without this I believe the work goes from being O(n) to O(n^2) as
we would have to walk the list every time we pull the batch of pages,
so without the iterator we end up having walk the page list
repeatedly. I suspect it becomes more expensive the more memory we
have. I'll be able to verify it later today once I can generate some
numbers.

> > Admittedly
> > it is a bit more invasive than the cut/splice logic used in compaction
> > which is taking the pages it has already processed and moving them to
> > the other end of the list. However, I have reduced things so that we
> > only really are limiting where add_to_free_list_tail can place pages,
> > and we are having to check/push back the boundaries if a reported page
> > is removed from a free_list.
> >
> > > If this is really the only way to go forward then I would like to hear
> > > very convincing arguments about other approaches not being feasible.
> > > There are none in this cover letter unfortunately. This will be really a
> > > hard sell without them.
> >
> > So I had considered several different approaches.
>
> Thanks this is certainly useful and it would have been even more so if
> you gave some rough numbers to quantify how much overhead for different
> solutions we are talking about here.

I'll see what I can do. As far as the bitmap solution I think Nitesh
has numbers for what he has been able to get out of it. At this point
I would assume his solution for the virtio/QEMU bits is probably
identical to mine so it should be easier to get an apples to apples
comparison.

Thanks.

- Alex