diff mbox

[v11,4/6] mm: function to offer a page block on the free list

Message ID 1497004901-30593-5-git-send-email-wei.w.wang@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wang, Wei W June 9, 2017, 10:41 a.m. UTC
Add a function to find a page block on the free list specified by the
caller. Pages from the page block may be used immediately after the
function returns. The caller is responsible for detecting or preventing
the use of such pages.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/linux/mm.h |  5 +++
 mm/page_alloc.c    | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+)

Comments

Dave Hansen June 12, 2017, 2:10 p.m. UTC | #1
Please stop cc'ing me on things also sent to closed mailing lists
(virtio-dev@lists.oasis-open.org).  I'm happy to review things on open
lists, but I'm not fond of the closed lists bouncing things at me.

On 06/09/2017 03:41 AM, Wei Wang wrote:
> Add a function to find a page block on the free list specified by the
> caller. Pages from the page block may be used immediately after the
> function returns. The caller is responsible for detecting or preventing
> the use of such pages.

This description doesn't tell me very much about what's going on here.
Neither does the comment.

"Pages from the page block may be used immediately after the
 function returns".

Used by who?  Does the "may" here mean that it is OK, or is it a warning
that the contents will be thrown away immediately?

The hypervisor is going to throw away the contents of these pages,
right?  As soon as the spinlock is released, someone can allocate a
page, and put good data in it.  What keeps the hypervisor from throwing
away good data?
Michael S. Tsirkin June 12, 2017, 4:28 p.m. UTC | #2
On Mon, Jun 12, 2017 at 07:10:12AM -0700, Dave Hansen wrote:
> Please stop cc'ing me on things also sent to closed mailing lists
> (virtio-dev@lists.oasis-open.org).  I'm happy to review things on open
> lists, but I'm not fond of the closed lists bouncing things at me.
> 
> On 06/09/2017 03:41 AM, Wei Wang wrote:
> > Add a function to find a page block on the free list specified by the
> > caller. Pages from the page block may be used immediately after the
> > function returns. The caller is responsible for detecting or preventing
> > the use of such pages.
> 
> This description doesn't tell me very much about what's going on here.
> Neither does the comment.
> 
> "Pages from the page block may be used immediately after the
>  function returns".
> 
> Used by who?  Does the "may" here mean that it is OK, or is it a warning
> that the contents will be thrown away immediately?

I agree here. Don't tell callers what they should do, say what does the
function does. "offer" also confuses. Here's a better comment

--->
mm: support reporting free page blocks

This adds support for reporting blocks of pages on the free list
specified by the caller.

As pages can leave the free list during this call or immediately
afterwards, they are not guaranteed to be free after the function
returns. The only guarantee this makes is that the page was on the free
list at some point in time after the function has been invoked.

Therefore, it is not safe for caller to use any pages on the returned
block or to discard data that is put there after the function returns.
However, it is safe for caller to discard data that was in one of these
pages before the function was invoked.

---

And repeat the last part in a code comment:

 * Note: it is not safe for caller to use any pages on the returned
 * block or to discard data that is put there after the function returns.
 * However, it is safe for caller to discard data that was in one of these
 * pages before the function was invoked.


> The hypervisor is going to throw away the contents of these pages,
> right?

It should be careful and only throw away contents that was there before
report_unused_page_block was invoked.  Hypervisor is responsible for not
corrupting guest memory.  But that's not something an mm patch should
worry about.

>  As soon as the spinlock is released, someone can allocate a
> page, and put good data in it.  What keeps the hypervisor from throwing
> away good data?

API should require this explicitly. Hopefully above answers this question.
Dave Hansen June 12, 2017, 4:42 p.m. UTC | #3
On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
> 
>> The hypervisor is going to throw away the contents of these pages,
>> right?
> It should be careful and only throw away contents that was there before
> report_unused_page_block was invoked.  Hypervisor is responsible for not
> corrupting guest memory.  But that's not something an mm patch should
> worry about.

That makes sense.  I'm struggling to imagine how the hypervisor makes
use of this information, though.  Does it make the pages read-only
before this, and then it knows if there has not been a write *and* it
gets notified via this new mechanism that it can throw the page away?
Michael S. Tsirkin June 12, 2017, 8:34 p.m. UTC | #4
On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
> > 
> >> The hypervisor is going to throw away the contents of these pages,
> >> right?
> > It should be careful and only throw away contents that was there before
> > report_unused_page_block was invoked.  Hypervisor is responsible for not
> > corrupting guest memory.  But that's not something an mm patch should
> > worry about.
> 
> That makes sense.  I'm struggling to imagine how the hypervisor makes
> use of this information, though.  Does it make the pages read-only
> before this, and then it knows if there has not been a write *and* it
> gets notified via this new mechanism that it can throw the page away?

Yes, and specifically, this is how it works for migration.  Normally you
start by migrating all of memory, then send updates incrementally if
pages have been modified.  This mechanism allows skipping some pages in
the 1st stage, if they get changed they will be migrated in the 2nd
stage.
Dave Hansen June 12, 2017, 8:54 p.m. UTC | #5
On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
>> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?
>>> It should be careful and only throw away contents that was there before
>>> report_unused_page_block was invoked.  Hypervisor is responsible for not
>>> corrupting guest memory.  But that's not something an mm patch should
>>> worry about.
>>
>> That makes sense.  I'm struggling to imagine how the hypervisor makes
>> use of this information, though.  Does it make the pages read-only
>> before this, and then it knows if there has not been a write *and* it
>> gets notified via this new mechanism that it can throw the page away?
> 
> Yes, and specifically, this is how it works for migration.  Normally you
> start by migrating all of memory, then send updates incrementally if
> pages have been modified.  This mechanism allows skipping some pages in
> the 1st stage, if they get changed they will be migrated in the 2nd
> stage.

OK, so the migration starts and marks everything read-only.  All the
pages now have read-only valuable data, or read-only worthless data in
the case that the page is in the free lists.  In order for a page to
become non-worthless, it has to have a write done to it, which the
hypervisor obviously knows about.

With this mechanism, the hypervisor knows it can discard pages which
have not had a write since they were known to have worthless contents.

Correct?

That also seems like pretty good information to include in the
changelog.  Otherwise, folks are going to be left wondering what good
the mechanism is.  It's pretty non-trivial to figure out. :)
Wang, Wei W June 13, 2017, 2:56 a.m. UTC | #6
On 06/13/2017 04:54 AM, Dave Hansen wrote:
> On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote:
>> On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote:
>>> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote:
>>>>> The hypervisor is going to throw away the contents of these pages,
>>>>> right?
>>>> It should be careful and only throw away contents that was there before
>>>> report_unused_page_block was invoked.  Hypervisor is responsible for not
>>>> corrupting guest memory.  But that's not something an mm patch should
>>>> worry about.
>>> That makes sense.  I'm struggling to imagine how the hypervisor makes
>>> use of this information, though.  Does it make the pages read-only
>>> before this, and then it knows if there has not been a write *and* it
>>> gets notified via this new mechanism that it can throw the page away?
>> Yes, and specifically, this is how it works for migration.  Normally you
>> start by migrating all of memory, then send updates incrementally if
>> pages have been modified.  This mechanism allows skipping some pages in
>> the 1st stage, if they get changed they will be migrated in the 2nd
>> stage.
> OK, so the migration starts and marks everything read-only.  All the
> pages now have read-only valuable data, or read-only worthless data in
> the case that the page is in the free lists.  In order for a page to
> become non-worthless, it has to have a write done to it, which the
> hypervisor obviously knows about.
>
> With this mechanism, the hypervisor knows it can discard pages which
> have not had a write since they were known to have worthless contents.
>
> Correct?
Right. By the way, ready-only is one of the dirty page logging
methods that a hypervisor uses to capture the pages that are
written by the VM.

>
> That also seems like pretty good information to include in the
> changelog.  Otherwise, folks are going to be left wondering what good
> the mechanism is.  It's pretty non-trivial to figure out. :)
If necessary, I think it's better to keep the introduction at high-level:

Examples of using this API by a hypervisor:
To live migrate a VM from one physical machine to another,
the hypervisor usually transfers all the VM's memory content.
An optimization here is to skip the transfer of memory that are not
in use by the VM, because the content of the unused memory is
worthless.
This API is the used to report the unused pages to the hypervisor.
The pages that have been reported to the hypervisor as unused
pages may be used by the VM after the report. The hypervisor
has a good mechanism (i.e. dirty page logging) to capture
the change. Therefore, if the new used pages are written into some
data, the hypervisor will still transfer them to the destination machine.

What do you guys think?

Best,
Wei
Rik van Riel June 20, 2017, 4:44 p.m. UTC | #7
On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:

> The hypervisor is going to throw away the contents of these pages,
> right?  As soon as the spinlock is released, someone can allocate a
> page, and put good data in it.  What keeps the hypervisor from
> throwing
> away good data?

That looks like it may be the wrong API, then?

We already have hooks called arch_free_page and
arch_alloc_page in the VM, which are called when
pages are freed, and allocated, respectively.

Nitesh Lal (on the CC list) is working on a way
to efficiently batch recently freed pages for
free page hinting to the hypervisor.

If that is done efficiently enough (eg. with
MADV_FREE on the hypervisor side for lazy freeing,
and lazy later re-use of the pages), do we still
need the harder to use batch interface from this
patch?
David Hildenbrand June 20, 2017, 4:49 p.m. UTC | #8
On 20.06.2017 18:44, Rik van Riel wrote:
> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> 
>> The hypervisor is going to throw away the contents of these pages,
>> right?  As soon as the spinlock is released, someone can allocate a
>> page, and put good data in it.  What keeps the hypervisor from
>> throwing
>> away good data?
> 
> That looks like it may be the wrong API, then?
> 
> We already have hooks called arch_free_page and
> arch_alloc_page in the VM, which are called when
> pages are freed, and allocated, respectively.
> 
> Nitesh Lal (on the CC list) is working on a way
> to efficiently batch recently freed pages for
> free page hinting to the hypervisor.
> 
> If that is done efficiently enough (eg. with
> MADV_FREE on the hypervisor side for lazy freeing,
> and lazy later re-use of the pages), do we still
> need the harder to use batch interface from this
> patch?
> 
David's opinion incoming:

No, I think proper free page hinting would be the optimum solution, if
done right. This would avoid the batch interface and even turn
virtio-balloon in some sense useless.
Rik van Riel June 20, 2017, 5:29 p.m. UTC | #9
On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:

> > Nitesh Lal (on the CC list) is working on a way
> > to efficiently batch recently freed pages for
> > free page hinting to the hypervisor.
> > 
> > If that is done efficiently enough (eg. with
> > MADV_FREE on the hypervisor side for lazy freeing,
> > and lazy later re-use of the pages), do we still
> > need the harder to use batch interface from this
> > patch?
> > 
> 
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution,
> if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.

I agree with that.  Let me go into some more detail of
what Nitesh is implementing:

1) In arch_free_page, the being-freed page is added
   to a per-cpu set of freed pages.
2) Once that set is full, arch_free_pages goes into a
   slow path, which:
   2a) Iterates over the set of freed pages, and
   2b) Checks whether they are still free, and
   2c) Adds the still free pages to a list that is
       to be passed to the hypervisor, to be MADV_FREEd.
   2d) Makes that hypercall.

Meanwhile all arch_alloc_pages has to do is make sure it
does not allocate a page while it is currently being
MADV_FREEd on the hypervisor side.

The code Wei is working on looks like it could be 
suitable for steps (2c) and (2d) above. Nitesh already
has code for steps 1 through 2b.
Michael S. Tsirkin June 20, 2017, 6:17 p.m. UTC | #10
On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:
> > On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> > 
> >> The hypervisor is going to throw away the contents of these pages,
> >> right?  As soon as the spinlock is released, someone can allocate a
> >> page, and put good data in it.  What keeps the hypervisor from
> >> throwing
> >> away good data?
> > 
> > That looks like it may be the wrong API, then?
> > 
> > We already have hooks called arch_free_page and
> > arch_alloc_page in the VM, which are called when
> > pages are freed, and allocated, respectively.
> > 
> > Nitesh Lal (on the CC list) is working on a way
> > to efficiently batch recently freed pages for
> > free page hinting to the hypervisor.
> > 
> > If that is done efficiently enough (eg. with
> > MADV_FREE on the hypervisor side for lazy freeing,
> > and lazy later re-use of the pages), do we still
> > need the harder to use batch interface from this
> > patch?
> > 
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution, if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.

I agree generally. But we have to balance that against the fact that
this was discussed since at least 2011 and no one built this solution
yet.

> -- 
> 
> Thanks,
> 
> David
Michael S. Tsirkin June 20, 2017, 6:26 p.m. UTC | #11
On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
> > On 20.06.2017 18:44, Rik van Riel wrote:
> 
> > > Nitesh Lal (on the CC list) is working on a way
> > > to efficiently batch recently freed pages for
> > > free page hinting to the hypervisor.
> > > 
> > > If that is done efficiently enough (eg. with
> > > MADV_FREE on the hypervisor side for lazy freeing,
> > > and lazy later re-use of the pages), do we still
> > > need the harder to use batch interface from this
> > > patch?
> > > 
> > 
> > David's opinion incoming:
> > 
> > No, I think proper free page hinting would be the optimum solution,
> > if
> > done right. This would avoid the batch interface and even turn
> > virtio-balloon in some sense useless.
> 
> I agree with that.  Let me go into some more detail of
> what Nitesh is implementing:
> 
> 1) In arch_free_page, the being-freed page is added
>    to a per-cpu set of freed pages.
> 2) Once that set is full, arch_free_pages goes into a
>    slow path, which:
>    2a) Iterates over the set of freed pages, and
>    2b) Checks whether they are still free, and
>    2c) Adds the still free pages to a list that is
>        to be passed to the hypervisor, to be MADV_FREEd.
>    2d) Makes that hypercall.
> 
> Meanwhile all arch_alloc_pages has to do is make sure it
> does not allocate a page while it is currently being
> MADV_FREEd on the hypervisor side.
> 
> The code Wei is working on looks like it could be 
> suitable for steps (2c) and (2d) above. Nitesh already
> has code for steps 1 through 2b.
> 
> -- 
> All rights reversed


So my question is this: Wei posted these numbers for balloon
inflation times:
inflating 7GB of an 8GB idle guest:

	1) allocating pages (6.5%)
	2) sending PFNs to host (68.3%)
	3) address translation (6.1%)
	4) madvise (19%)

	It takes about 4126ms for the inflating process to complete.

It seems that this is an excessive amount of time to stay
under a lock. What are your estimates for Nitesh's work?
David Hildenbrand June 20, 2017, 6:54 p.m. UTC | #12
On 20.06.2017 20:17, Michael S. Tsirkin wrote:
> On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?  As soon as the spinlock is released, someone can allocate a
>>>> page, and put good data in it.  What keeps the hypervisor from
>>>> throwing
>>>> away good data?
>>>
>>> That looks like it may be the wrong API, then?
>>>
>>> We already have hooks called arch_free_page and
>>> arch_alloc_page in the VM, which are called when
>>> pages are freed, and allocated, respectively.
>>>
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution, if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
> 
> I agree generally. But we have to balance that against the fact that
> this was discussed since at least 2011 and no one built this solution
> yet.

I totally agree, and I still think it will be hard to get a decent
performance for free page hinting (let's call it challenging). But I
heard of some interesting ideas. Surprise me.

Still, I would favor such an interface over a mm interface where people
start asking the same question over and over again ("how can this even
work"). Not only because it wasn't explained sufficiently enough, but
also because this interface is so special for one use case and one
scenario (concurrent dirty tracking in the host during migration).

IMHO even simply writing all-zeros to all free pages before starting
migration (or even when freeing a page) would be a cleaner interface
than this (because it atomically works with the entity the host cares
about for migration). But yes, performance is horrible that's why I am
not even suggesting it. Just saying that this mm interface is very very
special and if we could find something better, I'd favor it.
Michael S. Tsirkin June 20, 2017, 6:56 p.m. UTC | #13
On Tue, Jun 20, 2017 at 08:54:29PM +0200, David Hildenbrand wrote:
> On 20.06.2017 20:17, Michael S. Tsirkin wrote:
> > On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote:
> >> On 20.06.2017 18:44, Rik van Riel wrote:
> >>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
> >>>
> >>>> The hypervisor is going to throw away the contents of these pages,
> >>>> right?  As soon as the spinlock is released, someone can allocate a
> >>>> page, and put good data in it.  What keeps the hypervisor from
> >>>> throwing
> >>>> away good data?
> >>>
> >>> That looks like it may be the wrong API, then?
> >>>
> >>> We already have hooks called arch_free_page and
> >>> arch_alloc_page in the VM, which are called when
> >>> pages are freed, and allocated, respectively.
> >>>
> >>> Nitesh Lal (on the CC list) is working on a way
> >>> to efficiently batch recently freed pages for
> >>> free page hinting to the hypervisor.
> >>>
> >>> If that is done efficiently enough (eg. with
> >>> MADV_FREE on the hypervisor side for lazy freeing,
> >>> and lazy later re-use of the pages), do we still
> >>> need the harder to use batch interface from this
> >>> patch?
> >>>
> >> David's opinion incoming:
> >>
> >> No, I think proper free page hinting would be the optimum solution, if
> >> done right. This would avoid the batch interface and even turn
> >> virtio-balloon in some sense useless.
> > 
> > I agree generally. But we have to balance that against the fact that
> > this was discussed since at least 2011 and no one built this solution
> > yet.
> 
> I totally agree, and I still think it will be hard to get a decent
> performance for free page hinting (let's call it challenging). But I
> heard of some interesting ideas. Surprise me.
> 
> Still, I would favor such an interface over a mm interface where people
> start asking the same question over and over again ("how can this even
> work"). Not only because it wasn't explained sufficiently enough, but
> also because this interface is so special for one use case and one
> scenario (concurrent dirty tracking in the host during migration).
> 
> IMHO even simply writing all-zeros to all free pages before starting
> migration (or even when freeing a page) would be a cleaner interface
> than this (because it atomically works with the entity the host cares
> about for migration). But yes, performance is horrible that's why I am
> not even suggesting it. Just saying that this mm interface is very very
> special and if we could find something better, I'd favor it.

As long as there's a single user, changing to a better interface
once it's found won't be hard at all :)

> -- 
> 
> Thanks,
> 
> David
David Hildenbrand June 20, 2017, 7:01 p.m. UTC | #14
>> IMHO even simply writing all-zeros to all free pages before starting
>> migration (or even when freeing a page) would be a cleaner interface
>> than this (because it atomically works with the entity the host cares
>> about for migration). But yes, performance is horrible that's why I am
>> not even suggesting it. Just saying that this mm interface is very very
>> special and if we could find something better, I'd favor it.
> 
> As long as there's a single user, changing to a better interface
> once it's found won't be hard at all :)
> 

Hehe, more like "we made this beautiful virtio-balloon extension" - oh
there is free page hinting (assuming that it does not reuse the batch
interface here). Guess how long it would take to at least show that free
page hinting can be done. If it takes another 6 years, I am totally on
your side ;)
Rik van Riel June 20, 2017, 7:51 p.m. UTC | #15
On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote:
> On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> > I agree with that.  Let me go into some more detail of
> > what Nitesh is implementing:
> > 
> > 1) In arch_free_page, the being-freed page is added
> >    to a per-cpu set of freed pages.
> > 2) Once that set is full, arch_free_pages goes into a
> >    slow path, which:
> >    2a) Iterates over the set of freed pages, and
> >    2b) Checks whether they are still free, and
> >    2c) Adds the still free pages to a list that is
> >        to be passed to the hypervisor, to be MADV_FREEd.
> >    2d) Makes that hypercall.
> > 
> > Meanwhile all arch_alloc_pages has to do is make sure it
> > does not allocate a page while it is currently being
> > MADV_FREEd on the hypervisor side.
> > 
> > The code Wei is working on looks like it could be 
> > suitable for steps (2c) and (2d) above. Nitesh already
> > has code for steps 1 through 2b.
> 
> So my question is this: Wei posted these numbers for balloon
> inflation times:
> inflating 7GB of an 8GB idle guest:
> 
> 	1) allocating pages (6.5%)
> 	2) sending PFNs to host (68.3%)
> 	3) address translation (6.1%)
> 	4) madvise (19%)
> 
> 	It takes about 4126ms for the inflating process to complete.
> 
> It seems that this is an excessive amount of time to stay
> under a lock. What are your estimates for Nitesh's work?

That depends on the batch size used for step
(2c), and is something that we should be able
to tune for decent performance.

What seems to matter is that things are batched.
There are many ways to achieve that.
Wang, Wei W June 21, 2017, 8:38 a.m. UTC | #16
On 06/21/2017 01:29 AM, Rik van Riel wrote:
> On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution,
>> if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
> I agree with that.  Let me go into some more detail of
> what Nitesh is implementing:
>
> 1) In arch_free_page, the being-freed page is added
>     to a per-cpu set of freed pages.

I got some questions here:

1. Are the pages managed one by one on the per-CPU set?
For example, when there are 2 adjacent pages, are they still
put as two nodes on the per-CPU list? or the buddy algorithm
will be re-implemented on the per-CPU list as well?

2. Looks like this will be added to the common free function.
Normally, people may not need the free page hint, do they
need to carry the added burden?


> 2) Once that set is full, arch_free_pages goes into a
>     slow path, which:
>     2a) Iterates over the set of freed pages, and
>     2b) Checks whether they are still free, and

The pages that have been double checked as "free"
pages here and added to the list for the hypervisor can
also be immediately used.


>     2c) Adds the still free pages to a list that is
>         to be passed to the hypervisor, to be MADV_FREEd.
>     2d) Makes that hypercall.
>
> Meanwhile all arch_alloc_pages has to do is make sure it
> does not allocate a page while it is currently being
> MADV_FREEd on the hypervisor side.

Is this proposed to replace the balloon driver?

>
> The code Wei is working on looks like it could be
> suitable for steps (2c) and (2d) above. Nitesh already
> has code for steps 1 through 2b.
>

May I know the advantages of the added steps? Thanks.

Best,
Wei
Michael S. Tsirkin June 21, 2017, 12:41 p.m. UTC | #17
On Tue, Jun 20, 2017 at 03:51:00PM -0400, Rik van Riel wrote:
> On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote:
> > > I agree with that.  Let me go into some more detail of
> > > what Nitesh is implementing:
> > > 
> > > 1) In arch_free_page, the being-freed page is added
> > >    to a per-cpu set of freed pages.
> > > 2) Once that set is full, arch_free_pages goes into a
> > >    slow path, which:
> > >    2a) Iterates over the set of freed pages, and
> > >    2b) Checks whether they are still free, and
> > >    2c) Adds the still free pages to a list that is
> > >        to be passed to the hypervisor, to be MADV_FREEd.
> > >    2d) Makes that hypercall.
> > > 
> > > Meanwhile all arch_alloc_pages has to do is make sure it
> > > does not allocate a page while it is currently being
> > > MADV_FREEd on the hypervisor side.
> > > 
> > > The code Wei is working on looks like it could be 
> > > suitable for steps (2c) and (2d) above. Nitesh already
> > > has code for steps 1 through 2b.
> > 
> > So my question is this: Wei posted these numbers for balloon
> > inflation times:
> > inflating 7GB of an 8GB idle guest:
> > 
> > 	1) allocating pages (6.5%)
> > 	2) sending PFNs to host (68.3%)
> > 	3) address translation (6.1%)
> > 	4) madvise (19%)
> > 
> > 	It takes about 4126ms for the inflating process to complete.
> > 
> > It seems that this is an excessive amount of time to stay
> > under a lock. What are your estimates for Nitesh's work?
> 
> That depends on the batch size used for step
> (2c), and is something that we should be able
> to tune for decent performance.

I am not really sure how you intend to do this. Who will
drop and retake the lock? How do you make progress
instead of restarting from the beginning?
How do you combine multiple pages in a single s/g?

All these were issues that Wei's patches solved,
granted in a very limited manner (migration-specific)
but OTOH without a lot of tuning.


> What seems to matter is that things are batched.
> There are many ways to achieve that.

True, this is what the patches are trying to achieve.  So far this
approach was the 1st more or less workable way do achieve that,
previous ones got us nowhere.


> -- 
> All rights reversed
Christian Borntraeger June 21, 2017, 12:56 p.m. UTC | #18
On 06/20/2017 06:49 PM, David Hildenbrand wrote:
> On 20.06.2017 18:44, Rik van Riel wrote:
>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>
>>> The hypervisor is going to throw away the contents of these pages,
>>> right?  As soon as the spinlock is released, someone can allocate a
>>> page, and put good data in it.  What keeps the hypervisor from
>>> throwing
>>> away good data?
>>
>> That looks like it may be the wrong API, then?
>>
>> We already have hooks called arch_free_page and
>> arch_alloc_page in the VM, which are called when
>> pages are freed, and allocated, respectively.
>>
>> Nitesh Lal (on the CC list) is working on a way
>> to efficiently batch recently freed pages for
>> free page hinting to the hypervisor.
>>
>> If that is done efficiently enough (eg. with
>> MADV_FREE on the hypervisor side for lazy freeing,
>> and lazy later re-use of the pages), do we still
>> need the harder to use batch interface from this
>> patch?
>>
> David's opinion incoming:
> 
> No, I think proper free page hinting would be the optimum solution, if
> done right. This would avoid the batch interface and even turn
> virtio-balloon in some sense useless.
> 
Two reasons why I disagree:
- virtio-balloon is often used as memory hotplug. (e.g. libvirts current/max memory
uses virtio ballon)
- free page hinting will not allow to shrink the page cache of guests (like a ballooner does)
David Hildenbrand June 21, 2017, 1:47 p.m. UTC | #19
On 21.06.2017 14:56, Christian Borntraeger wrote:
> On 06/20/2017 06:49 PM, David Hildenbrand wrote:
>> On 20.06.2017 18:44, Rik van Riel wrote:
>>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote:
>>>
>>>> The hypervisor is going to throw away the contents of these pages,
>>>> right?  As soon as the spinlock is released, someone can allocate a
>>>> page, and put good data in it.  What keeps the hypervisor from
>>>> throwing
>>>> away good data?
>>>
>>> That looks like it may be the wrong API, then?
>>>
>>> We already have hooks called arch_free_page and
>>> arch_alloc_page in the VM, which are called when
>>> pages are freed, and allocated, respectively.
>>>
>>> Nitesh Lal (on the CC list) is working on a way
>>> to efficiently batch recently freed pages for
>>> free page hinting to the hypervisor.
>>>
>>> If that is done efficiently enough (eg. with
>>> MADV_FREE on the hypervisor side for lazy freeing,
>>> and lazy later re-use of the pages), do we still
>>> need the harder to use batch interface from this
>>> patch?
>>>
>> David's opinion incoming:
>>
>> No, I think proper free page hinting would be the optimum solution, if
>> done right. This would avoid the batch interface and even turn
>> virtio-balloon in some sense useless.
>>

I said "some sense" for a reason. Mainly because other techniques are
being worked on that are to fill the holes.

> Two reasons why I disagree:
> - virtio-balloon is often used as memory hotplug. (e.g. libvirts current/max memory
> uses virtio ballon)

I know, while one can argue if this real unplug as there are basically
no guarantees (see virtio-mem RFC) it is used by people because there is
simply no alternative. Still, for now some people use it for that.

> - free page hinting will not allow to shrink the page cache of guests (like a ballooner does)

There are currently some projects ongoing that try to avoid the page
cache in the guest completely.
diff mbox

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d22e69..82361a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1841,6 +1841,11 @@  extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+extern int report_unused_page_block(struct zone *zone, unsigned int order,
+				    unsigned int migratetype,
+				    struct page **page);
+#endif
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c25de4..0aefe02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4615,6 +4615,97 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
+
+/*
+ * Heuristically get a page block in the system that is unused.
+ * It is possible that pages from the page block are used immediately after
+ * report_unused_page_block() returns. It is the caller's responsibility
+ * to either detect or prevent the use of such pages.
+ *
+ * The free list to check: zone->free_area[order].free_list[migratetype].
+ *
+ * If the caller supplied page block (i.e. **page) is on the free list, offer
+ * the next page block on the list to the caller. Otherwise, offer the first
+ * page block on the list.
+ *
+ * Return 0 when a page block is found on the caller specified free list.
+ */
+int report_unused_page_block(struct zone *zone, unsigned int order,
+			     unsigned int migratetype, struct page **page)
+{
+	struct zone *this_zone;
+	struct list_head *this_list;
+	int ret = 0;
+	unsigned long flags;
+
+	/* Sanity check */
+	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
+	    migratetype >= MIGRATE_TYPES)
+		return -EINVAL;
+
+	/* Zone validity check */
+	for_each_populated_zone(this_zone) {
+		if (zone == this_zone)
+			break;
+	}
+
+	/* Got a non-existent zone from the caller? */
+	if (zone != this_zone)
+		return -EINVAL;
+
+	spin_lock_irqsave(&this_zone->lock, flags);
+
+	this_list = &zone->free_area[order].free_list[migratetype];
+	if (list_empty(this_list)) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/* The caller is asking for the first free page block on the list */
+	if ((*page) == NULL) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller is not on this free list
+	 * anymore (e.g. a 1MB free page block has been split). In this case,
+	 * offer the first page block on the free list that the caller is
+	 * asking for.
+	 */
+	if (PageBuddy(*page) && order != page_order(*page)) {
+		*page = list_first_entry(this_list, struct page, lru);
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * The page block passed from the caller has been the last page block
+	 * on the list.
+	 */
+	if ((*page)->lru.next == this_list) {
+		*page = NULL;
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * Finally, fall into the regular case: the page block passed from the
+	 * caller is still on the free list. Offer the next one.
+	 */
+	*page = list_next_entry((*page), lru);
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&this_zone->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(report_unused_page_block);
+
+#endif
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;