Message ID | 1497004901-30593-5-git-send-email-wei.w.wang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Please stop cc'ing me on things also sent to closed mailing lists (virtio-dev@lists.oasis-open.org). I'm happy to review things on open lists, but I'm not fond of the closed lists bouncing things at me. On 06/09/2017 03:41 AM, Wei Wang wrote: > Add a function to find a page block on the free list specified by the > caller. Pages from the page block may be used immediately after the > function returns. The caller is responsible for detecting or preventing > the use of such pages. This description doesn't tell me very much about what's going on here. Neither does the comment. "Pages from the page block may be used immediately after the function returns". Used by who? Does the "may" here mean that it is OK, or is it a warning that the contents will be thrown away immediately? The hypervisor is going to throw away the contents of these pages, right? As soon as the spinlock is released, someone can allocate a page, and put good data in it. What keeps the hypervisor from throwing away good data?
On Mon, Jun 12, 2017 at 07:10:12AM -0700, Dave Hansen wrote: > Please stop cc'ing me on things also sent to closed mailing lists > (virtio-dev@lists.oasis-open.org). I'm happy to review things on open > lists, but I'm not fond of the closed lists bouncing things at me. > > On 06/09/2017 03:41 AM, Wei Wang wrote: > > Add a function to find a page block on the free list specified by the > > caller. Pages from the page block may be used immediately after the > > function returns. The caller is responsible for detecting or preventing > > the use of such pages. > > This description doesn't tell me very much about what's going on here. > Neither does the comment. > > "Pages from the page block may be used immediately after the > function returns". > > Used by who? Does the "may" here mean that it is OK, or is it a warning > that the contents will be thrown away immediately? I agree here. Don't tell callers what they should do, say what does the function does. "offer" also confuses. Here's a better comment ---> mm: support reporting free page blocks This adds support for reporting blocks of pages on the free list specified by the caller. As pages can leave the free list during this call or immediately afterwards, they are not guaranteed to be free after the function returns. The only guarantee this makes is that the page was on the free list at some point in time after the function has been invoked. Therefore, it is not safe for caller to use any pages on the returned block or to discard data that is put there after the function returns. However, it is safe for caller to discard data that was in one of these pages before the function was invoked. --- And repeat the last part in a code comment: * Note: it is not safe for caller to use any pages on the returned * block or to discard data that is put there after the function returns. * However, it is safe for caller to discard data that was in one of these * pages before the function was invoked. > The hypervisor is going to throw away the contents of these pages, > right? It should be careful and only throw away contents that was there before report_unused_page_block was invoked. Hypervisor is responsible for not corrupting guest memory. But that's not something an mm patch should worry about. > As soon as the spinlock is released, someone can allocate a > page, and put good data in it. What keeps the hypervisor from throwing > away good data? API should require this explicitly. Hopefully above answers this question.
On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote: > >> The hypervisor is going to throw away the contents of these pages, >> right? > It should be careful and only throw away contents that was there before > report_unused_page_block was invoked. Hypervisor is responsible for not > corrupting guest memory. But that's not something an mm patch should > worry about. That makes sense. I'm struggling to imagine how the hypervisor makes use of this information, though. Does it make the pages read-only before this, and then it knows if there has not been a write *and* it gets notified via this new mechanism that it can throw the page away?
On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote: > On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote: > > > >> The hypervisor is going to throw away the contents of these pages, > >> right? > > It should be careful and only throw away contents that was there before > > report_unused_page_block was invoked. Hypervisor is responsible for not > > corrupting guest memory. But that's not something an mm patch should > > worry about. > > That makes sense. I'm struggling to imagine how the hypervisor makes > use of this information, though. Does it make the pages read-only > before this, and then it knows if there has not been a write *and* it > gets notified via this new mechanism that it can throw the page away? Yes, and specifically, this is how it works for migration. Normally you start by migrating all of memory, then send updates incrementally if pages have been modified. This mechanism allows skipping some pages in the 1st stage, if they get changed they will be migrated in the 2nd stage.
On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote: > On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote: >> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote: >>> >>>> The hypervisor is going to throw away the contents of these pages, >>>> right? >>> It should be careful and only throw away contents that was there before >>> report_unused_page_block was invoked. Hypervisor is responsible for not >>> corrupting guest memory. But that's not something an mm patch should >>> worry about. >> >> That makes sense. I'm struggling to imagine how the hypervisor makes >> use of this information, though. Does it make the pages read-only >> before this, and then it knows if there has not been a write *and* it >> gets notified via this new mechanism that it can throw the page away? > > Yes, and specifically, this is how it works for migration. Normally you > start by migrating all of memory, then send updates incrementally if > pages have been modified. This mechanism allows skipping some pages in > the 1st stage, if they get changed they will be migrated in the 2nd > stage. OK, so the migration starts and marks everything read-only. All the pages now have read-only valuable data, or read-only worthless data in the case that the page is in the free lists. In order for a page to become non-worthless, it has to have a write done to it, which the hypervisor obviously knows about. With this mechanism, the hypervisor knows it can discard pages which have not had a write since they were known to have worthless contents. Correct? That also seems like pretty good information to include in the changelog. Otherwise, folks are going to be left wondering what good the mechanism is. It's pretty non-trivial to figure out. :)
On 06/13/2017 04:54 AM, Dave Hansen wrote: > On 06/12/2017 01:34 PM, Michael S. Tsirkin wrote: >> On Mon, Jun 12, 2017 at 09:42:36AM -0700, Dave Hansen wrote: >>> On 06/12/2017 09:28 AM, Michael S. Tsirkin wrote: >>>>> The hypervisor is going to throw away the contents of these pages, >>>>> right? >>>> It should be careful and only throw away contents that was there before >>>> report_unused_page_block was invoked. Hypervisor is responsible for not >>>> corrupting guest memory. But that's not something an mm patch should >>>> worry about. >>> That makes sense. I'm struggling to imagine how the hypervisor makes >>> use of this information, though. Does it make the pages read-only >>> before this, and then it knows if there has not been a write *and* it >>> gets notified via this new mechanism that it can throw the page away? >> Yes, and specifically, this is how it works for migration. Normally you >> start by migrating all of memory, then send updates incrementally if >> pages have been modified. This mechanism allows skipping some pages in >> the 1st stage, if they get changed they will be migrated in the 2nd >> stage. > OK, so the migration starts and marks everything read-only. All the > pages now have read-only valuable data, or read-only worthless data in > the case that the page is in the free lists. In order for a page to > become non-worthless, it has to have a write done to it, which the > hypervisor obviously knows about. > > With this mechanism, the hypervisor knows it can discard pages which > have not had a write since they were known to have worthless contents. > > Correct? Right. By the way, ready-only is one of the dirty page logging methods that a hypervisor uses to capture the pages that are written by the VM. > > That also seems like pretty good information to include in the > changelog. Otherwise, folks are going to be left wondering what good > the mechanism is. It's pretty non-trivial to figure out. :) If necessary, I think it's better to keep the introduction at high-level: Examples of using this API by a hypervisor: To live migrate a VM from one physical machine to another, the hypervisor usually transfers all the VM's memory content. An optimization here is to skip the transfer of memory that are not in use by the VM, because the content of the unused memory is worthless. This API is the used to report the unused pages to the hypervisor. The pages that have been reported to the hypervisor as unused pages may be used by the VM after the report. The hypervisor has a good mechanism (i.e. dirty page logging) to capture the change. Therefore, if the new used pages are written into some data, the hypervisor will still transfer them to the destination machine. What do you guys think? Best, Wei
On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote: > The hypervisor is going to throw away the contents of these pages, > right? As soon as the spinlock is released, someone can allocate a > page, and put good data in it. What keeps the hypervisor from > throwing > away good data? That looks like it may be the wrong API, then? We already have hooks called arch_free_page and arch_alloc_page in the VM, which are called when pages are freed, and allocated, respectively. Nitesh Lal (on the CC list) is working on a way to efficiently batch recently freed pages for free page hinting to the hypervisor. If that is done efficiently enough (eg. with MADV_FREE on the hypervisor side for lazy freeing, and lazy later re-use of the pages), do we still need the harder to use batch interface from this patch?
On 20.06.2017 18:44, Rik van Riel wrote: > On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote: > >> The hypervisor is going to throw away the contents of these pages, >> right? As soon as the spinlock is released, someone can allocate a >> page, and put good data in it. What keeps the hypervisor from >> throwing >> away good data? > > That looks like it may be the wrong API, then? > > We already have hooks called arch_free_page and > arch_alloc_page in the VM, which are called when > pages are freed, and allocated, respectively. > > Nitesh Lal (on the CC list) is working on a way > to efficiently batch recently freed pages for > free page hinting to the hypervisor. > > If that is done efficiently enough (eg. with > MADV_FREE on the hypervisor side for lazy freeing, > and lazy later re-use of the pages), do we still > need the harder to use batch interface from this > patch? > David's opinion incoming: No, I think proper free page hinting would be the optimum solution, if done right. This would avoid the batch interface and even turn virtio-balloon in some sense useless.
On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote: > On 20.06.2017 18:44, Rik van Riel wrote: > > Nitesh Lal (on the CC list) is working on a way > > to efficiently batch recently freed pages for > > free page hinting to the hypervisor. > > > > If that is done efficiently enough (eg. with > > MADV_FREE on the hypervisor side for lazy freeing, > > and lazy later re-use of the pages), do we still > > need the harder to use batch interface from this > > patch? > > > > David's opinion incoming: > > No, I think proper free page hinting would be the optimum solution, > if > done right. This would avoid the batch interface and even turn > virtio-balloon in some sense useless. I agree with that. Let me go into some more detail of what Nitesh is implementing: 1) In arch_free_page, the being-freed page is added to a per-cpu set of freed pages. 2) Once that set is full, arch_free_pages goes into a slow path, which: 2a) Iterates over the set of freed pages, and 2b) Checks whether they are still free, and 2c) Adds the still free pages to a list that is to be passed to the hypervisor, to be MADV_FREEd. 2d) Makes that hypercall. Meanwhile all arch_alloc_pages has to do is make sure it does not allocate a page while it is currently being MADV_FREEd on the hypervisor side. The code Wei is working on looks like it could be suitable for steps (2c) and (2d) above. Nitesh already has code for steps 1 through 2b.
On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote: > On 20.06.2017 18:44, Rik van Riel wrote: > > On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote: > > > >> The hypervisor is going to throw away the contents of these pages, > >> right? As soon as the spinlock is released, someone can allocate a > >> page, and put good data in it. What keeps the hypervisor from > >> throwing > >> away good data? > > > > That looks like it may be the wrong API, then? > > > > We already have hooks called arch_free_page and > > arch_alloc_page in the VM, which are called when > > pages are freed, and allocated, respectively. > > > > Nitesh Lal (on the CC list) is working on a way > > to efficiently batch recently freed pages for > > free page hinting to the hypervisor. > > > > If that is done efficiently enough (eg. with > > MADV_FREE on the hypervisor side for lazy freeing, > > and lazy later re-use of the pages), do we still > > need the harder to use batch interface from this > > patch? > > > David's opinion incoming: > > No, I think proper free page hinting would be the optimum solution, if > done right. This would avoid the batch interface and even turn > virtio-balloon in some sense useless. I agree generally. But we have to balance that against the fact that this was discussed since at least 2011 and no one built this solution yet. > -- > > Thanks, > > David
On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote: > On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote: > > On 20.06.2017 18:44, Rik van Riel wrote: > > > > Nitesh Lal (on the CC list) is working on a way > > > to efficiently batch recently freed pages for > > > free page hinting to the hypervisor. > > > > > > If that is done efficiently enough (eg. with > > > MADV_FREE on the hypervisor side for lazy freeing, > > > and lazy later re-use of the pages), do we still > > > need the harder to use batch interface from this > > > patch? > > > > > > > David's opinion incoming: > > > > No, I think proper free page hinting would be the optimum solution, > > if > > done right. This would avoid the batch interface and even turn > > virtio-balloon in some sense useless. > > I agree with that. Let me go into some more detail of > what Nitesh is implementing: > > 1) In arch_free_page, the being-freed page is added > to a per-cpu set of freed pages. > 2) Once that set is full, arch_free_pages goes into a > slow path, which: > 2a) Iterates over the set of freed pages, and > 2b) Checks whether they are still free, and > 2c) Adds the still free pages to a list that is > to be passed to the hypervisor, to be MADV_FREEd. > 2d) Makes that hypercall. > > Meanwhile all arch_alloc_pages has to do is make sure it > does not allocate a page while it is currently being > MADV_FREEd on the hypervisor side. > > The code Wei is working on looks like it could be > suitable for steps (2c) and (2d) above. Nitesh already > has code for steps 1 through 2b. > > -- > All rights reversed So my question is this: Wei posted these numbers for balloon inflation times: inflating 7GB of an 8GB idle guest: 1) allocating pages (6.5%) 2) sending PFNs to host (68.3%) 3) address translation (6.1%) 4) madvise (19%) It takes about 4126ms for the inflating process to complete. It seems that this is an excessive amount of time to stay under a lock. What are your estimates for Nitesh's work?
On 20.06.2017 20:17, Michael S. Tsirkin wrote: > On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote: >> On 20.06.2017 18:44, Rik van Riel wrote: >>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote: >>> >>>> The hypervisor is going to throw away the contents of these pages, >>>> right? As soon as the spinlock is released, someone can allocate a >>>> page, and put good data in it. What keeps the hypervisor from >>>> throwing >>>> away good data? >>> >>> That looks like it may be the wrong API, then? >>> >>> We already have hooks called arch_free_page and >>> arch_alloc_page in the VM, which are called when >>> pages are freed, and allocated, respectively. >>> >>> Nitesh Lal (on the CC list) is working on a way >>> to efficiently batch recently freed pages for >>> free page hinting to the hypervisor. >>> >>> If that is done efficiently enough (eg. with >>> MADV_FREE on the hypervisor side for lazy freeing, >>> and lazy later re-use of the pages), do we still >>> need the harder to use batch interface from this >>> patch? >>> >> David's opinion incoming: >> >> No, I think proper free page hinting would be the optimum solution, if >> done right. This would avoid the batch interface and even turn >> virtio-balloon in some sense useless. > > I agree generally. But we have to balance that against the fact that > this was discussed since at least 2011 and no one built this solution > yet. I totally agree, and I still think it will be hard to get a decent performance for free page hinting (let's call it challenging). But I heard of some interesting ideas. Surprise me. Still, I would favor such an interface over a mm interface where people start asking the same question over and over again ("how can this even work"). Not only because it wasn't explained sufficiently enough, but also because this interface is so special for one use case and one scenario (concurrent dirty tracking in the host during migration). IMHO even simply writing all-zeros to all free pages before starting migration (or even when freeing a page) would be a cleaner interface than this (because it atomically works with the entity the host cares about for migration). But yes, performance is horrible that's why I am not even suggesting it. Just saying that this mm interface is very very special and if we could find something better, I'd favor it.
On Tue, Jun 20, 2017 at 08:54:29PM +0200, David Hildenbrand wrote: > On 20.06.2017 20:17, Michael S. Tsirkin wrote: > > On Tue, Jun 20, 2017 at 06:49:33PM +0200, David Hildenbrand wrote: > >> On 20.06.2017 18:44, Rik van Riel wrote: > >>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote: > >>> > >>>> The hypervisor is going to throw away the contents of these pages, > >>>> right? As soon as the spinlock is released, someone can allocate a > >>>> page, and put good data in it. What keeps the hypervisor from > >>>> throwing > >>>> away good data? > >>> > >>> That looks like it may be the wrong API, then? > >>> > >>> We already have hooks called arch_free_page and > >>> arch_alloc_page in the VM, which are called when > >>> pages are freed, and allocated, respectively. > >>> > >>> Nitesh Lal (on the CC list) is working on a way > >>> to efficiently batch recently freed pages for > >>> free page hinting to the hypervisor. > >>> > >>> If that is done efficiently enough (eg. with > >>> MADV_FREE on the hypervisor side for lazy freeing, > >>> and lazy later re-use of the pages), do we still > >>> need the harder to use batch interface from this > >>> patch? > >>> > >> David's opinion incoming: > >> > >> No, I think proper free page hinting would be the optimum solution, if > >> done right. This would avoid the batch interface and even turn > >> virtio-balloon in some sense useless. > > > > I agree generally. But we have to balance that against the fact that > > this was discussed since at least 2011 and no one built this solution > > yet. > > I totally agree, and I still think it will be hard to get a decent > performance for free page hinting (let's call it challenging). But I > heard of some interesting ideas. Surprise me. > > Still, I would favor such an interface over a mm interface where people > start asking the same question over and over again ("how can this even > work"). Not only because it wasn't explained sufficiently enough, but > also because this interface is so special for one use case and one > scenario (concurrent dirty tracking in the host during migration). > > IMHO even simply writing all-zeros to all free pages before starting > migration (or even when freeing a page) would be a cleaner interface > than this (because it atomically works with the entity the host cares > about for migration). But yes, performance is horrible that's why I am > not even suggesting it. Just saying that this mm interface is very very > special and if we could find something better, I'd favor it. As long as there's a single user, changing to a better interface once it's found won't be hard at all :) > -- > > Thanks, > > David
>> IMHO even simply writing all-zeros to all free pages before starting >> migration (or even when freeing a page) would be a cleaner interface >> than this (because it atomically works with the entity the host cares >> about for migration). But yes, performance is horrible that's why I am >> not even suggesting it. Just saying that this mm interface is very very >> special and if we could find something better, I'd favor it. > > As long as there's a single user, changing to a better interface > once it's found won't be hard at all :) > Hehe, more like "we made this beautiful virtio-balloon extension" - oh there is free page hinting (assuming that it does not reuse the batch interface here). Guess how long it would take to at least show that free page hinting can be done. If it takes another 6 years, I am totally on your side ;)
On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote: > On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote: > > I agree with that. Let me go into some more detail of > > what Nitesh is implementing: > > > > 1) In arch_free_page, the being-freed page is added > > to a per-cpu set of freed pages. > > 2) Once that set is full, arch_free_pages goes into a > > slow path, which: > > 2a) Iterates over the set of freed pages, and > > 2b) Checks whether they are still free, and > > 2c) Adds the still free pages to a list that is > > to be passed to the hypervisor, to be MADV_FREEd. > > 2d) Makes that hypercall. > > > > Meanwhile all arch_alloc_pages has to do is make sure it > > does not allocate a page while it is currently being > > MADV_FREEd on the hypervisor side. > > > > The code Wei is working on looks like it could be > > suitable for steps (2c) and (2d) above. Nitesh already > > has code for steps 1 through 2b. > > So my question is this: Wei posted these numbers for balloon > inflation times: > inflating 7GB of an 8GB idle guest: > > 1) allocating pages (6.5%) > 2) sending PFNs to host (68.3%) > 3) address translation (6.1%) > 4) madvise (19%) > > It takes about 4126ms for the inflating process to complete. > > It seems that this is an excessive amount of time to stay > under a lock. What are your estimates for Nitesh's work? That depends on the batch size used for step (2c), and is something that we should be able to tune for decent performance. What seems to matter is that things are batched. There are many ways to achieve that.
On 06/21/2017 01:29 AM, Rik van Riel wrote: > On Tue, 2017-06-20 at 18:49 +0200, David Hildenbrand wrote: >> On 20.06.2017 18:44, Rik van Riel wrote: >>> Nitesh Lal (on the CC list) is working on a way >>> to efficiently batch recently freed pages for >>> free page hinting to the hypervisor. >>> >>> If that is done efficiently enough (eg. with >>> MADV_FREE on the hypervisor side for lazy freeing, >>> and lazy later re-use of the pages), do we still >>> need the harder to use batch interface from this >>> patch? >>> >> David's opinion incoming: >> >> No, I think proper free page hinting would be the optimum solution, >> if >> done right. This would avoid the batch interface and even turn >> virtio-balloon in some sense useless. > I agree with that. Let me go into some more detail of > what Nitesh is implementing: > > 1) In arch_free_page, the being-freed page is added > to a per-cpu set of freed pages. I got some questions here: 1. Are the pages managed one by one on the per-CPU set? For example, when there are 2 adjacent pages, are they still put as two nodes on the per-CPU list? or the buddy algorithm will be re-implemented on the per-CPU list as well? 2. Looks like this will be added to the common free function. Normally, people may not need the free page hint, do they need to carry the added burden? > 2) Once that set is full, arch_free_pages goes into a > slow path, which: > 2a) Iterates over the set of freed pages, and > 2b) Checks whether they are still free, and The pages that have been double checked as "free" pages here and added to the list for the hypervisor can also be immediately used. > 2c) Adds the still free pages to a list that is > to be passed to the hypervisor, to be MADV_FREEd. > 2d) Makes that hypercall. > > Meanwhile all arch_alloc_pages has to do is make sure it > does not allocate a page while it is currently being > MADV_FREEd on the hypervisor side. Is this proposed to replace the balloon driver? > > The code Wei is working on looks like it could be > suitable for steps (2c) and (2d) above. Nitesh already > has code for steps 1 through 2b. > May I know the advantages of the added steps? Thanks. Best, Wei
On Tue, Jun 20, 2017 at 03:51:00PM -0400, Rik van Riel wrote: > On Tue, 2017-06-20 at 21:26 +0300, Michael S. Tsirkin wrote: > > On Tue, Jun 20, 2017 at 01:29:00PM -0400, Rik van Riel wrote: > > > I agree with that. Let me go into some more detail of > > > what Nitesh is implementing: > > > > > > 1) In arch_free_page, the being-freed page is added > > > to a per-cpu set of freed pages. > > > 2) Once that set is full, arch_free_pages goes into a > > > slow path, which: > > > 2a) Iterates over the set of freed pages, and > > > 2b) Checks whether they are still free, and > > > 2c) Adds the still free pages to a list that is > > > to be passed to the hypervisor, to be MADV_FREEd. > > > 2d) Makes that hypercall. > > > > > > Meanwhile all arch_alloc_pages has to do is make sure it > > > does not allocate a page while it is currently being > > > MADV_FREEd on the hypervisor side. > > > > > > The code Wei is working on looks like it could be > > > suitable for steps (2c) and (2d) above. Nitesh already > > > has code for steps 1 through 2b. > > > > So my question is this: Wei posted these numbers for balloon > > inflation times: > > inflating 7GB of an 8GB idle guest: > > > > 1) allocating pages (6.5%) > > 2) sending PFNs to host (68.3%) > > 3) address translation (6.1%) > > 4) madvise (19%) > > > > It takes about 4126ms for the inflating process to complete. > > > > It seems that this is an excessive amount of time to stay > > under a lock. What are your estimates for Nitesh's work? > > That depends on the batch size used for step > (2c), and is something that we should be able > to tune for decent performance. I am not really sure how you intend to do this. Who will drop and retake the lock? How do you make progress instead of restarting from the beginning? How do you combine multiple pages in a single s/g? All these were issues that Wei's patches solved, granted in a very limited manner (migration-specific) but OTOH without a lot of tuning. > What seems to matter is that things are batched. > There are many ways to achieve that. True, this is what the patches are trying to achieve. So far this approach was the 1st more or less workable way do achieve that, previous ones got us nowhere. > -- > All rights reversed
On 06/20/2017 06:49 PM, David Hildenbrand wrote: > On 20.06.2017 18:44, Rik van Riel wrote: >> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote: >> >>> The hypervisor is going to throw away the contents of these pages, >>> right? As soon as the spinlock is released, someone can allocate a >>> page, and put good data in it. What keeps the hypervisor from >>> throwing >>> away good data? >> >> That looks like it may be the wrong API, then? >> >> We already have hooks called arch_free_page and >> arch_alloc_page in the VM, which are called when >> pages are freed, and allocated, respectively. >> >> Nitesh Lal (on the CC list) is working on a way >> to efficiently batch recently freed pages for >> free page hinting to the hypervisor. >> >> If that is done efficiently enough (eg. with >> MADV_FREE on the hypervisor side for lazy freeing, >> and lazy later re-use of the pages), do we still >> need the harder to use batch interface from this >> patch? >> > David's opinion incoming: > > No, I think proper free page hinting would be the optimum solution, if > done right. This would avoid the batch interface and even turn > virtio-balloon in some sense useless. > Two reasons why I disagree: - virtio-balloon is often used as memory hotplug. (e.g. libvirts current/max memory uses virtio ballon) - free page hinting will not allow to shrink the page cache of guests (like a ballooner does)
On 21.06.2017 14:56, Christian Borntraeger wrote: > On 06/20/2017 06:49 PM, David Hildenbrand wrote: >> On 20.06.2017 18:44, Rik van Riel wrote: >>> On Mon, 2017-06-12 at 07:10 -0700, Dave Hansen wrote: >>> >>>> The hypervisor is going to throw away the contents of these pages, >>>> right? As soon as the spinlock is released, someone can allocate a >>>> page, and put good data in it. What keeps the hypervisor from >>>> throwing >>>> away good data? >>> >>> That looks like it may be the wrong API, then? >>> >>> We already have hooks called arch_free_page and >>> arch_alloc_page in the VM, which are called when >>> pages are freed, and allocated, respectively. >>> >>> Nitesh Lal (on the CC list) is working on a way >>> to efficiently batch recently freed pages for >>> free page hinting to the hypervisor. >>> >>> If that is done efficiently enough (eg. with >>> MADV_FREE on the hypervisor side for lazy freeing, >>> and lazy later re-use of the pages), do we still >>> need the harder to use batch interface from this >>> patch? >>> >> David's opinion incoming: >> >> No, I think proper free page hinting would be the optimum solution, if >> done right. This would avoid the batch interface and even turn >> virtio-balloon in some sense useless. >> I said "some sense" for a reason. Mainly because other techniques are being worked on that are to fill the holes. > Two reasons why I disagree: > - virtio-balloon is often used as memory hotplug. (e.g. libvirts current/max memory > uses virtio ballon) I know, while one can argue if this real unplug as there are basically no guarantees (see virtio-mem RFC) it is used by people because there is simply no alternative. Still, for now some people use it for that. > - free page hinting will not allow to shrink the page cache of guests (like a ballooner does) There are currently some projects ongoing that try to avoid the page cache in the guest completely.
diff --git a/include/linux/mm.h b/include/linux/mm.h index 5d22e69..82361a6 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1841,6 +1841,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, unsigned long zone_start_pfn, unsigned long *zholes_size); extern void free_initmem(void); +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) +extern int report_unused_page_block(struct zone *zone, unsigned int order, + unsigned int migratetype, + struct page **page); +#endif /* * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) * into the buddy system. The freed pages will be poisoned with pattern diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2c25de4..0aefe02 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4615,6 +4615,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) show_swap_cache_info(); } +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) + +/* + * Heuristically get a page block in the system that is unused. + * It is possible that pages from the page block are used immediately after + * report_unused_page_block() returns. It is the caller's responsibility + * to either detect or prevent the use of such pages. + * + * The free list to check: zone->free_area[order].free_list[migratetype]. + * + * If the caller supplied page block (i.e. **page) is on the free list, offer + * the next page block on the list to the caller. Otherwise, offer the first + * page block on the list. + * + * Return 0 when a page block is found on the caller specified free list. + */ +int report_unused_page_block(struct zone *zone, unsigned int order, + unsigned int migratetype, struct page **page) +{ + struct zone *this_zone; + struct list_head *this_list; + int ret = 0; + unsigned long flags; + + /* Sanity check */ + if (zone == NULL || page == NULL || order >= MAX_ORDER || + migratetype >= MIGRATE_TYPES) + return -EINVAL; + + /* Zone validity check */ + for_each_populated_zone(this_zone) { + if (zone == this_zone) + break; + } + + /* Got a non-existent zone from the caller? */ + if (zone != this_zone) + return -EINVAL; + + spin_lock_irqsave(&this_zone->lock, flags); + + this_list = &zone->free_area[order].free_list[migratetype]; + if (list_empty(this_list)) { + *page = NULL; + ret = 1; + goto out; + } + + /* The caller is asking for the first free page block on the list */ + if ((*page) == NULL) { + *page = list_first_entry(this_list, struct page, lru); + ret = 0; + goto out; + } + + /* + * The page block passed from the caller is not on this free list + * anymore (e.g. a 1MB free page block has been split). In this case, + * offer the first page block on the free list that the caller is + * asking for. + */ + if (PageBuddy(*page) && order != page_order(*page)) { + *page = list_first_entry(this_list, struct page, lru); + ret = 0; + goto out; + } + + /* + * The page block passed from the caller has been the last page block + * on the list. + */ + if ((*page)->lru.next == this_list) { + *page = NULL; + ret = 1; + goto out; + } + + /* + * Finally, fall into the regular case: the page block passed from the + * caller is still on the free list. Offer the next one. + */ + *page = list_next_entry((*page), lru); + ret = 0; +out: + spin_unlock_irqrestore(&this_zone->lock, flags); + return ret; +} +EXPORT_SYMBOL(report_unused_page_block); + +#endif + static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) { zoneref->zone = zone;