mbox series

[RFC,v8,0/7] KVM: Guest Free Page Hinting

Message ID 20190204201854.2328-1-nitesh@redhat.com (mailing list archive)
Headers show
Series KVM: Guest Free Page Hinting | expand

Message

Nitesh Narayan Lal Feb. 4, 2019, 8:18 p.m. UTC
The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.

Benefit:
With this patch-series, in our test-case, executed on a single system and single NUMA node with 15GB memory, we were able to successfully launch atleast 5 guests 
when page hinting was enabled and 3 without it. (Detailed explanation of the test procedure is provided at the bottom).

Changelog in V8:
In this patch-series, the earlier approach [1] which was used to capture and scan the pages freed by the guest has been changed. The new approach is briefly described below:

The patch-set still leverages the existing arch_free_page() to add this functionality. It maintains a per CPU array which is used to store the pages freed by the guest. The maximum number of entries which it can hold is defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is scanned and only the pages which are available in the buddy are stored. This process continues until the array is filled with pages which are part of the buddy free list. After which it wakes up a kernel per-cpu-thread.
This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation and if the page is not reallocated and present in the buddy, the kernel thread attempts to isolate it from the buddy. If it is successfully isolated, the page is added to another per-cpu array. Once the entire scanning process is complete, all the isolated pages are reported to the host through an existing virtio-balloon driver.

Known Issues:
	* Fixed array size: The problem with having a fixed/hardcoded array size arises when the size of the guest varies. For example when the guest size increases and it starts making large allocations fixed size limits this solution's ability to capture all the freed pages. This will result in less guest free memory getting reported to the host.

Known code re-work:
	* Plan to re-use Wei's work, which communicates the poison value to the host.
	* The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei's Free Page Hint code.
	* Sorting based on zonenum, to avoid repetitive zone locks for the same zone.

Other required work:
	* Run other benchmarks to evaluate the performance/impact of this approach.

Test case:
Setup:
Memory-15837 MB
Guest Memory Size-5 GB
Swap-Disabled
Test Program-Simple program which allocates 4GB memory via malloc, touches it via memset and exits.
Use case-Number of guests that can be launched completely including the successful execution of the test program.
Procedure: 
The first guest is launched and once its console is up, the test allocation program is executed with 4 GB memory request (Due to this the guest occupies almost 4-5 GB of memory in the host in a system without page hinting). Once this program exits at that time another guest is launched in the host and the same process is followed. We continue launching the guests until a guest gets killed due to low memory condition in the host.

Result:
Without Hinting-3 Guests
With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).

[1] https://www.spinics.net/lists/kvm/msg170113.html

Comments

Wang, Wei W Feb. 12, 2019, 9:03 a.m. UTC | #1
On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
> The following patch-set proposes an efficient mechanism for handing freed
> memory between the guest and the host. It enables the guests with no page
> cache to rapidly free and reclaims memory to and from the host respectively.
> 
> Benefit:
> With this patch-series, in our test-case, executed on a single system and
> single NUMA node with 15GB memory, we were able to successfully launch
> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
> explanation of the test procedure is provided at the bottom).
> 
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and
> scan the pages freed by the guest has been changed. The new approach is
> briefly described below:
> 
> The patch-set still leverages the existing arch_free_page() to add this
> functionality. It maintains a per CPU array which is used to store the pages
> freed by the guest. The maximum number of entries which it can hold is
> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
> is scanned and only the pages which are available in the buddy are stored.
> This process continues until the array is filled with pages which are part of
> the buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
> and if the page is not reallocated and present in the buddy, the kernel
> thread attempts to isolate it from the buddy. If it is successfully isolated, the
> page is added to another per-cpu array. Once the entire scanning process is
> complete, all the isolated pages are reported to the host through an existing
> virtio-balloon driver.

 Hi Nitesh,

Have you guys thought about something like below, which would be simpler:

- use bitmaps to record free pages, e.g. xbitmap: https://lkml.org/lkml/2018/1/9/304.
  The bitmap can be indexed by the guest pfn, and it's globally accessed by all the CPUs;
- arch_free_page(): set the bits of the freed pages from the bitmap
 (no per-CPU array with hardcoded fixed length and no per-cpu scanning thread)
- arch_alloc_page(): clear the related bits from the bitmap
- expose 2 APIs for the callers:
  -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int nr); 
     This API searches for the next free page chunk (@nr of pages), starting from @pfn_start.
     Bits of those free pages will be cleared after this function returns.
  -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
     This API sets the @nr continuous bits starting from pfn_start.

Usage example with balloon:
1) host requests to start ballooning;
2) balloon driver get_free_page_hints and report the hints to host via report_vq;
3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free pages and put back pfn_start to ack_vq;
4) balloon driver receives pfn_start and calls put_free_page_hints(pfn_start) to have the related bits from the bitmap to be set, indicating that those free pages are ready to be allocated.

In above 2), get_free_page_hints clears the bits which indicates that those pages are not ready to be used by the guest yet. Why?
This is because 3) will unmap the underlying physical pages from EPT. Normally, when guest re-visits those pages, EPT violations and QEMU page faults will get a new host page to set up the related EPT entry. If guest uses that page before the page gets unmapped (i.e. right before step 3), no EPT violation happens and the guest will use the same physical page that will be unmapped and given to other host threads. So we need to make sure that the guest free page is usable only after step 3 finishes.

Back to arch_alloc_page(), it needs to check if the allocated pages have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it means step 2) above has happened and step 4) hasn't been reached. In this case, we can either have arch_alloc_page() busywaiting a bit till 4) is done for that page
Or better to have a balloon callback which prioritize 3) and 4) to make this page usable by the guest.

Using bitmaps to record free page hints don't need to take the free pages off the buddy list and return them later, which needs to go through the long allocation/free code path.

Best,
Wei
David Hildenbrand Feb. 12, 2019, 9:24 a.m. UTC | #2
On 12.02.19 10:03, Wang, Wei W wrote:
> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>> The following patch-set proposes an efficient mechanism for handing freed
>> memory between the guest and the host. It enables the guests with no page
>> cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and
>> single NUMA node with 15GB memory, we were able to successfully launch
>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>> explanation of the test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and
>> scan the pages freed by the guest has been changed. The new approach is
>> briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this
>> functionality. It maintains a per CPU array which is used to store the pages
>> freed by the guest. The maximum number of entries which it can hold is
>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>> is scanned and only the pages which are available in the buddy are stored.
>> This process continues until the array is filled with pages which are part of
>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>> and if the page is not reallocated and present in the buddy, the kernel
>> thread attempts to isolate it from the buddy. If it is successfully isolated, the
>> page is added to another per-cpu array. Once the entire scanning process is
>> complete, all the isolated pages are reported to the host through an existing
>> virtio-balloon driver.
> 
>  Hi Nitesh,
> 
> Have you guys thought about something like below, which would be simpler:

Responding because I'm the first to stumble over this mail, hah! :)

> 
> - use bitmaps to record free pages, e.g. xbitmap: https://lkml.org/lkml/2018/1/9/304.
>   The bitmap can be indexed by the guest pfn, and it's globally accessed by all the CPUs;

Global means all VCPUs will be competing potentially for a single lock
when freeing/allocating a page, no? What if you have 64VCPUs
allocating/freeing memory like crazy?

(I assume some kind of locking is required even if the bitmap would be
atomic. Also, doesn't xbitmap mean that we eventually have to allocate
memory at places where we don't want to - e.g. from arch_free_page ?)

That's the big benefit of taking the pages of the buddy free list. Other
VCPUs won't stumble over them, waiting for them to get freed in the
hypervisor.

> - arch_free_page(): set the bits of the freed pages from the bitmap
>  (no per-CPU array with hardcoded fixed length and no per-cpu scanning thread)
> - arch_alloc_page(): clear the related bits from the bitmap
> - expose 2 APIs for the callers:
>   -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int nr); 
>      This API searches for the next free page chunk (@nr of pages), starting from @pfn_start.
>      Bits of those free pages will be cleared after this function returns.
>   -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
>      This API sets the @nr continuous bits starting from pfn_start.
> 
> Usage example with balloon:
> 1) host requests to start ballooning;
> 2) balloon driver get_free_page_hints and report the hints to host via report_vq;
> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free pages and put back pfn_start to ack_vq;
> 4) balloon driver receives pfn_start and calls put_free_page_hints(pfn_start) to have the related bits from the bitmap to be set, indicating that those free pages are ready to be allocated.

This sounds more like "the host requests to get free pages once in a
while" compared to "the host is always informed about free pages". At
the time where the host actually has to ask the guest (e.g. because the
host is low on memory), it might be to late to wait for guest action.
Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages
as candidates for removal and if the host is low on memory, only
scanning the guest page tables is sufficient to free up memory.

But both points might just be an implementation detail in the example
you describe.

> 
> In above 2), get_free_page_hints clears the bits which indicates that those pages are not ready to be used by the guest yet. Why?
> This is because 3) will unmap the underlying physical pages from EPT. Normally, when guest re-visits those pages, EPT violations and QEMU page faults will get a new host page to set up the related EPT entry. If guest uses that page before the page gets unmapped (i.e. right before step 3), no EPT violation happens and the guest will use the same physical page that will be unmapped and given to other host threads. So we need to make sure that the guest free page is usable only after step 3 finishes.
> 
> Back to arch_alloc_page(), it needs to check if the allocated pages have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it means step 2) above has happened and step 4) hasn't been reached. In this case, we can either have arch_alloc_page() busywaiting a bit till 4) is done for that page
> Or better to have a balloon callback which prioritize 3) and 4) to make this page usable by the guest.

Regarding the latter, the VCPU allocating a page cannot do anything if
the page (along with other pages) is just being freed by the hypervisor.
It has to busy-wait, no chance to prioritize.

> 
> Using bitmaps to record free page hints don't need to take the free pages off the buddy list and return them later, which needs to go through the long allocation/free code path.
> 

Yes, but it means that any process is able to get stuck on such a page
for as long as it takes to report the free pages to the hypervisor and
for it to call madvise(pfn_start, DONTNEED) on any such page.

Nice idea, but I think we definitely need something the can potentially
be implemented per-cpu without any global locks involved.

Thanks!

> Best,
> Wei
>
Nitesh Narayan Lal Feb. 12, 2019, 5:24 p.m. UTC | #3
On 2/12/19 4:24 AM, David Hildenbrand wrote:
> On 12.02.19 10:03, Wang, Wei W wrote:
>> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>>> The following patch-set proposes an efficient mechanism for handing freed
>>> memory between the guest and the host. It enables the guests with no page
>>> cache to rapidly free and reclaims memory to and from the host respectively.
>>>
>>> Benefit:
>>> With this patch-series, in our test-case, executed on a single system and
>>> single NUMA node with 15GB memory, we were able to successfully launch
>>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>>> explanation of the test procedure is provided at the bottom).
>>>
>>> Changelog in V8:
>>> In this patch-series, the earlier approach [1] which was used to capture and
>>> scan the pages freed by the guest has been changed. The new approach is
>>> briefly described below:
>>>
>>> The patch-set still leverages the existing arch_free_page() to add this
>>> functionality. It maintains a per CPU array which is used to store the pages
>>> freed by the guest. The maximum number of entries which it can hold is
>>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>>> is scanned and only the pages which are available in the buddy are stored.
>>> This process continues until the array is filled with pages which are part of
>>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>>> and if the page is not reallocated and present in the buddy, the kernel
>>> thread attempts to isolate it from the buddy. If it is successfully isolated, the
>>> page is added to another per-cpu array. Once the entire scanning process is
>>> complete, all the isolated pages are reported to the host through an existing
>>> virtio-balloon driver.
>>  Hi Nitesh,
>>
>> Have you guys thought about something like below, which would be simpler:
> Responding because I'm the first to stumble over this mail, hah! :)
>
>> - use bitmaps to record free pages, e.g. xbitmap: https://lkml.org/lkml/2018/1/9/304.
>>   The bitmap can be indexed by the guest pfn, and it's globally accessed by all the CPUs;
> Global means all VCPUs will be competing potentially for a single lock
> when freeing/allocating a page, no? What if you have 64VCPUs
> allocating/freeing memory like crazy?
>
> (I assume some kind of locking is required even if the bitmap would be
> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> memory at places where we don't want to - e.g. from arch_free_page ?)
>
> That's the big benefit of taking the pages of the buddy free list. Other
> VCPUs won't stumble over them, waiting for them to get freed in the
> hypervisor.
>
>> - arch_free_page(): set the bits of the freed pages from the bitmap
>>  (no per-CPU array with hardcoded fixed length and no per-cpu scanning thread)
>> - arch_alloc_page(): clear the related bits from the bitmap
>> - expose 2 APIs for the callers:
>>   -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int nr); 
>>      This API searches for the next free page chunk (@nr of pages), starting from @pfn_start.
>>      Bits of those free pages will be cleared after this function returns.
>>   -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
>>      This API sets the @nr continuous bits starting from pfn_start.
>>
>> Usage example with balloon:
>> 1) host requests to start ballooning;
>> 2) balloon driver get_free_page_hints and report the hints to host via report_vq;
>> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free pages and put back pfn_start to ack_vq;
>> 4) balloon driver receives pfn_start and calls put_free_page_hints(pfn_start) to have the related bits from the bitmap to be set, indicating that those free pages are ready to be allocated.
> This sounds more like "the host requests to get free pages once in a
> while" compared to "the host is always informed about free pages". At
> the time where the host actually has to ask the guest (e.g. because the
> host is low on memory), it might be to late to wait for guest action.
> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages
> as candidates for removal and if the host is low on memory, only
> scanning the guest page tables is sufficient to free up memory.
>
> But both points might just be an implementation detail in the example
> you describe.
>
>> In above 2), get_free_page_hints clears the bits which indicates that those pages are not ready to be used by the guest yet. Why?
>> This is because 3) will unmap the underlying physical pages from EPT. Normally, when guest re-visits those pages, EPT violations and QEMU page faults will get a new host page to set up the related EPT entry. If guest uses that page before the page gets unmapped (i.e. right before step 3), no EPT violation happens and the guest will use the same physical page that will be unmapped and given to other host threads. So we need to make sure that the guest free page is usable only after step 3 finishes.
>>
>> Back to arch_alloc_page(), it needs to check if the allocated pages have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it means step 2) above has happened and step 4) hasn't been reached. In this case, we can either have arch_alloc_page() busywaiting a bit till 4) is done for that page
>> Or better to have a balloon callback which prioritize 3) and 4) to make this page usable by the guest.
> Regarding the latter, the VCPU allocating a page cannot do anything if
> the page (along with other pages) is just being freed by the hypervisor.
> It has to busy-wait, no chance to prioritize.
>
>> Using bitmaps to record free page hints don't need to take the free pages off the buddy list and return them later, which needs to go through the long allocation/free code path.
>>
> Yes, but it means that any process is able to get stuck on such a page
> for as long as it takes to report the free pages to the hypervisor and
> for it to call madvise(pfn_start, DONTNEED) on any such page.
>
> Nice idea, but I think we definitely need something the can potentially
> be implemented per-cpu without any global locks involved.
>
> Thanks!
>
>> Best,
>> Wei
>>
Hi Wei,

For your comment, I agree with David. If we have one global per-cpu, we
will have to acquire a lock.
Also as David mentioned the idea is to derive the hints from the guest,
rather than host asking for free pages.

However, I am wondering if having per-cpu bitmaps is possible?
Using this I can possibly get rid of the fixed array size issue.
David Hildenbrand Feb. 12, 2019, 7:34 p.m. UTC | #4
On 12.02.19 18:24, Nitesh Narayan Lal wrote:
> 
> On 2/12/19 4:24 AM, David Hildenbrand wrote:
>> On 12.02.19 10:03, Wang, Wei W wrote:
>>> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>>>> The following patch-set proposes an efficient mechanism for handing freed
>>>> memory between the guest and the host. It enables the guests with no page
>>>> cache to rapidly free and reclaims memory to and from the host respectively.
>>>>
>>>> Benefit:
>>>> With this patch-series, in our test-case, executed on a single system and
>>>> single NUMA node with 15GB memory, we were able to successfully launch
>>>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>>>> explanation of the test procedure is provided at the bottom).
>>>>
>>>> Changelog in V8:
>>>> In this patch-series, the earlier approach [1] which was used to capture and
>>>> scan the pages freed by the guest has been changed. The new approach is
>>>> briefly described below:
>>>>
>>>> The patch-set still leverages the existing arch_free_page() to add this
>>>> functionality. It maintains a per CPU array which is used to store the pages
>>>> freed by the guest. The maximum number of entries which it can hold is
>>>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>>>> is scanned and only the pages which are available in the buddy are stored.
>>>> This process continues until the array is filled with pages which are part of
>>>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>>>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>>>> and if the page is not reallocated and present in the buddy, the kernel
>>>> thread attempts to isolate it from the buddy. If it is successfully isolated, the
>>>> page is added to another per-cpu array. Once the entire scanning process is
>>>> complete, all the isolated pages are reported to the host through an existing
>>>> virtio-balloon driver.
>>>  Hi Nitesh,
>>>
>>> Have you guys thought about something like below, which would be simpler:
>> Responding because I'm the first to stumble over this mail, hah! :)
>>
>>> - use bitmaps to record free pages, e.g. xbitmap: https://lkml.org/lkml/2018/1/9/304.
>>>   The bitmap can be indexed by the guest pfn, and it's globally accessed by all the CPUs;
>> Global means all VCPUs will be competing potentially for a single lock
>> when freeing/allocating a page, no? What if you have 64VCPUs
>> allocating/freeing memory like crazy?
>>
>> (I assume some kind of locking is required even if the bitmap would be
>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>> memory at places where we don't want to - e.g. from arch_free_page ?)
>>
>> That's the big benefit of taking the pages of the buddy free list. Other
>> VCPUs won't stumble over them, waiting for them to get freed in the
>> hypervisor.
>>
>>> - arch_free_page(): set the bits of the freed pages from the bitmap
>>>  (no per-CPU array with hardcoded fixed length and no per-cpu scanning thread)
>>> - arch_alloc_page(): clear the related bits from the bitmap
>>> - expose 2 APIs for the callers:
>>>   -- unsigned long get_free_page_hints(unsigned long pfn_start, unsigned int nr); 
>>>      This API searches for the next free page chunk (@nr of pages), starting from @pfn_start.
>>>      Bits of those free pages will be cleared after this function returns.
>>>   -- void put_free_page_hints(unsigned long pfn_start, unsigned int nr);
>>>      This API sets the @nr continuous bits starting from pfn_start.
>>>
>>> Usage example with balloon:
>>> 1) host requests to start ballooning;
>>> 2) balloon driver get_free_page_hints and report the hints to host via report_vq;
>>> 3) host calls madvise(pfn_start, DONTNEED) for each reported chunk of free pages and put back pfn_start to ack_vq;
>>> 4) balloon driver receives pfn_start and calls put_free_page_hints(pfn_start) to have the related bits from the bitmap to be set, indicating that those free pages are ready to be allocated.
>> This sounds more like "the host requests to get free pages once in a
>> while" compared to "the host is always informed about free pages". At
>> the time where the host actually has to ask the guest (e.g. because the
>> host is low on memory), it might be to late to wait for guest action.
>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages
>> as candidates for removal and if the host is low on memory, only
>> scanning the guest page tables is sufficient to free up memory.
>>
>> But both points might just be an implementation detail in the example
>> you describe.
>>
>>> In above 2), get_free_page_hints clears the bits which indicates that those pages are not ready to be used by the guest yet. Why?
>>> This is because 3) will unmap the underlying physical pages from EPT. Normally, when guest re-visits those pages, EPT violations and QEMU page faults will get a new host page to set up the related EPT entry. If guest uses that page before the page gets unmapped (i.e. right before step 3), no EPT violation happens and the guest will use the same physical page that will be unmapped and given to other host threads. So we need to make sure that the guest free page is usable only after step 3 finishes.
>>>
>>> Back to arch_alloc_page(), it needs to check if the allocated pages have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it means step 2) above has happened and step 4) hasn't been reached. In this case, we can either have arch_alloc_page() busywaiting a bit till 4) is done for that page
>>> Or better to have a balloon callback which prioritize 3) and 4) to make this page usable by the guest.
>> Regarding the latter, the VCPU allocating a page cannot do anything if
>> the page (along with other pages) is just being freed by the hypervisor.
>> It has to busy-wait, no chance to prioritize.
>>
>>> Using bitmaps to record free page hints don't need to take the free pages off the buddy list and return them later, which needs to go through the long allocation/free code path.
>>>
>> Yes, but it means that any process is able to get stuck on such a page
>> for as long as it takes to report the free pages to the hypervisor and
>> for it to call madvise(pfn_start, DONTNEED) on any such page.
>>
>> Nice idea, but I think we definitely need something the can potentially
>> be implemented per-cpu without any global locks involved.
>>
>> Thanks!
>>
>>> Best,
>>> Wei
>>>
> Hi Wei,
> 
> For your comment, I agree with David. If we have one global per-cpu, we
> will have to acquire a lock.
> Also as David mentioned the idea is to derive the hints from the guest,
> rather than host asking for free pages.
> 
> However, I am wondering if having per-cpu bitmaps is possible?
> Using this I can possibly get rid of the fixed array size issue.
> 

I assume we will have problems with dynamically sized bitmaps - memory
has to be allocated. Similar to a dynamically sized list.

But it is definitely worth investigating.
Wang, Wei W Feb. 13, 2019, 8:55 a.m. UTC | #5
On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
> Global means all VCPUs will be competing potentially for a single lock when
> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
> memory like crazy?

I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
I think atomic operations on the bitmap would be enough.

> (I assume some kind of locking is required even if the bitmap would be
> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> memory at places where we don't want to - e.g. from arch_free_page ?)

arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?

It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.

> 
> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
> won't stumble over them, waiting for them to get freed in the hypervisor.

As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.

> This sounds more like "the host requests to get free pages once in a while"
> compared to "the host is always informed about free pages". At the time
> where the host actually has to ask the guest (e.g. because the host is low on
> memory), it might be to late to wait for guest action.

Option 1: Host asks for free pages:
Not necessary to ask only when the host has been in memory pressure.
This could be the orchestration layer's job to monitor the host memory usage.
For example, people could set the condition "when 50% of the host memory
has been used, start to ask a guest for some amount of free pages" 

Option 2: Guest actively offers free pages:
Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
will be filled into the balloon's report_vq and the host will take away the backing
host page.

Both options can be implemented. But I think option 1 would be more
efficient as the guest free pages are offered on demand.  

> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> candidates for removal and if the host is low on memory, only scanning the
> guest page tables is sufficient to free up memory.
> 
> But both points might just be an implementation detail in the example you
> describe.

Yes, it is an implementation detail. I think DONTNEED would be easier
for the first step.

> 
> >
> > In above 2), get_free_page_hints clears the bits which indicates that those
> pages are not ready to be used by the guest yet. Why?
> > This is because 3) will unmap the underlying physical pages from EPT.
> Normally, when guest re-visits those pages, EPT violations and QEMU page
> faults will get a new host page to set up the related EPT entry. If guest uses
> that page before the page gets unmapped (i.e. right before step 3), no EPT
> violation happens and the guest will use the same physical page that will be
> unmapped and given to other host threads. So we need to make sure that
> the guest free page is usable only after step 3 finishes.
> >
> > Back to arch_alloc_page(), it needs to check if the allocated pages
> > have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
> means step 2) above has happened and step 4) hasn't been reached. In this
> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
> for that page Or better to have a balloon callback which prioritize 3) and 4)
> to make this page usable by the guest.
> 
> Regarding the latter, the VCPU allocating a page cannot do anything if the
> page (along with other pages) is just being freed by the hypervisor.
> It has to busy-wait, no chance to prioritize.

I meant this:
With this approach, essentially the free pages have 2 states:
ready free page: the page is on the free list and it has "1" in the bitmap
non-ready free page: the page is on the free list and it has "0" in the bitmap
Ready free pages are those who can be allocated to use.
Non-ready free pages are those who are in progress of being reported to
host and the related EPT mapping is about to be zapped. 

The non-ready pages are inserted into the report_vq and waiting for the
host to zap the mappings one by one. After the mapping gets zapped
(which means the backing host page has been taken away), host acks to
the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).

So the non-ready free page may happen to be used when they are waiting in
the report_vq to be handled by the host to zap the mapping, balloon could
have a fast path to notify the host:
"page 0x1000 is about to be used, don’t zap the mapping when you get
0x1000 from the report_vq"  /*option [1] */

Or

"page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
so that the free page will be marked as ready free page and the guest can use it".
This option will generate an extra EPT violation and QEMU page fault to get a new host
page to back the guest ready free page.

> 
> >
> > Using bitmaps to record free page hints don't need to take the free pages
> off the buddy list and return them later, which needs to go through the long
> allocation/free code path.
> >
> 
> Yes, but it means that any process is able to get stuck on such a page for as
> long as it takes to report the free pages to the hypervisor and for it to call
> madvise(pfn_start, DONTNEED) on any such page.

This only happens when the guest thread happens to get allocated on a page which is
being reported to the host. Using option [1] above will avoid this.

Best,
Wei
Wang, Wei W Feb. 13, 2019, 9 a.m. UTC | #6
On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
> The following patch-set proposes an efficient mechanism for handing freed
> memory between the guest and the host. It enables the guests with no page
> cache to rapidly free and reclaims memory to and from the host respectively.
> 
> Benefit:
> With this patch-series, in our test-case, executed on a single system and
> single NUMA node with 15GB memory, we were able to successfully launch
> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
> explanation of the test procedure is provided at the bottom).
> 
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and
> scan the pages freed by the guest has been changed. The new approach is
> briefly described below:
> 
> The patch-set still leverages the existing arch_free_page() to add this
> functionality. It maintains a per CPU array which is used to store the pages
> freed by the guest. The maximum number of entries which it can hold is
> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
> is scanned and only the pages which are available in the buddy are stored.
> This process continues until the array is filled with pages which are part of
> the buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
> and if the page is not reallocated and present in the buddy, the kernel
> thread attempts to isolate it from the buddy. If it is successfully isolated, the
> page is added to another per-cpu array. Once the entire scanning process is
> complete, all the isolated pages are reported to the host through an existing
> virtio-balloon driver.

The free page is removed from the buddy list here. When will they get returned to the buddy list so that the guest threads can use them normally?

Best,
Wei
David Hildenbrand Feb. 13, 2019, 9:19 a.m. UTC | #7
On 13.02.19 09:55, Wang, Wei W wrote:
> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
>> Global means all VCPUs will be competing potentially for a single lock when
>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
>> memory like crazy?
> 
> I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
> I think atomic operations on the bitmap would be enough.

If you have to resize/alloc/coordinate who will report, you will need
locking. Especially, I doubt that there is an atomic xbitmap  (prove me
wrong :) ).

> 
>> (I assume some kind of locking is required even if the bitmap would be
>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>> memory at places where we don't want to - e.g. from arch_free_page ?)
> 
> arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?

I remember we were stumbling over some issues that were non-trivial. I
am not 100% sure yet anymore, but allocating memory while deep down in
the freeing part of MM core smells like "be careful".

> 
> It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.

That makes things very ugly. Especially, preallocation will most likely
require locking.

> 
>>
>> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
>> won't stumble over them, waiting for them to get freed in the hypervisor.
> 
> As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.

If a page is in the buddy but stuck in some other bitmap, there is
nothing stopping another VCPU from trying to allocate it. Nitesh has
been fighting with this problem already :)

> 
>> This sounds more like "the host requests to get free pages once in a while"
>> compared to "the host is always informed about free pages". At the time
>> where the host actually has to ask the guest (e.g. because the host is low on
>> memory), it might be to late to wait for guest action.
> 
> Option 1: Host asks for free pages:
> Not necessary to ask only when the host has been in memory pressure.
> This could be the orchestration layer's job to monitor the host memory usage.
> For example, people could set the condition "when 50% of the host memory
> has been used, start to ask a guest for some amount of free pages" 
> 
> Option 2: Guest actively offers free pages:
> Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
> will be filled into the balloon's report_vq and the host will take away the backing
> host page.
> 
> Both options can be implemented. But I think option 1 would be more
> efficient as the guest free pages are offered on demand.  

Yes, but as I mentioned this has other drawbacks. Relying on a a guest
to free up memory when you really need it is not going to work. It might
work for some scenarios but should not dictate the design. It is a good
start though if it makes things easier.

Enabling/disabling free page hintning by the hypervisor via some
mechanism is on the other hand a good idea. "I have plenty of free
space, don't worry".

> 
>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
>> candidates for removal and if the host is low on memory, only scanning the
>> guest page tables is sufficient to free up memory.
>>
>> But both points might just be an implementation detail in the example you
>> describe.
> 
> Yes, it is an implementation detail. I think DONTNEED would be easier
> for the first step.
> 
>>
>>>
>>> In above 2), get_free_page_hints clears the bits which indicates that those
>> pages are not ready to be used by the guest yet. Why?
>>> This is because 3) will unmap the underlying physical pages from EPT.
>> Normally, when guest re-visits those pages, EPT violations and QEMU page
>> faults will get a new host page to set up the related EPT entry. If guest uses
>> that page before the page gets unmapped (i.e. right before step 3), no EPT
>> violation happens and the guest will use the same physical page that will be
>> unmapped and given to other host threads. So we need to make sure that
>> the guest free page is usable only after step 3 finishes.
>>>
>>> Back to arch_alloc_page(), it needs to check if the allocated pages
>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
>> means step 2) above has happened and step 4) hasn't been reached. In this
>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
>> for that page Or better to have a balloon callback which prioritize 3) and 4)
>> to make this page usable by the guest.
>>
>> Regarding the latter, the VCPU allocating a page cannot do anything if the
>> page (along with other pages) is just being freed by the hypervisor.
>> It has to busy-wait, no chance to prioritize.
> 
> I meant this:
> With this approach, essentially the free pages have 2 states:
> ready free page: the page is on the free list and it has "1" in the bitmap
> non-ready free page: the page is on the free list and it has "0" in the bitmap
> Ready free pages are those who can be allocated to use.
> Non-ready free pages are those who are in progress of being reported to
> host and the related EPT mapping is about to be zapped. 
> 
> The non-ready pages are inserted into the report_vq and waiting for the
> host to zap the mappings one by one. After the mapping gets zapped
> (which means the backing host page has been taken away), host acks to
> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).

Yes, that's how I understood your approach. The interesting part is
where somebody finds a buddy page and wants to allocate it.

> 
> So the non-ready free page may happen to be used when they are waiting in
> the report_vq to be handled by the host to zap the mapping, balloon could
> have a fast path to notify the host:
> "page 0x1000 is about to be used, don’t zap the mapping when you get
> 0x1000 from the report_vq"  /*option [1] */

This requires coordination and in any case there will be a scenario
where you have to wait for the hypervisor to eventually finish a madv
call. You can just try to make that scenario less likely.

What you propose is synchronous in the worst case. Getting pages of the
buddy makes it possible to have it done completely asynchronous. Nobody
allocating a page has to wait.

> 
> Or
> 
> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
> so that the free page will be marked as ready free page and the guest can use it".
> This option will generate an extra EPT violation and QEMU page fault to get a new host
> page to back the guest ready free page.

Again, coordination with the hypervisor while allocating a page. That is
to be avoided in any case.

> 
>>
>>>
>>> Using bitmaps to record free page hints don't need to take the free pages
>> off the buddy list and return them later, which needs to go through the long
>> allocation/free code path.
>>>
>>
>> Yes, but it means that any process is able to get stuck on such a page for as
>> long as it takes to report the free pages to the hypervisor and for it to call
>> madvise(pfn_start, DONTNEED) on any such page.
> 
> This only happens when the guest thread happens to get allocated on a page which is
> being reported to the host. Using option [1] above will avoid this.

I think getting pages out of the buddy system temporarily is the only
way we can avoid somebody else stumbling over a page currently getting
reported by the hypervisor. Otherwise, as I said, there are scenarios
where a allocating VCPU has to wait for the hypervisor to finish the
"freeing" task. While you can try to "speedup" that scenario -
"hypervisor please prioritize" you cannot avoid it. There will be busy
waiting.

I don't believe what you describe is going to work (especially the not
locking part when working with global resources).

What would be interesting is to see if something like a xbitmap could be
used instead of the per-vcpu list. Nitesh, do you remember what the
problem was with allocating memory from these hooks? Was it a locking issue?

Thanks!

> 
> Best,
> Wei
>
Nitesh Narayan Lal Feb. 13, 2019, 12:06 p.m. UTC | #8
On 2/13/19 4:00 AM, Wang, Wei W wrote:
> On Tuesday, February 5, 2019 4:19 AM, Nitesh Narayan Lal wrote:
>> The following patch-set proposes an efficient mechanism for handing freed
>> memory between the guest and the host. It enables the guests with no page
>> cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and
>> single NUMA node with 15GB memory, we were able to successfully launch
>> atleast 5 guests when page hinting was enabled and 3 without it. (Detailed
>> explanation of the test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and
>> scan the pages freed by the guest has been changed. The new approach is
>> briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this
>> functionality. It maintains a per CPU array which is used to store the pages
>> freed by the guest. The maximum number of entries which it can hold is
>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it
>> is scanned and only the pages which are available in the buddy are stored.
>> This process continues until the array is filled with pages which are part of
>> the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation
>> and if the page is not reallocated and present in the buddy, the kernel
>> thread attempts to isolate it from the buddy. If it is successfully isolated, the
>> page is added to another per-cpu array. Once the entire scanning process is
>> complete, all the isolated pages are reported to the host through an existing
>> virtio-balloon driver.
> The free page is removed from the buddy list here. When will they get returned to the buddy list so that the guest threads can use them normally?
Once the host free the pages. All the isolated pages are returned back
to the buddy. (This is implemented in hyperlist_ready())
>
> Best,
> Wei
Nitesh Narayan Lal Feb. 13, 2019, 12:17 p.m. UTC | #9
On 2/13/19 4:19 AM, David Hildenbrand wrote:
> On 13.02.19 09:55, Wang, Wei W wrote:
>> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
>>> Global means all VCPUs will be competing potentially for a single lock when
>>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
>>> memory like crazy?
>> I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
>> I think atomic operations on the bitmap would be enough.
> If you have to resize/alloc/coordinate who will report, you will need
> locking. Especially, I doubt that there is an atomic xbitmap  (prove me
> wrong :) ).
>
>>> (I assume some kind of locking is required even if the bitmap would be
>>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>>> memory at places where we don't want to - e.g. from arch_free_page ?)
>> arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?
> I remember we were stumbling over some issues that were non-trivial. I
> am not 100% sure yet anymore, but allocating memory while deep down in
> the freeing part of MM core smells like "be careful".
>
>> It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.
> That makes things very ugly. Especially, preallocation will most likely
> require locking.
>
>>> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
>>> won't stumble over them, waiting for them to get freed in the hypervisor.
>> As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.
> If a page is in the buddy but stuck in some other bitmap, there is
> nothing stopping another VCPU from trying to allocate it. Nitesh has
> been fighting with this problem already :)
>
>>> This sounds more like "the host requests to get free pages once in a while"
>>> compared to "the host is always informed about free pages". At the time
>>> where the host actually has to ask the guest (e.g. because the host is low on
>>> memory), it might be to late to wait for guest action.
>> Option 1: Host asks for free pages:
>> Not necessary to ask only when the host has been in memory pressure.
>> This could be the orchestration layer's job to monitor the host memory usage.
>> For example, people could set the condition "when 50% of the host memory
>> has been used, start to ask a guest for some amount of free pages" 
>>
>> Option 2: Guest actively offers free pages:
>> Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
>> will be filled into the balloon's report_vq and the host will take away the backing
>> host page.
>>
>> Both options can be implemented. But I think option 1 would be more
>> efficient as the guest free pages are offered on demand.  
> Yes, but as I mentioned this has other drawbacks. Relying on a a guest
> to free up memory when you really need it is not going to work. It might
> work for some scenarios but should not dictate the design. It is a good
> start though if it makes things easier.
>
> Enabling/disabling free page hintning by the hypervisor via some
> mechanism is on the other hand a good idea. "I have plenty of free
> space, don't worry".
>
>>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
>>> candidates for removal and if the host is low on memory, only scanning the
>>> guest page tables is sufficient to free up memory.
>>>
>>> But both points might just be an implementation detail in the example you
>>> describe.
>> Yes, it is an implementation detail. I think DONTNEED would be easier
>> for the first step.
>>
>>>> In above 2), get_free_page_hints clears the bits which indicates that those
>>> pages are not ready to be used by the guest yet. Why?
>>>> This is because 3) will unmap the underlying physical pages from EPT.
>>> Normally, when guest re-visits those pages, EPT violations and QEMU page
>>> faults will get a new host page to set up the related EPT entry. If guest uses
>>> that page before the page gets unmapped (i.e. right before step 3), no EPT
>>> violation happens and the guest will use the same physical page that will be
>>> unmapped and given to other host threads. So we need to make sure that
>>> the guest free page is usable only after step 3 finishes.
>>>> Back to arch_alloc_page(), it needs to check if the allocated pages
>>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
>>> means step 2) above has happened and step 4) hasn't been reached. In this
>>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
>>> for that page Or better to have a balloon callback which prioritize 3) and 4)
>>> to make this page usable by the guest.
>>>
>>> Regarding the latter, the VCPU allocating a page cannot do anything if the
>>> page (along with other pages) is just being freed by the hypervisor.
>>> It has to busy-wait, no chance to prioritize.
>> I meant this:
>> With this approach, essentially the free pages have 2 states:
>> ready free page: the page is on the free list and it has "1" in the bitmap
>> non-ready free page: the page is on the free list and it has "0" in the bitmap
>> Ready free pages are those who can be allocated to use.
>> Non-ready free pages are those who are in progress of being reported to
>> host and the related EPT mapping is about to be zapped. 
>>
>> The non-ready pages are inserted into the report_vq and waiting for the
>> host to zap the mappings one by one. After the mapping gets zapped
>> (which means the backing host page has been taken away), host acks to
>> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
> Yes, that's how I understood your approach. The interesting part is
> where somebody finds a buddy page and wants to allocate it.
>
>> So the non-ready free page may happen to be used when they are waiting in
>> the report_vq to be handled by the host to zap the mapping, balloon could
>> have a fast path to notify the host:
>> "page 0x1000 is about to be used, don’t zap the mapping when you get
>> 0x1000 from the report_vq"  /*option [1] */
> This requires coordination and in any case there will be a scenario
> where you have to wait for the hypervisor to eventually finish a madv
> call. You can just try to make that scenario less likely.
>
> What you propose is synchronous in the worst case. Getting pages of the
> buddy makes it possible to have it done completely asynchronous. Nobody
> allocating a page has to wait.
>
>> Or
>>
>> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
>> so that the free page will be marked as ready free page and the guest can use it".
>> This option will generate an extra EPT violation and QEMU page fault to get a new host
>> page to back the guest ready free page.
> Again, coordination with the hypervisor while allocating a page. That is
> to be avoided in any case.
>
>>>> Using bitmaps to record free page hints don't need to take the free pages
>>> off the buddy list and return them later, which needs to go through the long
>>> allocation/free code path.
>>> Yes, but it means that any process is able to get stuck on such a page for as
>>> long as it takes to report the free pages to the hypervisor and for it to call
>>> madvise(pfn_start, DONTNEED) on any such page.
>> This only happens when the guest thread happens to get allocated on a page which is
>> being reported to the host. Using option [1] above will avoid this.
> I think getting pages out of the buddy system temporarily is the only
> way we can avoid somebody else stumbling over a page currently getting
> reported by the hypervisor. Otherwise, as I said, there are scenarios
> where a allocating VCPU has to wait for the hypervisor to finish the
> "freeing" task. While you can try to "speedup" that scenario -
> "hypervisor please prioritize" you cannot avoid it. There will be busy
> waiting.
>
> I don't believe what you describe is going to work (especially the not
> locking part when working with global resources).
>
> What would be interesting is to see if something like a xbitmap could be
> used instead of the per-vcpu list. 
Yeap, exactly.
> Nitesh, do you remember what the
> problem was with allocating memory from these hooks? Was it a locking issue?
In the previous implementation, the issue was due to the locking. In the
current implementation having an allocation under these hooks will
result in lots of isolation failures under memory pressure.
By the above statement, if you are referring to having a dynamic array
to hold the freed pages.
Then, that is an idea Andrea also suggested to get around this fixed
array size issue.
>
> Thanks!
>
>> Best,
>> Wei
>>
>
Michael S. Tsirkin Feb. 13, 2019, 5:09 p.m. UTC | #10
On Wed, Feb 13, 2019 at 07:17:13AM -0500, Nitesh Narayan Lal wrote:
> 
> On 2/13/19 4:19 AM, David Hildenbrand wrote:
> > On 13.02.19 09:55, Wang, Wei W wrote:
> >> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
> >>> Global means all VCPUs will be competing potentially for a single lock when
> >>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
> >>> memory like crazy?
> >> I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
> >> I think atomic operations on the bitmap would be enough.
> > If you have to resize/alloc/coordinate who will report, you will need
> > locking. Especially, I doubt that there is an atomic xbitmap  (prove me
> > wrong :) ).
> >
> >>> (I assume some kind of locking is required even if the bitmap would be
> >>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> >>> memory at places where we don't want to - e.g. from arch_free_page ?)
> >> arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?
> > I remember we were stumbling over some issues that were non-trivial. I
> > am not 100% sure yet anymore, but allocating memory while deep down in
> > the freeing part of MM core smells like "be careful".
> >
> >> It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.
> > That makes things very ugly. Especially, preallocation will most likely
> > require locking.
> >
> >>> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
> >>> won't stumble over them, waiting for them to get freed in the hypervisor.
> >> As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.
> > If a page is in the buddy but stuck in some other bitmap, there is
> > nothing stopping another VCPU from trying to allocate it. Nitesh has
> > been fighting with this problem already :)
> >
> >>> This sounds more like "the host requests to get free pages once in a while"
> >>> compared to "the host is always informed about free pages". At the time
> >>> where the host actually has to ask the guest (e.g. because the host is low on
> >>> memory), it might be to late to wait for guest action.
> >> Option 1: Host asks for free pages:
> >> Not necessary to ask only when the host has been in memory pressure.
> >> This could be the orchestration layer's job to monitor the host memory usage.
> >> For example, people could set the condition "when 50% of the host memory
> >> has been used, start to ask a guest for some amount of free pages" 
> >>
> >> Option 2: Guest actively offers free pages:
> >> Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
> >> will be filled into the balloon's report_vq and the host will take away the backing
> >> host page.
> >>
> >> Both options can be implemented. But I think option 1 would be more
> >> efficient as the guest free pages are offered on demand.  
> > Yes, but as I mentioned this has other drawbacks. Relying on a a guest
> > to free up memory when you really need it is not going to work. It might
> > work for some scenarios but should not dictate the design. It is a good
> > start though if it makes things easier.
> >
> > Enabling/disabling free page hintning by the hypervisor via some
> > mechanism is on the other hand a good idea. "I have plenty of free
> > space, don't worry".
> >
> >>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> >>> candidates for removal and if the host is low on memory, only scanning the
> >>> guest page tables is sufficient to free up memory.
> >>>
> >>> But both points might just be an implementation detail in the example you
> >>> describe.
> >> Yes, it is an implementation detail. I think DONTNEED would be easier
> >> for the first step.
> >>
> >>>> In above 2), get_free_page_hints clears the bits which indicates that those
> >>> pages are not ready to be used by the guest yet. Why?
> >>>> This is because 3) will unmap the underlying physical pages from EPT.
> >>> Normally, when guest re-visits those pages, EPT violations and QEMU page
> >>> faults will get a new host page to set up the related EPT entry. If guest uses
> >>> that page before the page gets unmapped (i.e. right before step 3), no EPT
> >>> violation happens and the guest will use the same physical page that will be
> >>> unmapped and given to other host threads. So we need to make sure that
> >>> the guest free page is usable only after step 3 finishes.
> >>>> Back to arch_alloc_page(), it needs to check if the allocated pages
> >>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
> >>> means step 2) above has happened and step 4) hasn't been reached. In this
> >>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
> >>> for that page Or better to have a balloon callback which prioritize 3) and 4)
> >>> to make this page usable by the guest.
> >>>
> >>> Regarding the latter, the VCPU allocating a page cannot do anything if the
> >>> page (along with other pages) is just being freed by the hypervisor.
> >>> It has to busy-wait, no chance to prioritize.
> >> I meant this:
> >> With this approach, essentially the free pages have 2 states:
> >> ready free page: the page is on the free list and it has "1" in the bitmap
> >> non-ready free page: the page is on the free list and it has "0" in the bitmap
> >> Ready free pages are those who can be allocated to use.
> >> Non-ready free pages are those who are in progress of being reported to
> >> host and the related EPT mapping is about to be zapped. 
> >>
> >> The non-ready pages are inserted into the report_vq and waiting for the
> >> host to zap the mappings one by one. After the mapping gets zapped
> >> (which means the backing host page has been taken away), host acks to
> >> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
> > Yes, that's how I understood your approach. The interesting part is
> > where somebody finds a buddy page and wants to allocate it.
> >
> >> So the non-ready free page may happen to be used when they are waiting in
> >> the report_vq to be handled by the host to zap the mapping, balloon could
> >> have a fast path to notify the host:
> >> "page 0x1000 is about to be used, don’t zap the mapping when you get
> >> 0x1000 from the report_vq"  /*option [1] */
> > This requires coordination and in any case there will be a scenario
> > where you have to wait for the hypervisor to eventually finish a madv
> > call. You can just try to make that scenario less likely.
> >
> > What you propose is synchronous in the worst case. Getting pages of the
> > buddy makes it possible to have it done completely asynchronous. Nobody
> > allocating a page has to wait.
> >
> >> Or
> >>
> >> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
> >> so that the free page will be marked as ready free page and the guest can use it".
> >> This option will generate an extra EPT violation and QEMU page fault to get a new host
> >> page to back the guest ready free page.
> > Again, coordination with the hypervisor while allocating a page. That is
> > to be avoided in any case.
> >
> >>>> Using bitmaps to record free page hints don't need to take the free pages
> >>> off the buddy list and return them later, which needs to go through the long
> >>> allocation/free code path.
> >>> Yes, but it means that any process is able to get stuck on such a page for as
> >>> long as it takes to report the free pages to the hypervisor and for it to call
> >>> madvise(pfn_start, DONTNEED) on any such page.
> >> This only happens when the guest thread happens to get allocated on a page which is
> >> being reported to the host. Using option [1] above will avoid this.
> > I think getting pages out of the buddy system temporarily is the only
> > way we can avoid somebody else stumbling over a page currently getting
> > reported by the hypervisor. Otherwise, as I said, there are scenarios
> > where a allocating VCPU has to wait for the hypervisor to finish the
> > "freeing" task. While you can try to "speedup" that scenario -
> > "hypervisor please prioritize" you cannot avoid it. There will be busy
> > waiting.
> >
> > I don't believe what you describe is going to work (especially the not
> > locking part when working with global resources).
> >
> > What would be interesting is to see if something like a xbitmap could be
> > used instead of the per-vcpu list. 
> Yeap, exactly.
> > Nitesh, do you remember what the
> > problem was with allocating memory from these hooks? Was it a locking issue?
> In the previous implementation, the issue was due to the locking. In the
> current implementation having an allocation under these hooks will
> result in lots of isolation failures under memory pressure.

But then we shouldn't be giving host memory when under pressure
at all, should we?

> By the above statement, if you are referring to having a dynamic array
> to hold the freed pages.
> Then, that is an idea Andrea also suggested to get around this fixed
> array size issue.
> >
> > Thanks!
> >
> >> Best,
> >> Wei
> >>
> >
> -- 
> Regards
> Nitesh
>
Michael S. Tsirkin Feb. 13, 2019, 5:16 p.m. UTC | #11
On Wed, Feb 13, 2019 at 10:19:05AM +0100, David Hildenbrand wrote:
> On 13.02.19 09:55, Wang, Wei W wrote:
> > On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
> >> Global means all VCPUs will be competing potentially for a single lock when
> >> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
> >> memory like crazy?
> > 
> > I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
> > I think atomic operations on the bitmap would be enough.
> 
> If you have to resize/alloc/coordinate who will report, you will need
> locking. Especially, I doubt that there is an atomic xbitmap  (prove me
> wrong :) ).
> 
> > 
> >> (I assume some kind of locking is required even if the bitmap would be
> >> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> >> memory at places where we don't want to - e.g. from arch_free_page ?)
> > 
> > arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?
> 
> I remember we were stumbling over some issues that were non-trivial. I
> am not 100% sure yet anymore, but allocating memory while deep down in
> the freeing part of MM core smells like "be careful".
> 
> > 
> > It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.
> 
> That makes things very ugly. Especially, preallocation will most likely
> require locking.
> 
> > 
> >>
> >> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
> >> won't stumble over them, waiting for them to get freed in the hypervisor.
> > 
> > As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.
> 
> If a page is in the buddy but stuck in some other bitmap, there is
> nothing stopping another VCPU from trying to allocate it. Nitesh has
> been fighting with this problem already :)
> 
> > 
> >> This sounds more like "the host requests to get free pages once in a while"
> >> compared to "the host is always informed about free pages". At the time
> >> where the host actually has to ask the guest (e.g. because the host is low on
> >> memory), it might be to late to wait for guest action.
> > 
> > Option 1: Host asks for free pages:
> > Not necessary to ask only when the host has been in memory pressure.
> > This could be the orchestration layer's job to monitor the host memory usage.
> > For example, people could set the condition "when 50% of the host memory
> > has been used, start to ask a guest for some amount of free pages" 
> > 
> > Option 2: Guest actively offers free pages:
> > Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
> > will be filled into the balloon's report_vq and the host will take away the backing
> > host page.
> > 
> > Both options can be implemented. But I think option 1 would be more
> > efficient as the guest free pages are offered on demand.  
> 
> Yes, but as I mentioned this has other drawbacks. Relying on a a guest
> to free up memory when you really need it is not going to work. It might
> work for some scenarios but should not dictate the design. It is a good
> start though if it makes things easier.

Besides, it has already been implemented in Linux.

> Enabling/disabling free page hintning by the hypervisor via some
> mechanism is on the other hand a good idea. "I have plenty of free
> space, don't worry".

Existing mechanism includes ability to cancel reporting.

> > 
> >> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> >> candidates for removal and if the host is low on memory, only scanning the
> >> guest page tables is sufficient to free up memory.
> >>
> >> But both points might just be an implementation detail in the example you
> >> describe.
> > 
> > Yes, it is an implementation detail. I think DONTNEED would be easier
> > for the first step.
> > 
> >>
> >>>
> >>> In above 2), get_free_page_hints clears the bits which indicates that those
> >> pages are not ready to be used by the guest yet. Why?
> >>> This is because 3) will unmap the underlying physical pages from EPT.
> >> Normally, when guest re-visits those pages, EPT violations and QEMU page
> >> faults will get a new host page to set up the related EPT entry. If guest uses
> >> that page before the page gets unmapped (i.e. right before step 3), no EPT
> >> violation happens and the guest will use the same physical page that will be
> >> unmapped and given to other host threads. So we need to make sure that
> >> the guest free page is usable only after step 3 finishes.
> >>>
> >>> Back to arch_alloc_page(), it needs to check if the allocated pages
> >>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
> >> means step 2) above has happened and step 4) hasn't been reached. In this
> >> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
> >> for that page Or better to have a balloon callback which prioritize 3) and 4)
> >> to make this page usable by the guest.
> >>
> >> Regarding the latter, the VCPU allocating a page cannot do anything if the
> >> page (along with other pages) is just being freed by the hypervisor.
> >> It has to busy-wait, no chance to prioritize.
> > 
> > I meant this:
> > With this approach, essentially the free pages have 2 states:
> > ready free page: the page is on the free list and it has "1" in the bitmap
> > non-ready free page: the page is on the free list and it has "0" in the bitmap
> > Ready free pages are those who can be allocated to use.
> > Non-ready free pages are those who are in progress of being reported to
> > host and the related EPT mapping is about to be zapped. 
> > 
> > The non-ready pages are inserted into the report_vq and waiting for the
> > host to zap the mappings one by one. After the mapping gets zapped
> > (which means the backing host page has been taken away), host acks to
> > the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
> 
> Yes, that's how I understood your approach. The interesting part is
> where somebody finds a buddy page and wants to allocate it.
> 
> > 
> > So the non-ready free page may happen to be used when they are waiting in
> > the report_vq to be handled by the host to zap the mapping, balloon could
> > have a fast path to notify the host:
> > "page 0x1000 is about to be used, don’t zap the mapping when you get
> > 0x1000 from the report_vq"  /*option [1] */
> 
> This requires coordination and in any case there will be a scenario
> where you have to wait for the hypervisor to eventually finish a madv
> call. You can just try to make that scenario less likely.
> 
> What you propose is synchronous in the worst case. Getting pages of the
> buddy makes it possible to have it done completely asynchronous. Nobody
> allocating a page has to wait.
> 
> > 
> > Or
> > 
> > "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
> > so that the free page will be marked as ready free page and the guest can use it".
> > This option will generate an extra EPT violation and QEMU page fault to get a new host
> > page to back the guest ready free page.
> 
> Again, coordination with the hypervisor while allocating a page. That is
> to be avoided in any case.
> 
> > 
> >>
> >>>
> >>> Using bitmaps to record free page hints don't need to take the free pages
> >> off the buddy list and return them later, which needs to go through the long
> >> allocation/free code path.
> >>>
> >>
> >> Yes, but it means that any process is able to get stuck on such a page for as
> >> long as it takes to report the free pages to the hypervisor and for it to call
> >> madvise(pfn_start, DONTNEED) on any such page.
> > 
> > This only happens when the guest thread happens to get allocated on a page which is
> > being reported to the host. Using option [1] above will avoid this.
> 
> I think getting pages out of the buddy system temporarily is the only
> way we can avoid somebody else stumbling over a page currently getting
> reported by the hypervisor. Otherwise, as I said, there are scenarios
> where a allocating VCPU has to wait for the hypervisor to finish the
> "freeing" task. While you can try to "speedup" that scenario -
> "hypervisor please prioritize" you cannot avoid it. There will be busy
> waiting.

Right - there has to be waiting. But it does not have to be busy -
if you can defer page use until interrupt, that's one option.
Further if you are ready to exit to hypervisor it does not have to be
busy waiting.  In particular right now virtio does not have a capability
to stop queue processing by device.  We could add that if necessary.  In
that case, you would stop queue and detach buffers.  It is already
possible by reseting the balloon.  Naturally there is no magic - you
exit to hypervisor and block there. It's not all that great
in that VCPU does not run at all. But it is not busy waiting.


> I don't believe what you describe is going to work (especially the not
> locking part when working with global resources).
> 
> What would be interesting is to see if something like a xbitmap could be
> used instead of the per-vcpu list. Nitesh, do you remember what the
> problem was with allocating memory from these hooks? Was it a locking issue?
> 
> Thanks!
> 
> > 
> > Best,
> > Wei
> > 
> 
> 
> -- 
> 
> Thanks,
> 
> David / dhildenb
Nitesh Narayan Lal Feb. 13, 2019, 5:22 p.m. UTC | #12
On 2/13/19 12:09 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 13, 2019 at 07:17:13AM -0500, Nitesh Narayan Lal wrote:
>> On 2/13/19 4:19 AM, David Hildenbrand wrote:
>>> On 13.02.19 09:55, Wang, Wei W wrote:
>>>> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
>>>>> Global means all VCPUs will be competing potentially for a single lock when
>>>>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
>>>>> memory like crazy?
>>>> I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
>>>> I think atomic operations on the bitmap would be enough.
>>> If you have to resize/alloc/coordinate who will report, you will need
>>> locking. Especially, I doubt that there is an atomic xbitmap  (prove me
>>> wrong :) ).
>>>
>>>>> (I assume some kind of locking is required even if the bitmap would be
>>>>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
>>>>> memory at places where we don't want to - e.g. from arch_free_page ?)
>>>> arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?
>>> I remember we were stumbling over some issues that were non-trivial. I
>>> am not 100% sure yet anymore, but allocating memory while deep down in
>>> the freeing part of MM core smells like "be careful".
>>>
>>>> It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.
>>> That makes things very ugly. Especially, preallocation will most likely
>>> require locking.
>>>
>>>>> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
>>>>> won't stumble over them, waiting for them to get freed in the hypervisor.
>>>> As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.
>>> If a page is in the buddy but stuck in some other bitmap, there is
>>> nothing stopping another VCPU from trying to allocate it. Nitesh has
>>> been fighting with this problem already :)
>>>
>>>>> This sounds more like "the host requests to get free pages once in a while"
>>>>> compared to "the host is always informed about free pages". At the time
>>>>> where the host actually has to ask the guest (e.g. because the host is low on
>>>>> memory), it might be to late to wait for guest action.
>>>> Option 1: Host asks for free pages:
>>>> Not necessary to ask only when the host has been in memory pressure.
>>>> This could be the orchestration layer's job to monitor the host memory usage.
>>>> For example, people could set the condition "when 50% of the host memory
>>>> has been used, start to ask a guest for some amount of free pages" 
>>>>
>>>> Option 2: Guest actively offers free pages:
>>>> Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
>>>> will be filled into the balloon's report_vq and the host will take away the backing
>>>> host page.
>>>>
>>>> Both options can be implemented. But I think option 1 would be more
>>>> efficient as the guest free pages are offered on demand.  
>>> Yes, but as I mentioned this has other drawbacks. Relying on a a guest
>>> to free up memory when you really need it is not going to work. It might
>>> work for some scenarios but should not dictate the design. It is a good
>>> start though if it makes things easier.
>>>
>>> Enabling/disabling free page hintning by the hypervisor via some
>>> mechanism is on the other hand a good idea. "I have plenty of free
>>> space, don't worry".
>>>
>>>>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
>>>>> candidates for removal and if the host is low on memory, only scanning the
>>>>> guest page tables is sufficient to free up memory.
>>>>>
>>>>> But both points might just be an implementation detail in the example you
>>>>> describe.
>>>> Yes, it is an implementation detail. I think DONTNEED would be easier
>>>> for the first step.
>>>>
>>>>>> In above 2), get_free_page_hints clears the bits which indicates that those
>>>>> pages are not ready to be used by the guest yet. Why?
>>>>>> This is because 3) will unmap the underlying physical pages from EPT.
>>>>> Normally, when guest re-visits those pages, EPT violations and QEMU page
>>>>> faults will get a new host page to set up the related EPT entry. If guest uses
>>>>> that page before the page gets unmapped (i.e. right before step 3), no EPT
>>>>> violation happens and the guest will use the same physical page that will be
>>>>> unmapped and given to other host threads. So we need to make sure that
>>>>> the guest free page is usable only after step 3 finishes.
>>>>>> Back to arch_alloc_page(), it needs to check if the allocated pages
>>>>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
>>>>> means step 2) above has happened and step 4) hasn't been reached. In this
>>>>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
>>>>> for that page Or better to have a balloon callback which prioritize 3) and 4)
>>>>> to make this page usable by the guest.
>>>>>
>>>>> Regarding the latter, the VCPU allocating a page cannot do anything if the
>>>>> page (along with other pages) is just being freed by the hypervisor.
>>>>> It has to busy-wait, no chance to prioritize.
>>>> I meant this:
>>>> With this approach, essentially the free pages have 2 states:
>>>> ready free page: the page is on the free list and it has "1" in the bitmap
>>>> non-ready free page: the page is on the free list and it has "0" in the bitmap
>>>> Ready free pages are those who can be allocated to use.
>>>> Non-ready free pages are those who are in progress of being reported to
>>>> host and the related EPT mapping is about to be zapped. 
>>>>
>>>> The non-ready pages are inserted into the report_vq and waiting for the
>>>> host to zap the mappings one by one. After the mapping gets zapped
>>>> (which means the backing host page has been taken away), host acks to
>>>> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
>>> Yes, that's how I understood your approach. The interesting part is
>>> where somebody finds a buddy page and wants to allocate it.
>>>
>>>> So the non-ready free page may happen to be used when they are waiting in
>>>> the report_vq to be handled by the host to zap the mapping, balloon could
>>>> have a fast path to notify the host:
>>>> "page 0x1000 is about to be used, don’t zap the mapping when you get
>>>> 0x1000 from the report_vq"  /*option [1] */
>>> This requires coordination and in any case there will be a scenario
>>> where you have to wait for the hypervisor to eventually finish a madv
>>> call. You can just try to make that scenario less likely.
>>>
>>> What you propose is synchronous in the worst case. Getting pages of the
>>> buddy makes it possible to have it done completely asynchronous. Nobody
>>> allocating a page has to wait.
>>>
>>>> Or
>>>>
>>>> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
>>>> so that the free page will be marked as ready free page and the guest can use it".
>>>> This option will generate an extra EPT violation and QEMU page fault to get a new host
>>>> page to back the guest ready free page.
>>> Again, coordination with the hypervisor while allocating a page. That is
>>> to be avoided in any case.
>>>
>>>>>> Using bitmaps to record free page hints don't need to take the free pages
>>>>> off the buddy list and return them later, which needs to go through the long
>>>>> allocation/free code path.
>>>>> Yes, but it means that any process is able to get stuck on such a page for as
>>>>> long as it takes to report the free pages to the hypervisor and for it to call
>>>>> madvise(pfn_start, DONTNEED) on any such page.
>>>> This only happens when the guest thread happens to get allocated on a page which is
>>>> being reported to the host. Using option [1] above will avoid this.
>>> I think getting pages out of the buddy system temporarily is the only
>>> way we can avoid somebody else stumbling over a page currently getting
>>> reported by the hypervisor. Otherwise, as I said, there are scenarios
>>> where a allocating VCPU has to wait for the hypervisor to finish the
>>> "freeing" task. While you can try to "speedup" that scenario -
>>> "hypervisor please prioritize" you cannot avoid it. There will be busy
>>> waiting.
>>>
>>> I don't believe what you describe is going to work (especially the not
>>> locking part when working with global resources).
>>>
>>> What would be interesting is to see if something like a xbitmap could be
>>> used instead of the per-vcpu list. 
>> Yeap, exactly.
>>> Nitesh, do you remember what the
>>> problem was with allocating memory from these hooks? Was it a locking issue?
>> In the previous implementation, the issue was due to the locking. In the
>> current implementation having an allocation under these hooks will
>> result in lots of isolation failures under memory pressure.
> But then we shouldn't be giving host memory when under pressure
> at all, should we?
In normal condition yes we would not like to report any memory when the
guest is already under memory pressure.

I am not sure about the scenario where both guest and the host are under
memory pressure, who will be given priority? Is it something per-decided
or it depends on the use case?

In any case, the current implementation will not give away memory back
to the host when the guest is under continuous memory pressure.

>
>> By the above statement, if you are referring to having a dynamic array
>> to hold the freed pages.
>> Then, that is an idea Andrea also suggested to get around this fixed
>> array size issue.
>>> Thanks!
>>>
>>>> Best,
>>>> Wei
>>>>
>> -- 
>> Regards
>> Nitesh
>>
>
>
David Hildenbrand Feb. 13, 2019, 5:59 p.m. UTC | #13
>>>
>>>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
>>>> candidates for removal and if the host is low on memory, only scanning the
>>>> guest page tables is sufficient to free up memory.
>>>>
>>>> But both points might just be an implementation detail in the example you
>>>> describe.
>>>
>>> Yes, it is an implementation detail. I think DONTNEED would be easier
>>> for the first step.
>>>
>>>>
>>>>>
>>>>> In above 2), get_free_page_hints clears the bits which indicates that those
>>>> pages are not ready to be used by the guest yet. Why?
>>>>> This is because 3) will unmap the underlying physical pages from EPT.
>>>> Normally, when guest re-visits those pages, EPT violations and QEMU page
>>>> faults will get a new host page to set up the related EPT entry. If guest uses
>>>> that page before the page gets unmapped (i.e. right before step 3), no EPT
>>>> violation happens and the guest will use the same physical page that will be
>>>> unmapped and given to other host threads. So we need to make sure that
>>>> the guest free page is usable only after step 3 finishes.
>>>>>
>>>>> Back to arch_alloc_page(), it needs to check if the allocated pages
>>>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
>>>> means step 2) above has happened and step 4) hasn't been reached. In this
>>>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
>>>> for that page Or better to have a balloon callback which prioritize 3) and 4)
>>>> to make this page usable by the guest.
>>>>
>>>> Regarding the latter, the VCPU allocating a page cannot do anything if the
>>>> page (along with other pages) is just being freed by the hypervisor.
>>>> It has to busy-wait, no chance to prioritize.
>>>
>>> I meant this:
>>> With this approach, essentially the free pages have 2 states:
>>> ready free page: the page is on the free list and it has "1" in the bitmap
>>> non-ready free page: the page is on the free list and it has "0" in the bitmap
>>> Ready free pages are those who can be allocated to use.
>>> Non-ready free pages are those who are in progress of being reported to
>>> host and the related EPT mapping is about to be zapped. 
>>>
>>> The non-ready pages are inserted into the report_vq and waiting for the
>>> host to zap the mappings one by one. After the mapping gets zapped
>>> (which means the backing host page has been taken away), host acks to
>>> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
>>
>> Yes, that's how I understood your approach. The interesting part is
>> where somebody finds a buddy page and wants to allocate it.
>>
>>>
>>> So the non-ready free page may happen to be used when they are waiting in
>>> the report_vq to be handled by the host to zap the mapping, balloon could
>>> have a fast path to notify the host:
>>> "page 0x1000 is about to be used, don’t zap the mapping when you get
>>> 0x1000 from the report_vq"  /*option [1] */
>>
>> This requires coordination and in any case there will be a scenario
>> where you have to wait for the hypervisor to eventually finish a madv
>> call. You can just try to make that scenario less likely.
>>
>> What you propose is synchronous in the worst case. Getting pages of the
>> buddy makes it possible to have it done completely asynchronous. Nobody
>> allocating a page has to wait.
>>
>>>
>>> Or
>>>
>>> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
>>> so that the free page will be marked as ready free page and the guest can use it".
>>> This option will generate an extra EPT violation and QEMU page fault to get a new host
>>> page to back the guest ready free page.
>>
>> Again, coordination with the hypervisor while allocating a page. That is
>> to be avoided in any case.
>>
>>>
>>>>
>>>>>
>>>>> Using bitmaps to record free page hints don't need to take the free pages
>>>> off the buddy list and return them later, which needs to go through the long
>>>> allocation/free code path.
>>>>>
>>>>
>>>> Yes, but it means that any process is able to get stuck on such a page for as
>>>> long as it takes to report the free pages to the hypervisor and for it to call
>>>> madvise(pfn_start, DONTNEED) on any such page.
>>>
>>> This only happens when the guest thread happens to get allocated on a page which is
>>> being reported to the host. Using option [1] above will avoid this.
>>
>> I think getting pages out of the buddy system temporarily is the only
>> way we can avoid somebody else stumbling over a page currently getting
>> reported by the hypervisor. Otherwise, as I said, there are scenarios
>> where a allocating VCPU has to wait for the hypervisor to finish the
>> "freeing" task. While you can try to "speedup" that scenario -
>> "hypervisor please prioritize" you cannot avoid it. There will be busy
>> waiting.
> 
> Right - there has to be waiting. But it does not have to be busy -
> if you can defer page use until interrupt, that's one option.
> Further if you are ready to exit to hypervisor it does not have to be
> busy waiting.  In particular right now virtio does not have a capability
> to stop queue processing by device.  We could add that if necessary.  In
> that case, you would stop queue and detach buffers.  It is already
> possible by reseting the balloon.  Naturally there is no magic - you
> exit to hypervisor and block there. It's not all that great
> in that VCPU does not run at all. But it is not busy waiting.

Of course, you can always yield to the hypervisor and not call it busy
waiting. From the guest point of view, it is busy waiting. The VCPU is
to making progress. If I am not wrong, one can easily construct examples
where all VCPUs in the guest are waiting for the hypervisor to
madv(dontneed) pages. I don't like that approach

Especially if temporarily getting pages out of the buddy resolves these
issues and seems to work.
Michael S. Tsirkin Feb. 13, 2019, 7:08 p.m. UTC | #14
On Wed, Feb 13, 2019 at 06:59:24PM +0100, David Hildenbrand wrote:
> >>>
> >>>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> >>>> candidates for removal and if the host is low on memory, only scanning the
> >>>> guest page tables is sufficient to free up memory.
> >>>>
> >>>> But both points might just be an implementation detail in the example you
> >>>> describe.
> >>>
> >>> Yes, it is an implementation detail. I think DONTNEED would be easier
> >>> for the first step.
> >>>
> >>>>
> >>>>>
> >>>>> In above 2), get_free_page_hints clears the bits which indicates that those
> >>>> pages are not ready to be used by the guest yet. Why?
> >>>>> This is because 3) will unmap the underlying physical pages from EPT.
> >>>> Normally, when guest re-visits those pages, EPT violations and QEMU page
> >>>> faults will get a new host page to set up the related EPT entry. If guest uses
> >>>> that page before the page gets unmapped (i.e. right before step 3), no EPT
> >>>> violation happens and the guest will use the same physical page that will be
> >>>> unmapped and given to other host threads. So we need to make sure that
> >>>> the guest free page is usable only after step 3 finishes.
> >>>>>
> >>>>> Back to arch_alloc_page(), it needs to check if the allocated pages
> >>>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
> >>>> means step 2) above has happened and step 4) hasn't been reached. In this
> >>>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
> >>>> for that page Or better to have a balloon callback which prioritize 3) and 4)
> >>>> to make this page usable by the guest.
> >>>>
> >>>> Regarding the latter, the VCPU allocating a page cannot do anything if the
> >>>> page (along with other pages) is just being freed by the hypervisor.
> >>>> It has to busy-wait, no chance to prioritize.
> >>>
> >>> I meant this:
> >>> With this approach, essentially the free pages have 2 states:
> >>> ready free page: the page is on the free list and it has "1" in the bitmap
> >>> non-ready free page: the page is on the free list and it has "0" in the bitmap
> >>> Ready free pages are those who can be allocated to use.
> >>> Non-ready free pages are those who are in progress of being reported to
> >>> host and the related EPT mapping is about to be zapped. 
> >>>
> >>> The non-ready pages are inserted into the report_vq and waiting for the
> >>> host to zap the mappings one by one. After the mapping gets zapped
> >>> (which means the backing host page has been taken away), host acks to
> >>> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
> >>
> >> Yes, that's how I understood your approach. The interesting part is
> >> where somebody finds a buddy page and wants to allocate it.
> >>
> >>>
> >>> So the non-ready free page may happen to be used when they are waiting in
> >>> the report_vq to be handled by the host to zap the mapping, balloon could
> >>> have a fast path to notify the host:
> >>> "page 0x1000 is about to be used, don’t zap the mapping when you get
> >>> 0x1000 from the report_vq"  /*option [1] */
> >>
> >> This requires coordination and in any case there will be a scenario
> >> where you have to wait for the hypervisor to eventually finish a madv
> >> call. You can just try to make that scenario less likely.
> >>
> >> What you propose is synchronous in the worst case. Getting pages of the
> >> buddy makes it possible to have it done completely asynchronous. Nobody
> >> allocating a page has to wait.
> >>
> >>>
> >>> Or
> >>>
> >>> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
> >>> so that the free page will be marked as ready free page and the guest can use it".
> >>> This option will generate an extra EPT violation and QEMU page fault to get a new host
> >>> page to back the guest ready free page.
> >>
> >> Again, coordination with the hypervisor while allocating a page. That is
> >> to be avoided in any case.
> >>
> >>>
> >>>>
> >>>>>
> >>>>> Using bitmaps to record free page hints don't need to take the free pages
> >>>> off the buddy list and return them later, which needs to go through the long
> >>>> allocation/free code path.
> >>>>>
> >>>>
> >>>> Yes, but it means that any process is able to get stuck on such a page for as
> >>>> long as it takes to report the free pages to the hypervisor and for it to call
> >>>> madvise(pfn_start, DONTNEED) on any such page.
> >>>
> >>> This only happens when the guest thread happens to get allocated on a page which is
> >>> being reported to the host. Using option [1] above will avoid this.
> >>
> >> I think getting pages out of the buddy system temporarily is the only
> >> way we can avoid somebody else stumbling over a page currently getting
> >> reported by the hypervisor. Otherwise, as I said, there are scenarios
> >> where a allocating VCPU has to wait for the hypervisor to finish the
> >> "freeing" task. While you can try to "speedup" that scenario -
> >> "hypervisor please prioritize" you cannot avoid it. There will be busy
> >> waiting.
> > 
> > Right - there has to be waiting. But it does not have to be busy -
> > if you can defer page use until interrupt, that's one option.
> > Further if you are ready to exit to hypervisor it does not have to be
> > busy waiting.  In particular right now virtio does not have a capability
> > to stop queue processing by device.  We could add that if necessary.  In
> > that case, you would stop queue and detach buffers.  It is already
> > possible by reseting the balloon.  Naturally there is no magic - you
> > exit to hypervisor and block there. It's not all that great
> > in that VCPU does not run at all. But it is not busy waiting.
> 
> Of course, you can always yield to the hypervisor and not call it busy
> waiting. From the guest point of view, it is busy waiting. The VCPU is
> to making progress. If I am not wrong, one can easily construct examples
> where all VCPUs in the guest are waiting for the hypervisor to
> madv(dontneed) pages. I don't like that approach
> 
> Especially if temporarily getting pages out of the buddy resolves these
> issues and seems to work.

Well hypervisor can send a singla and interrupt the dontneed work.
But yes I prefer not blocking the VCPU too.

I also prefer MADV_FREE generally.

> 
> -- 
> 
> Thanks,
> 
> David / dhildenb
Wang, Wei W Feb. 14, 2019, 8:48 a.m. UTC | #15
On Wednesday, February 13, 2019 8:07 PM, Nitesh Narayan Lal wrote:
> Once the host free the pages. All the isolated pages are returned back 
> to the buddy. (This is implemented in hyperlist_ready())

This actually has the same issue: the isolated pages have to wait to return to the buddy
after the host has done with madvise(DONTNEED). Otherwise, a page could be used by
a guest thread and the next moment the host takes it to other host threads.

Best,
Wei
Wang, Wei W Feb. 14, 2019, 9:08 a.m. UTC | #16
On Wednesday, February 13, 2019 5:19 PM, David Hildenbrand wrote:
> If you have to resize/alloc/coordinate who will report, you will need locking.
> Especially, I doubt that there is an atomic xbitmap  (prove me wrong :) ).

Yes, we need change xbitmap to support it.

Just thought of another option, which would be better:
- xb_preload in prepare_alloc_pages to pre-allocate the bitmap memory;
- xb_set/clear the bit under the zone->lock, i.e. in rmqueue and free_one_page

So we use the existing zone->lock to guarantee that the xb ops
will not be concurrently called to race on the same bitmap.
And we don't add any new locks to generate new doubts.
Also, we can probably remove the arch_alloc/free_page part.

For the first step, we could optimize VIRTIO_BALLOON_F_FREE_PAGE_HINT for the live migration optimization:
- just replace alloc_pages(VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG,
                           VIRTIO_BALLOON_FREE_PAGE_ORDER)
with get_free_page_hints()

get_free_page_hints() was designed to clear the bit, and need put_free_page_hints() to set it later after host finishes madvise. For the live migration usage, as host doesn't free the backing host pages, so we can give get_free_page_hints a parameter option to not clear the bit for this usage. It will be simpler and faster.

I think get_free_page_hints() to read hints via bitmaps should be much faster than that allocation function, which takes around 15us to get a 4MB block. Another big bonus is that we don't need free_pages() to return all the pages back to buddy (it's a quite expensive operation too) when migration is done.

For the second step, we can improve ballooning, e.g. a new feature VIRTIO_BALLOON_F_ADVANCED_BALLOON to use the same get_free_page_hints() and another put_free_page_hints(), along with the virtio-balloon's report_vq and ack_vq to wait for the host's ack before making the free page ready.
(I think waiting for the host ack is the overhead that the guest has to suffer for enabling memory overcommitment, and even with this v8 patch series it also needs to do that. The optimization method was described yesterday)

> Yes, but as I mentioned this has other drawbacks. Relying on a a guest to free
> up memory when you really need it is not going to work. 

why not working? Host can ask at any time (including when not urgently need it) depending on the admin's configuration.

>It might work for
> some scenarios but should not dictate the design. It is a good start though if
> it makes things easier.
 > Enabling/disabling free page hintning by the hypervisor via some
> mechanism is on the other hand a good idea. "I have plenty of free space,
> don't worry".

Also guests are not treated identically, host can decide whom to offer the free pages first (offering free pages will cause the guest some performance drop).

Best,
Wei
David Hildenbrand Feb. 14, 2019, 9:34 a.m. UTC | #17
On 14.02.19 10:12, Wang, Wei W wrote:
> On Thursday, February 14, 2019 1:22 AM, Nitesh Narayan Lal wrote:
>> In normal condition yes we would not like to report any memory when the
>> guest is already under memory pressure.
>>
>> I am not sure about the scenario where both guest and the host are under
>> memory pressure, who will be given priority? Is it something per-decided or
>> it depends on the use case?
>>
> 
> That's one of the reasons that I would vote for "host to ask for free pages".

As I already said, there are scenarios where this does not work reliably.

When we

1. Let the guest report free pages
2. Allow the hypervisor to enable/disable it

We have a mechanism that is superior to what you describe.

"host to ask for free pages" can be emulated using that.

> 
> Host can have a global view of all the guest's memory states, so better to have the
> memory overcommitment policy defined on the host.
> 
> For example, the host can know guest1 is under memory pressure (thus not asking him)
> and guest2 has a huge amount of free memory. When host lacks memory to run the 3rd
> guest, it can asks guest2 to offer some free memory. 
> 
> Best,
> Wei
>
David Hildenbrand Feb. 14, 2019, 9:42 a.m. UTC | #18
On 14.02.19 09:48, Wang, Wei W wrote:
> On Wednesday, February 13, 2019 8:07 PM, Nitesh Narayan Lal wrote:
>> Once the host free the pages. All the isolated pages are returned back 
>> to the buddy. (This is implemented in hyperlist_ready())
> 
> This actually has the same issue: the isolated pages have to wait to return to the buddy
> after the host has done with madvise(DONTNEED). Otherwise, a page could be used by
> a guest thread and the next moment the host takes it to other host threads.

Yes indeed, that is the important bit. They must not be put pack to the
buddy before they have been processed by the hypervisor. But as the
pages are not in the buddy, no one allocating a page will stumble over
such a page and try to allocate it. Threads trying to allocate memory
will simply pick another buddy page instead of "busy waiting" for that
page to be finished reporting.

> 
> Best,
> Wei
>
David Hildenbrand Feb. 14, 2019, 10 a.m. UTC | #19
On 14.02.19 10:08, Wang, Wei W wrote:
> On Wednesday, February 13, 2019 5:19 PM, David Hildenbrand wrote:
>> If you have to resize/alloc/coordinate who will report, you will need locking.
>> Especially, I doubt that there is an atomic xbitmap  (prove me wrong :) ).
> 
> Yes, we need change xbitmap to support it.
> 
> Just thought of another option, which would be better:
> - xb_preload in prepare_alloc_pages to pre-allocate the bitmap memory;
> - xb_set/clear the bit under the zone->lock, i.e. in rmqueue and free_one_page

And how to preload without locking?

> 
> will not be concurrently called to race on the same bitmap.
> And we don't add any new locks to generate new doubts.
> Also, we can probably remove the arch_alloc/free_page part.
> 
> For the first step, we could optimize VIRTIO_BALLOON_F_FREE_PAGE_HINT for the live migration optimization:
> - just replace alloc_pages(VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG,
>                            VIRTIO_BALLOON_FREE_PAGE_ORDER)
> with get_free_page_hints()
> 
> get_free_page_hints() was designed to clear the bit, and need put_free_page_hints() to set it later after host finishes madvise. For the live migration usage, as host doesn't free the backing host pages, so we can give get_free_page_hints a parameter option to not clear the bit for this usage. It will be simpler and faster.
> 
> I think get_free_page_hints() to read hints via bitmaps should be much faster than that allocation function, which takes around 15us to get a 4MB block. Another big bonus is that we don't need free_pages() to return all the pages back to buddy (it's a quite expensive operation too) when migration is done.
> 
> For the second step, we can improve ballooning, e.g. a new feature VIRTIO_BALLOON_F_ADVANCED_BALLOON to use the same get_free_page_hints() and another put_free_page_hints(), along with the virtio-balloon's report_vq and ack_vq to wait for the host's ack before making the free page ready.
> (I think waiting for the host ack is the overhead that the guest has to suffer for enabling memory overcommitment, and even with this v8 patch series it also needs to do that. The optimization method was described yesterday)
> 

As I already said, I don't like that approach, because it has the
fundamental issue of page allocs getting blocked. That does not mean
that it is bad, but that I think what Nitesh has is superior in that
sense. Of course, things like "how to enable/disable", and much more
needs to be clarified.

If you believe in your approach, feel free to come up with a prototype.
 Especially the "no global locking" could be tricky in my opinion :)

>> Yes, but as I mentioned this has other drawbacks. Relying on a a guest to free
>> up memory when you really need it is not going to work. 
> 
> why not working? Host can ask at any time (including when not urgently need it) depending on the admin's configuration.

Because any heuristic like "I am running out of memory, quickly ask
someone who will need time to respond" is prone to fail in some
scenarios. It might work for many, but it is not a "I am running out of
memory, oh look, this page has been flagged via madv(FREE), let's just
take that."

> 
>> It might work for
>> some scenarios but should not dictate the design. It is a good start though if
>> it makes things easier.
>  > Enabling/disabling free page hintning by the hypervisor via some
>> mechanism is on the other hand a good idea. "I have plenty of free space,
>> don't worry".
> 
> Also guests are not treated identically, host can decide whom to offer the free pages first (offering free pages will cause the guest some performance drop).

Yes, it should definetly be configurable somehow. You don't want free
page hinting always and in any setup.
David Hildenbrand Feb. 14, 2019, 10:44 a.m. UTC | #20
On 14.02.19 11:00, David Hildenbrand wrote:
> On 14.02.19 10:08, Wang, Wei W wrote:
>> On Wednesday, February 13, 2019 5:19 PM, David Hildenbrand wrote:
>>> If you have to resize/alloc/coordinate who will report, you will need locking.
>>> Especially, I doubt that there is an atomic xbitmap  (prove me wrong :) ).
>>
>> Yes, we need change xbitmap to support it.
>>
>> Just thought of another option, which would be better:
>> - xb_preload in prepare_alloc_pages to pre-allocate the bitmap memory;
>> - xb_set/clear the bit under the zone->lock, i.e. in rmqueue and free_one_page
> 
> And how to preload without locking?
> 
>>
>> will not be concurrently called to race on the same bitmap.
>> And we don't add any new locks to generate new doubts.
>> Also, we can probably remove the arch_alloc/free_page part.
>>
>> For the first step, we could optimize VIRTIO_BALLOON_F_FREE_PAGE_HINT for the live migration optimization:
>> - just replace alloc_pages(VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG,
>>                            VIRTIO_BALLOON_FREE_PAGE_ORDER)
>> with get_free_page_hints()
>>
>> get_free_page_hints() was designed to clear the bit, and need put_free_page_hints() to set it later after host finishes madvise. For the live migration usage, as host doesn't free the backing host pages, so we can give get_free_page_hints a parameter option to not clear the bit for this usage. It will be simpler and faster.
>>
>> I think get_free_page_hints() to read hints via bitmaps should be much faster than that allocation function, which takes around 15us to get a 4MB block. Another big bonus is that we don't need free_pages() to return all the pages back to buddy (it's a quite expensive operation too) when migration is done.
>>
>> For the second step, we can improve ballooning, e.g. a new feature VIRTIO_BALLOON_F_ADVANCED_BALLOON to use the same get_free_page_hints() and another put_free_page_hints(), along with the virtio-balloon's report_vq and ack_vq to wait for the host's ack before making the free page ready.
>> (I think waiting for the host ack is the overhead that the guest has to suffer for enabling memory overcommitment, and even with this v8 patch series it also needs to do that. The optimization method was described yesterday)
>>
> 
> As I already said, I don't like that approach, because it has the
> fundamental issue of page allocs getting blocked. That does not mean
> that it is bad, but that I think what Nitesh has is superior in that
> sense. Of course, things like "how to enable/disable", and much more
> needs to be clarified.
> 
> If you believe in your approach, feel free to come up with a prototype.
>  Especially the "no global locking" could be tricky in my opinion :)


I want to add that your approach makes sense if we expect that the
hypervisor will ask for free memory very rarely. Then, blocking during
page alloc is most probably acceptable. Depending on the setup, this
might or might not be the case. If you have some guests that are
allocating/freeing memory continuously, you might want to get back free
pages fairly often to move them to other guests.

In case the hypervisor asks for free pages, as we are not reporting
continuously, you would have to somehow report all pages currently free
to the hypervisor, making sure via the bitmap that they cannot be allocated.

You certainly don't want to track free pages in a bitmap if the
hypervisor is not asking for free pages, otherwise you will waste
eventually a big amount of memory tracking page states nobody cares
about in a xbtimap. So you would have to use another way to initially
fill the bitmap with free pages (when the hypervisor requests it), while
making sure to avoid races with pages getting allocated just while you
are creating the bitmap.
Nitesh Narayan Lal Feb. 14, 2019, 1 p.m. UTC | #21
On 2/14/19 3:48 AM, Wang, Wei W wrote:
> On Wednesday, February 13, 2019 8:07 PM, Nitesh Narayan Lal wrote:
>> Once the host free the pages. All the isolated pages are returned back 
>> to the buddy. (This is implemented in hyperlist_ready())
> This actually has the same issue: the isolated pages have to wait to return to the buddy
> after the host has done with madvise(DONTNEED). Otherwise, a page could be used by
> a guest thread and the next moment the host takes it to other host threads.
I don't think that this will be a blocking case. Let's say there are
pages from the normal zone which are isolated and are currently freed by
the host at this point even if the normal zone runs out of free pages,
any allocation request could be served from another zone which will have
free pages.
>
> Best,
> Wei
Wang, Wei W Feb. 15, 2019, 9:05 a.m. UTC | #22
On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
> Yes indeed, that is the important bit. They must not be put pack to the
> buddy before they have been processed by the hypervisor. But as the pages
> are not in the buddy, no one allocating a page will stumble over such a page
> and try to allocate it. Threads trying to allocate memory will simply pick
> another buddy page instead of "busy waiting" for that page to be finished
> reporting.

What if a guest thread try to allocate some pages but the buddy cannot satisfy
because all the pages are isolated? Would it be the same case that the guest thread
gets blocked by waiting all the isolated pages to get madvised by the host and
returned to the guest buddy, or even worse, some guest threads get killed due to oom?

Best,
Wei
Wang, Wei W Feb. 15, 2019, 9:15 a.m. UTC | #23
On Thursday, February 14, 2019 6:01 PM, David Hildenbrand wrote:
> And how to preload without locking?

The memory is preload per-CPU. It's usually called outside the lock.

Best,
Wei
David Hildenbrand Feb. 15, 2019, 9:33 a.m. UTC | #24
On 15.02.19 10:15, Wang, Wei W wrote:
> On Thursday, February 14, 2019 6:01 PM, David Hildenbrand wrote:
>> And how to preload without locking?
> 
> The memory is preload per-CPU. It's usually called outside the lock.

Right, that works as long as only a fixed amount of pages is needed. I
remember that working for some radix tree operations.

> 
> Best,
> Wei
>
David Hildenbrand Feb. 15, 2019, 9:41 a.m. UTC | #25
On 15.02.19 10:05, Wang, Wei W wrote:
> On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
>> Yes indeed, that is the important bit. They must not be put pack to the
>> buddy before they have been processed by the hypervisor. But as the pages
>> are not in the buddy, no one allocating a page will stumble over such a page
>> and try to allocate it. Threads trying to allocate memory will simply pick
>> another buddy page instead of "busy waiting" for that page to be finished
>> reporting.
> 
> What if a guest thread try to allocate some pages but the buddy cannot satisfy
> because all the pages are isolated? Would it be the same case that the guest thread
> gets blocked by waiting all the isolated pages to get madvised by the host and
> returned to the guest buddy, or even worse, some guest threads get killed due to oom?

Your question targets low memory situations in the guest. I think Nitesh
already answered parts of that question somewhere and I'll let him
answer it in detail, only a short comment from my side :)

I can imagine techniques where the OOM killer can be avoided, but the
OOM handler will eventually kick in and handle it.

In general your question is valid and we will have to think about a way
to avoid that from happening. However, in contrast to your approach
blocking on potentially every page that is being hinted, in Nitesh's
approach this would only happen when the guest is really low on memory.
And the question is in general, if a guest wants to hint if low on
memory ("safety buffer").

> 
> Best,
> Wei
>
Nitesh Narayan Lal Feb. 15, 2019, 12:40 p.m. UTC | #26
On 2/15/19 4:05 AM, Wang, Wei W wrote:
> On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
>> Yes indeed, that is the important bit. They must not be put pack to the
>> buddy before they have been processed by the hypervisor. But as the pages
>> are not in the buddy, no one allocating a page will stumble over such a page
>> and try to allocate it. Threads trying to allocate memory will simply pick
>> another buddy page instead of "busy waiting" for that page to be finished
>> reporting.
> What if a guest thread try to allocate some pages but the buddy cannot satisfy
> because all the pages are isolated? Would it be the same case that the guest thread
> gets blocked by waiting all the isolated pages to get madvised by the host and
> returned to the guest buddy, or even worse, some guest threads get killed due to oom?

If you are referring to a situation when guest is under memory pressure
then isolation request will fail and memory will not be
reported/isolated. However, there can always be a corner case and we can
definitely take that into consideration later.

>
> Best,
> Wei
David Hildenbrand Feb. 16, 2019, 9:40 a.m. UTC | #27
On 04.02.19 21:18, Nitesh Narayan Lal wrote:

Hi Nitesh,

I thought again about how s390x handles free page hinting. As that seems
to work just fine, I guess sticking to a similar model makes sense.


I already explained in this thread how it works on s390x, a short summary:

1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If I
am not wrong, it contains 512 entries, so is exactly 1 page big. This
buffer is stored in the hypervisor and is on page granularity.

2. This page buffer is managed via the ESSA instruction. In addition, to
synchronize with the guest ("page reused when freeing in the
hypervisor"), special bits in the host->guest page table can be
set/locked via the ESSA instruction by the guest and similarly accessed
by the hypervisor.

3. Once the buffer is full, the guest does a synchronous hypercall,
going over all 512 entries and zapping them (== similar to MADV_DONTNEED)


To mimic that, we

1. Have a static buffer per VCPU in the guest with 512 entries. You
basically have that already.

2. On every free, add the page _or_ the page after merging by the buddy
(e.g. MAX_ORDER - 1) to the buffer (this is where we could be better
than s390x). You basically have that already.

3. If the buffer is full, try to isolate all pages and do a synchronous
report to the hypervisor. You have the first part already. The second
part would require a change (don't use a separate/global thread to do
the hinting, just do it synchronously).

4. One hinting is done, putback all isolated pages to the budy. You
basically have that already.


For 3. we can try what you have right now, using virtio. If we detect
that's a problem, we can do it similar to what Alexander proposes and
just do a bare hypercall. It's just a different way of carrying out the
same task.


This approach
1. Mimics what s390x does, besides supporting different granularities.
To synchronize guest->host we simply take the pages off the buddy.

2. Is basically what Alexander does, however his design limitation is
that doing any hinting on smaller granularities will not work because
there will be too many synchronous hints. Bad on fragmented guests.

3. Does not require any dynamic data structures in the guest.

4. Does not block allocation paths.

5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why
shouldn't it for us. We have to measure.

6. We are free to decide which granularity we report.

7. Potentially works even if the guest memory is fragmented (little
MAX_ORDER - 1) pages.

It would be worth a try. My feeling is that a synchronous report after
e.g. 512 frees should be acceptable, as it seems to be acceptable on
s390x. (basically always enabled, nobody complains).

We would have to play with how to enable/disable reporting and when to
not report because it's not worth it in the guest (e.g. low on memory).


Do you think something like this would be easy to change/implement and
measure?

Thanks!

> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.
> 
> Benefit:
> With this patch-series, in our test-case, executed on a single system and single NUMA node with 15GB memory, we were able to successfully launch atleast 5 guests 
> when page hinting was enabled and 3 without it. (Detailed explanation of the test procedure is provided at the bottom).
> 
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and scan the pages freed by the guest has been changed. The new approach is briefly described below:
> 
> The patch-set still leverages the existing arch_free_page() to add this functionality. It maintains a per CPU array which is used to store the pages freed by the guest. The maximum number of entries which it can hold is defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is scanned and only the pages which are available in the buddy are stored. This process continues until the array is filled with pages which are part of the buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation and if the page is not reallocated and present in the buddy, the kernel thread attempts to isolate it from the buddy. If it is successfully isolated, the page is added to another per-cpu array. Once the entire scanning process is complete, all the isolated pages are reported to the host through an existing virtio-balloon driver.
> 
> Known Issues:
> 	* Fixed array size: The problem with having a fixed/hardcoded array size arises when the size of the guest varies. For example when the guest size increases and it starts making large allocations fixed size limits this solution's ability to capture all the freed pages. This will result in less guest free memory getting reported to the host.
> 
> Known code re-work:
> 	* Plan to re-use Wei's work, which communicates the poison value to the host.
> 	* The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei's Free Page Hint code.
> 	* Sorting based on zonenum, to avoid repetitive zone locks for the same zone.
> 
> Other required work:
> 	* Run other benchmarks to evaluate the performance/impact of this approach.
> 
> Test case:
> Setup:
> Memory-15837 MB
> Guest Memory Size-5 GB
> Swap-Disabled
> Test Program-Simple program which allocates 4GB memory via malloc, touches it via memset and exits.
> Use case-Number of guests that can be launched completely including the successful execution of the test program.
> Procedure: 
> The first guest is launched and once its console is up, the test allocation program is executed with 4 GB memory request (Due to this the guest occupies almost 4-5 GB of memory in the host in a system without page hinting). Once this program exits at that time another guest is launched in the host and the same process is followed. We continue launching the guests until a guest gets killed due to low memory condition in the host.
> 
> Result:
> Without Hinting-3 Guests
> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).
> 
> [1] https://www.spinics.net/lists/kvm/msg170113.html 
> 
>
Wang, Wei W Feb. 18, 2019, 2:36 a.m. UTC | #28
On 02/15/2019 05:41 PM, David Hildenbrand wrote:
> On 15.02.19 10:05, Wang, Wei W wrote:
>> On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
>>> Yes indeed, that is the important bit. They must not be put pack to the
>>> buddy before they have been processed by the hypervisor. But as the pages
>>> are not in the buddy, no one allocating a page will stumble over such a page
>>> and try to allocate it. Threads trying to allocate memory will simply pick
>>> another buddy page instead of "busy waiting" for that page to be finished
>>> reporting.
>> What if a guest thread try to allocate some pages but the buddy cannot satisfy
>> because all the pages are isolated? Would it be the same case that the guest thread
>> gets blocked by waiting all the isolated pages to get madvised by the host and
>> returned to the guest buddy, or even worse, some guest threads get killed due to oom?
> Your question targets low memory situations in the guest. I think Nitesh
> already answered parts of that question somewhere and I'll let him
> answer it in detail, only a short comment from my side :)
>
> I can imagine techniques where the OOM killer can be avoided, but the
> OOM handler will eventually kick in and handle it.
>
> In general your question is valid and we will have to think about a way
> to avoid that from happening. However, in contrast to your approach
> blocking on potentially every page that is being hinted, in Nitesh's
> approach this would only happen when the guest is really low on memory.
> And the question is in general, if a guest wants to hint if low on
> memory ("safety buffer").

I think we should forget that the guest is low on memory because
this approach takes all the pages off the list, not because the guest really
uses up the free memory.

Guest allocating one page could also potentially be blocked until all 
the pages
(as opposed to one page) being madvised and returned to the guest buddy.

Best,
Wei
Wang, Wei W Feb. 18, 2019, 2:39 a.m. UTC | #29
On 02/18/2019 10:36 AM, Wei Wang wrote:
> On 02/15/2019 05:41 PM, David Hildenbrand wrote:
>> On 15.02.19 10:05, Wang, Wei W wrote:
>>> On Thursday, February 14, 2019 5:43 PM, David Hildenbrand wrote:
>>>> Yes indeed, that is the important bit. They must not be put pack to 
>>>> the
>>>> buddy before they have been processed by the hypervisor. But as the 
>>>> pages
>>>> are not in the buddy, no one allocating a page will stumble over 
>>>> such a page
>>>> and try to allocate it. Threads trying to allocate memory will 
>>>> simply pick
>>>> another buddy page instead of "busy waiting" for that page to be 
>>>> finished
>>>> reporting.
>>> What if a guest thread try to allocate some pages but the buddy 
>>> cannot satisfy
>>> because all the pages are isolated? Would it be the same case that 
>>> the guest thread
>>> gets blocked by waiting all the isolated pages to get madvised by 
>>> the host and
>>> returned to the guest buddy, or even worse, some guest threads get 
>>> killed due to oom?
>> Your question targets low memory situations in the guest. I think Nitesh
>> already answered parts of that question somewhere and I'll let him
>> answer it in detail, only a short comment from my side :)
>>
>> I can imagine techniques where the OOM killer can be avoided, but the
>> OOM handler will eventually kick in and handle it.
>>
>> In general your question is valid and we will have to think about a way
>> to avoid that from happening. However, in contrast to your approach
>> blocking on potentially every page that is being hinted, in Nitesh's
>> approach this would only happen when the guest is really low on memory.
>> And the question is in general, if a guest wants to hint if low on
>> memory ("safety buffer").
>
> I think we should forget that the guest is low on memory because
%s/should/shouldn't
> this approach takes all the pages off the list, not because the guest 
> really
> uses up the free memory.
>
> Guest allocating one page could also potentially be blocked until all 
> the pages
> (as opposed to one page) being madvised and returned to the guest buddy.
>
> Best,
> Wei
Nitesh Narayan Lal Feb. 18, 2019, 3:50 p.m. UTC | #30
On 2/16/19 4:40 AM, David Hildenbrand wrote:
> On 04.02.19 21:18, Nitesh Narayan Lal wrote:
>
> Hi Nitesh,
>
> I thought again about how s390x handles free page hinting. As that seems
> to work just fine, I guess sticking to a similar model makes sense.
>
>
> I already explained in this thread how it works on s390x, a short summary:
>
> 1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If I
> am not wrong, it contains 512 entries, so is exactly 1 page big. This
> buffer is stored in the hypervisor and is on page granularity.
>
> 2. This page buffer is managed via the ESSA instruction. In addition, to
> synchronize with the guest ("page reused when freeing in the
> hypervisor"), special bits in the host->guest page table can be
> set/locked via the ESSA instruction by the guest and similarly accessed
> by the hypervisor.
>
> 3. Once the buffer is full, the guest does a synchronous hypercall,
> going over all 512 entries and zapping them (== similar to MADV_DONTNEED)
>
>
> To mimic that, we
>
> 1. Have a static buffer per VCPU in the guest with 512 entries. You
> basically have that already.
>
> 2. On every free, add the page _or_ the page after merging by the buddy
> (e.g. MAX_ORDER - 1) to the buffer (this is where we could be better
> than s390x). You basically have that already.
>
> 3. If the buffer is full, try to isolate all pages and do a synchronous
> report to the hypervisor. You have the first part already. The second
> part would require a change (don't use a separate/global thread to do
> the hinting, just do it synchronously).
>
> 4. One hinting is done, putback all isolated pages to the budy. You
> basically have that already.
>
>
> For 3. we can try what you have right now, using virtio. If we detect
> that's a problem, we can do it similar to what Alexander proposes and
> just do a bare hypercall. It's just a different way of carrying out the
> same task.
>
>
> This approach
> 1. Mimics what s390x does, besides supporting different granularities.
> To synchronize guest->host we simply take the pages off the buddy.
>
> 2. Is basically what Alexander does, however his design limitation is
> that doing any hinting on smaller granularities will not work because
> there will be too many synchronous hints. Bad on fragmented guests.
>
> 3. Does not require any dynamic data structures in the guest.
>
> 4. Does not block allocation paths.
>
> 5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why
> shouldn't it for us. We have to measure.
>
> 6. We are free to decide which granularity we report.
>
> 7. Potentially works even if the guest memory is fragmented (little
> MAX_ORDER - 1) pages.
>
> It would be worth a try. My feeling is that a synchronous report after
> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> s390x. (basically always enabled, nobody complains).

The reason I like the current approach of reporting via separate kernel
thread is that it doesn't block any regular allocation/freeing code path
in anyways.
>
> We would have to play with how to enable/disable reporting and when to
> not report because it's not worth it in the guest (e.g. low on memory).
>
>
> Do you think something like this would be easy to change/implement and
> measure?

I can do that as I figure out a real world guest workload using which
the two approaches can be compared.

> Thanks!
>
>> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and single NUMA node with 15GB memory, we were able to successfully launch atleast 5 guests 
>> when page hinting was enabled and 3 without it. (Detailed explanation of the test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and scan the pages freed by the guest has been changed. The new approach is briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this functionality. It maintains a per CPU array which is used to store the pages freed by the guest. The maximum number of entries which it can hold is defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is scanned and only the pages which are available in the buddy are stored. This process continues until the array is filled with pages which are part of the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation and if the page is not reallocated and present in the buddy, the kernel thread attempts to isolate it from the buddy. If it is successfully isolated, the page is added to another per-cpu array. Once the entire scanning process is complete, all the isolated pages are reported to the host through an existing virtio-balloon driver.
>>
>> Known Issues:
>> 	* Fixed array size: The problem with having a fixed/hardcoded array size arises when the size of the guest varies. For example when the guest size increases and it starts making large allocations fixed size limits this solution's ability to capture all the freed pages. This will result in less guest free memory getting reported to the host.
>>
>> Known code re-work:
>> 	* Plan to re-use Wei's work, which communicates the poison value to the host.
>> 	* The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei's Free Page Hint code.
>> 	* Sorting based on zonenum, to avoid repetitive zone locks for the same zone.
>>
>> Other required work:
>> 	* Run other benchmarks to evaluate the performance/impact of this approach.
>>
>> Test case:
>> Setup:
>> Memory-15837 MB
>> Guest Memory Size-5 GB
>> Swap-Disabled
>> Test Program-Simple program which allocates 4GB memory via malloc, touches it via memset and exits.
>> Use case-Number of guests that can be launched completely including the successful execution of the test program.
>> Procedure: 
>> The first guest is launched and once its console is up, the test allocation program is executed with 4 GB memory request (Due to this the guest occupies almost 4-5 GB of memory in the host in a system without page hinting). Once this program exits at that time another guest is launched in the host and the same process is followed. We continue launching the guests until a guest gets killed due to low memory condition in the host.
>>
>> Result:
>> Without Hinting-3 Guests
>> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).
>>
>> [1] https://www.spinics.net/lists/kvm/msg170113.html 
>>
>>
>
David Hildenbrand Feb. 18, 2019, 4:02 p.m. UTC | #31
On 18.02.19 16:50, Nitesh Narayan Lal wrote:
> 
> On 2/16/19 4:40 AM, David Hildenbrand wrote:
>> On 04.02.19 21:18, Nitesh Narayan Lal wrote:
>>
>> Hi Nitesh,
>>
>> I thought again about how s390x handles free page hinting. As that seems
>> to work just fine, I guess sticking to a similar model makes sense.
>>
>>
>> I already explained in this thread how it works on s390x, a short summary:
>>
>> 1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If I
>> am not wrong, it contains 512 entries, so is exactly 1 page big. This
>> buffer is stored in the hypervisor and is on page granularity.
>>
>> 2. This page buffer is managed via the ESSA instruction. In addition, to
>> synchronize with the guest ("page reused when freeing in the
>> hypervisor"), special bits in the host->guest page table can be
>> set/locked via the ESSA instruction by the guest and similarly accessed
>> by the hypervisor.
>>
>> 3. Once the buffer is full, the guest does a synchronous hypercall,
>> going over all 512 entries and zapping them (== similar to MADV_DONTNEED)
>>
>>
>> To mimic that, we
>>
>> 1. Have a static buffer per VCPU in the guest with 512 entries. You
>> basically have that already.
>>
>> 2. On every free, add the page _or_ the page after merging by the buddy
>> (e.g. MAX_ORDER - 1) to the buffer (this is where we could be better
>> than s390x). You basically have that already.
>>
>> 3. If the buffer is full, try to isolate all pages and do a synchronous
>> report to the hypervisor. You have the first part already. The second
>> part would require a change (don't use a separate/global thread to do
>> the hinting, just do it synchronously).
>>
>> 4. One hinting is done, putback all isolated pages to the budy. You
>> basically have that already.
>>
>>
>> For 3. we can try what you have right now, using virtio. If we detect
>> that's a problem, we can do it similar to what Alexander proposes and
>> just do a bare hypercall. It's just a different way of carrying out the
>> same task.
>>
>>
>> This approach
>> 1. Mimics what s390x does, besides supporting different granularities.
>> To synchronize guest->host we simply take the pages off the buddy.
>>
>> 2. Is basically what Alexander does, however his design limitation is
>> that doing any hinting on smaller granularities will not work because
>> there will be too many synchronous hints. Bad on fragmented guests.
>>
>> 3. Does not require any dynamic data structures in the guest.
>>
>> 4. Does not block allocation paths.
>>
>> 5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why
>> shouldn't it for us. We have to measure.
>>
>> 6. We are free to decide which granularity we report.
>>
>> 7. Potentially works even if the guest memory is fragmented (little
>> MAX_ORDER - 1) pages.
>>
>> It would be worth a try. My feeling is that a synchronous report after
>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>> s390x. (basically always enabled, nobody complains).
> 
> The reason I like the current approach of reporting via separate kernel
> thread is that it doesn't block any regular allocation/freeing code path
> in anyways.

Well, that is partially true. The work has to be done "somewhere", so
once you kick a separate kernel thread, it can easily be scheduled on
the very same VCPU in the very near future. So depending on the user,
the "hickup" is similarly visible.

Having separate kernel threads seems to result in other questions not
easy to answer (do we need dynamic data structures, how to size these
data structures, how many threads do we want (e.g. big number of vcpus)
) that seem to be avoidable by keeping it simple and not having separate
threads. Initially I also thought that separate threads were the natural
thing to do, but now I have the feeling that it tends to over complicate
the problem. (and I don't want to repeat myself, but on s390x it seems
to work this way just fine, if we want to mimic that). Especially
without us knowing if "don't do a hypercall every X free calls" is
really a problem.

>>
>> We would have to play with how to enable/disable reporting and when to
>> not report because it's not worth it in the guest (e.g. low on memory).
>>
>>
>> Do you think something like this would be easy to change/implement and
>> measure?
> 
> I can do that as I figure out a real world guest workload using which
> the two approaches can be compared.
Michael S. Tsirkin Feb. 18, 2019, 4:49 p.m. UTC | #32
On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> It would be worth a try. My feeling is that a synchronous report after
> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> s390x. (basically always enabled, nobody complains).

What slips under the radar on an arch like s390 might
raise issues for a popular arch like x86. My fear would be
if it's only a problem e.g. for realtime. Then you get
a condition that's very hard to trigger and affects
worst case latencies.

But really what business has something that is supposedly
an optimization blocking a VCPU? We are just freeing up
lots of memory why is it a good idea to slow that
process down?
David Hildenbrand Feb. 18, 2019, 4:59 p.m. UTC | #33
On 18.02.19 17:49, Michael S. Tsirkin wrote:
> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>> It would be worth a try. My feeling is that a synchronous report after
>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>> s390x. (basically always enabled, nobody complains).
> 
> What slips under the radar on an arch like s390 might
> raise issues for a popular arch like x86. My fear would be
> if it's only a problem e.g. for realtime. Then you get
> a condition that's very hard to trigger and affects
> worst case latencies.

Realtime should never use free page hinting. Just like it should never
use ballooning. Just like it should pin all pages in the hypervisor.

> 
> But really what business has something that is supposedly
> an optimization blocking a VCPU? We are just freeing up
> lots of memory why is it a good idea to slow that
> process down?

I first want to know that it is a problem before we declare it a
problem. I provided an example (s390x) where it does not seem to be a
problem. One hypercall ~every 512 frees. As simple as it can get.

No trying to deny that it could be a problem on x86, but then I assume
it is only a problem in specific setups.

I would much rather prefer a simple solution that can eventually be
disabled in selected setup than a complicated solution that tries to fit
all possible setups. Realtime is one of the examples where such stuff is
to be disabled either way.

Optimization of space comes with a price (here: execution time).
Alexander Duyck Feb. 18, 2019, 5:31 p.m. UTC | #34
On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> > On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >> It would be worth a try. My feeling is that a synchronous report after
> >> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >> s390x. (basically always enabled, nobody complains).
> >
> > What slips under the radar on an arch like s390 might
> > raise issues for a popular arch like x86. My fear would be
> > if it's only a problem e.g. for realtime. Then you get
> > a condition that's very hard to trigger and affects
> > worst case latencies.
>
> Realtime should never use free page hinting. Just like it should never
> use ballooning. Just like it should pin all pages in the hypervisor.
>
> >
> > But really what business has something that is supposedly
> > an optimization blocking a VCPU? We are just freeing up
> > lots of memory why is it a good idea to slow that
> > process down?
>
> I first want to know that it is a problem before we declare it a
> problem. I provided an example (s390x) where it does not seem to be a
> problem. One hypercall ~every 512 frees. As simple as it can get.
>
> No trying to deny that it could be a problem on x86, but then I assume
> it is only a problem in specific setups.
>
> I would much rather prefer a simple solution that can eventually be
> disabled in selected setup than a complicated solution that tries to fit
> all possible setups. Realtime is one of the examples where such stuff is
> to be disabled either way.
>
> Optimization of space comes with a price (here: execution time).

One thing to keep in mind though is that if you are already having to
pull pages in and out of swap on the host in order be able to provide
enough memory for the guests the free page hinting should be a
significant win in terms of performance.

So far with my patch set that hints at the PMD level w/ THP enabled I
am not really seeing that much overhead for the hypercalls. The bigger
piece that is eating up CPU time is all the page faults and page
zeroing that is going on as we are cycling the memory in and out of
the guest. Some of that could probably be resolved by using MADV_FREE,
but if we are under actual memory pressure I suspect it would behave
similar to MADV_DONTNEED.
David Hildenbrand Feb. 18, 2019, 5:41 p.m. UTC | #35
On 18.02.19 18:31, Alexander Duyck wrote:
> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>>>> It would be worth a try. My feeling is that a synchronous report after
>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>>>> s390x. (basically always enabled, nobody complains).
>>>
>>> What slips under the radar on an arch like s390 might
>>> raise issues for a popular arch like x86. My fear would be
>>> if it's only a problem e.g. for realtime. Then you get
>>> a condition that's very hard to trigger and affects
>>> worst case latencies.
>>
>> Realtime should never use free page hinting. Just like it should never
>> use ballooning. Just like it should pin all pages in the hypervisor.
>>
>>>
>>> But really what business has something that is supposedly
>>> an optimization blocking a VCPU? We are just freeing up
>>> lots of memory why is it a good idea to slow that
>>> process down?
>>
>> I first want to know that it is a problem before we declare it a
>> problem. I provided an example (s390x) where it does not seem to be a
>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>
>> No trying to deny that it could be a problem on x86, but then I assume
>> it is only a problem in specific setups.
>>
>> I would much rather prefer a simple solution that can eventually be
>> disabled in selected setup than a complicated solution that tries to fit
>> all possible setups. Realtime is one of the examples where such stuff is
>> to be disabled either way.
>>
>> Optimization of space comes with a price (here: execution time).
> 
> One thing to keep in mind though is that if you are already having to
> pull pages in and out of swap on the host in order be able to provide
> enough memory for the guests the free page hinting should be a
> significant win in terms of performance.

Indeed. And also we are in a virtualized environment already, we can
have any kind of sudden hickups. (again, realtime has special
requirements on the setup)

Side note: I like your approach because it is simple. I don't like your
approach because it cannot deal with fragmented memory. And that can
happen easily.

The idea I described here can be similarly be an extension of your
approach, merging in a "batched reporting" Nitesh proposed, so we can
report on something < MAX_ORDER, similar to s390x. In the end it boils
down to reporting via hypercall vs. reporting via virtio. The main point
is that it is synchronous and batched. (and that we properly take care
of the race between host freeing and guest allocation)

> 
> So far with my patch set that hints at the PMD level w/ THP enabled I
> am not really seeing that much overhead for the hypercalls. The bigger
> piece that is eating up CPU time is all the page faults and page
> zeroing that is going on as we are cycling the memory in and out of
> the guest. Some of that could probably be resolved by using MADV_FREE,
> but if we are under actual memory pressure I suspect it would behave
> similar to MADV_DONTNEED.
> 

MADV_FREE is certainly the better thing to do for hinting in my opinion.
It should result in even less overhead. Thanks for the comment about the
hypercall overhead.
Michael S. Tsirkin Feb. 18, 2019, 5:54 p.m. UTC | #36
On Mon, Feb 18, 2019 at 05:59:06PM +0100, David Hildenbrand wrote:
> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> > On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >> It would be worth a try. My feeling is that a synchronous report after
> >> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >> s390x. (basically always enabled, nobody complains).
> > 
> > What slips under the radar on an arch like s390 might
> > raise issues for a popular arch like x86. My fear would be
> > if it's only a problem e.g. for realtime. Then you get
> > a condition that's very hard to trigger and affects
> > worst case latencies.
> 
> Realtime should never use free page hinting.

OK maybe document this in commit log. RT project has
enough work as it is without need to untangle
complex dependencies with other features.

> Just like it should never
> use ballooning.

Well its an aside but why not ballooning? As long as hypervisor does not touch the balloon,
and you don't touch the (weird, not really documented properly)
deflate on oom, you are fine.
Real time is violated when you reconfigure balloon,
but  after you are done guest is real time again.
And management certainly knows it that it did something
with balloon at the exact same time there was a latency spike.


I think this might not work well right now, but generally
I think it should be fine. No?


> Just like it should pin all pages in the hypervisor.

BTW all this is absolutely interesting to fix.
But I agree wrt hinting being kind of like pinning.


> > 
> > But really what business has something that is supposedly
> > an optimization blocking a VCPU? We are just freeing up
> > lots of memory why is it a good idea to slow that
> > process down?
> 
> I first want to know that it is a problem before we declare it a
> problem. I provided an example (s390x) where it does not seem to be a
> problem. One hypercall ~every 512 frees. As simple as it can get.
> 
> No trying to deny that it could be a problem on x86, but then I assume
> it is only a problem in specific setups.

But which setups? How are we going to identify them?

> I would much rather prefer a simple solution that can eventually be
> disabled in selected setup than a complicated solution that tries to fit
> all possible setups.

Well I am not sure just disabling it is reasonable.  E.g. Alex shows
drastic boot time speedups.  You won't be able to come to people later
and say oh you need to disable this feature yes you will stop getting
packet loss once in a while but you also won't be able to boot your VMs
quickly enough.

So I'm fine with a simple implementation but the interface needs to
allow the hypervisor to process hints in parallel while guest is
running.  We can then fix any issues on hypervisor without breaking
guests.


> Realtime is one of the examples where such stuff is
> to be disabled either way.

OK so we have identified realtime. Nice even though it wasn't documented
anywhere. Are there other workloads? What are they?


> Optimization of space comes with a price (here: execution time).

I am not sure I agree. If hinting patches just slowed everyone down they
would be useless. Note how Alex show-cased this by demonstrating
faster boot times.

Unlike regular ballooning, this doesn't do much to optimize space. There
are no promises so host must still have enough swap to fit guest memory
anyway.

All free page hinting does is reduce IO on the hypervisor.

So it's a tradeoff.

> -- 
> 
> Thanks,
> 
> David / dhildenb
Michael S. Tsirkin Feb. 18, 2019, 6:01 p.m. UTC | #37
On Mon, Feb 18, 2019 at 09:31:13AM -0800, Alexander Duyck wrote:
> > Optimization of space comes with a price (here: execution time).
> 
> One thing to keep in mind though is that if you are already having to
> pull pages in and out of swap on the host in order be able to provide
> enough memory for the guests the free page hinting should be a
> significant win in terms of performance.

Absolutely and I think that's the point of the hinting right?
To cut out swap/IO on the host.
David Hildenbrand Feb. 18, 2019, 6:29 p.m. UTC | #38
On 18.02.19 18:54, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 05:59:06PM +0100, David Hildenbrand wrote:
>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>>>> It would be worth a try. My feeling is that a synchronous report after
>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>>>> s390x. (basically always enabled, nobody complains).
>>>
>>> What slips under the radar on an arch like s390 might
>>> raise issues for a popular arch like x86. My fear would be
>>> if it's only a problem e.g. for realtime. Then you get
>>> a condition that's very hard to trigger and affects
>>> worst case latencies.
>>
>> Realtime should never use free page hinting.
> 
> OK maybe document this in commit log. RT project has
> enough work as it is without need to untangle
> complex dependencies with other features.

We most certainly should!!

> 
>> Just like it should never
>> use ballooning.
> 
> Well its an aside but why not ballooning? As long as hypervisor does not touch the balloon,
> and you don't touch the (weird, not really documented properly)
> deflate on oom, you are fine.
> Real time is violated when you reconfigure balloon,
> but  after you are done guest is real time again.
> And management certainly knows it that it did something
> with balloon at the exact same time there was a latency spike.

Fair enough, this is a potential use case. But it goes hand in hand with
pinning/unpinning pages. So yes, while this would be possible - modify
balloon in  "no real time period", I doubt this is a real life scenario.
As always, I like to be taught differently :)

Similar to "start reporting on !RT activity" and "stop reporting on RT
activity"

> 
> 
> I think this might not work well right now, but generally
> I think it should be fine. No?
> 
> 
>> Just like it should pin all pages in the hypervisor.
> 
> BTW all this is absolutely interesting to fix.
> But I agree wrt hinting being kind of like pinning.

Yes, this is all interesting stuff :)

> 
> 
>>>
>>> But really what business has something that is supposedly
>>> an optimization blocking a VCPU? We are just freeing up
>>> lots of memory why is it a good idea to slow that
>>> process down?
>>
>> I first want to know that it is a problem before we declare it a
>> problem. I provided an example (s390x) where it does not seem to be a
>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>
>> No trying to deny that it could be a problem on x86, but then I assume
>> it is only a problem in specific setups.
> 
> But which setups? How are we going to identify them?

I guess is simple (I should be carefuly with this word ;) ): As long as
you don't isolate + pin your CPUs in the hypervisor, you can expect any
kind of sudden hickups. We're in a virtualized world. Real time is one
example.

Using kernel threads like Nitesh does right now? It can be scheduled
anytime by the hypervisor on the exact same cpu. Unless you isolate +
pin in the hypervor. So the same problem applies.

> 
>> I would much rather prefer a simple solution that can eventually be
>> disabled in selected setup than a complicated solution that tries to fit
>> all possible setups.
> 
> Well I am not sure just disabling it is reasonable.  E.g. Alex shows
> drastic boot time speedups.  You won't be able to come to people later
> and say oh you need to disable this feature yes you will stop getting
> packet loss once in a while but you also won't be able to boot your VMs
> quickly enough.

The guest is always free to disable once up. Yes, these are nice
details, but I consider these improvements we can work on later.

> 
> So I'm fine with a simple implementation but the interface needs to
> allow the hypervisor to process hints in parallel while guest is
> running.  We can then fix any issues on hypervisor without breaking
> guests.

Yes, I am fine with defining an interface that theoretically let's us
change the implementation in the guest later. I consider this even a
prerequisite. IMHO the interface shouldn't be different, it will be
exactly the same.

It is just "who" calls the batch freeing and waits for it. And as I
outlined here, doing it without additional threads at least avoids us
for now having to think about dynamic data structures and that we can
sometimes not report "because the thread is still busy reporting or
wasn't scheduled yet".

> 
> 
>> Realtime is one of the examples where such stuff is
>> to be disabled either way.
> 
> OK so we have identified realtime. Nice even though it wasn't documented
> anywhere. Are there other workloads? What are they?

As stated above, I think these environments are easy to spot. As long as
you don't isolate and pin, surprises can happen anytime. Can you think
of others?

(this stuff really has to be documented)

> 
> 
>> Optimization of space comes with a price (here: execution time).
> 
> I am not sure I agree. If hinting patches just slowed everyone down they
> would be useless. Note how Alex show-cased this by demonstrating
> faster boot times.

Of course, like compressing the whole guest memory, things you might not
want to do ;) In the end, there has to be a net benefit.

> 
> Unlike regular ballooning, this doesn't do much to optimize space. There
> are no promises so host must still have enough swap to fit guest memory
> anyway.
> 
> All free page hinting does is reduce IO on the hypervisor.
> 
> So it's a tradeoff.

+1 to that.


The nice thing about this approach is that we can easily tweak "how many
to report in one shot" and "which sizes to report". We can play with it
fairly easily.
Michael S. Tsirkin Feb. 18, 2019, 7:16 p.m. UTC | #39
On Mon, Feb 18, 2019 at 07:29:44PM +0100, David Hildenbrand wrote:
> > 
> >>>
> >>> But really what business has something that is supposedly
> >>> an optimization blocking a VCPU? We are just freeing up
> >>> lots of memory why is it a good idea to slow that
> >>> process down?
> >>
> >> I first want to know that it is a problem before we declare it a
> >> problem. I provided an example (s390x) where it does not seem to be a
> >> problem. One hypercall ~every 512 frees. As simple as it can get.
> >>
> >> No trying to deny that it could be a problem on x86, but then I assume
> >> it is only a problem in specific setups.
> > 
> > But which setups? How are we going to identify them?
> 
> I guess is simple (I should be carefuly with this word ;) ): As long as
> you don't isolate + pin your CPUs in the hypervisor, you can expect any
> kind of sudden hickups. We're in a virtualized world. Real time is one
> example.
> 
> Using kernel threads like Nitesh does right now? It can be scheduled
> anytime by the hypervisor on the exact same cpu. Unless you isolate +
> pin in the hypervor. So the same problem applies.

Right but we know how to handle this. Many deployments already use tools
to detect host threads kicking VCPUs out.
Getting VCPU blocked by a kfree call would be something new.


...

> > So I'm fine with a simple implementation but the interface needs to
> > allow the hypervisor to process hints in parallel while guest is
> > running.  We can then fix any issues on hypervisor without breaking
> > guests.
> 
> Yes, I am fine with defining an interface that theoretically let's us
> change the implementation in the guest later.
> I consider this even a
> prerequisite. IMHO the interface shouldn't be different, it will be
> exactly the same.
> 
> It is just "who" calls the batch freeing and waits for it. And as I
> outlined here, doing it without additional threads at least avoids us
> for now having to think about dynamic data structures and that we can
> sometimes not report "because the thread is still busy reporting or
> wasn't scheduled yet".

Sorry I wasn't clear. I think we need ability to change the
implementation in the *host* later. IOW don't rely on
host being synchronous.
David Hildenbrand Feb. 18, 2019, 7:35 p.m. UTC | #40
On 18.02.19 20:16, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 07:29:44PM +0100, David Hildenbrand wrote:
>>>
>>>>>
>>>>> But really what business has something that is supposedly
>>>>> an optimization blocking a VCPU? We are just freeing up
>>>>> lots of memory why is it a good idea to slow that
>>>>> process down?
>>>>
>>>> I first want to know that it is a problem before we declare it a
>>>> problem. I provided an example (s390x) where it does not seem to be a
>>>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>>>
>>>> No trying to deny that it could be a problem on x86, but then I assume
>>>> it is only a problem in specific setups.
>>>
>>> But which setups? How are we going to identify them?
>>
>> I guess is simple (I should be carefuly with this word ;) ): As long as
>> you don't isolate + pin your CPUs in the hypervisor, you can expect any
>> kind of sudden hickups. We're in a virtualized world. Real time is one
>> example.
>>
>> Using kernel threads like Nitesh does right now? It can be scheduled
>> anytime by the hypervisor on the exact same cpu. Unless you isolate +
>> pin in the hypervor. So the same problem applies.
> 
> Right but we know how to handle this. Many deployments already use tools
> to detect host threads kicking VCPUs out.
> Getting VCPU blocked by a kfree call would be something new.
> 

Yes, and for s390x we already have some kfree's taking longer than
others. We have to identify when it is not okay.

> 
>>> So I'm fine with a simple implementation but the interface needs to
>>> allow the hypervisor to process hints in parallel while guest is
>>> running.  We can then fix any issues on hypervisor without breaking
>>> guests.
>>
>> Yes, I am fine with defining an interface that theoretically let's us
>> change the implementation in the guest later.
>> I consider this even a
>> prerequisite. IMHO the interface shouldn't be different, it will be
>> exactly the same.
>>
>> It is just "who" calls the batch freeing and waits for it. And as I
>> outlined here, doing it without additional threads at least avoids us
>> for now having to think about dynamic data structures and that we can
>> sometimes not report "because the thread is still busy reporting or
>> wasn't scheduled yet".
> 
> Sorry I wasn't clear. I think we need ability to change the
> implementation in the *host* later. IOW don't rely on
> host being synchronous.
> 
> 
I actually misread it :) . In any way, there has to be a mechanism to
synchronize.

If we are going via a bare hypercall (like s390x, like what Alexander
proposes), it is going to be a synchronous interface either way. Just a
bare hypercall, there will not really be any blocking on the guest side.

Via virtio, I guess it is waiting for a response to a requests, right?
Michael S. Tsirkin Feb. 18, 2019, 7:47 p.m. UTC | #41
On Mon, Feb 18, 2019 at 08:35:36PM +0100, David Hildenbrand wrote:
> On 18.02.19 20:16, Michael S. Tsirkin wrote:
> > On Mon, Feb 18, 2019 at 07:29:44PM +0100, David Hildenbrand wrote:
> >>>
> >>>>>
> >>>>> But really what business has something that is supposedly
> >>>>> an optimization blocking a VCPU? We are just freeing up
> >>>>> lots of memory why is it a good idea to slow that
> >>>>> process down?
> >>>>
> >>>> I first want to know that it is a problem before we declare it a
> >>>> problem. I provided an example (s390x) where it does not seem to be a
> >>>> problem. One hypercall ~every 512 frees. As simple as it can get.
> >>>>
> >>>> No trying to deny that it could be a problem on x86, but then I assume
> >>>> it is only a problem in specific setups.
> >>>
> >>> But which setups? How are we going to identify them?
> >>
> >> I guess is simple (I should be carefuly with this word ;) ): As long as
> >> you don't isolate + pin your CPUs in the hypervisor, you can expect any
> >> kind of sudden hickups. We're in a virtualized world. Real time is one
> >> example.
> >>
> >> Using kernel threads like Nitesh does right now? It can be scheduled
> >> anytime by the hypervisor on the exact same cpu. Unless you isolate +
> >> pin in the hypervor. So the same problem applies.
> > 
> > Right but we know how to handle this. Many deployments already use tools
> > to detect host threads kicking VCPUs out.
> > Getting VCPU blocked by a kfree call would be something new.
> > 
> 
> Yes, and for s390x we already have some kfree's taking longer than
> others. We have to identify when it is not okay.

Right even if the problem exists elsewhere this does not make it go away
or ensure that someone will work to address it :)


> > 
> >>> So I'm fine with a simple implementation but the interface needs to
> >>> allow the hypervisor to process hints in parallel while guest is
> >>> running.  We can then fix any issues on hypervisor without breaking
> >>> guests.
> >>
> >> Yes, I am fine with defining an interface that theoretically let's us
> >> change the implementation in the guest later.
> >> I consider this even a
> >> prerequisite. IMHO the interface shouldn't be different, it will be
> >> exactly the same.
> >>
> >> It is just "who" calls the batch freeing and waits for it. And as I
> >> outlined here, doing it without additional threads at least avoids us
> >> for now having to think about dynamic data structures and that we can
> >> sometimes not report "because the thread is still busy reporting or
> >> wasn't scheduled yet".
> > 
> > Sorry I wasn't clear. I think we need ability to change the
> > implementation in the *host* later. IOW don't rely on
> > host being synchronous.
> > 
> > 
> I actually misread it :) . In any way, there has to be a mechanism to
> synchronize.
> 
> If we are going via a bare hypercall (like s390x, like what Alexander
> proposes), it is going to be a synchronous interface either way. Just a
> bare hypercall, there will not really be any blocking on the guest side.

It bothers me that we are now tied to interface being synchronous. We
won't be able to fix it if there's an issue as that would break guests.

> Via virtio, I guess it is waiting for a response to a requests, right?

For the buffer to be used, yes. And it could mean putting some pages
aside until hypervisor is done with them. Then you don't need timers or
tricks like this, you can get an interrupt and start using the memory.


> -- 
> 
> Thanks,
> 
> David / dhildenb
David Hildenbrand Feb. 18, 2019, 8:04 p.m. UTC | #42
>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>> guests.
>>>>
>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>> change the implementation in the guest later.
>>>> I consider this even a
>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>> exactly the same.
>>>>
>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>> outlined here, doing it without additional threads at least avoids us
>>>> for now having to think about dynamic data structures and that we can
>>>> sometimes not report "because the thread is still busy reporting or
>>>> wasn't scheduled yet".
>>>
>>> Sorry I wasn't clear. I think we need ability to change the
>>> implementation in the *host* later. IOW don't rely on
>>> host being synchronous.
>>>
>>>
>> I actually misread it :) . In any way, there has to be a mechanism to
>> synchronize.
>>
>> If we are going via a bare hypercall (like s390x, like what Alexander
>> proposes), it is going to be a synchronous interface either way. Just a
>> bare hypercall, there will not really be any blocking on the guest side.
> 
> It bothers me that we are now tied to interface being synchronous. We
> won't be able to fix it if there's an issue as that would break guests.

I assume with "fix it" you mean "fix kfree taking longer on every X call"?

Yes, as I initially wrote, this mimics s390x. That might be good (we
know it has been working for years) and bad (we are inheriting the same
problem class, if it exists). And being synchronous is part of the
approach for now.

I tend to focus on the first part (we don't know anything besides it is
working) while you focus on the second part (there could be a potential
problem). Having a real problem at hand would be great, then we would
know what exactly we actually have to fix. But read below.

> 
>> Via virtio, I guess it is waiting for a response to a requests, right?
> 
> For the buffer to be used, yes. And it could mean putting some pages
> aside until hypervisor is done with them. Then you don't need timers or
> tricks like this, you can get an interrupt and start using the memory.

I am very open to such an approach as long as we can make it work and it
is not too complicated. (-> simple)

This would mean for example

1. Collect entries to be reported per VCPU in a buffer. Say magic number
256/512.

2. Once the buffer is full, do crazy "take pages out of the balloon
action" and report them to the hypervisor via virtio. Let the VCPU
continue. This will require some memory to store the request. Small
hickup for the VCPU to kick of the reporting to the hypervisor.

3. On interrupt/response, go over the response and put the pages back to
the buddy.

(assuming that reporting a bulk of frees is better than reporting every
single free obviously)

This could allow nice things like "when OOM gets trigger, see if pages
are currently being reported and wait until they have been put back to
the buddy, return "new pages available", so in a real "low on memory"
scenario, no OOM killer would get involved. This could address the issue
Wei had with reporting when low on memory.

Is that something you have in mind? I assume we would have to allocate
memory when crafting the new requests. This is the only reason I tend to
prefer a synchronous interface for now. But if allocation is not a
problem, great.
Michael S. Tsirkin Feb. 18, 2019, 8:31 p.m. UTC | #43
On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
> >>>>> So I'm fine with a simple implementation but the interface needs to
> >>>>> allow the hypervisor to process hints in parallel while guest is
> >>>>> running.  We can then fix any issues on hypervisor without breaking
> >>>>> guests.
> >>>>
> >>>> Yes, I am fine with defining an interface that theoretically let's us
> >>>> change the implementation in the guest later.
> >>>> I consider this even a
> >>>> prerequisite. IMHO the interface shouldn't be different, it will be
> >>>> exactly the same.
> >>>>
> >>>> It is just "who" calls the batch freeing and waits for it. And as I
> >>>> outlined here, doing it without additional threads at least avoids us
> >>>> for now having to think about dynamic data structures and that we can
> >>>> sometimes not report "because the thread is still busy reporting or
> >>>> wasn't scheduled yet".
> >>>
> >>> Sorry I wasn't clear. I think we need ability to change the
> >>> implementation in the *host* later. IOW don't rely on
> >>> host being synchronous.
> >>>
> >>>
> >> I actually misread it :) . In any way, there has to be a mechanism to
> >> synchronize.
> >>
> >> If we are going via a bare hypercall (like s390x, like what Alexander
> >> proposes), it is going to be a synchronous interface either way. Just a
> >> bare hypercall, there will not really be any blocking on the guest side.
> > 
> > It bothers me that we are now tied to interface being synchronous. We
> > won't be able to fix it if there's an issue as that would break guests.
> 
> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
> 
> Yes, as I initially wrote, this mimics s390x. That might be good (we
> know it has been working for years) and bad (we are inheriting the same
> problem class, if it exists). And being synchronous is part of the
> approach for now.

BTW on s390 are these hypercalls handled by Linux?

> I tend to focus on the first part (we don't know anything besides it is
> working) while you focus on the second part (there could be a potential
> problem). Having a real problem at hand would be great, then we would
> know what exactly we actually have to fix. But read below.

If we end up doing a hypercall per THP, maybe we could at least
not block with interrupts disabled? Poll in guest until
hypervisor reports its done?  That would already be an
improvement IMHO. E.g. perf within guest will point you
in the right direction and towards disabling hinting.


> > 
> >> Via virtio, I guess it is waiting for a response to a requests, right?
> > 
> > For the buffer to be used, yes. And it could mean putting some pages
> > aside until hypervisor is done with them. Then you don't need timers or
> > tricks like this, you can get an interrupt and start using the memory.
> 
> I am very open to such an approach as long as we can make it work and it
> is not too complicated. (-> simple)
> 
> This would mean for example
> 
> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
> 256/512.
> 
> 2. Once the buffer is full, do crazy "take pages out of the balloon
> action" and report them to the hypervisor via virtio. Let the VCPU
> continue. This will require some memory to store the request. Small
> hickup for the VCPU to kick of the reporting to the hypervisor.
> 
> 3. On interrupt/response, go over the response and put the pages back to
> the buddy.
> 
> (assuming that reporting a bulk of frees is better than reporting every
> single free obviously)
> 
> This could allow nice things like "when OOM gets trigger, see if pages
> are currently being reported and wait until they have been put back to
> the buddy, return "new pages available", so in a real "low on memory"
> scenario, no OOM killer would get involved. This could address the issue
> Wei had with reporting when low on memory.
> 
> Is that something you have in mind?

Yes that seems more future proof I think.

> I assume we would have to allocate
> memory when crafting the new requests. This is the only reason I tend to
> prefer a synchronous interface for now. But if allocation is not a
> problem, great.

There are two main ways to avoid allocation:
1. do not add extra data on top of each chunk passed
2. add extra data but pre-allocate buffers for it

> -- 
> 
> Thanks,
> 
> David / dhildenb
Nitesh Narayan Lal Feb. 18, 2019, 8:40 p.m. UTC | #44
On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>>>> guests.
>>>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>>>> change the implementation in the guest later.
>>>>>> I consider this even a
>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>>>> exactly the same.
>>>>>>
>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>>>> outlined here, doing it without additional threads at least avoids us
>>>>>> for now having to think about dynamic data structures and that we can
>>>>>> sometimes not report "because the thread is still busy reporting or
>>>>>> wasn't scheduled yet".
>>>>> Sorry I wasn't clear. I think we need ability to change the
>>>>> implementation in the *host* later. IOW don't rely on
>>>>> host being synchronous.
>>>>>
>>>>>
>>>> I actually misread it :) . In any way, there has to be a mechanism to
>>>> synchronize.
>>>>
>>>> If we are going via a bare hypercall (like s390x, like what Alexander
>>>> proposes), it is going to be a synchronous interface either way. Just a
>>>> bare hypercall, there will not really be any blocking on the guest side.
>>> It bothers me that we are now tied to interface being synchronous. We
>>> won't be able to fix it if there's an issue as that would break guests.
>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>
>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>> know it has been working for years) and bad (we are inheriting the same
>> problem class, if it exists). And being synchronous is part of the
>> approach for now.
> BTW on s390 are these hypercalls handled by Linux?
>
>> I tend to focus on the first part (we don't know anything besides it is
>> working) while you focus on the second part (there could be a potential
>> problem). Having a real problem at hand would be great, then we would
>> know what exactly we actually have to fix. But read below.
> If we end up doing a hypercall per THP, maybe we could at least
> not block with interrupts disabled? Poll in guest until
> hypervisor reports its done?  That would already be an
> improvement IMHO. E.g. perf within guest will point you
> in the right direction and towards disabling hinting.
>
>
>>>> Via virtio, I guess it is waiting for a response to a requests, right?
>>> For the buffer to be used, yes. And it could mean putting some pages
>>> aside until hypervisor is done with them. Then you don't need timers or
>>> tricks like this, you can get an interrupt and start using the memory.
>> I am very open to such an approach as long as we can make it work and it
>> is not too complicated. (-> simple)
>>
>> This would mean for example
>>
>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>> 256/512.
>>
>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>> action" and report them to the hypervisor via virtio. Let the VCPU
>> continue. This will require some memory to store the request. Small
>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>
>> 3. On interrupt/response, go over the response and put the pages back to
>> the buddy.
>>
>> (assuming that reporting a bulk of frees is better than reporting every
>> single free obviously)
>>
>> This could allow nice things like "when OOM gets trigger, see if pages
>> are currently being reported and wait until they have been put back to
>> the buddy, return "new pages available", so in a real "low on memory"
>> scenario, no OOM killer would get involved. This could address the issue
>> Wei had with reporting when low on memory.
>>
>> Is that something you have in mind?
> Yes that seems more future proof I think.
>
>> I assume we would have to allocate
>> memory when crafting the new requests. This is the only reason I tend to
>> prefer a synchronous interface for now. But if allocation is not a
>> problem, great.
> There are two main ways to avoid allocation:
> 1. do not add extra data on top of each chunk passed
If I am not wrong then this is close to what we have right now.
One issue I see right now is that I am polling while host is freeing the
memory.
In the next version I could tie the logic which returns pages to the
buddy and resets the per cpu array index value to 0 with the callback.
(i.e.., it happens once we receive an response from the host)
Other change which I am testing right now is to only capture 'MAX_ORDER
- 1' pages.
> 2. add extra data but pre-allocate buffers for it
>
>> -- 
>>
>> Thanks,
>>
>> David / dhildenb
David Hildenbrand Feb. 18, 2019, 8:53 p.m. UTC | #45
On 18.02.19 21:31, Michael S. Tsirkin wrote:
> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>>>> guests.
>>>>>>
>>>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>>>> change the implementation in the guest later.
>>>>>> I consider this even a
>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>>>> exactly the same.
>>>>>>
>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>>>> outlined here, doing it without additional threads at least avoids us
>>>>>> for now having to think about dynamic data structures and that we can
>>>>>> sometimes not report "because the thread is still busy reporting or
>>>>>> wasn't scheduled yet".
>>>>>
>>>>> Sorry I wasn't clear. I think we need ability to change the
>>>>> implementation in the *host* later. IOW don't rely on
>>>>> host being synchronous.
>>>>>
>>>>>
>>>> I actually misread it :) . In any way, there has to be a mechanism to
>>>> synchronize.
>>>>
>>>> If we are going via a bare hypercall (like s390x, like what Alexander
>>>> proposes), it is going to be a synchronous interface either way. Just a
>>>> bare hypercall, there will not really be any blocking on the guest side.
>>>
>>> It bothers me that we are now tied to interface being synchronous. We
>>> won't be able to fix it if there's an issue as that would break guests.
>>
>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>
>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>> know it has been working for years) and bad (we are inheriting the same
>> problem class, if it exists). And being synchronous is part of the
>> approach for now.
> 
> BTW on s390 are these hypercalls handled by Linux?

I assume you mean in KVM - Yes! There is a hardware assist to handle the
"queuing of 512 pfns" but once the buffer is full, the actual hypercall
intercept will be triggered.

arch/s390/kvm/priv.c:handle_essa()

The interesting part is

down_read(&gmap->mm->mmap_sem);
for (i = 0; i < entries; ++i);
	__gmap_zap(gmap, cbrlo[i]);
up_read(&gmap->mm->mmap_sem);

cbrlo is the pfn array stored in the hypervisor.

> 
>> I tend to focus on the first part (we don't know anything besides it is
>> working) while you focus on the second part (there could be a potential
>> problem). Having a real problem at hand would be great, then we would
>> know what exactly we actually have to fix. But read below.
> 
> If we end up doing a hypercall per THP, maybe we could at least
> not block with interrupts disabled? Poll in guest until
> hypervisor reports its done?  That would already be an
> improvement IMHO. E.g. perf within guest will point you
> in the right direction and towards disabling hinting.

I think we always have the option to busy loop where we consider it more
helpful. On synchronous hypercalls, no waiting is necessary. Only on
asynchronous ones (which would the most probably be virtio based).

I don't think only reporting THP will be future proof. So with whatever
we come up, it has to be able to deal with smaller granularities. Not
saying eventually page granularity, but at least some other orders. The
only solution to avoid overhead of many hypercalls is then to report
multiple ones in one shot.

>>>
>>>> Via virtio, I guess it is waiting for a response to a requests, right?
>>>
>>> For the buffer to be used, yes. And it could mean putting some pages
>>> aside until hypervisor is done with them. Then you don't need timers or
>>> tricks like this, you can get an interrupt and start using the memory.
>>
>> I am very open to such an approach as long as we can make it work and it
>> is not too complicated. (-> simple)
>>
>> This would mean for example
>>
>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>> 256/512.
>>
>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>> action" and report them to the hypervisor via virtio. Let the VCPU
>> continue. This will require some memory to store the request. Small
>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>
>> 3. On interrupt/response, go over the response and put the pages back to
>> the buddy.
>>
>> (assuming that reporting a bulk of frees is better than reporting every
>> single free obviously)
>>
>> This could allow nice things like "when OOM gets trigger, see if pages
>> are currently being reported and wait until they have been put back to
>> the buddy, return "new pages available", so in a real "low on memory"
>> scenario, no OOM killer would get involved. This could address the issue
>> Wei had with reporting when low on memory.
>>
>> Is that something you have in mind?
> 
> Yes that seems more future proof I think.

And it would satisfy your request for an asynchronous interface. + we
would get rid of the kthread(s).

> 
>> I assume we would have to allocate
>> memory when crafting the new requests. This is the only reason I tend to
>> prefer a synchronous interface for now. But if allocation is not a
>> problem, great.
> 
> There are two main ways to avoid allocation:
> 1. do not add extra data on top of each chunk passed
> 2. add extra data but pre-allocate buffers for it
> 

It could theoretically happen that while the old VCPU buffer is still
getting reported, that we want to free a page, so we need a new buffer I
assume. Busy waiting is an option (hmm), or have to skip that page, but
that is something I want to avoid. So allocating memory for the request
seems to be the cleanest approach.

But after all, as we are literally allocating buddy pages to report
temporarily, we can also most probably also allocate memory. We will
have to look into the details.

So the options I see so far are

1. Do a synchronous hypercall, reporting a bunch of pages as described
initially in this thread. Release page to the buddy when returning from
the hypercall.

2. Do an asynchronous hypercall (allocating memory for the request),
reporting a bunch of pages. Release page to the buddy when on response
via interrupt.

Thanks for the helpful discussion Michael!
David Hildenbrand Feb. 18, 2019, 9:04 p.m. UTC | #46
On 18.02.19 21:40, Nitesh Narayan Lal wrote:
> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
>> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>>>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>>>>> guests.
>>>>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>>>>> change the implementation in the guest later.
>>>>>>> I consider this even a
>>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>>>>> exactly the same.
>>>>>>>
>>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>>>>> outlined here, doing it without additional threads at least avoids us
>>>>>>> for now having to think about dynamic data structures and that we can
>>>>>>> sometimes not report "because the thread is still busy reporting or
>>>>>>> wasn't scheduled yet".
>>>>>> Sorry I wasn't clear. I think we need ability to change the
>>>>>> implementation in the *host* later. IOW don't rely on
>>>>>> host being synchronous.
>>>>>>
>>>>>>
>>>>> I actually misread it :) . In any way, there has to be a mechanism to
>>>>> synchronize.
>>>>>
>>>>> If we are going via a bare hypercall (like s390x, like what Alexander
>>>>> proposes), it is going to be a synchronous interface either way. Just a
>>>>> bare hypercall, there will not really be any blocking on the guest side.
>>>> It bothers me that we are now tied to interface being synchronous. We
>>>> won't be able to fix it if there's an issue as that would break guests.
>>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>>
>>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>>> know it has been working for years) and bad (we are inheriting the same
>>> problem class, if it exists). And being synchronous is part of the
>>> approach for now.
>> BTW on s390 are these hypercalls handled by Linux?
>>
>>> I tend to focus on the first part (we don't know anything besides it is
>>> working) while you focus on the second part (there could be a potential
>>> problem). Having a real problem at hand would be great, then we would
>>> know what exactly we actually have to fix. But read below.
>> If we end up doing a hypercall per THP, maybe we could at least
>> not block with interrupts disabled? Poll in guest until
>> hypervisor reports its done?  That would already be an
>> improvement IMHO. E.g. perf within guest will point you
>> in the right direction and towards disabling hinting.
>>
>>
>>>>> Via virtio, I guess it is waiting for a response to a requests, right?
>>>> For the buffer to be used, yes. And it could mean putting some pages
>>>> aside until hypervisor is done with them. Then you don't need timers or
>>>> tricks like this, you can get an interrupt and start using the memory.
>>> I am very open to such an approach as long as we can make it work and it
>>> is not too complicated. (-> simple)
>>>
>>> This would mean for example
>>>
>>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>>> 256/512.
>>>
>>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>>> action" and report them to the hypervisor via virtio. Let the VCPU
>>> continue. This will require some memory to store the request. Small
>>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>>
>>> 3. On interrupt/response, go over the response and put the pages back to
>>> the buddy.
>>>
>>> (assuming that reporting a bulk of frees is better than reporting every
>>> single free obviously)
>>>
>>> This could allow nice things like "when OOM gets trigger, see if pages
>>> are currently being reported and wait until they have been put back to
>>> the buddy, return "new pages available", so in a real "low on memory"
>>> scenario, no OOM killer would get involved. This could address the issue
>>> Wei had with reporting when low on memory.
>>>
>>> Is that something you have in mind?
>> Yes that seems more future proof I think.
>>
>>> I assume we would have to allocate
>>> memory when crafting the new requests. This is the only reason I tend to
>>> prefer a synchronous interface for now. But if allocation is not a
>>> problem, great.
>> There are two main ways to avoid allocation:
>> 1. do not add extra data on top of each chunk passed
> If I am not wrong then this is close to what we have right now.

Yes, minus the kthread(s) and eventually with some sort of memory
allocation for the request. Once you're asynchronous via a notification
mechanisnm, there is no real need for a thread anymore, hopefully.

> One issue I see right now is that I am polling while host is freeing the
> memory.
> In the next version I could tie the logic which returns pages to the
> buddy and resets the per cpu array index value to 0 with the callback.
> (i.e.., it happens once we receive an response from the host)

The question is, what happens when freeing pages and the array is not
ready to be reused yet. In that case, you want to somehow continue
freeing pages without busy waiting or eventually not reporting pages.

The callback should put the pages back to the buddy and free the request
eventually to have a fully asynchronous mechanism.

> Other change which I am testing right now is to only capture 'MAX_ORDER

I am not sure if this is an arbitrary number we came up with here. We
should really play with different orders to find a hot spot. I wouldn't
consider this high priority, though. Getting the whole concept right to
be able to deal with any magic number we come up should be the ultimate
goal. (stuff that only works with huge pages I consider not future
proof, especially regarding fragmented guests which can happen easily)
Alexander Duyck Feb. 18, 2019, 11:47 p.m. UTC | #47
On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.02.19 18:31, Alexander Duyck wrote:
> > On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> >>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >>>> It would be worth a try. My feeling is that a synchronous report after
> >>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >>>> s390x. (basically always enabled, nobody complains).
> >>>
> >>> What slips under the radar on an arch like s390 might
> >>> raise issues for a popular arch like x86. My fear would be
> >>> if it's only a problem e.g. for realtime. Then you get
> >>> a condition that's very hard to trigger and affects
> >>> worst case latencies.
> >>
> >> Realtime should never use free page hinting. Just like it should never
> >> use ballooning. Just like it should pin all pages in the hypervisor.
> >>
> >>>
> >>> But really what business has something that is supposedly
> >>> an optimization blocking a VCPU? We are just freeing up
> >>> lots of memory why is it a good idea to slow that
> >>> process down?
> >>
> >> I first want to know that it is a problem before we declare it a
> >> problem. I provided an example (s390x) where it does not seem to be a
> >> problem. One hypercall ~every 512 frees. As simple as it can get.
> >>
> >> No trying to deny that it could be a problem on x86, but then I assume
> >> it is only a problem in specific setups.
> >>
> >> I would much rather prefer a simple solution that can eventually be
> >> disabled in selected setup than a complicated solution that tries to fit
> >> all possible setups. Realtime is one of the examples where such stuff is
> >> to be disabled either way.
> >>
> >> Optimization of space comes with a price (here: execution time).
> >
> > One thing to keep in mind though is that if you are already having to
> > pull pages in and out of swap on the host in order be able to provide
> > enough memory for the guests the free page hinting should be a
> > significant win in terms of performance.
>
> Indeed. And also we are in a virtualized environment already, we can
> have any kind of sudden hickups. (again, realtime has special
> requirements on the setup)
>
> Side note: I like your approach because it is simple. I don't like your
> approach because it cannot deal with fragmented memory. And that can
> happen easily.
>
> The idea I described here can be similarly be an extension of your
> approach, merging in a "batched reporting" Nitesh proposed, so we can
> report on something < MAX_ORDER, similar to s390x. In the end it boils
> down to reporting via hypercall vs. reporting via virtio. The main point
> is that it is synchronous and batched. (and that we properly take care
> of the race between host freeing and guest allocation)

I'd say the discussion is even simpler then that. My concern is more
synchronous versus asynchronous. I honestly think the cost for a
synchronous call is being overblown and we are likely to see the fault
and zeroing of pages cost more than the hypercall or virtio
transaction itself.

Also one reason why I am not a fan of working with anything less than
PMD order is because there have been issues in the past with false
memory leaks being created when hints were provided on THP pages that
essentially fragmented them. I guess hugepaged went through and
started trying to reassemble the huge pages and as a result there have
been apps that ended up consuming more memory than they would have
otherwise since they were using fragments of THP pages after doing an
MADV_DONTNEED on sections of the page.

> >
> > So far with my patch set that hints at the PMD level w/ THP enabled I
> > am not really seeing that much overhead for the hypercalls. The bigger
> > piece that is eating up CPU time is all the page faults and page
> > zeroing that is going on as we are cycling the memory in and out of
> > the guest. Some of that could probably be resolved by using MADV_FREE,
> > but if we are under actual memory pressure I suspect it would behave
> > similar to MADV_DONTNEED.
> >
>
> MADV_FREE is certainly the better thing to do for hinting in my opinion.
> It should result in even less overhead. Thanks for the comment about the
> hypercall overhead.

Yeah, no problem. The only thing I don't like about MADV_FREE is that
you have to have memory pressure before the pages really start getting
scrubbed with is both a benefit and a drawback. Basically it defers
the freeing until you are under actual memory pressure so when you hit
that case things start feeling much slower, that and it limits your
allocations since the kernel doesn't recognize the pages as free until
it would have to start trying to push memory to swap.
Alexander Duyck Feb. 19, 2019, 12:01 a.m. UTC | #48
On Mon, Feb 18, 2019 at 1:04 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
> > On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
> >> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
> >>>>>>>> So I'm fine with a simple implementation but the interface needs to
> >>>>>>>> allow the hypervisor to process hints in parallel while guest is
> >>>>>>>> running.  We can then fix any issues on hypervisor without breaking
> >>>>>>>> guests.
> >>>>>>> Yes, I am fine with defining an interface that theoretically let's us
> >>>>>>> change the implementation in the guest later.
> >>>>>>> I consider this even a
> >>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
> >>>>>>> exactly the same.
> >>>>>>>
> >>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
> >>>>>>> outlined here, doing it without additional threads at least avoids us
> >>>>>>> for now having to think about dynamic data structures and that we can
> >>>>>>> sometimes not report "because the thread is still busy reporting or
> >>>>>>> wasn't scheduled yet".
> >>>>>> Sorry I wasn't clear. I think we need ability to change the
> >>>>>> implementation in the *host* later. IOW don't rely on
> >>>>>> host being synchronous.
> >>>>>>
> >>>>>>
> >>>>> I actually misread it :) . In any way, there has to be a mechanism to
> >>>>> synchronize.
> >>>>>
> >>>>> If we are going via a bare hypercall (like s390x, like what Alexander
> >>>>> proposes), it is going to be a synchronous interface either way. Just a
> >>>>> bare hypercall, there will not really be any blocking on the guest side.
> >>>> It bothers me that we are now tied to interface being synchronous. We
> >>>> won't be able to fix it if there's an issue as that would break guests.
> >>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
> >>>
> >>> Yes, as I initially wrote, this mimics s390x. That might be good (we
> >>> know it has been working for years) and bad (we are inheriting the same
> >>> problem class, if it exists). And being synchronous is part of the
> >>> approach for now.
> >> BTW on s390 are these hypercalls handled by Linux?
> >>
> >>> I tend to focus on the first part (we don't know anything besides it is
> >>> working) while you focus on the second part (there could be a potential
> >>> problem). Having a real problem at hand would be great, then we would
> >>> know what exactly we actually have to fix. But read below.
> >> If we end up doing a hypercall per THP, maybe we could at least
> >> not block with interrupts disabled? Poll in guest until
> >> hypervisor reports its done?  That would already be an
> >> improvement IMHO. E.g. perf within guest will point you
> >> in the right direction and towards disabling hinting.
> >>
> >>
> >>>>> Via virtio, I guess it is waiting for a response to a requests, right?
> >>>> For the buffer to be used, yes. And it could mean putting some pages
> >>>> aside until hypervisor is done with them. Then you don't need timers or
> >>>> tricks like this, you can get an interrupt and start using the memory.
> >>> I am very open to such an approach as long as we can make it work and it
> >>> is not too complicated. (-> simple)
> >>>
> >>> This would mean for example
> >>>
> >>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
> >>> 256/512.
> >>>
> >>> 2. Once the buffer is full, do crazy "take pages out of the balloon
> >>> action" and report them to the hypervisor via virtio. Let the VCPU
> >>> continue. This will require some memory to store the request. Small
> >>> hickup for the VCPU to kick of the reporting to the hypervisor.
> >>>
> >>> 3. On interrupt/response, go over the response and put the pages back to
> >>> the buddy.
> >>>
> >>> (assuming that reporting a bulk of frees is better than reporting every
> >>> single free obviously)
> >>>
> >>> This could allow nice things like "when OOM gets trigger, see if pages
> >>> are currently being reported and wait until they have been put back to
> >>> the buddy, return "new pages available", so in a real "low on memory"
> >>> scenario, no OOM killer would get involved. This could address the issue
> >>> Wei had with reporting when low on memory.
> >>>
> >>> Is that something you have in mind?
> >> Yes that seems more future proof I think.
> >>
> >>> I assume we would have to allocate
> >>> memory when crafting the new requests. This is the only reason I tend to
> >>> prefer a synchronous interface for now. But if allocation is not a
> >>> problem, great.
> >> There are two main ways to avoid allocation:
> >> 1. do not add extra data on top of each chunk passed
> > If I am not wrong then this is close to what we have right now.
>
> Yes, minus the kthread(s) and eventually with some sort of memory
> allocation for the request. Once you're asynchronous via a notification
> mechanisnm, there is no real need for a thread anymore, hopefully.
>
> > One issue I see right now is that I am polling while host is freeing the
> > memory.
> > In the next version I could tie the logic which returns pages to the
> > buddy and resets the per cpu array index value to 0 with the callback.
> > (i.e.., it happens once we receive an response from the host)
>
> The question is, what happens when freeing pages and the array is not
> ready to be reused yet. In that case, you want to somehow continue
> freeing pages without busy waiting or eventually not reporting pages.
>
> The callback should put the pages back to the buddy and free the request
> eventually to have a fully asynchronous mechanism.
>
> > Other change which I am testing right now is to only capture 'MAX_ORDER
>
> I am not sure if this is an arbitrary number we came up with here. We
> should really play with different orders to find a hot spot. I wouldn't
> consider this high priority, though. Getting the whole concept right to
> be able to deal with any magic number we come up should be the ultimate
> goal. (stuff that only works with huge pages I consider not future
> proof, especially regarding fragmented guests which can happen easily)

This essentially just ends up being another trade-off of CPU versus
memory though. Assuming we aren't using THP we are going to take a
penalty in terms of performance but could then free individual pages
less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
much higher in general even without the hinting. I figure for x86 we
probably don't have too many options since if I am not mistaken
MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.

As far as fragmentation my thought is that we may want to look into
adding support to the guest for prioritizing defragmentation on pages
lower than THP size. Then that way we could maintain the higher
overall performance with or without the hinting since shuffling lower
order pages around between guests would start to get expensive pretty
quick.
Michael S. Tsirkin Feb. 19, 2019, 2:45 a.m. UTC | #49
On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
> > > So far with my patch set that hints at the PMD level w/ THP enabled I
> > > am not really seeing that much overhead for the hypercalls. The bigger
> > > piece that is eating up CPU time is all the page faults and page
> > > zeroing that is going on as we are cycling the memory in and out of
> > > the guest. Some of that could probably be resolved by using MADV_FREE,
> > > but if we are under actual memory pressure I suspect it would behave
> > > similar to MADV_DONTNEED.
> > >
> >
> > MADV_FREE is certainly the better thing to do for hinting in my opinion.
> > It should result in even less overhead. Thanks for the comment about the
> > hypercall overhead.
> 
> Yeah, no problem. The only thing I don't like about MADV_FREE is that
> you have to have memory pressure before the pages really start getting
> scrubbed with is both a benefit and a drawback. Basically it defers
> the freeing until you are under actual memory pressure so when you hit
> that case things start feeling much slower, that and it limits your
> allocations since the kernel doesn't recognize the pages as free until
> it would have to start trying to push memory to swap.

For sure if someone *wants* to spend cycles freeing memory,
we could add a system call that does exactly that. There's
no reason to force that on the same CPU while VCPU
is stopped though.
Andrea Arcangeli Feb. 19, 2019, 2:46 a.m. UTC | #50
Hello,

On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
> essentially fragmented them. I guess hugepaged went through and
> started trying to reassemble the huge pages and as a result there have
> been apps that ended up consuming more memory than they would have
> otherwise since they were using fragments of THP pages after doing an
> MADV_DONTNEED on sections of the page.

With relatively recent kernels MADV_DONTNEED doesn't necessarily free
anything when it's applied to a THP subpage, it only splits the
pagetables and queues the THP for deferred splitting. If there's
memory pressure a shrinker will be invoked and the queue is scanned
and the THPs are physically splitted, but to be reassembled/collapsed
after a physical split it requires at least one young pte.

If this is particularly problematic for page hinting, this behavior
where the MADV_DONTNEED can be undoed by khugepaged (if some subpage is
being frequently accessed), can be turned off by setting
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none to
0. Then the THP will only be collapsed if all 512 subpages are mapped
(i.e. they've all be re-allocated by the guest).

Regardless of the max_ptes_none default, keeping the smaller guest
buddy orders as the last target for page hinting should be good for
performance.

> Yeah, no problem. The only thing I don't like about MADV_FREE is that
> you have to have memory pressure before the pages really start getting
> scrubbed with is both a benefit and a drawback. Basically it defers
> the freeing until you are under actual memory pressure so when you hit
> that case things start feeling much slower, that and it limits your
> allocations since the kernel doesn't recognize the pages as free until
> it would have to start trying to push memory to swap.

The guest allocation behavior should not be influenced by MADV_FREE vs
MADV_DONTNEED, the guest can't see the difference anyway, so why
should it limit the allocations?

The benefit of MADV_FREE should be that when the same guest frees and
reallocates an huge amount of RAM (i.e. guest app allocating and
freeing lots of RAM in a loop, not so uncommon), there will be no KVM
page fault during guest re-allocations. So in absence of memory
pressure in the host it should be a major win. Overall it sounds like
a good tradeoff compared to MADV_DONTNEED that forcefully invokes MMU
notifiers and forces host allocations and KVM page faults in order to
reallocate the same RAM in the same guest.

When there's memory pressure it's up to the host Linux VM to notice
there's plenty of MADV_FREE material to free at zero I/O cost before
starting swapping.

Thanks,
Andrea
David Hildenbrand Feb. 19, 2019, 7:54 a.m. UTC | #51
On 19.02.19 01:01, Alexander Duyck wrote:
> On Mon, Feb 18, 2019 at 1:04 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
>>> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
>>>> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>>>>>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>>>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>>>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>>>>>>> guests.
>>>>>>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>>>>>>> change the implementation in the guest later.
>>>>>>>>> I consider this even a
>>>>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>>>>>>> exactly the same.
>>>>>>>>>
>>>>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>>>>>>> outlined here, doing it without additional threads at least avoids us
>>>>>>>>> for now having to think about dynamic data structures and that we can
>>>>>>>>> sometimes not report "because the thread is still busy reporting or
>>>>>>>>> wasn't scheduled yet".
>>>>>>>> Sorry I wasn't clear. I think we need ability to change the
>>>>>>>> implementation in the *host* later. IOW don't rely on
>>>>>>>> host being synchronous.
>>>>>>>>
>>>>>>>>
>>>>>>> I actually misread it :) . In any way, there has to be a mechanism to
>>>>>>> synchronize.
>>>>>>>
>>>>>>> If we are going via a bare hypercall (like s390x, like what Alexander
>>>>>>> proposes), it is going to be a synchronous interface either way. Just a
>>>>>>> bare hypercall, there will not really be any blocking on the guest side.
>>>>>> It bothers me that we are now tied to interface being synchronous. We
>>>>>> won't be able to fix it if there's an issue as that would break guests.
>>>>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>>>>
>>>>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>>>>> know it has been working for years) and bad (we are inheriting the same
>>>>> problem class, if it exists). And being synchronous is part of the
>>>>> approach for now.
>>>> BTW on s390 are these hypercalls handled by Linux?
>>>>
>>>>> I tend to focus on the first part (we don't know anything besides it is
>>>>> working) while you focus on the second part (there could be a potential
>>>>> problem). Having a real problem at hand would be great, then we would
>>>>> know what exactly we actually have to fix. But read below.
>>>> If we end up doing a hypercall per THP, maybe we could at least
>>>> not block with interrupts disabled? Poll in guest until
>>>> hypervisor reports its done?  That would already be an
>>>> improvement IMHO. E.g. perf within guest will point you
>>>> in the right direction and towards disabling hinting.
>>>>
>>>>
>>>>>>> Via virtio, I guess it is waiting for a response to a requests, right?
>>>>>> For the buffer to be used, yes. And it could mean putting some pages
>>>>>> aside until hypervisor is done with them. Then you don't need timers or
>>>>>> tricks like this, you can get an interrupt and start using the memory.
>>>>> I am very open to such an approach as long as we can make it work and it
>>>>> is not too complicated. (-> simple)
>>>>>
>>>>> This would mean for example
>>>>>
>>>>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>>>>> 256/512.
>>>>>
>>>>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>>>>> action" and report them to the hypervisor via virtio. Let the VCPU
>>>>> continue. This will require some memory to store the request. Small
>>>>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>>>>
>>>>> 3. On interrupt/response, go over the response and put the pages back to
>>>>> the buddy.
>>>>>
>>>>> (assuming that reporting a bulk of frees is better than reporting every
>>>>> single free obviously)
>>>>>
>>>>> This could allow nice things like "when OOM gets trigger, see if pages
>>>>> are currently being reported and wait until they have been put back to
>>>>> the buddy, return "new pages available", so in a real "low on memory"
>>>>> scenario, no OOM killer would get involved. This could address the issue
>>>>> Wei had with reporting when low on memory.
>>>>>
>>>>> Is that something you have in mind?
>>>> Yes that seems more future proof I think.
>>>>
>>>>> I assume we would have to allocate
>>>>> memory when crafting the new requests. This is the only reason I tend to
>>>>> prefer a synchronous interface for now. But if allocation is not a
>>>>> problem, great.
>>>> There are two main ways to avoid allocation:
>>>> 1. do not add extra data on top of each chunk passed
>>> If I am not wrong then this is close to what we have right now.
>>
>> Yes, minus the kthread(s) and eventually with some sort of memory
>> allocation for the request. Once you're asynchronous via a notification
>> mechanisnm, there is no real need for a thread anymore, hopefully.
>>
>>> One issue I see right now is that I am polling while host is freeing the
>>> memory.
>>> In the next version I could tie the logic which returns pages to the
>>> buddy and resets the per cpu array index value to 0 with the callback.
>>> (i.e.., it happens once we receive an response from the host)
>>
>> The question is, what happens when freeing pages and the array is not
>> ready to be reused yet. In that case, you want to somehow continue
>> freeing pages without busy waiting or eventually not reporting pages.
>>
>> The callback should put the pages back to the buddy and free the request
>> eventually to have a fully asynchronous mechanism.
>>
>>> Other change which I am testing right now is to only capture 'MAX_ORDER
>>
>> I am not sure if this is an arbitrary number we came up with here. We
>> should really play with different orders to find a hot spot. I wouldn't
>> consider this high priority, though. Getting the whole concept right to
>> be able to deal with any magic number we come up should be the ultimate
>> goal. (stuff that only works with huge pages I consider not future
>> proof, especially regarding fragmented guests which can happen easily)
> 
> This essentially just ends up being another trade-off of CPU versus
> memory though. Assuming we aren't using THP we are going to take a
> penalty in terms of performance but could then free individual pages
> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
> much higher in general even without the hinting. I figure for x86 we
> probably don't have too many options since if I am not mistaken
> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.

THP is an implementation detail in the hypervisor. Yes, it is the common
case on x86. But it is e.g. not available on s390x yet. And we also want
this mechanism to work on s390x (e.g. for nested virtualization setups
as discussed).

If we e.g. report any granularity after merging was done in the buddy,
we could end up reporting everything from page size up to MAX_SIZE - 1,
the hypervisor could ignore hints below a certain magic number, if it
makes its life easier.

> 
> As far as fragmentation my thought is that we may want to look into
> adding support to the guest for prioritizing defragmentation on pages
> lower than THP size. Then that way we could maintain the higher
> overall performance with or without the hinting since shuffling lower
> order pages around between guests would start to get expensive pretty
> quick.

My take would be, design an interface/mechanism that allows any kind of
granularity. You can than balance between cpu overead and space shifting.

I feel like repeating myself, but on s390x hinting is done on page
granularity, and I have never heard somebody say "how can I turn it off,
this is slowing down my system too much.". All we know is that one
hypercall per free is most probably not acceptable. We really have to
play with the numbers.

I tend to like an asynchronous reporting approach as discussed in this
thread, we would have to see if Nitesh could get it implemented.

Thanks Alexander!
David Hildenbrand Feb. 19, 2019, 8:06 a.m. UTC | #52
On 19.02.19 00:47, Alexander Duyck wrote:
> On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 18.02.19 18:31, Alexander Duyck wrote:
>>> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
>>>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>>>>>> It would be worth a try. My feeling is that a synchronous report after
>>>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>>>>>> s390x. (basically always enabled, nobody complains).
>>>>>
>>>>> What slips under the radar on an arch like s390 might
>>>>> raise issues for a popular arch like x86. My fear would be
>>>>> if it's only a problem e.g. for realtime. Then you get
>>>>> a condition that's very hard to trigger and affects
>>>>> worst case latencies.
>>>>
>>>> Realtime should never use free page hinting. Just like it should never
>>>> use ballooning. Just like it should pin all pages in the hypervisor.
>>>>
>>>>>
>>>>> But really what business has something that is supposedly
>>>>> an optimization blocking a VCPU? We are just freeing up
>>>>> lots of memory why is it a good idea to slow that
>>>>> process down?
>>>>
>>>> I first want to know that it is a problem before we declare it a
>>>> problem. I provided an example (s390x) where it does not seem to be a
>>>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>>>
>>>> No trying to deny that it could be a problem on x86, but then I assume
>>>> it is only a problem in specific setups.
>>>>
>>>> I would much rather prefer a simple solution that can eventually be
>>>> disabled in selected setup than a complicated solution that tries to fit
>>>> all possible setups. Realtime is one of the examples where such stuff is
>>>> to be disabled either way.
>>>>
>>>> Optimization of space comes with a price (here: execution time).
>>>
>>> One thing to keep in mind though is that if you are already having to
>>> pull pages in and out of swap on the host in order be able to provide
>>> enough memory for the guests the free page hinting should be a
>>> significant win in terms of performance.
>>
>> Indeed. And also we are in a virtualized environment already, we can
>> have any kind of sudden hickups. (again, realtime has special
>> requirements on the setup)
>>
>> Side note: I like your approach because it is simple. I don't like your
>> approach because it cannot deal with fragmented memory. And that can
>> happen easily.
>>
>> The idea I described here can be similarly be an extension of your
>> approach, merging in a "batched reporting" Nitesh proposed, so we can
>> report on something < MAX_ORDER, similar to s390x. In the end it boils
>> down to reporting via hypercall vs. reporting via virtio. The main point
>> is that it is synchronous and batched. (and that we properly take care
>> of the race between host freeing and guest allocation)
> 
> I'd say the discussion is even simpler then that. My concern is more
> synchronous versus asynchronous. I honestly think the cost for a
> synchronous call is being overblown and we are likely to see the fault
> and zeroing of pages cost more than the hypercall or virtio
> transaction itself.

The overhead of page faults and zeroing should be mitigated by
MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead
(context switch) becomes relevant.

We have various discussions now :) And I think they are related.

synchronous versus asynchronous
batched vs. non-batched
MAX_ORDER - 1 vs. other/none magic number

1. synchronous call without batching on every kfree is bad. The
interface is fixed to big magic numbers, otherwise we end up having a
hypercall on every kfree. This is your approach.

2. asynchronous calls without batching would most probably have similar
problems with small granularities as we had when ballooning without
batching. Just overhead we can avoid.

3. synchronous and batched is what s390x does. It can deal with page
granularity. It is what I initially described in this sub-thread.

4. asynchronous and batched. This is the other approach we discussed
yesterday. If we can get it implemented, I would be interested in
performance numbers.

As far as I understood, Michael seems to favor something like 4 (and I
assume eventually 2 if it is similarly fast). I am a friend of either 3
or 4.

> 
> Also one reason why I am not a fan of working with anything less than
> PMD order is because there have been issues in the past with false
> memory leaks being created when hints were provided on THP pages that
> essentially fragmented them. I guess hugepaged went through and
> started trying to reassemble the huge pages and as a result there have
> been apps that ended up consuming more memory than they would have
> otherwise since they were using fragments of THP pages after doing an
> MADV_DONTNEED on sections of the page.

I understand your concerns, but we should not let bugs in the hypervisor
dictate the design. Bugs are there to be fixed. Interesting read,
though, thanks!
Nitesh Narayan Lal Feb. 19, 2019, 12:47 p.m. UTC | #53
On 2/18/19 4:04 PM, David Hildenbrand wrote:
> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
>> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
>>>>>>>>> So I'm fine with a simple implementation but the interface needs to
>>>>>>>>> allow the hypervisor to process hints in parallel while guest is
>>>>>>>>> running.  We can then fix any issues on hypervisor without breaking
>>>>>>>>> guests.
>>>>>>>> Yes, I am fine with defining an interface that theoretically let's us
>>>>>>>> change the implementation in the guest later.
>>>>>>>> I consider this even a
>>>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
>>>>>>>> exactly the same.
>>>>>>>>
>>>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
>>>>>>>> outlined here, doing it without additional threads at least avoids us
>>>>>>>> for now having to think about dynamic data structures and that we can
>>>>>>>> sometimes not report "because the thread is still busy reporting or
>>>>>>>> wasn't scheduled yet".
>>>>>>> Sorry I wasn't clear. I think we need ability to change the
>>>>>>> implementation in the *host* later. IOW don't rely on
>>>>>>> host being synchronous.
>>>>>>>
>>>>>>>
>>>>>> I actually misread it :) . In any way, there has to be a mechanism to
>>>>>> synchronize.
>>>>>>
>>>>>> If we are going via a bare hypercall (like s390x, like what Alexander
>>>>>> proposes), it is going to be a synchronous interface either way. Just a
>>>>>> bare hypercall, there will not really be any blocking on the guest side.
>>>>> It bothers me that we are now tied to interface being synchronous. We
>>>>> won't be able to fix it if there's an issue as that would break guests.
>>>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
>>>>
>>>> Yes, as I initially wrote, this mimics s390x. That might be good (we
>>>> know it has been working for years) and bad (we are inheriting the same
>>>> problem class, if it exists). And being synchronous is part of the
>>>> approach for now.
>>> BTW on s390 are these hypercalls handled by Linux?
>>>
>>>> I tend to focus on the first part (we don't know anything besides it is
>>>> working) while you focus on the second part (there could be a potential
>>>> problem). Having a real problem at hand would be great, then we would
>>>> know what exactly we actually have to fix. But read below.
>>> If we end up doing a hypercall per THP, maybe we could at least
>>> not block with interrupts disabled? Poll in guest until
>>> hypervisor reports its done?  That would already be an
>>> improvement IMHO. E.g. perf within guest will point you
>>> in the right direction and towards disabling hinting.
>>>
>>>
>>>>>> Via virtio, I guess it is waiting for a response to a requests, right?
>>>>> For the buffer to be used, yes. And it could mean putting some pages
>>>>> aside until hypervisor is done with them. Then you don't need timers or
>>>>> tricks like this, you can get an interrupt and start using the memory.
>>>> I am very open to such an approach as long as we can make it work and it
>>>> is not too complicated. (-> simple)
>>>>
>>>> This would mean for example
>>>>
>>>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
>>>> 256/512.
>>>>
>>>> 2. Once the buffer is full, do crazy "take pages out of the balloon
>>>> action" and report them to the hypervisor via virtio. Let the VCPU
>>>> continue. This will require some memory to store the request. Small
>>>> hickup for the VCPU to kick of the reporting to the hypervisor.
>>>>
>>>> 3. On interrupt/response, go over the response and put the pages back to
>>>> the buddy.
>>>>
>>>> (assuming that reporting a bulk of frees is better than reporting every
>>>> single free obviously)
>>>>
>>>> This could allow nice things like "when OOM gets trigger, see if pages
>>>> are currently being reported and wait until they have been put back to
>>>> the buddy, return "new pages available", so in a real "low on memory"
>>>> scenario, no OOM killer would get involved. This could address the issue
>>>> Wei had with reporting when low on memory.
>>>>
>>>> Is that something you have in mind?
>>> Yes that seems more future proof I think.
>>>
>>>> I assume we would have to allocate
>>>> memory when crafting the new requests. This is the only reason I tend to
>>>> prefer a synchronous interface for now. But if allocation is not a
>>>> problem, great.
>>> There are two main ways to avoid allocation:
>>> 1. do not add extra data on top of each chunk passed
>> If I am not wrong then this is close to what we have right now.
> Yes, minus the kthread(s) and eventually with some sort of memory
> allocation for the request. Once you're asynchronous via a notification
> mechanisnm, there is no real need for a thread anymore, hopefully.
Whether we should go with kthread or without it, I would like to do some
performance comparison before commenting on this.
>
>> One issue I see right now is that I am polling while host is freeing the
>> memory.
>> In the next version I could tie the logic which returns pages to the
>> buddy and resets the per cpu array index value to 0 with the callback.
>> (i.e.., it happens once we receive an response from the host)
> The question is, what happens when freeing pages and the array is not
> ready to be reused yet. In that case, you want to somehow continue
> freeing pages without busy waiting or eventually not reporting pages.
This is what happens right now.
Having kthread or not should not effect this behavior.
When the array is full the current approach simply skips collecting the
free pages.
>
> The callback should put the pages back to the buddy and free the request
> eventually to have a fully asynchronous mechanism.
>
>> Other change which I am testing right now is to only capture 'MAX_ORDER
> I am not sure if this is an arbitrary number we came up with here. We
> should really play with different orders to find a hot spot. I wouldn't
> consider this high priority, though. Getting the whole concept right to
> be able to deal with any magic number we come up should be the ultimate
> goal. (stuff that only works with huge pages I consider not future
> proof, especially regarding fragmented guests which can happen easily)
Its quite possible that when we are only capturing MAX_ORDER - 1 and run
a specific workload we don't get any memory back until we re-run the
program and buddy finally starts merging of pages of order MAX_ORDER -1.
This is why I think we may make this configurable from compile time and
keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.
>
Nitesh Narayan Lal Feb. 19, 2019, 12:52 p.m. UTC | #54
On 2/18/19 9:46 PM, Andrea Arcangeli wrote:
> Hello,
>
> On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
>> essentially fragmented them. I guess hugepaged went through and
>> started trying to reassemble the huge pages and as a result there have
>> been apps that ended up consuming more memory than they would have
>> otherwise since they were using fragments of THP pages after doing an
>> MADV_DONTNEED on sections of the page.
> With relatively recent kernels MADV_DONTNEED doesn't necessarily free
> anything when it's applied to a THP subpage, it only splits the
> pagetables and queues the THP for deferred splitting. If there's
> memory pressure a shrinker will be invoked and the queue is scanned
> and the THPs are physically splitted, but to be reassembled/collapsed
> after a physical split it requires at least one young pte.
>
> If this is particularly problematic for page hinting, this behavior
> where the MADV_DONTNEED can be undoed by khugepaged (if some subpage is
> being frequently accessed), can be turned off by setting
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none to
> 0. Then the THP will only be collapsed if all 512 subpages are mapped
> (i.e. they've all be re-allocated by the guest).
>
> Regardless of the max_ptes_none default, keeping the smaller guest
> buddy orders as the last target for page hinting should be good for
> performance.
>
>> Yeah, no problem. The only thing I don't like about MADV_FREE is that
>> you have to have memory pressure before the pages really start getting
>> scrubbed with is both a benefit and a drawback. Basically it defers
>> the freeing until you are under actual memory pressure so when you hit
>> that case things start feeling much slower, that and it limits your
>> allocations since the kernel doesn't recognize the pages as free until
>> it would have to start trying to push memory to swap.
> The guest allocation behavior should not be influenced by MADV_FREE vs
> MADV_DONTNEED, the guest can't see the difference anyway, so why
> should it limit the allocations?
>
> The benefit of MADV_FREE should be that when the same guest frees and
> reallocates an huge amount of RAM (i.e. guest app allocating and
> freeing lots of RAM in a loop, not so uncommon), there will be no KVM
> page fault during guest re-allocations. So in absence of memory
> pressure in the host it should be a major win. Overall it sounds like
> a good tradeoff compared to MADV_DONTNEED that forcefully invokes MMU
> notifiers and forces host allocations and KVM page faults in order to
> reallocate the same RAM in the same guest.
This does makes sense.
Thanks for explaining this.
>
> When there's memory pressure it's up to the host Linux VM to notice
> there's plenty of MADV_FREE material to free at zero I/O cost before
> starting swapping.
>
> Thanks,
> Andrea
David Hildenbrand Feb. 19, 2019, 1:03 p.m. UTC | #55
>>>> There are two main ways to avoid allocation:
>>>> 1. do not add extra data on top of each chunk passed
>>> If I am not wrong then this is close to what we have right now.
>> Yes, minus the kthread(s) and eventually with some sort of memory
>> allocation for the request. Once you're asynchronous via a notification
>> mechanisnm, there is no real need for a thread anymore, hopefully.
> Whether we should go with kthread or without it, I would like to do some
> performance comparison before commenting on this.
>>
>>> One issue I see right now is that I am polling while host is freeing the
>>> memory.
>>> In the next version I could tie the logic which returns pages to the
>>> buddy and resets the per cpu array index value to 0 with the callback.
>>> (i.e.., it happens once we receive an response from the host)
>> The question is, what happens when freeing pages and the array is not
>> ready to be reused yet. In that case, you want to somehow continue
>> freeing pages without busy waiting or eventually not reporting pages.
> This is what happens right now.
> Having kthread or not should not effect this behavior.
> When the array is full the current approach simply skips collecting the
> free pages.

Well, it somehow does affect your implementation. If you have a kthread
you always have to synchronize against the VCPU: "Is the pcpu array
ready to be used again".

Once you do it asynchronously from your VCPU without another thread
being involved, such synchronization is not required. Simply prepare a
request and send it off. Reuse the pcpu array instantly. At least that's
the theory :)

If you have a guest bulk freeing a lot of memory, I guess temporarily
dropping free page hints could be counter-productive. It really depends
on how fast the thread gets scheduled and how long the hinting process
takes. Having another thread involved might add a lot to that latency to
that formula. We'll have to measure, but my gut feeling is that once we
do stuff asynchronously, there is no need for a thread anymore.

>>
>> The callback should put the pages back to the buddy and free the request
>> eventually to have a fully asynchronous mechanism.
>>
>>> Other change which I am testing right now is to only capture 'MAX_ORDER
>> I am not sure if this is an arbitrary number we came up with here. We
>> should really play with different orders to find a hot spot. I wouldn't
>> consider this high priority, though. Getting the whole concept right to
>> be able to deal with any magic number we come up should be the ultimate
>> goal. (stuff that only works with huge pages I consider not future
>> proof, especially regarding fragmented guests which can happen easily)
> Its quite possible that when we are only capturing MAX_ORDER - 1 and run
> a specific workload we don't get any memory back until we re-run the
> program and buddy finally starts merging of pages of order MAX_ORDER -1.
> This is why I think we may make this configurable from compile time and
> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.

Eventually pages will never get merged. Assume you have 1 page of a
MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via
kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb
wasted (depending on the arch). This stuff can sum up.
Nitesh Narayan Lal Feb. 19, 2019, 2:17 p.m. UTC | #56
On 2/19/19 8:03 AM, David Hildenbrand wrote:
>>>>> There are two main ways to avoid allocation:
>>>>> 1. do not add extra data on top of each chunk passed
>>>> If I am not wrong then this is close to what we have right now.
>>> Yes, minus the kthread(s) and eventually with some sort of memory
>>> allocation for the request. Once you're asynchronous via a notification
>>> mechanisnm, there is no real need for a thread anymore, hopefully.
>> Whether we should go with kthread or without it, I would like to do some
>> performance comparison before commenting on this.
>>>> One issue I see right now is that I am polling while host is freeing the
>>>> memory.
>>>> In the next version I could tie the logic which returns pages to the
>>>> buddy and resets the per cpu array index value to 0 with the callback.
>>>> (i.e.., it happens once we receive an response from the host)
>>> The question is, what happens when freeing pages and the array is not
>>> ready to be reused yet. In that case, you want to somehow continue
>>> freeing pages without busy waiting or eventually not reporting pages.
>> This is what happens right now.
>> Having kthread or not should not effect this behavior.
>> When the array is full the current approach simply skips collecting the
>> free pages.
> Well, it somehow does affect your implementation. If you have a kthread
> you always have to synchronize against the VCPU: "Is the pcpu array
> ready to be used again".
>
> Once you do it asynchronously from your VCPU without another thread
> being involved, such synchronization is not required. Simply prepare a
> request and send it off. Reuse the pcpu array instantly. At least that's
> the theory :)
>
> If you have a guest bulk freeing a lot of memory, I guess temporarily
> dropping free page hints could be counter-productive. It really depends
> on how fast the thread gets scheduled and how long the hinting process
> takes. Having another thread involved might add a lot to that latency to
> that formula. We'll have to measure, but my gut feeling is that once we
> do stuff asynchronously, there is no need for a thread anymore.
This is true.
>
>>> The callback should put the pages back to the buddy and free the request
>>> eventually to have a fully asynchronous mechanism.
>>>
>>>> Other change which I am testing right now is to only capture 'MAX_ORDER
>>> I am not sure if this is an arbitrary number we came up with here. We
>>> should really play with different orders to find a hot spot. I wouldn't
>>> consider this high priority, though. Getting the whole concept right to
>>> be able to deal with any magic number we come up should be the ultimate
>>> goal. (stuff that only works with huge pages I consider not future
>>> proof, especially regarding fragmented guests which can happen easily)
>> Its quite possible that when we are only capturing MAX_ORDER - 1 and run
>> a specific workload we don't get any memory back until we re-run the
>> program and buddy finally starts merging of pages of order MAX_ORDER -1.
>> This is why I think we may make this configurable from compile time and
>> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.
> Eventually pages will never get merged. Assume you have 1 page of a
> MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via
> kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb
> wasted (depending on the arch). This stuff can sum up.

After the discussion, here are the changes on which I am planning to
work next:
1. Get rid of the kthread and dynamically allocate a per-cpu array to
hold the
isolated pages. As soon as the initial per-cpu array is completely
scanned, release it
so that we don't end up blocking anything.
2. Continue capturing MAX_ORDER - 1, for now. Reduce the initial per-cpu
array size to 256
for now. As we are doing asynchronous reporting we should be fine with a
lower size array.
3. As soon as the host responds, release the pages back to the buddy
from the callback and free the request.

Benefits wrt current implementation:
1. We will not eat up performance due to kernel thread.
2. We will still be doing reporting asynchronously=> no blocking.
3. Hopefully, we will be able to free more memory.
David Hildenbrand Feb. 19, 2019, 2:21 p.m. UTC | #57
On 19.02.19 15:17, Nitesh Narayan Lal wrote:
> On 2/19/19 8:03 AM, David Hildenbrand wrote:
>>>>>> There are two main ways to avoid allocation:
>>>>>> 1. do not add extra data on top of each chunk passed
>>>>> If I am not wrong then this is close to what we have right now.
>>>> Yes, minus the kthread(s) and eventually with some sort of memory
>>>> allocation for the request. Once you're asynchronous via a notification
>>>> mechanisnm, there is no real need for a thread anymore, hopefully.
>>> Whether we should go with kthread or without it, I would like to do some
>>> performance comparison before commenting on this.
>>>>> One issue I see right now is that I am polling while host is freeing the
>>>>> memory.
>>>>> In the next version I could tie the logic which returns pages to the
>>>>> buddy and resets the per cpu array index value to 0 with the callback.
>>>>> (i.e.., it happens once we receive an response from the host)
>>>> The question is, what happens when freeing pages and the array is not
>>>> ready to be reused yet. In that case, you want to somehow continue
>>>> freeing pages without busy waiting or eventually not reporting pages.
>>> This is what happens right now.
>>> Having kthread or not should not effect this behavior.
>>> When the array is full the current approach simply skips collecting the
>>> free pages.
>> Well, it somehow does affect your implementation. If you have a kthread
>> you always have to synchronize against the VCPU: "Is the pcpu array
>> ready to be used again".
>>
>> Once you do it asynchronously from your VCPU without another thread
>> being involved, such synchronization is not required. Simply prepare a
>> request and send it off. Reuse the pcpu array instantly. At least that's
>> the theory :)
>>
>> If you have a guest bulk freeing a lot of memory, I guess temporarily
>> dropping free page hints could be counter-productive. It really depends
>> on how fast the thread gets scheduled and how long the hinting process
>> takes. Having another thread involved might add a lot to that latency to
>> that formula. We'll have to measure, but my gut feeling is that once we
>> do stuff asynchronously, there is no need for a thread anymore.
> This is true.
>>
>>>> The callback should put the pages back to the buddy and free the request
>>>> eventually to have a fully asynchronous mechanism.
>>>>
>>>>> Other change which I am testing right now is to only capture 'MAX_ORDER
>>>> I am not sure if this is an arbitrary number we came up with here. We
>>>> should really play with different orders to find a hot spot. I wouldn't
>>>> consider this high priority, though. Getting the whole concept right to
>>>> be able to deal with any magic number we come up should be the ultimate
>>>> goal. (stuff that only works with huge pages I consider not future
>>>> proof, especially regarding fragmented guests which can happen easily)
>>> Its quite possible that when we are only capturing MAX_ORDER - 1 and run
>>> a specific workload we don't get any memory back until we re-run the
>>> program and buddy finally starts merging of pages of order MAX_ORDER -1.
>>> This is why I think we may make this configurable from compile time and
>>> keep capturing MAX_ORDER - 1 so that we don't end up breaking anything.
>> Eventually pages will never get merged. Assume you have 1 page of a
>> MAX_ORDER - 1 chunk still allocated somewhere (e.g. !movable via
>> kmalloc). You skip reporting that chunk completely. Roughly 1mb/2mb/4mb
>> wasted (depending on the arch). This stuff can sum up.
> 
> After the discussion, here are the changes on which I am planning to
> work next:
> 1. Get rid of the kthread and dynamically allocate a per-cpu array to
> hold the
> isolated pages. As soon as the initial per-cpu array is completely
> scanned, release it
> so that we don't end up blocking anything.
> 2. Continue capturing MAX_ORDER - 1, for now. Reduce the initial per-cpu
> array size to 256
> for now. As we are doing asynchronous reporting we should be fine with a
> lower size array.
> 3. As soon as the host responds, release the pages back to the buddy
> from the callback and free the request.
> 
> Benefits wrt current implementation:
> 1. We will not eat up performance due to kernel thread.
> 2. We will still be doing reporting asynchronously=> no blocking.
> 3. Hopefully, we will be able to free more memory.
> 

+1 to that approach. We can fine tune the numbers (array size, sizes to
report) easily later on. Let us know if you run into problems doing the
allocation for the request. If that is a blocker, all we left with is a
synchronous approach I guess. Let's cross fingers. :)
Michael S. Tsirkin Feb. 19, 2019, 2:40 p.m. UTC | #58
On Tue, Feb 19, 2019 at 09:06:01AM +0100, David Hildenbrand wrote:
> On 19.02.19 00:47, Alexander Duyck wrote:
> > On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 18.02.19 18:31, Alexander Duyck wrote:
> >>> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
> >>>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
> >>>>>> It would be worth a try. My feeling is that a synchronous report after
> >>>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> >>>>>> s390x. (basically always enabled, nobody complains).
> >>>>>
> >>>>> What slips under the radar on an arch like s390 might
> >>>>> raise issues for a popular arch like x86. My fear would be
> >>>>> if it's only a problem e.g. for realtime. Then you get
> >>>>> a condition that's very hard to trigger and affects
> >>>>> worst case latencies.
> >>>>
> >>>> Realtime should never use free page hinting. Just like it should never
> >>>> use ballooning. Just like it should pin all pages in the hypervisor.
> >>>>
> >>>>>
> >>>>> But really what business has something that is supposedly
> >>>>> an optimization blocking a VCPU? We are just freeing up
> >>>>> lots of memory why is it a good idea to slow that
> >>>>> process down?
> >>>>
> >>>> I first want to know that it is a problem before we declare it a
> >>>> problem. I provided an example (s390x) where it does not seem to be a
> >>>> problem. One hypercall ~every 512 frees. As simple as it can get.
> >>>>
> >>>> No trying to deny that it could be a problem on x86, but then I assume
> >>>> it is only a problem in specific setups.
> >>>>
> >>>> I would much rather prefer a simple solution that can eventually be
> >>>> disabled in selected setup than a complicated solution that tries to fit
> >>>> all possible setups. Realtime is one of the examples where such stuff is
> >>>> to be disabled either way.
> >>>>
> >>>> Optimization of space comes with a price (here: execution time).
> >>>
> >>> One thing to keep in mind though is that if you are already having to
> >>> pull pages in and out of swap on the host in order be able to provide
> >>> enough memory for the guests the free page hinting should be a
> >>> significant win in terms of performance.
> >>
> >> Indeed. And also we are in a virtualized environment already, we can
> >> have any kind of sudden hickups. (again, realtime has special
> >> requirements on the setup)
> >>
> >> Side note: I like your approach because it is simple. I don't like your
> >> approach because it cannot deal with fragmented memory. And that can
> >> happen easily.
> >>
> >> The idea I described here can be similarly be an extension of your
> >> approach, merging in a "batched reporting" Nitesh proposed, so we can
> >> report on something < MAX_ORDER, similar to s390x. In the end it boils
> >> down to reporting via hypercall vs. reporting via virtio. The main point
> >> is that it is synchronous and batched. (and that we properly take care
> >> of the race between host freeing and guest allocation)
> > 
> > I'd say the discussion is even simpler then that. My concern is more
> > synchronous versus asynchronous. I honestly think the cost for a
> > synchronous call is being overblown and we are likely to see the fault
> > and zeroing of pages cost more than the hypercall or virtio
> > transaction itself.
> 
> The overhead of page faults and zeroing should be mitigated by
> MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead
> (context switch) becomes relevant.
> 
> We have various discussions now :) And I think they are related.
> 
> synchronous versus asynchronous
> batched vs. non-batched
> MAX_ORDER - 1 vs. other/none magic number
> 
> 1. synchronous call without batching on every kfree is bad. The
> interface is fixed to big magic numbers, otherwise we end up having a
> hypercall on every kfree. This is your approach.
> 
> 2. asynchronous calls without batching would most probably have similar
> problems with small granularities as we had when ballooning without
> batching. Just overhead we can avoid.
> 
> 3. synchronous and batched is what s390x does. It can deal with page
> granularity. It is what I initially described in this sub-thread.
> 
> 4. asynchronous and batched. This is the other approach we discussed
> yesterday. If we can get it implemented, I would be interested in
> performance numbers.
> 
> As far as I understood, Michael seems to favor something like 4 (and I
> assume eventually 2 if it is similarly fast). I am a friend of either 3
> or 4.

Well Linus said big granularity is important for linux MM
and not to bother with hinting small sizes.

Alex said cost of a hypercall is drawfed by a pagefault
after alloc. I would be curious whether async pagefault
can help things somehow though.

> > 
> > Also one reason why I am not a fan of working with anything less than
> > PMD order is because there have been issues in the past with false
> > memory leaks being created when hints were provided on THP pages that
> > essentially fragmented them. I guess hugepaged went through and
> > started trying to reassemble the huge pages and as a result there have
> > been apps that ended up consuming more memory than they would have
> > otherwise since they were using fragments of THP pages after doing an
> > MADV_DONTNEED on sections of the page.
> 
> I understand your concerns, but we should not let bugs in the hypervisor
> dictate the design. Bugs are there to be fixed. Interesting read,
> though, thanks!

Right but if we break up a huge page we are then creating
more work for hypervisor to reassemble it.

> -- 
> 
> Thanks,
> 
> David / dhildenb
David Hildenbrand Feb. 19, 2019, 2:44 p.m. UTC | #59
On 19.02.19 15:40, Michael S. Tsirkin wrote:
> On Tue, Feb 19, 2019 at 09:06:01AM +0100, David Hildenbrand wrote:
>> On 19.02.19 00:47, Alexander Duyck wrote:
>>> On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 18.02.19 18:31, Alexander Duyck wrote:
>>>>> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 18.02.19 17:49, Michael S. Tsirkin wrote:
>>>>>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote:
>>>>>>>> It would be worth a try. My feeling is that a synchronous report after
>>>>>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on
>>>>>>>> s390x. (basically always enabled, nobody complains).
>>>>>>>
>>>>>>> What slips under the radar on an arch like s390 might
>>>>>>> raise issues for a popular arch like x86. My fear would be
>>>>>>> if it's only a problem e.g. for realtime. Then you get
>>>>>>> a condition that's very hard to trigger and affects
>>>>>>> worst case latencies.
>>>>>>
>>>>>> Realtime should never use free page hinting. Just like it should never
>>>>>> use ballooning. Just like it should pin all pages in the hypervisor.
>>>>>>
>>>>>>>
>>>>>>> But really what business has something that is supposedly
>>>>>>> an optimization blocking a VCPU? We are just freeing up
>>>>>>> lots of memory why is it a good idea to slow that
>>>>>>> process down?
>>>>>>
>>>>>> I first want to know that it is a problem before we declare it a
>>>>>> problem. I provided an example (s390x) where it does not seem to be a
>>>>>> problem. One hypercall ~every 512 frees. As simple as it can get.
>>>>>>
>>>>>> No trying to deny that it could be a problem on x86, but then I assume
>>>>>> it is only a problem in specific setups.
>>>>>>
>>>>>> I would much rather prefer a simple solution that can eventually be
>>>>>> disabled in selected setup than a complicated solution that tries to fit
>>>>>> all possible setups. Realtime is one of the examples where such stuff is
>>>>>> to be disabled either way.
>>>>>>
>>>>>> Optimization of space comes with a price (here: execution time).
>>>>>
>>>>> One thing to keep in mind though is that if you are already having to
>>>>> pull pages in and out of swap on the host in order be able to provide
>>>>> enough memory for the guests the free page hinting should be a
>>>>> significant win in terms of performance.
>>>>
>>>> Indeed. And also we are in a virtualized environment already, we can
>>>> have any kind of sudden hickups. (again, realtime has special
>>>> requirements on the setup)
>>>>
>>>> Side note: I like your approach because it is simple. I don't like your
>>>> approach because it cannot deal with fragmented memory. And that can
>>>> happen easily.
>>>>
>>>> The idea I described here can be similarly be an extension of your
>>>> approach, merging in a "batched reporting" Nitesh proposed, so we can
>>>> report on something < MAX_ORDER, similar to s390x. In the end it boils
>>>> down to reporting via hypercall vs. reporting via virtio. The main point
>>>> is that it is synchronous and batched. (and that we properly take care
>>>> of the race between host freeing and guest allocation)
>>>
>>> I'd say the discussion is even simpler then that. My concern is more
>>> synchronous versus asynchronous. I honestly think the cost for a
>>> synchronous call is being overblown and we are likely to see the fault
>>> and zeroing of pages cost more than the hypercall or virtio
>>> transaction itself.
>>
>> The overhead of page faults and zeroing should be mitigated by
>> MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead
>> (context switch) becomes relevant.
>>
>> We have various discussions now :) And I think they are related.
>>
>> synchronous versus asynchronous
>> batched vs. non-batched
>> MAX_ORDER - 1 vs. other/none magic number
>>
>> 1. synchronous call without batching on every kfree is bad. The
>> interface is fixed to big magic numbers, otherwise we end up having a
>> hypercall on every kfree. This is your approach.
>>
>> 2. asynchronous calls without batching would most probably have similar
>> problems with small granularities as we had when ballooning without
>> batching. Just overhead we can avoid.
>>
>> 3. synchronous and batched is what s390x does. It can deal with page
>> granularity. It is what I initially described in this sub-thread.
>>
>> 4. asynchronous and batched. This is the other approach we discussed
>> yesterday. If we can get it implemented, I would be interested in
>> performance numbers.
>>
>> As far as I understood, Michael seems to favor something like 4 (and I
>> assume eventually 2 if it is similarly fast). I am a friend of either 3
>> or 4.
> 
> Well Linus said big granularity is important for linux MM
> and not to bother with hinting small sizes.
> 

For some reason I tend to challenge also the opinions of people way
smarter than me ;) Only the numbers can tell the true story later.

> Alex said cost of a hypercall is drawfed by a pagefault
> after alloc. I would be curious whether async pagefault
> can help things somehow though.

Indeed.

> 
>>>
>>> Also one reason why I am not a fan of working with anything less than
>>> PMD order is because there have been issues in the past with false
>>> memory leaks being created when hints were provided on THP pages that
>>> essentially fragmented them. I guess hugepaged went through and
>>> started trying to reassemble the huge pages and as a result there have
>>> been apps that ended up consuming more memory than they would have
>>> otherwise since they were using fragments of THP pages after doing an
>>> MADV_DONTNEED on sections of the page.
>>
>> I understand your concerns, but we should not let bugs in the hypervisor
>> dictate the design. Bugs are there to be fixed. Interesting read,
>> though, thanks!
> 
> Right but if we break up a huge page we are then creating
> more work for hypervisor to reassemble it.

Yes, but the hypervisor can decide what to do. E.g. on s390x there are
no THP, so nothing to break up.

It is all very complicated :)
David Hildenbrand Feb. 19, 2019, 2:45 p.m. UTC | #60
>>>>
>>>> Also one reason why I am not a fan of working with anything less than
>>>> PMD order is because there have been issues in the past with false
>>>> memory leaks being created when hints were provided on THP pages that
>>>> essentially fragmented them. I guess hugepaged went through and
>>>> started trying to reassemble the huge pages and as a result there have
>>>> been apps that ended up consuming more memory than they would have
>>>> otherwise since they were using fragments of THP pages after doing an
>>>> MADV_DONTNEED on sections of the page.
>>>
>>> I understand your concerns, but we should not let bugs in the hypervisor
>>> dictate the design. Bugs are there to be fixed. Interesting read,
>>> though, thanks!
>>
>> Right but if we break up a huge page we are then creating
>> more work for hypervisor to reassemble it.
> 
> Yes, but the hypervisor can decide what to do. E.g. on s390x there are
> no THP, so nothing to break up.

To clarify as that might be confusing: No THP in a KVM guest mapping for
now.

> 
> It is all very complicated :)
>
Alexander Duyck Feb. 19, 2019, 4:23 p.m. UTC | #61
On Mon, Feb 18, 2019 at 6:46 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Hello,
>
> On Mon, Feb 18, 2019 at 03:47:22PM -0800, Alexander Duyck wrote:
> > essentially fragmented them. I guess hugepaged went through and
> > started trying to reassemble the huge pages and as a result there have
> > been apps that ended up consuming more memory than they would have
> > otherwise since they were using fragments of THP pages after doing an
> > MADV_DONTNEED on sections of the page.
>
> With relatively recent kernels MADV_DONTNEED doesn't necessarily free
> anything when it's applied to a THP subpage, it only splits the
> pagetables and queues the THP for deferred splitting. If there's
> memory pressure a shrinker will be invoked and the queue is scanned
> and the THPs are physically splitted, but to be reassembled/collapsed
> after a physical split it requires at least one young pte.
>
> If this is particularly problematic for page hinting, this behavior
> where the MADV_DONTNEED can be undoed by khugepaged (if some subpage is
> being frequently accessed), can be turned off by setting
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none to
> 0. Then the THP will only be collapsed if all 512 subpages are mapped
> (i.e. they've all be re-allocated by the guest).
>
> Regardless of the max_ptes_none default, keeping the smaller guest
> buddy orders as the last target for page hinting should be good for
> performance.

Okay, this is good to know. Thanks.

> > Yeah, no problem. The only thing I don't like about MADV_FREE is that
> > you have to have memory pressure before the pages really start getting
> > scrubbed with is both a benefit and a drawback. Basically it defers
> > the freeing until you are under actual memory pressure so when you hit
> > that case things start feeling much slower, that and it limits your
> > allocations since the kernel doesn't recognize the pages as free until
> > it would have to start trying to push memory to swap.
>
> The guest allocation behavior should not be influenced by MADV_FREE vs
> MADV_DONTNEED, the guest can't see the difference anyway, so why
> should it limit the allocations?

Actually I was talking about the host. So if  have a guest that is
using MADV_FREE what I have to do is create an allocation that would
force us to have to access swap and that in turn ends up triggering
the freeing of the pages that were moved to the "Inactive(file)" list
by the MADV_FREE call.

The only reason it came up is that one of my test systems had a small
swap so I ended up having to do multiple allocations and frees in swap
sized increments to free up memory from a large guest that wasn't in
use.

> The benefit of MADV_FREE should be that when the same guest frees and
> reallocates an huge amount of RAM (i.e. guest app allocating and
> freeing lots of RAM in a loop, not so uncommon), there will be no KVM
> page fault during guest re-allocations. So in absence of memory
> pressure in the host it should be a major win. Overall it sounds like
> a good tradeoff compared to MADV_DONTNEED that forcefully invokes MMU
> notifiers and forces host allocations and KVM page faults in order to
> reallocate the same RAM in the same guest.

Right, and I do see that behavior.

> When there's memory pressure it's up to the host Linux VM to notice
> there's plenty of MADV_FREE material to free at zero I/O cost before
> starting swapping.

Right.
Alexander Duyck Feb. 19, 2019, 6:06 p.m. UTC | #62
On Mon, Feb 18, 2019 at 11:55 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.02.19 01:01, Alexander Duyck wrote:
> > On Mon, Feb 18, 2019 at 1:04 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 18.02.19 21:40, Nitesh Narayan Lal wrote:
> >>> On 2/18/19 3:31 PM, Michael S. Tsirkin wrote:
> >>>> On Mon, Feb 18, 2019 at 09:04:57PM +0100, David Hildenbrand wrote:
> >>>>>>>>>> So I'm fine with a simple implementation but the interface needs to
> >>>>>>>>>> allow the hypervisor to process hints in parallel while guest is
> >>>>>>>>>> running.  We can then fix any issues on hypervisor without breaking
> >>>>>>>>>> guests.
> >>>>>>>>> Yes, I am fine with defining an interface that theoretically let's us
> >>>>>>>>> change the implementation in the guest later.
> >>>>>>>>> I consider this even a
> >>>>>>>>> prerequisite. IMHO the interface shouldn't be different, it will be
> >>>>>>>>> exactly the same.
> >>>>>>>>>
> >>>>>>>>> It is just "who" calls the batch freeing and waits for it. And as I
> >>>>>>>>> outlined here, doing it without additional threads at least avoids us
> >>>>>>>>> for now having to think about dynamic data structures and that we can
> >>>>>>>>> sometimes not report "because the thread is still busy reporting or
> >>>>>>>>> wasn't scheduled yet".
> >>>>>>>> Sorry I wasn't clear. I think we need ability to change the
> >>>>>>>> implementation in the *host* later. IOW don't rely on
> >>>>>>>> host being synchronous.
> >>>>>>>>
> >>>>>>>>
> >>>>>>> I actually misread it :) . In any way, there has to be a mechanism to
> >>>>>>> synchronize.
> >>>>>>>
> >>>>>>> If we are going via a bare hypercall (like s390x, like what Alexander
> >>>>>>> proposes), it is going to be a synchronous interface either way. Just a
> >>>>>>> bare hypercall, there will not really be any blocking on the guest side.
> >>>>>> It bothers me that we are now tied to interface being synchronous. We
> >>>>>> won't be able to fix it if there's an issue as that would break guests.
> >>>>> I assume with "fix it" you mean "fix kfree taking longer on every X call"?
> >>>>>
> >>>>> Yes, as I initially wrote, this mimics s390x. That might be good (we
> >>>>> know it has been working for years) and bad (we are inheriting the same
> >>>>> problem class, if it exists). And being synchronous is part of the
> >>>>> approach for now.
> >>>> BTW on s390 are these hypercalls handled by Linux?
> >>>>
> >>>>> I tend to focus on the first part (we don't know anything besides it is
> >>>>> working) while you focus on the second part (there could be a potential
> >>>>> problem). Having a real problem at hand would be great, then we would
> >>>>> know what exactly we actually have to fix. But read below.
> >>>> If we end up doing a hypercall per THP, maybe we could at least
> >>>> not block with interrupts disabled? Poll in guest until
> >>>> hypervisor reports its done?  That would already be an
> >>>> improvement IMHO. E.g. perf within guest will point you
> >>>> in the right direction and towards disabling hinting.
> >>>>
> >>>>
> >>>>>>> Via virtio, I guess it is waiting for a response to a requests, right?
> >>>>>> For the buffer to be used, yes. And it could mean putting some pages
> >>>>>> aside until hypervisor is done with them. Then you don't need timers or
> >>>>>> tricks like this, you can get an interrupt and start using the memory.
> >>>>> I am very open to such an approach as long as we can make it work and it
> >>>>> is not too complicated. (-> simple)
> >>>>>
> >>>>> This would mean for example
> >>>>>
> >>>>> 1. Collect entries to be reported per VCPU in a buffer. Say magic number
> >>>>> 256/512.
> >>>>>
> >>>>> 2. Once the buffer is full, do crazy "take pages out of the balloon
> >>>>> action" and report them to the hypervisor via virtio. Let the VCPU
> >>>>> continue. This will require some memory to store the request. Small
> >>>>> hickup for the VCPU to kick of the reporting to the hypervisor.
> >>>>>
> >>>>> 3. On interrupt/response, go over the response and put the pages back to
> >>>>> the buddy.
> >>>>>
> >>>>> (assuming that reporting a bulk of frees is better than reporting every
> >>>>> single free obviously)
> >>>>>
> >>>>> This could allow nice things like "when OOM gets trigger, see if pages
> >>>>> are currently being reported and wait until they have been put back to
> >>>>> the buddy, return "new pages available", so in a real "low on memory"
> >>>>> scenario, no OOM killer would get involved. This could address the issue
> >>>>> Wei had with reporting when low on memory.
> >>>>>
> >>>>> Is that something you have in mind?
> >>>> Yes that seems more future proof I think.
> >>>>
> >>>>> I assume we would have to allocate
> >>>>> memory when crafting the new requests. This is the only reason I tend to
> >>>>> prefer a synchronous interface for now. But if allocation is not a
> >>>>> problem, great.
> >>>> There are two main ways to avoid allocation:
> >>>> 1. do not add extra data on top of each chunk passed
> >>> If I am not wrong then this is close to what we have right now.
> >>
> >> Yes, minus the kthread(s) and eventually with some sort of memory
> >> allocation for the request. Once you're asynchronous via a notification
> >> mechanisnm, there is no real need for a thread anymore, hopefully.
> >>
> >>> One issue I see right now is that I am polling while host is freeing the
> >>> memory.
> >>> In the next version I could tie the logic which returns pages to the
> >>> buddy and resets the per cpu array index value to 0 with the callback.
> >>> (i.e.., it happens once we receive an response from the host)
> >>
> >> The question is, what happens when freeing pages and the array is not
> >> ready to be reused yet. In that case, you want to somehow continue
> >> freeing pages without busy waiting or eventually not reporting pages.
> >>
> >> The callback should put the pages back to the buddy and free the request
> >> eventually to have a fully asynchronous mechanism.
> >>
> >>> Other change which I am testing right now is to only capture 'MAX_ORDER
> >>
> >> I am not sure if this is an arbitrary number we came up with here. We
> >> should really play with different orders to find a hot spot. I wouldn't
> >> consider this high priority, though. Getting the whole concept right to
> >> be able to deal with any magic number we come up should be the ultimate
> >> goal. (stuff that only works with huge pages I consider not future
> >> proof, especially regarding fragmented guests which can happen easily)
> >
> > This essentially just ends up being another trade-off of CPU versus
> > memory though. Assuming we aren't using THP we are going to take a
> > penalty in terms of performance but could then free individual pages
> > less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
> > much higher in general even without the hinting. I figure for x86 we
> > probably don't have too many options since if I am not mistaken
> > MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.
>
> THP is an implementation detail in the hypervisor. Yes, it is the common
> case on x86. But it is e.g. not available on s390x yet. And we also want
> this mechanism to work on s390x (e.g. for nested virtualization setups
> as discussed).
>
> If we e.g. report any granularity after merging was done in the buddy,
> we could end up reporting everything from page size up to MAX_SIZE - 1,
> the hypervisor could ignore hints below a certain magic number, if it
> makes its life easier.

For each architecture we can do a separate implementation of what to
hint on. We already do that for bare metal so why would we have guests
do the same type of hinting in the virtualization case when there are
fundamental differences in page size and features in each
architecture?

This is another reason why I think the hypercall approach is a better
idea since each architecture is likely going to want to handle things
differently and it would be a pain to try and sort that all out in a
virtio driver.

> >
> > As far as fragmentation my thought is that we may want to look into
> > adding support to the guest for prioritizing defragmentation on pages
> > lower than THP size. Then that way we could maintain the higher
> > overall performance with or without the hinting since shuffling lower
> > order pages around between guests would start to get expensive pretty
> > quick.
>
> My take would be, design an interface/mechanism that allows any kind of
> granularity. You can than balance between cpu overead and space shifting.

The problem with using "any kind of granularity" is that in the case
of memory we are already having problems with 4K pages being deemed
too small of a granularity to be useful for anything and making
operations too expensive.

I'm open to using other page orders for other architectures. Nothing
says we have to stick with THP sized pages for all architectures. I
have just been focused on x86 and this seems like the best fit for the
balance between CPU and freeing of memory for now on that
architecture.

> I feel like repeating myself, but on s390x hinting is done on page
> granularity, and I have never heard somebody say "how can I turn it off,
> this is slowing down my system too much.". All we know is that one
> hypercall per free is most probably not acceptable. We really have to
> play with the numbers.

My thought was we could look at doing different implementations for
other architectures such as s390 and powerPC. Odds are the
implementations would be similar but have slight differences where
appropriate such as what order we should start hinting on, or if we
bypass the hypercall/virtio-balloon for a host native approach if
available.

> I tend to like an asynchronous reporting approach as discussed in this
> thread, we would have to see if Nitesh could get it implemented.

I agree it would be great if it could work. However I have concerns
given that work on this patch set dates back to 2017, major issues
such as working around device assignment have yet to be addressed, and
it seems like most of the effort is being focused on things that in my
opinion are being over-engineered for little to no benefit.

I really think that simpler would be much better in terms of design in
this case.

Thanks.

- Alex
David Hildenbrand Feb. 19, 2019, 6:31 p.m. UTC | #63
>>> This essentially just ends up being another trade-off of CPU versus
>>> memory though. Assuming we aren't using THP we are going to take a
>>> penalty in terms of performance but could then free individual pages
>>> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
>>> much higher in general even without the hinting. I figure for x86 we
>>> probably don't have too many options since if I am not mistaken
>>> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.
>>
>> THP is an implementation detail in the hypervisor. Yes, it is the common
>> case on x86. But it is e.g. not available on s390x yet. And we also want
>> this mechanism to work on s390x (e.g. for nested virtualization setups
>> as discussed).
>>
>> If we e.g. report any granularity after merging was done in the buddy,
>> we could end up reporting everything from page size up to MAX_SIZE - 1,
>> the hypervisor could ignore hints below a certain magic number, if it
>> makes its life easier.
> 
> For each architecture we can do a separate implementation of what to
> hint on. We already do that for bare metal so why would we have guests
> do the same type of hinting in the virtualization case when there are
> fundamental differences in page size and features in each
> architecture?
> 
> This is another reason why I think the hypercall approach is a better
> idea since each architecture is likely going to want to handle things
> differently and it would be a pain to try and sort that all out in a
> virtio driver.

I can't follow. We are talking about something as simple as a minimum
page granularity here that can easily be configured. Nothing that
screams for different implementations. But I get your point, we could
tune for different architectures.

> 
>>>
>>> As far as fragmentation my thought is that we may want to look into
>>> adding support to the guest for prioritizing defragmentation on pages
>>> lower than THP size. Then that way we could maintain the higher
>>> overall performance with or without the hinting since shuffling lower
>>> order pages around between guests would start to get expensive pretty
>>> quick.
>>
>> My take would be, design an interface/mechanism that allows any kind of
>> granularity. You can than balance between cpu overead and space shifting.
> 
> The problem with using "any kind of granularity" is that in the case
> of memory we are already having problems with 4K pages being deemed
> too small of a granularity to be useful for anything and making
> operations too expensive.

No, sorry, s390x does it. And via batch reporting it could work. Not
saying we should do page granularity, but "to be useful for anything" is
just wrong.

> 
> I'm open to using other page orders for other architectures. Nothing
> says we have to stick with THP sized pages for all architectures. I
> have just been focused on x86 and this seems like the best fit for the
> balance between CPU and freeing of memory for now on that
> architecture.
> 
>> I feel like repeating myself, but on s390x hinting is done on page
>> granularity, and I have never heard somebody say "how can I turn it off,
>> this is slowing down my system too much.". All we know is that one
>> hypercall per free is most probably not acceptable. We really have to
>> play with the numbers.
> 
> My thought was we could look at doing different implementations for
> other architectures such as s390 and powerPC. Odds are the
> implementations would be similar but have slight differences where
> appropriate such as what order we should start hinting on, or if we
> bypass the hypercall/virtio-balloon for a host native approach if
> available.
> 
>> I tend to like an asynchronous reporting approach as discussed in this
>> thread, we would have to see if Nitesh could get it implemented.
> 
> I agree it would be great if it could work. However I have concerns
> given that work on this patch set dates back to 2017, major issues
> such as working around device assignment have yet to be addressed, and
> it seems like most of the effort is being focused on things that in my
> opinion are being over-engineered for little to no benefit.

I can understand that you are trying to push your solution. I would do
the same. Again, I don't like a pure synchronous approach that works on
one-element-at-a-time. Period. Other people might have other opinions.
This is mine - luckily I don't have anything to say here :)

MST also voted for an asynchronous solution if we can make it work.
Nitesh made significant improvements since the 2017. Complicated stuff
needs time. No need to rush. People have been talking about free page
hinting since 2006. I talked to various people that experimented with
bitmap based solutions two years ago.

So much to that, if you think your solution is the way to go, please
follow up on it. Nitesh seems to have decided to look into the
asynchronous approach you also called "great if it could work". As long
as we don't run into elementary blockers there, to me it all looks like
we are making progress, which is good. If we find out asynchronous
doesn't work, synchronous is the only alternative.

And just so you don't get me wrong: Thanks for looking and working on
this. And thanks for sharing your opinions and insights! However making
a decision about going your way at this point does not seem reasonable
to me. We have plenty of time.
Michael S. Tsirkin Feb. 19, 2019, 7:58 p.m. UTC | #64
On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
> > I tend to like an asynchronous reporting approach as discussed in this
> > thread, we would have to see if Nitesh could get it implemented.
> 
> I agree it would be great if it could work. However I have concerns
> given that work on this patch set dates back to 2017, major issues
> such as working around device assignment have yet to be addressed,

BTW for device assignment to work, your idea of sending
data directly to kvm won't work, will it?
You need to update userspace so it can update VFIO right?
Another blocker for assignment is ability to make holes
an an existing mapping - supported by hardware but
not by IOMMU drivers.

All the issues are shared with balloon btw, so that
could be another reason to use the balloon.

> and
> it seems like most of the effort is being focused on things that in my
> opinion are being over-engineered for little to no benefit.
> 
> I really think that simpler would be much better in terms of design in
> this case.
> 
> Thanks.
> 
> - Alex

I personally learned a lot from Wei's work on hinting.


All this doesn't mean you approach is not the right one.
David Hildenbrand Feb. 19, 2019, 8:02 p.m. UTC | #65
On 19.02.19 20:58, Michael S. Tsirkin wrote:
> On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
>>> I tend to like an asynchronous reporting approach as discussed in this
>>> thread, we would have to see if Nitesh could get it implemented.
>>
>> I agree it would be great if it could work. However I have concerns
>> given that work on this patch set dates back to 2017, major issues
>> such as working around device assignment have yet to be addressed,
> 
> BTW for device assignment to work, your idea of sending
> data directly to kvm won't work, will it?
> You need to update userspace so it can update VFIO right?
> Another blocker for assignment is ability to make holes
> an an existing mapping - supported by hardware but
> not by IOMMU drivers.

I had the exact same thought and then realized that we decided to block
the balloon in user space until we figured out how to handle this properly.

I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
finding pinned pages, but I doubt it. Most probably we'll have to
disable hinting for device assignments as well.

> 
> All the issues are shared with balloon btw, so that
> could be another reason to use the balloon.

Yes, smells like it.
Michael S. Tsirkin Feb. 19, 2019, 8:17 p.m. UTC | #66
On Tue, Feb 19, 2019 at 09:02:52PM +0100, David Hildenbrand wrote:
> On 19.02.19 20:58, Michael S. Tsirkin wrote:
> > On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
> >>> I tend to like an asynchronous reporting approach as discussed in this
> >>> thread, we would have to see if Nitesh could get it implemented.
> >>
> >> I agree it would be great if it could work. However I have concerns
> >> given that work on this patch set dates back to 2017, major issues
> >> such as working around device assignment have yet to be addressed,
> > 
> > BTW for device assignment to work, your idea of sending
> > data directly to kvm won't work, will it?
> > You need to update userspace so it can update VFIO right?
> > Another blocker for assignment is ability to make holes
> > an an existing mapping - supported by hardware but
> > not by IOMMU drivers.
> 
> I had the exact same thought and then realized that we decided to block
> the balloon in user space until we figured out how to handle this properly.
> 
> I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
> finding pinned pages, but I doubt it. Most probably we'll have to
> disable hinting for device assignments as well.

OK but let's recognize it as a bug not a feature.
David Hildenbrand Feb. 19, 2019, 8:21 p.m. UTC | #67
On 19.02.19 21:17, Michael S. Tsirkin wrote:
> On Tue, Feb 19, 2019 at 09:02:52PM +0100, David Hildenbrand wrote:
>> On 19.02.19 20:58, Michael S. Tsirkin wrote:
>>> On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
>>>>> I tend to like an asynchronous reporting approach as discussed in this
>>>>> thread, we would have to see if Nitesh could get it implemented.
>>>>
>>>> I agree it would be great if it could work. However I have concerns
>>>> given that work on this patch set dates back to 2017, major issues
>>>> such as working around device assignment have yet to be addressed,
>>>
>>> BTW for device assignment to work, your idea of sending
>>> data directly to kvm won't work, will it?
>>> You need to update userspace so it can update VFIO right?
>>> Another blocker for assignment is ability to make holes
>>> an an existing mapping - supported by hardware but
>>> not by IOMMU drivers.
>>
>> I had the exact same thought and then realized that we decided to block
>> the balloon in user space until we figured out how to handle this properly.
>>
>> I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
>> finding pinned pages, but I doubt it. Most probably we'll have to
>> disable hinting for device assignments as well.
> 
> OK but let's recognize it as a bug not a feature.
> 

Yes, btw interesting read: https://lwn.net/Articles/198380/

"Pages which have been locked into memory pose an extra challenge here -
they can be part of the page cache, but they still shouldn't be taken
away by the host system. So such pages cannot be marked as "volatile."
The problem is that figuring out if a page is locked is harder than it
might seem; it can involve scanning a list of virtual memory area (VMA)
structures, which is slow. So the hinting patches add a new flag to the
address_space structure to note that somebody has locked pages from that
address space in memory."

I assume locked here actually means pinned.
Michael S. Tsirkin Feb. 19, 2019, 8:35 p.m. UTC | #68
On Tue, Feb 19, 2019 at 09:21:20PM +0100, David Hildenbrand wrote:
> On 19.02.19 21:17, Michael S. Tsirkin wrote:
> > On Tue, Feb 19, 2019 at 09:02:52PM +0100, David Hildenbrand wrote:
> >> On 19.02.19 20:58, Michael S. Tsirkin wrote:
> >>> On Tue, Feb 19, 2019 at 10:06:35AM -0800, Alexander Duyck wrote:
> >>>>> I tend to like an asynchronous reporting approach as discussed in this
> >>>>> thread, we would have to see if Nitesh could get it implemented.
> >>>>
> >>>> I agree it would be great if it could work. However I have concerns
> >>>> given that work on this patch set dates back to 2017, major issues
> >>>> such as working around device assignment have yet to be addressed,
> >>>
> >>> BTW for device assignment to work, your idea of sending
> >>> data directly to kvm won't work, will it?
> >>> You need to update userspace so it can update VFIO right?
> >>> Another blocker for assignment is ability to make holes
> >>> an an existing mapping - supported by hardware but
> >>> not by IOMMU drivers.
> >>
> >> I had the exact same thought and then realized that we decided to block
> >> the balloon in user space until we figured out how to handle this properly.
> >>
> >> I wonder if MADV_FREE behaves differently compared to MADV_DONTNEED when
> >> finding pinned pages, but I doubt it. Most probably we'll have to
> >> disable hinting for device assignments as well.
> > 
> > OK but let's recognize it as a bug not a feature.
> > 
> 
> Yes, btw interesting read: https://lwn.net/Articles/198380/

There's also slideware from Rik van Riel circa 2011. His idea
was tagging pages in guest memory, freeing a page involves
- drop the EPT PTE
- check page is still free
- unpin page

This way you can hint without exits at all if you like.

> 
> "Pages which have been locked into memory pose an extra challenge here -
> they can be part of the page cache, but they still shouldn't be taken
> away by the host system. So such pages cannot be marked as "volatile."
> The problem is that figuring out if a page is locked is harder than it
> might seem; it can involve scanning a list of virtual memory area (VMA)
> structures, which is slow. So the hinting patches add a new flag to the
> address_space structure to note that somebody has locked pages from that
> address space in memory."
> 
> I assume locked here actually means pinned.

Locked seems to mean mlock there.

This seems to also resemble Xen's tmem a bit.


> -- 
> 
> Thanks,
> 
> David / dhildenb
Alexander Duyck Feb. 19, 2019, 9:57 p.m. UTC | #69
On Tue, Feb 19, 2019 at 10:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>> This essentially just ends up being another trade-off of CPU versus
> >>> memory though. Assuming we aren't using THP we are going to take a
> >>> penalty in terms of performance but could then free individual pages
> >>> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
> >>> much higher in general even without the hinting. I figure for x86 we
> >>> probably don't have too many options since if I am not mistaken
> >>> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.
> >>
> >> THP is an implementation detail in the hypervisor. Yes, it is the common
> >> case on x86. But it is e.g. not available on s390x yet. And we also want
> >> this mechanism to work on s390x (e.g. for nested virtualization setups
> >> as discussed).
> >>
> >> If we e.g. report any granularity after merging was done in the buddy,
> >> we could end up reporting everything from page size up to MAX_SIZE - 1,
> >> the hypervisor could ignore hints below a certain magic number, if it
> >> makes its life easier.
> >
> > For each architecture we can do a separate implementation of what to
> > hint on. We already do that for bare metal so why would we have guests
> > do the same type of hinting in the virtualization case when there are
> > fundamental differences in page size and features in each
> > architecture?
> >
> > This is another reason why I think the hypercall approach is a better
> > idea since each architecture is likely going to want to handle things
> > differently and it would be a pain to try and sort that all out in a
> > virtio driver.
>
> I can't follow. We are talking about something as simple as a minimum
> page granularity here that can easily be configured. Nothing that
> screams for different implementations. But I get your point, we could
> tune for different architectures.

I was thinking about the guest side of things. Basically if we need to
define different page orders for different architectures then we start
needing to do architecture specific includes. Then if we throw in
stuff like the fact that the first level of KVM can make use of the
host style hints then that is another thing that will be a difference
int he different architectures. I'm just worried this stuff is going
to start adding up to a bunch of "#ifdef" cruft if we are trying to do
this as a virtio driver.

> >
> >>>
> >>> As far as fragmentation my thought is that we may want to look into
> >>> adding support to the guest for prioritizing defragmentation on pages
> >>> lower than THP size. Then that way we could maintain the higher
> >>> overall performance with or without the hinting since shuffling lower
> >>> order pages around between guests would start to get expensive pretty
> >>> quick.
> >>
> >> My take would be, design an interface/mechanism that allows any kind of
> >> granularity. You can than balance between cpu overead and space shifting.
> >
> > The problem with using "any kind of granularity" is that in the case
> > of memory we are already having problems with 4K pages being deemed
> > too small of a granularity to be useful for anything and making
> > operations too expensive.
>
> No, sorry, s390x does it. And via batch reporting it could work. Not
> saying we should do page granularity, but "to be useful for anything" is
> just wrong.

Yeah, I was engaging in a bit of hyperbole. I have had a headache this
morning so I am a bit cranky.

So I am assuming the batching is the reason why you also have a
arch_alloc_page then for the s390 so that you can abort the hint if a
page is reallocated before the hint is processed then? I just want to
confirm so that my understanding of this is correct.

If that is the case I would be much happier with an asynchronous page
hint setup as this doesn't deprive the guest of memory while waiting
on the hint. The current logic in the patches from Nitesh has the
pages unavailable to the guest while waiting on the hint and that has
me somewhat concerned as it is going to hurt cache locality as it will
guarantee that we cannot reuse the same page if we are doing a cycle
of alloc and free for the same page size.

> >
> > I'm open to using other page orders for other architectures. Nothing
> > says we have to stick with THP sized pages for all architectures. I
> > have just been focused on x86 and this seems like the best fit for the
> > balance between CPU and freeing of memory for now on that
> > architecture.
> >
> >> I feel like repeating myself, but on s390x hinting is done on page
> >> granularity, and I have never heard somebody say "how can I turn it off,
> >> this is slowing down my system too much.". All we know is that one
> >> hypercall per free is most probably not acceptable. We really have to
> >> play with the numbers.
> >
> > My thought was we could look at doing different implementations for
> > other architectures such as s390 and powerPC. Odds are the
> > implementations would be similar but have slight differences where
> > appropriate such as what order we should start hinting on, or if we
> > bypass the hypercall/virtio-balloon for a host native approach if
> > available.
> >
> >> I tend to like an asynchronous reporting approach as discussed in this
> >> thread, we would have to see if Nitesh could get it implemented.
> >
> > I agree it would be great if it could work. However I have concerns
> > given that work on this patch set dates back to 2017, major issues
> > such as working around device assignment have yet to be addressed, and
> > it seems like most of the effort is being focused on things that in my
> > opinion are being over-engineered for little to no benefit.
>
> I can understand that you are trying to push your solution. I would do
> the same. Again, I don't like a pure synchronous approach that works on
> one-element-at-a-time. Period. Other people might have other opinions.
> This is mine - luckily I don't have anything to say here :)
>
> MST also voted for an asynchronous solution if we can make it work.
> Nitesh made significant improvements since the 2017. Complicated stuff
> needs time. No need to rush. People have been talking about free page
> hinting since 2006. I talked to various people that experimented with
> bitmap based solutions two years ago.

Now that I think I have a better understanding of how the s390x is
handling this I'm beginning to come around to the idea of an
asynchronous setup. The one thing that has been bugging me about the
asynchronous approach is the fact that the pages are not available to
the guest while waiting on the hint to be completed. If we can do
something like an arch_alloc_page and that would abort the hint and
allow us to keep the page available while waiting on the hint that
would be my preferred way of handling this.

> So much to that, if you think your solution is the way to go, please
> follow up on it. Nitesh seems to have decided to look into the
> asynchronous approach you also called "great if it could work". As long
> as we don't run into elementary blockers there, to me it all looks like
> we are making progress, which is good. If we find out asynchronous
> doesn't work, synchronous is the only alternative.

I plan to follow up in the next week or so.

> And just so you don't get me wrong: Thanks for looking and working on
> this. And thanks for sharing your opinions and insights! However making
> a decision about going your way at this point does not seem reasonable
> to me. We have plenty of time.

I appreciate the feedback. Sorry if I seemed a bit short. As I
mentioned I've had a headache most of the morning which hasn't really
helped my mood.

Thanks.

- Alex
Michael S. Tsirkin Feb. 19, 2019, 10:17 p.m. UTC | #70
On Tue, Feb 19, 2019 at 01:57:14PM -0800, Alexander Duyck wrote:
> On Tue, Feb 19, 2019 at 10:32 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > >>> This essentially just ends up being another trade-off of CPU versus
> > >>> memory though. Assuming we aren't using THP we are going to take a
> > >>> penalty in terms of performance but could then free individual pages
> > >>> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be
> > >>> much higher in general even without the hinting. I figure for x86 we
> > >>> probably don't have too many options since if I am not mistaken
> > >>> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER.
> > >>
> > >> THP is an implementation detail in the hypervisor. Yes, it is the common
> > >> case on x86. But it is e.g. not available on s390x yet. And we also want
> > >> this mechanism to work on s390x (e.g. for nested virtualization setups
> > >> as discussed).
> > >>
> > >> If we e.g. report any granularity after merging was done in the buddy,
> > >> we could end up reporting everything from page size up to MAX_SIZE - 1,
> > >> the hypervisor could ignore hints below a certain magic number, if it
> > >> makes its life easier.
> > >
> > > For each architecture we can do a separate implementation of what to
> > > hint on. We already do that for bare metal so why would we have guests
> > > do the same type of hinting in the virtualization case when there are
> > > fundamental differences in page size and features in each
> > > architecture?
> > >
> > > This is another reason why I think the hypercall approach is a better
> > > idea since each architecture is likely going to want to handle things
> > > differently and it would be a pain to try and sort that all out in a
> > > virtio driver.
> >
> > I can't follow. We are talking about something as simple as a minimum
> > page granularity here that can easily be configured. Nothing that
> > screams for different implementations. But I get your point, we could
> > tune for different architectures.
> 
> I was thinking about the guest side of things. Basically if we need to
> define different page orders for different architectures then we start
> needing to do architecture specific includes. Then if we throw in
> stuff like the fact that the first level of KVM can make use of the
> host style hints then that is another thing that will be a difference
> int he different architectures.

Sorry didn't catch this one. What are host style hints?

> I'm just worried this stuff is going
> to start adding up to a bunch of "#ifdef" cruft if we are trying to do
> this as a virtio driver.

I agree we want to avoid that.

And by comparison, if it's up to host or if it's tied to logic within
guest (such as MAX_PAGE_ORDER as suggested by Linus) as opposed to CPU
architecture, then virtio is easier as you can re-use config space and
feature bits to negotiate host/guest capabilities. Doing hypercalls for
that would add lots of hypercalls.

I CC'd Wei Wang who implemented host-driven hints in the balloon right
now. Wei I wonder - could you try changing from MAX_PAGE_ORDER to
HUGETLB_PAGE_ORDER? Does this affect performance for you at all? Thanks!


> > >
> > >>>
> > >>> As far as fragmentation my thought is that we may want to look into
> > >>> adding support to the guest for prioritizing defragmentation on pages
> > >>> lower than THP size. Then that way we could maintain the higher
> > >>> overall performance with or without the hinting since shuffling lower
> > >>> order pages around between guests would start to get expensive pretty
> > >>> quick.
> > >>
> > >> My take would be, design an interface/mechanism that allows any kind of
> > >> granularity. You can than balance between cpu overead and space shifting.
> > >
> > > The problem with using "any kind of granularity" is that in the case
> > > of memory we are already having problems with 4K pages being deemed
> > > too small of a granularity to be useful for anything and making
> > > operations too expensive.
> >
> > No, sorry, s390x does it. And via batch reporting it could work. Not
> > saying we should do page granularity, but "to be useful for anything" is
> > just wrong.
> 
> Yeah, I was engaging in a bit of hyperbole. I have had a headache this
> morning so I am a bit cranky.
> 
> So I am assuming the batching is the reason why you also have a
> arch_alloc_page then for the s390 so that you can abort the hint if a
> page is reallocated before the hint is processed then? I just want to
> confirm so that my understanding of this is correct.
> 
> If that is the case I would be much happier with an asynchronous page
> hint setup as this doesn't deprive the guest of memory while waiting
> on the hint. The current logic in the patches from Nitesh has the
> pages unavailable to the guest while waiting on the hint and that has
> me somewhat concerned as it is going to hurt cache locality as it will
> guarantee that we cannot reuse the same page if we are doing a cycle
> of alloc and free for the same page size.
> > >
> > > I'm open to using other page orders for other architectures. Nothing
> > > says we have to stick with THP sized pages for all architectures. I
> > > have just been focused on x86 and this seems like the best fit for the
> > > balance between CPU and freeing of memory for now on that
> > > architecture.
> > >
> > >> I feel like repeating myself, but on s390x hinting is done on page
> > >> granularity, and I have never heard somebody say "how can I turn it off,
> > >> this is slowing down my system too much.". All we know is that one
> > >> hypercall per free is most probably not acceptable. We really have to
> > >> play with the numbers.
> > >
> > > My thought was we could look at doing different implementations for
> > > other architectures such as s390 and powerPC. Odds are the
> > > implementations would be similar but have slight differences where
> > > appropriate such as what order we should start hinting on, or if we
> > > bypass the hypercall/virtio-balloon for a host native approach if
> > > available.
> > >
> > >> I tend to like an asynchronous reporting approach as discussed in this
> > >> thread, we would have to see if Nitesh could get it implemented.
> > >
> > > I agree it would be great if it could work. However I have concerns
> > > given that work on this patch set dates back to 2017, major issues
> > > such as working around device assignment have yet to be addressed, and
> > > it seems like most of the effort is being focused on things that in my
> > > opinion are being over-engineered for little to no benefit.
> >
> > I can understand that you are trying to push your solution. I would do
> > the same. Again, I don't like a pure synchronous approach that works on
> > one-element-at-a-time. Period. Other people might have other opinions.
> > This is mine - luckily I don't have anything to say here :)
> >
> > MST also voted for an asynchronous solution if we can make it work.
> > Nitesh made significant improvements since the 2017. Complicated stuff
> > needs time. No need to rush. People have been talking about free page
> > hinting since 2006. I talked to various people that experimented with
> > bitmap based solutions two years ago.
> 
> Now that I think I have a better understanding of how the s390x is
> handling this I'm beginning to come around to the idea of an
> asynchronous setup. The one thing that has been bugging me about the
> asynchronous approach is the fact that the pages are not available to
> the guest while waiting on the hint to be completed. If we can do
> something like an arch_alloc_page and that would abort the hint and
> allow us to keep the page available while waiting on the hint that
> would be my preferred way of handling this.
> 
> > So much to that, if you think your solution is the way to go, please
> > follow up on it. Nitesh seems to have decided to look into the
> > asynchronous approach you also called "great if it could work". As long
> > as we don't run into elementary blockers there, to me it all looks like
> > we are making progress, which is good. If we find out asynchronous
> > doesn't work, synchronous is the only alternative.
> 
> I plan to follow up in the next week or so.
> 
> > And just so you don't get me wrong: Thanks for looking and working on
> > this. And thanks for sharing your opinions and insights! However making
> > a decision about going your way at this point does not seem reasonable
> > to me. We have plenty of time.
> 
> I appreciate the feedback. Sorry if I seemed a bit short. As I
> mentioned I've had a headache most of the morning which hasn't really
> helped my mood.
> 
> Thanks.
> 
> - Alex
David Hildenbrand Feb. 19, 2019, 10:36 p.m. UTC | #71
>> I can't follow. We are talking about something as simple as a minimum
>> page granularity here that can easily be configured. Nothing that
>> screams for different implementations. But I get your point, we could
>> tune for different architectures.
> 
> I was thinking about the guest side of things. Basically if we need to
> define different page orders for different architectures then we start
> needing to do architecture specific includes. Then if we throw in
> stuff like the fact that the first level of KVM can make use of the
> host style hints then that is another thing that will be a difference
> int he different architectures. I'm just worried this stuff is going
> to start adding up to a bunch of "#ifdef" cruft if we are trying to do
> this as a virtio driver.

I agree that something like that is to be avoided. As MST pointed out,
feature bits and config space are a nice way to solve that at least on
the virtio side.

> 
>>>
>>>>>
>>>>> As far as fragmentation my thought is that we may want to look into
>>>>> adding support to the guest for prioritizing defragmentation on pages
>>>>> lower than THP size. Then that way we could maintain the higher
>>>>> overall performance with or without the hinting since shuffling lower
>>>>> order pages around between guests would start to get expensive pretty
>>>>> quick.
>>>>
>>>> My take would be, design an interface/mechanism that allows any kind of
>>>> granularity. You can than balance between cpu overead and space shifting.
>>>
>>> The problem with using "any kind of granularity" is that in the case
>>> of memory we are already having problems with 4K pages being deemed
>>> too small of a granularity to be useful for anything and making
>>> operations too expensive.
>>
>> No, sorry, s390x does it. And via batch reporting it could work. Not
>> saying we should do page granularity, but "to be useful for anything" is
>> just wrong.
> 
> Yeah, I was engaging in a bit of hyperbole. I have had a headache this
> morning so I am a bit cranky.

No worries, I am very happy about this discussion. :)

> 
> So I am assuming the batching is the reason why you also have a
> arch_alloc_page then for the s390 so that you can abort the hint if a
> page is reallocated before the hint is processed then? I just want to
> confirm so that my understanding of this is correct.

s390x is very special as it actually communicates the page state to the
hypervisor via page table bits in the guest->host mapping. The reporting
is then done batched (and synchronous) via a list of PFNs. But via the
special page table bits (along with the ESSA instruction), requests for
single pages can be canceled any time. So allocation paths are *only*
blocked by a special page lock also part of the page table bits in the
guest->host mapping.

I guess now you are completely confused. The main point is: actual
reporting to _perform_ the free is synchronous + batched. Canceling any
time possible via special synchronization mechanism per page. (that's
why pages are not taken out of the buddy, because canceling is easy)

> 
> If that is the case I would be much happier with an asynchronous page
> hint setup as this doesn't deprive the guest of memory while waiting
> on the hint. The current logic in the patches from Nitesh has the
> pages unavailable to the guest while waiting on the hint and that has
> me somewhat concerned as it is going to hurt cache locality as it will
> guarantee that we cannot reuse the same page if we are doing a cycle
> of alloc and free for the same page size.

My view on things on the current plan and your question:

1. We queue up *potential* hints on kfree per VCPU. We might only queue
after merging in the buddy and if we exceed a certain page order.

2. When that list is full, we *try to* take these pages out of the
buddy. We then trigger asynchronous reporting for the ones where we
succeeded.

3. Once reporting returns, we put back the pages to the buddy.

Between 1 and 2, any page can be reallocated. 2. will simply ignore the
page if it is no longer in the buddy. Pages are removed from the buddy
only between 2 and 3. So hopefully a very short time.

Especially, the queuing might mitigate cache locality issues. And there
are the pcpu lists that - as far as I remember - won't be touched as
these pages are not "official buddy pages" (yet). But this is an
interesting point to keep in mind.

> 
>>>
>>> I'm open to using other page orders for other architectures. Nothing
>>> says we have to stick with THP sized pages for all architectures. I
>>> have just been focused on x86 and this seems like the best fit for the
>>> balance between CPU and freeing of memory for now on that
>>> architecture.
>>>
>>>> I feel like repeating myself, but on s390x hinting is done on page
>>>> granularity, and I have never heard somebody say "how can I turn it off,
>>>> this is slowing down my system too much.". All we know is that one
>>>> hypercall per free is most probably not acceptable. We really have to
>>>> play with the numbers.
>>>
>>> My thought was we could look at doing different implementations for
>>> other architectures such as s390 and powerPC. Odds are the
>>> implementations would be similar but have slight differences where
>>> appropriate such as what order we should start hinting on, or if we
>>> bypass the hypercall/virtio-balloon for a host native approach if
>>> available.
>>>
>>>> I tend to like an asynchronous reporting approach as discussed in this
>>>> thread, we would have to see if Nitesh could get it implemented.
>>>
>>> I agree it would be great if it could work. However I have concerns
>>> given that work on this patch set dates back to 2017, major issues
>>> such as working around device assignment have yet to be addressed, and
>>> it seems like most of the effort is being focused on things that in my
>>> opinion are being over-engineered for little to no benefit.
>>
>> I can understand that you are trying to push your solution. I would do
>> the same. Again, I don't like a pure synchronous approach that works on
>> one-element-at-a-time. Period. Other people might have other opinions.
>> This is mine - luckily I don't have anything to say here :)
>>
>> MST also voted for an asynchronous solution if we can make it work.
>> Nitesh made significant improvements since the 2017. Complicated stuff
>> needs time. No need to rush. People have been talking about free page
>> hinting since 2006. I talked to various people that experimented with
>> bitmap based solutions two years ago.
> 
> Now that I think I have a better understanding of how the s390x is
> handling this I'm beginning to come around to the idea of an
> asynchronous setup. The one thing that has been bugging me about the
> asynchronous approach is the fact that the pages are not available to
> the guest while waiting on the hint to be completed. If we can do
> something like an arch_alloc_page and that would abort the hint and
> allow us to keep the page available while waiting on the hint that
> would be my preferred way of handling this.

We'll have to see if any kind of abortion is easily possible and even
necessary. s390x has the advantage that this synchronization (e.g. for
abortion of a free) is built into the architecture (guest->host page
tables). I guess we cannot tell before Nitesh has some kind of prototype
how much of an issue this actually is.

> 
>> So much to that, if you think your solution is the way to go, please
>> follow up on it. Nitesh seems to have decided to look into the
>> asynchronous approach you also called "great if it could work". As long
>> as we don't run into elementary blockers there, to me it all looks like
>> we are making progress, which is good. If we find out asynchronous
>> doesn't work, synchronous is the only alternative.
> 
> I plan to follow up in the next week or so.
> 
>> And just so you don't get me wrong: Thanks for looking and working on
>> this. And thanks for sharing your opinions and insights! However making
>> a decision about going your way at this point does not seem reasonable
>> to me. We have plenty of time.
> 
> I appreciate the feedback. Sorry if I seemed a bit short. As I
> mentioned I've had a headache most of the morning which hasn't really
> helped my mood.

Absolutely no problem Alex, thanks!
Alexander Duyck Feb. 23, 2019, 12:02 a.m. UTC | #72
On Mon, Feb 4, 2019 at 1:47 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.
>
> Benefit:
> With this patch-series, in our test-case, executed on a single system and single NUMA node with 15GB memory, we were able to successfully launch atleast 5 guests
> when page hinting was enabled and 3 without it. (Detailed explanation of the test procedure is provided at the bottom).
>
> Changelog in V8:
> In this patch-series, the earlier approach [1] which was used to capture and scan the pages freed by the guest has been changed. The new approach is briefly described below:
>
> The patch-set still leverages the existing arch_free_page() to add this functionality. It maintains a per CPU array which is used to store the pages freed by the guest. The maximum number of entries which it can hold is defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is scanned and only the pages which are available in the buddy are stored. This process continues until the array is filled with pages which are part of the buddy free list. After which it wakes up a kernel per-cpu-thread.
> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation and if the page is not reallocated and present in the buddy, the kernel thread attempts to isolate it from the buddy. If it is successfully isolated, the page is added to another per-cpu array. Once the entire scanning process is complete, all the isolated pages are reported to the host through an existing virtio-balloon driver.
>
> Known Issues:
>         * Fixed array size: The problem with having a fixed/hardcoded array size arises when the size of the guest varies. For example when the guest size increases and it starts making large allocations fixed size limits this solution's ability to capture all the freed pages. This will result in less guest free memory getting reported to the host.
>
> Known code re-work:
>         * Plan to re-use Wei's work, which communicates the poison value to the host.
>         * The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei's Free Page Hint code.
>         * Sorting based on zonenum, to avoid repetitive zone locks for the same zone.
>
> Other required work:
>         * Run other benchmarks to evaluate the performance/impact of this approach.
>
> Test case:
> Setup:
> Memory-15837 MB
> Guest Memory Size-5 GB
> Swap-Disabled
> Test Program-Simple program which allocates 4GB memory via malloc, touches it via memset and exits.
> Use case-Number of guests that can be launched completely including the successful execution of the test program.
> Procedure:
> The first guest is launched and once its console is up, the test allocation program is executed with 4 GB memory request (Due to this the guest occupies almost 4-5 GB of memory in the host in a system without page hinting). Once this program exits at that time another guest is launched in the host and the same process is followed. We continue launching the guests until a guest gets killed due to low memory condition in the host.
>
> Result:
> Without Hinting-3 Guests
> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).
>
> [1] https://www.spinics.net/lists/kvm/msg170113.html

So I tried reproducing your test and I am not having much luck.
According to the sysctl in the guest  I am seeing
"vm.guest-page-hinting = 1" which is supposed to indicate that the
hinting is enabled in both QEMU and the guest right? I'm just wanting
to verify that this is the case before I start doing any debugging.

I'm assuming you never really ran any multi-threaded tests on a
multi-CPU guest did you? With the patches applied I am seeing
stability issues. If I enable a VM with multiple CPUs and run
something like the page_fault1 test from the will-it-scale suite I am
seeing multiple traces being generated by the guest kernel and it
ultimately just hangs.

I have included the traces below. There end up being 3 specific
issues, a double free that is detected, the RCU stall, and then starts
complaining about a soft lockup.

Thanks.

- Alex

-- This looks like a page complaining about a double add when added to
the LRU --
[   50.479635] list_add double add: new=fffff64480000008,
prev=ffffa000fffd50c0, next=fffff64480000008.
[   50.481066] ------------[ cut here ]------------
[   50.481753] kernel BUG at lib/list_debug.c:31!
[   50.482448] invalid opcode: 0000 [#1] SMP PTI
[   50.483108] CPU: 1 PID: 852 Comm: hinting/1 Not tainted
5.0.0-rc7-next-20190219-baseline+ #50
[   50.486362] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[   50.487881] RIP: 0010:__list_add_valid+0x4b/0x70
[   50.488623] Code: 00 00 c3 48 89 c1 48 c7 c7 d8 70 10 9e 31 c0 e8
4f db c8 ff 0f 0b 48 89 c1 48 89 fe 31 c0 48 c7 c7 88 71 10 9e e8 39
db c8 ff <0f> 0b 48 89 d1 48 c7 c7 30 71 10 9e 48 89 f2 48 89 c6 31 c0
e8 20
[   50.492626] RSP: 0018:ffffb9a8c3b4bdf0 EFLAGS: 00010246
[   50.494189] RAX: 0000000000000058 RBX: ffffa000fffd50c0 RCX: 0000000000000000
[   50.496308] RDX: 0000000000000000 RSI: ffffa000df85e6c8 RDI: ffffa000df85e6c8
[   50.497876] RBP: ffffa000fffd50c0 R08: 0000000000000273 R09: 0000000000000005
[   50.498981] R10: 0000000000000000 R11: ffffb9a8c3b4bb70 R12: fffff64480000008
[   50.500077] R13: fffff64480000008 R14: fffff64480000000 R15: ffffa000fffd5000
[   50.501184] FS:  0000000000000000(0000) GS:ffffa000df840000(0000)
knlGS:0000000000000000
[   50.502432] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.503325] CR2: 00007ffff6e47000 CR3: 000000080f76c002 CR4: 0000000000160ee0
[   50.504431] Call Trace:
[   50.505464]  free_one_page+0x2b5/0x470
[   50.506070]  hyperlist_ready+0xa9/0xc0
[   50.506662]  hinting_fn+0x1db/0x3c0
[   50.507220]  smpboot_thread_fn+0x10e/0x160
[   50.507868]  kthread+0xf8/0x130
[   50.508371]  ? sort_range+0x20/0x20
[   50.508934]  ? kthread_bind+0x10/0x10
[   50.509520]  ret_from_fork+0x35/0x40
[   50.510098] Modules linked in: ip6t_rpfilter ip6t_REJECT
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle
ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
iptable_security ebtable_filter ebtables ip6table_filter ip6_tables
sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
kvm_intel kvm ppdev irqbypass parport_pc joydev virtio_balloon
pcc_cpufreq i2c_piix4 pcspkr parport xfs libcrc32c cirrus
drm_kms_helper ttm drm e1000 crc32c_intel virtio_blk ata_generic
floppy serio_raw pata_acpi qemu_fw_cfg
[   50.519202] ---[ end trace 141fe2acdf2e3818 ]---
[   50.519935] RIP: 0010:__list_add_valid+0x4b/0x70
[   50.520675] Code: 00 00 c3 48 89 c1 48 c7 c7 d8 70 10 9e 31 c0 e8
4f db c8 ff 0f 0b 48 89 c1 48 89 fe 31 c0 48 c7 c7 88 71 10 9e e8 39
db c8 ff <0f> 0b 48 89 d1 48 c7 c7 30 71 10 9e 48 89 f2 48 89 c6 31 c0
e8 20
[   50.523570] RSP: 0018:ffffb9a8c3b4bdf0 EFLAGS: 00010246
[   50.524399] RAX: 0000000000000058 RBX: ffffa000fffd50c0 RCX: 0000000000000000
[   50.525516] RDX: 0000000000000000 RSI: ffffa000df85e6c8 RDI: ffffa000df85e6c8
[   50.526634] RBP: ffffa000fffd50c0 R08: 0000000000000273 R09: 0000000000000005
[   50.527754] R10: 0000000000000000 R11: ffffb9a8c3b4bb70 R12: fffff64480000008
[   50.528872] R13: fffff64480000008 R14: fffff64480000000 R15: ffffa000fffd5000
[   50.530004] FS:  0000000000000000(0000) GS:ffffa000df840000(0000)
knlGS:0000000000000000
[   50.531276] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.532189] CR2: 00007ffff6e47000 CR3: 000000080f76c002 CR4: 0000000000160ee0

-- This appears to be a deadlock on the zone lock --
[  156.436784] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  156.439195] rcu: 0-...0: (0 ticks this GP)
idle=6ca/1/0x4000000000000000 softirq=10718/10718 fqs=2546
[  156.440810] rcu: 1-...0: (1 GPs behind)
idle=8f2/1/0x4000000000000000 softirq=8233/8235 fqs=2547
[  156.442320] rcu: 2-...0: (0 ticks this GP)
idle=ae2/1/0x4000000000000002 softirq=6779/6779 fqs=2547
[  156.443910] rcu: 3-...0: (0 ticks this GP)
idle=456/1/0x4000000000000000 softirq=1616/1616 fqs=2547
[  156.445454] rcu: (detected by 14, t=60109 jiffies, g=17493, q=31)
[  156.446545] Sending NMI from CPU 14 to CPUs 0:
[  156.448330] NMI backtrace for cpu 0
[  156.448331] CPU: 0 PID: 1308 Comm: page_fault1_pro Tainted: G
D           5.0.0-rc7-next-20190219-baseline+ #50
[  156.448331] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  156.448332] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
[  156.448332] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
75 3e
[  156.448333] RSP: 0000:ffffb9a8c3e83c10 EFLAGS: 00000002
[  156.448339] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
[  156.448340] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
[  156.448340] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
[  156.448341] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
[  156.448341] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
[  156.448342] FS:  00007ffff7fec440(0000) GS:ffffa000df800000(0000)
knlGS:0000000000000000
[  156.448342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.448342] CR2: 00007fffefe2d000 CR3: 0000000695904004 CR4: 0000000000160ef0
[  156.448343] Call Trace:
[  156.448343]  get_page_from_freelist+0x50f/0x1280
[  156.448343]  ? get_page_from_freelist+0xa44/0x1280
[  156.448344]  __alloc_pages_nodemask+0x141/0x2e0
[  156.448344]  alloc_pages_vma+0x73/0x180
[  156.448344]  __handle_mm_fault+0xd59/0x14e0
[  156.448345]  handle_mm_fault+0xfa/0x210
[  156.448345]  __do_page_fault+0x207/0x4c0
[  156.448345]  do_page_fault+0x32/0x140
[  156.448346]  ? async_page_fault+0x8/0x30
[  156.448346]  async_page_fault+0x1e/0x30
[  156.448346] RIP: 0033:0x401840
[  156.448347] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
ff ff
[  156.448347] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
[  156.448348] RAX: 00007fffeee48000 RBX: 00007ffff7ff7000 RCX: 0000000000fe5000
[  156.448348] RDX: 00007fffefe2d000 RSI: 0000000008000000 RDI: 0000000000000000
[  156.448349] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
[  156.448349] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
[  156.448349] R13: 0000000000000000 R14: 0000000000610710 R15: 0000000000000005
[  156.448355] Sending NMI from CPU 14 to CPUs 1:
[  156.489676] NMI backtrace for cpu 1
[  156.489677] CPU: 1 PID: 1309 Comm: page_fault1_pro Tainted: G
D           5.0.0-rc7-next-20190219-baseline+ #50
[  156.489677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  156.489678] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
[  156.489678] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
75 3e
[  156.489679] RSP: 0000:ffffb9a8c3b4bc10 EFLAGS: 00000002
[  156.489679] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
[  156.489680] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
[  156.489680] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
[  156.489680] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
[  156.489681] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
[  156.489681] FS:  00007ffff7fec440(0000) GS:ffffa000df840000(0000)
knlGS:0000000000000000
[  156.489682] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.489682] CR2: 00007ffff4608000 CR3: 000000081ddf6003 CR4: 0000000000160ee0
[  156.489682] Call Trace:
[  156.489683]  get_page_from_freelist+0x50f/0x1280
[  156.489683]  ? get_page_from_freelist+0xa44/0x1280
[  156.489683]  __alloc_pages_nodemask+0x141/0x2e0
[  156.489683]  alloc_pages_vma+0x73/0x180
[  156.489684]  __handle_mm_fault+0xd59/0x14e0
[  156.489684]  handle_mm_fault+0xfa/0x210
[  156.489684]  __do_page_fault+0x207/0x4c0
[  156.489685]  do_page_fault+0x32/0x140
[  156.489685]  ? async_page_fault+0x8/0x30
[  156.489685]  async_page_fault+0x1e/0x30
[  156.489686] RIP: 0033:0x401840
[  156.489686] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
ff ff
[  156.489687] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
[  156.489687] RAX: 00007fffeee48000 RBX: 00007ffff7ff7080 RCX: 00000000057c0000
[  156.489692] RDX: 00007ffff4608000 RSI: 0000000008000000 RDI: 0000000000000000
[  156.489693] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
[  156.489693] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
[  156.489694] R13: 0000000000000000 R14: 000000000060f870 R15: 0000000000000005
[  156.489696] Sending NMI from CPU 14 to CPUs 2:
[  156.530601] NMI backtrace for cpu 2
[  156.530602] CPU: 2 PID: 858 Comm: hinting/2 Tainted: G      D
    5.0.0-rc7-next-20190219-baseline+ #50
[  156.530602] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  156.530603] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
[  156.530603] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
75 3e
[  156.530604] RSP: 0018:ffffa000df883e38 EFLAGS: 00000002
[  156.530604] RAX: 0000000000000001 RBX: fffff644a05a0ec8 RCX: dead000000000200
[  156.530605] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
[  156.530605] RBP: ffffa000df8af340 R08: ffffa000da2b2000 R09: 0000000000000100
[  156.530606] R10: 0000000000000004 R11: 0000000000000005 R12: fffff6449fb5fb08
[  156.530606] R13: ffffa000fffd5d00 R14: 0000000000000001 R15: 0000000000000001
[  156.530606] FS:  0000000000000000(0000) GS:ffffa000df880000(0000)
knlGS:0000000000000000
[  156.530607] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.530607] CR2: 00007ffff6e47000 CR3: 0000000813b34003 CR4: 0000000000160ee0
[  156.530607] Call Trace:
[  156.530608]  <IRQ>
[  156.530608]  free_pcppages_bulk+0x1af/0x6d0
[  156.530608]  free_unref_page+0x54/0x70
[  156.530608]  tlb_remove_table_rcu+0x23/0x40
[  156.530609]  rcu_core+0x2b0/0x470
[  156.530609]  __do_softirq+0xde/0x2bf
[  156.530609]  irq_exit+0xd5/0xe0
[  156.530610]  smp_apic_timer_interrupt+0x74/0x140
[  156.530610]  apic_timer_interrupt+0xf/0x20
[  156.530610]  </IRQ>
[  156.530611] RIP: 0010:_raw_spin_lock+0x10/0x20
[  156.530611] Code: b8 01 00 00 00 c3 48 8b 3c 24 be 00 02 00 00 e8
f6 cf 77 ff 31 c0 c3 0f 1f 00 0f 1f 44 00 00 31 c0 ba 01 00 00 00 f0
0f b1 17 <0f> 94 c2 84 d2 74 02 f3 c3 89 c6 e9 d0 e8 7c ff 0f 1f 44 00
00 65
[  156.530612] RSP: 0018:ffffb9a8c3bf3df0 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff13
[  156.530612] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[  156.530613] RDX: 0000000000000001 RSI: fffff6449fd4aec0 RDI: ffffa000fffd6240
[  156.530613] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000002
[  156.530613] R10: 0000000000000000 R11: 0000000000003bf3 R12: 00000000007f52bb
[  156.530614] R13: 00000000007ecca4 R14: fffff6449fd4aec0 R15: ffffa000fffd5d00
[  156.530614]  free_one_page+0x32/0x470
[  156.530614]  ? __switch_to_asm+0x40/0x70
[  156.530615]  hyperlist_ready+0xa9/0xc0
[  156.530615]  hinting_fn+0x1db/0x3c0
[  156.530615]  smpboot_thread_fn+0x10e/0x160
[  156.530616]  kthread+0xf8/0x130
[  156.530616]  ? sort_range+0x20/0x20
[  156.530616]  ? kthread_bind+0x10/0x10
[  156.530616]  ret_from_fork+0x35/0x40
[  156.530619] Sending NMI from CPU 14 to CPUs 3:
[  156.577112] NMI backtrace for cpu 3
[  156.577113] CPU: 3 PID: 1311 Comm: page_fault1_pro Tainted: G
D           5.0.0-rc7-next-20190219-baseline+ #50
[  156.577113] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  156.577114] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
[  156.577114] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
75 3e
[  156.577115] RSP: 0000:ffffb9a8c407fc10 EFLAGS: 00000002
[  156.577115] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
[  156.577116] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
[  156.577116] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
[  156.577121] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
[  156.577122] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
[  156.577122] FS:  00007ffff7fec440(0000) GS:ffffa000df8c0000(0000)
knlGS:0000000000000000
[  156.577122] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.577123] CR2: 00007ffff398a000 CR3: 000000081aa00003 CR4: 0000000000160ee0
[  156.577123] Call Trace:
[  156.577123]  get_page_from_freelist+0x50f/0x1280
[  156.577124]  ? get_page_from_freelist+0xa44/0x1280
[  156.577124]  ? try_charge+0x637/0x860
[  156.577124]  __alloc_pages_nodemask+0x141/0x2e0
[  156.577125]  alloc_pages_vma+0x73/0x180
[  156.577125]  __handle_mm_fault+0xd59/0x14e0
[  156.577125]  handle_mm_fault+0xfa/0x210
[  156.577126]  __do_page_fault+0x207/0x4c0
[  156.577126]  do_page_fault+0x32/0x140
[  156.577126]  ? async_page_fault+0x8/0x30
[  156.577127]  async_page_fault+0x1e/0x30
[  156.577127] RIP: 0033:0x401840
[  156.577128] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
ff ff
[  156.577128] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
[  156.577129] RAX: 00007fffeee48000 RBX: 00007ffff7ff7180 RCX: 0000000004b42000
[  156.577129] RDX: 00007ffff398a000 RSI: 0000000008000000 RDI: 0000000000000000
[  156.577130] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
[  156.577130] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
[  156.577130] R13: 0000000000000000 R14: 000000000060db00 R15: 0000000000000005

-- After the above two it starts spitting this one out every 10 - 30
seconds or so --
[  183.788386] watchdog: BUG: soft lockup - CPU#14 stuck for 23s!
[kworker/14:1:121]
[  183.790003] Modules linked in: ip6t_rpfilter ip6t_REJECT
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle
ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
iptable_security ebtable_filter ebtables ip6table_filter ip6_tables
sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
kvm_intel kvm ppdev irqbypass parport_pc joydev virtio_balloon
pcc_cpufreq i2c_piix4 pcspkr parport xfs libcrc32c cirrus
drm_kms_helper ttm drm e1000 crc32c_intel virtio_blk ata_generic
floppy serio_raw pata_acpi qemu_fw_cfg
[  183.799984] CPU: 14 PID: 121 Comm: kworker/14:1 Tainted: G      D
        5.0.0-rc7-next-20190219-baseline+ #50
[  183.801674] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  183.803078] Workqueue: events netstamp_clear
[  183.803873] RIP: 0010:smp_call_function_many+0x206/0x260
[  183.804847] Code: e8 0f 97 7c 00 3b 05 bd d1 1e 01 0f 83 7c fe ff
ff 48 63 d0 48 8b 4d 00 48 03 0c d5 80 28 18 9e 8b 51 18 83 e2 01 74
0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 0f b6 4c 24 0c 48 83 c4 10 89
ef 5b
[  183.808273] RSP: 0018:ffffb9a8c35a3d38 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff13
[  183.809662] RAX: 0000000000000000 RBX: ffffa000dfba9d88 RCX: ffffa000df8301c0
[  183.810971] RDX: 0000000000000001 RSI: 0000000000000100 RDI: ffffa000dfba9d88
[  183.812268] RBP: ffffa000dfba9d80 R08: 0000000000000000 R09: 0000000000003fff
[  183.813582] R10: 0000000000000000 R11: 000000000000000f R12: ffffffff9d02f690
[  183.814884] R13: 0000000000000000 R14: ffffa000dfba9da8 R15: 0000000000000100
[  183.816195] FS:  0000000000000000(0000) GS:ffffa000dfb80000(0000)
knlGS:0000000000000000
[  183.817673] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  183.818729] CR2: 00007ffff704b080 CR3: 0000000814c48001 CR4: 0000000000160ee0
[  183.820038] Call Trace:
[  183.820510]  ? netif_receive_skb_list+0x68/0x4a0
[  183.821367]  ? poke_int3_handler+0x40/0x40
[  183.822126]  ? netif_receive_skb_list+0x69/0x4a0
[  183.822975]  on_each_cpu+0x28/0x60
[  183.823611]  ? netif_receive_skb_list+0x68/0x4a0
[  183.824467]  text_poke_bp+0x68/0xe0
[  183.825126]  ? netif_receive_skb_list+0x68/0x4a0
[  183.825983]  __jump_label_transform+0x101/0x140
[  183.826829]  arch_jump_label_transform+0x26/0x40
[  183.827687]  __jump_label_update+0x56/0xc0
[  183.828456]  static_key_enable_cpuslocked+0x57/0x80
[  183.829358]  static_key_enable+0x16/0x20
[  183.830085]  process_one_work+0x16c/0x380
[  183.830831]  worker_thread+0x49/0x3e0
[  183.831516]  kthread+0xf8/0x130
[  183.832106]  ? rescuer_thread+0x340/0x340
[  183.832848]  ? kthread_bind+0x10/0x10
[  183.833532]  ret_from_fork+0x35/0x40
Nitesh Narayan Lal Feb. 25, 2019, 1:01 p.m. UTC | #73
On 2/22/19 7:02 PM, Alexander Duyck wrote:
> On Mon, Feb 4, 2019 at 1:47 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> The following patch-set proposes an efficient mechanism for handing freed memory between the guest and the host. It enables the guests with no page cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and single NUMA node with 15GB memory, we were able to successfully launch atleast 5 guests
>> when page hinting was enabled and 3 without it. (Detailed explanation of the test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and scan the pages freed by the guest has been changed. The new approach is briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this functionality. It maintains a per CPU array which is used to store the pages freed by the guest. The maximum number of entries which it can hold is defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it is scanned and only the pages which are available in the buddy are stored. This process continues until the array is filled with pages which are part of the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation and if the page is not reallocated and present in the buddy, the kernel thread attempts to isolate it from the buddy. If it is successfully isolated, the page is added to another per-cpu array. Once the entire scanning process is complete, all the isolated pages are reported to the host through an existing virtio-balloon driver.
>>
>> Known Issues:
>>         * Fixed array size: The problem with having a fixed/hardcoded array size arises when the size of the guest varies. For example when the guest size increases and it starts making large allocations fixed size limits this solution's ability to capture all the freed pages. This will result in less guest free memory getting reported to the host.
>>
>> Known code re-work:
>>         * Plan to re-use Wei's work, which communicates the poison value to the host.
>>         * The nomenclatures used in virtio-balloon needs to be changed so that the code can easily be distinguished from Wei's Free Page Hint code.
>>         * Sorting based on zonenum, to avoid repetitive zone locks for the same zone.
>>
>> Other required work:
>>         * Run other benchmarks to evaluate the performance/impact of this approach.
>>
>> Test case:
>> Setup:
>> Memory-15837 MB
>> Guest Memory Size-5 GB
>> Swap-Disabled
>> Test Program-Simple program which allocates 4GB memory via malloc, touches it via memset and exits.
>> Use case-Number of guests that can be launched completely including the successful execution of the test program.
>> Procedure:
>> The first guest is launched and once its console is up, the test allocation program is executed with 4 GB memory request (Due to this the guest occupies almost 4-5 GB of memory in the host in a system without page hinting). Once this program exits at that time another guest is launched in the host and the same process is followed. We continue launching the guests until a guest gets killed due to low memory condition in the host.
>>
>> Result:
>> Without Hinting-3 Guests
>> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).
>>
>> [1] https://www.spinics.net/lists/kvm/msg170113.html
> So I tried reproducing your test and I am not having much luck.
> According to the sysctl in the guest  I am seeing
> "vm.guest-page-hinting = 1" which is supposed to indicate that the
> hinting is enabled in both QEMU and the guest right? 
That is correct. If your guest has the balloon driver enabled it will
also enable the hinting.
> I'm just wanting
> to verify that this is the case before I start doing any debugging.
>
> I'm assuming you never really ran any multi-threaded tests on a
> multi-CPU guest did you?
This is correct. I forgot to mention this as another todo item for me in
the cover email.
I will test multiple vcpus, once I finalize the design changes which I
am doing right now.
Thanks for pointing this out.
>  With the patches applied I am seeing
> stability issues. If I enable a VM with multiple CPUs and run
> something like the page_fault1 test from the will-it-scale suite I am
> seeing multiple traces being generated by the guest kernel and it
> ultimately just hangs.
As I am done with the changes on which I am currently working. I will
look into this as well.
>
> I have included the traces below. There end up being 3 specific
> issues, a double free that is detected, the RCU stall, and then starts
> complaining about a soft lockup.
>
> Thanks.
>
> - Alex
>
> -- This looks like a page complaining about a double add when added to
> the LRU --
> [   50.479635] list_add double add: new=fffff64480000008,
> prev=ffffa000fffd50c0, next=fffff64480000008.
> [   50.481066] ------------[ cut here ]------------
> [   50.481753] kernel BUG at lib/list_debug.c:31!
> [   50.482448] invalid opcode: 0000 [#1] SMP PTI
> [   50.483108] CPU: 1 PID: 852 Comm: hinting/1 Not tainted
> 5.0.0-rc7-next-20190219-baseline+ #50
> [   50.486362] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [   50.487881] RIP: 0010:__list_add_valid+0x4b/0x70
> [   50.488623] Code: 00 00 c3 48 89 c1 48 c7 c7 d8 70 10 9e 31 c0 e8
> 4f db c8 ff 0f 0b 48 89 c1 48 89 fe 31 c0 48 c7 c7 88 71 10 9e e8 39
> db c8 ff <0f> 0b 48 89 d1 48 c7 c7 30 71 10 9e 48 89 f2 48 89 c6 31 c0
> e8 20
> [   50.492626] RSP: 0018:ffffb9a8c3b4bdf0 EFLAGS: 00010246
> [   50.494189] RAX: 0000000000000058 RBX: ffffa000fffd50c0 RCX: 0000000000000000
> [   50.496308] RDX: 0000000000000000 RSI: ffffa000df85e6c8 RDI: ffffa000df85e6c8
> [   50.497876] RBP: ffffa000fffd50c0 R08: 0000000000000273 R09: 0000000000000005
> [   50.498981] R10: 0000000000000000 R11: ffffb9a8c3b4bb70 R12: fffff64480000008
> [   50.500077] R13: fffff64480000008 R14: fffff64480000000 R15: ffffa000fffd5000
> [   50.501184] FS:  0000000000000000(0000) GS:ffffa000df840000(0000)
> knlGS:0000000000000000
> [   50.502432] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   50.503325] CR2: 00007ffff6e47000 CR3: 000000080f76c002 CR4: 0000000000160ee0
> [   50.504431] Call Trace:
> [   50.505464]  free_one_page+0x2b5/0x470
> [   50.506070]  hyperlist_ready+0xa9/0xc0
> [   50.506662]  hinting_fn+0x1db/0x3c0
> [   50.507220]  smpboot_thread_fn+0x10e/0x160
> [   50.507868]  kthread+0xf8/0x130
> [   50.508371]  ? sort_range+0x20/0x20
> [   50.508934]  ? kthread_bind+0x10/0x10
> [   50.509520]  ret_from_fork+0x35/0x40
> [   50.510098] Modules linked in: ip6t_rpfilter ip6t_REJECT
> nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
> ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle
> ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
> iptable_security ebtable_filter ebtables ip6table_filter ip6_tables
> sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> kvm_intel kvm ppdev irqbypass parport_pc joydev virtio_balloon
> pcc_cpufreq i2c_piix4 pcspkr parport xfs libcrc32c cirrus
> drm_kms_helper ttm drm e1000 crc32c_intel virtio_blk ata_generic
> floppy serio_raw pata_acpi qemu_fw_cfg
> [   50.519202] ---[ end trace 141fe2acdf2e3818 ]---
> [   50.519935] RIP: 0010:__list_add_valid+0x4b/0x70
> [   50.520675] Code: 00 00 c3 48 89 c1 48 c7 c7 d8 70 10 9e 31 c0 e8
> 4f db c8 ff 0f 0b 48 89 c1 48 89 fe 31 c0 48 c7 c7 88 71 10 9e e8 39
> db c8 ff <0f> 0b 48 89 d1 48 c7 c7 30 71 10 9e 48 89 f2 48 89 c6 31 c0
> e8 20
> [   50.523570] RSP: 0018:ffffb9a8c3b4bdf0 EFLAGS: 00010246
> [   50.524399] RAX: 0000000000000058 RBX: ffffa000fffd50c0 RCX: 0000000000000000
> [   50.525516] RDX: 0000000000000000 RSI: ffffa000df85e6c8 RDI: ffffa000df85e6c8
> [   50.526634] RBP: ffffa000fffd50c0 R08: 0000000000000273 R09: 0000000000000005
> [   50.527754] R10: 0000000000000000 R11: ffffb9a8c3b4bb70 R12: fffff64480000008
> [   50.528872] R13: fffff64480000008 R14: fffff64480000000 R15: ffffa000fffd5000
> [   50.530004] FS:  0000000000000000(0000) GS:ffffa000df840000(0000)
> knlGS:0000000000000000
> [   50.531276] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   50.532189] CR2: 00007ffff6e47000 CR3: 000000080f76c002 CR4: 0000000000160ee0
>
> -- This appears to be a deadlock on the zone lock --
> [  156.436784] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [  156.439195] rcu: 0-...0: (0 ticks this GP)
> idle=6ca/1/0x4000000000000000 softirq=10718/10718 fqs=2546
> [  156.440810] rcu: 1-...0: (1 GPs behind)
> idle=8f2/1/0x4000000000000000 softirq=8233/8235 fqs=2547
> [  156.442320] rcu: 2-...0: (0 ticks this GP)
> idle=ae2/1/0x4000000000000002 softirq=6779/6779 fqs=2547
> [  156.443910] rcu: 3-...0: (0 ticks this GP)
> idle=456/1/0x4000000000000000 softirq=1616/1616 fqs=2547
> [  156.445454] rcu: (detected by 14, t=60109 jiffies, g=17493, q=31)
> [  156.446545] Sending NMI from CPU 14 to CPUs 0:
> [  156.448330] NMI backtrace for cpu 0
> [  156.448331] CPU: 0 PID: 1308 Comm: page_fault1_pro Tainted: G
> D           5.0.0-rc7-next-20190219-baseline+ #50
> [  156.448331] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  156.448332] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
> [  156.448332] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
> 44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
> c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
> 75 3e
> [  156.448333] RSP: 0000:ffffb9a8c3e83c10 EFLAGS: 00000002
> [  156.448339] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
> [  156.448340] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
> [  156.448340] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
> [  156.448341] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
> [  156.448341] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
> [  156.448342] FS:  00007ffff7fec440(0000) GS:ffffa000df800000(0000)
> knlGS:0000000000000000
> [  156.448342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  156.448342] CR2: 00007fffefe2d000 CR3: 0000000695904004 CR4: 0000000000160ef0
> [  156.448343] Call Trace:
> [  156.448343]  get_page_from_freelist+0x50f/0x1280
> [  156.448343]  ? get_page_from_freelist+0xa44/0x1280
> [  156.448344]  __alloc_pages_nodemask+0x141/0x2e0
> [  156.448344]  alloc_pages_vma+0x73/0x180
> [  156.448344]  __handle_mm_fault+0xd59/0x14e0
> [  156.448345]  handle_mm_fault+0xfa/0x210
> [  156.448345]  __do_page_fault+0x207/0x4c0
> [  156.448345]  do_page_fault+0x32/0x140
> [  156.448346]  ? async_page_fault+0x8/0x30
> [  156.448346]  async_page_fault+0x1e/0x30
> [  156.448346] RIP: 0033:0x401840
> [  156.448347] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
> 00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
> 48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
> ff ff
> [  156.448347] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
> [  156.448348] RAX: 00007fffeee48000 RBX: 00007ffff7ff7000 RCX: 0000000000fe5000
> [  156.448348] RDX: 00007fffefe2d000 RSI: 0000000008000000 RDI: 0000000000000000
> [  156.448349] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
> [  156.448349] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
> [  156.448349] R13: 0000000000000000 R14: 0000000000610710 R15: 0000000000000005
> [  156.448355] Sending NMI from CPU 14 to CPUs 1:
> [  156.489676] NMI backtrace for cpu 1
> [  156.489677] CPU: 1 PID: 1309 Comm: page_fault1_pro Tainted: G
> D           5.0.0-rc7-next-20190219-baseline+ #50
> [  156.489677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  156.489678] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
> [  156.489678] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
> 44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
> c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
> 75 3e
> [  156.489679] RSP: 0000:ffffb9a8c3b4bc10 EFLAGS: 00000002
> [  156.489679] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
> [  156.489680] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
> [  156.489680] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
> [  156.489680] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
> [  156.489681] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
> [  156.489681] FS:  00007ffff7fec440(0000) GS:ffffa000df840000(0000)
> knlGS:0000000000000000
> [  156.489682] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  156.489682] CR2: 00007ffff4608000 CR3: 000000081ddf6003 CR4: 0000000000160ee0
> [  156.489682] Call Trace:
> [  156.489683]  get_page_from_freelist+0x50f/0x1280
> [  156.489683]  ? get_page_from_freelist+0xa44/0x1280
> [  156.489683]  __alloc_pages_nodemask+0x141/0x2e0
> [  156.489683]  alloc_pages_vma+0x73/0x180
> [  156.489684]  __handle_mm_fault+0xd59/0x14e0
> [  156.489684]  handle_mm_fault+0xfa/0x210
> [  156.489684]  __do_page_fault+0x207/0x4c0
> [  156.489685]  do_page_fault+0x32/0x140
> [  156.489685]  ? async_page_fault+0x8/0x30
> [  156.489685]  async_page_fault+0x1e/0x30
> [  156.489686] RIP: 0033:0x401840
> [  156.489686] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
> 00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
> 48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
> ff ff
> [  156.489687] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
> [  156.489687] RAX: 00007fffeee48000 RBX: 00007ffff7ff7080 RCX: 00000000057c0000
> [  156.489692] RDX: 00007ffff4608000 RSI: 0000000008000000 RDI: 0000000000000000
> [  156.489693] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
> [  156.489693] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
> [  156.489694] R13: 0000000000000000 R14: 000000000060f870 R15: 0000000000000005
> [  156.489696] Sending NMI from CPU 14 to CPUs 2:
> [  156.530601] NMI backtrace for cpu 2
> [  156.530602] CPU: 2 PID: 858 Comm: hinting/2 Tainted: G      D
>     5.0.0-rc7-next-20190219-baseline+ #50
> [  156.530602] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  156.530603] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
> [  156.530603] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
> 44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
> c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
> 75 3e
> [  156.530604] RSP: 0018:ffffa000df883e38 EFLAGS: 00000002
> [  156.530604] RAX: 0000000000000001 RBX: fffff644a05a0ec8 RCX: dead000000000200
> [  156.530605] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
> [  156.530605] RBP: ffffa000df8af340 R08: ffffa000da2b2000 R09: 0000000000000100
> [  156.530606] R10: 0000000000000004 R11: 0000000000000005 R12: fffff6449fb5fb08
> [  156.530606] R13: ffffa000fffd5d00 R14: 0000000000000001 R15: 0000000000000001
> [  156.530606] FS:  0000000000000000(0000) GS:ffffa000df880000(0000)
> knlGS:0000000000000000
> [  156.530607] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  156.530607] CR2: 00007ffff6e47000 CR3: 0000000813b34003 CR4: 0000000000160ee0
> [  156.530607] Call Trace:
> [  156.530608]  <IRQ>
> [  156.530608]  free_pcppages_bulk+0x1af/0x6d0
> [  156.530608]  free_unref_page+0x54/0x70
> [  156.530608]  tlb_remove_table_rcu+0x23/0x40
> [  156.530609]  rcu_core+0x2b0/0x470
> [  156.530609]  __do_softirq+0xde/0x2bf
> [  156.530609]  irq_exit+0xd5/0xe0
> [  156.530610]  smp_apic_timer_interrupt+0x74/0x140
> [  156.530610]  apic_timer_interrupt+0xf/0x20
> [  156.530610]  </IRQ>
> [  156.530611] RIP: 0010:_raw_spin_lock+0x10/0x20
> [  156.530611] Code: b8 01 00 00 00 c3 48 8b 3c 24 be 00 02 00 00 e8
> f6 cf 77 ff 31 c0 c3 0f 1f 00 0f 1f 44 00 00 31 c0 ba 01 00 00 00 f0
> 0f b1 17 <0f> 94 c2 84 d2 74 02 f3 c3 89 c6 e9 d0 e8 7c ff 0f 1f 44 00
> 00 65
> [  156.530612] RSP: 0018:ffffb9a8c3bf3df0 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffff13
> [  156.530612] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
> [  156.530613] RDX: 0000000000000001 RSI: fffff6449fd4aec0 RDI: ffffa000fffd6240
> [  156.530613] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000002
> [  156.530613] R10: 0000000000000000 R11: 0000000000003bf3 R12: 00000000007f52bb
> [  156.530614] R13: 00000000007ecca4 R14: fffff6449fd4aec0 R15: ffffa000fffd5d00
> [  156.530614]  free_one_page+0x32/0x470
> [  156.530614]  ? __switch_to_asm+0x40/0x70
> [  156.530615]  hyperlist_ready+0xa9/0xc0
> [  156.530615]  hinting_fn+0x1db/0x3c0
> [  156.530615]  smpboot_thread_fn+0x10e/0x160
> [  156.530616]  kthread+0xf8/0x130
> [  156.530616]  ? sort_range+0x20/0x20
> [  156.530616]  ? kthread_bind+0x10/0x10
> [  156.530616]  ret_from_fork+0x35/0x40
> [  156.530619] Sending NMI from CPU 14 to CPUs 3:
> [  156.577112] NMI backtrace for cpu 3
> [  156.577113] CPU: 3 PID: 1311 Comm: page_fault1_pro Tainted: G
> D           5.0.0-rc7-next-20190219-baseline+ #50
> [  156.577113] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  156.577114] RIP: 0010:queued_spin_lock_slowpath+0x21/0x1f0
> [  156.577114] Code: c0 75 ec c3 90 90 90 90 90 0f 1f 44 00 00 0f 1f
> 44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3
> c3 f3 90 <eb> ec 81 fe 00 01 00 00 0f 84 44 01 00 00 81 e6 00 ff ff ff
> 75 3e
> [  156.577115] RSP: 0000:ffffb9a8c407fc10 EFLAGS: 00000002
> [  156.577115] RAX: 0000000000000001 RBX: 0000000000000007 RCX: 0000000000000001
> [  156.577116] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffa000fffd6240
> [  156.577116] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000006f36aa
> [  156.577121] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000081
> [  156.577122] R13: 0000000000100dca R14: 0000000000000000 R15: ffffa000fffd5d00
> [  156.577122] FS:  00007ffff7fec440(0000) GS:ffffa000df8c0000(0000)
> knlGS:0000000000000000
> [  156.577122] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  156.577123] CR2: 00007ffff398a000 CR3: 000000081aa00003 CR4: 0000000000160ee0
> [  156.577123] Call Trace:
> [  156.577123]  get_page_from_freelist+0x50f/0x1280
> [  156.577124]  ? get_page_from_freelist+0xa44/0x1280
> [  156.577124]  ? try_charge+0x637/0x860
> [  156.577124]  __alloc_pages_nodemask+0x141/0x2e0
> [  156.577125]  alloc_pages_vma+0x73/0x180
> [  156.577125]  __handle_mm_fault+0xd59/0x14e0
> [  156.577125]  handle_mm_fault+0xfa/0x210
> [  156.577126]  __do_page_fault+0x207/0x4c0
> [  156.577126]  do_page_fault+0x32/0x140
> [  156.577126]  ? async_page_fault+0x8/0x30
> [  156.577127]  async_page_fault+0x1e/0x30
> [  156.577127] RIP: 0033:0x401840
> [  156.577128] Code: 00 00 45 31 c9 31 ff 41 b8 ff ff ff ff b9 22 00
> 00 00 ba 03 00 00 00 be 00 00 00 08 e8 d9 f5 ff ff 48 83 f8 ff 74 2b
> 48 89 c2 <c6> 02 00 48 01 ea 48 83 03 01 48 89 d1 48 29 c1 48 81 f9 ff
> ff ff
> [  156.577128] RSP: 002b:00007fffffffc0a0 EFLAGS: 00010293
> [  156.577129] RAX: 00007fffeee48000 RBX: 00007ffff7ff7180 RCX: 0000000004b42000
> [  156.577129] RDX: 00007ffff398a000 RSI: 0000000008000000 RDI: 0000000000000000
> [  156.577130] RBP: 0000000000001000 R08: ffffffffffffffff R09: 0000000000000000
> [  156.577130] R10: 0000000000000022 R11: 0000000000000246 R12: 00007fffffffc240
> [  156.577130] R13: 0000000000000000 R14: 000000000060db00 R15: 0000000000000005
>
> -- After the above two it starts spitting this one out every 10 - 30
> seconds or so --
> [  183.788386] watchdog: BUG: soft lockup - CPU#14 stuck for 23s!
> [kworker/14:1:121]
> [  183.790003] Modules linked in: ip6t_rpfilter ip6t_REJECT
> nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
> ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle
> ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw
> iptable_security ebtable_filter ebtables ip6table_filter ip6_tables
> sunrpc sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
> kvm_intel kvm ppdev irqbypass parport_pc joydev virtio_balloon
> pcc_cpufreq i2c_piix4 pcspkr parport xfs libcrc32c cirrus
> drm_kms_helper ttm drm e1000 crc32c_intel virtio_blk ata_generic
> floppy serio_raw pata_acpi qemu_fw_cfg
> [  183.799984] CPU: 14 PID: 121 Comm: kworker/14:1 Tainted: G      D
>         5.0.0-rc7-next-20190219-baseline+ #50
> [  183.801674] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  183.803078] Workqueue: events netstamp_clear
> [  183.803873] RIP: 0010:smp_call_function_many+0x206/0x260
> [  183.804847] Code: e8 0f 97 7c 00 3b 05 bd d1 1e 01 0f 83 7c fe ff
> ff 48 63 d0 48 8b 4d 00 48 03 0c d5 80 28 18 9e 8b 51 18 83 e2 01 74
> 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 0f b6 4c 24 0c 48 83 c4 10 89
> ef 5b
> [  183.808273] RSP: 0018:ffffb9a8c35a3d38 EFLAGS: 00000202 ORIG_RAX:
> ffffffffffffff13
> [  183.809662] RAX: 0000000000000000 RBX: ffffa000dfba9d88 RCX: ffffa000df8301c0
> [  183.810971] RDX: 0000000000000001 RSI: 0000000000000100 RDI: ffffa000dfba9d88
> [  183.812268] RBP: ffffa000dfba9d80 R08: 0000000000000000 R09: 0000000000003fff
> [  183.813582] R10: 0000000000000000 R11: 000000000000000f R12: ffffffff9d02f690
> [  183.814884] R13: 0000000000000000 R14: ffffa000dfba9da8 R15: 0000000000000100
> [  183.816195] FS:  0000000000000000(0000) GS:ffffa000dfb80000(0000)
> knlGS:0000000000000000
> [  183.817673] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  183.818729] CR2: 00007ffff704b080 CR3: 0000000814c48001 CR4: 0000000000160ee0
> [  183.820038] Call Trace:
> [  183.820510]  ? netif_receive_skb_list+0x68/0x4a0
> [  183.821367]  ? poke_int3_handler+0x40/0x40
> [  183.822126]  ? netif_receive_skb_list+0x69/0x4a0
> [  183.822975]  on_each_cpu+0x28/0x60
> [  183.823611]  ? netif_receive_skb_list+0x68/0x4a0
> [  183.824467]  text_poke_bp+0x68/0xe0
> [  183.825126]  ? netif_receive_skb_list+0x68/0x4a0
> [  183.825983]  __jump_label_transform+0x101/0x140
> [  183.826829]  arch_jump_label_transform+0x26/0x40
> [  183.827687]  __jump_label_update+0x56/0xc0
> [  183.828456]  static_key_enable_cpuslocked+0x57/0x80
> [  183.829358]  static_key_enable+0x16/0x20
> [  183.830085]  process_one_work+0x16c/0x380
> [  183.830831]  worker_thread+0x49/0x3e0
> [  183.831516]  kthread+0xf8/0x130
> [  183.832106]  ? rescuer_thread+0x340/0x340
> [  183.832848]  ? kthread_bind+0x10/0x10
> [  183.833532]  ret_from_fork+0x35/0x40