[RFC,4/4] mm: Add merge page notifier
diff mbox series

Message ID 20190204181558.12095.83484.stgit@localhost.localdomain
State New
Headers show
Series
  • kvm: Report unused guest pages to host
Related show

Commit Message

Alexander Duyck Feb. 4, 2019, 6:15 p.m. UTC
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Because the implementation was limiting itself to only providing hints on
pages huge TLB order sized or larger we introduced the possibility for free
pages to slip past us because they are freed as something less then
huge TLB in size and aggregated with buddies later.

To address that I am adding a new call arch_merge_page which is called
after __free_one_page has merged a pair of pages to create a higher order
page. By doing this I am able to fill the gap and provide full coverage for
all of the pages huge TLB order or larger.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 arch/x86/include/asm/page.h |   12 ++++++++++++
 arch/x86/kernel/kvm.c       |   28 ++++++++++++++++++++++++++++
 include/linux/gfp.h         |    4 ++++
 mm/page_alloc.c             |    2 ++
 4 files changed, 46 insertions(+)

Comments

Dave Hansen Feb. 4, 2019, 7:40 p.m. UTC | #1
> +void __arch_merge_page(struct zone *zone, struct page *page,
> +		       unsigned int order)
> +{
> +	/*
> +	 * The merging logic has merged a set of buddies up to the
> +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> +	 * advantage of this moment to notify the hypervisor of the free
> +	 * memory.
> +	 */
> +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> +		return;
> +
> +	/*
> +	 * Drop zone lock while processing the hypercall. This
> +	 * should be safe as the page has not yet been added
> +	 * to the buddy list as of yet and all the pages that
> +	 * were merged have had their buddy/guard flags cleared
> +	 * and their order reset to 0.
> +	 */
> +	spin_unlock(&zone->lock);
> +
> +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> +		       PAGE_SIZE << order);
> +
> +	/* reacquire lock and resume freeing memory */
> +	spin_lock(&zone->lock);
> +}

Why do the lock-dropping on merge but not free?  What's the difference?

This makes me really nervous.  You at *least* want to document this at
the arch_merge_page() call-site, and perhaps even the __free_one_page()
call-sites because they're near where the zone lock is taken.

The place you are calling arch_merge_page() looks OK to me, today.  But,
it can't get moved around without careful consideration.  That also
needs to be documented to warn off folks who might move code around.

The interaction between the free and merge hooks is also really
implementation-specific.  If an architecture is getting order-0
arch_free_page() notifications, it's probably worth at least documenting
that they'll *also* get arch_merge_page() notifications.

The reason x86 doesn't double-hypercall on those is not broached in the
descriptions.  That seems to be problematic.
Alexander Duyck Feb. 4, 2019, 7:51 p.m. UTC | #2
On Mon, 2019-02-04 at 11:40 -0800, Dave Hansen wrote:
> > +void __arch_merge_page(struct zone *zone, struct page *page,
> > +		       unsigned int order)
> > +{
> > +	/*
> > +	 * The merging logic has merged a set of buddies up to the
> > +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> > +	 * advantage of this moment to notify the hypervisor of the free
> > +	 * memory.
> > +	 */
> > +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> > +		return;
> > +
> > +	/*
> > +	 * Drop zone lock while processing the hypercall. This
> > +	 * should be safe as the page has not yet been added
> > +	 * to the buddy list as of yet and all the pages that
> > +	 * were merged have had their buddy/guard flags cleared
> > +	 * and their order reset to 0.
> > +	 */
> > +	spin_unlock(&zone->lock);
> > +
> > +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> > +		       PAGE_SIZE << order);
> > +
> > +	/* reacquire lock and resume freeing memory */
> > +	spin_lock(&zone->lock);
> > +}
> 
> Why do the lock-dropping on merge but not free?  What's the difference?

The lock has not yet been acquired in the free path. The arch_free_page
call is made from free_pages_prepare, whereas the arch_merge_page call
is made from within __free_one_page which has the requirement that the
zone lock be taken before calling the function.

> This makes me really nervous.  You at *least* want to document this at
> the arch_merge_page() call-site, and perhaps even the __free_one_page()
> call-sites because they're near where the zone lock is taken.

Okay, that makes sense. I would probably look at adding the
documentation to the arch_merge_page call-site.

> The place you are calling arch_merge_page() looks OK to me, today.  But,
> it can't get moved around without careful consideration.  That also
> needs to be documented to warn off folks who might move code around.

Agreed.

> The interaction between the free and merge hooks is also really
> implementation-specific.  If an architecture is getting order-0
> arch_free_page() notifications, it's probably worth at least documenting
> that they'll *also* get arch_merge_page() notifications.

If an architecture is getting order-0 notifications then the merge
notifications would be pointless since all the pages would be already
hinted.

I can add documentation that explains that in the case where we are
only hinting on non-zero order pages then arch_merge_page should
provide hints for when a page is merged above that threshold.

> The reason x86 doesn't double-hypercall on those is not broached in the
> descriptions.  That seems to be problematic.

I will add more documentation to address that.
Michael S. Tsirkin Feb. 10, 2019, 12:57 a.m. UTC | #3
On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Because the implementation was limiting itself to only providing hints on
> pages huge TLB order sized or larger we introduced the possibility for free
> pages to slip past us because they are freed as something less then
> huge TLB in size and aggregated with buddies later.
> 
> To address that I am adding a new call arch_merge_page which is called
> after __free_one_page has merged a pair of pages to create a higher order
> page. By doing this I am able to fill the gap and provide full coverage for
> all of the pages huge TLB order or larger.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Looks like this will be helpful whenever active free page
hints are added. So I think it's a good idea to
add a hook.

However, could you split adding the hook to a separate
patch from the KVM hypercall based implementation?

Then e.g. Nilal's patches could reuse it too.



> ---
>  arch/x86/include/asm/page.h |   12 ++++++++++++
>  arch/x86/kernel/kvm.c       |   28 ++++++++++++++++++++++++++++
>  include/linux/gfp.h         |    4 ++++
>  mm/page_alloc.c             |    2 ++
>  4 files changed, 46 insertions(+)
> 
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 4487ad7a3385..9540a97c9997 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
>  	if (static_branch_unlikely(&pv_free_page_hint_enabled))
>  		__arch_free_page(page, order);
>  }
> +
> +struct zone;
> +
> +#define HAVE_ARCH_MERGE_PAGE
> +void __arch_merge_page(struct zone *zone, struct page *page,
> +		       unsigned int order);
> +static inline void arch_merge_page(struct zone *zone, struct page *page,
> +				   unsigned int order)
> +{
> +	if (static_branch_unlikely(&pv_free_page_hint_enabled))
> +		__arch_merge_page(zone, page, order);
> +}
>  #endif
>  
>  #include <linux/range.h>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 09c91641c36c..957bb4f427bb 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
>  		       PAGE_SIZE << order);
>  }
>  
> +void __arch_merge_page(struct zone *zone, struct page *page,
> +		       unsigned int order)
> +{
> +	/*
> +	 * The merging logic has merged a set of buddies up to the
> +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> +	 * advantage of this moment to notify the hypervisor of the free
> +	 * memory.
> +	 */
> +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> +		return;
> +
> +	/*
> +	 * Drop zone lock while processing the hypercall. This
> +	 * should be safe as the page has not yet been added
> +	 * to the buddy list as of yet and all the pages that
> +	 * were merged have had their buddy/guard flags cleared
> +	 * and their order reset to 0.
> +	 */
> +	spin_unlock(&zone->lock);
> +
> +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> +		       PAGE_SIZE << order);
> +
> +	/* reacquire lock and resume freeing memory */
> +	spin_lock(&zone->lock);
> +}
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  
>  /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index fdab7de7490d..4746d5560193 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>  #ifndef HAVE_ARCH_FREE_PAGE
>  static inline void arch_free_page(struct page *page, int order) { }
>  #endif
> +#ifndef HAVE_ARCH_MERGE_PAGE
> +static inline void
> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> +#endif
>  #ifndef HAVE_ARCH_ALLOC_PAGE
>  static inline void arch_alloc_page(struct page *page, int order) { }
>  #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c954f8c1fbc4..7a1309b0b7c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;
>  		order++;
> +
> +		arch_merge_page(zone, page, order);
>  	}
>  	if (max_order < MAX_ORDER) {
>  		/* If we are here, it means order is >= pageblock_order.
Aaron Lu Feb. 11, 2019, 6:40 a.m. UTC | #4
On 2019/2/5 2:15, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> 
> Because the implementation was limiting itself to only providing hints on
> pages huge TLB order sized or larger we introduced the possibility for free
> pages to slip past us because they are freed as something less then
> huge TLB in size and aggregated with buddies later.
> 
> To address that I am adding a new call arch_merge_page which is called
> after __free_one_page has merged a pair of pages to create a higher order
> page. By doing this I am able to fill the gap and provide full coverage for
> all of the pages huge TLB order or larger.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
>  arch/x86/include/asm/page.h |   12 ++++++++++++
>  arch/x86/kernel/kvm.c       |   28 ++++++++++++++++++++++++++++
>  include/linux/gfp.h         |    4 ++++
>  mm/page_alloc.c             |    2 ++
>  4 files changed, 46 insertions(+)
> 
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 4487ad7a3385..9540a97c9997 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
>  	if (static_branch_unlikely(&pv_free_page_hint_enabled))
>  		__arch_free_page(page, order);
>  }
> +
> +struct zone;
> +
> +#define HAVE_ARCH_MERGE_PAGE
> +void __arch_merge_page(struct zone *zone, struct page *page,
> +		       unsigned int order);
> +static inline void arch_merge_page(struct zone *zone, struct page *page,
> +				   unsigned int order)
> +{
> +	if (static_branch_unlikely(&pv_free_page_hint_enabled))
> +		__arch_merge_page(zone, page, order);
> +}
>  #endif
>  
>  #include <linux/range.h>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 09c91641c36c..957bb4f427bb 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
>  		       PAGE_SIZE << order);
>  }
>  
> +void __arch_merge_page(struct zone *zone, struct page *page,
> +		       unsigned int order)
> +{
> +	/*
> +	 * The merging logic has merged a set of buddies up to the
> +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> +	 * advantage of this moment to notify the hypervisor of the free
> +	 * memory.
> +	 */
> +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> +		return;
> +
> +	/*
> +	 * Drop zone lock while processing the hypercall. This
> +	 * should be safe as the page has not yet been added
> +	 * to the buddy list as of yet and all the pages that
> +	 * were merged have had their buddy/guard flags cleared
> +	 * and their order reset to 0.
> +	 */
> +	spin_unlock(&zone->lock);
> +
> +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> +		       PAGE_SIZE << order);
> +
> +	/* reacquire lock and resume freeing memory */
> +	spin_lock(&zone->lock);
> +}
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  
>  /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index fdab7de7490d..4746d5560193 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>  #ifndef HAVE_ARCH_FREE_PAGE
>  static inline void arch_free_page(struct page *page, int order) { }
>  #endif
> +#ifndef HAVE_ARCH_MERGE_PAGE
> +static inline void
> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> +#endif
>  #ifndef HAVE_ARCH_ALLOC_PAGE
>  static inline void arch_alloc_page(struct page *page, int order) { }
>  #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c954f8c1fbc4..7a1309b0b7c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;
>  		order++;
> +
> +		arch_merge_page(zone, page, order);

Not a proper place AFAICS.

Assume we have an order-8 page being sent here for merge and its order-8
buddy is also free, then order++ became 9 and arch_merge_page() will do
the hint to host on this page as an order-9 page, no problem so far.
Then the next round, assume the now order-9 page's buddy is also free,
order++ will become 10 and arch_merge_page() will again hint to host on
this page as an order-10 page. The first hint to host became redundant.

I think the proper place is after the done_merging tag.

BTW, with arch_merge_page() at the proper place, I don't think patch3/4
is necessary - any freed page will go through merge anyway, we won't
lose any hint opportunity. Or do I miss anything?

>  	}
>  	if (max_order < MAX_ORDER) {
>  		/* If we are here, it means order is >= pageblock_order.
>
Nitesh Narayan Lal Feb. 11, 2019, 1:30 p.m. UTC | #5
On 2/9/19 7:57 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>
>> Because the implementation was limiting itself to only providing hints on
>> pages huge TLB order sized or larger we introduced the possibility for free
>> pages to slip past us because they are freed as something less then
>> huge TLB in size and aggregated with buddies later.
>>
>> To address that I am adding a new call arch_merge_page which is called
>> after __free_one_page has merged a pair of pages to create a higher order
>> page. By doing this I am able to fill the gap and provide full coverage for
>> all of the pages huge TLB order or larger.
>>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Looks like this will be helpful whenever active free page
> hints are added. So I think it's a good idea to
> add a hook.
>
> However, could you split adding the hook to a separate
> patch from the KVM hypercall based implementation?
>
> Then e.g. Nilal's patches could reuse it too.
With the current design of my patch-set, if I use this hook to report
free pages. I will end up making redundant hints for the same pfns.

This is because the pages once freed by the host, are returned back to
the buddy.

>
>
>> ---
>>  arch/x86/include/asm/page.h |   12 ++++++++++++
>>  arch/x86/kernel/kvm.c       |   28 ++++++++++++++++++++++++++++
>>  include/linux/gfp.h         |    4 ++++
>>  mm/page_alloc.c             |    2 ++
>>  4 files changed, 46 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>> index 4487ad7a3385..9540a97c9997 100644
>> --- a/arch/x86/include/asm/page.h
>> +++ b/arch/x86/include/asm/page.h
>> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
>>  	if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>  		__arch_free_page(page, order);
>>  }
>> +
>> +struct zone;
>> +
>> +#define HAVE_ARCH_MERGE_PAGE
>> +void __arch_merge_page(struct zone *zone, struct page *page,
>> +		       unsigned int order);
>> +static inline void arch_merge_page(struct zone *zone, struct page *page,
>> +				   unsigned int order)
>> +{
>> +	if (static_branch_unlikely(&pv_free_page_hint_enabled))
>> +		__arch_merge_page(zone, page, order);
>> +}
>>  #endif
>>  
>>  #include <linux/range.h>
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 09c91641c36c..957bb4f427bb 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
>>  		       PAGE_SIZE << order);
>>  }
>>  
>> +void __arch_merge_page(struct zone *zone, struct page *page,
>> +		       unsigned int order)
>> +{
>> +	/*
>> +	 * The merging logic has merged a set of buddies up to the
>> +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
>> +	 * advantage of this moment to notify the hypervisor of the free
>> +	 * memory.
>> +	 */
>> +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
>> +		return;
>> +
>> +	/*
>> +	 * Drop zone lock while processing the hypercall. This
>> +	 * should be safe as the page has not yet been added
>> +	 * to the buddy list as of yet and all the pages that
>> +	 * were merged have had their buddy/guard flags cleared
>> +	 * and their order reset to 0.
>> +	 */
>> +	spin_unlock(&zone->lock);
>> +
>> +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
>> +		       PAGE_SIZE << order);
>> +
>> +	/* reacquire lock and resume freeing memory */
>> +	spin_lock(&zone->lock);
>> +}
>> +
>>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>  
>>  /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index fdab7de7490d..4746d5560193 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>  #ifndef HAVE_ARCH_FREE_PAGE
>>  static inline void arch_free_page(struct page *page, int order) { }
>>  #endif
>> +#ifndef HAVE_ARCH_MERGE_PAGE
>> +static inline void
>> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
>> +#endif
>>  #ifndef HAVE_ARCH_ALLOC_PAGE
>>  static inline void arch_alloc_page(struct page *page, int order) { }
>>  #endif
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c954f8c1fbc4..7a1309b0b7c5 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
>>  		page = page + (combined_pfn - pfn);
>>  		pfn = combined_pfn;
>>  		order++;
>> +
>> +		arch_merge_page(zone, page, order);
>>  	}
>>  	if (max_order < MAX_ORDER) {
>>  		/* If we are here, it means order is >= pageblock_order.
Michael S. Tsirkin Feb. 11, 2019, 2:17 p.m. UTC | #6
On Mon, Feb 11, 2019 at 08:30:03AM -0500, Nitesh Narayan Lal wrote:
> 
> On 2/9/19 7:57 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
> >> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >>
> >> Because the implementation was limiting itself to only providing hints on
> >> pages huge TLB order sized or larger we introduced the possibility for free
> >> pages to slip past us because they are freed as something less then
> >> huge TLB in size and aggregated with buddies later.
> >>
> >> To address that I am adding a new call arch_merge_page which is called
> >> after __free_one_page has merged a pair of pages to create a higher order
> >> page. By doing this I am able to fill the gap and provide full coverage for
> >> all of the pages huge TLB order or larger.
> >>
> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > Looks like this will be helpful whenever active free page
> > hints are added. So I think it's a good idea to
> > add a hook.
> >
> > However, could you split adding the hook to a separate
> > patch from the KVM hypercall based implementation?
> >
> > Then e.g. Nilal's patches could reuse it too.
> With the current design of my patch-set, if I use this hook to report
> free pages. I will end up making redundant hints for the same pfns.
> 
> This is because the pages once freed by the host, are returned back to
> the buddy.

Suggestions on how you'd like to fix this? You do need this if
you introduce a size cut-off right?

> >
> >
> >> ---
> >>  arch/x86/include/asm/page.h |   12 ++++++++++++
> >>  arch/x86/kernel/kvm.c       |   28 ++++++++++++++++++++++++++++
> >>  include/linux/gfp.h         |    4 ++++
> >>  mm/page_alloc.c             |    2 ++
> >>  4 files changed, 46 insertions(+)
> >>
> >> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> >> index 4487ad7a3385..9540a97c9997 100644
> >> --- a/arch/x86/include/asm/page.h
> >> +++ b/arch/x86/include/asm/page.h
> >> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
> >>  	if (static_branch_unlikely(&pv_free_page_hint_enabled))
> >>  		__arch_free_page(page, order);
> >>  }
> >> +
> >> +struct zone;
> >> +
> >> +#define HAVE_ARCH_MERGE_PAGE
> >> +void __arch_merge_page(struct zone *zone, struct page *page,
> >> +		       unsigned int order);
> >> +static inline void arch_merge_page(struct zone *zone, struct page *page,
> >> +				   unsigned int order)
> >> +{
> >> +	if (static_branch_unlikely(&pv_free_page_hint_enabled))
> >> +		__arch_merge_page(zone, page, order);
> >> +}
> >>  #endif
> >>  
> >>  #include <linux/range.h>
> >> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> >> index 09c91641c36c..957bb4f427bb 100644
> >> --- a/arch/x86/kernel/kvm.c
> >> +++ b/arch/x86/kernel/kvm.c
> >> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
> >>  		       PAGE_SIZE << order);
> >>  }
> >>  
> >> +void __arch_merge_page(struct zone *zone, struct page *page,
> >> +		       unsigned int order)
> >> +{
> >> +	/*
> >> +	 * The merging logic has merged a set of buddies up to the
> >> +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> >> +	 * advantage of this moment to notify the hypervisor of the free
> >> +	 * memory.
> >> +	 */
> >> +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> >> +		return;
> >> +
> >> +	/*
> >> +	 * Drop zone lock while processing the hypercall. This
> >> +	 * should be safe as the page has not yet been added
> >> +	 * to the buddy list as of yet and all the pages that
> >> +	 * were merged have had their buddy/guard flags cleared
> >> +	 * and their order reset to 0.
> >> +	 */
> >> +	spin_unlock(&zone->lock);
> >> +
> >> +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> >> +		       PAGE_SIZE << order);
> >> +
> >> +	/* reacquire lock and resume freeing memory */
> >> +	spin_lock(&zone->lock);
> >> +}
> >> +
> >>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
> >>  
> >>  /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >> index fdab7de7490d..4746d5560193 100644
> >> --- a/include/linux/gfp.h
> >> +++ b/include/linux/gfp.h
> >> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >>  #ifndef HAVE_ARCH_FREE_PAGE
> >>  static inline void arch_free_page(struct page *page, int order) { }
> >>  #endif
> >> +#ifndef HAVE_ARCH_MERGE_PAGE
> >> +static inline void
> >> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> >> +#endif
> >>  #ifndef HAVE_ARCH_ALLOC_PAGE
> >>  static inline void arch_alloc_page(struct page *page, int order) { }
> >>  #endif
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index c954f8c1fbc4..7a1309b0b7c5 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
> >>  		page = page + (combined_pfn - pfn);
> >>  		pfn = combined_pfn;
> >>  		order++;
> >> +
> >> +		arch_merge_page(zone, page, order);
> >>  	}
> >>  	if (max_order < MAX_ORDER) {
> >>  		/* If we are here, it means order is >= pageblock_order.
> -- 
> Regards
> Nitesh
>
Alexander Duyck Feb. 11, 2019, 3:58 p.m. UTC | #7
On Mon, 2019-02-11 at 14:40 +0800, Aaron Lu wrote:
> On 2019/2/5 2:15, Alexander Duyck wrote:
> > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > 
> > Because the implementation was limiting itself to only providing hints on
> > pages huge TLB order sized or larger we introduced the possibility for free
> > pages to slip past us because they are freed as something less then
> > huge TLB in size and aggregated with buddies later.
> > 
> > To address that I am adding a new call arch_merge_page which is called
> > after __free_one_page has merged a pair of pages to create a higher order
> > page. By doing this I am able to fill the gap and provide full coverage for
> > all of the pages huge TLB order or larger.
> > 
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > ---
> >  arch/x86/include/asm/page.h |   12 ++++++++++++
> >  arch/x86/kernel/kvm.c       |   28 ++++++++++++++++++++++++++++
> >  include/linux/gfp.h         |    4 ++++
> >  mm/page_alloc.c             |    2 ++
> >  4 files changed, 46 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > index 4487ad7a3385..9540a97c9997 100644
> > --- a/arch/x86/include/asm/page.h
> > +++ b/arch/x86/include/asm/page.h
> > @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
> >  	if (static_branch_unlikely(&pv_free_page_hint_enabled))
> >  		__arch_free_page(page, order);
> >  }
> > +
> > +struct zone;
> > +
> > +#define HAVE_ARCH_MERGE_PAGE
> > +void __arch_merge_page(struct zone *zone, struct page *page,
> > +		       unsigned int order);
> > +static inline void arch_merge_page(struct zone *zone, struct page *page,
> > +				   unsigned int order)
> > +{
> > +	if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > +		__arch_merge_page(zone, page, order);
> > +}
> >  #endif
> >  
> >  #include <linux/range.h>
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index 09c91641c36c..957bb4f427bb 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
> >  		       PAGE_SIZE << order);
> >  }
> >  
> > +void __arch_merge_page(struct zone *zone, struct page *page,
> > +		       unsigned int order)
> > +{
> > +	/*
> > +	 * The merging logic has merged a set of buddies up to the
> > +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> > +	 * advantage of this moment to notify the hypervisor of the free
> > +	 * memory.
> > +	 */
> > +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> > +		return;
> > +
> > +	/*
> > +	 * Drop zone lock while processing the hypercall. This
> > +	 * should be safe as the page has not yet been added
> > +	 * to the buddy list as of yet and all the pages that
> > +	 * were merged have had their buddy/guard flags cleared
> > +	 * and their order reset to 0.
> > +	 */
> > +	spin_unlock(&zone->lock);
> > +
> > +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> > +		       PAGE_SIZE << order);
> > +
> > +	/* reacquire lock and resume freeing memory */
> > +	spin_lock(&zone->lock);
> > +}
> > +
> >  #ifdef CONFIG_PARAVIRT_SPINLOCKS
> >  
> >  /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index fdab7de7490d..4746d5560193 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >  #ifndef HAVE_ARCH_FREE_PAGE
> >  static inline void arch_free_page(struct page *page, int order) { }
> >  #endif
> > +#ifndef HAVE_ARCH_MERGE_PAGE
> > +static inline void
> > +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> > +#endif
> >  #ifndef HAVE_ARCH_ALLOC_PAGE
> >  static inline void arch_alloc_page(struct page *page, int order) { }
> >  #endif
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c954f8c1fbc4..7a1309b0b7c5 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
> >  		page = page + (combined_pfn - pfn);
> >  		pfn = combined_pfn;
> >  		order++;
> > +
> > +		arch_merge_page(zone, page, order);
> 
> Not a proper place AFAICS.
> 
> Assume we have an order-8 page being sent here for merge and its order-8
> buddy is also free, then order++ became 9 and arch_merge_page() will do
> the hint to host on this page as an order-9 page, no problem so far.
> Then the next round, assume the now order-9 page's buddy is also free,
> order++ will become 10 and arch_merge_page() will again hint to host on
> this page as an order-10 page. The first hint to host became redundant.

Actually the problem is even worse the other way around. My concern was
pages being incrementally freed.

With this setup I can catch when we have crossed the threshold from
order 8 to 9, and specifically for that case provide the hint. This
allows me to ignore orders above and below 9.

If I move the hint to the spot after the merging I have no way of
telling if I have hinted the page as a lower order or not. As such I
will hint if it is merged up to orders 9 or greater. So for example if
it merges up to order 9 and stops there then done_merging will report
an order 9 page, then if another page is freed and merged with this up
to order 10 you would be hinting on order 10. By placing the function
here I can guarantee that no more than 1 hint is provided per 2MB page.

> I think the proper place is after the done_merging tag.
> 
> BTW, with arch_merge_page() at the proper place, I don't think patch3/4
> is necessary - any freed page will go through merge anyway, we won't
> lose any hint opportunity. Or do I miss anything?

You can refer to my comment above. What I want to avoid is us hinting a
page multiple times if we aren't using MAX_ORDER - 1 as the limit. What
I am avoiding by placing this where I did is us doing a hint on orders
greater than our target hint order. So with this way I only perform one
hint per 2MB page, otherwise I would be performing multiple hints per
2MB page as every order above that would also trigger hints.
Nitesh Narayan Lal Feb. 11, 2019, 4:24 p.m. UTC | #8
On 2/11/19 9:17 AM, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 08:30:03AM -0500, Nitesh Narayan Lal wrote:
>> On 2/9/19 7:57 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
>>>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>>
>>>> Because the implementation was limiting itself to only providing hints on
>>>> pages huge TLB order sized or larger we introduced the possibility for free
>>>> pages to slip past us because they are freed as something less then
>>>> huge TLB in size and aggregated with buddies later.
>>>>
>>>> To address that I am adding a new call arch_merge_page which is called
>>>> after __free_one_page has merged a pair of pages to create a higher order
>>>> page. By doing this I am able to fill the gap and provide full coverage for
>>>> all of the pages huge TLB order or larger.
>>>>
>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>> Looks like this will be helpful whenever active free page
>>> hints are added. So I think it's a good idea to
>>> add a hook.
>>>
>>> However, could you split adding the hook to a separate
>>> patch from the KVM hypercall based implementation?
>>>
>>> Then e.g. Nilal's patches could reuse it too.
>> With the current design of my patch-set, if I use this hook to report
>> free pages. I will end up making redundant hints for the same pfns.
>>
>> This is because the pages once freed by the host, are returned back to
>> the buddy.
> Suggestions on how you'd like to fix this? You do need this if
> you introduce a size cut-off right?

I do, there are two ways to go about it.

One is to  use this and have a flag in the page structure indicating
whether that page has been freed/used or not. Though I am not sure if
this will be acceptable upstream.
Second is to find another place to invoke guest_free_page() post buddy
merging.

>
>>>
>>>> ---
>>>>  arch/x86/include/asm/page.h |   12 ++++++++++++
>>>>  arch/x86/kernel/kvm.c       |   28 ++++++++++++++++++++++++++++
>>>>  include/linux/gfp.h         |    4 ++++
>>>>  mm/page_alloc.c             |    2 ++
>>>>  4 files changed, 46 insertions(+)
>>>>
>>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>>> index 4487ad7a3385..9540a97c9997 100644
>>>> --- a/arch/x86/include/asm/page.h
>>>> +++ b/arch/x86/include/asm/page.h
>>>> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
>>>>  	if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>>>  		__arch_free_page(page, order);
>>>>  }
>>>> +
>>>> +struct zone;
>>>> +
>>>> +#define HAVE_ARCH_MERGE_PAGE
>>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>>> +		       unsigned int order);
>>>> +static inline void arch_merge_page(struct zone *zone, struct page *page,
>>>> +				   unsigned int order)
>>>> +{
>>>> +	if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>>> +		__arch_merge_page(zone, page, order);
>>>> +}
>>>>  #endif
>>>>  
>>>>  #include <linux/range.h>
>>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>>> index 09c91641c36c..957bb4f427bb 100644
>>>> --- a/arch/x86/kernel/kvm.c
>>>> +++ b/arch/x86/kernel/kvm.c
>>>> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
>>>>  		       PAGE_SIZE << order);
>>>>  }
>>>>  
>>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>>> +		       unsigned int order)
>>>> +{
>>>> +	/*
>>>> +	 * The merging logic has merged a set of buddies up to the
>>>> +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
>>>> +	 * advantage of this moment to notify the hypervisor of the free
>>>> +	 * memory.
>>>> +	 */
>>>> +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
>>>> +		return;
>>>> +
>>>> +	/*
>>>> +	 * Drop zone lock while processing the hypercall. This
>>>> +	 * should be safe as the page has not yet been added
>>>> +	 * to the buddy list as of yet and all the pages that
>>>> +	 * were merged have had their buddy/guard flags cleared
>>>> +	 * and their order reset to 0.
>>>> +	 */
>>>> +	spin_unlock(&zone->lock);
>>>> +
>>>> +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
>>>> +		       PAGE_SIZE << order);
>>>> +
>>>> +	/* reacquire lock and resume freeing memory */
>>>> +	spin_lock(&zone->lock);
>>>> +}
>>>> +
>>>>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>>  
>>>>  /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>> index fdab7de7490d..4746d5560193 100644
>>>> --- a/include/linux/gfp.h
>>>> +++ b/include/linux/gfp.h
>>>> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>>>  #ifndef HAVE_ARCH_FREE_PAGE
>>>>  static inline void arch_free_page(struct page *page, int order) { }
>>>>  #endif
>>>> +#ifndef HAVE_ARCH_MERGE_PAGE
>>>> +static inline void
>>>> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
>>>> +#endif
>>>>  #ifndef HAVE_ARCH_ALLOC_PAGE
>>>>  static inline void arch_alloc_page(struct page *page, int order) { }
>>>>  #endif
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index c954f8c1fbc4..7a1309b0b7c5 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
>>>>  		page = page + (combined_pfn - pfn);
>>>>  		pfn = combined_pfn;
>>>>  		order++;
>>>> +
>>>> +		arch_merge_page(zone, page, order);
>>>>  	}
>>>>  	if (max_order < MAX_ORDER) {
>>>>  		/* If we are here, it means order is >= pageblock_order.
>> -- 
>> Regards
>> Nitesh
>>
>
>
Michael S. Tsirkin Feb. 11, 2019, 5:41 p.m. UTC | #9
On Mon, Feb 11, 2019 at 11:24:02AM -0500, Nitesh Narayan Lal wrote:
> 
> On 2/11/19 9:17 AM, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 08:30:03AM -0500, Nitesh Narayan Lal wrote:
> >> On 2/9/19 7:57 PM, Michael S. Tsirkin wrote:
> >>> On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
> >>>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >>>>
> >>>> Because the implementation was limiting itself to only providing hints on
> >>>> pages huge TLB order sized or larger we introduced the possibility for free
> >>>> pages to slip past us because they are freed as something less then
> >>>> huge TLB in size and aggregated with buddies later.
> >>>>
> >>>> To address that I am adding a new call arch_merge_page which is called
> >>>> after __free_one_page has merged a pair of pages to create a higher order
> >>>> page. By doing this I am able to fill the gap and provide full coverage for
> >>>> all of the pages huge TLB order or larger.
> >>>>
> >>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> >>> Looks like this will be helpful whenever active free page
> >>> hints are added. So I think it's a good idea to
> >>> add a hook.
> >>>
> >>> However, could you split adding the hook to a separate
> >>> patch from the KVM hypercall based implementation?
> >>>
> >>> Then e.g. Nilal's patches could reuse it too.
> >> With the current design of my patch-set, if I use this hook to report
> >> free pages. I will end up making redundant hints for the same pfns.
> >>
> >> This is because the pages once freed by the host, are returned back to
> >> the buddy.
> > Suggestions on how you'd like to fix this? You do need this if
> > you introduce a size cut-off right?
> 
> I do, there are two ways to go about it.
> 
> One is to  use this and have a flag in the page structure indicating
> whether that page has been freed/used or not.

Not sure what do you mean. The refcount does this right?

> Though I am not sure if
> this will be acceptable upstream.
> Second is to find another place to invoke guest_free_page() post buddy
> merging.

Might be easier.

> >
> >>>
> >>>> ---
> >>>>  arch/x86/include/asm/page.h |   12 ++++++++++++
> >>>>  arch/x86/kernel/kvm.c       |   28 ++++++++++++++++++++++++++++
> >>>>  include/linux/gfp.h         |    4 ++++
> >>>>  mm/page_alloc.c             |    2 ++
> >>>>  4 files changed, 46 insertions(+)
> >>>>
> >>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> >>>> index 4487ad7a3385..9540a97c9997 100644
> >>>> --- a/arch/x86/include/asm/page.h
> >>>> +++ b/arch/x86/include/asm/page.h
> >>>> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
> >>>>  	if (static_branch_unlikely(&pv_free_page_hint_enabled))
> >>>>  		__arch_free_page(page, order);
> >>>>  }
> >>>> +
> >>>> +struct zone;
> >>>> +
> >>>> +#define HAVE_ARCH_MERGE_PAGE
> >>>> +void __arch_merge_page(struct zone *zone, struct page *page,
> >>>> +		       unsigned int order);
> >>>> +static inline void arch_merge_page(struct zone *zone, struct page *page,
> >>>> +				   unsigned int order)
> >>>> +{
> >>>> +	if (static_branch_unlikely(&pv_free_page_hint_enabled))
> >>>> +		__arch_merge_page(zone, page, order);
> >>>> +}
> >>>>  #endif
> >>>>  
> >>>>  #include <linux/range.h>
> >>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> >>>> index 09c91641c36c..957bb4f427bb 100644
> >>>> --- a/arch/x86/kernel/kvm.c
> >>>> +++ b/arch/x86/kernel/kvm.c
> >>>> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
> >>>>  		       PAGE_SIZE << order);
> >>>>  }
> >>>>  
> >>>> +void __arch_merge_page(struct zone *zone, struct page *page,
> >>>> +		       unsigned int order)
> >>>> +{
> >>>> +	/*
> >>>> +	 * The merging logic has merged a set of buddies up to the
> >>>> +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> >>>> +	 * advantage of this moment to notify the hypervisor of the free
> >>>> +	 * memory.
> >>>> +	 */
> >>>> +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> >>>> +		return;
> >>>> +
> >>>> +	/*
> >>>> +	 * Drop zone lock while processing the hypercall. This
> >>>> +	 * should be safe as the page has not yet been added
> >>>> +	 * to the buddy list as of yet and all the pages that
> >>>> +	 * were merged have had their buddy/guard flags cleared
> >>>> +	 * and their order reset to 0.
> >>>> +	 */
> >>>> +	spin_unlock(&zone->lock);
> >>>> +
> >>>> +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> >>>> +		       PAGE_SIZE << order);
> >>>> +
> >>>> +	/* reacquire lock and resume freeing memory */
> >>>> +	spin_lock(&zone->lock);
> >>>> +}
> >>>> +
> >>>>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
> >>>>  
> >>>>  /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> >>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >>>> index fdab7de7490d..4746d5560193 100644
> >>>> --- a/include/linux/gfp.h
> >>>> +++ b/include/linux/gfp.h
> >>>> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >>>>  #ifndef HAVE_ARCH_FREE_PAGE
> >>>>  static inline void arch_free_page(struct page *page, int order) { }
> >>>>  #endif
> >>>> +#ifndef HAVE_ARCH_MERGE_PAGE
> >>>> +static inline void
> >>>> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> >>>> +#endif
> >>>>  #ifndef HAVE_ARCH_ALLOC_PAGE
> >>>>  static inline void arch_alloc_page(struct page *page, int order) { }
> >>>>  #endif
> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>> index c954f8c1fbc4..7a1309b0b7c5 100644
> >>>> --- a/mm/page_alloc.c
> >>>> +++ b/mm/page_alloc.c
> >>>> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
> >>>>  		page = page + (combined_pfn - pfn);
> >>>>  		pfn = combined_pfn;
> >>>>  		order++;
> >>>> +
> >>>> +		arch_merge_page(zone, page, order);
> >>>>  	}
> >>>>  	if (max_order < MAX_ORDER) {
> >>>>  		/* If we are here, it means order is >= pageblock_order.
> >> -- 
> >> Regards
> >> Nitesh
> >>
> >
> >
> -- 
> Regards
> Nitesh
>
Nitesh Narayan Lal Feb. 11, 2019, 6:09 p.m. UTC | #10
On 2/11/19 12:41 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 11:24:02AM -0500, Nitesh Narayan Lal wrote:
>> On 2/11/19 9:17 AM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 11, 2019 at 08:30:03AM -0500, Nitesh Narayan Lal wrote:
>>>> On 2/9/19 7:57 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
>>>>>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>>>>
>>>>>> Because the implementation was limiting itself to only providing hints on
>>>>>> pages huge TLB order sized or larger we introduced the possibility for free
>>>>>> pages to slip past us because they are freed as something less then
>>>>>> huge TLB in size and aggregated with buddies later.
>>>>>>
>>>>>> To address that I am adding a new call arch_merge_page which is called
>>>>>> after __free_one_page has merged a pair of pages to create a higher order
>>>>>> page. By doing this I am able to fill the gap and provide full coverage for
>>>>>> all of the pages huge TLB order or larger.
>>>>>>
>>>>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>>> Looks like this will be helpful whenever active free page
>>>>> hints are added. So I think it's a good idea to
>>>>> add a hook.
>>>>>
>>>>> However, could you split adding the hook to a separate
>>>>> patch from the KVM hypercall based implementation?
>>>>>
>>>>> Then e.g. Nilal's patches could reuse it too.
>>>> With the current design of my patch-set, if I use this hook to report
>>>> free pages. I will end up making redundant hints for the same pfns.
>>>>
>>>> This is because the pages once freed by the host, are returned back to
>>>> the buddy.
>>> Suggestions on how you'd like to fix this? You do need this if
>>> you introduce a size cut-off right?
>> I do, there are two ways to go about it.
>>
>> One is to  use this and have a flag in the page structure indicating
>> whether that page has been freed/used or not.
> Not sure what do you mean. The refcount does this right?
I meant a flag using which I could determine whether a PFN has been
already freed by the host or not. This is to avoid repetitive free.
>
>> Though I am not sure if
>> this will be acceptable upstream.
>> Second is to find another place to invoke guest_free_page() post buddy
>> merging.
> Might be easier.
>
>>>>>> ---
>>>>>>  arch/x86/include/asm/page.h |   12 ++++++++++++
>>>>>>  arch/x86/kernel/kvm.c       |   28 ++++++++++++++++++++++++++++
>>>>>>  include/linux/gfp.h         |    4 ++++
>>>>>>  mm/page_alloc.c             |    2 ++
>>>>>>  4 files changed, 46 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>>>>> index 4487ad7a3385..9540a97c9997 100644
>>>>>> --- a/arch/x86/include/asm/page.h
>>>>>> +++ b/arch/x86/include/asm/page.h
>>>>>> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
>>>>>>  	if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>>>>>  		__arch_free_page(page, order);
>>>>>>  }
>>>>>> +
>>>>>> +struct zone;
>>>>>> +
>>>>>> +#define HAVE_ARCH_MERGE_PAGE
>>>>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>>>>> +		       unsigned int order);
>>>>>> +static inline void arch_merge_page(struct zone *zone, struct page *page,
>>>>>> +				   unsigned int order)
>>>>>> +{
>>>>>> +	if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>>>>> +		__arch_merge_page(zone, page, order);
>>>>>> +}
>>>>>>  #endif
>>>>>>  
>>>>>>  #include <linux/range.h>
>>>>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>>>>> index 09c91641c36c..957bb4f427bb 100644
>>>>>> --- a/arch/x86/kernel/kvm.c
>>>>>> +++ b/arch/x86/kernel/kvm.c
>>>>>> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
>>>>>>  		       PAGE_SIZE << order);
>>>>>>  }
>>>>>>  
>>>>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>>>>> +		       unsigned int order)
>>>>>> +{
>>>>>> +	/*
>>>>>> +	 * The merging logic has merged a set of buddies up to the
>>>>>> +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
>>>>>> +	 * advantage of this moment to notify the hypervisor of the free
>>>>>> +	 * memory.
>>>>>> +	 */
>>>>>> +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
>>>>>> +		return;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Drop zone lock while processing the hypercall. This
>>>>>> +	 * should be safe as the page has not yet been added
>>>>>> +	 * to the buddy list as of yet and all the pages that
>>>>>> +	 * were merged have had their buddy/guard flags cleared
>>>>>> +	 * and their order reset to 0.
>>>>>> +	 */
>>>>>> +	spin_unlock(&zone->lock);
>>>>>> +
>>>>>> +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
>>>>>> +		       PAGE_SIZE << order);
>>>>>> +
>>>>>> +	/* reacquire lock and resume freeing memory */
>>>>>> +	spin_lock(&zone->lock);
>>>>>> +}
>>>>>> +
>>>>>>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>>>>  
>>>>>>  /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
>>>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>>>> index fdab7de7490d..4746d5560193 100644
>>>>>> --- a/include/linux/gfp.h
>>>>>> +++ b/include/linux/gfp.h
>>>>>> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>>>>>  #ifndef HAVE_ARCH_FREE_PAGE
>>>>>>  static inline void arch_free_page(struct page *page, int order) { }
>>>>>>  #endif
>>>>>> +#ifndef HAVE_ARCH_MERGE_PAGE
>>>>>> +static inline void
>>>>>> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
>>>>>> +#endif
>>>>>>  #ifndef HAVE_ARCH_ALLOC_PAGE
>>>>>>  static inline void arch_alloc_page(struct page *page, int order) { }
>>>>>>  #endif
>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>> index c954f8c1fbc4..7a1309b0b7c5 100644
>>>>>> --- a/mm/page_alloc.c
>>>>>> +++ b/mm/page_alloc.c
>>>>>> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
>>>>>>  		page = page + (combined_pfn - pfn);
>>>>>>  		pfn = combined_pfn;
>>>>>>  		order++;
>>>>>> +
>>>>>> +		arch_merge_page(zone, page, order);
>>>>>>  	}
>>>>>>  	if (max_order < MAX_ORDER) {
>>>>>>  		/* If we are here, it means order is >= pageblock_order.
>>>> -- 
>>>> Regards
>>>> Nitesh
>>>>
>>>
>> -- 
>> Regards
>> Nitesh
>>
>
>
Aaron Lu Feb. 12, 2019, 2:09 a.m. UTC | #11
On 2019/2/11 23:58, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 14:40 +0800, Aaron Lu wrote:
>> On 2019/2/5 2:15, Alexander Duyck wrote:
>>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>
>>> Because the implementation was limiting itself to only providing hints on
>>> pages huge TLB order sized or larger we introduced the possibility for free
>>> pages to slip past us because they are freed as something less then
>>> huge TLB in size and aggregated with buddies later.
>>>
>>> To address that I am adding a new call arch_merge_page which is called
>>> after __free_one_page has merged a pair of pages to create a higher order
>>> page. By doing this I am able to fill the gap and provide full coverage for
>>> all of the pages huge TLB order or larger.
>>>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>> ---
>>>  arch/x86/include/asm/page.h |   12 ++++++++++++
>>>  arch/x86/kernel/kvm.c       |   28 ++++++++++++++++++++++++++++
>>>  include/linux/gfp.h         |    4 ++++
>>>  mm/page_alloc.c             |    2 ++
>>>  4 files changed, 46 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>> index 4487ad7a3385..9540a97c9997 100644
>>> --- a/arch/x86/include/asm/page.h
>>> +++ b/arch/x86/include/asm/page.h
>>> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
>>>  	if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>>  		__arch_free_page(page, order);
>>>  }
>>> +
>>> +struct zone;
>>> +
>>> +#define HAVE_ARCH_MERGE_PAGE
>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>> +		       unsigned int order);
>>> +static inline void arch_merge_page(struct zone *zone, struct page *page,
>>> +				   unsigned int order)
>>> +{
>>> +	if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>> +		__arch_merge_page(zone, page, order);
>>> +}
>>>  #endif
>>>  
>>>  #include <linux/range.h>
>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>> index 09c91641c36c..957bb4f427bb 100644
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
>>>  		       PAGE_SIZE << order);
>>>  }
>>>  
>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>> +		       unsigned int order)
>>> +{
>>> +	/*
>>> +	 * The merging logic has merged a set of buddies up to the
>>> +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
>>> +	 * advantage of this moment to notify the hypervisor of the free
>>> +	 * memory.
>>> +	 */
>>> +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
>>> +		return;
>>> +
>>> +	/*
>>> +	 * Drop zone lock while processing the hypercall. This
>>> +	 * should be safe as the page has not yet been added
>>> +	 * to the buddy list as of yet and all the pages that
>>> +	 * were merged have had their buddy/guard flags cleared
>>> +	 * and their order reset to 0.
>>> +	 */
>>> +	spin_unlock(&zone->lock);
>>> +
>>> +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
>>> +		       PAGE_SIZE << order);
>>> +
>>> +	/* reacquire lock and resume freeing memory */
>>> +	spin_lock(&zone->lock);
>>> +}
>>> +
>>>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>  
>>>  /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index fdab7de7490d..4746d5560193 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>>  #ifndef HAVE_ARCH_FREE_PAGE
>>>  static inline void arch_free_page(struct page *page, int order) { }
>>>  #endif
>>> +#ifndef HAVE_ARCH_MERGE_PAGE
>>> +static inline void
>>> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
>>> +#endif
>>>  #ifndef HAVE_ARCH_ALLOC_PAGE
>>>  static inline void arch_alloc_page(struct page *page, int order) { }
>>>  #endif
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index c954f8c1fbc4..7a1309b0b7c5 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
>>>  		page = page + (combined_pfn - pfn);
>>>  		pfn = combined_pfn;
>>>  		order++;
>>> +
>>> +		arch_merge_page(zone, page, order);
>>
>> Not a proper place AFAICS.
>>
>> Assume we have an order-8 page being sent here for merge and its order-8
>> buddy is also free, then order++ became 9 and arch_merge_page() will do
>> the hint to host on this page as an order-9 page, no problem so far.
>> Then the next round, assume the now order-9 page's buddy is also free,
>> order++ will become 10 and arch_merge_page() will again hint to host on
>> this page as an order-10 page. The first hint to host became redundant.
> 
> Actually the problem is even worse the other way around. My concern was
> pages being incrementally freed.
> 
> With this setup I can catch when we have crossed the threshold from
> order 8 to 9, and specifically for that case provide the hint. This
> allows me to ignore orders above and below 9.

OK, I see, you are now only hinting for pages with order 9, not above.

> If I move the hint to the spot after the merging I have no way of
> telling if I have hinted the page as a lower order or not. As such I
> will hint if it is merged up to orders 9 or greater. So for example if
> it merges up to order 9 and stops there then done_merging will report
> an order 9 page, then if another page is freed and merged with this up
> to order 10 you would be hinting on order 10. By placing the function
> here I can guarantee that no more than 1 hint is provided per 2MB page.

So what's the downside of hinting the page as order-10 after merge
compared to as order-9 before the merge? I can see the same physical
range can be hinted multiple times, but the total hint number is the
same: both are 2 - in your current implementation, we hint twice for
each of the 2 order-9 pages; alternatively, we can provide hint for one
order-9 page and the merged order-10 page. I think the cost of
hypercalls are the same? Is it that we want to ease the host side
madvise(DONTNEED) since we can avoid operating the same range multiple
times?

The reason I asked is, if we can move the arch_merge_page() after
done_merging tag, we can theoretically make fewer function calls on free
path for the guest. Maybe not a big deal, I don't know...

>> I think the proper place is after the done_merging tag.
>>
>> BTW, with arch_merge_page() at the proper place, I don't think patch3/4
>> is necessary - any freed page will go through merge anyway, we won't
>> lose any hint opportunity. Or do I miss anything?
> 
> You can refer to my comment above. What I want to avoid is us hinting a
> page multiple times if we aren't using MAX_ORDER - 1 as the limit. What

Yeah that's a good point. But is this going to happen?

> I am avoiding by placing this where I did is us doing a hint on orders
> greater than our target hint order. So with this way I only perform one
> hint per 2MB page, otherwise I would be performing multiple hints per
> 2MB page as every order above that would also trigger hints.
>
Alexander Duyck Feb. 12, 2019, 5:20 p.m. UTC | #12
On Tue, 2019-02-12 at 10:09 +0800, Aaron Lu wrote:
> On 2019/2/11 23:58, Alexander Duyck wrote:
> > On Mon, 2019-02-11 at 14:40 +0800, Aaron Lu wrote:
> > > On 2019/2/5 2:15, Alexander Duyck wrote:
> > > > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > 
> > > > Because the implementation was limiting itself to only providing hints on
> > > > pages huge TLB order sized or larger we introduced the possibility for free
> > > > pages to slip past us because they are freed as something less then
> > > > huge TLB in size and aggregated with buddies later.
> > > > 
> > > > To address that I am adding a new call arch_merge_page which is called
> > > > after __free_one_page has merged a pair of pages to create a higher order
> > > > page. By doing this I am able to fill the gap and provide full coverage for
> > > > all of the pages huge TLB order or larger.
> > > > 
> > > > Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > ---
> > > >  arch/x86/include/asm/page.h |   12 ++++++++++++
> > > >  arch/x86/kernel/kvm.c       |   28 ++++++++++++++++++++++++++++
> > > >  include/linux/gfp.h         |    4 ++++
> > > >  mm/page_alloc.c             |    2 ++
> > > >  4 files changed, 46 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > > > index 4487ad7a3385..9540a97c9997 100644
> > > > --- a/arch/x86/include/asm/page.h
> > > > +++ b/arch/x86/include/asm/page.h
> > > > @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
> > > >  	if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > > >  		__arch_free_page(page, order);
> > > >  }
> > > > +
> > > > +struct zone;
> > > > +
> > > > +#define HAVE_ARCH_MERGE_PAGE
> > > > +void __arch_merge_page(struct zone *zone, struct page *page,
> > > > +		       unsigned int order);
> > > > +static inline void arch_merge_page(struct zone *zone, struct page *page,
> > > > +				   unsigned int order)
> > > > +{
> > > > +	if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > > > +		__arch_merge_page(zone, page, order);
> > > > +}
> > > >  #endif
> > > >  
> > > >  #include <linux/range.h>
> > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > > index 09c91641c36c..957bb4f427bb 100644
> > > > --- a/arch/x86/kernel/kvm.c
> > > > +++ b/arch/x86/kernel/kvm.c
> > > > @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
> > > >  		       PAGE_SIZE << order);
> > > >  }
> > > >  
> > > > +void __arch_merge_page(struct zone *zone, struct page *page,
> > > > +		       unsigned int order)
> > > > +{
> > > > +	/*
> > > > +	 * The merging logic has merged a set of buddies up to the
> > > > +	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> > > > +	 * advantage of this moment to notify the hypervisor of the free
> > > > +	 * memory.
> > > > +	 */
> > > > +	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 * Drop zone lock while processing the hypercall. This
> > > > +	 * should be safe as the page has not yet been added
> > > > +	 * to the buddy list as of yet and all the pages that
> > > > +	 * were merged have had their buddy/guard flags cleared
> > > > +	 * and their order reset to 0.
> > > > +	 */
> > > > +	spin_unlock(&zone->lock);
> > > > +
> > > > +	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> > > > +		       PAGE_SIZE << order);
> > > > +
> > > > +	/* reacquire lock and resume freeing memory */
> > > > +	spin_lock(&zone->lock);
> > > > +}
> > > > +
> > > >  #ifdef CONFIG_PARAVIRT_SPINLOCKS
> > > >  
> > > >  /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > > index fdab7de7490d..4746d5560193 100644
> > > > --- a/include/linux/gfp.h
> > > > +++ b/include/linux/gfp.h
> > > > @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > > >  #ifndef HAVE_ARCH_FREE_PAGE
> > > >  static inline void arch_free_page(struct page *page, int order) { }
> > > >  #endif
> > > > +#ifndef HAVE_ARCH_MERGE_PAGE
> > > > +static inline void
> > > > +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> > > > +#endif
> > > >  #ifndef HAVE_ARCH_ALLOC_PAGE
> > > >  static inline void arch_alloc_page(struct page *page, int order) { }
> > > >  #endif
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index c954f8c1fbc4..7a1309b0b7c5 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
> > > >  		page = page + (combined_pfn - pfn);
> > > >  		pfn = combined_pfn;
> > > >  		order++;
> > > > +
> > > > +		arch_merge_page(zone, page, order);
> > > 
> > > Not a proper place AFAICS.
> > > 
> > > Assume we have an order-8 page being sent here for merge and its order-8
> > > buddy is also free, then order++ became 9 and arch_merge_page() will do
> > > the hint to host on this page as an order-9 page, no problem so far.
> > > Then the next round, assume the now order-9 page's buddy is also free,
> > > order++ will become 10 and arch_merge_page() will again hint to host on
> > > this page as an order-10 page. The first hint to host became redundant.
> > 
> > Actually the problem is even worse the other way around. My concern was
> > pages being incrementally freed.
> > 
> > With this setup I can catch when we have crossed the threshold from
> > order 8 to 9, and specifically for that case provide the hint. This
> > allows me to ignore orders above and below 9.
> 
> OK, I see, you are now only hinting for pages with order 9, not above.

Right.

> > If I move the hint to the spot after the merging I have no way of
> > telling if I have hinted the page as a lower order or not. As such I
> > will hint if it is merged up to orders 9 or greater. So for example if
> > it merges up to order 9 and stops there then done_merging will report
> > an order 9 page, then if another page is freed and merged with this up
> > to order 10 you would be hinting on order 10. By placing the function
> > here I can guarantee that no more than 1 hint is provided per 2MB page.
> 
> So what's the downside of hinting the page as order-10 after merge
> compared to as order-9 before the merge? I can see the same physical
> range can be hinted multiple times, but the total hint number is the
> same: both are 2 - in your current implementation, we hint twice for
> each of the 2 order-9 pages; alternatively, we can provide hint for one
> order-9 page and the merged order-10 page. I think the cost of
> hypercalls are the same? Is it that we want to ease the host side
> madvise(DONTNEED) since we can avoid operating the same range multiple
> times?

The cost for the hypercall overhead is the same, but I would think you
are in the hypercall a bit longer for the order 10 page because you are
having to process both order 9 pages in order clear them. In my mind
doing it that way you end up having to do 50% more madvise work. For a
THP based setup it probably isn't an issue, but I would think if we are
having to invalidate things at the 4K page level that cost could add up
real quick.

I could probably try launching the guest with THP disabled in QEMU to
verify if the difference is visible or not.

> The reason I asked is, if we can move the arch_merge_page() after
> done_merging tag, we can theoretically make fewer function calls on free
> path for the guest. Maybe not a big deal, I don't know...

I suspect it really isn't that big a deal. The two functions are
essentially inline and only one will ever make use of the hypercall.

> > > I think the proper place is after the done_merging tag.
> > > 
> > > BTW, with arch_merge_page() at the proper place, I don't think patch3/4
> > > is necessary - any freed page will go through merge anyway, we won't
> > > lose any hint opportunity. Or do I miss anything?
> > 
> > You can refer to my comment above. What I want to avoid is us hinting a
> > page multiple times if we aren't using MAX_ORDER - 1 as the limit. What
> 
> Yeah that's a good point. But is this going to happen?

One of the advantages I have from splitting things out the way I did is
that I have been able to add some debug counters to track what is freed
as a higher order page and what isn't. From what I am seeing after boot
essentially all of the calls are coming from the merge logic.

I'm suspecting the typical use case is that pages are likely going to
either be freed as something THP or larger, or they will be freed in 4K
increments. By splitting things up the way I did we end up getting the
most efficient performance out of the 4K case since we avoid performing
madvise 1.5 times per page and keep it to once per page.

Patch
diff mbox series

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 4487ad7a3385..9540a97c9997 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -29,6 +29,18 @@  static inline void arch_free_page(struct page *page, unsigned int order)
 	if (static_branch_unlikely(&pv_free_page_hint_enabled))
 		__arch_free_page(page, order);
 }
+
+struct zone;
+
+#define HAVE_ARCH_MERGE_PAGE
+void __arch_merge_page(struct zone *zone, struct page *page,
+		       unsigned int order);
+static inline void arch_merge_page(struct zone *zone, struct page *page,
+				   unsigned int order)
+{
+	if (static_branch_unlikely(&pv_free_page_hint_enabled))
+		__arch_merge_page(zone, page, order);
+}
 #endif
 
 #include <linux/range.h>
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 09c91641c36c..957bb4f427bb 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -785,6 +785,34 @@  void __arch_free_page(struct page *page, unsigned int order)
 		       PAGE_SIZE << order);
 }
 
+void __arch_merge_page(struct zone *zone, struct page *page,
+		       unsigned int order)
+{
+	/*
+	 * The merging logic has merged a set of buddies up to the
+	 * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
+	 * advantage of this moment to notify the hypervisor of the free
+	 * memory.
+	 */
+	if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
+		return;
+
+	/*
+	 * Drop zone lock while processing the hypercall. This
+	 * should be safe as the page has not yet been added
+	 * to the buddy list as of yet and all the pages that
+	 * were merged have had their buddy/guard flags cleared
+	 * and their order reset to 0.
+	 */
+	spin_unlock(&zone->lock);
+
+	kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
+		       PAGE_SIZE << order);
+
+	/* reacquire lock and resume freeing memory */
+	spin_lock(&zone->lock);
+}
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fdab7de7490d..4746d5560193 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -459,6 +459,10 @@  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 #ifndef HAVE_ARCH_FREE_PAGE
 static inline void arch_free_page(struct page *page, int order) { }
 #endif
+#ifndef HAVE_ARCH_MERGE_PAGE
+static inline void
+arch_merge_page(struct zone *zone, struct page *page, int order) { }
+#endif
 #ifndef HAVE_ARCH_ALLOC_PAGE
 static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c954f8c1fbc4..7a1309b0b7c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -913,6 +913,8 @@  static inline void __free_one_page(struct page *page,
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
 		order++;
+
+		arch_merge_page(zone, page, order);
 	}
 	if (max_order < MAX_ORDER) {
 		/* If we are here, it means order is >= pageblock_order.