[v12,6/8] mm: support reporting free page blocks

Message ID	1499863221-16206-7-git-send-email-wei.w.wang@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Wei Wang <wei.w.wang@intel.com> To: linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mst@redhat.com, david@redhat.com, cornelia.huck@de.ibm.com, akpm@linux-foundation.org, mgorman@techsingularity.net, aarcange@redhat.com, amit.shah@redhat.com, pbonzini@redhat.com, wei.w.wang@intel.com, liliang.opensource@gmail.com Cc: virtio-dev@lists.oasis-open.org, yang.zhang.wz@gmail.com, quan.xu@aliyun.com Subject: [PATCH v12 6/8] mm: support reporting free page blocks Date: Wed, 12 Jul 2017 20:40:19 +0800 Message-Id: <1499863221-16206-7-git-send-email-wei.w.wang@intel.com> In-Reply-To: <1499863221-16206-1-git-send-email-wei.w.wang@intel.com> References: <1499863221-16206-1-git-send-email-wei.w.wang@intel.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Wang, Wei W July 12, 2017, 12:40 p.m. UTC

This patch adds support for reporting blocks of pages on the free list
specified by the caller.

As pages can leave the free list during this call or immediately
afterwards, they are not guaranteed to be free after the function
returns. The only guarantee this makes is that the page was on the free
list at some point in time after the function has been invoked.

Therefore, it is not safe for caller to use any pages on the returned
block or to discard data that is put there after the function returns.
However, it is safe for caller to discard data that was in one of these
pages before the function was invoked.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
---
 include/linux/mm.h |  5 +++
 mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+)

Michael S. Tsirkin July 13, 2017, 12:33 a.m. UTC | #1

On Wed, Jul 12, 2017 at 08:40:19PM +0800, Wei Wang wrote:
> This patch adds support for reporting blocks of pages on the free list
> specified by the caller.
> 
> As pages can leave the free list during this call or immediately
> afterwards, they are not guaranteed to be free after the function
> returns. The only guarantee this makes is that the page was on the free
> list at some point in time after the function has been invoked.
> 
> Therefore, it is not safe for caller to use any pages on the returned
> block or to discard data that is put there after the function returns.
> However, it is safe for caller to discard data that was in one of these
> pages before the function was invoked.
> 
> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  include/linux/mm.h |  5 +++
>  mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 101 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..76cb433 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> +extern int report_unused_page_block(struct zone *zone, unsigned int order,
> +				    unsigned int migratetype,
> +				    struct page **page);
> +#endif
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 64b7d82..8b3c9dd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> +
> +/*
> + * Heuristically get a page block in the system that is unused.
> + * It is possible that pages from the page block are used immediately after
> + * report_unused_page_block() returns. It is the caller's responsibility
> + * to either detect or prevent the use of such pages.
> + *
> + * The free list to check: zone->free_area[order].free_list[migratetype].
> + *
> + * If the caller supplied page block (i.e. **page) is on the free list, offer
> + * the next page block on the list to the caller. Otherwise, offer the first
> + * page block on the list.
> + *
> + * Note: it is not safe for caller to use any pages on the returned
> + * block or to discard data that is put there after the function returns.
> + * However, it is safe for caller to discard data that was in one of these
> + * pages before the function was invoked.
> + *
> + * Return 0 when a page block is found on the caller specified free list.

Otherwise?

> + */

As an alternative, we could have an API that scans free pages
and invokes a callback under a lock. Granted, this might end up
staying a lot of time under a lock. Is this a big issue?
Some benchmarking will tell.

It would then be up to the hypervisor to decide whether it wants to play
tricks with the dirty bit or just wants to drop pages while VCPU is
stopped.


> +int report_unused_page_block(struct zone *zone, unsigned int order,
> +			     unsigned int migratetype, struct page **page)
> +{
> +	struct zone *this_zone;
> +	struct list_head *this_list;
> +	int ret = 0;
> +	unsigned long flags;
> +
> +	/* Sanity check */
> +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
> +	    migratetype >= MIGRATE_TYPES)
> +		return -EINVAL;

Why do callers this?

> +
> +	/* Zone validity check */
> +	for_each_populated_zone(this_zone) {
> +		if (zone == this_zone)
> +			break;
> +	}

Why?  Will take a long time if there are lots of zones.

> +
> +	/* Got a non-existent zone from the caller? */
> +	if (zone != this_zone)
> +		return -EINVAL;

When does this happen?

> +
> +	spin_lock_irqsave(&this_zone->lock, flags);
> +
> +	this_list = &zone->free_area[order].free_list[migratetype];
> +	if (list_empty(this_list)) {
> +		*page = NULL;
> +		ret = 1;


What does this mean?

> +		goto out;
> +	}
> +
> +	/* The caller is asking for the first free page block on the list */
> +	if ((*page) == NULL) {

if (!*page) is shorter and prettier.

> +		*page = list_first_entry(this_list, struct page, lru);
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The page block passed from the caller is not on this free list
> +	 * anymore (e.g. a 1MB free page block has been split). In this case,
> +	 * offer the first page block on the free list that the caller is
> +	 * asking for.

This just might keep giving you same block over and over again.
E.g.
	- get 1st block
	- get 2nd block
	- 2nd gets broken up
	- get 1st block again

this way we might never make progress beyond the 1st 2 blocks


> +	 */
> +	if (PageBuddy(*page) && order != page_order(*page)) {
> +		*page = list_first_entry(this_list, struct page, lru);
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The page block passed from the caller has been the last page block
> +	 * on the list.
> +	 */
> +	if ((*page)->lru.next == this_list) {
> +		*page = NULL;
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Finally, fall into the regular case: the page block passed from the
> +	 * caller is still on the free list. Offer the next one.
> +	 */
> +	*page = list_next_entry((*page), lru);
> +	ret = 0;
> +out:
> +	spin_unlock_irqrestore(&this_zone->lock, flags);
> +	return ret;
> +}
> +EXPORT_SYMBOL(report_unused_page_block);
> +
> +#endif
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4

Wang, Wei W July 13, 2017, 8:25 a.m. UTC | #2

On 07/13/2017 08:33 AM, Michael S. Tsirkin wrote:
> On Wed, Jul 12, 2017 at 08:40:19PM +0800, Wei Wang wrote:
>> This patch adds support for reporting blocks of pages on the free list
>> specified by the caller.
>>
>> As pages can leave the free list during this call or immediately
>> afterwards, they are not guaranteed to be free after the function
>> returns. The only guarantee this makes is that the page was on the free
>> list at some point in time after the function has been invoked.
>>
>> Therefore, it is not safe for caller to use any pages on the returned
>> block or to discard data that is put there after the function returns.
>> However, it is safe for caller to discard data that was in one of these
>> pages before the function was invoked.
>>
>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
>> Signed-off-by: Liang Li <liang.z.li@intel.com>
>> ---
>>   include/linux/mm.h |  5 +++
>>   mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 101 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 46b9ac5..76cb433 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>>   		unsigned long zone_start_pfn, unsigned long *zholes_size);
>>   extern void free_initmem(void);
>>   
>> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
>> +extern int report_unused_page_block(struct zone *zone, unsigned int order,
>> +				    unsigned int migratetype,
>> +				    struct page **page);
>> +#endif
>>   /*
>>    * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>>    * into the buddy system. The freed pages will be poisoned with pattern
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 64b7d82..8b3c9dd 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>>   	show_swap_cache_info();
>>   }
>>   
>> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
>> +
>> +/*
>> + * Heuristically get a page block in the system that is unused.
>> + * It is possible that pages from the page block are used immediately after
>> + * report_unused_page_block() returns. It is the caller's responsibility
>> + * to either detect or prevent the use of such pages.
>> + *
>> + * The free list to check: zone->free_area[order].free_list[migratetype].
>> + *
>> + * If the caller supplied page block (i.e. **page) is on the free list, offer
>> + * the next page block on the list to the caller. Otherwise, offer the first
>> + * page block on the list.
>> + *
>> + * Note: it is not safe for caller to use any pages on the returned
>> + * block or to discard data that is put there after the function returns.
>> + * However, it is safe for caller to discard data that was in one of these
>> + * pages before the function was invoked.
>> + *
>> + * Return 0 when a page block is found on the caller specified free list.
> Otherwise?

Other values mean that no page block is found. I will add them.

>
>> + */
> As an alternative, we could have an API that scans free pages
> and invokes a callback under a lock. Granted, this might end up
> staying a lot of time under a lock. Is this a big issue?
> Some benchmarking will tell.
>
> It would then be up to the hypervisor to decide whether it wants to play
> tricks with the dirty bit or just wants to drop pages while VCPU is
> stopped.
>
>
>> +int report_unused_page_block(struct zone *zone, unsigned int order,
>> +			     unsigned int migratetype, struct page **page)
>> +{
>> +	struct zone *this_zone;
>> +	struct list_head *this_list;
>> +	int ret = 0;
>> +	unsigned long flags;
>> +
>> +	/* Sanity check */
>> +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
>> +	    migratetype >= MIGRATE_TYPES)
>> +		return -EINVAL;
> Why do callers this?
>
>> +
>> +	/* Zone validity check */
>> +	for_each_populated_zone(this_zone) {
>> +		if (zone == this_zone)
>> +			break;
>> +	}
> Why?  Will take a long time if there are lots of zones.
>
>> +
>> +	/* Got a non-existent zone from the caller? */
>> +	if (zone != this_zone)
>> +		return -EINVAL;
> When does this happen?

The above lines of code are just sanity check. If not
necessary, we can remove them.

>
>> +
>> +	spin_lock_irqsave(&this_zone->lock, flags);
>> +
>> +	this_list = &zone->free_area[order].free_list[migratetype];
>> +	if (list_empty(this_list)) {
>> +		*page = NULL;
>> +		ret = 1;
>
> What does this mean?

Just means the list is empty, and expects the caller to try again
in the next list.

Probably, use "-EAGAIN" is better?

>
>> +		*page = list_first_entry(this_list, struct page, lru);
>> +		ret = 0;
>> +		goto out;
>> +	}
>> +
>> +	/*
>> +	 * The page block passed from the caller is not on this free list
>> +	 * anymore (e.g. a 1MB free page block has been split). In this case,
>> +	 * offer the first page block on the free list that the caller is
>> +	 * asking for.
> This just might keep giving you same block over and over again.
> E.g.
> 	- get 1st block
> 	- get 2nd block
> 	- 2nd gets broken up
> 	- get 1st block again
>
> this way we might never make progress beyond the 1st 2 blocks

Not really. I think the pages are allocated in order. If the 2nd block 
isn't there, then
the 1st block must have gone, too. So, the call will return the 3rd one 
(which is the
new first) on the list.

Best,
Wei

Michal Hocko July 14, 2017, 12:30 p.m. UTC | #3

On Wed 12-07-17 20:40:19, Wei Wang wrote:
> This patch adds support for reporting blocks of pages on the free list
> specified by the caller.
> 
> As pages can leave the free list during this call or immediately
> afterwards, they are not guaranteed to be free after the function
> returns. The only guarantee this makes is that the page was on the free
> list at some point in time after the function has been invoked.
> 
> Therefore, it is not safe for caller to use any pages on the returned
> block or to discard data that is put there after the function returns.
> However, it is safe for caller to discard data that was in one of these
> pages before the function was invoked.

I do not understand what is the point of such a function and how it is
used because the patch doesn't give us any user (I haven't checked other
patches yet).

But just from the semantic point of view this sounds like a horrible
idea. The only way to get a free block of pages is to call the page
allocator. I am tempted to give it Nack right on those grounds but I
would like to hear more about what you actually want to achieve.

> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> ---
>  include/linux/mm.h |  5 +++
>  mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 101 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 46b9ac5..76cb433 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>  		unsigned long zone_start_pfn, unsigned long *zholes_size);
>  extern void free_initmem(void);
>  
> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> +extern int report_unused_page_block(struct zone *zone, unsigned int order,
> +				    unsigned int migratetype,
> +				    struct page **page);
> +#endif
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>   * into the buddy system. The freed pages will be poisoned with pattern
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 64b7d82..8b3c9dd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> +
> +/*
> + * Heuristically get a page block in the system that is unused.
> + * It is possible that pages from the page block are used immediately after
> + * report_unused_page_block() returns. It is the caller's responsibility
> + * to either detect or prevent the use of such pages.
> + *
> + * The free list to check: zone->free_area[order].free_list[migratetype].
> + *
> + * If the caller supplied page block (i.e. **page) is on the free list, offer
> + * the next page block on the list to the caller. Otherwise, offer the first
> + * page block on the list.
> + *
> + * Note: it is not safe for caller to use any pages on the returned
> + * block or to discard data that is put there after the function returns.
> + * However, it is safe for caller to discard data that was in one of these
> + * pages before the function was invoked.
> + *
> + * Return 0 when a page block is found on the caller specified free list.
> + */
> +int report_unused_page_block(struct zone *zone, unsigned int order,
> +			     unsigned int migratetype, struct page **page)
> +{
> +	struct zone *this_zone;
> +	struct list_head *this_list;
> +	int ret = 0;
> +	unsigned long flags;
> +
> +	/* Sanity check */
> +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
> +	    migratetype >= MIGRATE_TYPES)
> +		return -EINVAL;
> +
> +	/* Zone validity check */
> +	for_each_populated_zone(this_zone) {
> +		if (zone == this_zone)
> +			break;
> +	}
> +
> +	/* Got a non-existent zone from the caller? */
> +	if (zone != this_zone)
> +		return -EINVAL;

Huh, what do you check for here? Why don't you simply
populated_zone(zone)?

> +
> +	spin_lock_irqsave(&this_zone->lock, flags);
> +
> +	this_list = &zone->free_area[order].free_list[migratetype];
> +	if (list_empty(this_list)) {
> +		*page = NULL;
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/* The caller is asking for the first free page block on the list */
> +	if ((*page) == NULL) {
> +		*page = list_first_entry(this_list, struct page, lru);
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The page block passed from the caller is not on this free list
> +	 * anymore (e.g. a 1MB free page block has been split). In this case,
> +	 * offer the first page block on the free list that the caller is
> +	 * asking for.
> +	 */
> +	if (PageBuddy(*page) && order != page_order(*page)) {
> +		*page = list_first_entry(this_list, struct page, lru);
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The page block passed from the caller has been the last page block
> +	 * on the list.
> +	 */
> +	if ((*page)->lru.next == this_list) {
> +		*page = NULL;
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Finally, fall into the regular case: the page block passed from the
> +	 * caller is still on the free list. Offer the next one.
> +	 */
> +	*page = list_next_entry((*page), lru);
> +	ret = 0;
> +out:
> +	spin_unlock_irqrestore(&this_zone->lock, flags);
> +	return ret;
> +}
> +EXPORT_SYMBOL(report_unused_page_block);
> +
> +#endif
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

Michal Hocko July 14, 2017, 12:54 p.m. UTC | #4

On Fri 14-07-17 14:30:23, Michal Hocko wrote:
> On Wed 12-07-17 20:40:19, Wei Wang wrote:
> > This patch adds support for reporting blocks of pages on the free list
> > specified by the caller.
> > 
> > As pages can leave the free list during this call or immediately
> > afterwards, they are not guaranteed to be free after the function
> > returns. The only guarantee this makes is that the page was on the free
> > list at some point in time after the function has been invoked.
> > 
> > Therefore, it is not safe for caller to use any pages on the returned
> > block or to discard data that is put there after the function returns.
> > However, it is safe for caller to discard data that was in one of these
> > pages before the function was invoked.
> 
> I do not understand what is the point of such a function and how it is
> used because the patch doesn't give us any user (I haven't checked other
> patches yet).
> 
> But just from the semantic point of view this sounds like a horrible
> idea. The only way to get a free block of pages is to call the page
> allocator. I am tempted to give it Nack right on those grounds but I
> would like to hear more about what you actually want to achieve.

OK, so I gave it another thought and giving a page which is still on the
free list to a random module is just a free ticket to a disaster.
Nacked-by: Michal Hocko <mhocko@suse.com>

> 
> > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > ---
> >  include/linux/mm.h |  5 +++
> >  mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 101 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 46b9ac5..76cb433 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
> >  		unsigned long zone_start_pfn, unsigned long *zholes_size);
> >  extern void free_initmem(void);
> >  
> > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > +extern int report_unused_page_block(struct zone *zone, unsigned int order,
> > +				    unsigned int migratetype,
> > +				    struct page **page);
> > +#endif
> >  /*
> >   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
> >   * into the buddy system. The freed pages will be poisoned with pattern
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 64b7d82..8b3c9dd 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> >  	show_swap_cache_info();
> >  }
> >  
> > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > +
> > +/*
> > + * Heuristically get a page block in the system that is unused.
> > + * It is possible that pages from the page block are used immediately after
> > + * report_unused_page_block() returns. It is the caller's responsibility
> > + * to either detect or prevent the use of such pages.
> > + *
> > + * The free list to check: zone->free_area[order].free_list[migratetype].
> > + *
> > + * If the caller supplied page block (i.e. **page) is on the free list, offer
> > + * the next page block on the list to the caller. Otherwise, offer the first
> > + * page block on the list.
> > + *
> > + * Note: it is not safe for caller to use any pages on the returned
> > + * block or to discard data that is put there after the function returns.
> > + * However, it is safe for caller to discard data that was in one of these
> > + * pages before the function was invoked.
> > + *
> > + * Return 0 when a page block is found on the caller specified free list.
> > + */
> > +int report_unused_page_block(struct zone *zone, unsigned int order,
> > +			     unsigned int migratetype, struct page **page)
> > +{
> > +	struct zone *this_zone;
> > +	struct list_head *this_list;
> > +	int ret = 0;
> > +	unsigned long flags;
> > +
> > +	/* Sanity check */
> > +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
> > +	    migratetype >= MIGRATE_TYPES)
> > +		return -EINVAL;
> > +
> > +	/* Zone validity check */
> > +	for_each_populated_zone(this_zone) {
> > +		if (zone == this_zone)
> > +			break;
> > +	}
> > +
> > +	/* Got a non-existent zone from the caller? */
> > +	if (zone != this_zone)
> > +		return -EINVAL;
> 
> Huh, what do you check for here? Why don't you simply
> populated_zone(zone)?
> 
> > +
> > +	spin_lock_irqsave(&this_zone->lock, flags);
> > +
> > +	this_list = &zone->free_area[order].free_list[migratetype];
> > +	if (list_empty(this_list)) {
> > +		*page = NULL;
> > +		ret = 1;
> > +		goto out;
> > +	}
> > +
> > +	/* The caller is asking for the first free page block on the list */
> > +	if ((*page) == NULL) {
> > +		*page = list_first_entry(this_list, struct page, lru);
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * The page block passed from the caller is not on this free list
> > +	 * anymore (e.g. a 1MB free page block has been split). In this case,
> > +	 * offer the first page block on the free list that the caller is
> > +	 * asking for.
> > +	 */
> > +	if (PageBuddy(*page) && order != page_order(*page)) {
> > +		*page = list_first_entry(this_list, struct page, lru);
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * The page block passed from the caller has been the last page block
> > +	 * on the list.
> > +	 */
> > +	if ((*page)->lru.next == this_list) {
> > +		*page = NULL;
> > +		ret = 1;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * Finally, fall into the regular case: the page block passed from the
> > +	 * caller is still on the free list. Offer the next one.
> > +	 */
> > +	*page = list_next_entry((*page), lru);
> > +	ret = 0;
> > +out:
> > +	spin_unlock_irqrestore(&this_zone->lock, flags);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(report_unused_page_block);
> > +
> > +#endif
> > +
> >  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> >  {
> >  	zoneref->zone = zone;
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> Michal Hocko
> SUSE Labs

Michael S. Tsirkin July 14, 2017, 3:46 p.m. UTC | #5

On Fri, Jul 14, 2017 at 02:54:30PM +0200, Michal Hocko wrote:
> On Fri 14-07-17 14:30:23, Michal Hocko wrote:
> > On Wed 12-07-17 20:40:19, Wei Wang wrote:
> > > This patch adds support for reporting blocks of pages on the free list
> > > specified by the caller.
> > > 
> > > As pages can leave the free list during this call or immediately
> > > afterwards, they are not guaranteed to be free after the function
> > > returns. The only guarantee this makes is that the page was on the free
> > > list at some point in time after the function has been invoked.
> > > 
> > > Therefore, it is not safe for caller to use any pages on the returned
> > > block or to discard data that is put there after the function returns.
> > > However, it is safe for caller to discard data that was in one of these
> > > pages before the function was invoked.
> > 
> > I do not understand what is the point of such a function and how it is
> > used because the patch doesn't give us any user (I haven't checked other
> > patches yet).
> > 
> > But just from the semantic point of view this sounds like a horrible
> > idea. The only way to get a free block of pages is to call the page
> > allocator. I am tempted to give it Nack right on those grounds but I
> > would like to hear more about what you actually want to achieve.
> 
> OK, so I gave it another thought and giving a page which is still on the
> free list to a random module is just a free ticket to a disaster.
> Nacked-by: Michal Hocko <mhocko@suse.com>

I agree it should be EXPORT_SYMBOL_GPL. Too much power
to give to non-GPL modules.

But pls take a look at the explanation I posted.  Any kind of hypervisor
hinting will need to do this by definition - best we can do is keep a
lock while we do this.

> > 
> > > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > > ---
> > >  include/linux/mm.h |  5 +++
> > >  mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 101 insertions(+)
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 46b9ac5..76cb433 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
> > >  		unsigned long zone_start_pfn, unsigned long *zholes_size);
> > >  extern void free_initmem(void);
> > >  
> > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > > +extern int report_unused_page_block(struct zone *zone, unsigned int order,
> > > +				    unsigned int migratetype,
> > > +				    struct page **page);
> > > +#endif
> > >  /*
> > >   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
> > >   * into the buddy system. The freed pages will be poisoned with pattern
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 64b7d82..8b3c9dd 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> > >  	show_swap_cache_info();
> > >  }
> > >  
> > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > > +
> > > +/*
> > > + * Heuristically get a page block in the system that is unused.
> > > + * It is possible that pages from the page block are used immediately after
> > > + * report_unused_page_block() returns. It is the caller's responsibility
> > > + * to either detect or prevent the use of such pages.
> > > + *
> > > + * The free list to check: zone->free_area[order].free_list[migratetype].
> > > + *
> > > + * If the caller supplied page block (i.e. **page) is on the free list, offer
> > > + * the next page block on the list to the caller. Otherwise, offer the first
> > > + * page block on the list.
> > > + *
> > > + * Note: it is not safe for caller to use any pages on the returned
> > > + * block or to discard data that is put there after the function returns.
> > > + * However, it is safe for caller to discard data that was in one of these
> > > + * pages before the function was invoked.
> > > + *
> > > + * Return 0 when a page block is found on the caller specified free list.
> > > + */
> > > +int report_unused_page_block(struct zone *zone, unsigned int order,
> > > +			     unsigned int migratetype, struct page **page)
> > > +{
> > > +	struct zone *this_zone;
> > > +	struct list_head *this_list;
> > > +	int ret = 0;
> > > +	unsigned long flags;
> > > +
> > > +	/* Sanity check */
> > > +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
> > > +	    migratetype >= MIGRATE_TYPES)
> > > +		return -EINVAL;
> > > +
> > > +	/* Zone validity check */
> > > +	for_each_populated_zone(this_zone) {
> > > +		if (zone == this_zone)
> > > +			break;
> > > +	}
> > > +
> > > +	/* Got a non-existent zone from the caller? */
> > > +	if (zone != this_zone)
> > > +		return -EINVAL;
> > 
> > Huh, what do you check for here? Why don't you simply
> > populated_zone(zone)?
> > 
> > > +
> > > +	spin_lock_irqsave(&this_zone->lock, flags);
> > > +
> > > +	this_list = &zone->free_area[order].free_list[migratetype];
> > > +	if (list_empty(this_list)) {
> > > +		*page = NULL;
> > > +		ret = 1;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/* The caller is asking for the first free page block on the list */
> > > +	if ((*page) == NULL) {
> > > +		*page = list_first_entry(this_list, struct page, lru);
> > > +		ret = 0;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/*
> > > +	 * The page block passed from the caller is not on this free list
> > > +	 * anymore (e.g. a 1MB free page block has been split). In this case,
> > > +	 * offer the first page block on the free list that the caller is
> > > +	 * asking for.
> > > +	 */
> > > +	if (PageBuddy(*page) && order != page_order(*page)) {
> > > +		*page = list_first_entry(this_list, struct page, lru);
> > > +		ret = 0;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/*
> > > +	 * The page block passed from the caller has been the last page block
> > > +	 * on the list.
> > > +	 */
> > > +	if ((*page)->lru.next == this_list) {
> > > +		*page = NULL;
> > > +		ret = 1;
> > > +		goto out;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Finally, fall into the regular case: the page block passed from the
> > > +	 * caller is still on the free list. Offer the next one.
> > > +	 */
> > > +	*page = list_next_entry((*page), lru);
> > > +	ret = 0;
> > > +out:
> > > +	spin_unlock_irqrestore(&this_zone->lock, flags);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(report_unused_page_block);
> > > +
> > > +#endif
> > > +
> > >  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> > >  {
> > >  	zoneref->zone = zone;
> > > -- 
> > > 2.7.4
> > > 
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to majordomo@kvack.org.  For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> -- 
> Michal Hocko
> SUSE Labs

Michael S. Tsirkin July 14, 2017, 7:17 p.m. UTC | #6

On Fri, Jul 14, 2017 at 02:30:23PM +0200, Michal Hocko wrote:
> On Wed 12-07-17 20:40:19, Wei Wang wrote:
> > This patch adds support for reporting blocks of pages on the free list
> > specified by the caller.
> > 
> > As pages can leave the free list during this call or immediately
> > afterwards, they are not guaranteed to be free after the function
> > returns. The only guarantee this makes is that the page was on the free
> > list at some point in time after the function has been invoked.
> > 
> > Therefore, it is not safe for caller to use any pages on the returned
> > block or to discard data that is put there after the function returns.
> > However, it is safe for caller to discard data that was in one of these
> > pages before the function was invoked.
> 
> I do not understand what is the point of such a function and how it is
> used because the patch doesn't give us any user (I haven't checked other
> patches yet).
> 
> But just from the semantic point of view this sounds like a horrible
> idea. The only way to get a free block of pages is to call the page
> allocator. I am tempted to give it Nack right on those grounds but I
> would like to hear more about what you actually want to achieve.

Basically it's a performance hint to the hypervisor.
For example, these pages would be good candidates to
move around as they are not mapped into any running
applications.

As such, it's important not to slow down other parts of the system too
much - otherwise we are speeding up one part of the system while we slow
down other parts of it, which is why it's trying to drop the lock as
soon a possible.

As long as hypervisor does not assume it can drop these pages, and as
long it's correct in most cases.  we are OK even if the hint is slightly
wrong because hypervisor notifications are racing with allocations.

There are patches to do more tricks - if hypervisor tracks all
memory writes we might actually use this hint to discard data -
but that is just implementation detail.


> > Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> > Signed-off-by: Liang Li <liang.z.li@intel.com>
> > ---
> >  include/linux/mm.h |  5 +++
> >  mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 101 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 46b9ac5..76cb433 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
> >  		unsigned long zone_start_pfn, unsigned long *zholes_size);
> >  extern void free_initmem(void);
> >  
> > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > +extern int report_unused_page_block(struct zone *zone, unsigned int order,
> > +				    unsigned int migratetype,
> > +				    struct page **page);
> > +#endif
> >  /*
> >   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
> >   * into the buddy system. The freed pages will be poisoned with pattern
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 64b7d82..8b3c9dd 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> >  	show_swap_cache_info();
> >  }
> >  
> > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON)
> > +
> > +/*
> > + * Heuristically get a page block in the system that is unused.
> > + * It is possible that pages from the page block are used immediately after
> > + * report_unused_page_block() returns. It is the caller's responsibility
> > + * to either detect or prevent the use of such pages.
> > + *
> > + * The free list to check: zone->free_area[order].free_list[migratetype].
> > + *
> > + * If the caller supplied page block (i.e. **page) is on the free list, offer
> > + * the next page block on the list to the caller. Otherwise, offer the first
> > + * page block on the list.
> > + *
> > + * Note: it is not safe for caller to use any pages on the returned
> > + * block or to discard data that is put there after the function returns.
> > + * However, it is safe for caller to discard data that was in one of these
> > + * pages before the function was invoked.
> > + *
> > + * Return 0 when a page block is found on the caller specified free list.
> > + */
> > +int report_unused_page_block(struct zone *zone, unsigned int order,
> > +			     unsigned int migratetype, struct page **page)
> > +{
> > +	struct zone *this_zone;
> > +	struct list_head *this_list;
> > +	int ret = 0;
> > +	unsigned long flags;
> > +
> > +	/* Sanity check */
> > +	if (zone == NULL || page == NULL || order >= MAX_ORDER ||
> > +	    migratetype >= MIGRATE_TYPES)
> > +		return -EINVAL;
> > +
> > +	/* Zone validity check */
> > +	for_each_populated_zone(this_zone) {
> > +		if (zone == this_zone)
> > +			break;
> > +	}
> > +
> > +	/* Got a non-existent zone from the caller? */
> > +	if (zone != this_zone)
> > +		return -EINVAL;
> 
> Huh, what do you check for here? Why don't you simply
> populated_zone(zone)?
> 
> > +
> > +	spin_lock_irqsave(&this_zone->lock, flags);
> > +
> > +	this_list = &zone->free_area[order].free_list[migratetype];
> > +	if (list_empty(this_list)) {
> > +		*page = NULL;
> > +		ret = 1;
> > +		goto out;
> > +	}
> > +
> > +	/* The caller is asking for the first free page block on the list */
> > +	if ((*page) == NULL) {
> > +		*page = list_first_entry(this_list, struct page, lru);
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * The page block passed from the caller is not on this free list
> > +	 * anymore (e.g. a 1MB free page block has been split). In this case,
> > +	 * offer the first page block on the free list that the caller is
> > +	 * asking for.
> > +	 */
> > +	if (PageBuddy(*page) && order != page_order(*page)) {
> > +		*page = list_first_entry(this_list, struct page, lru);
> > +		ret = 0;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * The page block passed from the caller has been the last page block
> > +	 * on the list.
> > +	 */
> > +	if ((*page)->lru.next == this_list) {
> > +		*page = NULL;
> > +		ret = 1;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * Finally, fall into the regular case: the page block passed from the
> > +	 * caller is still on the free list. Offer the next one.
> > +	 */
> > +	*page = list_next_entry((*page), lru);
> > +	ret = 0;
> > +out:
> > +	spin_unlock_irqrestore(&this_zone->lock, flags);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(report_unused_page_block);
> > +
> > +#endif
> > +
> >  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> >  {
> >  	zoneref->zone = zone;
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko July 17, 2017, 3:24 p.m. UTC | #7

On Fri 14-07-17 22:17:13, Michael S. Tsirkin wrote:
> On Fri, Jul 14, 2017 at 02:30:23PM +0200, Michal Hocko wrote:
> > On Wed 12-07-17 20:40:19, Wei Wang wrote:
> > > This patch adds support for reporting blocks of pages on the free list
> > > specified by the caller.
> > > 
> > > As pages can leave the free list during this call or immediately
> > > afterwards, they are not guaranteed to be free after the function
> > > returns. The only guarantee this makes is that the page was on the free
> > > list at some point in time after the function has been invoked.
> > > 
> > > Therefore, it is not safe for caller to use any pages on the returned
> > > block or to discard data that is put there after the function returns.
> > > However, it is safe for caller to discard data that was in one of these
> > > pages before the function was invoked.
> > 
> > I do not understand what is the point of such a function and how it is
> > used because the patch doesn't give us any user (I haven't checked other
> > patches yet).
> > 
> > But just from the semantic point of view this sounds like a horrible
> > idea. The only way to get a free block of pages is to call the page
> > allocator. I am tempted to give it Nack right on those grounds but I
> > would like to hear more about what you actually want to achieve.
> 
> Basically it's a performance hint to the hypervisor.
> For example, these pages would be good candidates to
> move around as they are not mapped into any running
> applications.
>
> As such, it's important not to slow down other parts of the system too
> much - otherwise we are speeding up one part of the system while we slow
> down other parts of it, which is why it's trying to drop the lock as
> soon a possible.

So why cannot you simply allocate those page and then do whatever you
need. You can tell the page allocator to do only a lightweight
allocation by the gfp_mask - e.g. GFP_NOWAIT or if you even do not want
to risk kswapd intervening then 0 mask.

> As long as hypervisor does not assume it can drop these pages, and as
> long it's correct in most cases.  we are OK even if the hint is slightly
> wrong because hypervisor notifications are racing with allocations.

But the page could have been reused anytime after the lock is dropped
and you cannot check for that except for elevating the reference count.

Wang, Wei W July 18, 2017, 2:12 a.m. UTC | #8

On 07/17/2017 11:24 PM, Michal Hocko wrote:
> On Fri 14-07-17 22:17:13, Michael S. Tsirkin wrote:
>> On Fri, Jul 14, 2017 at 02:30:23PM +0200, Michal Hocko wrote:
>>> On Wed 12-07-17 20:40:19, Wei Wang wrote:
>>>> This patch adds support for reporting blocks of pages on the free list
>>>> specified by the caller.
>>>>
>>>> As pages can leave the free list during this call or immediately
>>>> afterwards, they are not guaranteed to be free after the function
>>>> returns. The only guarantee this makes is that the page was on the free
>>>> list at some point in time after the function has been invoked.
>>>>
>>>> Therefore, it is not safe for caller to use any pages on the returned
>>>> block or to discard data that is put there after the function returns.
>>>> However, it is safe for caller to discard data that was in one of these
>>>> pages before the function was invoked.
>>> I do not understand what is the point of such a function and how it is
>>> used because the patch doesn't give us any user (I haven't checked other
>>> patches yet).
>>>
>>> But just from the semantic point of view this sounds like a horrible
>>> idea. The only way to get a free block of pages is to call the page
>>> allocator. I am tempted to give it Nack right on those grounds but I
>>> would like to hear more about what you actually want to achieve.
>> Basically it's a performance hint to the hypervisor.
>> For example, these pages would be good candidates to
>> move around as they are not mapped into any running
>> applications.
>>
>> As such, it's important not to slow down other parts of the system too
>> much - otherwise we are speeding up one part of the system while we slow
>> down other parts of it, which is why it's trying to drop the lock as
>> soon a possible.

Probably I should have included the introduction of the usages in
the log. Hope it is not too later to explain here:

Live migration needs to transfer the VM's memory from the source
machine to the destination round by round. For the 1st round, all the VM's
memory is transferred. From the 2nd round, only the pieces of memory
that were written by the guest (after the 1st round) are transferred. One
method that is popularly used by the hypervisor to track which part of
memory is written is to write-protect all the guest memory.

This patch enables the optimization of the 1st round memory transfer -
the hypervisor can skip the transfer of guest unused pages in the 1st round.
It is not concerned that the memory pages are used after they are given to
the hypervisor as a hint of the unused pages, because they will be tracked
by the hypervisor and transferred in the next round if they are used and
written.

> So why cannot you simply allocate those page and then do whatever you
> need. You can tell the page allocator to do only a lightweight
> allocation by the gfp_mask - e.g. GFP_NOWAIT or if you even do not want
> to risk kswapd intervening then 0 mask.

Here are the 2 reasons that we can't get the hint of unused pages by 
allocating
them:

1) It's expected that live migration shouldn't affect the things running 
inside
the VM - take away all the free pages from the guest would greatly slow 
down the
activities inside guest (e.g. the network transmission may be stuck due 
to the lack of
sk_buf).

2) The hint of free pages are used to optimize the 1st round memory 
transfer, so the hint
is expect to be gotten by the hypervisor as quick as possible. Depending 
on the memory
size of the guest, allocation of all the free memory would be too long 
for the case.

Hope it clarifies the use case.

>> As long as hypervisor does not assume it can drop these pages, and as
>> long it's correct in most cases.  we are OK even if the hint is slightly
>> wrong because hypervisor notifications are racing with allocations.
> But the page could have been reused anytime after the lock is dropped
> and you cannot check for that except for elevating the reference count.

As also introduced above, the hypervisor uses a dirty page logging mechanism
to track which memory page is written by the guest when live migration 
begins.

Best,
Wei

Michal Hocko July 19, 2017, 8:13 a.m. UTC | #9

On Tue 18-07-17 10:12:14, Wei Wang wrote:
[...]
> Probably I should have included the introduction of the usages in
> the log. Hope it is not too later to explain here:

Yes this should have been described in the cover.
 
> Live migration needs to transfer the VM's memory from the source
> machine to the destination round by round. For the 1st round, all the VM's
> memory is transferred. From the 2nd round, only the pieces of memory
> that were written by the guest (after the 1st round) are transferred. One
> method that is popularly used by the hypervisor to track which part of
> memory is written is to write-protect all the guest memory.
> 
> This patch enables the optimization of the 1st round memory transfer -
> the hypervisor can skip the transfer of guest unused pages in the 1st round.

All you should need is the check for the page reference count, no?  I
assume you do some sort of pfn walk and so you should be able to get an
access to the struct page.

Wang, Wei W July 19, 2017, 12:01 p.m. UTC | #10

On 07/19/2017 04:13 PM, Michal Hocko wrote:
> On Tue 18-07-17 10:12:14, Wei Wang wrote:
> [...]
>> Probably I should have included the introduction of the usages in
>> the log. Hope it is not too later to explain here:
> Yes this should have been described in the cover.

OK, I will do it in the next version.

>   
>> Live migration needs to transfer the VM's memory from the source
>> machine to the destination round by round. For the 1st round, all the VM's
>> memory is transferred. From the 2nd round, only the pieces of memory
>> that were written by the guest (after the 1st round) are transferred. One
>> method that is popularly used by the hypervisor to track which part of
>> memory is written is to write-protect all the guest memory.
>>
>> This patch enables the optimization of the 1st round memory transfer -
>> the hypervisor can skip the transfer of guest unused pages in the 1st round.
> All you should need is the check for the page reference count, no?  I
> assume you do some sort of pfn walk and so you should be able to get an
> access to the struct page.

Not necessarily - the guest struct page is not seen by the hypervisor. The
hypervisor only gets those guest pfns which are hinted as unused. From the
hypervisor (host) point of view, a guest physical address corresponds to a
virtual address of a host process. So, once the hypervisor knows a guest
physical page is unsued, it knows that the corresponding virtual memory of
the process doesn't need to be transferred in the 1st round.

Best,
Wei

Michal Hocko July 24, 2017, 9 a.m. UTC | #11

On Wed 19-07-17 20:01:18, Wei Wang wrote:
> On 07/19/2017 04:13 PM, Michal Hocko wrote:
[...
> >All you should need is the check for the page reference count, no?  I
> >assume you do some sort of pfn walk and so you should be able to get an
> >access to the struct page.
> 
> Not necessarily - the guest struct page is not seen by the hypervisor. The
> hypervisor only gets those guest pfns which are hinted as unused. From the
> hypervisor (host) point of view, a guest physical address corresponds to a
> virtual address of a host process. So, once the hypervisor knows a guest
> physical page is unsued, it knows that the corresponding virtual memory of
> the process doesn't need to be transferred in the 1st round.

I am sorry, but I do not understand. Why cannot _guest_ simply check the
struct page ref count and send them to the hypervisor? Is there any
documentation which describes the workflow or code which would use your
new API?

Wang, Wei W July 25, 2017, 9:32 a.m. UTC | #12

On 07/24/2017 05:00 PM, Michal Hocko wrote:
> On Wed 19-07-17 20:01:18, Wei Wang wrote:
>> On 07/19/2017 04:13 PM, Michal Hocko wrote:
> [...
>>> All you should need is the check for the page reference count, no?  I
>>> assume you do some sort of pfn walk and so you should be able to get an
>>> access to the struct page.
>> Not necessarily - the guest struct page is not seen by the hypervisor. The
>> hypervisor only gets those guest pfns which are hinted as unused. From the
>> hypervisor (host) point of view, a guest physical address corresponds to a
>> virtual address of a host process. So, once the hypervisor knows a guest
>> physical page is unsued, it knows that the corresponding virtual memory of
>> the process doesn't need to be transferred in the 1st round.
> I am sorry, but I do not understand. Why cannot _guest_ simply check the
> struct page ref count and send them to the hypervisor?

Were you suggesting the following?
1) get a free page block from the page list using the API;
2) if page->ref_count == 0, send it to the hypervisor

Btw, ref_count may also change at any time.

> Is there any
> documentation which describes the workflow or code which would use your
> new API?
>

It's used in the balloon driver (patch 8). We don't have any docs yet, but
I think the high level workflow is the two steps above.


Best,
Wei

Michal Hocko July 25, 2017, 11:25 a.m. UTC | #13

On Tue 25-07-17 17:32:00, Wei Wang wrote:
> On 07/24/2017 05:00 PM, Michal Hocko wrote:
> >On Wed 19-07-17 20:01:18, Wei Wang wrote:
> >>On 07/19/2017 04:13 PM, Michal Hocko wrote:
> >[...
> >>>All you should need is the check for the page reference count, no?  I
> >>>assume you do some sort of pfn walk and so you should be able to get an
> >>>access to the struct page.
> >>Not necessarily - the guest struct page is not seen by the hypervisor. The
> >>hypervisor only gets those guest pfns which are hinted as unused. From the
> >>hypervisor (host) point of view, a guest physical address corresponds to a
> >>virtual address of a host process. So, once the hypervisor knows a guest
> >>physical page is unsued, it knows that the corresponding virtual memory of
> >>the process doesn't need to be transferred in the 1st round.
> >I am sorry, but I do not understand. Why cannot _guest_ simply check the
> >struct page ref count and send them to the hypervisor?
> 
> Were you suggesting the following?
> 1) get a free page block from the page list using the API;

No. Use a pfn walk, check the reference count and skip those pages which
have 0 ref count. I suspected that you need to do some sort of the pfn
walk anyway because you somehow have to evaluate a memory to migrate,
right?

> 2) if page->ref_count == 0, send it to the hypervisor

yes

> Btw, ref_count may also change at any time.
> 
> >Is there any
> >documentation which describes the workflow or code which would use your
> >new API?
> >
> 
> It's used in the balloon driver (patch 8). We don't have any docs yet, but
> I think the high level workflow is the two steps above.

I will have a look.

Wang, Wei W July 25, 2017, 11:56 a.m. UTC | #14

On 07/25/2017 07:25 PM, Michal Hocko wrote:
> On Tue 25-07-17 17:32:00, Wei Wang wrote:
>> On 07/24/2017 05:00 PM, Michal Hocko wrote:
>>> On Wed 19-07-17 20:01:18, Wei Wang wrote:
>>>> On 07/19/2017 04:13 PM, Michal Hocko wrote:
>>> [...
>>>>> All you should need is the check for the page reference count, no?  I
>>>>> assume you do some sort of pfn walk and so you should be able to get an
>>>>> access to the struct page.
>>>> Not necessarily - the guest struct page is not seen by the hypervisor. The
>>>> hypervisor only gets those guest pfns which are hinted as unused. From the
>>>> hypervisor (host) point of view, a guest physical address corresponds to a
>>>> virtual address of a host process. So, once the hypervisor knows a guest
>>>> physical page is unsued, it knows that the corresponding virtual memory of
>>>> the process doesn't need to be transferred in the 1st round.
>>> I am sorry, but I do not understand. Why cannot _guest_ simply check the
>>> struct page ref count and send them to the hypervisor?
>> Were you suggesting the following?
>> 1) get a free page block from the page list using the API;
> No. Use a pfn walk, check the reference count and skip those pages which
> have 0 ref count.

"pfn walk" - do you mean start from the first pfn, and scan all the pfns 
that the VM has?

> I suspected that you need to do some sort of the pfn
> walk anyway because you somehow have to evaluate a memory to migrate,
> right?

We don't need to do the pfn walk in the guest kernel. When the API 
reports, for example,
a 2MB free page block, the API caller offers to the hypervisor the base 
address of the page
block, and size=2MB, to the hypervisor.

The hypervisor maintains a bitmap of all the guest physical memory (a 
bit corresponds to
a guest pfn). When migrating memory, only the pfns that are set in the 
bitmap are transferred
to the destination machine. So, when the hypervisor receives a 2MB free 
page block, the
corresponding bits in the bitmap are cleared.

Best,
Wei

Michal Hocko July 25, 2017, 12:41 p.m. UTC | #15

On Tue 25-07-17 19:56:24, Wei Wang wrote:
> On 07/25/2017 07:25 PM, Michal Hocko wrote:
> >On Tue 25-07-17 17:32:00, Wei Wang wrote:
> >>On 07/24/2017 05:00 PM, Michal Hocko wrote:
> >>>On Wed 19-07-17 20:01:18, Wei Wang wrote:
> >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote:
> >>>[...
> >>>>>All you should need is the check for the page reference count, no?  I
> >>>>>assume you do some sort of pfn walk and so you should be able to get an
> >>>>>access to the struct page.
> >>>>Not necessarily - the guest struct page is not seen by the hypervisor. The
> >>>>hypervisor only gets those guest pfns which are hinted as unused. From the
> >>>>hypervisor (host) point of view, a guest physical address corresponds to a
> >>>>virtual address of a host process. So, once the hypervisor knows a guest
> >>>>physical page is unsued, it knows that the corresponding virtual memory of
> >>>>the process doesn't need to be transferred in the 1st round.
> >>>I am sorry, but I do not understand. Why cannot _guest_ simply check the
> >>>struct page ref count and send them to the hypervisor?
> >>Were you suggesting the following?
> >>1) get a free page block from the page list using the API;
> >No. Use a pfn walk, check the reference count and skip those pages which
> >have 0 ref count.
> 
> 
> "pfn walk" - do you mean start from the first pfn, and scan all the pfns
> that the VM has?

yes

> >I suspected that you need to do some sort of the pfn
> >walk anyway because you somehow have to evaluate a memory to migrate,
> >right?
> 
> We don't need to do the pfn walk in the guest kernel. When the API
> reports, for example, a 2MB free page block, the API caller offers to
> the hypervisor the base address of the page block, and size=2MB, to
> the hypervisor.

So you want to skip pfn walks by regularly calling into the page
allocator to update your bitmap. If that is the case then would an API
that would allow you to update your bitmap via a callback be s
sufficient? Something like
	void walk_free_mem(int node, int min_order,
			void (*visit)(unsigned long pfn, unsigned long nr_pages))

The function will call the given callback for each free memory block on
the given node starting from the given min_order. The callback will be
strictly an atomic and very light context. You can update your bitmap
from there.

This would address my main concern that the allocator internals would
get outside of the allocator proper. A nasty callback which would be too
expensive could still stall other allocations and cause latencies but
the locking will be under mm core control at least.

Does that sound useful?

Wang, Wei W July 25, 2017, 2:47 p.m. UTC | #16

On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
> On Tue 25-07-17 19:56:24, Wei Wang wrote:
> > On 07/25/2017 07:25 PM, Michal Hocko wrote:
> > >On Tue 25-07-17 17:32:00, Wei Wang wrote:
> > >>On 07/24/2017 05:00 PM, Michal Hocko wrote:
> > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote:
> > >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote:
> > >>>[...
> > We don't need to do the pfn walk in the guest kernel. When the API
> > reports, for example, a 2MB free page block, the API caller offers to
> > the hypervisor the base address of the page block, and size=2MB, to
> > the hypervisor.
> 
> So you want to skip pfn walks by regularly calling into the page allocator to
> update your bitmap. If that is the case then would an API that would allow you
> to update your bitmap via a callback be s sufficient? Something like
> 	void walk_free_mem(int node, int min_order,
> 			void (*visit)(unsigned long pfn, unsigned long nr_pages))
> 
> The function will call the given callback for each free memory block on the given
> node starting from the given min_order. The callback will be strictly an atomic
> and very light context. You can update your bitmap from there.

I would need to introduce more about the background here:
The hypervisor and the guest live in their own address space. The hypervisor's bitmap
isn't seen by the guest. I think we also wouldn't be able to give a callback function 
from the hypervisor to the guest in this case.

> 
> This would address my main concern that the allocator internals would get
> outside of the allocator proper. 

What issue would it have to expose the internal, for_each_zone()?
I think new code which would call it will also be strictly checked when they
are pushed to upstream.

Best,
Wei

Michal Hocko July 25, 2017, 2:53 p.m. UTC | #17

On Tue 25-07-17 14:47:16, Wang, Wei W wrote:
> On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
> > On Tue 25-07-17 19:56:24, Wei Wang wrote:
> > > On 07/25/2017 07:25 PM, Michal Hocko wrote:
> > > >On Tue 25-07-17 17:32:00, Wei Wang wrote:
> > > >>On 07/24/2017 05:00 PM, Michal Hocko wrote:
> > > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote:
> > > >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote:
> > > >>>[...
> > > We don't need to do the pfn walk in the guest kernel. When the API
> > > reports, for example, a 2MB free page block, the API caller offers to
> > > the hypervisor the base address of the page block, and size=2MB, to
> > > the hypervisor.
> > 
> > So you want to skip pfn walks by regularly calling into the page allocator to
> > update your bitmap. If that is the case then would an API that would allow you
> > to update your bitmap via a callback be s sufficient? Something like
> > 	void walk_free_mem(int node, int min_order,
> > 			void (*visit)(unsigned long pfn, unsigned long nr_pages))
> > 
> > The function will call the given callback for each free memory block on the given
> > node starting from the given min_order. The callback will be strictly an atomic
> > and very light context. You can update your bitmap from there.
> 
> I would need to introduce more about the background here:
> The hypervisor and the guest live in their own address space. The hypervisor's bitmap
> isn't seen by the guest. I think we also wouldn't be able to give a callback function 
> from the hypervisor to the guest in this case.

How did you plan to use your original API which export struct page array
then?

> > This would address my main concern that the allocator internals would get
> > outside of the allocator proper. 
> 
> What issue would it have to expose the internal, for_each_zone()?

zone is a MM internal concept. No code outside of the MM proper should
really care about zones.

Wang, Wei W July 26, 2017, 2:22 a.m. UTC | #18

On 07/25/2017 10:53 PM, Michal Hocko wrote:
> On Tue 25-07-17 14:47:16, Wang, Wei W wrote:
>> On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
>>> On Tue 25-07-17 19:56:24, Wei Wang wrote:
>>>> On 07/25/2017 07:25 PM, Michal Hocko wrote:
>>>>> On Tue 25-07-17 17:32:00, Wei Wang wrote:
>>>>>> On 07/24/2017 05:00 PM, Michal Hocko wrote:
>>>>>>> On Wed 19-07-17 20:01:18, Wei Wang wrote:
>>>>>>>> On 07/19/2017 04:13 PM, Michal Hocko wrote:
>>>>>>> [...
>>>> We don't need to do the pfn walk in the guest kernel. When the API
>>>> reports, for example, a 2MB free page block, the API caller offers to
>>>> the hypervisor the base address of the page block, and size=2MB, to
>>>> the hypervisor.
>>> So you want to skip pfn walks by regularly calling into the page allocator to
>>> update your bitmap. If that is the case then would an API that would allow you
>>> to update your bitmap via a callback be s sufficient? Something like
>>> 	void walk_free_mem(int node, int min_order,
>>> 			void (*visit)(unsigned long pfn, unsigned long nr_pages))
>>>
>>> The function will call the given callback for each free memory block on the given
>>> node starting from the given min_order. The callback will be strictly an atomic
>>> and very light context. You can update your bitmap from there.
>> I would need to introduce more about the background here:
>> The hypervisor and the guest live in their own address space. The hypervisor's bitmap
>> isn't seen by the guest. I think we also wouldn't be able to give a callback function
>> from the hypervisor to the guest in this case.
> How did you plan to use your original API which export struct page array
> then?


That's where the virtio-balloon driver comes in. It uses a shared ring 
mechanism to
send the guest memory info to the hypervisor.

We didn't expose the struct page array from the guest to the hypervisor. 
For example, when
a 2MB free page block is reported from the free page list, the info put 
on the ring is just
(base address of the 2MB continuous memory, size=2M).


>
>>> This would address my main concern that the allocator internals would get
>>> outside of the allocator proper.
>> What issue would it have to expose the internal, for_each_zone()?
> zone is a MM internal concept. No code outside of the MM proper should
> really care about zones.

I think this is also what Andrew suggested in the previous discussion:
https://lkml.org/lkml/2017/3/16/951

Move the code to virtio-balloon and a little layering violation seems 
acceptable.


Best,
Wei

Michal Hocko July 26, 2017, 10:24 a.m. UTC | #19

On Wed 26-07-17 10:22:23, Wei Wang wrote:
> On 07/25/2017 10:53 PM, Michal Hocko wrote:
> >On Tue 25-07-17 14:47:16, Wang, Wei W wrote:
> >>On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
> >>>On Tue 25-07-17 19:56:24, Wei Wang wrote:
> >>>>On 07/25/2017 07:25 PM, Michal Hocko wrote:
> >>>>>On Tue 25-07-17 17:32:00, Wei Wang wrote:
> >>>>>>On 07/24/2017 05:00 PM, Michal Hocko wrote:
> >>>>>>>On Wed 19-07-17 20:01:18, Wei Wang wrote:
> >>>>>>>>On 07/19/2017 04:13 PM, Michal Hocko wrote:
> >>>>>>>[...
> >>>>We don't need to do the pfn walk in the guest kernel. When the API
> >>>>reports, for example, a 2MB free page block, the API caller offers to
> >>>>the hypervisor the base address of the page block, and size=2MB, to
> >>>>the hypervisor.
> >>>So you want to skip pfn walks by regularly calling into the page allocator to
> >>>update your bitmap. If that is the case then would an API that would allow you
> >>>to update your bitmap via a callback be s sufficient? Something like
> >>>	void walk_free_mem(int node, int min_order,
> >>>			void (*visit)(unsigned long pfn, unsigned long nr_pages))
> >>>
> >>>The function will call the given callback for each free memory block on the given
> >>>node starting from the given min_order. The callback will be strictly an atomic
> >>>and very light context. You can update your bitmap from there.
> >>I would need to introduce more about the background here:
> >>The hypervisor and the guest live in their own address space. The hypervisor's bitmap
> >>isn't seen by the guest. I think we also wouldn't be able to give a callback function
> >>from the hypervisor to the guest in this case.
> >How did you plan to use your original API which export struct page array
> >then?
> 
> 
> That's where the virtio-balloon driver comes in. It uses a shared ring
> mechanism to
> send the guest memory info to the hypervisor.
> 
> We didn't expose the struct page array from the guest to the hypervisor. For
> example, when
> a 2MB free page block is reported from the free page list, the info put on
> the ring is just
> (base address of the 2MB continuous memory, size=2M).

So what exactly prevents virtio-balloon from using the above proposed
callback mechanism and export what is needed to the hypervisor?
 
> >>>This would address my main concern that the allocator internals would get
> >>>outside of the allocator proper.
> >>What issue would it have to expose the internal, for_each_zone()?
> >zone is a MM internal concept. No code outside of the MM proper should
> >really care about zones.
> 
> I think this is also what Andrew suggested in the previous discussion:
> https://lkml.org/lkml/2017/3/16/951
> 
> Move the code to virtio-balloon and a little layering violation seems
> acceptable.

Andrew didn't suggest to expose zones to modules. If I read his words
properly he said that a functionality which "provides a snapshot of the
present system unused pages" is just too specific for virtio usecase
to be a generic function and as such it should be in virtio. I tend
to agree. Even the proposed callback API is a layer violation. We
should just make sure that the API is not inherently dangerous. That
is why I have chosen to give only pfn and nr_pages to the caller. Sure
somebody could argue that the caller could do pfn_to_page and do nasty
things. That would be a fair argument but nothing really prevents
anybody to do th pfn walk already so there shouldn't inherently more
risk. We can document what we expect from the API user and that would be
much easier to describe than struct page API IMHO.

Wang, Wei W July 26, 2017, 11:44 a.m. UTC | #20

On 07/26/2017 06:24 PM, Michal Hocko wrote:
> On Wed 26-07-17 10:22:23, Wei Wang wrote:
>> On 07/25/2017 10:53 PM, Michal Hocko wrote:
>>> On Tue 25-07-17 14:47:16, Wang, Wei W wrote:
>>>> On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
>>>>> On Tue 25-07-17 19:56:24, Wei Wang wrote:
>>>>>> On 07/25/2017 07:25 PM, Michal Hocko wrote:
>>>>>>> On Tue 25-07-17 17:32:00, Wei Wang wrote:
>>>>>>>> On 07/24/2017 05:00 PM, Michal Hocko wrote:
>>>>>>>>> On Wed 19-07-17 20:01:18, Wei Wang wrote:
>>>>>>>>>> On 07/19/2017 04:13 PM, Michal Hocko wrote:
>>>>>>>>> [...
>>>>>> We don't need to do the pfn walk in the guest kernel. When the API
>>>>>> reports, for example, a 2MB free page block, the API caller offers to
>>>>>> the hypervisor the base address of the page block, and size=2MB, to
>>>>>> the hypervisor.
>>>>> So you want to skip pfn walks by regularly calling into the page allocator to
>>>>> update your bitmap. If that is the case then would an API that would allow you
>>>>> to update your bitmap via a callback be s sufficient? Something like
>>>>> 	void walk_free_mem(int node, int min_order,
>>>>> 			void (*visit)(unsigned long pfn, unsigned long nr_pages))
>>>>>
>>>>> The function will call the given callback for each free memory block on the given
>>>>> node starting from the given min_order. The callback will be strictly an atomic
>>>>> and very light context. You can update your bitmap from there.
>>>> I would need to introduce more about the background here:
>>>> The hypervisor and the guest live in their own address space. The hypervisor's bitmap
>>>> isn't seen by the guest. I think we also wouldn't be able to give a callback function
>>> >from the hypervisor to the guest in this case.
>>> How did you plan to use your original API which export struct page array
>>> then?
>>
>> That's where the virtio-balloon driver comes in. It uses a shared ring
>> mechanism to
>> send the guest memory info to the hypervisor.
>>
>> We didn't expose the struct page array from the guest to the hypervisor. For
>> example, when
>> a 2MB free page block is reported from the free page list, the info put on
>> the ring is just
>> (base address of the 2MB continuous memory, size=2M).
> So what exactly prevents virtio-balloon from using the above proposed
> callback mechanism and export what is needed to the hypervisor?

I thought about it more. Probably we can use the callback function with 
a little change like this:

void walk_free_mem(void *opaque1, void (*visit)(void *opaque2, unsigned 
long pfn,
            unsigned long nr_pages))
{
     ...
     for_each_populated_zone(zone) {
                    for_each_migratetype_order(order, type) {
                         report_unused_page_block(zone, order, type, 
&page); // from patch 6
                         pfn = page_to_pfn(page);
                         visit(opaque1, pfn, 1 << order);
                     }
     }
}

The above function scans all the free list and directly sends each free 
page block to the
hypervisor via the virtio_balloon callback below. No need to implement a 
bitmap.

In virtio-balloon, we have the callback:
void *virtio_balloon_report_unused_pages(void *opaque,  unsigned long pfn,
unsigned long nr_pages)
{
     struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
     ...put the free page block to the the ring of vb;
}


What do you think?


Best,
Wei

Michal Hocko July 26, 2017, 11:55 a.m. UTC | #21

On Wed 26-07-17 19:44:23, Wei Wang wrote:
[...]
> I thought about it more. Probably we can use the callback function with a
> little change like this:
> 
> void walk_free_mem(void *opaque1, void (*visit)(void *opaque2, unsigned long
> pfn,
>            unsigned long nr_pages))
> {
>     ...
>     for_each_populated_zone(zone) {
>                    for_each_migratetype_order(order, type) {
>                         report_unused_page_block(zone, order, type, &page);
> // from patch 6
>                         pfn = page_to_pfn(page);
>                         visit(opaque1, pfn, 1 << order);
>                     }
>     }
> }
> 
> The above function scans all the free list and directly sends each free page
> block to the
> hypervisor via the virtio_balloon callback below. No need to implement a
> bitmap.
> 
> In virtio-balloon, we have the callback:
> void *virtio_balloon_report_unused_pages(void *opaque,  unsigned long pfn,
> unsigned long nr_pages)
> {
>     struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
>     ...put the free page block to the the ring of vb;
> }
> 
> 
> What do you think?

I do not mind conveying a context to the callback. I would still prefer
to keep the original min_order to check semantic though. Why? Well,
it doesn't make much sense to scan low order free blocks all the time
because they are simply too volatile. Larger blocks tend to surivive for
longer. So I assume you would only care about larger free blocks. This
will also make the call cheaper.

Wang, Wei W July 26, 2017, 12:47 p.m. UTC | #22

On Wednesday, July 26, 2017 7:55 PM, Michal Hocko wrote:
> On Wed 26-07-17 19:44:23, Wei Wang wrote:
> [...]
> > I thought about it more. Probably we can use the callback function
> > with a little change like this:
> >
> > void walk_free_mem(void *opaque1, void (*visit)(void *opaque2,
> > unsigned long pfn,
> >            unsigned long nr_pages))
> > {
> >     ...
> >     for_each_populated_zone(zone) {
> >                    for_each_migratetype_order(order, type) {
> >                         report_unused_page_block(zone, order, type,
> > &page); // from patch 6
> >                         pfn = page_to_pfn(page);
> >                         visit(opaque1, pfn, 1 << order);
> >                     }
> >     }
> > }
> >
> > The above function scans all the free list and directly sends each
> > free page block to the hypervisor via the virtio_balloon callback
> > below. No need to implement a bitmap.
> >
> > In virtio-balloon, we have the callback:
> > void *virtio_balloon_report_unused_pages(void *opaque,  unsigned long
> > pfn, unsigned long nr_pages) {
> >     struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> >     ...put the free page block to the the ring of vb; }
> >
> >
> > What do you think?
> 
> I do not mind conveying a context to the callback. I would still prefer
> to keep the original min_order to check semantic though. Why? Well,
> it doesn't make much sense to scan low order free blocks all the time
> because they are simply too volatile. Larger blocks tend to surivive for
> longer. So I assume you would only care about larger free blocks. This
> will also make the call cheaper.
> --

OK, I will keep min order there in the next version.

Best,
Wei

[v12,6/8] mm: support reporting free page blocks

Commit Message

Comments

Patch