diff mbox

[v2,repost,6/7] mm: add the related functions to get free page info

Message ID 1469582616-5729-7-git-send-email-liang.z.li@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Liang Li July 27, 2016, 1:23 a.m. UTC
Save the free page info into a page bitmap, will be used in virtio
balloon device driver.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Amit Shah <amit.shah@redhat.com>
---
 mm/page_alloc.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

Comments

Dave Hansen July 27, 2016, 4:40 p.m. UTC | #1
On 07/26/2016 06:23 PM, Liang Li wrote:
> +	for_each_migratetype_order(order, t) {
> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> +			if (pfn >= start_pfn && pfn <= end_pfn) {
> +				page_num = 1UL << order;
> +				if (pfn + page_num > end_pfn)
> +					page_num = end_pfn - pfn;
> +				bitmap_set(bitmap, pfn - start_pfn, page_num);
> +			}
> +		}
> +	}

Nit:  The 'page_num' nomenclature really confused me here.  It is the
number of bits being set in the bitmap.  Seems like calling it nr_pages
or num_pages would be more appropriate.

Isn't this bitmap out of date by the time it's send up to the
hypervisor?  Is there something that makes the inaccuracy OK here?
Michael S. Tsirkin July 27, 2016, 10:05 p.m. UTC | #2
On Wed, Jul 27, 2016 at 09:40:56AM -0700, Dave Hansen wrote:
> On 07/26/2016 06:23 PM, Liang Li wrote:
> > +	for_each_migratetype_order(order, t) {
> > +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> > +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > +			if (pfn >= start_pfn && pfn <= end_pfn) {
> > +				page_num = 1UL << order;
> > +				if (pfn + page_num > end_pfn)
> > +					page_num = end_pfn - pfn;
> > +				bitmap_set(bitmap, pfn - start_pfn, page_num);
> > +			}
> > +		}
> > +	}
> 
> Nit:  The 'page_num' nomenclature really confused me here.  It is the
> number of bits being set in the bitmap.  Seems like calling it nr_pages
> or num_pages would be more appropriate.
> 
> Isn't this bitmap out of date by the time it's send up to the
> hypervisor?  Is there something that makes the inaccuracy OK here?

Yes. Calling these free pages is unfortunate. It's likely to confuse
people thinking they can just discard these pages.

Hypervisor sends a request. We respond with this list of pages, and
the guarantee hypervisor needs is that these were free sometime between request
and response, so they are safe to free if they are unmodified
since the request. hypervisor can detect modifications so
it can detect modifications itself and does not need guest help.

Maybe just call these "free if unmodified" and reflect this
everywhere - verbose but hey. Better naming suggestions would be
welcome.
Michael S. Tsirkin July 27, 2016, 10:13 p.m. UTC | #3
On Wed, Jul 27, 2016 at 09:23:35AM +0800, Liang Li wrote:
> Save the free page info into a page bitmap, will be used in virtio
> balloon device driver.
> 
> Signed-off-by: Liang Li <liang.z.li@intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
> Cc: Amit Shah <amit.shah@redhat.com>
> ---
>  mm/page_alloc.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7da61ad..3ad8b10 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4523,6 +4523,52 @@ unsigned long get_max_pfn(void)
>  }
>  EXPORT_SYMBOL(get_max_pfn);
>  
> +static void mark_free_pages_bitmap(struct zone *zone, unsigned long start_pfn,
> +	unsigned long end_pfn, unsigned long *bitmap, unsigned long len)
> +{
> +	unsigned long pfn, flags, page_num;
> +	unsigned int order, t;
> +	struct list_head *curr;
> +
> +	if (zone_is_empty(zone))
> +		return;
> +	end_pfn = min(start_pfn + len, end_pfn);
> +	spin_lock_irqsave(&zone->lock, flags);
> +
> +	for_each_migratetype_order(order, t) {

Why not do each order separately? This way you can
use a single bit to pass a huge page to host.

Not a requirement but hey.

Alternatively (and maybe that is a better idea0
if you wanted to, you could just skip lone 4K pages.
It's not clear that they are worth bothering with.
Add a flag to start with some reasonably large order and go from there.


> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> +			if (pfn >= start_pfn && pfn <= end_pfn) {
> +				page_num = 1UL << order;
> +				if (pfn + page_num > end_pfn)
> +					page_num = end_pfn - pfn;
> +				bitmap_set(bitmap, pfn - start_pfn, page_num);
> +			}
> +		}
> +	}
> +
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +int get_free_pages(unsigned long start_pfn, unsigned long end_pfn,
> +		unsigned long *bitmap, unsigned long len)
> +{
> +	struct zone *zone;
> +	int ret = 0;
> +
> +	if (bitmap == NULL || start_pfn > end_pfn || start_pfn >= max_pfn)
> +		return 0;
> +	if (end_pfn < max_pfn)
> +		ret = 1;
> +	if (end_pfn >= max_pfn)
> +		ret = 0;
> +
> +	for_each_populated_zone(zone)
> +		mark_free_pages_bitmap(zone, start_pfn, end_pfn, bitmap, len);
> +	return ret;
> +}
> +EXPORT_SYMBOL(get_free_pages);
> +
>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 1.9.1
Dave Hansen July 27, 2016, 10:16 p.m. UTC | #4
On 07/27/2016 03:05 PM, Michael S. Tsirkin wrote:
> On Wed, Jul 27, 2016 at 09:40:56AM -0700, Dave Hansen wrote:
>> On 07/26/2016 06:23 PM, Liang Li wrote:
>>> +	for_each_migratetype_order(order, t) {
>>> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
>>> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
>>> +			if (pfn >= start_pfn && pfn <= end_pfn) {
>>> +				page_num = 1UL << order;
>>> +				if (pfn + page_num > end_pfn)
>>> +					page_num = end_pfn - pfn;
>>> +				bitmap_set(bitmap, pfn - start_pfn, page_num);
>>> +			}
>>> +		}
>>> +	}
>>
>> Nit:  The 'page_num' nomenclature really confused me here.  It is the
>> number of bits being set in the bitmap.  Seems like calling it nr_pages
>> or num_pages would be more appropriate.
>>
>> Isn't this bitmap out of date by the time it's send up to the
>> hypervisor?  Is there something that makes the inaccuracy OK here?
> 
> Yes. Calling these free pages is unfortunate. It's likely to confuse
> people thinking they can just discard these pages.
> 
> Hypervisor sends a request. We respond with this list of pages, and
> the guarantee hypervisor needs is that these were free sometime between request
> and response, so they are safe to free if they are unmodified
> since the request. hypervisor can detect modifications so
> it can detect modifications itself and does not need guest help.

Ahh, that makes sense.

So the hypervisor is trying to figure out: "Which pages do I move?".  It
wants to know which pages the guest thinks have good data and need to
move.  But, the list of free pages is (likely) smaller than the list of
pages with good data, so it asks for that instead.

A write to a page means that it has valuable data, regardless of whether
it was in the free list or not.

The hypervisor only skips moving pages that were free *and* were never
written to.  So we never lose data, even if this "get free page info"
stuff is totally out of date.

The patch description and code comments are, um, a _bit_ light for this
level of subtlety. :)
Michael S. Tsirkin July 27, 2016, 11:05 p.m. UTC | #5
On Wed, Jul 27, 2016 at 03:16:57PM -0700, Dave Hansen wrote:
> On 07/27/2016 03:05 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 27, 2016 at 09:40:56AM -0700, Dave Hansen wrote:
> >> On 07/26/2016 06:23 PM, Liang Li wrote:
> >>> +	for_each_migratetype_order(order, t) {
> >>> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> >>> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >>> +			if (pfn >= start_pfn && pfn <= end_pfn) {
> >>> +				page_num = 1UL << order;
> >>> +				if (pfn + page_num > end_pfn)
> >>> +					page_num = end_pfn - pfn;
> >>> +				bitmap_set(bitmap, pfn - start_pfn, page_num);
> >>> +			}
> >>> +		}
> >>> +	}
> >>
> >> Nit:  The 'page_num' nomenclature really confused me here.  It is the
> >> number of bits being set in the bitmap.  Seems like calling it nr_pages
> >> or num_pages would be more appropriate.
> >>
> >> Isn't this bitmap out of date by the time it's send up to the
> >> hypervisor?  Is there something that makes the inaccuracy OK here?
> > 
> > Yes. Calling these free pages is unfortunate. It's likely to confuse
> > people thinking they can just discard these pages.
> > 
> > Hypervisor sends a request. We respond with this list of pages, and
> > the guarantee hypervisor needs is that these were free sometime between request
> > and response, so they are safe to free if they are unmodified
> > since the request. hypervisor can detect modifications so
> > it can detect modifications itself and does not need guest help.
> 
> Ahh, that makes sense.
> 
> So the hypervisor is trying to figure out: "Which pages do I move?".  It
> wants to know which pages the guest thinks have good data and need to
> move.  But, the list of free pages is (likely) smaller than the list of
> pages with good data, so it asks for that instead.
> 
> A write to a page means that it has valuable data, regardless of whether
> it was in the free list or not.
> 
> The hypervisor only skips moving pages that were free *and* were never
> written to.

Right - except never is a long time, so we just need it "since the request
was received".

> So we never lose data, even if this "get free page info"
> stuff is totally out of date.

So if you include pages that were written to before the request
then yes data will be lost. This is why we do this scan
after we get the request and not e.g. on boot :)

> The patch description and code comments are, um, a _bit_ light for this
> level of subtlety. :)

Add to that, for any page it is safe to skip and not add it to list.
So the requirement is for when page must *not* be on this list:
it must not be there if it is needed by guest but was not
modified since the request.

Calling it "free" will just keep confusing people.  Either use the
verbose "free or modified" or invent a new word like "discardable" and
add a comment explaining that page is always discardable unless it's
content is needed by Linux but was not modified since the request.
Liang Li July 28, 2016, 12:10 a.m. UTC | #6
> Subject: Re: [PATCH v2 repost 6/7] mm: add the related functions to get free
> page info
> 
> On 07/26/2016 06:23 PM, Liang Li wrote:
> > +	for_each_migratetype_order(order, t) {
> > +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> > +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > +			if (pfn >= start_pfn && pfn <= end_pfn) {
> > +				page_num = 1UL << order;
> > +				if (pfn + page_num > end_pfn)
> > +					page_num = end_pfn - pfn;
> > +				bitmap_set(bitmap, pfn - start_pfn,
> page_num);
> > +			}
> > +		}
> > +	}
> 
> Nit:  The 'page_num' nomenclature really confused me here.  It is the
> number of bits being set in the bitmap.  Seems like calling it nr_pages or
> num_pages would be more appropriate.
> 

You are right,  will change.

> Isn't this bitmap out of date by the time it's send up to the hypervisor?  Is
> there something that makes the inaccuracy OK here?

Yes. The dirty page logging will be used to correct the inaccuracy.
The dirty page logging should be started before getting the free page bitmap, then if some of the free pages become no free for writing, these pages will be tracked by the dirty page logging mechanism.

Thanks!
Liang
Michael S. Tsirkin July 28, 2016, 12:17 a.m. UTC | #7
On Thu, Jul 28, 2016 at 12:10:16AM +0000, Li, Liang Z wrote:
> > Subject: Re: [PATCH v2 repost 6/7] mm: add the related functions to get free
> > page info
> > 
> > On 07/26/2016 06:23 PM, Liang Li wrote:
> > > +	for_each_migratetype_order(order, t) {
> > > +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> > > +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> > > +			if (pfn >= start_pfn && pfn <= end_pfn) {
> > > +				page_num = 1UL << order;
> > > +				if (pfn + page_num > end_pfn)
> > > +					page_num = end_pfn - pfn;
> > > +				bitmap_set(bitmap, pfn - start_pfn,
> > page_num);
> > > +			}
> > > +		}
> > > +	}
> > 
> > Nit:  The 'page_num' nomenclature really confused me here.  It is the
> > number of bits being set in the bitmap.  Seems like calling it nr_pages or
> > num_pages would be more appropriate.
> > 
> 
> You are right,  will change.
> 
> > Isn't this bitmap out of date by the time it's send up to the hypervisor?  Is
> > there something that makes the inaccuracy OK here?
> 
> Yes. The dirty page logging will be used to correct the inaccuracy.
> The dirty page logging should be started before getting the free page bitmap, then if some of the free pages become no free for writing, these pages will be tracked by the dirty page logging mechanism.
> 
> Thanks!
> Liang

Right but this should be clear from code and naming.
Liang Li July 28, 2016, 4:36 a.m. UTC | #8
> On 07/27/2016 03:05 PM, Michael S. Tsirkin wrote:
> > On Wed, Jul 27, 2016 at 09:40:56AM -0700, Dave Hansen wrote:
> >> On 07/26/2016 06:23 PM, Liang Li wrote:
> >>> +	for_each_migratetype_order(order, t) {
> >>> +		list_for_each(curr, &zone->free_area[order].free_list[t]) {
> >>> +			pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >>> +			if (pfn >= start_pfn && pfn <= end_pfn) {
> >>> +				page_num = 1UL << order;
> >>> +				if (pfn + page_num > end_pfn)
> >>> +					page_num = end_pfn - pfn;
> >>> +				bitmap_set(bitmap, pfn - start_pfn,
> page_num);
> >>> +			}
> >>> +		}
> >>> +	}
> >>
> >> Nit:  The 'page_num' nomenclature really confused me here.  It is the
> >> number of bits being set in the bitmap.  Seems like calling it
> >> nr_pages or num_pages would be more appropriate.
> >>
> >> Isn't this bitmap out of date by the time it's send up to the
> >> hypervisor?  Is there something that makes the inaccuracy OK here?
> >
> > Yes. Calling these free pages is unfortunate. It's likely to confuse
> > people thinking they can just discard these pages.
> >
> > Hypervisor sends a request. We respond with this list of pages, and
> > the guarantee hypervisor needs is that these were free sometime
> > between request and response, so they are safe to free if they are
> > unmodified since the request. hypervisor can detect modifications so
> > it can detect modifications itself and does not need guest help.
> 
> Ahh, that makes sense.
> 
> So the hypervisor is trying to figure out: "Which pages do I move?".  It wants
> to know which pages the guest thinks have good data and need to move.
> But, the list of free pages is (likely) smaller than the list of pages with good
> data, so it asks for that instead.
> 
> A write to a page means that it has valuable data, regardless of whether it
> was in the free list or not.
> 
> The hypervisor only skips moving pages that were free *and* were never
> written to.  So we never lose data, even if this "get free page info"
> stuff is totally out of date.
> 
> The patch description and code comments are, um, a _bit_ light for this level
> of subtlety. :)

I will add more description about this in v3.

Thanks!
Liang
Liang Li July 28, 2016, 5:30 a.m. UTC | #9
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7da61ad..3ad8b10
> > 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4523,6 +4523,52 @@ unsigned long get_max_pfn(void)  }
> > EXPORT_SYMBOL(get_max_pfn);
> >
> > +static void mark_free_pages_bitmap(struct zone *zone, unsigned long
> start_pfn,
> > +	unsigned long end_pfn, unsigned long *bitmap, unsigned long len) {
> > +	unsigned long pfn, flags, page_num;
> > +	unsigned int order, t;
> > +	struct list_head *curr;
> > +
> > +	if (zone_is_empty(zone))
> > +		return;
> > +	end_pfn = min(start_pfn + len, end_pfn);
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +
> > +	for_each_migratetype_order(order, t) {
> 
> Why not do each order separately? This way you can use a single bit to pass a
> huge page to host.
> 

I thought about that before, and did not that because of complexity and small benefits.
Use separated page bitmaps for each order won't help to reduce the data traffic, except
ignoring the pages with small order. 

> Not a requirement but hey.
> 
> Alternatively (and maybe that is a better idea0 if you wanted to, you could
> just skip lone 4K pages.
> It's not clear that they are worth bothering with.
> Add a flag to start with some reasonably large order and go from there.
> 
One of the main reason of this patch is to reduce the network traffic as much as possible,
it looks strange to skip the lone 4K pages. Skipping these pages can made live migration
faster? I think it depends on the amount of lone 4K pages.

In the other hand, it's faster to send one bit through virtio than that send 4K bytes 
through even 10Gps network, is that true?

Liang
diff mbox

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7da61ad..3ad8b10 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4523,6 +4523,52 @@  unsigned long get_max_pfn(void)
 }
 EXPORT_SYMBOL(get_max_pfn);
 
+static void mark_free_pages_bitmap(struct zone *zone, unsigned long start_pfn,
+	unsigned long end_pfn, unsigned long *bitmap, unsigned long len)
+{
+	unsigned long pfn, flags, page_num;
+	unsigned int order, t;
+	struct list_head *curr;
+
+	if (zone_is_empty(zone))
+		return;
+	end_pfn = min(start_pfn + len, end_pfn);
+	spin_lock_irqsave(&zone->lock, flags);
+
+	for_each_migratetype_order(order, t) {
+		list_for_each(curr, &zone->free_area[order].free_list[t]) {
+			pfn = page_to_pfn(list_entry(curr, struct page, lru));
+			if (pfn >= start_pfn && pfn <= end_pfn) {
+				page_num = 1UL << order;
+				if (pfn + page_num > end_pfn)
+					page_num = end_pfn - pfn;
+				bitmap_set(bitmap, pfn - start_pfn, page_num);
+			}
+		}
+	}
+
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+int get_free_pages(unsigned long start_pfn, unsigned long end_pfn,
+		unsigned long *bitmap, unsigned long len)
+{
+	struct zone *zone;
+	int ret = 0;
+
+	if (bitmap == NULL || start_pfn > end_pfn || start_pfn >= max_pfn)
+		return 0;
+	if (end_pfn < max_pfn)
+		ret = 1;
+	if (end_pfn >= max_pfn)
+		ret = 0;
+
+	for_each_populated_zone(zone)
+		mark_free_pages_bitmap(zone, start_pfn, end_pfn, bitmap, len);
+	return ret;
+}
+EXPORT_SYMBOL(get_free_pages);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;