diff mbox

[v33,1/4] mm: add a function to get free page blocks

Message ID 1529037793-35521-2-git-send-email-wei.w.wang@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wang, Wei W June 15, 2018, 4:43 a.m. UTC
This patch adds a function to get free pages blocks from a free page
list. The obtained free page blocks are hints about free pages, because
there is no guarantee that they are still on the free page list after
the function returns.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are hinted as free pages but are written after this function returns will
be captured by the hypervisor, and they will be added to the next round of
memory transfer.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 include/linux/mm.h |  1 +
 mm/page_alloc.c    | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

Comments

Linus Torvalds June 15, 2018, 11:08 p.m. UTC | #1
On Fri, Jun 15, 2018 at 2:08 PM Wei Wang <wei.w.wang@intel.com> wrote:
>
> This patch adds a function to get free pages blocks from a free page
> list. The obtained free page blocks are hints about free pages, because
> there is no guarantee that they are still on the free page list after
> the function returns.

Ack. This is the kind of simple interface where I don't need to worry
about the MM code calling out to random drivers or subsystems.

I think that "order" should be checked for validity, but from a MM
standpoint I think this is fine.

                Linus
Wang, Wei W June 16, 2018, 1:19 a.m. UTC | #2
On Saturday, June 16, 2018 7:09 AM, Linus Torvalds wrote:
> On Fri, Jun 15, 2018 at 2:08 PM Wei Wang <wei.w.wang@intel.com> wrote:

> >

> > This patch adds a function to get free pages blocks from a free page

> > list. The obtained free page blocks are hints about free pages,

> > because there is no guarantee that they are still on the free page

> > list after the function returns.

> 

> Ack. This is the kind of simple interface where I don't need to worry about

> the MM code calling out to random drivers or subsystems.

> 

> I think that "order" should be checked for validity, but from a MM standpoint

> I think this is fine.

> 


Thanks, will add a validity check for "order".

Best,
Wei
Matthew Wilcox (Oracle) June 16, 2018, 4:50 a.m. UTC | #3
On Fri, Jun 15, 2018 at 12:43:10PM +0800, Wei Wang wrote:
> +/**
> + * get_from_free_page_list - get free page blocks from a free page list
> + * @order: the order of the free page list to check
> + * @buf: the array to store the physical addresses of the free page blocks
> + * @size: the array size
> + *
> + * This function offers hints about free pages. There is no guarantee that
> + * the obtained free pages are still on the free page list after the function
> + * returns. pfn_to_page on the obtained free pages is strongly discouraged
> + * and if there is an absolute need for that, make sure to contact MM people
> + * to discuss potential problems.
> + *
> + * The addresses are currently stored to the array in little endian. This
> + * avoids the overhead of converting endianness by the caller who needs data
> + * in the little endian format. Big endian support can be added on demand in
> + * the future.
> + *
> + * Return the number of free page blocks obtained from the free page list.
> + * The maximum number of free page blocks that can be obtained is limited to
> + * the caller's array size.
> + */

Please use:

 * Return: The number of free page blocks obtained from the free page list.

Also, please include a

 * Context: Any context.

or

 * Context: Process context.

or whatever other conetext this function can be called from.  Since you're
taking the lock irqsafe, I assume this can be called from any context, but
I wonder if it makes sense to have this function callable from interrupt
context.  Maybe this should be callable from process context only.

> +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size)
> +{
> +	struct zone *zone;
> +	enum migratetype mt;
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned long addr, flags;
> +	uint32_t index = 0;
> +
> +	for_each_populated_zone(zone) {
> +		spin_lock_irqsave(&zone->lock, flags);
> +		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +			list = &zone->free_area[order].free_list[mt];
> +			list_for_each_entry(page, list, lru) {
> +				addr = page_to_pfn(page) << PAGE_SHIFT;
> +				if (likely(index < size)) {
> +					buf[index++] = cpu_to_le64(addr);
> +				} else {
> +					spin_unlock_irqrestore(&zone->lock,
> +							       flags);
> +					return index;
> +				}
> +			}
> +		}
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +	}
> +
> +	return index;
> +}

I wonder if (to address Michael's concern), you shouldn't instead use
the first free chunk of pages to return the addresses of all the pages.
ie something like this:

	__le64 *ret = NULL;
	unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);

	for_each_populated_zone(zone) {
		spin_lock_irq(&zone->lock);
		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
			list = &zone->free_area[order].free_list[mt];
			list_for_each_entry_safe(page, list, lru, ...) {
				if (index == size)
					break;
				addr = page_to_pfn(page) << PAGE_SHIFT;
				if (!ret) {
					list_del(...);
					ret = addr;
				}
				ret[index++] = cpu_to_le64(addr);
			}
		}
		spin_unlock_irq(&zone->lock);
	}

	return ret;
}

You'll need to return the page to the freelist afterwards, but free_pages()
should take care of that.
Wang, Wei W June 17, 2018, 12:07 a.m. UTC | #4
On Saturday, June 16, 2018 12:50 PM, Matthew Wilcox wrote:
> On Fri, Jun 15, 2018 at 12:43:10PM +0800, Wei Wang wrote:
> > +/**
> > + * get_from_free_page_list - get free page blocks from a free page
> > +list
> > + * @order: the order of the free page list to check
> > + * @buf: the array to store the physical addresses of the free page
> > +blocks
> > + * @size: the array size
> > + *
> > + * This function offers hints about free pages. There is no guarantee
> > +that
> > + * the obtained free pages are still on the free page list after the
> > +function
> > + * returns. pfn_to_page on the obtained free pages is strongly
> > +discouraged
> > + * and if there is an absolute need for that, make sure to contact MM
> > +people
> > + * to discuss potential problems.
> > + *
> > + * The addresses are currently stored to the array in little endian.
> > +This
> > + * avoids the overhead of converting endianness by the caller who
> > +needs data
> > + * in the little endian format. Big endian support can be added on
> > +demand in
> > + * the future.
> > + *
> > + * Return the number of free page blocks obtained from the free page list.
> > + * The maximum number of free page blocks that can be obtained is
> > +limited to
> > + * the caller's array size.
> > + */
> 
> Please use:
> 
>  * Return: The number of free page blocks obtained from the free page list.
> 
> Also, please include a
> 
>  * Context: Any context.
> 
> or
> 
>  * Context: Process context.
> 
> or whatever other conetext this function can be called from.  Since you're
> taking the lock irqsafe, I assume this can be called from any context, but I
> wonder if it makes sense to have this function callable from interrupt context.
> Maybe this should be callable from process context only.

Thanks, sounds better to make it process context only.

> 
> > +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t
> > +size) {
> > +	struct zone *zone;
> > +	enum migratetype mt;
> > +	struct page *page;
> > +	struct list_head *list;
> > +	unsigned long addr, flags;
> > +	uint32_t index = 0;
> > +
> > +	for_each_populated_zone(zone) {
> > +		spin_lock_irqsave(&zone->lock, flags);
> > +		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +			list = &zone->free_area[order].free_list[mt];
> > +			list_for_each_entry(page, list, lru) {
> > +				addr = page_to_pfn(page) << PAGE_SHIFT;
> > +				if (likely(index < size)) {
> > +					buf[index++] = cpu_to_le64(addr);
> > +				} else {
> > +					spin_unlock_irqrestore(&zone->lock,
> > +							       flags);
> > +					return index;
> > +				}
> > +			}
> > +		}
> > +		spin_unlock_irqrestore(&zone->lock, flags);
> > +	}
> > +
> > +	return index;
> > +}
> 
> I wonder if (to address Michael's concern), you shouldn't instead use the first
> free chunk of pages to return the addresses of all the pages.
> ie something like this:
> 
> 	__le64 *ret = NULL;
> 	unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);
> 
> 	for_each_populated_zone(zone) {
> 		spin_lock_irq(&zone->lock);
> 		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> 			list = &zone->free_area[order].free_list[mt];
> 			list_for_each_entry_safe(page, list, lru, ...) {
> 				if (index == size)
> 					break;
> 				addr = page_to_pfn(page) << PAGE_SHIFT;
> 				if (!ret) {
> 					list_del(...);

Thanks for sharing. But probably we would not take this approach for the reasons below:

1) I'm not sure if getting a block of free pages to use could be that simple (just pluck it from the list as above). I think it is more prudent to let the callers allocate the array via the regular allocation functions. 

2) Callers may need to use this with their own defined protocols, and they want the header and payload (i.e. the obtained hints) to locate in physically continuous memory (there are tricks they can use to make it work with non-physically continuous memory, but that would just complicate all the things) . In this case, it is better to have callers allocate the memory on their own, and pass the payload part memory to this API to get the payload filled.

Best,
Wei
Michael S. Tsirkin June 18, 2018, 2:16 a.m. UTC | #5
On Fri, Jun 15, 2018 at 09:50:05PM -0700, Matthew Wilcox wrote:
> I wonder if (to address Michael's concern), you shouldn't instead use
> the first free chunk of pages to return the addresses of all the pages.
> ie something like this:
> 
> 	__le64 *ret = NULL;
> 	unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);
> 
> 	for_each_populated_zone(zone) {
> 		spin_lock_irq(&zone->lock);
> 		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> 			list = &zone->free_area[order].free_list[mt];
> 			list_for_each_entry_safe(page, list, lru, ...) {
> 				if (index == size)
> 					break;
> 				addr = page_to_pfn(page) << PAGE_SHIFT;
> 				if (!ret) {
> 					list_del(...);
> 					ret = addr;
> 				}
> 				ret[index++] = cpu_to_le64(addr);
> 			}
> 		}
> 		spin_unlock_irq(&zone->lock);
> 	}
> 
> 	return ret;
> }
> 
> You'll need to return the page to the freelist afterwards, but free_pages()
> should take care of that.

Yes Wei already came up with the idea to stick this data into a
MAX_ORDER allocation. Are you sure just taking an entry off
the list like that has no bad side effects?
I have a vague memory someone complained that everyone
most go through get free pages/kmalloc, but I can't
find that anymore.
Michael S. Tsirkin June 26, 2018, 1:55 a.m. UTC | #6
On Sat, Jun 16, 2018 at 08:08:53AM +0900, Linus Torvalds wrote:
> On Fri, Jun 15, 2018 at 2:08 PM Wei Wang <wei.w.wang@intel.com> wrote:
> >
> > This patch adds a function to get free pages blocks from a free page
> > list. The obtained free page blocks are hints about free pages, because
> > there is no guarantee that they are still on the free page list after
> > the function returns.

...

> > +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size)

...

> 
> Ack. This is the kind of simple interface where I don't need to worry
> about the MM code calling out to random drivers or subsystems.
> 
> I think that "order" should be checked for validity, but from a MM
> standpoint I think this is fine.
> 
>                 Linus

The only issue seems to be getting hold of buf that's large enough -
and we don't really know what the size is, or whether one
buf would be enough.

Linus, do you think it would be ok to have get_from_free_page_list
actually pop entries from the free list and use them as the buffer
to store PAs?

Caller would be responsible for freeing the returned entries.
Linus Torvalds June 27, 2018, 4:05 p.m. UTC | #7
[ Sorry for slow reply, my travels have made a mess of my inbox ]

On Mon, Jun 25, 2018 at 6:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> Linus, do you think it would be ok to have get_from_free_page_list
> actually pop entries from the free list and use them as the buffer
> to store PAs?

Honestly, what I think the best option would be is to get rid of this
interface *entirely*, and just have the balloon code do

    #define GFP_MINFLAGS (__GFP_NORETRY | __GFP_NOWARN |
__GFP_THISNODE | __GFP_NOMEMALLOC)

    struct page *page =  alloc_pages(GFP_MINFLAGS, MAX_ORDER-1);

 which is not a new interface, and simply removes the max-order page
from the list if at all possible.

The above has the advantage of "just working", and not having any races.

Now, because you don't want to necessarily *entirely* deplete the max
order, I'd suggest that the *one* new interface you add is just a "how
many max-order pages are there" interface. So then you can query
(either before or after getting the max-order page) just how many of
them there were and whether you want to give that page back.

Notice? No need for any page lists or physical addresses. No races. No
complex new functions.

The physical address you can just get from the "struct page" you got.

And if you run out of memory because of getting a page, you get all
the usual "hey, we ran out of memory" responses..

Wouldn't the above be sufficient?

            Linus
Michael S. Tsirkin June 27, 2018, 7:07 p.m. UTC | #8
On Wed, Jun 27, 2018 at 09:05:39AM -0700, Linus Torvalds wrote:
> [ Sorry for slow reply, my travels have made a mess of my inbox ]
> 
> On Mon, Jun 25, 2018 at 6:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > Linus, do you think it would be ok to have get_from_free_page_list
> > actually pop entries from the free list and use them as the buffer
> > to store PAs?
> 
> Honestly, what I think the best option would be is to get rid of this
> interface *entirely*, and just have the balloon code do
> 
>     #define GFP_MINFLAGS (__GFP_NORETRY | __GFP_NOWARN |
> __GFP_THISNODE | __GFP_NOMEMALLOC)
> 
>     struct page *page =  alloc_pages(GFP_MINFLAGS, MAX_ORDER-1);
> 
>  which is not a new interface, and simply removes the max-order page
> from the list if at all possible.
> 
> The above has the advantage of "just working", and not having any races.
> 
> Now, because you don't want to necessarily *entirely* deplete the max
> order, I'd suggest that the *one* new interface you add is just a "how
> many max-order pages are there" interface. So then you can query
> (either before or after getting the max-order page) just how many of
> them there were and whether you want to give that page back.
> 
> Notice? No need for any page lists or physical addresses. No races. No
> complex new functions.
> 
> The physical address you can just get from the "struct page" you got.
> 
> And if you run out of memory because of getting a page, you get all
> the usual "hey, we ran out of memory" responses..
> 
> Wouldn't the above be sufficient?
> 
>             Linus

I think so, thanks!

Wei, to put it in balloon terms, I think there's one thing we missed: if
you do manage to allocate a page, and you don't have a use for it, then
hey, you can just give it to the host because you know it's free - you
are going to return it to the free list.
Wang, Wei W June 28, 2018, 8:39 a.m. UTC | #9
On 06/28/2018 03:07 AM, Michael S. Tsirkin wrote:
> On Wed, Jun 27, 2018 at 09:05:39AM -0700, Linus Torvalds wrote:
>> [ Sorry for slow reply, my travels have made a mess of my inbox ]
>>
>> On Mon, Jun 25, 2018 at 6:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>> Linus, do you think it would be ok to have get_from_free_page_list
>>> actually pop entries from the free list and use them as the buffer
>>> to store PAs?
>> Honestly, what I think the best option would be is to get rid of this
>> interface *entirely*, and just have the balloon code do
>>
>>      #define GFP_MINFLAGS (__GFP_NORETRY | __GFP_NOWARN |
>> __GFP_THISNODE | __GFP_NOMEMALLOC)
>>
>>      struct page *page =  alloc_pages(GFP_MINFLAGS, MAX_ORDER-1);
>>
>>   which is not a new interface, and simply removes the max-order page
>> from the list if at all possible.
>>
>> The above has the advantage of "just working", and not having any races.
>>
>> Now, because you don't want to necessarily *entirely* deplete the max
>> order, I'd suggest that the *one* new interface you add is just a "how
>> many max-order pages are there" interface. So then you can query
>> (either before or after getting the max-order page) just how many of
>> them there were and whether you want to give that page back.
>>
>> Notice? No need for any page lists or physical addresses. No races. No
>> complex new functions.
>>
>> The physical address you can just get from the "struct page" you got.
>>
>> And if you run out of memory because of getting a page, you get all
>> the usual "hey, we ran out of memory" responses..
>>
>> Wouldn't the above be sufficient?
>>
>>              Linus


Thanks for the elaboration.

> I think so, thanks!
>
> Wei, to put it in balloon terms, I think there's one thing we missed: if
> you do manage to allocate a page, and you don't have a use for it, then
> hey, you can just give it to the host because you know it's free - you
> are going to return it to the free list.
>

I'm not sure if this would be better than Linus' previous suggestion, 
because live migration is expected to be performed without disturbing 
the guest. If we do allocation to get all the free pages at all 
possible, then the guest applications would be seriously affected. For 
example, the network would become very slow as the allocation of sk_buf 
often triggers OOM during live migration. If live migration happens from 
time to time, and users try memory related tools like "free -h" on the 
guest, the reported statistics (e.g. the fee memory becomes very low 
abruptly due to the balloon allocation) would confuse them.

With the previous suggestion, we only get hints of the free pages (i.e. 
just report the address of free pages to host without taking them off 
the list).

Best,
Wei
diff mbox

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e49388..c58b4e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2002,6 +2002,7 @@  extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
+uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size);
 
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07b3c23..7c816d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5043,6 +5043,58 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/**
+ * get_from_free_page_list - get free page blocks from a free page list
+ * @order: the order of the free page list to check
+ * @buf: the array to store the physical addresses of the free page blocks
+ * @size: the array size
+ *
+ * This function offers hints about free pages. There is no guarantee that
+ * the obtained free pages are still on the free page list after the function
+ * returns. pfn_to_page on the obtained free pages is strongly discouraged
+ * and if there is an absolute need for that, make sure to contact MM people
+ * to discuss potential problems.
+ *
+ * The addresses are currently stored to the array in little endian. This
+ * avoids the overhead of converting endianness by the caller who needs data
+ * in the little endian format. Big endian support can be added on demand in
+ * the future.
+ *
+ * Return the number of free page blocks obtained from the free page list.
+ * The maximum number of free page blocks that can be obtained is limited to
+ * the caller's array size.
+ */
+uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size)
+{
+	struct zone *zone;
+	enum migratetype mt;
+	struct page *page;
+	struct list_head *list;
+	unsigned long addr, flags;
+	uint32_t index = 0;
+
+	for_each_populated_zone(zone) {
+		spin_lock_irqsave(&zone->lock, flags);
+		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+			list = &zone->free_area[order].free_list[mt];
+			list_for_each_entry(page, list, lru) {
+				addr = page_to_pfn(page) << PAGE_SHIFT;
+				if (likely(index < size)) {
+					buf[index++] = cpu_to_le64(addr);
+				} else {
+					spin_unlock_irqrestore(&zone->lock,
+							       flags);
+					return index;
+				}
+			}
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+
+	return index;
+}
+EXPORT_SYMBOL_GPL(get_from_free_page_list);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;