diff mbox

[v29,1/4] mm: support reporting free page blocks

Message ID 1522031994-7246-2-git-send-email-wei.w.wang@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wang, Wei W March 26, 2018, 2:39 a.m. UTC
This patch adds support to walk through the free page blocks in the
system and report them via a callback function. Some page blocks may
leave the free list after zone->lock is released, so it is the caller's
responsibility to either detect or prevent the use of such pages.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are reported as free pages but are written after the report function
returns will be captured by the hypervisor, and they will be added to the
next round of memory transfer.

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
---
 include/linux/mm.h |  6 ++++
 mm/page_alloc.c    | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 102 insertions(+)

Comments

Andrew Morton March 26, 2018, 9:22 p.m. UTC | #1
On Mon, 26 Mar 2018 10:39:51 +0800 Wei Wang <wei.w.wang@intel.com> wrote:

> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> One use example of this patch is to accelerate live migration by skipping
> the transfer of free pages reported from the guest. A popular method used
> by the hypervisor to track which part of memory is written during live
> migration is to write-protect all the guest memory. So, those pages that
> are reported as free pages but are written after the report function
> returns will be captured by the hypervisor, and they will be added to the
> next round of memory transfer.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4912,6 +4912,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  	show_swap_cache_info();
>  }
>  
> +/*
> + * Walk through a free page list and report the found pfn range via the
> + * callback.
> + *
> + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> + * value returned from the callback.
> + */
> +static int walk_free_page_list(void *opaque,
> +			       struct zone *zone,
> +			       int order,
> +			       enum migratetype mt,
> +			       int (*report_pfn_range)(void *,
> +						       unsigned long,
> +						       unsigned long))
> +{
> +	struct page *page;
> +	struct list_head *list;
> +	unsigned long pfn, flags;
> +	int ret = 0;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +	list = &zone->free_area[order].free_list[mt];
> +	list_for_each_entry(page, list, lru) {
> +		pfn = page_to_pfn(page);
> +		ret = report_pfn_range(opaque, pfn, 1 << order);
> +		if (ret)
> +			break;
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	return ret;
> +}
> +
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @report_pfn_range: the callback to report the pfn range of the free pages
> + *
> + * If the callback returns a non-zero value, stop iterating the list of free
> + * page blocks. Otherwise, continue to report.
> + *
> + * Please note that there are no locking guarantees for the callback and
> + * that the reported pfn range might be freed or disappear after the
> + * callback returns so the caller has to be very careful how it is used.
> + *
> + * The callback itself must not sleep or perform any operations which would
> + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> + * or via any lock dependency. It is generally advisable to implement
> + * the callback as simple as possible and defer any heavy lifting to a
> + * different context.
> + *
> + * There is no guarantee that each free range will be reported only once
> + * during one walk_free_mem_block invocation.
> + *
> + * pfn_to_page on the given range is strongly discouraged and if there is
> + * an absolute need for that make sure to contact MM people to discuss
> + * potential problems.
> + *
> + * The function itself might sleep so it cannot be called from atomic
> + * contexts.

I don't see how walk_free_mem_block() can sleep.

> + * In general low orders tend to be very volatile and so it makes more
> + * sense to query larger ones first for various optimizations which like
> + * ballooning etc... This will reduce the overhead as well.
> + *
> + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> + * value returned from the callback.
> + */
> +int walk_free_mem_block(void *opaque,
> +			int min_order,
> +			int (*report_pfn_range)(void *opaque,
> +			unsigned long pfn,
> +			unsigned long num))
> +{
> +	struct zone *zone;
> +	int order;
> +	enum migratetype mt;
> +	int ret;
> +
> +	for_each_populated_zone(zone) {
> +		for (order = MAX_ORDER - 1; order >= min_order; order--) {
> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +				ret = walk_free_page_list(opaque, zone,
> +							  order, mt,
> +							  report_pfn_range);
> +				if (ret)
> +					return ret;
> +			}
> +		}
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);

This looks like it could take a long time.  Will we end up needing to
add cond_resched() in there somewhere?

>  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
>  {
>  	zoneref->zone = zone;
> -- 
> 2.7.4
Wang, Wei W March 27, 2018, 6:23 a.m. UTC | #2
On 03/27/2018 05:22 AM, Andrew Morton wrote:
> On Mon, 26 Mar 2018 10:39:51 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
>
>> This patch adds support to walk through the free page blocks in the
>> system and report them via a callback function. Some page blocks may
>> leave the free list after zone->lock is released, so it is the caller's
>> responsibility to either detect or prevent the use of such pages.
>>
>> One use example of this patch is to accelerate live migration by skipping
>> the transfer of free pages reported from the guest. A popular method used
>> by the hypervisor to track which part of memory is written during live
>> migration is to write-protect all the guest memory. So, those pages that
>> are reported as free pages but are written after the report function
>> returns will be captured by the hypervisor, and they will be added to the
>> next round of memory transfer.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4912,6 +4912,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/*
>> + * Walk through a free page list and report the found pfn range via the
>> + * callback.
>> + *
>> + * Return 0 if it completes the reporting. Otherwise, return the non-zero
>> + * value returned from the callback.
>> + */
>> +static int walk_free_page_list(void *opaque,
>> +			       struct zone *zone,
>> +			       int order,
>> +			       enum migratetype mt,
>> +			       int (*report_pfn_range)(void *,
>> +						       unsigned long,
>> +						       unsigned long))
>> +{
>> +	struct page *page;
>> +	struct list_head *list;
>> +	unsigned long pfn, flags;
>> +	int ret = 0;
>> +
>> +	spin_lock_irqsave(&zone->lock, flags);
>> +	list = &zone->free_area[order].free_list[mt];
>> +	list_for_each_entry(page, list, lru) {
>> +		pfn = page_to_pfn(page);
>> +		ret = report_pfn_range(opaque, pfn, 1 << order);
>> +		if (ret)
>> +			break;
>> +	}
>> +	spin_unlock_irqrestore(&zone->lock, flags);
>> +
>> +	return ret;
>> +}
>> +
>> +/**
>> + * walk_free_mem_block - Walk through the free page blocks in the system
>> + * @opaque: the context passed from the caller
>> + * @min_order: the minimum order of free lists to check
>> + * @report_pfn_range: the callback to report the pfn range of the free pages
>> + *
>> + * If the callback returns a non-zero value, stop iterating the list of free
>> + * page blocks. Otherwise, continue to report.
>> + *
>> + * Please note that there are no locking guarantees for the callback and
>> + * that the reported pfn range might be freed or disappear after the
>> + * callback returns so the caller has to be very careful how it is used.
>> + *
>> + * The callback itself must not sleep or perform any operations which would
>> + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
>> + * or via any lock dependency. It is generally advisable to implement
>> + * the callback as simple as possible and defer any heavy lifting to a
>> + * different context.
>> + *
>> + * There is no guarantee that each free range will be reported only once
>> + * during one walk_free_mem_block invocation.
>> + *
>> + * pfn_to_page on the given range is strongly discouraged and if there is
>> + * an absolute need for that make sure to contact MM people to discuss
>> + * potential problems.
>> + *
>> + * The function itself might sleep so it cannot be called from atomic
>> + * contexts.
> I don't see how walk_free_mem_block() can sleep.

OK, it would be better to remove this sentence for the current version. 
But I think we could probably keep it if we decide to add cond_resched() 
below.

>
>> + * In general low orders tend to be very volatile and so it makes more
>> + * sense to query larger ones first for various optimizations which like
>> + * ballooning etc... This will reduce the overhead as well.
>> + *
>> + * Return 0 if it completes the reporting. Otherwise, return the non-zero
>> + * value returned from the callback.
>> + */
>> +int walk_free_mem_block(void *opaque,
>> +			int min_order,
>> +			int (*report_pfn_range)(void *opaque,
>> +			unsigned long pfn,
>> +			unsigned long num))
>> +{
>> +	struct zone *zone;
>> +	int order;
>> +	enum migratetype mt;
>> +	int ret;
>> +
>> +	for_each_populated_zone(zone) {
>> +		for (order = MAX_ORDER - 1; order >= min_order; order--) {
>> +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
>> +				ret = walk_free_page_list(opaque, zone,
>> +							  order, mt,
>> +							  report_pfn_range);
>> +				if (ret)
>> +					return ret;
>> +			}
==>
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> This looks like it could take a long time.  Will we end up needing to
> add cond_resched() in there somewhere?

OK. How about adding cond_resched at the above place "==>" (i.e. every 
order)?


Best,
Wei
Michal Hocko March 27, 2018, 6:33 a.m. UTC | #3
On Tue 27-03-18 14:23:51, Wei Wang wrote:
> On 03/27/2018 05:22 AM, Andrew Morton wrote:
> > On Mon, 26 Mar 2018 10:39:51 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
> > 
> > > This patch adds support to walk through the free page blocks in the
> > > system and report them via a callback function. Some page blocks may
> > > leave the free list after zone->lock is released, so it is the caller's
> > > responsibility to either detect or prevent the use of such pages.
> > > 
> > > One use example of this patch is to accelerate live migration by skipping
> > > the transfer of free pages reported from the guest. A popular method used
> > > by the hypervisor to track which part of memory is written during live
> > > migration is to write-protect all the guest memory. So, those pages that
> > > are reported as free pages but are written after the report function
> > > returns will be captured by the hypervisor, and they will be added to the
> > > next round of memory transfer.
> > > 
> > > ...
> > > 
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4912,6 +4912,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> > >   	show_swap_cache_info();
> > >   }
> > > +/*
> > > + * Walk through a free page list and report the found pfn range via the
> > > + * callback.
> > > + *
> > > + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> > > + * value returned from the callback.
> > > + */
> > > +static int walk_free_page_list(void *opaque,
> > > +			       struct zone *zone,
> > > +			       int order,
> > > +			       enum migratetype mt,
> > > +			       int (*report_pfn_range)(void *,
> > > +						       unsigned long,
> > > +						       unsigned long))
> > > +{
> > > +	struct page *page;
> > > +	struct list_head *list;
> > > +	unsigned long pfn, flags;
> > > +	int ret = 0;
> > > +
> > > +	spin_lock_irqsave(&zone->lock, flags);
> > > +	list = &zone->free_area[order].free_list[mt];
> > > +	list_for_each_entry(page, list, lru) {
> > > +		pfn = page_to_pfn(page);
> > > +		ret = report_pfn_range(opaque, pfn, 1 << order);
> > > +		if (ret)
> > > +			break;
> > > +	}
> > > +	spin_unlock_irqrestore(&zone->lock, flags);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * walk_free_mem_block - Walk through the free page blocks in the system
> > > + * @opaque: the context passed from the caller
> > > + * @min_order: the minimum order of free lists to check
> > > + * @report_pfn_range: the callback to report the pfn range of the free pages
> > > + *
> > > + * If the callback returns a non-zero value, stop iterating the list of free
> > > + * page blocks. Otherwise, continue to report.
> > > + *
> > > + * Please note that there are no locking guarantees for the callback and
> > > + * that the reported pfn range might be freed or disappear after the
> > > + * callback returns so the caller has to be very careful how it is used.
> > > + *
> > > + * The callback itself must not sleep or perform any operations which would
> > > + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> > > + * or via any lock dependency. It is generally advisable to implement
> > > + * the callback as simple as possible and defer any heavy lifting to a
> > > + * different context.
> > > + *
> > > + * There is no guarantee that each free range will be reported only once
> > > + * during one walk_free_mem_block invocation.
> > > + *
> > > + * pfn_to_page on the given range is strongly discouraged and if there is
> > > + * an absolute need for that make sure to contact MM people to discuss
> > > + * potential problems.
> > > + *
> > > + * The function itself might sleep so it cannot be called from atomic
> > > + * contexts.
> > I don't see how walk_free_mem_block() can sleep.
> 
> OK, it would be better to remove this sentence for the current version. But
> I think we could probably keep it if we decide to add cond_resched() below.

The point of this sentence was to make any user aware that the function
might sleep from the very begining rather than chase existing callers
when we need to add cond_resched or sleep for any other reason. So I
would rather keep it.
Michael S. Tsirkin March 27, 2018, 4:07 p.m. UTC | #4
On Tue, Mar 27, 2018 at 08:33:22AM +0200, Michal Hocko wrote:
> > > > + * The function itself might sleep so it cannot be called from atomic
> > > > + * contexts.
> > > I don't see how walk_free_mem_block() can sleep.
> > 
> > OK, it would be better to remove this sentence for the current version. But
> > I think we could probably keep it if we decide to add cond_resched() below.
> 
> The point of this sentence was to make any user aware that the function
> might sleep from the very begining rather than chase existing callers
> when we need to add cond_resched or sleep for any other reason. So I
> would rather keep it.

Let's say what it is then - "will be changed to sleep in the future".

> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko March 28, 2018, 7:01 a.m. UTC | #5
On Tue 27-03-18 19:07:22, Michael S. Tsirkin wrote:
> On Tue, Mar 27, 2018 at 08:33:22AM +0200, Michal Hocko wrote:
> > > > > + * The function itself might sleep so it cannot be called from atomic
> > > > > + * contexts.
> > > > I don't see how walk_free_mem_block() can sleep.
> > > 
> > > OK, it would be better to remove this sentence for the current version. But
> > > I think we could probably keep it if we decide to add cond_resched() below.
> > 
> > The point of this sentence was to make any user aware that the function
> > might sleep from the very begining rather than chase existing callers
> > when we need to add cond_resched or sleep for any other reason. So I
> > would rather keep it.
> 
> Let's say what it is then - "will be changed to sleep in the future".

Do we really want to describe the precise implementation in the
documentation? I thought the main purpose of the documentation is to
describe the _contract_. If I am curious about the implementation I can
look at the code. As I've said earlier in this patchset lifetime. This
interface is rather dangerous because we are exposing guts of our
internal data structures. So we better set expectations of what can and
cannot be done right from the beginning. I definitely do not want
somebody to simply look at the code and see that the interface is
sleepable and abuse that fact.
Michael S. Tsirkin April 10, 2018, 6:19 p.m. UTC | #6
On Mon, Mar 26, 2018 at 02:22:54PM -0700, Andrew Morton wrote:
> On Mon, 26 Mar 2018 10:39:51 +0800 Wei Wang <wei.w.wang@intel.com> wrote:
> 
> > This patch adds support to walk through the free page blocks in the
> > system and report them via a callback function. Some page blocks may
> > leave the free list after zone->lock is released, so it is the caller's
> > responsibility to either detect or prevent the use of such pages.
> > 
> > One use example of this patch is to accelerate live migration by skipping
> > the transfer of free pages reported from the guest. A popular method used
> > by the hypervisor to track which part of memory is written during live
> > migration is to write-protect all the guest memory. So, those pages that
> > are reported as free pages but are written after the report function
> > returns will be captured by the hypervisor, and they will be added to the
> > next round of memory transfer.
> > 
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4912,6 +4912,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> >  	show_swap_cache_info();
> >  }
> >  
> > +/*
> > + * Walk through a free page list and report the found pfn range via the
> > + * callback.
> > + *
> > + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> > + * value returned from the callback.
> > + */
> > +static int walk_free_page_list(void *opaque,
> > +			       struct zone *zone,
> > +			       int order,
> > +			       enum migratetype mt,
> > +			       int (*report_pfn_range)(void *,
> > +						       unsigned long,
> > +						       unsigned long))
> > +{
> > +	struct page *page;
> > +	struct list_head *list;
> > +	unsigned long pfn, flags;
> > +	int ret = 0;
> > +
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +	list = &zone->free_area[order].free_list[mt];
> > +	list_for_each_entry(page, list, lru) {
> > +		pfn = page_to_pfn(page);
> > +		ret = report_pfn_range(opaque, pfn, 1 << order);
> > +		if (ret)
> > +			break;
> > +	}
> > +	spin_unlock_irqrestore(&zone->lock, flags);
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * walk_free_mem_block - Walk through the free page blocks in the system
> > + * @opaque: the context passed from the caller
> > + * @min_order: the minimum order of free lists to check
> > + * @report_pfn_range: the callback to report the pfn range of the free pages
> > + *
> > + * If the callback returns a non-zero value, stop iterating the list of free
> > + * page blocks. Otherwise, continue to report.
> > + *
> > + * Please note that there are no locking guarantees for the callback and
> > + * that the reported pfn range might be freed or disappear after the
> > + * callback returns so the caller has to be very careful how it is used.
> > + *
> > + * The callback itself must not sleep or perform any operations which would
> > + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> > + * or via any lock dependency. It is generally advisable to implement
> > + * the callback as simple as possible and defer any heavy lifting to a
> > + * different context.
> > + *
> > + * There is no guarantee that each free range will be reported only once
> > + * during one walk_free_mem_block invocation.
> > + *
> > + * pfn_to_page on the given range is strongly discouraged and if there is
> > + * an absolute need for that make sure to contact MM people to discuss
> > + * potential problems.
> > + *
> > + * The function itself might sleep so it cannot be called from atomic
> > + * contexts.
> 
> I don't see how walk_free_mem_block() can sleep.
> 
> > + * In general low orders tend to be very volatile and so it makes more
> > + * sense to query larger ones first for various optimizations which like
> > + * ballooning etc... This will reduce the overhead as well.
> > + *
> > + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> > + * value returned from the callback.
> > + */
> > +int walk_free_mem_block(void *opaque,
> > +			int min_order,
> > +			int (*report_pfn_range)(void *opaque,
> > +			unsigned long pfn,
> > +			unsigned long num))
> > +{
> > +	struct zone *zone;
> > +	int order;
> > +	enum migratetype mt;
> > +	int ret;
> > +
> > +	for_each_populated_zone(zone) {
> > +		for (order = MAX_ORDER - 1; order >= min_order; order--) {
> > +			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +				ret = walk_free_page_list(opaque, zone,
> > +							  order, mt,
> > +							  report_pfn_range);
> > +				if (ret)
> > +					return ret;
> > +			}
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(walk_free_mem_block);
> 
> This looks like it could take a long time.  Will we end up needing to
> add cond_resched() in there somewhere?

Andrew, were your questions answered? If yes could I bother you for an ack on this?


> >  static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> >  {
> >  	zoneref->zone = zone;
> > -- 
> > 2.7.4
Andrew Morton April 10, 2018, 8:54 p.m. UTC | #7
On Tue, 10 Apr 2018 21:19:31 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote:

> 
> Andrew, were your questions answered? If yes could I bother you for an ack on this?
> 

Still not very happy that readers are told that "this function may
sleep" when it clearly doesn't do so.  If we wish to be able to change
it to sleep in the future then that should be mentioned.  And even put a
might_sleep() in there, to catch people who didn't read the comments...

Otherwise it looks OK.
Michael S. Tsirkin April 10, 2018, 11:25 p.m. UTC | #8
On Tue, Apr 10, 2018 at 01:54:29PM -0700, Andrew Morton wrote:
> On Tue, 10 Apr 2018 21:19:31 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > 
> > Andrew, were your questions answered? If yes could I bother you for an ack on this?
> > 
> 
> Still not very happy that readers are told that "this function may
> sleep" when it clearly doesn't do so.  If we wish to be able to change
> it to sleep in the future then that should be mentioned.  And even put a
> might_sleep() in there, to catch people who didn't read the comments...
> 
> Otherwise it looks OK.

Oh, might_sleep with a comment explaining it's for the future sounds
good to me. I queued this - Wei, could you post a patch on top pls?
Wang, Wei W April 11, 2018, 1:22 a.m. UTC | #9
On 04/11/2018 07:25 AM, Michael S. Tsirkin wrote:
> On Tue, Apr 10, 2018 at 01:54:29PM -0700, Andrew Morton wrote:
>> On Tue, 10 Apr 2018 21:19:31 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>
>>> Andrew, were your questions answered? If yes could I bother you for an ack on this?
>>>
>> Still not very happy that readers are told that "this function may
>> sleep" when it clearly doesn't do so.  If we wish to be able to change
>> it to sleep in the future then that should be mentioned.  And even put a
>> might_sleep() in there, to catch people who didn't read the comments...
>>
>> Otherwise it looks OK.
> Oh, might_sleep with a comment explaining it's for the future sounds
> good to me. I queued this - Wei, could you post a patch on top pls?
>

I'm just thinking if it would be necessary to add another might_sleep, 
because we've had a cond_resched there which has wrapped a __might_sleep.

Best,
Wei
diff mbox

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42..c72b5a9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1944,6 +1944,12 @@  extern void free_area_init_node(int nid, unsigned long * zones_size,
 		unsigned long zone_start_pfn, unsigned long *zholes_size);
 extern void free_initmem(void);
 
+extern int walk_free_mem_block(void *opaque,
+			       int min_order,
+			       int (*report_pfn_range)(void *opaque,
+						       unsigned long pfn,
+						       unsigned long num));
+
 /*
  * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
  * into the buddy system. The freed pages will be poisoned with pattern
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 635d7dd..d58de87 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4912,6 +4912,102 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
+/*
+ * Walk through a free page list and report the found pfn range via the
+ * callback.
+ *
+ * Return 0 if it completes the reporting. Otherwise, return the non-zero
+ * value returned from the callback.
+ */
+static int walk_free_page_list(void *opaque,
+			       struct zone *zone,
+			       int order,
+			       enum migratetype mt,
+			       int (*report_pfn_range)(void *,
+						       unsigned long,
+						       unsigned long))
+{
+	struct page *page;
+	struct list_head *list;
+	unsigned long pfn, flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	list = &zone->free_area[order].free_list[mt];
+	list_for_each_entry(page, list, lru) {
+		pfn = page_to_pfn(page);
+		ret = report_pfn_range(opaque, pfn, 1 << order);
+		if (ret)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	return ret;
+}
+
+/**
+ * walk_free_mem_block - Walk through the free page blocks in the system
+ * @opaque: the context passed from the caller
+ * @min_order: the minimum order of free lists to check
+ * @report_pfn_range: the callback to report the pfn range of the free pages
+ *
+ * If the callback returns a non-zero value, stop iterating the list of free
+ * page blocks. Otherwise, continue to report.
+ *
+ * Please note that there are no locking guarantees for the callback and
+ * that the reported pfn range might be freed or disappear after the
+ * callback returns so the caller has to be very careful how it is used.
+ *
+ * The callback itself must not sleep or perform any operations which would
+ * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
+ * or via any lock dependency. It is generally advisable to implement
+ * the callback as simple as possible and defer any heavy lifting to a
+ * different context.
+ *
+ * There is no guarantee that each free range will be reported only once
+ * during one walk_free_mem_block invocation.
+ *
+ * pfn_to_page on the given range is strongly discouraged and if there is
+ * an absolute need for that make sure to contact MM people to discuss
+ * potential problems.
+ *
+ * The function itself might sleep so it cannot be called from atomic
+ * contexts.
+ *
+ * In general low orders tend to be very volatile and so it makes more
+ * sense to query larger ones first for various optimizations which like
+ * ballooning etc... This will reduce the overhead as well.
+ *
+ * Return 0 if it completes the reporting. Otherwise, return the non-zero
+ * value returned from the callback.
+ */
+int walk_free_mem_block(void *opaque,
+			int min_order,
+			int (*report_pfn_range)(void *opaque,
+			unsigned long pfn,
+			unsigned long num))
+{
+	struct zone *zone;
+	int order;
+	enum migratetype mt;
+	int ret;
+
+	for_each_populated_zone(zone) {
+		for (order = MAX_ORDER - 1; order >= min_order; order--) {
+			for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+				ret = walk_free_page_list(opaque, zone,
+							  order, mt,
+							  report_pfn_range);
+				if (ret)
+					return ret;
+			}
+		}
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(walk_free_mem_block);
+
 static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
 {
 	zoneref->zone = zone;