diff mbox series

[v15,4/7] mm: Introduce Reported pages

Message ID 20191205162238.19548.68238.stgit@localhost.localdomain (mailing list archive)
State New, archived
Headers show
Series mm / virtio: Provide support for free page reporting | expand

Commit Message

Alexander H Duyck Dec. 5, 2019, 4:22 p.m. UTC
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

In order to pave the way for free page reporting in virtualized
environments we will need a way to get pages out of the free lists and
identify those pages after they have been returned. To accomplish this,
this patch adds the concept of a Reported Buddy, which is essentially
meant to just be the Uptodate flag used in conjunction with the Buddy
page type.

To prevent the reported pages from leaking outside of the buddy lists I
added a check to clear the PageReported bit in the del_page_from_free_list
function. As a result any reported page that is split, merged, or
allocated will have the flag cleared prior to the PageBuddy value being
cleared.

The process for reporting pages is fairly simple. Once we free a page that
meets the minimum order for page reporting we will schedule a worker thread
to start 2s or more in the future. That worker thread will begin working
from the lowest supported page reporting order up to MAX_ORDER - 1 pulling
unreported pages from the free list and storing them in the scatterlist.

When processing each individual free list it is necessary for the worker
thread to release the zone lock when it needs to stop and report the full
scatterlist of pages. To reduce the work of the next iteration the worker
thread will rotate the free list so that the first unreported page in the
free list becomes the first entry in the list.

It will then call a reporting function providing information on how many
entries are in the scatterlist. Once the function completes it will return
the pages to the free area from which they were allocated and start over
pulling more pages from the free areas until there are no longer enough
pages to report on to keep the worker busy, or we have processed as many
pages as were contained in the free area when we started processing the
list.

The worker thread will work in a round-robin fashion making its way
though each zone requesting reporting, and through each reportable free
list within that zone. Once all free areas within the zone have been
processed it will check to see if there have been any requests for
reporting while it was processing. If so it will reschedule the worker
thread to start up again in roughly 2s and exit.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/page-flags.h     |   11 +
 include/linux/page_reporting.h |   25 +++
 mm/Kconfig                     |   11 +
 mm/Makefile                    |    1 
 mm/page_alloc.c                |   17 ++
 mm/page_reporting.c            |  336 ++++++++++++++++++++++++++++++++++++++++
 mm/page_reporting.h            |   54 ++++++
 7 files changed, 451 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/page_reporting.h
 create mode 100644 mm/page_reporting.c
 create mode 100644 mm/page_reporting.h

Comments

Nitesh Narayan Lal Dec. 16, 2019, 10:17 a.m. UTC | #1
On 12/5/19 11:22 AM, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>
> In order to pave the way for free page reporting in virtualized
> environments we will need a way to get pages out of the free lists and
> identify those pages after they have been returned. To accomplish this,
> this patch adds the concept of a Reported Buddy, which is essentially
> meant to just be the Uptodate flag used in conjunction with the Buddy
> page type.

[...]

> +enum {
> +	PAGE_REPORTING_IDLE = 0,
> +	PAGE_REPORTING_REQUESTED,
> +	PAGE_REPORTING_ACTIVE
> +};
> +
> +/* request page reporting */
> +static void
> +__page_reporting_request(struct page_reporting_dev_info *prdev)
> +{
> +	unsigned int state;
> +
> +	/* Check to see if we are in desired state */
> +	state = atomic_read(&prdev->state);
> +	if (state == PAGE_REPORTING_REQUESTED)
> +		return;
> +
> +	/*
> +	 *  If reporting is already active there is nothing we need to do.
> +	 *  Test against 0 as that represents PAGE_REPORTING_IDLE.
> +	 */
> +	state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED);
> +	if (state != PAGE_REPORTING_IDLE)
> +		return;
> +
> +	/*
> +	 * Delay the start of work to allow a sizable queue to build. For
> +	 * now we are limiting this to running no more than once every
> +	 * couple of seconds.
> +	 */
> +	schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY);
> +}
> +

I think you recently switched to using an atomic variable for maintaining page
reporting status as I was doing in v12.
Which is good, as we will not have a disagreement on it now.

On a side note, apologies for not getting involved actively in the
discussions/reviews as I am on PTO.
Nitesh Narayan Lal Dec. 16, 2019, 11:44 a.m. UTC | #2
On 12/5/19 11:22 AM, Alexander Duyck wrote:
> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>
> In order to pave the way for free page reporting in virtualized
> environments we will need a way to get pages out of the free lists and
> identify those pages after they have been returned. To accomplish this,
> this patch adds the concept of a Reported Buddy, which is essentially
> meant to just be the Uptodate flag used in conjunction with the Buddy
> page type.
>
> To prevent the reported pages from leaking outside of the buddy lists I
> added a check to clear the PageReported bit in the del_page_from_free_list
> function. As a result any reported page that is split, merged, or
> allocated will have the flag cleared prior to the PageBuddy value being
> cleared.
>
> The process for reporting pages is fairly simple. Once we free a page that
> meets the minimum order for page reporting we will schedule a worker thread
> to start 2s or more in the future. That worker thread will begin working
> from the lowest supported page reporting order up to MAX_ORDER - 1 pulling
> unreported pages from the free list and storing them in the scatterlist.
>
> When processing each individual free list it is necessary for the worker
> thread to release the zone lock when it needs to stop and report the full
> scatterlist of pages. To reduce the work of the next iteration the worker
> thread will rotate the free list so that the first unreported page in the
> free list becomes the first entry in the list.

[...]

> k);
> +
> +	return err;
> +}
> +
> +static int
> +page_reporting_process_zone(struct page_reporting_dev_info *prdev,
> +			    struct scatterlist *sgl, struct zone *zone)
> +{
> +	unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY;
> +	unsigned long watermark;
> +	int err = 0;
> +
> +	/* Generate minimum watermark to be able to guarantee progress */
> +	watermark = low_wmark_pages(zone) +
> +		    (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER);
> +
> +	/*
> +	 * Cancel request if insufficient free memory or if we failed
> +	 * to allocate page reporting statistics for the zone.
> +	 */
> +	if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
> +		return err;
> +


Will it not make more sense to check the low watermark condition before every
reporting request generated for a bunch of 32 isolated pages?
or will that be too costly?

> +	/* Process each free list starting from lowest order/mt */
> +	for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) {
> +		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> +			/* We do not pull pages from the isolate free list */
> +			if (is_migrate_isolate(mt))
> +				continue;
> +
> +			err = page_reporting_cycle(prdev, zone, order, mt,
> +						   sgl, &offset);
> +			if (err)
> +				return err;
> +		}
> +	}
> +
> +	/* report the leftover pages before going idle */
> +	leftover = PAGE_REPORTING_CAPACITY - offset;
> +	if (leftover) {
> +		sgl = &sgl[offset];
> +		err = prdev->report(prdev, sgl, leftover);
> +
> +		/* flush any remaining pages out from the last report */
> +		spin_lock_irq(&zone->lock);
> +		page_reporting_drain(prdev, sgl, leftover, !err);
> +		spin_unlock_irq(&zone->lock);
> +	}
> +
> +	return err;
> +}
Alexander Duyck Dec. 16, 2019, 4:10 p.m. UTC | #3
On Mon, 2019-12-16 at 06:44 -0500, Nitesh Narayan Lal wrote:
> On 12/5/19 11:22 AM, Alexander Duyck wrote:
> > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > 
> > In order to pave the way for free page reporting in virtualized
> > environments we will need a way to get pages out of the free lists and
> > identify those pages after they have been returned. To accomplish this,
> > this patch adds the concept of a Reported Buddy, which is essentially
> > meant to just be the Uptodate flag used in conjunction with the Buddy
> > page type.
> > 
> > To prevent the reported pages from leaking outside of the buddy lists I
> > added a check to clear the PageReported bit in the del_page_from_free_list
> > function. As a result any reported page that is split, merged, or
> > allocated will have the flag cleared prior to the PageBuddy value being
> > cleared.
> > 
> > The process for reporting pages is fairly simple. Once we free a page that
> > meets the minimum order for page reporting we will schedule a worker thread
> > to start 2s or more in the future. That worker thread will begin working
> > from the lowest supported page reporting order up to MAX_ORDER - 1 pulling
> > unreported pages from the free list and storing them in the scatterlist.
> > 
> > When processing each individual free list it is necessary for the worker
> > thread to release the zone lock when it needs to stop and report the full
> > scatterlist of pages. To reduce the work of the next iteration the worker
> > thread will rotate the free list so that the first unreported page in the
> > free list becomes the first entry in the list.
> 
> [...]
> 
> > k);
> > +
> > +	return err;
> > +}
> > +
> > +static int
> > +page_reporting_process_zone(struct page_reporting_dev_info *prdev,
> > +			    struct scatterlist *sgl, struct zone *zone)
> > +{
> > +	unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY;
> > +	unsigned long watermark;
> > +	int err = 0;
> > +
> > +	/* Generate minimum watermark to be able to guarantee progress */
> > +	watermark = low_wmark_pages(zone) +
> > +		    (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER);
> > +
> > +	/*
> > +	 * Cancel request if insufficient free memory or if we failed
> > +	 * to allocate page reporting statistics for the zone.
> > +	 */
> > +	if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
> > +		return err;
> > +
> 
> Will it not make more sense to check the low watermark condition before every
> reporting request generated for a bunch of 32 isolated pages?
> or will that be too costly?

My thought is to wait until we are actually processing the request. That
way we are only performing this check once every 2 seconds instead of
every time we are thinking about requesting page reporting.

Keep in mind I removed the reported_pages tracking statistics so we now
are requesting as soon as we free any page. So if we moved the check tot
he request itself it would mean that a low memory condition would result
in us repeatedly checking the low water mark and failing the test.
Alexander Duyck Dec. 16, 2019, 4:28 p.m. UTC | #4
On Mon, 2019-12-16 at 05:17 -0500, Nitesh Narayan Lal wrote:
> On 12/5/19 11:22 AM, Alexander Duyck wrote:
> > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > 
> > In order to pave the way for free page reporting in virtualized
> > environments we will need a way to get pages out of the free lists and
> > identify those pages after they have been returned. To accomplish this,
> > this patch adds the concept of a Reported Buddy, which is essentially
> > meant to just be the Uptodate flag used in conjunction with the Buddy
> > page type.
> 
> [...]
> 
> > +enum {
> > +	PAGE_REPORTING_IDLE = 0,
> > +	PAGE_REPORTING_REQUESTED,
> > +	PAGE_REPORTING_ACTIVE
> > +};
> > +
> > +/* request page reporting */
> > +static void
> > +__page_reporting_request(struct page_reporting_dev_info *prdev)
> > +{
> > +	unsigned int state;
> > +
> > +	/* Check to see if we are in desired state */
> > +	state = atomic_read(&prdev->state);
> > +	if (state == PAGE_REPORTING_REQUESTED)
> > +		return;
> > +
> > +	/*
> > +	 *  If reporting is already active there is nothing we need to do.
> > +	 *  Test against 0 as that represents PAGE_REPORTING_IDLE.
> > +	 */
> > +	state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED);
> > +	if (state != PAGE_REPORTING_IDLE)
> > +		return;
> > +
> > +	/*
> > +	 * Delay the start of work to allow a sizable queue to build. For
> > +	 * now we are limiting this to running no more than once every
> > +	 * couple of seconds.
> > +	 */
> > +	schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY);
> > +}
> > +
> 
> I think you recently switched to using an atomic variable for maintaining page
> reporting status as I was doing in v12.
> Which is good, as we will not have a disagreement on it now.

There is still some differences between our approaches if I am not
mistaken. Specifically I have code in place so that any requests to report
while we are actively working on reporting will trigger another pass being
scheduled after we completed. I still believe you were lacking any logic
like that as I recall.

> On a side note, apologies for not getting involved actively in the
> discussions/reviews as I am on PTO.

No problem. I've been mostly looking for input from the core MM
maintainers anyway as we sorted most of the virtualization pieces some
time ago.
Nitesh Narayan Lal Dec. 17, 2019, 8:55 a.m. UTC | #5
On 12/16/19 11:28 AM, Alexander Duyck wrote:
> On Mon, 2019-12-16 at 05:17 -0500, Nitesh Narayan Lal wrote:
>> On 12/5/19 11:22 AM, Alexander Duyck wrote:
>>> From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
>>>
>>> In order to pave the way for free page reporting in virtualized
>>> environments we will need a way to get pages out of the free lists and
>>> identify those pages after they have been returned. To accomplish this,
>>> this patch adds the concept of a Reported Buddy, which is essentially
>>> meant to just be the Uptodate flag used in conjunction with the Buddy
>>> page type.
>> [...]
>>
>>> +enum {
>>> +	PAGE_REPORTING_IDLE = 0,
>>> +	PAGE_REPORTING_REQUESTED,
>>> +	PAGE_REPORTING_ACTIVE
>>> +};
>>> +
>>> +/* request page reporting */
>>> +static void
>>> +__page_reporting_request(struct page_reporting_dev_info *prdev)
>>> +{
>>> +	unsigned int state;
>>> +
>>> +	/* Check to see if we are in desired state */
>>> +	state = atomic_read(&prdev->state);
>>> +	if (state == PAGE_REPORTING_REQUESTED)
>>> +		return;
>>> +
>>> +	/*
>>> +	 *  If reporting is already active there is nothing we need to do.
>>> +	 *  Test against 0 as that represents PAGE_REPORTING_IDLE.
>>> +	 */
>>> +	state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED);
>>> +	if (state != PAGE_REPORTING_IDLE)
>>> +		return;
>>> +
>>> +	/*
>>> +	 * Delay the start of work to allow a sizable queue to build. For
>>> +	 * now we are limiting this to running no more than once every
>>> +	 * couple of seconds.
>>> +	 */
>>> +	schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY);
>>> +}
>>> +
>> I think you recently switched to using an atomic variable for maintaining page
>> reporting status as I was doing in v12.
>> Which is good, as we will not have a disagreement on it now.
> There is still some differences between our approaches if I am not
> mistaken. Specifically I have code in place so that any requests to report
> while we are actively working on reporting will trigger another pass being
> scheduled after we completed. I still believe you were lacking any logic
> like that as I recall.
>

Yes, I was specifically referring to the atomic state variable.
Though I am wondering if having an atomic variable to track page reporting state
is better than having a page reporting specific unsigned long flag, which we can
manipulate via __set_bit() and __clear_bit().

>> On a side note, apologies for not getting involved actively in the
>> discussions/reviews as I am on PTO.
> No problem. I've been mostly looking for input from the core MM
> maintainers anyway as we sorted most of the virtualization pieces some
> time ago.
>
Alexander Duyck Dec. 17, 2019, 4:31 p.m. UTC | #6
On Tue, 2019-12-17 at 03:55 -0500, Nitesh Narayan Lal wrote:
> On 12/16/19 11:28 AM, Alexander Duyck wrote:
> > On Mon, 2019-12-16 at 05:17 -0500, Nitesh Narayan Lal wrote:
> > > On 12/5/19 11:22 AM, Alexander Duyck wrote:
> > > > From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > > > 
> > > > In order to pave the way for free page reporting in virtualized
> > > > environments we will need a way to get pages out of the free lists and
> > > > identify those pages after they have been returned. To accomplish this,
> > > > this patch adds the concept of a Reported Buddy, which is essentially
> > > > meant to just be the Uptodate flag used in conjunction with the Buddy
> > > > page type.
> > > [...]
> > > 
> > > > +enum {
> > > > +	PAGE_REPORTING_IDLE = 0,
> > > > +	PAGE_REPORTING_REQUESTED,
> > > > +	PAGE_REPORTING_ACTIVE
> > > > +};
> > > > +
> > > > +/* request page reporting */
> > > > +static void
> > > > +__page_reporting_request(struct page_reporting_dev_info *prdev)
> > > > +{
> > > > +	unsigned int state;
> > > > +
> > > > +	/* Check to see if we are in desired state */
> > > > +	state = atomic_read(&prdev->state);
> > > > +	if (state == PAGE_REPORTING_REQUESTED)
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 *  If reporting is already active there is nothing we need to do.
> > > > +	 *  Test against 0 as that represents PAGE_REPORTING_IDLE.
> > > > +	 */
> > > > +	state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED);
> > > > +	if (state != PAGE_REPORTING_IDLE)
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 * Delay the start of work to allow a sizable queue to build. For
> > > > +	 * now we are limiting this to running no more than once every
> > > > +	 * couple of seconds.
> > > > +	 */
> > > > +	schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY);
> > > > +}
> > > > +
> > > I think you recently switched to using an atomic variable for maintaining page
> > > reporting status as I was doing in v12.
> > > Which is good, as we will not have a disagreement on it now.
> > There is still some differences between our approaches if I am not
> > mistaken. Specifically I have code in place so that any requests to report
> > while we are actively working on reporting will trigger another pass being
> > scheduled after we completed. I still believe you were lacking any logic
> > like that as I recall.
> > 
> 
> Yes, I was specifically referring to the atomic state variable.
> Though I am wondering if having an atomic variable to track page reporting state
> is better than having a page reporting specific unsigned long flag, which we can
> manipulate via __set_bit() and __clear_bit().

So the reason for using an atomic state variable is because I only really
have 3 possible states; idle, active, and requested. It allows for a
pretty simple state machine as any transition from idle indicates that we
need to schedule the worker, transition from requested to active when the
worker starts, and if at the end of a pass if we are still in the active
state it means we can transition back to idle, otherwise we reschedule the
worker.

In order to do the same sort of thing using the bitops would require at
least 2 bits. In addition with the requirement that I cannot use the zone
lock for protection of the state I cannot use the non-atomic versions of
things such as __set_bit and __clear_bit so they would require additional
locking protections.
Mel Gorman Dec. 18, 2019, 7:31 a.m. UTC | #7
On Tue, Dec 17, 2019 at 08:31:59AM -0800, Alexander Duyck wrote:
> > > > I think you recently switched to using an atomic variable for maintaining page
> > > > reporting status as I was doing in v12.
> > > > Which is good, as we will not have a disagreement on it now.
> > >
> > > There is still some differences between our approaches if I am not
> > > mistaken. Specifically I have code in place so that any requests to report
> > > while we are actively working on reporting will trigger another pass being
> > > scheduled after we completed. I still believe you were lacking any logic
> > > like that as I recall.
> > > 
> > 
> > Yes, I was specifically referring to the atomic state variable.
> > Though I am wondering if having an atomic variable to track page reporting state
> > is better than having a page reporting specific unsigned long flag, which we can
> > manipulate via __set_bit() and __clear_bit().
> 
> So the reason for using an atomic state variable is because I only really
> have 3 possible states; idle, active, and requested. It allows for a
> pretty simple state machine as any transition from idle indicates that we
> need to schedule the worker, transition from requested to active when the
> worker starts, and if at the end of a pass if we are still in the active
> state it means we can transition back to idle, otherwise we reschedule the
> worker.
> 
> In order to do the same sort of thing using the bitops would require at
> least 2 bits. In addition with the requirement that I cannot use the zone
> lock for protection of the state I cannot use the non-atomic versions of
> things such as __set_bit and __clear_bit so they would require additional
> locking protections.
> 

I completely agree with this. I had pointed out in an earlier review
that expanding the scope of the zone lock was inappropriate, the
non-atomic operations on separate flags potentially misses updates and
in general, I prefer the atomic variable approach a lot more than the
previous zone->flag based approach.
diff mbox series

Patch

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1bf83c8fcaa7..49c2697046b9 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -163,6 +163,9 @@  enum pageflags {
 
 	/* non-lru isolated movable page */
 	PG_isolated = PG_reclaim,
+
+	/* Only valid for buddy pages. Used to track pages that are reported */
+	PG_reported = PG_uptodate,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -432,6 +435,14 @@  static inline bool set_hwpoison_free_buddy_page(struct page *page)
 #endif
 
 /*
+ * PageReported() is used to track reported free pages within the Buddy
+ * allocator. We can use the non-atomic version of the test and set
+ * operations as both should be shielded with the zone lock to prevent
+ * any possible races on the setting or clearing of the bit.
+ */
+__PAGEFLAG(Reported, reported, PF_NO_COMPOUND)
+
+/*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
  * with the PAGE_MAPPING_ANON bit set to distinguish it.  See rmap.h.
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
new file mode 100644
index 000000000000..32355486f572
--- /dev/null
+++ b/include/linux/page_reporting.h
@@ -0,0 +1,25 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PAGE_REPORTING_H
+#define _LINUX_PAGE_REPORTING_H
+
+#include <linux/mmzone.h>
+#include <linux/scatterlist.h>
+
+#define PAGE_REPORTING_CAPACITY		32
+
+struct page_reporting_dev_info {
+	/* function that alters pages to make them "reported" */
+	int (*report)(struct page_reporting_dev_info *prdev,
+		      struct scatterlist *sg, unsigned int nents);
+
+	/* work struct for processing reports */
+	struct delayed_work work;
+
+	/* Current state of page reporting */
+	atomic_t state;
+};
+
+/* Tear-down and bring-up for page reporting devices */
+void page_reporting_unregister(struct page_reporting_dev_info *prdev);
+int page_reporting_register(struct page_reporting_dev_info *prdev);
+#endif /*_LINUX_PAGE_REPORTING_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index ab80933be65f..d40a873402ff 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -237,6 +237,17 @@  config COMPACTION
 	  linux-mm@kvack.org.
 
 #
+# support for free page reporting
+config PAGE_REPORTING
+	bool "Free page reporting"
+	def_bool n
+	help
+	  Free page reporting allows for the incremental acquisition of
+	  free pages from the buddy allocator for the purpose of reporting
+	  those pages to another entity, such as a hypervisor, so that the
+	  memory can be freed within the host for other uses.
+
+#
 # support for page migration
 #
 config MIGRATION
diff --git a/mm/Makefile b/mm/Makefile
index 059115e6efb4..e55649063735 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -117,3 +117,4 @@  obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
+obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 500b242c6f7f..290148398c26 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -74,6 +74,7 @@ 
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
+#include "page_reporting.h"
 
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
@@ -909,6 +910,10 @@  static inline void move_to_free_list(struct page *page, struct zone *zone,
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 					   unsigned int order)
 {
+	/* clear reported state and update reported page count */
+	if (page_reported(page))
+		__ClearPageReported(page);
+
 	list_del(&page->lru);
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
@@ -972,7 +977,7 @@  static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 static inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype)
+		int migratetype, bool report)
 {
 	struct capture_control *capc = task_capc(zone);
 	unsigned long uninitialized_var(buddy_pfn);
@@ -1057,6 +1062,10 @@  static inline void __free_one_page(struct page *page,
 		add_to_free_list_tail(page, zone, order, migratetype);
 	else
 		add_to_free_list(page, zone, order, migratetype);
+
+	/* Notify page reporting subsystem of freed page */
+	if (report)
+		page_reporting_notify_free(order);
 }
 
 /*
@@ -1373,7 +1382,7 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 		if (unlikely(isolated_pageblocks))
 			mt = get_pageblock_migratetype(page);
 
-		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
+		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
 		trace_mm_page_pcpu_drain(page, 0, mt);
 	}
 	spin_unlock(&zone->lock);
@@ -1389,7 +1398,7 @@  static void free_one_page(struct zone *zone,
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
-	__free_one_page(page, pfn, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype, true);
 	spin_unlock(&zone->lock);
 }
 
@@ -3249,7 +3258,7 @@  void __putback_isolated_page(struct page *page, unsigned int order)
 	mt = get_pfnblock_migratetype(page, pfn);
 
 	/* Return isolated page to tail of freelist. */
-	__free_one_page(page, pfn, zone, order, mt);
+	__free_one_page(page, pfn, zone, order, mt, false);
 }
 
 /*
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
new file mode 100644
index 000000000000..90154e71d6f9
--- /dev/null
+++ b/mm/page_reporting.c
@@ -0,0 +1,336 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/page_reporting.h>
+#include <linux/gfp.h>
+#include <linux/export.h>
+#include <linux/delay.h>
+#include <linux/scatterlist.h>
+
+#include "page_reporting.h"
+#include "internal.h"
+
+#define PAGE_REPORTING_DELAY	(2 * HZ)
+static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
+
+enum {
+	PAGE_REPORTING_IDLE = 0,
+	PAGE_REPORTING_REQUESTED,
+	PAGE_REPORTING_ACTIVE
+};
+
+/* request page reporting */
+static void
+__page_reporting_request(struct page_reporting_dev_info *prdev)
+{
+	unsigned int state;
+
+	/* Check to see if we are in desired state */
+	state = atomic_read(&prdev->state);
+	if (state == PAGE_REPORTING_REQUESTED)
+		return;
+
+	/*
+	 *  If reporting is already active there is nothing we need to do.
+	 *  Test against 0 as that represents PAGE_REPORTING_IDLE.
+	 */
+	state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED);
+	if (state != PAGE_REPORTING_IDLE)
+		return;
+
+	/*
+	 * Delay the start of work to allow a sizable queue to build. For
+	 * now we are limiting this to running no more than once every
+	 * couple of seconds.
+	 */
+	schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY);
+}
+
+/* notify prdev of free page reporting request */
+void __page_reporting_notify(void)
+{
+	struct page_reporting_dev_info *prdev;
+
+	/*
+	 * We use RCU to protect the pr_dev_info pointer. In almost all
+	 * cases this should be present, however in the unlikely case of
+	 * a shutdown this will be NULL and we should exit.
+	 */
+	rcu_read_lock();
+	prdev = rcu_dereference(pr_dev_info);
+	if (likely(prdev))
+		__page_reporting_request(prdev);
+
+	rcu_read_unlock();
+}
+
+static void
+page_reporting_drain(struct page_reporting_dev_info *prdev,
+		     struct scatterlist *sgl, unsigned int nents, bool reported)
+{
+	struct scatterlist *sg = sgl;
+
+	/*
+	 * Drain the now reported pages back into their respective
+	 * free lists/areas. We assume at least one page is populated.
+	 */
+	do {
+		unsigned int order = get_order(sg->length);
+		struct page *page = sg_page(sg);
+
+		__putback_isolated_page(page, order);
+
+		/* If the pages were not reported due to error skip flagging */
+		if (!reported)
+			continue;
+
+		/*
+		 * If page was not comingled with another page we can
+		 * consider the result to be "reported" since the page
+		 * hasn't been modified, otherwise we will need to
+		 * report on the new larger page when we make our way
+		 * up to that higher order.
+		 */
+		if (PageBuddy(page) && page_order(page) == order)
+			__SetPageReported(page);
+	} while ((sg = sg_next(sg)));
+
+	/* reinitialize scatterlist now that it is empty */
+	sg_init_table(sgl, nents);
+}
+
+/*
+ * The page reporting cycle consists of 4 stages, fill, report, drain, and
+ * idle. We will cycle through the first 3 stages until we cannot obtain a
+ * full scatterlist of pages, in that case we will switch to idle.
+ */
+static int
+page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
+		     unsigned int order, unsigned int mt,
+		     struct scatterlist *sgl, unsigned int *offset)
+{
+	struct free_area *area = &zone->free_area[order];
+	struct list_head *list = &area->free_list[mt];
+	unsigned int page_len = PAGE_SIZE << order;
+	struct page *page, *next;
+	unsigned long budget;
+	int err = 0;
+
+	/*
+	 * Perform early check, if free area is empty there is
+	 * nothing to process so we can skip this free_list.
+	 */
+	if (list_empty(list))
+		return err;
+
+	spin_lock_irq(&zone->lock);
+
+	/*
+	 * Only process as many pages are are present at the start of this
+	 * call. If additional pages are freed we can process them in a
+	 * couple seconds when we start the next pass.
+	 *
+	 * The assumption with this is that most of the pages are of one
+	 * migratetype so we should really only need this for one of the
+	 * lists in a given free area.
+	 */
+	budget = area->nr_free;
+
+	/* loop through free list adding unreported pages to sg list */
+	list_for_each_entry_safe(page, next, list, lru) {
+		/* If we consumed our budget we should go idle for now */
+		if (!budget--)
+			break;
+
+		/* We are going to skip over the reported pages. */
+		if (PageReported(page))
+			continue;
+
+		/* Attempt to add page to sg list */
+		if (*offset) {
+			if (!__isolate_free_page(page, order))
+				break;
+
+			sg_set_page(&sgl[--(*offset)], page, page_len, 0);
+			continue;
+		}
+
+		/*
+		 * Make the first non-reported entry in the free list
+		 * the new head of the free list before we exit.
+		 */
+		if (!list_is_first(&page->lru, list))
+			list_rotate_to_front(&page->lru, list);
+
+		/* release lock before waiting on report processing */
+		spin_unlock_irq(&zone->lock);
+
+		/* reset entries since the full list will be reported */
+		*offset = PAGE_REPORTING_CAPACITY;
+
+		/* begin processing pages in local list */
+		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
+
+		/* reacquire zone lock and resume processing */
+		spin_lock_irq(&zone->lock);
+
+		/* flush reported pages from the sg list */
+		page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
+
+		/* exit on error */
+		if (err)
+			break;
+
+		/*
+		 * Reset next to first entry, the old next isn't valid
+		 * since we dropped the lock to report the pages
+		 */
+		next = list_first_entry(list, struct page, lru);
+	}
+
+	spin_unlock_irq(&zone->lock);
+
+	return err;
+}
+
+static int
+page_reporting_process_zone(struct page_reporting_dev_info *prdev,
+			    struct scatterlist *sgl, struct zone *zone)
+{
+	unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY;
+	unsigned long watermark;
+	int err = 0;
+
+	/* Generate minimum watermark to be able to guarantee progress */
+	watermark = low_wmark_pages(zone) +
+		    (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER);
+
+	/*
+	 * Cancel request if insufficient free memory or if we failed
+	 * to allocate page reporting statistics for the zone.
+	 */
+	if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+		return err;
+
+	/* Process each free list starting from lowest order/mt */
+	for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) {
+		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+			/* We do not pull pages from the isolate free list */
+			if (is_migrate_isolate(mt))
+				continue;
+
+			err = page_reporting_cycle(prdev, zone, order, mt,
+						   sgl, &offset);
+			if (err)
+				return err;
+		}
+	}
+
+	/* report the leftover pages before going idle */
+	leftover = PAGE_REPORTING_CAPACITY - offset;
+	if (leftover) {
+		sgl = &sgl[offset];
+		err = prdev->report(prdev, sgl, leftover);
+
+		/* flush any remaining pages out from the last report */
+		spin_lock_irq(&zone->lock);
+		page_reporting_drain(prdev, sgl, leftover, !err);
+		spin_unlock_irq(&zone->lock);
+	}
+
+	return err;
+}
+
+static void page_reporting_process(struct work_struct *work)
+{
+	struct delayed_work *d_work = to_delayed_work(work);
+	struct page_reporting_dev_info *prdev =
+		container_of(d_work, struct page_reporting_dev_info, work);
+	int err = 0, state = PAGE_REPORTING_ACTIVE;
+	struct scatterlist *sgl;
+	struct zone *zone;
+
+	/*
+	 * Alter the state so we can detect any future requests while not
+	 * scheculing any additional work.
+	 */
+	atomic_set(&prdev->state, state);
+
+	/* allocate scatterlist to store pages being reported on */
+	sgl = kmalloc_array(PAGE_REPORTING_CAPACITY, sizeof(*sgl), GFP_KERNEL);
+	if (!sgl)
+		goto err_out;
+
+	sg_init_table(sgl, PAGE_REPORTING_CAPACITY);
+
+	for_each_zone(zone) {
+		err = page_reporting_process_zone(prdev, sgl, zone);
+		if (err)
+			break;
+	}
+
+	kfree(sgl);
+err_out:
+	/*
+	 * If the state has reverted back to requested then there may be
+	 * additional pages to be processed. We will defer for 200ms to allow
+	 * more pages to accumulate.
+	 */
+	state = atomic_cmpxchg(&prdev->state, state, PAGE_REPORTING_IDLE);
+	if (state == PAGE_REPORTING_REQUESTED)
+		schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY);
+}
+
+static DEFINE_MUTEX(page_reporting_mutex);
+DEFINE_STATIC_KEY_FALSE(page_reporting_enabled);
+
+int page_reporting_register(struct page_reporting_dev_info *prdev)
+{
+	int err = 0;
+
+	mutex_lock(&page_reporting_mutex);
+
+	/* nothing to do if already in use */
+	if (rcu_access_pointer(pr_dev_info)) {
+		err = -EBUSY;
+		goto err_out;
+	}
+
+	/* initialize state and work structures */
+	atomic_set(&prdev->state, PAGE_REPORTING_IDLE);
+	INIT_DELAYED_WORK(&prdev->work, &page_reporting_process);
+
+	/* Begin initial flush of zones */
+	__page_reporting_request(prdev);
+
+	/* Assign device to allow notifications */
+	rcu_assign_pointer(pr_dev_info, prdev);
+
+	/* enable page reporting notification */
+	if (!static_key_enabled(&page_reporting_enabled)) {
+		static_branch_enable(&page_reporting_enabled);
+		pr_info("Free page reporting enabled\n");
+	}
+err_out:
+	mutex_unlock(&page_reporting_mutex);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(page_reporting_register);
+
+void page_reporting_unregister(struct page_reporting_dev_info *prdev)
+{
+	mutex_lock(&page_reporting_mutex);
+
+	if (rcu_access_pointer(pr_dev_info) == prdev) {
+		/* Disable page reporting notification */
+		RCU_INIT_POINTER(pr_dev_info, NULL);
+		synchronize_rcu();
+
+		/* Flush any existing work, and lock it out */
+		cancel_delayed_work_sync(&prdev->work);
+	}
+
+	mutex_unlock(&page_reporting_mutex);
+}
+EXPORT_SYMBOL_GPL(page_reporting_unregister);
diff --git a/mm/page_reporting.h b/mm/page_reporting.h
new file mode 100644
index 000000000000..aa6d37f4dc22
--- /dev/null
+++ b/mm/page_reporting.h
@@ -0,0 +1,54 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _MM_PAGE_REPORTING_H
+#define _MM_PAGE_REPORTING_H
+
+#include <linux/mmzone.h>
+#include <linux/pageblock-flags.h>
+#include <linux/page-isolation.h>
+#include <linux/jump_label.h>
+#include <linux/slab.h>
+#include <asm/pgtable.h>
+#include <linux/scatterlist.h>
+
+#define PAGE_REPORTING_MIN_ORDER	pageblock_order
+
+#ifdef CONFIG_PAGE_REPORTING
+DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
+void __page_reporting_notify(void);
+
+static inline bool page_reported(struct page *page)
+{
+	return static_branch_unlikely(&page_reporting_enabled) &&
+	       PageReported(page);
+}
+
+/**
+ * page_reporting_notify_free - Free page notification to start page processing
+ *
+ * This function is meant to act as a screener for __page_reporting_notify
+ * which will determine if a give zone has crossed over the high-water mark
+ * that will justify us beginning page treatment. If we have crossed that
+ * threshold then it will start the process of pulling some pages and
+ * placing them in the batch list for treatment.
+ */
+static inline void page_reporting_notify_free(unsigned int order)
+{
+	/* Called from hot path in __free_one_page() */
+	if (!static_branch_unlikely(&page_reporting_enabled))
+		return;
+
+	/* Determine if we have crossed reporting threshold */
+	if (order < PAGE_REPORTING_MIN_ORDER)
+		return;
+
+	/* This will add a few cycles, but should be called infrequently */
+	__page_reporting_notify();
+}
+#else /* CONFIG_PAGE_REPORTING */
+#define page_reported(_page)	false
+
+static inline void page_reporting_notify_free(unsigned int order)
+{
+}
+#endif /* CONFIG_PAGE_REPORTING */
+#endif /*_MM_PAGE_REPORTING_H */