diff mbox series

[RFC,v12,1/2] mm: page_reporting: core infrastructure

Message ID 20190812131235.27244-2-nitesh@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm: Support for page reporting | expand

Commit Message

Nitesh Narayan Lal Aug. 12, 2019, 1:12 p.m. UTC
This patch introduces the core infrastructure for free page reporting in
virtual environments. It enables the kernel to track the free pages which
can be reported to its hypervisor so that the hypervisor could
free and reuse that memory as per its requirement.

While the pages are getting processed in the hypervisor (e.g.,
via MADV_DONTNEED), the guest must not use them, otherwise, data loss
would be possible. To avoid such a situation, these pages are
temporarily removed from the buddy. The amount of pages removed
temporarily from the buddy is governed by the backend(virtio-balloon
in our case).

To efficiently identify free pages that can to be reported to the
hypervisor, bitmaps in a coarse granularity are used. Only fairly big
chunks are reported to the hypervisor - especially, to not break up THP
in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
in the bitmap are an indication whether a page *might* be free, not a
guarantee. A new hook after buddy merging sets the bits.

Bitmaps are stored per zone, protected by the zone lock. A workqueue
asynchronously processes the bitmaps, trying to isolate and report pages
that are still free. The backend (virtio-balloon) is responsible for
reporting these batched pages to the host synchronously. Once reporting/
freeing is complete, isolated pages are returned back to the buddy.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 include/linux/mmzone.h         |  11 ++
 include/linux/page_reporting.h |  63 +++++++
 mm/Kconfig                     |   6 +
 mm/Makefile                    |   1 +
 mm/page_alloc.c                |  42 ++++-
 mm/page_reporting.c            | 332 +++++++++++++++++++++++++++++++++
 6 files changed, 448 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/page_reporting.h
 create mode 100644 mm/page_reporting.c

Comments

Alexander Duyck Aug. 12, 2019, 6:47 p.m. UTC | #1
On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> This patch introduces the core infrastructure for free page reporting in
> virtual environments. It enables the kernel to track the free pages which
> can be reported to its hypervisor so that the hypervisor could
> free and reuse that memory as per its requirement.
>
> While the pages are getting processed in the hypervisor (e.g.,
> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> would be possible. To avoid such a situation, these pages are
> temporarily removed from the buddy. The amount of pages removed
> temporarily from the buddy is governed by the backend(virtio-balloon
> in our case).
>
> To efficiently identify free pages that can to be reported to the
> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> chunks are reported to the hypervisor - especially, to not break up THP
> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> in the bitmap are an indication whether a page *might* be free, not a
> guarantee. A new hook after buddy merging sets the bits.
>
> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> asynchronously processes the bitmaps, trying to isolate and report pages
> that are still free. The backend (virtio-balloon) is responsible for
> reporting these batched pages to the host synchronously. Once reporting/
> freeing is complete, isolated pages are returned back to the buddy.
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

So if I understand correctly the hotplug support for this is still not
included correct? I assume that is the case since I don't see any
logic for zone resizing.

Also I don't see how this dealt with the sparse issue that was pointed
out earlier. Specifically how would you deal with a zone that has a
wide range between the base and the end and a huge gap somewhere
in-between?

> ---
>  include/linux/mmzone.h         |  11 ++
>  include/linux/page_reporting.h |  63 +++++++
>  mm/Kconfig                     |   6 +
>  mm/Makefile                    |   1 +
>  mm/page_alloc.c                |  42 ++++-
>  mm/page_reporting.c            | 332 +++++++++++++++++++++++++++++++++
>  6 files changed, 448 insertions(+), 7 deletions(-)
>  create mode 100644 include/linux/page_reporting.h
>  create mode 100644 mm/page_reporting.c
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d77d717c620c..ba5f5b508f25 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -559,6 +559,17 @@ struct zone {
>         /* Zone statistics */
>         atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
>         atomic_long_t           vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
> +#ifdef CONFIG_PAGE_REPORTING
> +       /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
> +       unsigned long *bitmap;
> +       /* Preserve start and end PFN in case they change due to hotplug */
> +       unsigned long base_pfn;
> +       unsigned long end_pfn;
> +       /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
> +       atomic_t free_pages;
> +       /* Number of bits required in the bitmap */
> +       unsigned long nbits;
> +#endif
>  } ____cacheline_internodealigned_in_smp;

Okay, so the original thing this patch set had going for it was that
it was non-invasive. However, now you are adding a bunch of stuff to
the zone. That kind of loses the non-invasive argument for this patch
set compared to mine.

If we are going to continue further with this patch set then it might
be worth looking into dynamically allocating the space you need for
this block. At a minimum you could probably look at making the bitmap
an RCU based setup so you could define the base and end along with the
bitmap. It would probably help to resolve the hotplug issues you still
need to address.

>  enum pgdat_flags {
> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
> new file mode 100644
> index 000000000000..37a39589939d
> --- /dev/null
> +++ b/include/linux/page_reporting.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PAGE_REPORTING_H
> +#define _LINUX_PAGE_REPORTING_H
> +
> +#define PAGE_REPORTING_MIN_ORDER               (MAX_ORDER - 2)
> +#define PAGE_REPORTING_MAX_PAGES               16
> +
> +#ifdef CONFIG_PAGE_REPORTING
> +struct page_reporting_config {
> +       /* function to hint batch of isolated pages */
> +       void (*report)(struct page_reporting_config *phconf,
> +                      unsigned int num_pages);
> +
> +       /* scatterlist to hold the isolated pages to be hinted */
> +       struct scatterlist *sg;
> +
> +       /*
> +        * Maxmimum pages that are going to be hinted to the hypervisor at a
> +        * time of granularity >= PAGE_REPORTING_MIN_ORDER.
> +        */
> +       int max_pages;
> +
> +       /* work object to process page reporting rqeuests */
> +       struct work_struct reporting_work;
> +
> +       /* tracks the number of reporting request processed at a time */
> +       atomic_t refcnt;
> +};
> +
> +void __page_reporting_enqueue(struct page *page);
> +void __return_isolated_page(struct zone *zone, struct page *page);
> +void set_pageblock_migratetype(struct page *page, int migratetype);
> +
> +/**
> + * page_reporting_enqueue - checks the eligibility of the freed page based on
> + * its order for further page reporting processing.
> + * @page: page which has been freed.
> + * @order: order for the the free page.
> + */
> +static inline void page_reporting_enqueue(struct page *page, int order)
> +{
> +       if (order < PAGE_REPORTING_MIN_ORDER)
> +               return;
> +       __page_reporting_enqueue(page);
> +}
> +
> +int page_reporting_enable(struct page_reporting_config *phconf);
> +void page_reporting_disable(struct page_reporting_config *phconf);
> +#else
> +static inline void page_reporting_enqueue(struct page *page, int order)
> +{
> +}
> +
> +static inline int page_reporting_enable(struct page_reporting_config *phconf)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static inline void page_reporting_disable(struct page_reporting_config *phconf)
> +{
> +}
> +#endif /* CONFIG_PAGE_REPORTING */
> +#endif /* _LINUX_PAGE_REPORTING_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 56cec636a1fc..6a35a9dad350 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,4 +736,10 @@ config ARCH_HAS_PTE_SPECIAL
>  config ARCH_HAS_HUGEPD
>         bool
>
> +# PAGE_REPORTING will allow the guest to report the free pages to the
> +# host in fixed chunks as soon as a fixed threshold is reached.
> +config PAGE_REPORTING
> +       bool
> +       def_bool n
> +       depends on X86_64
>  endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index 338e528ad436..6a32cdfa61c2 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -104,3 +104,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
>  obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
>  obj-$(CONFIG_HMM_MIRROR) += hmm.o
>  obj-$(CONFIG_MEMFD_CREATE) += memfd.o
> +obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 272c6de1bf4e..aa7c89d50c85 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -68,6 +68,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/nmi.h>
>  #include <linux/psi.h>
> +#include <linux/page_reporting.h>
>
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -903,7 +904,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
>  static inline void __free_one_page(struct page *page,
>                 unsigned long pfn,
>                 struct zone *zone, unsigned int order,
> -               int migratetype)
> +               int migratetype, bool needs_reporting)
>  {
>         unsigned long combined_pfn;
>         unsigned long uninitialized_var(buddy_pfn);
> @@ -1006,7 +1007,8 @@ static inline void __free_one_page(struct page *page,
>                                 migratetype);
>         else
>                 add_to_free_area(page, &zone->free_area[order], migratetype);
> -
> +       if (needs_reporting)
> +               page_reporting_enqueue(page, order);
>  }
>
>  /*
> @@ -1317,7 +1319,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>                 if (unlikely(isolated_pageblocks))
>                         mt = get_pageblock_migratetype(page);
>
> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>                 trace_mm_page_pcpu_drain(page, 0, mt);
>         }
>         spin_unlock(&zone->lock);
> @@ -1326,14 +1328,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  static void free_one_page(struct zone *zone,
>                                 struct page *page, unsigned long pfn,
>                                 unsigned int order,
> -                               int migratetype)
> +                               int migratetype, bool needs_reporting)
>  {
>         spin_lock(&zone->lock);
>         if (unlikely(has_isolate_pageblock(zone) ||
>                 is_migrate_isolate(migratetype))) {
>                 migratetype = get_pfnblock_migratetype(page, pfn);
>         }
> -       __free_one_page(page, pfn, zone, order, migratetype);
> +       __free_one_page(page, pfn, zone, order, migratetype, needs_reporting);
>         spin_unlock(&zone->lock);
>  }
>
> @@ -1423,7 +1425,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>         migratetype = get_pfnblock_migratetype(page, pfn);
>         local_irq_save(flags);
>         __count_vm_events(PGFREE, 1 << order);
> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>         local_irq_restore(flags);
>  }
>
> @@ -2197,6 +2199,32 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>         return NULL;
>  }
>
> +#ifdef CONFIG_PAGE_REPORTING
> +/**
> + * return_isolated_page - returns a reported page back to the buddy.
> + * @zone: zone from where the page was isolated.
> + * @page: page which will be returned.
> + */
> +void __return_isolated_page(struct zone *zone, struct page *page)
> +{
> +       unsigned int order, mt;
> +       unsigned long pfn;
> +
> +       /* zone lock should be held when this function is called */
> +       lockdep_assert_held(&zone->lock);
> +
> +       mt = get_pageblock_migratetype(page);
> +       pfn = page_to_pfn(page);
> +
> +       if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt)))
> +               mt = get_pfnblock_migratetype(page, pfn);
> +
> +       order = page_private(page);
> +       set_page_private(page, 0);
> +
> +       __free_one_page(page, pfn, zone, order, mt, false);
> +}
> +#endif /* CONFIG_PAGE_REPORTING */
>
>  /*
>   * This array describes the order lists are fallen back to when
> @@ -3041,7 +3069,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>          */
>         if (migratetype >= MIGRATE_PCPTYPES) {
>                 if (unlikely(is_migrate_isolate(migratetype))) {
> -                       free_one_page(zone, page, pfn, 0, migratetype);
> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>                         return;
>                 }
>                 migratetype = MIGRATE_MOVABLE;
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> new file mode 100644
> index 000000000000..4ee2c172cd9a
> --- /dev/null
> +++ b/mm/page_reporting.c
> @@ -0,0 +1,332 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Page reporting core infrastructure to enable a VM to report free pages to its
> + * hypervisor.
> + *
> + * Copyright Red Hat, Inc. 2019
> + *
> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/page_reporting.h>
> +#include <linux/scatterlist.h>
> +#include "internal.h"
> +
> +static struct page_reporting_config __rcu *page_reporting_conf __read_mostly;
> +static DEFINE_MUTEX(page_reporting_mutex);
> +
> +static inline unsigned long pfn_to_bit(struct page *page, struct zone *zone)
> +{
> +       unsigned long bitnr;
> +
> +       bitnr = (page_to_pfn(page) - zone->base_pfn) >>
> +               PAGE_REPORTING_MIN_ORDER;
> +
> +       return bitnr;
> +}
> +
> +static void return_isolated_page(struct zone *zone,
> +                                struct page_reporting_config *phconf)
> +{
> +       struct scatterlist *sg = phconf->sg;
> +
> +       spin_lock(&zone->lock);
> +       do {
> +               __return_isolated_page(zone, sg_page(sg));
> +       } while (!sg_is_last(sg++));
> +       spin_unlock(&zone->lock);
> +}
> +
> +static void bitmap_set_bit(struct page *page, struct zone *zone)
> +{
> +       unsigned long bitnr = 0;
> +
> +       /* zone lock should be held when this function is called */
> +       lockdep_assert_held(&zone->lock);
> +

So why does the zone lock need to be held? What could you possibly
race against? If nothing else it might make more sense to look at
moving the bitmap, base, end, and length values into one RCU
allocation structure so that you could do without the requirement of
the zone lock to manipulate the bitmap.

> +       bitnr = pfn_to_bit(page, zone);
> +       /* set bit if it is not already set and is a valid bit */
> +       if (zone->bitmap && bitnr < zone->nbits &&
> +           !test_and_set_bit(bitnr, zone->bitmap))
> +               atomic_inc(&zone->free_pages);
> +}
> +
> +static int process_free_page(struct page *page,
> +                            struct page_reporting_config *phconf, int count)
> +{
> +       int mt, order, ret = 0;
> +
> +       mt = get_pageblock_migratetype(page);
> +       order = page_private(page);
> +       ret = __isolate_free_page(page, order);
> +
> +       if (ret) {
> +               /*
> +                * Preserving order and migratetype for reuse while
> +                * releasing the pages back to the buddy.
> +                */
> +               set_pageblock_migratetype(page, mt);
> +               set_page_private(page, order);
> +
> +               sg_set_page(&phconf->sg[count++], page,
> +                           PAGE_SIZE << order, 0);
> +       }
> +
> +       return count;
> +}
> +
> +/**
> + * scan_zone_bitmap - scans the bitmap for the requested zone.
> + * @phconf: page reporting configuration object initialized by the backend.
> + * @zone: zone for which page reporting is requested.
> + *
> + * For every page marked in the bitmap it checks if it is still free if so it
> + * isolates and adds them to a scatterlist. As soon as the number of isolated
> + * pages reach the threshold set by the backend, they are reported to the
> + * hypervisor by the backend. Once the hypervisor responds after processing
> + * they are returned back to the buddy for reuse.
> + */
> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
> +                            struct zone *zone)
> +{
> +       unsigned long setbit;
> +       struct page *page;
> +       int count = 0;
> +
> +       sg_init_table(phconf->sg, phconf->max_pages);
> +
> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
> +               /* Process only if the page is still online */
> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
> +                                         zone->base_pfn);
> +               if (!page)
> +                       continue;
> +

Shouldn't you be clearing the bit and dropping the reference to
free_pages before you move on to the next bit? Otherwise you are going
to be stuck with those aren't you?

> +               spin_lock(&zone->lock);
> +
> +               /* Ensure page is still free and can be processed */
> +               if (PageBuddy(page) && page_private(page) >=
> +                   PAGE_REPORTING_MIN_ORDER)
> +                       count = process_free_page(page, phconf, count);
> +
> +               spin_unlock(&zone->lock);

So I kind of wonder just how much overhead you are taking for bouncing
the zone lock once per page here. Especially since it can result in
you not actually making any progress since the page may have already
been reallocated.

> +               /* Page has been processed, adjust its bit and zone counter */
> +               clear_bit(setbit, zone->bitmap);
> +               atomic_dec(&zone->free_pages);

So earlier you were setting this bit and required that the zone->lock
be held while doing so. Here you are clearing it without any
zone->lock being held.

> +               if (count == phconf->max_pages) {
> +                       /* Report isolated pages to the hypervisor */
> +                       phconf->report(phconf, count);
> +
> +                       /* Return processed pages back to the buddy */
> +                       return_isolated_page(zone, phconf);
> +
> +                       /* Reset for next reporting */
> +                       sg_init_table(phconf->sg, phconf->max_pages);
> +                       count = 0;
> +               }
> +       }
> +       /*
> +        * If the number of isolated pages does not meet the max_pages
> +        * threshold, we would still prefer to report them as we have already
> +        * isolated them.
> +        */
> +       if (count) {
> +               sg_mark_end(&phconf->sg[count - 1]);
> +               phconf->report(phconf, count);
> +
> +               return_isolated_page(zone, phconf);
> +       }
> +}
> +
> +/**
> + * page_reporting_wq - checks the number of free_pages in all the zones and
> + * invokes a request to scan the respective bitmap if free_pages reaches or
> + * exceeds the threshold specified by the backend.
> + */
> +static void page_reporting_wq(struct work_struct *work)
> +{
> +       struct page_reporting_config *phconf =
> +               container_of(work, struct page_reporting_config,
> +                            reporting_work);
> +       struct zone *zone;
> +
> +       for_each_populated_zone(zone) {
> +               if (atomic_read(&zone->free_pages) >= phconf->max_pages)
> +                       scan_zone_bitmap(phconf, zone);
> +       }
> +       /*
> +        * We have processed all the zones, we can process new page reporting
> +        * request now.
> +        */
> +       atomic_set(&phconf->refcnt, 0);

It doesn't make sense to reset the refcnt once you have made a single pass.

> +}
> +
> +/**
> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
> + */
> +void __page_reporting_enqueue(struct page *page)
> +{
> +       struct page_reporting_config *phconf;
> +       struct zone *zone;
> +
> +       rcu_read_lock();
> +       /*
> +        * We should not process this page if either page reporting is not
> +        * yet completely enabled or it has been disabled by the backend.
> +        */
> +       phconf = rcu_dereference(page_reporting_conf);
> +       if (!phconf)
> +               return;
> +
> +       zone = page_zone(page);
> +       bitmap_set_bit(page, zone);
> +
> +       /*
> +        * We should not enqueue a job if a previously enqueued reporting work
> +        * is in progress or we don't have enough free pages in the zone.
> +        */
> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))

This doesn't make any sense to me. Why are you only incrementing the
refcount if it is zero? Combining this with the assignment above, this
isn't really a refcnt. It is just an oversized bitflag.

Also I am pretty sure this results in the opportunity to miss pages
because there is nothing to prevent you from possibly missing a ton of
pages you could hint on if a large number of pages are pushed out all
at once and then the system goes idle in terms of memory allocation
and freeing.

> +               schedule_work(&phconf->reporting_work);
> +
> +       rcu_read_unlock();
> +}
> +
> +/**
> + * zone_reporting_cleanup - resets the page reporting fields and free the
> + * bitmap for all the initialized zones.
> + */
> +static void zone_reporting_cleanup(void)
> +{
> +       struct zone *zone;
> +
> +       for_each_populated_zone(zone) {
> +               /*
> +                * Bitmap may not be allocated for all the zones if the
> +                * initialization fails before reaching to the last one.
> +                */
> +               if (!zone->bitmap)
> +                       continue;
> +               bitmap_free(zone->bitmap);
> +               zone->bitmap = NULL;
> +               atomic_set(&zone->free_pages, 0);
> +       }
> +}
> +
> +static int zone_bitmap_alloc(struct zone *zone)
> +{
> +       unsigned long bitmap_size, pages;
> +
> +       pages = zone->end_pfn - zone->base_pfn;
> +       bitmap_size = (pages >> PAGE_REPORTING_MIN_ORDER) + 1;
> +
> +       if (!bitmap_size)
> +               return 0;
> +
> +       zone->bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
> +       if (!zone->bitmap)
> +               return -ENOMEM;
> +
> +       zone->nbits = bitmap_size;
> +
> +       return 0;
> +}
> +

So as per your comments in the cover page, the two functions above
should also probably be plugged into the zone resizing logic somewhere
so if a zone is resized the bitmap is adjusted.

> +/**
> + * zone_reporting_init - For each zone initializes the page reporting fields
> + * and allocates the respective bitmap.
> + *
> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
> + */
> +static int zone_reporting_init(void)
> +{
> +       struct zone *zone;
> +       int ret;
> +
> +       for_each_populated_zone(zone) {
> +#ifdef CONFIG_ZONE_DEVICE
> +               /* we can not report pages which are not in the buddy */
> +               if (zone_idx(zone) == ZONE_DEVICE)
> +                       continue;
> +#endif

I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
zone will be considered "populated".


> +               spin_lock(&zone->lock);
> +               zone->base_pfn = zone->zone_start_pfn;
> +               zone->end_pfn = zone_end_pfn(zone);
> +               spin_unlock(&zone->lock);
> +
> +               ret = zone_bitmap_alloc(zone);
> +               if (ret < 0) {
> +                       zone_reporting_cleanup();
> +                       return ret;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +void page_reporting_disable(struct page_reporting_config *phconf)
> +{
> +       mutex_lock(&page_reporting_mutex);
> +
> +       if (rcu_access_pointer(page_reporting_conf) != phconf)
> +               return;
> +
> +       RCU_INIT_POINTER(page_reporting_conf, NULL);
> +       synchronize_rcu();
> +
> +       /* Cancel any pending reporting request */
> +       cancel_work_sync(&phconf->reporting_work);
> +
> +       /* Free the scatterlist used for isolated pages */
> +       kfree(phconf->sg);
> +       phconf->sg = NULL;
> +
> +       /* Cleanup the bitmaps and old tracking data */
> +       zone_reporting_cleanup();
> +
> +       mutex_unlock(&page_reporting_mutex);
> +}
> +EXPORT_SYMBOL_GPL(page_reporting_disable);
> +
> +int page_reporting_enable(struct page_reporting_config *phconf)
> +{
> +       int ret = 0;
> +
> +       mutex_lock(&page_reporting_mutex);
> +
> +       /* check if someone is already using page reporting*/
> +       if (rcu_access_pointer(page_reporting_conf)) {
> +               ret = -EBUSY;
> +               goto out;
> +       }
> +
> +       /* allocate scatterlist to hold isolated pages */
> +       phconf->sg = kcalloc(phconf->max_pages, sizeof(*phconf->sg),
> +                            GFP_KERNEL);
> +       if (!phconf->sg) {
> +               ret = -ENOMEM;
> +               goto out;
> +       }
> +
> +       /* initialize each zone's fields required for page reporting */
> +       ret = zone_reporting_init();
> +       if (ret < 0) {
> +               kfree(phconf->sg);
> +               goto out;
> +       }
> +
> +       atomic_set(&phconf->refcnt, 0);
> +       INIT_WORK(&phconf->reporting_work, page_reporting_wq);
> +
> +       /* assign the configuration object provided by the backend */
> +       rcu_assign_pointer(page_reporting_conf, phconf);
> +
> +out:
> +       mutex_unlock(&page_reporting_mutex);
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(page_reporting_enable);
> --
> 2.21.0
>
Nitesh Narayan Lal Aug. 12, 2019, 8:04 p.m. UTC | #2
On 8/12/19 2:47 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch introduces the core infrastructure for free page reporting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be reported to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> So if I understand correctly the hotplug support for this is still not
> included correct? 


That is correct, I have it as an ongoing-item in my cover-email.
In case, we decide to go with this approach do you think it is a blocker?


> I assume that is the case since I don't see any
> logic for zone resizing.
>
> Also I don't see how this dealt with the sparse issue that was pointed
> out earlier. Specifically how would you deal with a zone that has a
> wide range between the base and the end and a huge gap somewhere
> in-between?

It doesn't. However, considering we are tracking page on order of (MAX_ORDER -
2) I don't think the loss will be significant.
In any case, this is certainly a drawback of this approach and I should add this
in my cover.
The one thing which I did change in this version is that now I started
maintaining bitmap for each zone.

>
>> ---
>>  include/linux/mmzone.h         |  11 ++
>>  include/linux/page_reporting.h |  63 +++++++
>>  mm/Kconfig                     |   6 +
>>  mm/Makefile                    |   1 +
>>  mm/page_alloc.c                |  42 ++++-
>>  mm/page_reporting.c            | 332 +++++++++++++++++++++++++++++++++
>>  6 files changed, 448 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_reporting.h
>>  create mode 100644 mm/page_reporting.c
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index d77d717c620c..ba5f5b508f25 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -559,6 +559,17 @@ struct zone {
>>         /* Zone statistics */
>>         atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
>>         atomic_long_t           vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>> +#ifdef CONFIG_PAGE_REPORTING
>> +       /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
>> +       unsigned long *bitmap;
>> +       /* Preserve start and end PFN in case they change due to hotplug */
>> +       unsigned long base_pfn;
>> +       unsigned long end_pfn;
>> +       /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
>> +       atomic_t free_pages;
>> +       /* Number of bits required in the bitmap */
>> +       unsigned long nbits;
>> +#endif
>>  } ____cacheline_internodealigned_in_smp;
> Okay, so the original thing this patch set had going for it was that
> it was non-invasive. However, now you are adding a bunch of stuff to
> the zone. That kind of loses the non-invasive argument for this patch
> set compared to mine.

I think it is fair to add that it not as invasive as yours. :) (But that has its
own pros and cons)
In any case, I do understand your point.

>
>
> If we are going to continue further with this patch set then it might
> be worth looking into dynamically allocating the space you need for
> this block.

Not sure if I understood this part. Dynamic allocation for the structure which
you are suggesting below?


>  At a minimum you could probably look at making the bitmap
> an RCU based setup so you could define the base and end along with the
> bitmap. It would probably help to resolve the hotplug issues you still
> need to address.


Thanks for the suggestion. I will look into it.


>
>>  enum pgdat_flags {
>> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
>> new file mode 100644
>> index 000000000000..37a39589939d
>> --- /dev/null
>> +++ b/include/linux/page_reporting.h
>> @@ -0,0 +1,63 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_PAGE_REPORTING_H
>> +#define _LINUX_PAGE_REPORTING_H
>> +
>> +#define PAGE_REPORTING_MIN_ORDER               (MAX_ORDER - 2)
>> +#define PAGE_REPORTING_MAX_PAGES               16
>> +
>> +#ifdef CONFIG_PAGE_REPORTING
>> +struct page_reporting_config {
>> +       /* function to hint batch of isolated pages */
>> +       void (*report)(struct page_reporting_config *phconf,
>> +                      unsigned int num_pages);
>> +
>> +       /* scatterlist to hold the isolated pages to be hinted */
>> +       struct scatterlist *sg;
>> +
>> +       /*
>> +        * Maxmimum pages that are going to be hinted to the hypervisor at a
>> +        * time of granularity >= PAGE_REPORTING_MIN_ORDER.
>> +        */
>> +       int max_pages;
>> +
>> +       /* work object to process page reporting rqeuests */
>> +       struct work_struct reporting_work;
>> +
>> +       /* tracks the number of reporting request processed at a time */
>> +       atomic_t refcnt;
>> +};
>> +
>> +void __page_reporting_enqueue(struct page *page);
>> +void __return_isolated_page(struct zone *zone, struct page *page);
>> +void set_pageblock_migratetype(struct page *page, int migratetype);
>> +
>> +/**
>> + * page_reporting_enqueue - checks the eligibility of the freed page based on
>> + * its order for further page reporting processing.
>> + * @page: page which has been freed.
>> + * @order: order for the the free page.
>> + */
>> +static inline void page_reporting_enqueue(struct page *page, int order)
>> +{
>> +       if (order < PAGE_REPORTING_MIN_ORDER)
>> +               return;
>> +       __page_reporting_enqueue(page);
>> +}
>> +
>> +int page_reporting_enable(struct page_reporting_config *phconf);
>> +void page_reporting_disable(struct page_reporting_config *phconf);
>> +#else
>> +static inline void page_reporting_enqueue(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline int page_reporting_enable(struct page_reporting_config *phconf)
>> +{
>> +       return -EOPNOTSUPP;
>> +}
>> +
>> +static inline void page_reporting_disable(struct page_reporting_config *phconf)
>> +{
>> +}
>> +#endif /* CONFIG_PAGE_REPORTING */
>> +#endif /* _LINUX_PAGE_REPORTING_H */
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 56cec636a1fc..6a35a9dad350 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -736,4 +736,10 @@ config ARCH_HAS_PTE_SPECIAL
>>  config ARCH_HAS_HUGEPD
>>         bool
>>
>> +# PAGE_REPORTING will allow the guest to report the free pages to the
>> +# host in fixed chunks as soon as a fixed threshold is reached.
>> +config PAGE_REPORTING
>> +       bool
>> +       def_bool n
>> +       depends on X86_64
>>  endmenu
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 338e528ad436..6a32cdfa61c2 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -104,3 +104,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
>>  obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
>>  obj-$(CONFIG_HMM_MIRROR) += hmm.o
>>  obj-$(CONFIG_MEMFD_CREATE) += memfd.o
>> +obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 272c6de1bf4e..aa7c89d50c85 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -68,6 +68,7 @@
>>  #include <linux/lockdep.h>
>>  #include <linux/nmi.h>
>>  #include <linux/psi.h>
>> +#include <linux/page_reporting.h>
>>
>>  #include <asm/sections.h>
>>  #include <asm/tlbflush.h>
>> @@ -903,7 +904,7 @@ compaction_capture(struct capture_control *capc, struct page *page,
>>  static inline void __free_one_page(struct page *page,
>>                 unsigned long pfn,
>>                 struct zone *zone, unsigned int order,
>> -               int migratetype)
>> +               int migratetype, bool needs_reporting)
>>  {
>>         unsigned long combined_pfn;
>>         unsigned long uninitialized_var(buddy_pfn);
>> @@ -1006,7 +1007,8 @@ static inline void __free_one_page(struct page *page,
>>                                 migratetype);
>>         else
>>                 add_to_free_area(page, &zone->free_area[order], migratetype);
>> -
>> +       if (needs_reporting)
>> +               page_reporting_enqueue(page, order);
>>  }
>>
>>  /*
>> @@ -1317,7 +1319,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>                 if (unlikely(isolated_pageblocks))
>>                         mt = get_pageblock_migratetype(page);
>>
>> -               __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> +               __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>>                 trace_mm_page_pcpu_drain(page, 0, mt);
>>         }
>>         spin_unlock(&zone->lock);
>> @@ -1326,14 +1328,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>  static void free_one_page(struct zone *zone,
>>                                 struct page *page, unsigned long pfn,
>>                                 unsigned int order,
>> -                               int migratetype)
>> +                               int migratetype, bool needs_reporting)
>>  {
>>         spin_lock(&zone->lock);
>>         if (unlikely(has_isolate_pageblock(zone) ||
>>                 is_migrate_isolate(migratetype))) {
>>                 migratetype = get_pfnblock_migratetype(page, pfn);
>>         }
>> -       __free_one_page(page, pfn, zone, order, migratetype);
>> +       __free_one_page(page, pfn, zone, order, migratetype, needs_reporting);
>>         spin_unlock(&zone->lock);
>>  }
>>
>> @@ -1423,7 +1425,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>>         migratetype = get_pfnblock_migratetype(page, pfn);
>>         local_irq_save(flags);
>>         __count_vm_events(PGFREE, 1 << order);
>> -       free_one_page(page_zone(page), page, pfn, order, migratetype);
>> +       free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>>         local_irq_restore(flags);
>>  }
>>
>> @@ -2197,6 +2199,32 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>>         return NULL;
>>  }
>>
>> +#ifdef CONFIG_PAGE_REPORTING
>> +/**
>> + * return_isolated_page - returns a reported page back to the buddy.
>> + * @zone: zone from where the page was isolated.
>> + * @page: page which will be returned.
>> + */
>> +void __return_isolated_page(struct zone *zone, struct page *page)
>> +{
>> +       unsigned int order, mt;
>> +       unsigned long pfn;
>> +
>> +       /* zone lock should be held when this function is called */
>> +       lockdep_assert_held(&zone->lock);
>> +
>> +       mt = get_pageblock_migratetype(page);
>> +       pfn = page_to_pfn(page);
>> +
>> +       if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt)))
>> +               mt = get_pfnblock_migratetype(page, pfn);
>> +
>> +       order = page_private(page);
>> +       set_page_private(page, 0);
>> +
>> +       __free_one_page(page, pfn, zone, order, mt, false);
>> +}
>> +#endif /* CONFIG_PAGE_REPORTING */
>>
>>  /*
>>   * This array describes the order lists are fallen back to when
>> @@ -3041,7 +3069,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>>          */
>>         if (migratetype >= MIGRATE_PCPTYPES) {
>>                 if (unlikely(is_migrate_isolate(migratetype))) {
>> -                       free_one_page(zone, page, pfn, 0, migratetype);
>> +                       free_one_page(zone, page, pfn, 0, migratetype, true);
>>                         return;
>>                 }
>>                 migratetype = MIGRATE_MOVABLE;
>> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
>> new file mode 100644
>> index 000000000000..4ee2c172cd9a
>> --- /dev/null
>> +++ b/mm/page_reporting.c
>> @@ -0,0 +1,332 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Page reporting core infrastructure to enable a VM to report free pages to its
>> + * hypervisor.
>> + *
>> + * Copyright Red Hat, Inc. 2019
>> + *
>> + * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <linux/page_reporting.h>
>> +#include <linux/scatterlist.h>
>> +#include "internal.h"
>> +
>> +static struct page_reporting_config __rcu *page_reporting_conf __read_mostly;
>> +static DEFINE_MUTEX(page_reporting_mutex);
>> +
>> +static inline unsigned long pfn_to_bit(struct page *page, struct zone *zone)
>> +{
>> +       unsigned long bitnr;
>> +
>> +       bitnr = (page_to_pfn(page) - zone->base_pfn) >>
>> +               PAGE_REPORTING_MIN_ORDER;
>> +
>> +       return bitnr;
>> +}
>> +
>> +static void return_isolated_page(struct zone *zone,
>> +                                struct page_reporting_config *phconf)
>> +{
>> +       struct scatterlist *sg = phconf->sg;
>> +
>> +       spin_lock(&zone->lock);
>> +       do {
>> +               __return_isolated_page(zone, sg_page(sg));
>> +       } while (!sg_is_last(sg++));
>> +       spin_unlock(&zone->lock);
>> +}
>> +
>> +static void bitmap_set_bit(struct page *page, struct zone *zone)
>> +{
>> +       unsigned long bitnr = 0;
>> +
>> +       /* zone lock should be held when this function is called */
>> +       lockdep_assert_held(&zone->lock);
>> +
> So why does the zone lock need to be held? What could you possibly
> race against? If nothing else it might make more sense to look at
> moving the bitmap, base, end, and length values into one RCU
> allocation structure so that you could do without the requirement of
> the zone lock to manipulate the bitmap.


Makes sense to me.


>> +       bitnr = pfn_to_bit(page, zone);
>> +       /* set bit if it is not already set and is a valid bit */
>> +       if (zone->bitmap && bitnr < zone->nbits &&
>> +           !test_and_set_bit(bitnr, zone->bitmap))
>> +               atomic_inc(&zone->free_pages);
>> +}
>> +
>> +static int process_free_page(struct page *page,
>> +                            struct page_reporting_config *phconf, int count)
>> +{
>> +       int mt, order, ret = 0;
>> +
>> +       mt = get_pageblock_migratetype(page);
>> +       order = page_private(page);
>> +       ret = __isolate_free_page(page, order);
>> +
>> +       if (ret) {
>> +               /*
>> +                * Preserving order and migratetype for reuse while
>> +                * releasing the pages back to the buddy.
>> +                */
>> +               set_pageblock_migratetype(page, mt);
>> +               set_page_private(page, order);
>> +
>> +               sg_set_page(&phconf->sg[count++], page,
>> +                           PAGE_SIZE << order, 0);
>> +       }
>> +
>> +       return count;
>> +}
>> +
>> +/**
>> + * scan_zone_bitmap - scans the bitmap for the requested zone.
>> + * @phconf: page reporting configuration object initialized by the backend.
>> + * @zone: zone for which page reporting is requested.
>> + *
>> + * For every page marked in the bitmap it checks if it is still free if so it
>> + * isolates and adds them to a scatterlist. As soon as the number of isolated
>> + * pages reach the threshold set by the backend, they are reported to the
>> + * hypervisor by the backend. Once the hypervisor responds after processing
>> + * they are returned back to the buddy for reuse.
>> + */
>> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
>> +                            struct zone *zone)
>> +{
>> +       unsigned long setbit;
>> +       struct page *page;
>> +       int count = 0;
>> +
>> +       sg_init_table(phconf->sg, phconf->max_pages);
>> +
>> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
>> +               /* Process only if the page is still online */
>> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
>> +                                         zone->base_pfn);
>> +               if (!page)
>> +                       continue;
>> +
> Shouldn't you be clearing the bit and dropping the reference to
> free_pages before you move on to the next bit? Otherwise you are going
> to be stuck with those aren't you?


+1. Thanks for pointing this out.


>> +               spin_lock(&zone->lock);
>> +
>> +               /* Ensure page is still free and can be processed */
>> +               if (PageBuddy(page) && page_private(page) >=
>> +                   PAGE_REPORTING_MIN_ORDER)
>> +                       count = process_free_page(page, phconf, count);
>> +
>> +               spin_unlock(&zone->lock);
> So I kind of wonder just how much overhead you are taking for bouncing
> the zone lock once per page here. Especially since it can result in
> you not actually making any progress since the page may have already
> been reallocated.


Any suggestion about how can I see the overhead involved here?


>> +               /* Page has been processed, adjust its bit and zone counter */
>> +               clear_bit(setbit, zone->bitmap);
>> +               atomic_dec(&zone->free_pages);
> So earlier you were setting this bit and required that the zone->lock
> be held while doing so. Here you are clearing it without any
> zone->lock being held.


I should have held the zone->lock here while clearing the bit.


>
>> +               if (count == phconf->max_pages) {
>> +                       /* Report isolated pages to the hypervisor */
>> +                       phconf->report(phconf, count);
>> +
>> +                       /* Return processed pages back to the buddy */
>> +                       return_isolated_page(zone, phconf);
>> +
>> +                       /* Reset for next reporting */
>> +                       sg_init_table(phconf->sg, phconf->max_pages);
>> +                       count = 0;
>> +               }
>> +       }
>> +       /*
>> +        * If the number of isolated pages does not meet the max_pages
>> +        * threshold, we would still prefer to report them as we have already
>> +        * isolated them.
>> +        */
>> +       if (count) {
>> +               sg_mark_end(&phconf->sg[count - 1]);
>> +               phconf->report(phconf, count);
>> +
>> +               return_isolated_page(zone, phconf);
>> +       }
>> +}
>> +
>> +/**
>> + * page_reporting_wq - checks the number of free_pages in all the zones and
>> + * invokes a request to scan the respective bitmap if free_pages reaches or
>> + * exceeds the threshold specified by the backend.
>> + */
>> +static void page_reporting_wq(struct work_struct *work)
>> +{
>> +       struct page_reporting_config *phconf =
>> +               container_of(work, struct page_reporting_config,
>> +                            reporting_work);
>> +       struct zone *zone;
>> +
>> +       for_each_populated_zone(zone) {
>> +               if (atomic_read(&zone->free_pages) >= phconf->max_pages)
>> +                       scan_zone_bitmap(phconf, zone);
>> +       }
>> +       /*
>> +        * We have processed all the zones, we can process new page reporting
>> +        * request now.
>> +        */
>> +       atomic_set(&phconf->refcnt, 0);
> It doesn't make sense to reset the refcnt once you have made a single pass.
>
>> +}
>> +
>> +/**
>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
>> + */
>> +void __page_reporting_enqueue(struct page *page)
>> +{
>> +       struct page_reporting_config *phconf;
>> +       struct zone *zone;
>> +
>> +       rcu_read_lock();
>> +       /*
>> +        * We should not process this page if either page reporting is not
>> +        * yet completely enabled or it has been disabled by the backend.
>> +        */
>> +       phconf = rcu_dereference(page_reporting_conf);
>> +       if (!phconf)
>> +               return;
>> +
>> +       zone = page_zone(page);
>> +       bitmap_set_bit(page, zone);
>> +
>> +       /*
>> +        * We should not enqueue a job if a previously enqueued reporting work
>> +        * is in progress or we don't have enough free pages in the zone.
>> +        */
>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
> This doesn't make any sense to me. Why are you only incrementing the
> refcount if it is zero? Combining this with the assignment above, this
> isn't really a refcnt. It is just an oversized bitflag.
>

The reason why I have this here is that at a time I only want a single request
to be en-queued.
Now, every time free_page threshold for any zone is met I am checking each zone
for possible reporting.

I think there are two change required here:
1. rename this flag.
2. use refcnt to actually track the usage of page_hinting_config object separately.

> Also I am pretty sure this results in the opportunity to miss pages
> because there is nothing to prevent you from possibly missing a ton of
> pages you could hint on if a large number of pages are pushed out all
> at once and then the system goes idle in terms of memory allocation
> and freeing.

I have failed to reproduce this kind of situation.
I have a test app which simply allocates large memory chunk, touches it and then
exits. In this case, I get most of the memory back.


>
>> +               schedule_work(&phconf->reporting_work);
>> +
>> +       rcu_read_unlock();
>> +}
>> +
>> +/**
>> + * zone_reporting_cleanup - resets the page reporting fields and free the
>> + * bitmap for all the initialized zones.
>> + */
>> +static void zone_reporting_cleanup(void)
>> +{
>> +       struct zone *zone;
>> +
>> +       for_each_populated_zone(zone) {
>> +               /*
>> +                * Bitmap may not be allocated for all the zones if the
>> +                * initialization fails before reaching to the last one.
>> +                */
>> +               if (!zone->bitmap)
>> +                       continue;
>> +               bitmap_free(zone->bitmap);
>> +               zone->bitmap = NULL;
>> +               atomic_set(&zone->free_pages, 0);
>> +       }
>> +}
>> +
>> +static int zone_bitmap_alloc(struct zone *zone)
>> +{
>> +       unsigned long bitmap_size, pages;
>> +
>> +       pages = zone->end_pfn - zone->base_pfn;
>> +       bitmap_size = (pages >> PAGE_REPORTING_MIN_ORDER) + 1;
>> +
>> +       if (!bitmap_size)
>> +               return 0;
>> +
>> +       zone->bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
>> +       if (!zone->bitmap)
>> +               return -ENOMEM;
>> +
>> +       zone->nbits = bitmap_size;
>> +
>> +       return 0;
>> +}
>> +
> So as per your comments in the cover page, the two functions above
> should also probably be plugged into the zone resizing logic somewhere
> so if a zone is resized the bitmap is adjusted.


I think the right way will be to have a common interface which could be called
here and in memory hotplug/unplug case.
Right now, in my prototype, I have a different functions which adjusts the size
of the bitmap based on the memory notifier action.


>
>> +/**
>> + * zone_reporting_init - For each zone initializes the page reporting fields
>> + * and allocates the respective bitmap.
>> + *
>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>> + */
>> +static int zone_reporting_init(void)
>> +{
>> +       struct zone *zone;
>> +       int ret;
>> +
>> +       for_each_populated_zone(zone) {
>> +#ifdef CONFIG_ZONE_DEVICE
>> +               /* we can not report pages which are not in the buddy */
>> +               if (zone_idx(zone) == ZONE_DEVICE)
>> +                       continue;
>> +#endif
> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
> zone will be considered "populated".


hmm, I was not aware of it. I will dig in more about it.


>
>> +               spin_lock(&zone->lock);
>> +               zone->base_pfn = zone->zone_start_pfn;
>> +               zone->end_pfn = zone_end_pfn(zone);
>> +               spin_unlock(&zone->lock);
>> +
>> +               ret = zone_bitmap_alloc(zone);
>> +               if (ret < 0) {
>> +                       zone_reporting_cleanup();
>> +                       return ret;
>> +               }
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +void page_reporting_disable(struct page_reporting_config *phconf)
>> +{
>> +       mutex_lock(&page_reporting_mutex);
>> +
>> +       if (rcu_access_pointer(page_reporting_conf) != phconf)
>> +               return;
>> +
>> +       RCU_INIT_POINTER(page_reporting_conf, NULL);
>> +       synchronize_rcu();
>> +
>> +       /* Cancel any pending reporting request */
>> +       cancel_work_sync(&phconf->reporting_work);
>> +
>> +       /* Free the scatterlist used for isolated pages */
>> +       kfree(phconf->sg);
>> +       phconf->sg = NULL;
>> +
>> +       /* Cleanup the bitmaps and old tracking data */
>> +       zone_reporting_cleanup();
>> +
>> +       mutex_unlock(&page_reporting_mutex);
>> +}
>> +EXPORT_SYMBOL_GPL(page_reporting_disable);
>> +
>> +int page_reporting_enable(struct page_reporting_config *phconf)
>> +{
>> +       int ret = 0;
>> +
>> +       mutex_lock(&page_reporting_mutex);
>> +
>> +       /* check if someone is already using page reporting*/
>> +       if (rcu_access_pointer(page_reporting_conf)) {
>> +               ret = -EBUSY;
>> +               goto out;
>> +       }
>> +
>> +       /* allocate scatterlist to hold isolated pages */
>> +       phconf->sg = kcalloc(phconf->max_pages, sizeof(*phconf->sg),
>> +                            GFP_KERNEL);
>> +       if (!phconf->sg) {
>> +               ret = -ENOMEM;
>> +               goto out;
>> +       }
>> +
>> +       /* initialize each zone's fields required for page reporting */
>> +       ret = zone_reporting_init();
>> +       if (ret < 0) {
>> +               kfree(phconf->sg);
>> +               goto out;
>> +       }
>> +
>> +       atomic_set(&phconf->refcnt, 0);
>> +       INIT_WORK(&phconf->reporting_work, page_reporting_wq);
>> +
>> +       /* assign the configuration object provided by the backend */
>> +       rcu_assign_pointer(page_reporting_conf, phconf);
>> +
>> +out:
>> +       mutex_unlock(&page_reporting_mutex);
>> +       return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(page_reporting_enable);
>> --
>> 2.21.0
>>
David Hildenbrand Aug. 12, 2019, 8:05 p.m. UTC | #3
>> ---
>>  include/linux/mmzone.h         |  11 ++
>>  include/linux/page_reporting.h |  63 +++++++
>>  mm/Kconfig                     |   6 +
>>  mm/Makefile                    |   1 +
>>  mm/page_alloc.c                |  42 ++++-
>>  mm/page_reporting.c            | 332 +++++++++++++++++++++++++++++++++
>>  6 files changed, 448 insertions(+), 7 deletions(-)
>>  create mode 100644 include/linux/page_reporting.h
>>  create mode 100644 mm/page_reporting.c
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index d77d717c620c..ba5f5b508f25 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -559,6 +559,17 @@ struct zone {
>>         /* Zone statistics */
>>         atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
>>         atomic_long_t           vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>> +#ifdef CONFIG_PAGE_REPORTING
>> +       /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
>> +       unsigned long *bitmap;
>> +       /* Preserve start and end PFN in case they change due to hotplug */
>> +       unsigned long base_pfn;
>> +       unsigned long end_pfn;
>> +       /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
>> +       atomic_t free_pages;
>> +       /* Number of bits required in the bitmap */
>> +       unsigned long nbits;
>> +#endif
>>  } ____cacheline_internodealigned_in_smp;
> 
> Okay, so the original thing this patch set had going for it was that
> it was non-invasive. However, now you are adding a bunch of stuff to
> the zone. That kind of loses the non-invasive argument for this patch
> set compared to mine.
> 

Adding something to "struct zone" is certainly less invasive than core
buddy modifications, just saying (I agree that this is suboptimal. I
would have guessed that all that's needed is a pointer to some private
structure here). However, the migratetype thingy below looks fishy to me.

> If we are going to continue further with this patch set then it might
> be worth looking into dynamically allocating the space you need for
> this block. At a minimum you could probably look at making the bitmap
> an RCU based setup so you could define the base and end along with the
> bitmap. It would probably help to resolve the hotplug issues you still
> need to address.

Yeah, I guess that makes sense.

[...]
>> +
>> +static int process_free_page(struct page *page,
>> +                            struct page_reporting_config *phconf, int count)
>> +{
>> +       int mt, order, ret = 0;
>> +
>> +       mt = get_pageblock_migratetype(page);
>> +       order = page_private(page);
>> +       ret = __isolate_free_page(page, order);
>> +

I just started looking into the wonderful world of
isolation/compaction/migration.

I don't think saving/restoring the migratetype is correct here. AFAIK,
MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
movable pages and up in UNMOVABLE or ordinary kernel allocations on
MOVABLE. So that shouldn't be an issue - I guess.

1. You should never allocate something that is no
MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
CMA here. There should at least be a !is_migrate_isolate_page() check
somewhere

2. set_migratetype_isolate() takes the zone lock, so to avoid racing
with isolation code, you have to hold the zone lock. Your code seems to
do that, so at least you cannot race against isolation.

3. You could end up temporarily allocating something in the
ZONE_MOVABLE. The pages you allocate are, however, not movable. There
would have to be a way to make alloc_contig_range()/offlining code
properly wait until the pages have been processed. Not sure about the
real implications, though - too many details in the code (I wonder if
Alex' series has a way of dealing with that)

When you restore the migratetype, you could suddenly overwrite e.g.,
ISOLATE, which feels wrong.

[...]
> So as per your comments in the cover page, the two functions above
> should also probably be plugged into the zone resizing logic somewhere
> so if a zone is resized the bitmap is adjusted.
> 
>> +/**
>> + * zone_reporting_init - For each zone initializes the page reporting fields
>> + * and allocates the respective bitmap.
>> + *
>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>> + */
>> +static int zone_reporting_init(void)
>> +{
>> +       struct zone *zone;
>> +       int ret;
>> +
>> +       for_each_populated_zone(zone) {
>> +#ifdef CONFIG_ZONE_DEVICE
>> +               /* we can not report pages which are not in the buddy */
>> +               if (zone_idx(zone) == ZONE_DEVICE)
>> +                       continue;
>> +#endif
> 
> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
> zone will be considered "populated".
> 
I think you are right (although it's confusing, we will have present
sections part of a zone but the zone has no present_pages - screams like
a re factoring - leftover from ZONE_DEVICE introduction).
Nitesh Narayan Lal Aug. 13, 2019, 10:30 a.m. UTC | #4
On 8/12/19 4:05 PM, David Hildenbrand wrote:
>>> ---
>>>  include/linux/mmzone.h         |  11 ++
>>>  include/linux/page_reporting.h |  63 +++++++
>>>  mm/Kconfig                     |   6 +
>>>  mm/Makefile                    |   1 +
>>>  mm/page_alloc.c                |  42 ++++-
>>>  mm/page_reporting.c            | 332 +++++++++++++++++++++++++++++++++
>>>  6 files changed, 448 insertions(+), 7 deletions(-)
>>>  create mode 100644 include/linux/page_reporting.h
>>>  create mode 100644 mm/page_reporting.c
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index d77d717c620c..ba5f5b508f25 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -559,6 +559,17 @@ struct zone {
>>>         /* Zone statistics */
>>>         atomic_long_t           vm_stat[NR_VM_ZONE_STAT_ITEMS];
>>>         atomic_long_t           vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
>>> +#ifdef CONFIG_PAGE_REPORTING
>>> +       /* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
>>> +       unsigned long *bitmap;
>>> +       /* Preserve start and end PFN in case they change due to hotplug */
>>> +       unsigned long base_pfn;
>>> +       unsigned long end_pfn;
>>> +       /* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
>>> +       atomic_t free_pages;
>>> +       /* Number of bits required in the bitmap */
>>> +       unsigned long nbits;
>>> +#endif
>>>  } ____cacheline_internodealigned_in_smp;
>> Okay, so the original thing this patch set had going for it was that
>> it was non-invasive. However, now you are adding a bunch of stuff to
>> the zone. That kind of loses the non-invasive argument for this patch
>> set compared to mine.
>>
> Adding something to "struct zone" is certainly less invasive than core
> buddy modifications, just saying (I agree that this is suboptimal. I
> would have guessed that all that's needed is a pointer to some private
> structure here). 


I think having just a pointer to a private structure makes sense here.
If I am not wrong then I can probably make an allocation for it for each
populated zone at the time I enable page reporting.

> However, the migratetype thingy below looks fishy to me.
>
>> If we are going to continue further with this patch set then it might
>> be worth looking into dynamically allocating the space you need for
>> this block. At a minimum you could probably look at making the bitmap
>> an RCU based setup so you could define the base and end along with the
>> bitmap. It would probably help to resolve the hotplug issues you still
>> need to address.
> Yeah, I guess that makes sense.
>
> [...]
>>> +
>>> +static int process_free_page(struct page *page,
>>> +                            struct page_reporting_config *phconf, int count)
>>> +{
>>> +       int mt, order, ret = 0;
>>> +
>>> +       mt = get_pageblock_migratetype(page);
>>> +       order = page_private(page);
>>> +       ret = __isolate_free_page(page, order);
>>> +
> I just started looking into the wonderful world of
> isolation/compaction/migration.
>
> I don't think saving/restoring the migratetype is correct here. AFAIK,
> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
> movable pages and up in UNMOVABLE or ordinary kernel allocations on
> MOVABLE. So that shouldn't be an issue - I guess.
>
> 1. You should never allocate something that is no
> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
> CMA here. There should at least be a !is_migrate_isolate_page() check
> somewhere
>
> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
> with isolation code, you have to hold the zone lock. Your code seems to
> do that, so at least you cannot race against isolation.
>
> 3. You could end up temporarily allocating something in the
> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
> would have to be a way to make alloc_contig_range()/offlining code
> properly wait until the pages have been processed. Not sure about the
> real implications, though - too many details in the code (I wonder if
> Alex' series has a way of dealing with that)
>
> When you restore the migratetype, you could suddenly overwrite e.g.,
> ISOLATE, which feels wrong.


I was triggering an occasional CPU stall bug earlier, with saving and restoring
the migratetype I was able to fix it.
But I will further look into this to figure out if it is really required.

> [...]
>> So as per your comments in the cover page, the two functions above
>> should also probably be plugged into the zone resizing logic somewhere
>> so if a zone is resized the bitmap is adjusted.
>>
>>> +/**
>>> + * zone_reporting_init - For each zone initializes the page reporting fields
>>> + * and allocates the respective bitmap.
>>> + *
>>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>>> + */
>>> +static int zone_reporting_init(void)
>>> +{
>>> +       struct zone *zone;
>>> +       int ret;
>>> +
>>> +       for_each_populated_zone(zone) {
>>> +#ifdef CONFIG_ZONE_DEVICE
>>> +               /* we can not report pages which are not in the buddy */
>>> +               if (zone_idx(zone) == ZONE_DEVICE)
>>> +                       continue;
>>> +#endif
>> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
>> zone will be considered "populated".
>>
> I think you are right (although it's confusing, we will have present
> sections part of a zone but the zone has no present_pages - screams like
> a re factoring - leftover from ZONE_DEVICE introduction).


I think in that case it is safe to have this check here.
What do you guys suggest?


>
David Hildenbrand Aug. 13, 2019, 10:34 a.m. UTC | #5
>>>> +static int process_free_page(struct page *page,
>>>> +                            struct page_reporting_config *phconf, int count)
>>>> +{
>>>> +       int mt, order, ret = 0;
>>>> +
>>>> +       mt = get_pageblock_migratetype(page);
>>>> +       order = page_private(page);
>>>> +       ret = __isolate_free_page(page, order);
>>>> +
>> I just started looking into the wonderful world of
>> isolation/compaction/migration.
>>
>> I don't think saving/restoring the migratetype is correct here. AFAIK,
>> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
>> movable pages and up in UNMOVABLE or ordinary kernel allocations on
>> MOVABLE. So that shouldn't be an issue - I guess.
>>
>> 1. You should never allocate something that is no
>> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
>> CMA here. There should at least be a !is_migrate_isolate_page() check
>> somewhere
>>
>> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
>> with isolation code, you have to hold the zone lock. Your code seems to
>> do that, so at least you cannot race against isolation.
>>
>> 3. You could end up temporarily allocating something in the
>> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
>> would have to be a way to make alloc_contig_range()/offlining code
>> properly wait until the pages have been processed. Not sure about the
>> real implications, though - too many details in the code (I wonder if
>> Alex' series has a way of dealing with that)
>>
>> When you restore the migratetype, you could suddenly overwrite e.g.,
>> ISOLATE, which feels wrong.
> 
> 
> I was triggering an occasional CPU stall bug earlier, with saving and restoring
> the migratetype I was able to fix it.
> But I will further look into this to figure out if it is really required.
> 

You should especially look into handling isolated/cma pages. Maybe that
was the original issue. Alex seems to have added that in his latest
series (skipping isolated/cma pageblocks completely) as well.

>> [...]
>>> So as per your comments in the cover page, the two functions above
>>> should also probably be plugged into the zone resizing logic somewhere
>>> so if a zone is resized the bitmap is adjusted.
>>>
>>>> +/**
>>>> + * zone_reporting_init - For each zone initializes the page reporting fields
>>>> + * and allocates the respective bitmap.
>>>> + *
>>>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>>>> + */
>>>> +static int zone_reporting_init(void)
>>>> +{
>>>> +       struct zone *zone;
>>>> +       int ret;
>>>> +
>>>> +       for_each_populated_zone(zone) {
>>>> +#ifdef CONFIG_ZONE_DEVICE
>>>> +               /* we can not report pages which are not in the buddy */
>>>> +               if (zone_idx(zone) == ZONE_DEVICE)
>>>> +                       continue;
>>>> +#endif
>>> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
>>> zone will be considered "populated".
>>>
>> I think you are right (although it's confusing, we will have present
>> sections part of a zone but the zone has no present_pages - screams like
>> a re factoring - leftover from ZONE_DEVICE introduction).
> 
> 
> I think in that case it is safe to have this check here.
> What do you guys suggest?

If it's not needed, I'd say drop it (eventually add a comment).
Nitesh Narayan Lal Aug. 13, 2019, 10:42 a.m. UTC | #6
On 8/13/19 6:34 AM, David Hildenbrand wrote:
>>>>> +static int process_free_page(struct page *page,
>>>>> +                            struct page_reporting_config *phconf, int count)
>>>>> +{
>>>>> +       int mt, order, ret = 0;
[...]
>>>>> +/**
>>>>> + * zone_reporting_init - For each zone initializes the page reporting fields
>>>>> + * and allocates the respective bitmap.
>>>>> + *
>>>>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>>>>> + */
>>>>> +static int zone_reporting_init(void)
>>>>> +{
>>>>> +       struct zone *zone;
>>>>> +       int ret;
>>>>> +
>>>>> +       for_each_populated_zone(zone) {
>>>>> +#ifdef CONFIG_ZONE_DEVICE
>>>>> +               /* we can not report pages which are not in the buddy */
>>>>> +               if (zone_idx(zone) == ZONE_DEVICE)
>>>>> +                       continue;
>>>>> +#endif
>>>> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
>>>> zone will be considered "populated".
>>>>
>>> I think you are right (although it's confusing, we will have present
>>> sections part of a zone but the zone has no present_pages - screams like
>>> a re factoring - leftover from ZONE_DEVICE introduction).
>>
>> I think in that case it is safe to have this check here.
>> What do you guys suggest?
> If it's not needed, I'd say drop it (eventually add a comment).


Comment to mention that we do not expect a zone with non-buddy page to be
initialized here?

>
>
David Hildenbrand Aug. 13, 2019, 10:44 a.m. UTC | #7
On 13.08.19 12:42, Nitesh Narayan Lal wrote:
> 
> On 8/13/19 6:34 AM, David Hildenbrand wrote:
>>>>>> +static int process_free_page(struct page *page,
>>>>>> +                            struct page_reporting_config *phconf, int count)
>>>>>> +{
>>>>>> +       int mt, order, ret = 0;
> [...]
>>>>>> +/**
>>>>>> + * zone_reporting_init - For each zone initializes the page reporting fields
>>>>>> + * and allocates the respective bitmap.
>>>>>> + *
>>>>>> + * This function returns 0 on successful initialization, -ENOMEM otherwise.
>>>>>> + */
>>>>>> +static int zone_reporting_init(void)
>>>>>> +{
>>>>>> +       struct zone *zone;
>>>>>> +       int ret;
>>>>>> +
>>>>>> +       for_each_populated_zone(zone) {
>>>>>> +#ifdef CONFIG_ZONE_DEVICE
>>>>>> +               /* we can not report pages which are not in the buddy */
>>>>>> +               if (zone_idx(zone) == ZONE_DEVICE)
>>>>>> +                       continue;
>>>>>> +#endif
>>>>> I'm pretty sure this isn't needed since I don't think the ZONE_DEVICE
>>>>> zone will be considered "populated".
>>>>>
>>>> I think you are right (although it's confusing, we will have present
>>>> sections part of a zone but the zone has no present_pages - screams like
>>>> a re factoring - leftover from ZONE_DEVICE introduction).
>>>
>>> I think in that case it is safe to have this check here.
>>> What do you guys suggest?
>> If it's not needed, I'd say drop it (eventually add a comment).
> 
> 
> Comment to mention that we do not expect a zone with non-buddy page to be
> initialized here?

Something along these lines, or something like

/* ZONE_DEVICE is never considered populated */
Alexander Duyck Aug. 13, 2019, 11:14 p.m. UTC | #8
On Tue, Aug 13, 2019 at 3:34 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>>> +static int process_free_page(struct page *page,
> >>>> +                            struct page_reporting_config *phconf, int count)
> >>>> +{
> >>>> +       int mt, order, ret = 0;
> >>>> +
> >>>> +       mt = get_pageblock_migratetype(page);
> >>>> +       order = page_private(page);
> >>>> +       ret = __isolate_free_page(page, order);
> >>>> +
> >> I just started looking into the wonderful world of
> >> isolation/compaction/migration.
> >>
> >> I don't think saving/restoring the migratetype is correct here. AFAIK,
> >> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
> >> movable pages and up in UNMOVABLE or ordinary kernel allocations on
> >> MOVABLE. So that shouldn't be an issue - I guess.
> >>
> >> 1. You should never allocate something that is no
> >> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
> >> CMA here. There should at least be a !is_migrate_isolate_page() check
> >> somewhere
> >>
> >> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
> >> with isolation code, you have to hold the zone lock. Your code seems to
> >> do that, so at least you cannot race against isolation.
> >>
> >> 3. You could end up temporarily allocating something in the
> >> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
> >> would have to be a way to make alloc_contig_range()/offlining code
> >> properly wait until the pages have been processed. Not sure about the
> >> real implications, though - too many details in the code (I wonder if
> >> Alex' series has a way of dealing with that)
> >>
> >> When you restore the migratetype, you could suddenly overwrite e.g.,
> >> ISOLATE, which feels wrong.
> >
> >
> > I was triggering an occasional CPU stall bug earlier, with saving and restoring
> > the migratetype I was able to fix it.
> > But I will further look into this to figure out if it is really required.
> >
>
> You should especially look into handling isolated/cma pages. Maybe that
> was the original issue. Alex seems to have added that in his latest
> series (skipping isolated/cma pageblocks completely) as well.

So as far as skipping isolated pageblocks, I get the reason for
skipping isolated, but why would we need to skip CMA? I had made the
change I did based on comments you had made earlier. But while working
on some of the changes to address isolation better and looking over
several spots in the code it seems like CMA is already being used as
an allocation fallback for MIGRATE_MOVABLE. If that is the case
wouldn't it make sense to allow pulling pages and reporting them while
they are in the free_list?
David Hildenbrand Aug. 14, 2019, 7:07 a.m. UTC | #9
On 14.08.19 01:14, Alexander Duyck wrote:
> On Tue, Aug 13, 2019 at 3:34 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>>>> +static int process_free_page(struct page *page,
>>>>>> +                            struct page_reporting_config *phconf, int count)
>>>>>> +{
>>>>>> +       int mt, order, ret = 0;
>>>>>> +
>>>>>> +       mt = get_pageblock_migratetype(page);
>>>>>> +       order = page_private(page);
>>>>>> +       ret = __isolate_free_page(page, order);
>>>>>> +
>>>> I just started looking into the wonderful world of
>>>> isolation/compaction/migration.
>>>>
>>>> I don't think saving/restoring the migratetype is correct here. AFAIK,
>>>> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
>>>> movable pages and up in UNMOVABLE or ordinary kernel allocations on
>>>> MOVABLE. So that shouldn't be an issue - I guess.
>>>>
>>>> 1. You should never allocate something that is no
>>>> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
>>>> CMA here. There should at least be a !is_migrate_isolate_page() check
>>>> somewhere
>>>>
>>>> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
>>>> with isolation code, you have to hold the zone lock. Your code seems to
>>>> do that, so at least you cannot race against isolation.
>>>>
>>>> 3. You could end up temporarily allocating something in the
>>>> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
>>>> would have to be a way to make alloc_contig_range()/offlining code
>>>> properly wait until the pages have been processed. Not sure about the
>>>> real implications, though - too many details in the code (I wonder if
>>>> Alex' series has a way of dealing with that)
>>>>
>>>> When you restore the migratetype, you could suddenly overwrite e.g.,
>>>> ISOLATE, which feels wrong.
>>>
>>>
>>> I was triggering an occasional CPU stall bug earlier, with saving and restoring
>>> the migratetype I was able to fix it.
>>> But I will further look into this to figure out if it is really required.
>>>
>>
>> You should especially look into handling isolated/cma pages. Maybe that
>> was the original issue. Alex seems to have added that in his latest
>> series (skipping isolated/cma pageblocks completely) as well.
> 
> So as far as skipping isolated pageblocks, I get the reason for
> skipping isolated, but why would we need to skip CMA? I had made the
> change I did based on comments you had made earlier. But while working
> on some of the changes to address isolation better and looking over
> several spots in the code it seems like CMA is already being used as
> an allocation fallback for MIGRATE_MOVABLE. If that is the case
> wouldn't it make sense to allow pulling pages and reporting them while
> they are in the free_list?

I was assuming that CMA is also to be skipped because "static int
fallbacks[MIGRATE_TYPES][4]" in mm/page_alloc.c doesn't handle CMA at
all, meaning we should never fallback to CMA or from CMA to another type
- at least when stealing pages from another migratetype. So it smells
like MIGRATE_CMA is static -> the area is marked once and will never be
converted to something else (except MIGRATE_ISOLATE temporarily).

I assume you are talking about gfp_to_alloc_flags()/prepare_alloc_pages():

#ifdef CONFIG_CMA
	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
		alloc_flags |= ALLOC_CMA;
#endif

Yeah, this looks like MOVABLE allocations can fallback to CMA
pageblocks. And from what I read, "CMA may use its own migratetype
(MIGRATE_CMA) which behaves similarly to ZONE_MOVABLE but can be put in
arbitrary places."

So I think you are right, it could be that it is safe to temporarily
pull out CMA pages (in contrast to isolated pages) - assuming it is fine
to have temporary unmovable allocations on them (different discussion).

(I am learning about the details as we discuss :) )

The important part would then be to never allocate from the isolated
pageblocks and to never overwrite MIGRATE_ISOLATE.
Nitesh Narayan Lal Aug. 14, 2019, 12:49 p.m. UTC | #10
On 8/14/19 3:07 AM, David Hildenbrand wrote:
> On 14.08.19 01:14, Alexander Duyck wrote:
>> On Tue, Aug 13, 2019 at 3:34 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>> +static int process_free_page(struct page *page,
>>>>>>> +                            struct page_reporting_config *phconf, int count)
>>>>>>> +{
>>>>>>> +       int mt, order, ret = 0;
>>>>>>> +
>>>>>>> +       mt = get_pageblock_migratetype(page);
>>>>>>> +       order = page_private(page);
>>>>>>> +       ret = __isolate_free_page(page, order);
>>>>>>> +
>>>>> I just started looking into the wonderful world of
>>>>> isolation/compaction/migration.
>>>>>
>>>>> I don't think saving/restoring the migratetype is correct here. AFAIK,
>>>>> MOVABLE/UNMOVABLE/RECLAIMABLE is just a hint, doesn't mean that e.g.,
>>>>> movable pages and up in UNMOVABLE or ordinary kernel allocations on
>>>>> MOVABLE. So that shouldn't be an issue - I guess.
>>>>>
>>>>> 1. You should never allocate something that is no
>>>>> MOVABLE/UNMOVABLE/RECLAIMABLE. Especially not, if you have ISOLATE or
>>>>> CMA here. There should at least be a !is_migrate_isolate_page() check
>>>>> somewhere
>>>>>
>>>>> 2. set_migratetype_isolate() takes the zone lock, so to avoid racing
>>>>> with isolation code, you have to hold the zone lock. Your code seems to
>>>>> do that, so at least you cannot race against isolation.
>>>>>
>>>>> 3. You could end up temporarily allocating something in the
>>>>> ZONE_MOVABLE. The pages you allocate are, however, not movable. There
>>>>> would have to be a way to make alloc_contig_range()/offlining code
>>>>> properly wait until the pages have been processed. Not sure about the
>>>>> real implications, though - too many details in the code (I wonder if
>>>>> Alex' series has a way of dealing with that)
>>>>>
>>>>> When you restore the migratetype, you could suddenly overwrite e.g.,
>>>>> ISOLATE, which feels wrong.
>>>>
>>>> I was triggering an occasional CPU stall bug earlier, with saving and restoring
>>>> the migratetype I was able to fix it.
>>>> But I will further look into this to figure out if it is really required.
>>>>
>>> You should especially look into handling isolated/cma pages. Maybe that
>>> was the original issue. Alex seems to have added that in his latest
>>> series (skipping isolated/cma pageblocks completely) as well.
>> So as far as skipping isolated pageblocks, I get the reason for
>> skipping isolated, but why would we need to skip CMA? I had made the
>> change I did based on comments you had made earlier. But while working
>> on some of the changes to address isolation better and looking over
>> several spots in the code it seems like CMA is already being used as
>> an allocation fallback for MIGRATE_MOVABLE. If that is the case
>> wouldn't it make sense to allow pulling pages and reporting them while
>> they are in the free_list?
> I was assuming that CMA is also to be skipped because "static int
> fallbacks[MIGRATE_TYPES][4]" in mm/page_alloc.c doesn't handle CMA at
> all, meaning we should never fallback to CMA or from CMA to another type
> - at least when stealing pages from another migratetype. So it smells
> like MIGRATE_CMA is static -> the area is marked once and will never be
> converted to something else (except MIGRATE_ISOLATE temporarily).
>
> I assume you are talking about gfp_to_alloc_flags()/prepare_alloc_pages():


I am also trying to look into this to get more understanding of it.
Another thing which I am looking into right now is the difference between
get/set_pcppage_migratetype() and ge/set_pageblock_migratetype().
To an extent, I do understand what is the benefit of using
get/set_pcppage_migratetype() by reading the comments. However, I am not sure
how it gets along with MIGRATE_CMA.
Hopefully, I will have an understanding of it soon.

> #ifdef CONFIG_CMA
> 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> 		alloc_flags |= ALLOC_CMA;
> #endif
>
> Yeah, this looks like MOVABLE allocations can fallback to CMA
> pageblocks. And from what I read, "CMA may use its own migratetype
> (MIGRATE_CMA) which behaves similarly to ZONE_MOVABLE but can be put in
> arbitrary places."
>
> So I think you are right, it could be that it is safe to temporarily
> pull out CMA pages (in contrast to isolated pages) - assuming it is fine
> to have temporary unmovable allocations on them (different discussion).
>
> (I am learning about the details as we discuss :) )
>
> The important part would then be to never allocate from the isolated
> pageblocks and to never overwrite MIGRATE_ISOLATE.


Agreed. I think I should just avoid isolating pages with migratetype
MIGRATE_ISOLATE.
Adding a check with is_migrate_isolate_page() before isolating the page should
do it.


>
Nitesh Narayan Lal Aug. 14, 2019, 3:49 p.m. UTC | #11
On 8/12/19 2:47 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch introduces the core infrastructure for free page reporting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be reported to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>
[...]
>> +}
>> +
>> +/**
>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
>> + */
>> +void __page_reporting_enqueue(struct page *page)
>> +{
>> +       struct page_reporting_config *phconf;
>> +       struct zone *zone;
>> +
>> +       rcu_read_lock();
>> +       /*
>> +        * We should not process this page if either page reporting is not
>> +        * yet completely enabled or it has been disabled by the backend.
>> +        */
>> +       phconf = rcu_dereference(page_reporting_conf);
>> +       if (!phconf)
>> +               return;
>> +
>> +       zone = page_zone(page);
>> +       bitmap_set_bit(page, zone);
>> +
>> +       /*
>> +        * We should not enqueue a job if a previously enqueued reporting work
>> +        * is in progress or we don't have enough free pages in the zone.
>> +        */
>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
> This doesn't make any sense to me. Why are you only incrementing the
> refcount if it is zero? Combining this with the assignment above, this
> isn't really a refcnt. It is just an oversized bitflag.


The intent for having an extra variable was to ensure that at a time only one
reporting job is enqueued. I do agree that for that purpose I really don't need
a reference counter and I should have used something like bool
'page_hinting_active'. But with bool, I think there could be a possible chance
of race. Maybe I should rename this variable and keep it as atomic.
Any thoughts?


>
> Also I am pretty sure this results in the opportunity to miss pages
> because there is nothing to prevent you from possibly missing a ton of
> pages you could hint on if a large number of pages are pushed out all
> at once and then the system goes idle in terms of memory allocation
> and freeing.


I was looking at how you are enqueuing/processing reporting jobs for each zone.
I am wondering if I should also consider something on similar lines as having
that I might be able to address the concern which you have raised above. But it
would also mean that I have to add an additional flag in the zone_flags. :)

>
[...]
>
>> +EXPORT_SYMBOL_GPL(page_reporting_enable);
>> --
>> 2.21.0
>>
Alexander Duyck Aug. 14, 2019, 4:11 p.m. UTC | #12
On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 8/12/19 2:47 PM, Alexander Duyck wrote:
> > On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >> This patch introduces the core infrastructure for free page reporting in
> >> virtual environments. It enables the kernel to track the free pages which
> >> can be reported to its hypervisor so that the hypervisor could
> >> free and reuse that memory as per its requirement.
> >>
> >> While the pages are getting processed in the hypervisor (e.g.,
> >> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> >> would be possible. To avoid such a situation, these pages are
> >> temporarily removed from the buddy. The amount of pages removed
> >> temporarily from the buddy is governed by the backend(virtio-balloon
> >> in our case).
> >>
> >> To efficiently identify free pages that can to be reported to the
> >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >> chunks are reported to the hypervisor - especially, to not break up THP
> >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >> in the bitmap are an indication whether a page *might* be free, not a
> >> guarantee. A new hook after buddy merging sets the bits.
> >>
> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >> asynchronously processes the bitmaps, trying to isolate and report pages
> >> that are still free. The backend (virtio-balloon) is responsible for
> >> reporting these batched pages to the host synchronously. Once reporting/
> >> freeing is complete, isolated pages are returned back to the buddy.
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >
> [...]
> >> +}
> >> +
> >> +/**
> >> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
> >> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
> >> + */
> >> +void __page_reporting_enqueue(struct page *page)
> >> +{
> >> +       struct page_reporting_config *phconf;
> >> +       struct zone *zone;
> >> +
> >> +       rcu_read_lock();
> >> +       /*
> >> +        * We should not process this page if either page reporting is not
> >> +        * yet completely enabled or it has been disabled by the backend.
> >> +        */
> >> +       phconf = rcu_dereference(page_reporting_conf);
> >> +       if (!phconf)
> >> +               return;
> >> +
> >> +       zone = page_zone(page);
> >> +       bitmap_set_bit(page, zone);
> >> +
> >> +       /*
> >> +        * We should not enqueue a job if a previously enqueued reporting work
> >> +        * is in progress or we don't have enough free pages in the zone.
> >> +        */
> >> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
> >> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
> > This doesn't make any sense to me. Why are you only incrementing the
> > refcount if it is zero? Combining this with the assignment above, this
> > isn't really a refcnt. It is just an oversized bitflag.
>
>
> The intent for having an extra variable was to ensure that at a time only one
> reporting job is enqueued. I do agree that for that purpose I really don't need
> a reference counter and I should have used something like bool
> 'page_hinting_active'. But with bool, I think there could be a possible chance
> of race. Maybe I should rename this variable and keep it as atomic.
> Any thoughts?

You could just use a bitflag to achieve what you are doing here. That
is the primary use case for many of the test_and_set_bit type
operations. However one issue with doing it as a bitflag is that you
have no way of telling that you took care of all requesters. That is
where having an actual reference count comes in handy as you know
exactly how many zones are requesting to be reported on.

> >
> > Also I am pretty sure this results in the opportunity to miss pages
> > because there is nothing to prevent you from possibly missing a ton of
> > pages you could hint on if a large number of pages are pushed out all
> > at once and then the system goes idle in terms of memory allocation
> > and freeing.
>
>
> I was looking at how you are enqueuing/processing reporting jobs for each zone.
> I am wondering if I should also consider something on similar lines as having
> that I might be able to address the concern which you have raised above. But it
> would also mean that I have to add an additional flag in the zone_flags. :)

You could do it either in the zone or outside the zone as yet another
bitmap. I decided to put the flags inside the zone because there was a
number of free bits there and it should be faster since we were
already using the zone structure.
Nitesh Narayan Lal Aug. 15, 2019, 1:15 p.m. UTC | #13
On 8/14/19 12:11 PM, Alexander Duyck wrote:
> On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> This patch introduces the core infrastructure for free page reporting in
>>>> virtual environments. It enables the kernel to track the free pages which
>>>> can be reported to its hypervisor so that the hypervisor could
>>>> free and reuse that memory as per its requirement.
>>>>
>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>>>> would be possible. To avoid such a situation, these pages are
>>>> temporarily removed from the buddy. The amount of pages removed
>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>> in our case).
>>>>
>>>> To efficiently identify free pages that can to be reported to the
>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>
>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>
>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> [...]
>>>> +}
>>>> +
>>>> +/**
>>>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>>>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
>>>> + */
>>>> +void __page_reporting_enqueue(struct page *page)
>>>> +{
>>>> +       struct page_reporting_config *phconf;
>>>> +       struct zone *zone;
>>>> +
>>>> +       rcu_read_lock();
>>>> +       /*
>>>> +        * We should not process this page if either page reporting is not
>>>> +        * yet completely enabled or it has been disabled by the backend.
>>>> +        */
>>>> +       phconf = rcu_dereference(page_reporting_conf);
>>>> +       if (!phconf)
>>>> +               return;
>>>> +
>>>> +       zone = page_zone(page);
>>>> +       bitmap_set_bit(page, zone);
>>>> +
>>>> +       /*
>>>> +        * We should not enqueue a job if a previously enqueued reporting work
>>>> +        * is in progress or we don't have enough free pages in the zone.
>>>> +        */
>>>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
>>>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
>>> This doesn't make any sense to me. Why are you only incrementing the
>>> refcount if it is zero? Combining this with the assignment above, this
>>> isn't really a refcnt. It is just an oversized bitflag.
>>
>> The intent for having an extra variable was to ensure that at a time only one
>> reporting job is enqueued. I do agree that for that purpose I really don't need
>> a reference counter and I should have used something like bool
>> 'page_hinting_active'. But with bool, I think there could be a possible chance
>> of race. Maybe I should rename this variable and keep it as atomic.
>> Any thoughts?
> You could just use a bitflag to achieve what you are doing here. That
> is the primary use case for many of the test_and_set_bit type
> operations. However one issue with doing it as a bitflag is that you
> have no way of telling that you took care of all requesters.

I think you are right, I might end up missing on certain reporting
opportunities in some special cases. Specifically when the pages which are
part of this new reporting request belongs to a section of the bitmap which
has already been scanned. Although, I have failed to reproduce this kind of
situation in an actual setup.

>  That is
> where having an actual reference count comes in handy as you know
> exactly how many zones are requesting to be reported on.


True.

>
>>> Also I am pretty sure this results in the opportunity to miss pages
>>> because there is nothing to prevent you from possibly missing a ton of
>>> pages you could hint on if a large number of pages are pushed out all
>>> at once and then the system goes idle in terms of memory allocation
>>> and freeing.
>>
>> I was looking at how you are enqueuing/processing reporting jobs for each zone.
>> I am wondering if I should also consider something on similar lines as having
>> that I might be able to address the concern which you have raised above. But it
>> would also mean that I have to add an additional flag in the zone_flags. :)
> You could do it either in the zone or outside the zone as yet another
> bitmap. I decided to put the flags inside the zone because there was a
> number of free bits there and it should be faster since we were
> already using the zone structure.

There are two possibilities which could happen while I am reporting:
1. Another request might come in for a different zone.
2. Another request could come in for the same zone and the pages belong to a
    section of the bitmap which has already been scanned.

Having a per zone flag to indicate reporting status will solve the first
issue and to an extent the second as well. Having refcnt will possibly solve
both of them. What I am wondering about is that in my case I could easily
impact the performance negatively by performing more bitmap scanning.
Nitesh Narayan Lal Aug. 15, 2019, 7:22 p.m. UTC | #14
On 8/15/19 9:15 AM, Nitesh Narayan Lal wrote:
> On 8/14/19 12:11 PM, Alexander Duyck wrote:
>> On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>>>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>>> This patch introduces the core infrastructure for free page reporting in
>>>>> virtual environments. It enables the kernel to track the free pages which
>>>>> can be reported to its hypervisor so that the hypervisor could
>>>>> free and reuse that memory as per its requirement.
>>>>>
>>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>>>>> would be possible. To avoid such a situation, these pages are
>>>>> temporarily removed from the buddy. The amount of pages removed
>>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>>> in our case).
>>>>>
>>>>> To efficiently identify free pages that can to be reported to the
>>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>>
>>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>>
>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>>> [...]
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>>>>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
>>>>> + */
>>>>> +void __page_reporting_enqueue(struct page *page)
>>>>> +{
>>>>> +       struct page_reporting_config *phconf;
>>>>> +       struct zone *zone;
>>>>> +
>>>>> +       rcu_read_lock();
>>>>> +       /*
>>>>> +        * We should not process this page if either page reporting is not
>>>>> +        * yet completely enabled or it has been disabled by the backend.
>>>>> +        */
>>>>> +       phconf = rcu_dereference(page_reporting_conf);
>>>>> +       if (!phconf)
>>>>> +               return;
>>>>> +
>>>>> +       zone = page_zone(page);
>>>>> +       bitmap_set_bit(page, zone);
>>>>> +
>>>>> +       /*
>>>>> +        * We should not enqueue a job if a previously enqueued reporting work
>>>>> +        * is in progress or we don't have enough free pages in the zone.
>>>>> +        */
>>>>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
>>>>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
>>>> This doesn't make any sense to me. Why are you only incrementing the
>>>> refcount if it is zero? Combining this with the assignment above, this
>>>> isn't really a refcnt. It is just an oversized bitflag.
>>> The intent for having an extra variable was to ensure that at a time only one
>>> reporting job is enqueued. I do agree that for that purpose I really don't need
>>> a reference counter and I should have used something like bool
>>> 'page_hinting_active'. But with bool, I think there could be a possible chance
>>> of race. Maybe I should rename this variable and keep it as atomic.
>>> Any thoughts?
>> You could just use a bitflag to achieve what you are doing here. That
>> is the primary use case for many of the test_and_set_bit type
>> operations. However one issue with doing it as a bitflag is that you
>> have no way of telling that you took care of all requesters.
> I think you are right, I might end up missing on certain reporting
> opportunities in some special cases. Specifically when the pages which are
> part of this new reporting request belongs to a section of the bitmap which
> has already been scanned. Although, I have failed to reproduce this kind of
> situation in an actual setup.
>
>>  That is
>> where having an actual reference count comes in handy as you know
>> exactly how many zones are requesting to be reported on.
>
> True.
>
>>>> Also I am pretty sure this results in the opportunity to miss pages
>>>> because there is nothing to prevent you from possibly missing a ton of
>>>> pages you could hint on if a large number of pages are pushed out all
>>>> at once and then the system goes idle in terms of memory allocation
>>>> and freeing.
>>> I was looking at how you are enqueuing/processing reporting jobs for each zone.
>>> I am wondering if I should also consider something on similar lines as having
>>> that I might be able to address the concern which you have raised above. But it
>>> would also mean that I have to add an additional flag in the zone_flags. :)
>> You could do it either in the zone or outside the zone as yet another
>> bitmap. I decided to put the flags inside the zone because there was a
>> number of free bits there and it should be faster since we were
>> already using the zone structure.
> There are two possibilities which could happen while I am reporting:
> 1. Another request might come in for a different zone.
> 2. Another request could come in for the same zone and the pages belong to a
>     section of the bitmap which has already been scanned.
>
> Having a per zone flag to indicate reporting status will solve the first
> issue and to an extent the second as well. Having refcnt will possibly solve
> both of them. What I am wondering about is that in my case I could easily
> impact the performance negatively by performing more bitmap scanning.
>
>

I realized that it may not be possible for me to directly adopt either refcnt
or zone flags just because of the way I have page reporting setup right now.

For now, I will just replace the refcnt with a bitflag as that should work
for most of the cases.  Nevertheless, I will also keep looking for a better way.
Alexander Duyck Aug. 15, 2019, 11 p.m. UTC | #15
On Thu, Aug 15, 2019 at 12:23 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 8/15/19 9:15 AM, Nitesh Narayan Lal wrote:
> > On 8/14/19 12:11 PM, Alexander Duyck wrote:
> >> On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
> >>>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>>>> This patch introduces the core infrastructure for free page reporting in
> >>>>> virtual environments. It enables the kernel to track the free pages which
> >>>>> can be reported to its hypervisor so that the hypervisor could
> >>>>> free and reuse that memory as per its requirement.
> >>>>>
> >>>>> While the pages are getting processed in the hypervisor (e.g.,
> >>>>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> >>>>> would be possible. To avoid such a situation, these pages are
> >>>>> temporarily removed from the buddy. The amount of pages removed
> >>>>> temporarily from the buddy is governed by the backend(virtio-balloon
> >>>>> in our case).
> >>>>>
> >>>>> To efficiently identify free pages that can to be reported to the
> >>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >>>>> chunks are reported to the hypervisor - especially, to not break up THP
> >>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >>>>> in the bitmap are an indication whether a page *might* be free, not a
> >>>>> guarantee. A new hook after buddy merging sets the bits.
> >>>>>
> >>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >>>>> asynchronously processes the bitmaps, trying to isolate and report pages
> >>>>> that are still free. The backend (virtio-balloon) is responsible for
> >>>>> reporting these batched pages to the host synchronously. Once reporting/
> >>>>> freeing is complete, isolated pages are returned back to the buddy.
> >>>>>
> >>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> >>> [...]
> >>>>> +}
> >>>>> +
> >>>>> +/**
> >>>>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
> >>>>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
> >>>>> + */
> >>>>> +void __page_reporting_enqueue(struct page *page)
> >>>>> +{
> >>>>> +       struct page_reporting_config *phconf;
> >>>>> +       struct zone *zone;
> >>>>> +
> >>>>> +       rcu_read_lock();
> >>>>> +       /*
> >>>>> +        * We should not process this page if either page reporting is not
> >>>>> +        * yet completely enabled or it has been disabled by the backend.
> >>>>> +        */
> >>>>> +       phconf = rcu_dereference(page_reporting_conf);
> >>>>> +       if (!phconf)
> >>>>> +               return;
> >>>>> +
> >>>>> +       zone = page_zone(page);
> >>>>> +       bitmap_set_bit(page, zone);
> >>>>> +
> >>>>> +       /*
> >>>>> +        * We should not enqueue a job if a previously enqueued reporting work
> >>>>> +        * is in progress or we don't have enough free pages in the zone.
> >>>>> +        */
> >>>>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
> >>>>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
> >>>> This doesn't make any sense to me. Why are you only incrementing the
> >>>> refcount if it is zero? Combining this with the assignment above, this
> >>>> isn't really a refcnt. It is just an oversized bitflag.
> >>> The intent for having an extra variable was to ensure that at a time only one
> >>> reporting job is enqueued. I do agree that for that purpose I really don't need
> >>> a reference counter and I should have used something like bool
> >>> 'page_hinting_active'. But with bool, I think there could be a possible chance
> >>> of race. Maybe I should rename this variable and keep it as atomic.
> >>> Any thoughts?
> >> You could just use a bitflag to achieve what you are doing here. That
> >> is the primary use case for many of the test_and_set_bit type
> >> operations. However one issue with doing it as a bitflag is that you
> >> have no way of telling that you took care of all requesters.
> > I think you are right, I might end up missing on certain reporting
> > opportunities in some special cases. Specifically when the pages which are
> > part of this new reporting request belongs to a section of the bitmap which
> > has already been scanned. Although, I have failed to reproduce this kind of
> > situation in an actual setup.
> >
> >>  That is
> >> where having an actual reference count comes in handy as you know
> >> exactly how many zones are requesting to be reported on.
> >
> > True.
> >
> >>>> Also I am pretty sure this results in the opportunity to miss pages
> >>>> because there is nothing to prevent you from possibly missing a ton of
> >>>> pages you could hint on if a large number of pages are pushed out all
> >>>> at once and then the system goes idle in terms of memory allocation
> >>>> and freeing.
> >>> I was looking at how you are enqueuing/processing reporting jobs for each zone.
> >>> I am wondering if I should also consider something on similar lines as having
> >>> that I might be able to address the concern which you have raised above. But it
> >>> would also mean that I have to add an additional flag in the zone_flags. :)
> >> You could do it either in the zone or outside the zone as yet another
> >> bitmap. I decided to put the flags inside the zone because there was a
> >> number of free bits there and it should be faster since we were
> >> already using the zone structure.
> > There are two possibilities which could happen while I am reporting:
> > 1. Another request might come in for a different zone.
> > 2. Another request could come in for the same zone and the pages belong to a
> >     section of the bitmap which has already been scanned.
> >
> > Having a per zone flag to indicate reporting status will solve the first
> > issue and to an extent the second as well. Having refcnt will possibly solve
> > both of them. What I am wondering about is that in my case I could easily
> > impact the performance negatively by performing more bitmap scanning.
> >
> >
>
> I realized that it may not be possible for me to directly adopt either refcnt
> or zone flags just because of the way I have page reporting setup right now.
>
> For now, I will just replace the refcnt with a bitflag as that should work
> for most of the cases.  Nevertheless, I will also keep looking for a better way.

If nothing else something you could consider is a refcnt for the
number of bits you have set in your bitfield. Then all you would need
to be doing is replace the cmpxchg with just a atomic_fetch_inc and
what you would need to do is have your worker thread track how many
bits it has cleared and subtract that from the refcnt at the end.
Nitesh Narayan Lal Aug. 16, 2019, 6:35 p.m. UTC | #16
On 8/15/19 7:00 PM, Alexander Duyck wrote:
> On Thu, Aug 15, 2019 at 12:23 PM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
[...]
>>>>>>> +}
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
>>>>>>> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
>>>>>>> + */
>>>>>>> +void __page_reporting_enqueue(struct page *page)
>>>>>>> +{
>>>>>>> +       struct page_reporting_config *phconf;
>>>>>>> +       struct zone *zone;
>>>>>>> +
>>>>>>> +       rcu_read_lock();
>>>>>>> +       /*
>>>>>>> +        * We should not process this page if either page reporting is not
>>>>>>> +        * yet completely enabled or it has been disabled by the backend.
>>>>>>> +        */
>>>>>>> +       phconf = rcu_dereference(page_reporting_conf);
>>>>>>> +       if (!phconf)
>>>>>>> +               return;
>>>>>>> +
>>>>>>> +       zone = page_zone(page);
>>>>>>> +       bitmap_set_bit(page, zone);
>>>>>>> +
>>>>>>> +       /*
>>>>>>> +        * We should not enqueue a job if a previously enqueued reporting work
>>>>>>> +        * is in progress or we don't have enough free pages in the zone.
>>>>>>> +        */
>>>>>>> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
>>>>>>> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
>>>>>> This doesn't make any sense to me. Why are you only incrementing the
>>>>>> refcount if it is zero? Combining this with the assignment above, this
>>>>>> isn't really a refcnt. It is just an oversized bitflag.
>>>>> The intent for having an extra variable was to ensure that at a time only one
>>>>> reporting job is enqueued. I do agree that for that purpose I really don't need
>>>>> a reference counter and I should have used something like bool
>>>>> 'page_hinting_active'. But with bool, I think there could be a possible chance
>>>>> of race. Maybe I should rename this variable and keep it as atomic.
>>>>> Any thoughts?
>>>> You could just use a bitflag to achieve what you are doing here. That
>>>> is the primary use case for many of the test_and_set_bit type
>>>> operations. However one issue with doing it as a bitflag is that you
>>>> have no way of telling that you took care of all requesters.
>>> I think you are right, I might end up missing on certain reporting
>>> opportunities in some special cases. Specifically when the pages which are
>>> part of this new reporting request belongs to a section of the bitmap which
>>> has already been scanned. Although, I have failed to reproduce this kind of
>>> situation in an actual setup.
>>>
>>>>  That is
>>>> where having an actual reference count comes in handy as you know
>>>> exactly how many zones are requesting to be reported on.
>>> True.
>>>
>>>>>> Also I am pretty sure this results in the opportunity to miss pages
>>>>>> because there is nothing to prevent you from possibly missing a ton of
>>>>>> pages you could hint on if a large number of pages are pushed out all
>>>>>> at once and then the system goes idle in terms of memory allocation
>>>>>> and freeing.
>>>>> I was looking at how you are enqueuing/processing reporting jobs for each zone.
>>>>> I am wondering if I should also consider something on similar lines as having
>>>>> that I might be able to address the concern which you have raised above. But it
>>>>> would also mean that I have to add an additional flag in the zone_flags. :)
>>>> You could do it either in the zone or outside the zone as yet another
>>>> bitmap. I decided to put the flags inside the zone because there was a
>>>> number of free bits there and it should be faster since we were
>>>> already using the zone structure.
>>> There are two possibilities which could happen while I am reporting:
>>> 1. Another request might come in for a different zone.
>>> 2. Another request could come in for the same zone and the pages belong to a
>>>     section of the bitmap which has already been scanned.
>>>
>>> Having a per zone flag to indicate reporting status will solve the first
>>> issue and to an extent the second as well. Having refcnt will possibly solve
>>> both of them. What I am wondering about is that in my case I could easily
>>> impact the performance negatively by performing more bitmap scanning.
>>>
>>>
>> I realized that it may not be possible for me to directly adopt either refcnt
>> or zone flags just because of the way I have page reporting setup right now.
>>
>> For now, I will just replace the refcnt with a bitflag as that should work
>> for most of the cases.  Nevertheless, I will also keep looking for a better way.
> If nothing else something you could consider is a refcnt for the
> number of bits you have set in your bitfield. Then all you would need
> to be doing is replace the cmpxchg with just a atomic_fetch_inc and
> what you would need to do is have your worker thread track how many
> bits it has cleared and subtract that from the refcnt at the end.

Thanks, I will think about this suggestion as well.

Based on your previous suggestion and what you have already proposed in your
series I can think of a way to atleast ensure reporting for zones freeing pages
after getting scanned in wq.
(In case I decide to go ahead with this approach I will mention that this change
is based on your series. Please do let me know if there is a better way to give
credit)

However, a situation where the same zone is reporting pages from the bitmap
section already scanned with zero freeing activity on other zones, may not
be entirely handled.

In any case, I think what I have in my mind will be better than what I have
right now. I will try to implement and test it to see if it can actually work.
Nitesh Narayan Lal Aug. 20, 2019, 2:11 p.m. UTC | #17
On 8/12/19 4:04 PM, Nitesh Narayan Lal wrote:
> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>> This patch introduces the core infrastructure for free page reporting in
>>> virtual environments. It enables the kernel to track the free pages which
>>> can be reported to its hypervisor so that the hypervisor could
>>> free and reuse that memory as per its requirement.
>>>
>>> While the pages are getting processed in the hypervisor (e.g.,
>>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>>> would be possible. To avoid such a situation, these pages are
>>> temporarily removed from the buddy. The amount of pages removed
>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>> in our case).
>>>
>>> To efficiently identify free pages that can to be reported to the
>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>> chunks are reported to the hypervisor - especially, to not break up THP
>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>> in the bitmap are an indication whether a page *might* be free, not a
>>> guarantee. A new hook after buddy merging sets the bits.
>>>
>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>> that are still free. The backend (virtio-balloon) is responsible for
>>> reporting these batched pages to the host synchronously. Once reporting/
>>> freeing is complete, isolated pages are returned back to the buddy.
>>>
>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> So if I understand correctly the hotplug support for this is still not
>> included correct? 
>
> That is correct, I have it as an ongoing-item in my cover-email.
> In case, we decide to go with this approach do you think it is a blocker?

I am planning to defer memory hotplug/hotremove support for the future. Due to
following reasons:
* I would like to first get a basic framework ready and merged (in case we
  decide to go ahead with this approach) and then build on top of it.
* Memory hotplug/hotremove is not a primary use case in our mind right now and
  hence I am not considering this as a blocker for the first step.

Following are the items which I intend to address before my next submission:
* Use zone flag and reference counter to track the number of zones requesting
  page reporting.
* Move the bitmap and its respective fields into a structure whose rcu-protected
  pointer object is maintained on a per-zone basis.
* Pick Alexander's patch for page poisoning support and test them with my patch
  set. (@Alexander: I will keep your signed-off for these patches to indicate
  you are the original author, please do let me know there is a better way to
  give credit).
* Address other suggestions/comments received on v12.

Looking forward to any suggestions/comments.


[...]
> +
> +       /* assign the configuration object provided by the backend */
> +       rcu_assign_pointer(page_reporting_conf, phconf);
> +
> +out:
> +       mutex_unlock(&page_reporting_mutex);
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(page_reporting_enable);
> --
> 2.21.0
>
Nitesh Narayan Lal Aug. 30, 2019, 3:15 p.m. UTC | #18
On 8/12/19 2:47 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> This patch introduces the core infrastructure for free page reporting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be reported to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
[...]
>> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
>> +                            struct zone *zone)
>> +{
>> +       unsigned long setbit;
>> +       struct page *page;
>> +       int count = 0;
>> +
>> +       sg_init_table(phconf->sg, phconf->max_pages);
>> +
>> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
>> +               /* Process only if the page is still online */
>> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
>> +                                         zone->base_pfn);
>> +               if (!page)
>> +                       continue;
>> +
> Shouldn't you be clearing the bit and dropping the reference to
> free_pages before you move on to the next bit? Otherwise you are going
> to be stuck with those aren't you?
>
>> +               spin_lock(&zone->lock);
>> +
>> +               /* Ensure page is still free and can be processed */
>> +               if (PageBuddy(page) && page_private(page) >=
>> +                   PAGE_REPORTING_MIN_ORDER)
>> +                       count = process_free_page(page, phconf, count);
>> +
>> +               spin_unlock(&zone->lock);
> So I kind of wonder just how much overhead you are taking for bouncing
> the zone lock once per page here. Especially since it can result in
> you not actually making any progress since the page may have already
> been reallocated.
>

I am wondering if there is a way to measure this overhead?
After thinking about this, I do understand your point.
One possible way which I can think of to address this is by having a
page_reporting_dequeue() hook somewhere in the allocation path.
    
For some reason, I am not seeing this work as I would have expected
but I don't have solid reasoning to share yet. It could be simply
because I am putting my hook at the wrong place. I will continue
investigating this.

In any case, I may be over complicating things here, so please let me
if there is a better way to do this.

If this overhead is not significant we can probably live with it.
Alexander Duyck Aug. 30, 2019, 3:31 p.m. UTC | #19
On Fri, Aug 30, 2019 at 8:15 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 8/12/19 2:47 PM, Alexander Duyck wrote:
> > On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >> This patch introduces the core infrastructure for free page reporting in
> >> virtual environments. It enables the kernel to track the free pages which
> >> can be reported to its hypervisor so that the hypervisor could
> >> free and reuse that memory as per its requirement.
> >>
> >> While the pages are getting processed in the hypervisor (e.g.,
> >> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> >> would be possible. To avoid such a situation, these pages are
> >> temporarily removed from the buddy. The amount of pages removed
> >> temporarily from the buddy is governed by the backend(virtio-balloon
> >> in our case).
> >>
> >> To efficiently identify free pages that can to be reported to the
> >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >> chunks are reported to the hypervisor - especially, to not break up THP
> >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >> in the bitmap are an indication whether a page *might* be free, not a
> >> guarantee. A new hook after buddy merging sets the bits.
> >>
> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >> asynchronously processes the bitmaps, trying to isolate and report pages
> >> that are still free. The backend (virtio-balloon) is responsible for
> >> reporting these batched pages to the host synchronously. Once reporting/
> >> freeing is complete, isolated pages are returned back to the buddy.
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> [...]
> >> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
> >> +                            struct zone *zone)
> >> +{
> >> +       unsigned long setbit;
> >> +       struct page *page;
> >> +       int count = 0;
> >> +
> >> +       sg_init_table(phconf->sg, phconf->max_pages);
> >> +
> >> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
> >> +               /* Process only if the page is still online */
> >> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
> >> +                                         zone->base_pfn);
> >> +               if (!page)
> >> +                       continue;
> >> +
> > Shouldn't you be clearing the bit and dropping the reference to
> > free_pages before you move on to the next bit? Otherwise you are going
> > to be stuck with those aren't you?
> >
> >> +               spin_lock(&zone->lock);
> >> +
> >> +               /* Ensure page is still free and can be processed */
> >> +               if (PageBuddy(page) && page_private(page) >=
> >> +                   PAGE_REPORTING_MIN_ORDER)
> >> +                       count = process_free_page(page, phconf, count);
> >> +
> >> +               spin_unlock(&zone->lock);
> > So I kind of wonder just how much overhead you are taking for bouncing
> > the zone lock once per page here. Especially since it can result in
> > you not actually making any progress since the page may have already
> > been reallocated.
> >
>
> I am wondering if there is a way to measure this overhead?
> After thinking about this, I do understand your point.
> One possible way which I can think of to address this is by having a
> page_reporting_dequeue() hook somewhere in the allocation path.

Really in order to stress this you probably need to have a lot of
CPUs, a lot of memory, and something that forces a lot of pages to get
hit such as the memory shuffling feature.

> For some reason, I am not seeing this work as I would have expected
> but I don't have solid reasoning to share yet. It could be simply
> because I am putting my hook at the wrong place. I will continue
> investigating this.
>
> In any case, I may be over complicating things here, so please let me
> if there is a better way to do this.

I have already been demonstrating the "better way" I think there is to
do this. I will push v7 of it early next week unless there is some
other feedback. By putting the bit in the page and controlling what
comes into and out of the lists it makes most of this quite a bit
easier. The only limitation is you have to modify where things get
placed in the lists so you don't create a "vapor lock" that would
stall the feed of pages into the reporting engine.

> If this overhead is not significant we can probably live with it.

You have bigger issues you still have to overcome as I recall. Didn't
you still need to sort out hotplug and a sparse map with a wide span
in a zone? Without those resolved the bitmap approach is still a no-go
regardless of performance.

- Alex
Nitesh Narayan Lal Aug. 30, 2019, 4:05 p.m. UTC | #20
On 8/30/19 11:31 AM, Alexander Duyck wrote:
> On Fri, Aug 30, 2019 at 8:15 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 8/12/19 2:47 PM, Alexander Duyck wrote:
>>> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> This patch introduces the core infrastructure for free page reporting in
>>>> virtual environments. It enables the kernel to track the free pages which
>>>> can be reported to its hypervisor so that the hypervisor could
>>>> free and reuse that memory as per its requirement.
>>>>
>>>> While the pages are getting processed in the hypervisor (e.g.,
>>>> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
>>>> would be possible. To avoid such a situation, these pages are
>>>> temporarily removed from the buddy. The amount of pages removed
>>>> temporarily from the buddy is governed by the backend(virtio-balloon
>>>> in our case).
>>>>
>>>> To efficiently identify free pages that can to be reported to the
>>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>>>> chunks are reported to the hypervisor - especially, to not break up THP
>>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>>>> in the bitmap are an indication whether a page *might* be free, not a
>>>> guarantee. A new hook after buddy merging sets the bits.
>>>>
>>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>>>> asynchronously processes the bitmaps, trying to isolate and report pages
>>>> that are still free. The backend (virtio-balloon) is responsible for
>>>> reporting these batched pages to the host synchronously. Once reporting/
>>>> freeing is complete, isolated pages are returned back to the buddy.
>>>>
>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> [...]
>>>> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
>>>> +                            struct zone *zone)
>>>> +{
>>>> +       unsigned long setbit;
>>>> +       struct page *page;
>>>> +       int count = 0;
>>>> +
>>>> +       sg_init_table(phconf->sg, phconf->max_pages);
>>>> +
>>>> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
>>>> +               /* Process only if the page is still online */
>>>> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
>>>> +                                         zone->base_pfn);
>>>> +               if (!page)
>>>> +                       continue;
>>>> +
>>> Shouldn't you be clearing the bit and dropping the reference to
>>> free_pages before you move on to the next bit? Otherwise you are going
>>> to be stuck with those aren't you?
>>>
>>>> +               spin_lock(&zone->lock);
>>>> +
>>>> +               /* Ensure page is still free and can be processed */
>>>> +               if (PageBuddy(page) && page_private(page) >=
>>>> +                   PAGE_REPORTING_MIN_ORDER)
>>>> +                       count = process_free_page(page, phconf, count);
>>>> +
>>>> +               spin_unlock(&zone->lock);
>>> So I kind of wonder just how much overhead you are taking for bouncing
>>> the zone lock once per page here. Especially since it can result in
>>> you not actually making any progress since the page may have already
>>> been reallocated.
>>>
>> I am wondering if there is a way to measure this overhead?
>> After thinking about this, I do understand your point.
>> One possible way which I can think of to address this is by having a
>> page_reporting_dequeue() hook somewhere in the allocation path.
> Really in order to stress this you probably need to have a lot of
> CPUs, a lot of memory, and something that forces a lot of pages to get
> hit such as the memory shuffling feature.

I will think about it, thanks for the suggestion.

>
>> For some reason, I am not seeing this work as I would have expected
>> but I don't have solid reasoning to share yet. It could be simply
>> because I am putting my hook at the wrong place. I will continue
>> investigating this.
>>
>> In any case, I may be over complicating things here, so please let me
>> if there is a better way to do this.
> I have already been demonstrating the "better way" I think there is to
> do this. I will push v7 of it early next week unless there is some
> other feedback. By putting the bit in the page and controlling what
> comes into and out of the lists it makes most of this quite a bit
> easier. The only limitation is you have to modify where things get
> placed in the lists so you don't create a "vapor lock" that would
> stall the feed of pages into the reporting engine.
>
>> If this overhead is not significant we can probably live with it.
> You have bigger issues you still have to overcome as I recall. Didn't
> you still need to sort out hotplu

For memory hotplug, my impression is that it should
not be a blocker for taking the first step (in case we do decide to
go ahead with this approach). Another reason why I am considering
this as future work is that memory hot(un)plug is still under
development and requires fixing. (Specifically, issue such as zone
shrinking which will directly impact the bitmap approach is still
under discussion).

> g and a sparse map with a wide span
> in a zone? Without those resolved the bitmap approach is still a no-go
> regardless of performance.

For sparsity, the memory wastage should not be significant as I
am tracking pages on the granularity of (MAX_ORDER - 2) and maintaining
the bitmaps on a per-zone basis (which was not the case earlier).

However, if you do consider this as a block I will think about it and try to fix it.
In the worst case, if I don't find a solution I will add this as a known limitation
for this approach in my cover.

> - Alex
David Hildenbrand Sept. 4, 2019, 8:40 a.m. UTC | #21
>>> For some reason, I am not seeing this work as I would have expected
>>> but I don't have solid reasoning to share yet. It could be simply
>>> because I am putting my hook at the wrong place. I will continue
>>> investigating this.
>>>
>>> In any case, I may be over complicating things here, so please let me
>>> if there is a better way to do this.
>> I have already been demonstrating the "better way" I think there is to
>> do this. I will push v7 of it early next week unless there is some
>> other feedback. By putting the bit in the page and controlling what
>> comes into and out of the lists it makes most of this quite a bit
>> easier. The only limitation is you have to modify where things get
>> placed in the lists so you don't create a "vapor lock" that would
>> stall the feed of pages into the reporting engine.
>>
>>> If this overhead is not significant we can probably live with it.
>> You have bigger issues you still have to overcome as I recall. Didn't
>> you still need to sort out hotplu
> 
> For memory hotplug, my impression is that it should
> not be a blocker for taking the first step (in case we do decide to
> go ahead with this approach). Another reason why I am considering
> this as future work is that memory hot(un)plug is still under
> development and requires fixing. (Specifically, issue such as zone
> shrinking which will directly impact the bitmap approach is still
> under discussion).

Memory hotplug is way more reliable than memory unplug - however, but
both are used in production. E.g. the zone shrinking you mention is
something that is not "optimal", but works in most scenarios. So both
features are rather in a "production/fixing" stage than a pure
"development" stage.

However, I do agree that memory hot(un)plug is not something
high-priority for free page hinting in the first shot. If it works out
of the box (Alexander's approach) - great - if not it is just one
additional scenario where free page hinting might not be optimal yet
(like vfio where we can't use it completely right now). After all, most
virtual environment (under KVM) I am aware of don't use DIMM-base memory
hot(un)plug at all.

The important part is that it will not break when memory is added/removed.

> 
>> g and a sparse map with a wide span
>> in a zone? Without those resolved the bitmap approach is still a no-go
>> regardless of performance.
> 
> For sparsity, the memory wastage should not be significant as I
> am tracking pages on the granularity of (MAX_ORDER - 2) and maintaining
> the bitmaps on a per-zone basis (which was not the case earlier).

To handle sparsity one would have to have separate bitmaps for each
section. Roughly 64 bit per 128MB section. Scanning the scattered bitmap
would get more expensive. Of course, one could have a hierarchical
bitmap to speed up the search etc.

But again, I think we should have a look how much of an issue sparse
zones/nodes actually are in virtual machines (KVM). The setups I am
aware of minimize sparsity (e.g., QEMU will place DIMMs consecutively in
memory, only aligning to e.g, 128MB on x86 if required). Usually all
memory in VMs is onlined to the NORMAL zone (except special cases of
course). The only "issue" would be if you start mixing DIMMs of
different nodes when hotplugging memory. The overhead for 1bit per 2MB
could be almost neglectable in most setups.

But I do agree that for the bitmap-based approach there might be more
scenarios where you'd rather not want to use free page hinting and
instead disable it.

> 
> However, if you do consider this as a block I will think about it and try to fix it.
> In the worst case, if I don't find a solution I will add this as a known limitation
> for this approach in my cover.
> 
>> - Alex
Alexander Duyck Oct. 10, 2019, 8:36 p.m. UTC | #22
On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>

<snip>

> +static int process_free_page(struct page *page,
> +                            struct page_reporting_config *phconf, int count)
> +{
> +       int mt, order, ret = 0;
> +
> +       mt = get_pageblock_migratetype(page);
> +       order = page_private(page);
> +       ret = __isolate_free_page(page, order);
> +
> +       if (ret) {
> +               /*
> +                * Preserving order and migratetype for reuse while
> +                * releasing the pages back to the buddy.
> +                */
> +               set_pageblock_migratetype(page, mt);
> +               set_page_private(page, order);
> +
> +               sg_set_page(&phconf->sg[count++], page,
> +                           PAGE_SIZE << order, 0);
> +       }
> +
> +       return count;
> +}
> +
> +/**
> + * scan_zone_bitmap - scans the bitmap for the requested zone.
> + * @phconf: page reporting configuration object initialized by the backend.
> + * @zone: zone for which page reporting is requested.
> + *
> + * For every page marked in the bitmap it checks if it is still free if so it
> + * isolates and adds them to a scatterlist. As soon as the number of isolated
> + * pages reach the threshold set by the backend, they are reported to the
> + * hypervisor by the backend. Once the hypervisor responds after processing
> + * they are returned back to the buddy for reuse.
> + */
> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
> +                            struct zone *zone)
> +{
> +       unsigned long setbit;
> +       struct page *page;
> +       int count = 0;
> +
> +       sg_init_table(phconf->sg, phconf->max_pages);
> +
> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
> +               /* Process only if the page is still online */
> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
> +                                         zone->base_pfn);
> +               if (!page)
> +                       continue;
> +
> +               spin_lock(&zone->lock);
> +
> +               /* Ensure page is still free and can be processed */
> +               if (PageBuddy(page) && page_private(page) >=
> +                   PAGE_REPORTING_MIN_ORDER)
> +                       count = process_free_page(page, phconf, count);
> +
> +               spin_unlock(&zone->lock);
> +               /* Page has been processed, adjust its bit and zone counter */
> +               clear_bit(setbit, zone->bitmap);
> +               atomic_dec(&zone->free_pages);
> +
> +               if (count == phconf->max_pages) {
> +                       /* Report isolated pages to the hypervisor */
> +                       phconf->report(phconf, count);
> +
> +                       /* Return processed pages back to the buddy */
> +                       return_isolated_page(zone, phconf);
> +
> +                       /* Reset for next reporting */
> +                       sg_init_table(phconf->sg, phconf->max_pages);
> +                       count = 0;
> +               }
> +       }
> +       /*
> +        * If the number of isolated pages does not meet the max_pages
> +        * threshold, we would still prefer to report them as we have already
> +        * isolated them.
> +        */
> +       if (count) {
> +               sg_mark_end(&phconf->sg[count - 1]);
> +               phconf->report(phconf, count);
> +
> +               return_isolated_page(zone, phconf);
> +       }
> +}
> +

So one thing that occurred to me is that this code is missing checks
so that it doesn't try to hint isolated pages. With the bitmap
approach you need an additional check so that you aren't pulling
isolated pages out and reporting them.
Nitesh Narayan Lal Oct. 11, 2019, 11:02 a.m. UTC | #23
On 10/10/19 4:36 PM, Alexander Duyck wrote:
> On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> <snip>
>
>> +static int process_free_page(struct page *page,
>> +                            struct page_reporting_config *phconf, int count)
>> +{
>> +       int mt, order, ret = 0;
>> +
>> +       mt = get_pageblock_migratetype(page);
>> +       order = page_private(page);
>> +       ret = __isolate_free_page(page, order);
>> +
>> +       if (ret) {
>> +               /*
>> +                * Preserving order and migratetype for reuse while
>> +                * releasing the pages back to the buddy.
>> +                */
>> +               set_pageblock_migratetype(page, mt);
>> +               set_page_private(page, order);
>> +
>> +               sg_set_page(&phconf->sg[count++], page,
>> +                           PAGE_SIZE << order, 0);
>> +       }
>> +
>> +       return count;
>> +}
>> +
>> +/**
>> + * scan_zone_bitmap - scans the bitmap for the requested zone.
>> + * @phconf: page reporting configuration object initialized by the backend.
>> + * @zone: zone for which page reporting is requested.
>> + *
>> + * For every page marked in the bitmap it checks if it is still free if so it
>> + * isolates and adds them to a scatterlist. As soon as the number of isolated
>> + * pages reach the threshold set by the backend, they are reported to the
>> + * hypervisor by the backend. Once the hypervisor responds after processing
>> + * they are returned back to the buddy for reuse.
>> + */
>> +static void scan_zone_bitmap(struct page_reporting_config *phconf,
>> +                            struct zone *zone)
>> +{
>> +       unsigned long setbit;
>> +       struct page *page;
>> +       int count = 0;
>> +
>> +       sg_init_table(phconf->sg, phconf->max_pages);
>> +
>> +       for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
>> +               /* Process only if the page is still online */
>> +               page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
>> +                                         zone->base_pfn);
>> +               if (!page)
>> +                       continue;
>> +
>> +               spin_lock(&zone->lock);
>> +
>> +               /* Ensure page is still free and can be processed */
>> +               if (PageBuddy(page) && page_private(page) >=
>> +                   PAGE_REPORTING_MIN_ORDER)
>> +                       count = process_free_page(page, phconf, count);
>> +
>> +               spin_unlock(&zone->lock);
>> +               /* Page has been processed, adjust its bit and zone counter */
>> +               clear_bit(setbit, zone->bitmap);
>> +               atomic_dec(&zone->free_pages);
>> +
>> +               if (count == phconf->max_pages) {
>> +                       /* Report isolated pages to the hypervisor */
>> +                       phconf->report(phconf, count);
>> +
>> +                       /* Return processed pages back to the buddy */
>> +                       return_isolated_page(zone, phconf);
>> +
>> +                       /* Reset for next reporting */
>> +                       sg_init_table(phconf->sg, phconf->max_pages);
>> +                       count = 0;
>> +               }
>> +       }
>> +       /*
>> +        * If the number of isolated pages does not meet the max_pages
>> +        * threshold, we would still prefer to report them as we have already
>> +        * isolated them.
>> +        */
>> +       if (count) {
>> +               sg_mark_end(&phconf->sg[count - 1]);
>> +               phconf->report(phconf, count);
>> +
>> +               return_isolated_page(zone, phconf);
>> +       }
>> +}
>> +
> So one thing that occurred to me is that this code is missing checks
> so that it doesn't try to hint isolated pages. With the bitmap
> approach you need an additional check so that you aren't pulling
> isolated pages out and reporting them.

I think you mean that we should not report pages of type MIGRATE_ISOLATE.

The current code on which I am working, I have added the
is_migrate_isolate_page() check
to ensure that I am not processing these pages.
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d77d717c620c..ba5f5b508f25 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -559,6 +559,17 @@  struct zone {
 	/* Zone statistics */
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 	atomic_long_t		vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
+#ifdef CONFIG_PAGE_REPORTING
+	/* Pointer to the bitmap in PAGE_REPORTING_MIN_ORDER granularity */
+	unsigned long *bitmap;
+	/* Preserve start and end PFN in case they change due to hotplug */
+	unsigned long base_pfn;
+	unsigned long end_pfn;
+	/* Free pages of granularity PAGE_REPORTING_MIN_ORDER */
+	atomic_t free_pages;
+	/* Number of bits required in the bitmap */
+	unsigned long nbits;
+#endif
 } ____cacheline_internodealigned_in_smp;
 
 enum pgdat_flags {
diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
new file mode 100644
index 000000000000..37a39589939d
--- /dev/null
+++ b/include/linux/page_reporting.h
@@ -0,0 +1,63 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PAGE_REPORTING_H
+#define _LINUX_PAGE_REPORTING_H
+
+#define PAGE_REPORTING_MIN_ORDER		(MAX_ORDER - 2)
+#define PAGE_REPORTING_MAX_PAGES		16
+
+#ifdef CONFIG_PAGE_REPORTING
+struct page_reporting_config {
+	/* function to hint batch of isolated pages */
+	void (*report)(struct page_reporting_config *phconf,
+		       unsigned int num_pages);
+
+	/* scatterlist to hold the isolated pages to be hinted */
+	struct scatterlist *sg;
+
+	/*
+	 * Maxmimum pages that are going to be hinted to the hypervisor at a
+	 * time of granularity >= PAGE_REPORTING_MIN_ORDER.
+	 */
+	int max_pages;
+
+	/* work object to process page reporting rqeuests */
+	struct work_struct reporting_work;
+
+	/* tracks the number of reporting request processed at a time */
+	atomic_t refcnt;
+};
+
+void __page_reporting_enqueue(struct page *page);
+void __return_isolated_page(struct zone *zone, struct page *page);
+void set_pageblock_migratetype(struct page *page, int migratetype);
+
+/**
+ * page_reporting_enqueue - checks the eligibility of the freed page based on
+ * its order for further page reporting processing.
+ * @page: page which has been freed.
+ * @order: order for the the free page.
+ */
+static inline void page_reporting_enqueue(struct page *page, int order)
+{
+	if (order < PAGE_REPORTING_MIN_ORDER)
+		return;
+	__page_reporting_enqueue(page);
+}
+
+int page_reporting_enable(struct page_reporting_config *phconf);
+void page_reporting_disable(struct page_reporting_config *phconf);
+#else
+static inline void page_reporting_enqueue(struct page *page, int order)
+{
+}
+
+static inline int page_reporting_enable(struct page_reporting_config *phconf)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void page_reporting_disable(struct page_reporting_config *phconf)
+{
+}
+#endif /* CONFIG_PAGE_REPORTING */
+#endif /* _LINUX_PAGE_REPORTING_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 56cec636a1fc..6a35a9dad350 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,10 @@  config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+# PAGE_REPORTING will allow the guest to report the free pages to the
+# host in fixed chunks as soon as a fixed threshold is reached.
+config PAGE_REPORTING
+       bool
+       def_bool n
+       depends on X86_64
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 338e528ad436..6a32cdfa61c2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -104,3 +104,4 @@  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..aa7c89d50c85 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@ 
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
 #include <linux/psi.h>
+#include <linux/page_reporting.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -903,7 +904,7 @@  compaction_capture(struct capture_control *capc, struct page *page,
 static inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype)
+		int migratetype, bool needs_reporting)
 {
 	unsigned long combined_pfn;
 	unsigned long uninitialized_var(buddy_pfn);
@@ -1006,7 +1007,8 @@  static inline void __free_one_page(struct page *page,
 				migratetype);
 	else
 		add_to_free_area(page, &zone->free_area[order], migratetype);
-
+	if (needs_reporting)
+		page_reporting_enqueue(page, order);
 }
 
 /*
@@ -1317,7 +1319,7 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 		if (unlikely(isolated_pageblocks))
 			mt = get_pageblock_migratetype(page);
 
-		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
+		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
 		trace_mm_page_pcpu_drain(page, 0, mt);
 	}
 	spin_unlock(&zone->lock);
@@ -1326,14 +1328,14 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 static void free_one_page(struct zone *zone,
 				struct page *page, unsigned long pfn,
 				unsigned int order,
-				int migratetype)
+				int migratetype, bool needs_reporting)
 {
 	spin_lock(&zone->lock);
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
-	__free_one_page(page, pfn, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype, needs_reporting);
 	spin_unlock(&zone->lock);
 }
 
@@ -1423,7 +1425,7 @@  static void __free_pages_ok(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, pfn, order, migratetype);
+	free_one_page(page_zone(page), page, pfn, order, migratetype, true);
 	local_irq_restore(flags);
 }
 
@@ -2197,6 +2199,32 @@  struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	return NULL;
 }
 
+#ifdef CONFIG_PAGE_REPORTING
+/**
+ * return_isolated_page - returns a reported page back to the buddy.
+ * @zone: zone from where the page was isolated.
+ * @page: page which will be returned.
+ */
+void __return_isolated_page(struct zone *zone, struct page *page)
+{
+	unsigned int order, mt;
+	unsigned long pfn;
+
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
+
+	mt = get_pageblock_migratetype(page);
+	pfn = page_to_pfn(page);
+
+	if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt)))
+		mt = get_pfnblock_migratetype(page, pfn);
+
+	order = page_private(page);
+	set_page_private(page, 0);
+
+	__free_one_page(page, pfn, zone, order, mt, false);
+}
+#endif /* CONFIG_PAGE_REPORTING */
 
 /*
  * This array describes the order lists are fallen back to when
@@ -3041,7 +3069,7 @@  static void free_unref_page_commit(struct page *page, unsigned long pfn)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, 0, migratetype);
+			free_one_page(zone, page, pfn, 0, migratetype, true);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
diff --git a/mm/page_reporting.c b/mm/page_reporting.c
new file mode 100644
index 000000000000..4ee2c172cd9a
--- /dev/null
+++ b/mm/page_reporting.c
@@ -0,0 +1,332 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Page reporting core infrastructure to enable a VM to report free pages to its
+ * hypervisor.
+ *
+ * Copyright Red Hat, Inc. 2019
+ *
+ * Author(s): Nitesh Narayan Lal <nitesh@redhat.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/page_reporting.h>
+#include <linux/scatterlist.h>
+#include "internal.h"
+
+static struct page_reporting_config __rcu *page_reporting_conf __read_mostly;
+static DEFINE_MUTEX(page_reporting_mutex);
+
+static inline unsigned long pfn_to_bit(struct page *page, struct zone *zone)
+{
+	unsigned long bitnr;
+
+	bitnr = (page_to_pfn(page) - zone->base_pfn) >>
+		PAGE_REPORTING_MIN_ORDER;
+
+	return bitnr;
+}
+
+static void return_isolated_page(struct zone *zone,
+				 struct page_reporting_config *phconf)
+{
+	struct scatterlist *sg = phconf->sg;
+
+	spin_lock(&zone->lock);
+	do {
+		__return_isolated_page(zone, sg_page(sg));
+	} while (!sg_is_last(sg++));
+	spin_unlock(&zone->lock);
+}
+
+static void bitmap_set_bit(struct page *page, struct zone *zone)
+{
+	unsigned long bitnr = 0;
+
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
+
+	bitnr = pfn_to_bit(page, zone);
+	/* set bit if it is not already set and is a valid bit */
+	if (zone->bitmap && bitnr < zone->nbits &&
+	    !test_and_set_bit(bitnr, zone->bitmap))
+		atomic_inc(&zone->free_pages);
+}
+
+static int process_free_page(struct page *page,
+			     struct page_reporting_config *phconf, int count)
+{
+	int mt, order, ret = 0;
+
+	mt = get_pageblock_migratetype(page);
+	order = page_private(page);
+	ret = __isolate_free_page(page, order);
+
+	if (ret) {
+		/*
+		 * Preserving order and migratetype for reuse while
+		 * releasing the pages back to the buddy.
+		 */
+		set_pageblock_migratetype(page, mt);
+		set_page_private(page, order);
+
+		sg_set_page(&phconf->sg[count++], page,
+			    PAGE_SIZE << order, 0);
+	}
+
+	return count;
+}
+
+/**
+ * scan_zone_bitmap - scans the bitmap for the requested zone.
+ * @phconf: page reporting configuration object initialized by the backend.
+ * @zone: zone for which page reporting is requested.
+ *
+ * For every page marked in the bitmap it checks if it is still free if so it
+ * isolates and adds them to a scatterlist. As soon as the number of isolated
+ * pages reach the threshold set by the backend, they are reported to the
+ * hypervisor by the backend. Once the hypervisor responds after processing
+ * they are returned back to the buddy for reuse.
+ */
+static void scan_zone_bitmap(struct page_reporting_config *phconf,
+			     struct zone *zone)
+{
+	unsigned long setbit;
+	struct page *page;
+	int count = 0;
+
+	sg_init_table(phconf->sg, phconf->max_pages);
+
+	for_each_set_bit(setbit, zone->bitmap, zone->nbits) {
+		/* Process only if the page is still online */
+		page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
+					  zone->base_pfn);
+		if (!page)
+			continue;
+
+		spin_lock(&zone->lock);
+
+		/* Ensure page is still free and can be processed */
+		if (PageBuddy(page) && page_private(page) >=
+		    PAGE_REPORTING_MIN_ORDER)
+			count = process_free_page(page, phconf, count);
+
+		spin_unlock(&zone->lock);
+		/* Page has been processed, adjust its bit and zone counter */
+		clear_bit(setbit, zone->bitmap);
+		atomic_dec(&zone->free_pages);
+
+		if (count == phconf->max_pages) {
+			/* Report isolated pages to the hypervisor */
+			phconf->report(phconf, count);
+
+			/* Return processed pages back to the buddy */
+			return_isolated_page(zone, phconf);
+
+			/* Reset for next reporting */
+			sg_init_table(phconf->sg, phconf->max_pages);
+			count = 0;
+		}
+	}
+	/*
+	 * If the number of isolated pages does not meet the max_pages
+	 * threshold, we would still prefer to report them as we have already
+	 * isolated them.
+	 */
+	if (count) {
+		sg_mark_end(&phconf->sg[count - 1]);
+		phconf->report(phconf, count);
+
+		return_isolated_page(zone, phconf);
+	}
+}
+
+/**
+ * page_reporting_wq - checks the number of free_pages in all the zones and
+ * invokes a request to scan the respective bitmap if free_pages reaches or
+ * exceeds the threshold specified by the backend.
+ */
+static void page_reporting_wq(struct work_struct *work)
+{
+	struct page_reporting_config *phconf =
+		container_of(work, struct page_reporting_config,
+			     reporting_work);
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		if (atomic_read(&zone->free_pages) >= phconf->max_pages)
+			scan_zone_bitmap(phconf, zone);
+	}
+	/*
+	 * We have processed all the zones, we can process new page reporting
+	 * request now.
+	 */
+	atomic_set(&phconf->refcnt, 0);
+}
+
+/**
+ * __page_reporting_enqueue - tracks the freed page in the respective zone's
+ * bitmap and enqueues a new page reporting job to the workqueue if possible.
+ */
+void __page_reporting_enqueue(struct page *page)
+{
+	struct page_reporting_config *phconf;
+	struct zone *zone;
+
+	rcu_read_lock();
+	/*
+	 * We should not process this page if either page reporting is not
+	 * yet completely enabled or it has been disabled by the backend.
+	 */
+	phconf = rcu_dereference(page_reporting_conf);
+	if (!phconf)
+		return;
+
+	zone = page_zone(page);
+	bitmap_set_bit(page, zone);
+
+	/*
+	 * We should not enqueue a job if a previously enqueued reporting work
+	 * is in progress or we don't have enough free pages in the zone.
+	 */
+	if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
+	    !atomic_cmpxchg(&phconf->refcnt, 0, 1))
+		schedule_work(&phconf->reporting_work);
+
+	rcu_read_unlock();
+}
+
+/**
+ * zone_reporting_cleanup - resets the page reporting fields and free the
+ * bitmap for all the initialized zones.
+ */
+static void zone_reporting_cleanup(void)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		/*
+		 * Bitmap may not be allocated for all the zones if the
+		 * initialization fails before reaching to the last one.
+		 */
+		if (!zone->bitmap)
+			continue;
+		bitmap_free(zone->bitmap);
+		zone->bitmap = NULL;
+		atomic_set(&zone->free_pages, 0);
+	}
+}
+
+static int zone_bitmap_alloc(struct zone *zone)
+{
+	unsigned long bitmap_size, pages;
+
+	pages = zone->end_pfn - zone->base_pfn;
+	bitmap_size = (pages >> PAGE_REPORTING_MIN_ORDER) + 1;
+
+	if (!bitmap_size)
+		return 0;
+
+	zone->bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
+	if (!zone->bitmap)
+		return -ENOMEM;
+
+	zone->nbits = bitmap_size;
+
+	return 0;
+}
+
+/**
+ * zone_reporting_init - For each zone initializes the page reporting fields
+ * and allocates the respective bitmap.
+ *
+ * This function returns 0 on successful initialization, -ENOMEM otherwise.
+ */
+static int zone_reporting_init(void)
+{
+	struct zone *zone;
+	int ret;
+
+	for_each_populated_zone(zone) {
+#ifdef CONFIG_ZONE_DEVICE
+		/* we can not report pages which are not in the buddy */
+		if (zone_idx(zone) == ZONE_DEVICE)
+			continue;
+#endif
+		spin_lock(&zone->lock);
+		zone->base_pfn = zone->zone_start_pfn;
+		zone->end_pfn = zone_end_pfn(zone);
+		spin_unlock(&zone->lock);
+
+		ret = zone_bitmap_alloc(zone);
+		if (ret < 0) {
+			zone_reporting_cleanup();
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+void page_reporting_disable(struct page_reporting_config *phconf)
+{
+	mutex_lock(&page_reporting_mutex);
+
+	if (rcu_access_pointer(page_reporting_conf) != phconf)
+		return;
+
+	RCU_INIT_POINTER(page_reporting_conf, NULL);
+	synchronize_rcu();
+
+	/* Cancel any pending reporting request */
+	cancel_work_sync(&phconf->reporting_work);
+
+	/* Free the scatterlist used for isolated pages */
+	kfree(phconf->sg);
+	phconf->sg = NULL;
+
+	/* Cleanup the bitmaps and old tracking data */
+	zone_reporting_cleanup();
+
+	mutex_unlock(&page_reporting_mutex);
+}
+EXPORT_SYMBOL_GPL(page_reporting_disable);
+
+int page_reporting_enable(struct page_reporting_config *phconf)
+{
+	int ret = 0;
+
+	mutex_lock(&page_reporting_mutex);
+
+	/* check if someone is already using page reporting*/
+	if (rcu_access_pointer(page_reporting_conf)) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	/* allocate scatterlist to hold isolated pages */
+	phconf->sg = kcalloc(phconf->max_pages, sizeof(*phconf->sg),
+			     GFP_KERNEL);
+	if (!phconf->sg) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* initialize each zone's fields required for page reporting */
+	ret = zone_reporting_init();
+	if (ret < 0) {
+		kfree(phconf->sg);
+		goto out;
+	}
+
+	atomic_set(&phconf->refcnt, 0);
+	INIT_WORK(&phconf->reporting_work, page_reporting_wq);
+
+	/* assign the configuration object provided by the backend */
+	rcu_assign_pointer(page_reporting_conf, phconf);
+
+out:
+	mutex_unlock(&page_reporting_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(page_reporting_enable);