diff mbox series

[v1,5/6] mm: Add logic for separating "aerated" pages from "raw" pages

Message ID 20190619223331.1231.39271.stgit@localhost.localdomain (mailing list archive)
State New, archived
Headers show
Series mm / virtio: Provide support for paravirtual waste page treatment | expand

Commit Message

Alexander H Duyck June 19, 2019, 10:33 p.m. UTC
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>

Add a set of pointers we shall call "boundary" which represents the upper
boundary between the "raw" and "aerated" pages. The general idea is that in
order for a page to cross from one side of the boundary to the other it
will need to go through the aeration treatment.

By doing this we should be able to make certain that we keep the aerated
pages as one contiguous block on the end of each free list. This will allow
us to efficiently walk the free lists whenever we need to go in and start
processing hints to the hypervisor that the pages are no longer in use.

And added advantage to this approach is that we should be reducing the
overall memory footprint of the guest as it will be more likely to recycle
warm pages versus the aerated pages that are likely to be cache cold.

Since we will only be aerating one zone at a time we keep the boundary
limited to being defined for just the zone we are currently placing aerated
pages into. Doing this we can keep the number of additional poitners needed
quite small.

Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
---
 include/linux/memory_aeration.h |   57 ++++++++
 include/linux/mmzone.h          |    8 +
 include/linux/page-flags.h      |    8 +
 mm/Makefile                     |    1 
 mm/aeration.c                   |  270 +++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c                 |    4 -
 6 files changed, 347 insertions(+), 1 deletion(-)
 create mode 100644 mm/aeration.c

Comments

Dave Hansen June 25, 2019, 8:24 p.m. UTC | #1
On 6/19/19 3:33 PM, Alexander Duyck wrote:
> Add a set of pointers we shall call "boundary" which represents the upper
> boundary between the "raw" and "aerated" pages. The general idea is that in
> order for a page to cross from one side of the boundary to the other it
> will need to go through the aeration treatment.

Aha!  The mysterious "boundary"!

But, how can you introduce code that deals with boundaries before
introducing the boundary itself?  Or was that comment misplaced?

FWIW, I'm not a fan of these commit messages.  They are really hard to
map to the data structures.

	One goal in this set is to avoid creating new data structures.
	We accomplish that by reusing the free lists to hold aerated and
	non-aerated pages.  But, in order to use the existing free list,
	we need a boundary to separate aerated from raw.

Further:

	Pages are temporarily removed from the free lists while aerating
	them.

This needs a justification why you chose this path, and also what the
larger implications are.

> By doing this we should be able to make certain that we keep the aerated
> pages as one contiguous block on the end of each free list. This will allow
> us to efficiently walk the free lists whenever we need to go in and start
> processing hints to the hypervisor that the pages are no longer in use.

You don't really walk them though, right?  It *keeps* you from having to
ever walk the lists.

I also don't see what the boundary has to do with aerated pages being on
the tail of the list.  If you want them on the tail, you just always
list_add_tail() them.

> And added advantage to this approach is that we should be reducing the
> overall memory footprint of the guest as it will be more likely to recycle
> warm pages versus the aerated pages that are likely to be cache cold.

I'm confused.  Isn't an aerated page non-present on the guest?  That's
worse than cache cold.  It costs a VMEXIT to bring back in.

> Since we will only be aerating one zone at a time we keep the boundary
> limited to being defined for just the zone we are currently placing aerated
> pages into. Doing this we can keep the number of additional poitners needed
> quite small.

							pointers ^

> +struct list_head *__aerator_get_tail(unsigned int order, int migratetype);
>  static inline struct list_head *aerator_get_tail(struct zone *zone,
>  						 unsigned int order,
>  						 int migratetype)
>  {
> +#ifdef CONFIG_AERATION
> +	if (order >= AERATOR_MIN_ORDER &&
> +	    test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> +		return __aerator_get_tail(order, migratetype);
> +#endif
>  	return &zone->free_area[order].free_list[migratetype];
>  }

Logically, I have no idea what this is doing.  "Go get pages out of the
aerated list?"  "raw list"?  Needs comments.

> +static inline void aerator_del_from_boundary(struct page *page,
> +					     struct zone *zone)
> +{
> +	if (PageAerated(page) && test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> +		__aerator_del_from_boundary(page, zone);
> +}
> +
>  static inline void set_page_aerated(struct page *page,
>  				    struct zone *zone,
>  				    unsigned int order,
> @@ -28,6 +59,9 @@ static inline void set_page_aerated(struct page *page,
>  	/* record migratetype and flag page as aerated */
>  	set_pcppage_migratetype(page, migratetype);
>  	__SetPageAerated(page);
> +
> +	/* update boundary of new migratetype and record it */
> +	aerator_add_to_boundary(page, zone);
>  #endif
>  }
>  
> @@ -39,11 +73,19 @@ static inline void clear_page_aerated(struct page *page,
>  	if (likely(!PageAerated(page)))
>  		return;
>  
> +	/* push boundary back if we removed the upper boundary */
> +	aerator_del_from_boundary(page, zone);
> +
>  	__ClearPageAerated(page);
>  	area->nr_free_aerated--;
>  #endif
>  }
>  
> +static inline unsigned long aerator_raw_pages(struct free_area *area)
> +{
> +	return area->nr_free - area->nr_free_aerated;
> +}
> +
>  /**
>   * aerator_notify_free - Free page notification that will start page processing
>   * @zone: Pointer to current zone of last page processed
> @@ -57,5 +99,20 @@ static inline void clear_page_aerated(struct page *page,
>   */
>  static inline void aerator_notify_free(struct zone *zone, int order)
>  {
> +#ifdef CONFIG_AERATION
> +	if (!static_key_false(&aerator_notify_enabled))
> +		return;
> +	if (order < AERATOR_MIN_ORDER)
> +		return;
> +	if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> +		return;
> +	if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
> +		return;
> +
> +	__aerator_notify(zone);
> +#endif
>  }

Again, this is really hard to review.  I see some possible overhead in a
fast path here, but only if aerator_notify_free() is called in a fast
path.  Is it?  I have to go digging in the previous patches to figure
that out.

> +static struct aerator_dev_info *a_dev_info;
> +struct static_key aerator_notify_enabled;
> +
> +struct list_head *boundary[MAX_ORDER - AERATOR_MIN_ORDER][MIGRATE_TYPES];
> +
> +static void aerator_reset_boundary(struct zone *zone, unsigned int order,
> +				   unsigned int migratetype)
> +{
> +	boundary[order - AERATOR_MIN_ORDER][migratetype] =
> +			&zone->free_area[order].free_list[migratetype];
> +}
> +
> +#define for_each_aerate_migratetype_order(_order, _type) \
> +	for (_order = MAX_ORDER; _order-- != AERATOR_MIN_ORDER;) \
> +		for (_type = MIGRATE_TYPES; _type--;)
> +
> +static void aerator_populate_boundaries(struct zone *zone)
> +{
> +	unsigned int order, mt;
> +
> +	if (test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> +		return;
> +
> +	for_each_aerate_migratetype_order(order, mt)
> +		aerator_reset_boundary(zone, order, mt);
> +
> +	set_bit(ZONE_AERATION_ACTIVE, &zone->flags);
> +}

This function appears misnamed as it's doing more than boundary
manipulation.

> +struct list_head *__aerator_get_tail(unsigned int order, int migratetype)
> +{
> +	return boundary[order - AERATOR_MIN_ORDER][migratetype];
> +}
> +
> +void __aerator_del_from_boundary(struct page *page, struct zone *zone)
> +{
> +	unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> +	int mt = get_pcppage_migratetype(page);
> +	struct list_head **tail = &boundary[order][mt];
> +
> +	if (*tail == &page->lru)
> +		*tail = page->lru.next;
> +}

Ewww.  Please just track the page that's the boundary, not the list head
inside the page that's the boundary.

This also at least needs one comment along the lines of: Move the
boundary if the page representing the boundary is being removed.


> +void aerator_add_to_boundary(struct page *page, struct zone *zone)
> +{
> +	unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> +	int mt = get_pcppage_migratetype(page);
> +	struct list_head **tail = &boundary[order][mt];
> +
> +	*tail = &page->lru;
> +}
> +
> +void aerator_shutdown(void)
> +{
> +	static_key_slow_dec(&aerator_notify_enabled);
> +
> +	while (atomic_read(&a_dev_info->refcnt))
> +		msleep(20);

We generally frown on open-coded check/sleep loops.  What is this for?

> +	WARN_ON(!list_empty(&a_dev_info->batch));
> +
> +	a_dev_info = NULL;
> +}
> +EXPORT_SYMBOL_GPL(aerator_shutdown);
> +
> +static void aerator_schedule_initial_aeration(void)
> +{
> +	struct zone *zone;
> +
> +	for_each_populated_zone(zone) {
> +		spin_lock(&zone->lock);
> +		__aerator_notify(zone);
> +		spin_unlock(&zone->lock);
> +	}
> +}

Why do we need an initial aeration?

> +int aerator_startup(struct aerator_dev_info *sdev)
> +{
> +	if (a_dev_info)
> +		return -EBUSY;
> +
> +	INIT_LIST_HEAD(&sdev->batch);
> +	atomic_set(&sdev->refcnt, 0);
> +
> +	a_dev_info = sdev;
> +	aerator_schedule_initial_aeration();
> +
> +	static_key_slow_inc(&aerator_notify_enabled);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(aerator_startup);
> +
> +static void aerator_fill(struct zone *zone)
> +{
> +	struct list_head *batch = &a_dev_info->batch;
> +	int budget = a_dev_info->capacity;

Where does capacity come from?

> +	unsigned int order, mt;
> +
> +	for_each_aerate_migratetype_order(order, mt) {
> +		struct page *page;
> +
> +		/*
> +		 * Pull pages from free list until we have drained
> +		 * it or we have filled the batch reactor.
> +		 */

What's a reactor?

> +		while ((page = get_aeration_page(zone, order, mt))) {
> +			list_add_tail(&page->lru, batch);
> +
> +			if (!--budget)
> +				return;
> +		}
> +	}
> +
> +	/*
> +	 * If there are no longer enough free pages to fully populate
> +	 * the aerator, then we can just shut it down for this zone.
> +	 */
> +	clear_bit(ZONE_AERATION_REQUESTED, &zone->flags);
> +	atomic_dec(&a_dev_info->refcnt);
> +}

Huh, so this is the number of threads doing aeration?  Didn't we just
make a big deal about there only being one zone being aerated at a time?
 Or, did I misunderstand what refcnt is from its lack of clear
documentation?

> +static void aerator_drain(struct zone *zone)
> +{
> +	struct list_head *list = &a_dev_info->batch;
> +	struct page *page;
> +
> +	/*
> +	 * Drain the now aerated pages back into their respective
> +	 * free lists/areas.
> +	 */
> +	while ((page = list_first_entry_or_null(list, struct page, lru))) {
> +		list_del(&page->lru);
> +		put_aeration_page(zone, page);
> +	}
> +}
> +
> +static void aerator_scrub_zone(struct zone *zone)
> +{
> +	/* See if there are any pages to pull */
> +	if (!test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> +		return;

How would someone ask for the zone to be scrubbed when aeration has not
been requested?

> +	spin_lock(&zone->lock);
> +
> +	do {
> +		aerator_fill(zone);

Should this say:

		/* Pull pages out of the allocator into a local list */

?

> +		if (list_empty(&a_dev_info->batch))
> +			break;

		/* no pages were acquired, give up */

> +		spin_unlock(&zone->lock);
> +
> +		/*
> +		 * Start aerating the pages in the batch, and then
> +		 * once that is completed we can drain the reactor
> +		 * and refill the reactor, restarting the cycle.
> +		 */
> +		a_dev_info->react(a_dev_info);

After reading (most of) this set, I'm going to reiterate my suggestion:
please find new nomenclature.  I can't parse that comment and I don't
know whether that's because it's a bad comment or whether you really
mean "cycle" the english word or "cycle" referring to some new
definition relating to this patch set.

I've asked quite nicely a few times now.

> +		spin_lock(&zone->lock);
> +
> +		/*
> +		 * Guarantee boundaries are populated before we
> +		 * start placing aerated pages in the zone.
> +		 */
> +		aerator_populate_boundaries(zone);

aerator_populate_boundaries() has apparent concurrency checks via
ZONE_AERATION_ACTIVE.  Why are those needed when this is called under a
spinlock?
Alexander Duyck July 8, 2019, 7:02 p.m. UTC | #2
On Tue, 2019-06-25 at 13:24 -0700, Dave Hansen wrote:
> On 6/19/19 3:33 PM, Alexander Duyck wrote:
> > Add a set of pointers we shall call "boundary" which represents the upper
> > boundary between the "raw" and "aerated" pages. The general idea is that in
> > order for a page to cross from one side of the boundary to the other it
> > will need to go through the aeration treatment.
> 
> Aha!  The mysterious "boundary"!
> 
> But, how can you introduce code that deals with boundaries before
> introducing the boundary itself?  Or was that comment misplaced?

The comment in the earlier patch was misplaced. Basically the logic before
this patch would just add the aerated pages directly to the tail of the
free_list, however if it had to leave and come back there was nothing to
prevent us from creating a mess of interleaved "raw" and "aerated" pages.
With this patch we are guaranteed that any "raw" pages are added above the
"aerated" pages and will be pulled for processing.

> FWIW, I'm not a fan of these commit messages.  They are really hard to
> map to the data structures.
> 
> 	One goal in this set is to avoid creating new data structures.
> 	We accomplish that by reusing the free lists to hold aerated and
> 	non-aerated pages.  But, in order to use the existing free list,
> 	we need a boundary to separate aerated from raw.
> 
> Further:
> 
> 	Pages are temporarily removed from the free lists while aerating
> 	them.
> 
> This needs a justification why you chose this path, and also what the
> larger implications are.

Well the big advantage is that we aren't messing with the individual
free_area or free_list structures. My initial implementation was adding a
third pointer to split do the work and it actually had performance
implications as it increased the size of the free_area and zone.

> > By doing this we should be able to make certain that we keep the aerated
> > pages as one contiguous block on the end of each free list. This will allow
> > us to efficiently walk the free lists whenever we need to go in and start
> > processing hints to the hypervisor that the pages are no longer in use.
> 
> You don't really walk them though, right?  It *keeps* you from having to
> ever walk the lists.

It all depends on your definition of "walk". In the case of this logic we
will have to ultimately do 1 pass over all the "raw" pages to process
them. So I consider that a walk through the free_list. However we can
avoid all of the already processed pages since we have the flag and the
pointer to what should be the top of the list for the "aerated" pages.

> I also don't see what the boundary has to do with aerated pages being on
> the tail of the list.  If you want them on the tail, you just always
> list_add_tail() them.

The issue is that there are multiple things that can add to the tail of
the list. For example the shuffle code or the lower order buddy expecting
its buddy to be freed. In those cases I don't want to add to tail so
instead I am adding those to the boundary. By doing that I can avoid
having the tail of the list becoming interleaved with raw and aerated
pages.

> > And added advantage to this approach is that we should be reducing the
> > overall memory footprint of the guest as it will be more likely to recycle
> > warm pages versus the aerated pages that are likely to be cache cold.
> 
> I'm confused.  Isn't an aerated page non-present on the guest?  That's
> worse than cache cold.  It costs a VMEXIT to bring back in.

I suppose so, it would be worse than being cache cold.

> > Since we will only be aerating one zone at a time we keep the boundary
> > limited to being defined for just the zone we are currently placing aerated
> > pages into. Doing this we can keep the number of additional poitners needed
> > quite small.
> 
> 							pointers ^
> 
> > +struct list_head *__aerator_get_tail(unsigned int order, int migratetype);
> >  static inline struct list_head *aerator_get_tail(struct zone *zone,
> >  						 unsigned int order,
> >  						 int migratetype)
> >  {
> > +#ifdef CONFIG_AERATION
> > +	if (order >= AERATOR_MIN_ORDER &&
> > +	    test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> > +		return __aerator_get_tail(order, migratetype);
> > +#endif
> >  	return &zone->free_area[order].free_list[migratetype];
> >  }
> 
> Logically, I have no idea what this is doing.  "Go get pages out of the
> aerated list?"  "raw list"?  Needs comments.

I'll add comments. Really now that I think about it I should probably
change the name for this anyway. What is really being returned is the tail
for the non-aerated list. Specifically if ZONE_AERATION_ACTIVE is set we
want to prevent any insertions below the list of aerated pages, so we are
returning the first entry in the aerated list and using that as the
tail/head of a list tail insertion.

Ugh. I really need to go back and name this better.

> > +static inline void aerator_del_from_boundary(struct page *page,
> > +					     struct zone *zone)
> > +{
> > +	if (PageAerated(page) && test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> > +		__aerator_del_from_boundary(page, zone);
> > +}
> > +
> >  static inline void set_page_aerated(struct page *page,
> >  				    struct zone *zone,
> >  				    unsigned int order,
> > @@ -28,6 +59,9 @@ static inline void set_page_aerated(struct page *page,
> >  	/* record migratetype and flag page as aerated */
> >  	set_pcppage_migratetype(page, migratetype);
> >  	__SetPageAerated(page);
> > +
> > +	/* update boundary of new migratetype and record it */
> > +	aerator_add_to_boundary(page, zone);
> >  #endif
> >  }
> >  
> > @@ -39,11 +73,19 @@ static inline void clear_page_aerated(struct page *page,
> >  	if (likely(!PageAerated(page)))
> >  		return;
> >  
> > +	/* push boundary back if we removed the upper boundary */
> > +	aerator_del_from_boundary(page, zone);
> > +
> >  	__ClearPageAerated(page);
> >  	area->nr_free_aerated--;
> >  #endif
> >  }
> >  
> > +static inline unsigned long aerator_raw_pages(struct free_area *area)
> > +{
> > +	return area->nr_free - area->nr_free_aerated;
> > +}
> > +
> >  /**
> >   * aerator_notify_free - Free page notification that will start page processing
> >   * @zone: Pointer to current zone of last page processed
> > @@ -57,5 +99,20 @@ static inline void clear_page_aerated(struct page *page,
> >   */
> >  static inline void aerator_notify_free(struct zone *zone, int order)
> >  {
> > +#ifdef CONFIG_AERATION
> > +	if (!static_key_false(&aerator_notify_enabled))
> > +		return;
> > +	if (order < AERATOR_MIN_ORDER)
> > +		return;
> > +	if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> > +		return;
> > +	if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
> > +		return;
> > +
> > +	__aerator_notify(zone);
> > +#endif
> >  }
> 
> Again, this is really hard to review.  I see some possible overhead in a
> fast path here, but only if aerator_notify_free() is called in a fast
> path.  Is it?  I have to go digging in the previous patches to figure
> that out.

This is called at the end of __free_one_page().

I tried to limit the impact as much as possible by ordering the checks the
way I did. The order check should limit the impact pretty significantly as
that is the only one that will be triggered for every page, then the
higher order pages are left to deal with the test_bit and
aerator_raw_pages checks.

> > +static struct aerator_dev_info *a_dev_info;
> > +struct static_key aerator_notify_enabled;
> > +
> > +struct list_head *boundary[MAX_ORDER - AERATOR_MIN_ORDER][MIGRATE_TYPES];
> > +
> > +static void aerator_reset_boundary(struct zone *zone, unsigned int order,
> > +				   unsigned int migratetype)
> > +{
> > +	boundary[order - AERATOR_MIN_ORDER][migratetype] =
> > +			&zone->free_area[order].free_list[migratetype];
> > +}
> > +
> > +#define for_each_aerate_migratetype_order(_order, _type) \
> > +	for (_order = MAX_ORDER; _order-- != AERATOR_MIN_ORDER;) \
> > +		for (_type = MIGRATE_TYPES; _type--;)
> > +
> > +static void aerator_populate_boundaries(struct zone *zone)
> > +{
> > +	unsigned int order, mt;
> > +
> > +	if (test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> > +		return;
> > +
> > +	for_each_aerate_migratetype_order(order, mt)
> > +		aerator_reset_boundary(zone, order, mt);
> > +
> > +	set_bit(ZONE_AERATION_ACTIVE, &zone->flags);
> > +}
> 
> This function appears misnamed as it's doing more than boundary
> manipulation.

The ZONE_AERATION_ACTIVE flag is what is used to indicate that the
boundaries are being tracked. Without that we just fall back to using the
free_list tail.

> > +struct list_head *__aerator_get_tail(unsigned int order, int migratetype)
> > +{
> > +	return boundary[order - AERATOR_MIN_ORDER][migratetype];
> > +}
> > +
> > +void __aerator_del_from_boundary(struct page *page, struct zone *zone)
> > +{
> > +	unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> > +	int mt = get_pcppage_migratetype(page);
> > +	struct list_head **tail = &boundary[order][mt];
> > +
> > +	if (*tail == &page->lru)
> > +		*tail = page->lru.next;
> > +}
> 
> Ewww.  Please just track the page that's the boundary, not the list head
> inside the page that's the boundary.
> 
> This also at least needs one comment along the lines of: Move the
> boundary if the page representing the boundary is being removed.

So the reason for using the list_head is because we can end up with a
boundary for an empty list. In that case we don't have a page to point to
but just the list_head for the list itself. It actually makes things quite
a bit simpler, otherwise I have to perform extra checks to see if the list
is empty.

I'll work on updating the comments.

> 
> > +void aerator_add_to_boundary(struct page *page, struct zone *zone)
> > +{
> > +	unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> > +	int mt = get_pcppage_migratetype(page);
> > +	struct list_head **tail = &boundary[order][mt];
> > +
> > +	*tail = &page->lru;
> > +}
> > +
> > +void aerator_shutdown(void)
> > +{
> > +	static_key_slow_dec(&aerator_notify_enabled);
> > +
> > +	while (atomic_read(&a_dev_info->refcnt))
> > +		msleep(20);
> 
> We generally frown on open-coded check/sleep loops.  What is this for?

We are waiting on the aerator to finish processing the list it had active.
With the static key disabled we should see the refcount wind down to 0.
Once that occurs we can safely free the a_dev_info structure since there
will be no other uses of it.

> > +	WARN_ON(!list_empty(&a_dev_info->batch));
> > +
> > +	a_dev_info = NULL;
> > +}
> > +EXPORT_SYMBOL_GPL(aerator_shutdown);
> > +
> > +static void aerator_schedule_initial_aeration(void)
> > +{
> > +	struct zone *zone;
> > +
> > +	for_each_populated_zone(zone) {
> > +		spin_lock(&zone->lock);
> > +		__aerator_notify(zone);
> > +		spin_unlock(&zone->lock);
> > +	}
> > +}
> 
> Why do we need an initial aeration?

This is mostly about avoiding any possible races while we are brining up
the aerator. If we assume we are just going to start a cycle of aeration
for all zones when the aerator is brought up it makes it easier to be sure
we have gone though and checked all of the zones after initialization is
complete.

> > +int aerator_startup(struct aerator_dev_info *sdev)
> > +{
> > +	if (a_dev_info)
> > +		return -EBUSY;
> > +
> > +	INIT_LIST_HEAD(&sdev->batch);
> > +	atomic_set(&sdev->refcnt, 0);
> > +
> > +	a_dev_info = sdev;
> > +	aerator_schedule_initial_aeration();
> > +
> > +	static_key_slow_inc(&aerator_notify_enabled);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(aerator_startup);
> > +
> > +static void aerator_fill(struct zone *zone)
> > +{
> > +	struct list_head *batch = &a_dev_info->batch;
> > +	int budget = a_dev_info->capacity;
> 
> Where does capacity come from?

It is the limit on how many pages we can process at a time. The value is
set in a_dev_info before the call to aerator_startup.

> > +	unsigned int order, mt;
> > +
> > +	for_each_aerate_migratetype_order(order, mt) {
> > +		struct page *page;
> > +
> > +		/*
> > +		 * Pull pages from free list until we have drained
> > +		 * it or we have filled the batch reactor.
> > +		 */
> 
> What's a reactor?

A hold-over from an earlier patch. Basically the batch reactor was the
list containing the pages to be processed. It was a chemistry term in
regards to aeration. I should update that to instead say we have reached
the capacity of the aeration device.

> > +		while ((page = get_aeration_page(zone, order, mt))) {
> > +			list_add_tail(&page->lru, batch);
> > +
> > +			if (!--budget)
> > +				return;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * If there are no longer enough free pages to fully populate
> > +	 * the aerator, then we can just shut it down for this zone.
> > +	 */
> > +	clear_bit(ZONE_AERATION_REQUESTED, &zone->flags);
> > +	atomic_dec(&a_dev_info->refcnt);
> > +}
> 
> Huh, so this is the number of threads doing aeration?  Didn't we just
> make a big deal about there only being one zone being aerated at a time?
>  Or, did I misunderstand what refcnt is from its lack of clear
> documentation?

The refcnt is the number of zones requesting aeration plus one additional
if the thread is active. We are limited to only having pages from one zone
in the aerator at a time. That is to prevent us from having to maintain
multiple boundaries.

> > +static void aerator_drain(struct zone *zone)
> > +{
> > +	struct list_head *list = &a_dev_info->batch;
> > +	struct page *page;
> > +
> > +	/*
> > +	 * Drain the now aerated pages back into their respective
> > +	 * free lists/areas.
> > +	 */
> > +	while ((page = list_first_entry_or_null(list, struct page, lru))) {
> > +		list_del(&page->lru);
> > +		put_aeration_page(zone, page);
> > +	}
> > +}
> > +
> > +static void aerator_scrub_zone(struct zone *zone)
> > +{
> > +	/* See if there are any pages to pull */
> > +	if (!test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> > +		return;
> 
> How would someone ask for the zone to be scrubbed when aeration has not
> been requested?

I'm not sure what you are asking here. Basically this function is called
per zone by aerator_cycle. Which now that I think about it I should
probably swap the names around that we perform a cycle per zone and just
scrub memory generically.

> > +	spin_lock(&zone->lock);
> > +
> > +	do {
> > +		aerator_fill(zone);
> 
> Should this say:
> 
> 		/* Pull pages out of the allocator into a local list */
> 
> ?

Yes, we are filling the local list with "raw" pages from the zone.

> 
> > +		if (list_empty(&a_dev_info->batch))
> > +			break;
> 
> 		/* no pages were acquired, give up */

Correct.

> 
> > +		spin_unlock(&zone->lock);
> > +
> > +		/*
> > +		 * Start aerating the pages in the batch, and then
> > +		 * once that is completed we can drain the reactor
> > +		 * and refill the reactor, restarting the cycle.
> > +		 */
> > +		a_dev_info->react(a_dev_info);
> 
> After reading (most of) this set, I'm going to reiterate my suggestion:
> please find new nomenclature.  I can't parse that comment and I don't
> know whether that's because it's a bad comment or whether you really
> mean "cycle" the english word or "cycle" referring to some new
> definition relating to this patch set.
> 
> I've asked quite nicely a few times now.

The "cycle" in this case refers to fill, react, drain, and idle or repeat.

> > +		spin_lock(&zone->lock);
> > +
> > +		/*
> > +		 * Guarantee boundaries are populated before we
> > +		 * start placing aerated pages in the zone.
> > +		 */
> > +		aerator_populate_boundaries(zone);
> 
> aerator_populate_boundaries() has apparent concurrency checks via
> ZONE_AERATION_ACTIVE.  Why are those needed when this is called under a
> spinlock?

I probably could move the spin_lock down. It isn't really needed for the
population of the boundaries, it was needed for the draining of the
aerator into the free_lists.  I'll move the lock, although I might need to
add a smp_mb__before_atomic to make sure that any callers see the boundary
values before they see the updated bit.
Dave Hansen July 8, 2019, 7:36 p.m. UTC | #3
On 7/8/19 12:02 PM, Alexander Duyck wrote:
> On Tue, 2019-06-25 at 13:24 -0700, Dave Hansen wrote:
>> I also don't see what the boundary has to do with aerated pages being on
>> the tail of the list.  If you want them on the tail, you just always
>> list_add_tail() them.
> 
> The issue is that there are multiple things that can add to the tail of
> the list. For example the shuffle code or the lower order buddy expecting
> its buddy to be freed. In those cases I don't want to add to tail so
> instead I am adding those to the boundary. By doing that I can avoid
> having the tail of the list becoming interleaved with raw and aerated
> pages.

So, it sounds like we've got the following data structure rules:

1. We have one list_head and one list of pages
2. For the purposes of allocation, the list is treated the same as
   before these patches
3. For a "free()", the behavior changes and we now have two "tails":
   3a. Aerated pages are freed into the tail of the list
   3b. Cold pages are freed at the boundary between aerated and non.
       This serves to...  This is also referred to as a "tail".
   3a. Hot pages are never aerated and are still freed into the head
       of the list.

Did I miss any?  Could you please spell it out this way in future
changelogs?


>>> +struct list_head *__aerator_get_tail(unsigned int order, int migratetype);
>>>  static inline struct list_head *aerator_get_tail(struct zone *zone,
>>>  						 unsigned int order,
>>>  						 int migratetype)
>>>  {
>>> +#ifdef CONFIG_AERATION
>>> +	if (order >= AERATOR_MIN_ORDER &&
>>> +	    test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
>>> +		return __aerator_get_tail(order, migratetype);
>>> +#endif
>>>  	return &zone->free_area[order].free_list[migratetype];
>>>  }
>>
>> Logically, I have no idea what this is doing.  "Go get pages out of the
>> aerated list?"  "raw list"?  Needs comments.
> 
> I'll add comments. Really now that I think about it I should probably
> change the name for this anyway. What is really being returned is the tail
> for the non-aerated list. Specifically if ZONE_AERATION_ACTIVE is set we
> want to prevent any insertions below the list of aerated pages, so we are
> returning the first entry in the aerated list and using that as the
> tail/head of a list tail insertion.
> 
> Ugh. I really need to go back and name this better.

OK, so we now have two tails?  One that's called both a boundary and a
tail at different parts of the code?

>>>  static inline void aerator_notify_free(struct zone *zone, int order)
>>>  {
>>> +#ifdef CONFIG_AERATION
>>> +	if (!static_key_false(&aerator_notify_enabled))
>>> +		return;
>>> +	if (order < AERATOR_MIN_ORDER)
>>> +		return;
>>> +	if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
>>> +		return;
>>> +	if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
>>> +		return;
>>> +
>>> +	__aerator_notify(zone);
>>> +#endif
>>>  }
>>
>> Again, this is really hard to review.  I see some possible overhead in a
>> fast path here, but only if aerator_notify_free() is called in a fast
>> path.  Is it?  I have to go digging in the previous patches to figure
>> that out.
> 
> This is called at the end of __free_one_page().
> 
> I tried to limit the impact as much as possible by ordering the checks the
> way I did. The order check should limit the impact pretty significantly as
> that is the only one that will be triggered for every page, then the
> higher order pages are left to deal with the test_bit and
> aerator_raw_pages checks.

That sounds like a good idea.  But, that good idea is very hard to
distill from the code in the patch.

Imagine if the function stared being commented with:

/* Called from a hot path in __free_one_page() */

And said:


	if (!static_key_false(&aerator_notify_enabled))
		return;

	/* Avoid (slow) notifications when no aeration is performed: */
	if (order < AERATOR_MIN_ORDER)
		return;
	if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
		return;

	/* Some other relevant comment: */
	if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
		return;

	/* This is slow, but should happen very rarely: */
	__aerator_notify(zone);

>>> +static void aerator_populate_boundaries(struct zone *zone)
>>> +{
>>> +	unsigned int order, mt;
>>> +
>>> +	if (test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
>>> +		return;
>>> +
>>> +	for_each_aerate_migratetype_order(order, mt)
>>> +		aerator_reset_boundary(zone, order, mt);
>>> +
>>> +	set_bit(ZONE_AERATION_ACTIVE, &zone->flags);
>>> +}
>>
>> This function appears misnamed as it's doing more than boundary
>> manipulation.
> 
> The ZONE_AERATION_ACTIVE flag is what is used to indicate that the
> boundaries are being tracked. Without that we just fall back to using the
> free_list tail.

Is the flag used for other things?  Or just to indicate that boundaries
are being tracked?

>>> +struct list_head *__aerator_get_tail(unsigned int order, int migratetype)
>>> +{
>>> +	return boundary[order - AERATOR_MIN_ORDER][migratetype];
>>> +}
>>> +
>>> +void __aerator_del_from_boundary(struct page *page, struct zone *zone)
>>> +{
>>> +	unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
>>> +	int mt = get_pcppage_migratetype(page);
>>> +	struct list_head **tail = &boundary[order][mt];
>>> +
>>> +	if (*tail == &page->lru)
>>> +		*tail = page->lru.next;
>>> +}
>>
>> Ewww.  Please just track the page that's the boundary, not the list head
>> inside the page that's the boundary.
>>
>> This also at least needs one comment along the lines of: Move the
>> boundary if the page representing the boundary is being removed.
> 
> So the reason for using the list_head is because we can end up with a
> boundary for an empty list. In that case we don't have a page to point to
> but just the list_head for the list itself. It actually makes things quite
> a bit simpler, otherwise I have to perform extra checks to see if the list
> is empty.

Could you please double-check that keeping a 'struct page *' is truly
more messy?

>>> +void aerator_add_to_boundary(struct page *page, struct zone *zone)
>>> +{
>>> +	unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
>>> +	int mt = get_pcppage_migratetype(page);
>>> +	struct list_head **tail = &boundary[order][mt];
>>> +
>>> +	*tail = &page->lru;
>>> +}
>>> +
>>> +void aerator_shutdown(void)
>>> +{
>>> +	static_key_slow_dec(&aerator_notify_enabled);
>>> +
>>> +	while (atomic_read(&a_dev_info->refcnt))
>>> +		msleep(20);
>>
>> We generally frown on open-coded check/sleep loops.  What is this for?
> 
> We are waiting on the aerator to finish processing the list it had active.
> With the static key disabled we should see the refcount wind down to 0.
> Once that occurs we can safely free the a_dev_info structure since there
> will be no other uses of it.

That's fine, but we still don't open-code sleep loops.  Please remove this.

"Wait until we can free the thing" sounds to me like RCU.  Do you want
to use RCU here?  A synchronize_rcu() call can be a very powerful thing
if the read-side critical sections are amenable to it.

>>> +static void aerator_schedule_initial_aeration(void)
>>> +{
>>> +	struct zone *zone;
>>> +
>>> +	for_each_populated_zone(zone) {
>>> +		spin_lock(&zone->lock);
>>> +		__aerator_notify(zone);
>>> +		spin_unlock(&zone->lock);
>>> +	}
>>> +}
>>
>> Why do we need an initial aeration?
> 
> This is mostly about avoiding any possible races while we are brining up
> the aerator. If we assume we are just going to start a cycle of aeration
> for all zones when the aerator is brought up it makes it easier to be sure
> we have gone though and checked all of the zones after initialization is
> complete.

Let me ask a different way:  What will happen if we don't have this?
Will things crash?  Will they be slow?  Do we not know?

>>> +{
>>> +	struct list_head *batch = &a_dev_info->batch;
>>> +	int budget = a_dev_info->capacity;
>>
>> Where does capacity come from?
> 
> It is the limit on how many pages we can process at a time. The value is
> set in a_dev_info before the call to aerator_startup.

Let me ask another way: Does it come from the user?  Or is it
automatically determined by some in-kernel heuristic?

>>> +		while ((page = get_aeration_page(zone, order, mt))) {
>>> +			list_add_tail(&page->lru, batch);
>>> +
>>> +			if (!--budget)
>>> +				return;
>>> +		}
>>> +	}
>>> +
>>> +	/*
>>> +	 * If there are no longer enough free pages to fully populate
>>> +	 * the aerator, then we can just shut it down for this zone.
>>> +	 */
>>> +	clear_bit(ZONE_AERATION_REQUESTED, &zone->flags);
>>> +	atomic_dec(&a_dev_info->refcnt);
>>> +}
>>
>> Huh, so this is the number of threads doing aeration?  Didn't we just
>> make a big deal about there only being one zone being aerated at a time?
>>  Or, did I misunderstand what refcnt is from its lack of clear
>> documentation?
> 
> The refcnt is the number of zones requesting aeration plus one additional
> if the thread is active. We are limited to only having pages from one zone
> in the aerator at a time. That is to prevent us from having to maintain
> multiple boundaries.

That sounds like excellent documentation to add to 'refcnt's definition.

>>> +static void aerator_drain(struct zone *zone)
>>> +{
>>> +	struct list_head *list = &a_dev_info->batch;
>>> +	struct page *page;
>>> +
>>> +	/*
>>> +	 * Drain the now aerated pages back into their respective
>>> +	 * free lists/areas.
>>> +	 */
>>> +	while ((page = list_first_entry_or_null(list, struct page, lru))) {
>>> +		list_del(&page->lru);
>>> +		put_aeration_page(zone, page);
>>> +	}
>>> +}
>>> +
>>> +static void aerator_scrub_zone(struct zone *zone)
>>> +{
>>> +	/* See if there are any pages to pull */
>>> +	if (!test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
>>> +		return;
>>
>> How would someone ask for the zone to be scrubbed when aeration has not
>> been requested?
> 
> I'm not sure what you are asking here. Basically this function is called
> per zone by aerator_cycle. Which now that I think about it I should
> probably swap the names around that we perform a cycle per zone and just
> scrub memory generically.

It looks like aerator_cycle() calls aerator_scrub_zone() on all zones
all the time.  This is the code responsible for ensuring that we don't
do any aeration work on zones that do not need it.
Alexander Duyck July 8, 2019, 10:02 p.m. UTC | #4
On Mon, 2019-07-08 at 12:36 -0700, Dave Hansen wrote:
> On 7/8/19 12:02 PM, Alexander Duyck wrote:
> > On Tue, 2019-06-25 at 13:24 -0700, Dave Hansen wrote:
> > > I also don't see what the boundary has to do with aerated pages being on
> > > the tail of the list.  If you want them on the tail, you just always
> > > list_add_tail() them.
> > 
> > The issue is that there are multiple things that can add to the tail of
> > the list. For example the shuffle code or the lower order buddy expecting
> > its buddy to be freed. In those cases I don't want to add to tail so
> > instead I am adding those to the boundary. By doing that I can avoid
> > having the tail of the list becoming interleaved with raw and aerated
> > pages.
> 
> So, it sounds like we've got the following data structure rules:
> 
> 1. We have one list_head and one list of pages
> 2. For the purposes of allocation, the list is treated the same as
>    before these patches

So these 2 points are correct.

> 3. For a "free()", the behavior changes and we now have two "tails":
>    3a. Aerated pages are freed into the tail of the list
>    3b. Cold pages are freed at the boundary between aerated and non.
>        This serves to...  This is also referred to as a "tail".
>    3a. Hot pages are never aerated and are still freed into the head
>        of the list.
> 
> Did I miss any?  Could you please spell it out this way in future
> changelogs?

So the logic for 3a and 3b is actually the same location. The difference
is that the boundary pointer will move up to the page in the case of 3a,
and will not move in the case of 3b. That was why I was kind of annoyed
with myself as I was calling it the aerator "tail" when it is really the
head of the aeration list.

So the change I am planning to make in terms of naming is to refer to
__aerator_get_boundary in the function below. Boundary makes more sense in
my mind anyway because it is the head of one list and the tail of the
other.

> 
> > > > +struct list_head *__aerator_get_tail(unsigned int order, int migratetype);
> > > >  static inline struct list_head *aerator_get_tail(struct zone *zone,
> > > >  						 unsigned int order,
> > > >  						 int migratetype)
> > > >  {
> > > > +#ifdef CONFIG_AERATION
> > > > +	if (order >= AERATOR_MIN_ORDER &&
> > > > +	    test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> > > > +		return __aerator_get_tail(order, migratetype);
> > > > +#endif
> > > >  	return &zone->free_area[order].free_list[migratetype];
> > > >  }
> > > 
> > > Logically, I have no idea what this is doing.  "Go get pages out of the
> > > aerated list?"  "raw list"?  Needs comments.
> > 
> > I'll add comments. Really now that I think about it I should probably
> > change the name for this anyway. What is really being returned is the tail
> > for the non-aerated list. Specifically if ZONE_AERATION_ACTIVE is set we
> > want to prevent any insertions below the list of aerated pages, so we are
> > returning the first entry in the aerated list and using that as the
> > tail/head of a list tail insertion.
> > 
> > Ugh. I really need to go back and name this better.
> 
> OK, so we now have two tails?  One that's called both a boundary and a
> tail at different parts of the code?

Yes, that is the naming issue I was getting at. I would prefer to go with
boundary where I can since it is both a head of one list and the tail of
the other.

I will try to clean this all up before I submit this again.

> > > >  static inline void aerator_notify_free(struct zone *zone, int order)
> > > >  {
> > > > +#ifdef CONFIG_AERATION
> > > > +	if (!static_key_false(&aerator_notify_enabled))
> > > > +		return;
> > > > +	if (order < AERATOR_MIN_ORDER)
> > > > +		return;
> > > > +	if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> > > > +		return;
> > > > +	if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
> > > > +		return;
> > > > +
> > > > +	__aerator_notify(zone);
> > > > +#endif
> > > >  }
> > > 
> > > Again, this is really hard to review.  I see some possible overhead in a
> > > fast path here, but only if aerator_notify_free() is called in a fast
> > > path.  Is it?  I have to go digging in the previous patches to figure
> > > that out.
> > 
> > This is called at the end of __free_one_page().
> > 
> > I tried to limit the impact as much as possible by ordering the checks the
> > way I did. The order check should limit the impact pretty significantly as
> > that is the only one that will be triggered for every page, then the
> > higher order pages are left to deal with the test_bit and
> > aerator_raw_pages checks.
> 
> That sounds like a good idea.  But, that good idea is very hard to
> distill from the code in the patch.
> 
> Imagine if the function stared being commented with:
> 
> /* Called from a hot path in __free_one_page() */
> 
> And said:
> 
> 
> 	if (!static_key_false(&aerator_notify_enabled))
> 		return;
> 
> 	/* Avoid (slow) notifications when no aeration is performed: */
> 	if (order < AERATOR_MIN_ORDER)
> 		return;
> 	if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> 		return;
> 
> 	/* Some other relevant comment: */
> 	if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
> 		return;
> 
> 	/* This is slow, but should happen very rarely: */
> 	__aerator_notify(zone);
> 

I'll go through and work on cleaning up the comments.

> > > > +static void aerator_populate_boundaries(struct zone *zone)
> > > > +{
> > > > +	unsigned int order, mt;
> > > > +
> > > > +	if (test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> > > > +		return;
> > > > +
> > > > +	for_each_aerate_migratetype_order(order, mt)
> > > > +		aerator_reset_boundary(zone, order, mt);
> > > > +
> > > > +	set_bit(ZONE_AERATION_ACTIVE, &zone->flags);
> > > > +}
> > > 
> > > This function appears misnamed as it's doing more than boundary
> > > manipulation.
> > 
> > The ZONE_AERATION_ACTIVE flag is what is used to indicate that the
> > boundaries are being tracked. Without that we just fall back to using the
> > free_list tail.
> 
> Is the flag used for other things?  Or just to indicate that boundaries
> are being tracked?

Just the boundaries. It gets set before the first time we have to flush
out a batch of pages, and is cleared after we have determined that there
are no longer any pages to pull and our local list is empty.

> > > > +struct list_head *__aerator_get_tail(unsigned int order, int migratetype)
> > > > +{
> > > > +	return boundary[order - AERATOR_MIN_ORDER][migratetype];
> > > > +}
> > > > +
> > > > +void __aerator_del_from_boundary(struct page *page, struct zone *zone)
> > > > +{
> > > > +	unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> > > > +	int mt = get_pcppage_migratetype(page);
> > > > +	struct list_head **tail = &boundary[order][mt];
> > > > +
> > > > +	if (*tail == &page->lru)
> > > > +		*tail = page->lru.next;
> > > > +}
> > > 
> > > Ewww.  Please just track the page that's the boundary, not the list head
> > > inside the page that's the boundary.
> > > 
> > > This also at least needs one comment along the lines of: Move the
> > > boundary if the page representing the boundary is being removed.
> > 
> > So the reason for using the list_head is because we can end up with a
> > boundary for an empty list. In that case we don't have a page to point to
> > but just the list_head for the list itself. It actually makes things quite
> > a bit simpler, otherwise I have to perform extra checks to see if the list
> > is empty.
> 
> Could you please double-check that keeping a 'struct page *' is truly
> more messy?

Well there are a few places I am using this where using a page pointer
would be an issue.

1. add_to_free_area_tail
      Using a page pointer here would be difficult since we are adding a
      page to a list, not to another page.
2. aerator_populate_boundaries
      We were initializing the boundary to the list head for each of the
      free_lists that we could possibly be placing pages into. Translating
      to a page would require additional overhead.
3. __aerator_del_from_boundary
      What we can end up with here if we aren't careful is a page pointer
      that isn't to a page in the case that the free_list is actually
      empty.

In my mind in order to handle this correctly I would have to start using
NULL when the list is empty, and have to add a check to
__aerator_del_from_boundary that would go in and grab the free_list for
the page and test against the head of the free list to make certain that
removing the page will not cause us to point to something that isn't a
page.


> > > > +void aerator_add_to_boundary(struct page *page, struct zone *zone)
> > > > +{
> > > > +	unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> > > > +	int mt = get_pcppage_migratetype(page);
> > > > +	struct list_head **tail = &boundary[order][mt];
> > > > +
> > > > +	*tail = &page->lru;
> > > > +}
> > > > +
> > > > +void aerator_shutdown(void)
> > > > +{
> > > > +	static_key_slow_dec(&aerator_notify_enabled);
> > > > +
> > > > +	while (atomic_read(&a_dev_info->refcnt))
> > > > +		msleep(20);
> > > 
> > > We generally frown on open-coded check/sleep loops.  What is this for?
> > 
> > We are waiting on the aerator to finish processing the list it had active.
> > With the static key disabled we should see the refcount wind down to 0.
> > Once that occurs we can safely free the a_dev_info structure since there
> > will be no other uses of it.
> 
> That's fine, but we still don't open-code sleep loops.  Please remove this.
> 
> "Wait until we can free the thing" sounds to me like RCU.  Do you want
> to use RCU here?  A synchronize_rcu() call can be a very powerful thing
> if the read-side critical sections are amenable to it.

So the issue is I am not entirely sure RCU would be a good fit here. Now I
could handle the __aerator_notify call via an RCU setup, however the call
to aerator_cycle probably wouldn't work well with it since it would be
holding onto a_dev_info for an extended period of time and we wouldn't
want to stall RCU out because the system is busy aerating a big section of
memory.

I'll have to think about this some more. As it currently stands I don't
think this completely solves what it is meant to anyway since I think it
is possible to race and end up with a scenario where another CPU might be
able to get past the static key check before we disable it, and then we
could free a_dev_info before it has a chance to take a reference to it.

> > > > +static void aerator_schedule_initial_aeration(void)
> > > > +{
> > > > +	struct zone *zone;
> > > > +
> > > > +	for_each_populated_zone(zone) {
> > > > +		spin_lock(&zone->lock);
> > > > +		__aerator_notify(zone);
> > > > +		spin_unlock(&zone->lock);
> > > > +	}
> > > > +}
> > > 
> > > Why do we need an initial aeration?
> > 
> > This is mostly about avoiding any possible races while we are brining up
> > the aerator. If we assume we are just going to start a cycle of aeration
> > for all zones when the aerator is brought up it makes it easier to be sure
> > we have gone though and checked all of the zones after initialization is
> > complete.
> 
> Let me ask a different way:  What will happen if we don't have this?
> Will things crash?  Will they be slow?  Do we not know?

I wouldn't expect any crashes. We may just not end up with the memory
being freed for some time if all the pages are freed before the aerator
device is registered, and there isn't any memory activity after that.

This was mostly about just making sure we flush the memory after the
device has been initialized.

> > > > +{
> > > > +	struct list_head *batch = &a_dev_info->batch;
> > > > +	int budget = a_dev_info->capacity;
> > > 
> > > Where does capacity come from?
> > 
> > It is the limit on how many pages we can process at a time. The value is
> > set in a_dev_info before the call to aerator_startup.
> 
> Let me ask another way: Does it come from the user?  Or is it
> automatically determined by some in-kernel heuristic?

It is being provided by the module that registers the aeration device. So
in patch 6 of the series we determined that we wanted to process 32 pages
at a time. So we set that as the limit since that is the number of hints
we had allocated in the virtio-balloon driver.

> > > > +		while ((page = get_aeration_page(zone, order, mt))) {
> > > > +			list_add_tail(&page->lru, batch);
> > > > +
> > > > +			if (!--budget)
> > > > +				return;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * If there are no longer enough free pages to fully populate
> > > > +	 * the aerator, then we can just shut it down for this zone.
> > > > +	 */
> > > > +	clear_bit(ZONE_AERATION_REQUESTED, &zone->flags);
> > > > +	atomic_dec(&a_dev_info->refcnt);
> > > > +}
> > > 
> > > Huh, so this is the number of threads doing aeration?  Didn't we just
> > > make a big deal about there only being one zone being aerated at a time?
> > >  Or, did I misunderstand what refcnt is from its lack of clear
> > > documentation?
> > 
> > The refcnt is the number of zones requesting aeration plus one additional
> > if the thread is active. We are limited to only having pages from one zone
> > in the aerator at a time. That is to prevent us from having to maintain
> > multiple boundaries.
> 
> That sounds like excellent documentation to add to 'refcnt's definition.

Will do.

> > > > +static void aerator_drain(struct zone *zone)
> > > > +{
> > > > +	struct list_head *list = &a_dev_info->batch;
> > > > +	struct page *page;
> > > > +
> > > > +	/*
> > > > +	 * Drain the now aerated pages back into their respective
> > > > +	 * free lists/areas.
> > > > +	 */
> > > > +	while ((page = list_first_entry_or_null(list, struct page, lru))) {
> > > > +		list_del(&page->lru);
> > > > +		put_aeration_page(zone, page);
> > > > +	}
> > > > +}
> > > > +
> > > > +static void aerator_scrub_zone(struct zone *zone)
> > > > +{
> > > > +	/* See if there are any pages to pull */
> > > > +	if (!test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> > > > +		return;
> > > 
> > > How would someone ask for the zone to be scrubbed when aeration has not
> > > been requested?
> > 
> > I'm not sure what you are asking here. Basically this function is called
> > per zone by aerator_cycle. Which now that I think about it I should
> > probably swap the names around that we perform a cycle per zone and just
> > scrub memory generically.
> 
> It looks like aerator_cycle() calls aerator_scrub_zone() on all zones
> all the time.  This is the code responsible for ensuring that we don't
> do any aeration work on zones that do not need it.

Yes, that is correct.

Based on your comment here and a few other spots I am assuming you would
prefer to see these sort of tests pulled out and done before we call the
function? I'm assuming that was the case after I started to see the
pattern so I will update that for the next patch set.
diff mbox series

Patch

diff --git a/include/linux/memory_aeration.h b/include/linux/memory_aeration.h
index 44cfbc259778..2f45196218b1 100644
--- a/include/linux/memory_aeration.h
+++ b/include/linux/memory_aeration.h
@@ -3,19 +3,50 @@ 
 #define _LINUX_MEMORY_AERATION_H
 
 #include <linux/mmzone.h>
+#include <linux/jump_label.h>
 #include <linux/pageblock-flags.h>
+#include <asm/pgtable_types.h>
 
+#define AERATOR_MIN_ORDER	pageblock_order
+#define AERATOR_HWM		32
+
+struct aerator_dev_info {
+	void (*react)(struct aerator_dev_info *a_dev_info);
+	struct list_head batch;
+	unsigned long capacity;
+	atomic_t refcnt;
+};
+
+extern struct static_key aerator_notify_enabled;
+
+void __aerator_notify(struct zone *zone);
 struct page *get_aeration_page(struct zone *zone, unsigned int order,
 			       int migratetype);
 void put_aeration_page(struct zone *zone, struct page *page);
 
+void __aerator_del_from_boundary(struct page *page, struct zone *zone);
+void aerator_add_to_boundary(struct page *page, struct zone *zone);
+
+struct list_head *__aerator_get_tail(unsigned int order, int migratetype);
 static inline struct list_head *aerator_get_tail(struct zone *zone,
 						 unsigned int order,
 						 int migratetype)
 {
+#ifdef CONFIG_AERATION
+	if (order >= AERATOR_MIN_ORDER &&
+	    test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
+		return __aerator_get_tail(order, migratetype);
+#endif
 	return &zone->free_area[order].free_list[migratetype];
 }
 
+static inline void aerator_del_from_boundary(struct page *page,
+					     struct zone *zone)
+{
+	if (PageAerated(page) && test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
+		__aerator_del_from_boundary(page, zone);
+}
+
 static inline void set_page_aerated(struct page *page,
 				    struct zone *zone,
 				    unsigned int order,
@@ -28,6 +59,9 @@  static inline void set_page_aerated(struct page *page,
 	/* record migratetype and flag page as aerated */
 	set_pcppage_migratetype(page, migratetype);
 	__SetPageAerated(page);
+
+	/* update boundary of new migratetype and record it */
+	aerator_add_to_boundary(page, zone);
 #endif
 }
 
@@ -39,11 +73,19 @@  static inline void clear_page_aerated(struct page *page,
 	if (likely(!PageAerated(page)))
 		return;
 
+	/* push boundary back if we removed the upper boundary */
+	aerator_del_from_boundary(page, zone);
+
 	__ClearPageAerated(page);
 	area->nr_free_aerated--;
 #endif
 }
 
+static inline unsigned long aerator_raw_pages(struct free_area *area)
+{
+	return area->nr_free - area->nr_free_aerated;
+}
+
 /**
  * aerator_notify_free - Free page notification that will start page processing
  * @zone: Pointer to current zone of last page processed
@@ -57,5 +99,20 @@  static inline void clear_page_aerated(struct page *page,
  */
 static inline void aerator_notify_free(struct zone *zone, int order)
 {
+#ifdef CONFIG_AERATION
+	if (!static_key_false(&aerator_notify_enabled))
+		return;
+	if (order < AERATOR_MIN_ORDER)
+		return;
+	if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
+		return;
+	if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
+		return;
+
+	__aerator_notify(zone);
+#endif
 }
+
+void aerator_shutdown(void);
+int aerator_startup(struct aerator_dev_info *sdev);
 #endif /*_LINUX_MEMORY_AERATION_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7d89722ae9eb..52190a791e63 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -554,6 +554,14 @@  enum zone_flags {
 	ZONE_BOOSTED_WATERMARK,		/* zone recently boosted watermarks.
 					 * Cleared when kswapd is woken.
 					 */
+	ZONE_AERATION_REQUESTED,	/* zone enabled aeration and is
+					 * requesting scrubbing the data out of
+					 * higher order pages.
+					 */
+	ZONE_AERATION_ACTIVE,		/* zone enabled aeration and is
+					 * activly cleaning the data out of
+					 * higher order pages.
+					 */
 };
 
 static inline unsigned long zone_managed_pages(struct zone *zone)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b848517da64c..f16e73318d49 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -745,6 +745,14 @@  static inline int page_has_type(struct page *page)
 PAGE_TYPE_OPS(Offline, offline)
 
 /*
+ * PageAerated() is an alias for Offline, however it is not meant to be an
+ * exclusive value. It should be combined with PageBuddy() when seen as it
+ * is meant to indicate that the page has been scrubbed while waiting in
+ * the buddy system.
+ */
+PAGE_TYPE_OPS(Aerated, offline)
+
+/*
  * If kmemcg is enabled, the buddy allocator will set PageKmemcg() on
  * pages allocated with __GFP_ACCOUNT. It gets cleared on page free.
  */
diff --git a/mm/Makefile b/mm/Makefile
index ac5e5ba78874..26c2fcd2b89d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -104,3 +104,4 @@  obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_AERATION) += aeration.o
diff --git a/mm/aeration.c b/mm/aeration.c
new file mode 100644
index 000000000000..720dc51cb215
--- /dev/null
+++ b/mm/aeration.c
@@ -0,0 +1,270 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/page-isolation.h>
+#include <linux/gfp.h>
+#include <linux/export.h>
+#include <linux/delay.h>
+#include <linux/slab.h>
+
+static struct aerator_dev_info *a_dev_info;
+struct static_key aerator_notify_enabled;
+
+struct list_head *boundary[MAX_ORDER - AERATOR_MIN_ORDER][MIGRATE_TYPES];
+
+static void aerator_reset_boundary(struct zone *zone, unsigned int order,
+				   unsigned int migratetype)
+{
+	boundary[order - AERATOR_MIN_ORDER][migratetype] =
+			&zone->free_area[order].free_list[migratetype];
+}
+
+#define for_each_aerate_migratetype_order(_order, _type) \
+	for (_order = MAX_ORDER; _order-- != AERATOR_MIN_ORDER;) \
+		for (_type = MIGRATE_TYPES; _type--;)
+
+static void aerator_populate_boundaries(struct zone *zone)
+{
+	unsigned int order, mt;
+
+	if (test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
+		return;
+
+	for_each_aerate_migratetype_order(order, mt)
+		aerator_reset_boundary(zone, order, mt);
+
+	set_bit(ZONE_AERATION_ACTIVE, &zone->flags);
+}
+
+struct list_head *__aerator_get_tail(unsigned int order, int migratetype)
+{
+	return boundary[order - AERATOR_MIN_ORDER][migratetype];
+}
+
+void __aerator_del_from_boundary(struct page *page, struct zone *zone)
+{
+	unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
+	int mt = get_pcppage_migratetype(page);
+	struct list_head **tail = &boundary[order][mt];
+
+	if (*tail == &page->lru)
+		*tail = page->lru.next;
+}
+
+void aerator_add_to_boundary(struct page *page, struct zone *zone)
+{
+	unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
+	int mt = get_pcppage_migratetype(page);
+	struct list_head **tail = &boundary[order][mt];
+
+	*tail = &page->lru;
+}
+
+void aerator_shutdown(void)
+{
+	static_key_slow_dec(&aerator_notify_enabled);
+
+	while (atomic_read(&a_dev_info->refcnt))
+		msleep(20);
+
+	WARN_ON(!list_empty(&a_dev_info->batch));
+
+	a_dev_info = NULL;
+}
+EXPORT_SYMBOL_GPL(aerator_shutdown);
+
+static void aerator_schedule_initial_aeration(void)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		spin_lock(&zone->lock);
+		__aerator_notify(zone);
+		spin_unlock(&zone->lock);
+	}
+}
+
+int aerator_startup(struct aerator_dev_info *sdev)
+{
+	if (a_dev_info)
+		return -EBUSY;
+
+	INIT_LIST_HEAD(&sdev->batch);
+	atomic_set(&sdev->refcnt, 0);
+
+	a_dev_info = sdev;
+	aerator_schedule_initial_aeration();
+
+	static_key_slow_inc(&aerator_notify_enabled);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(aerator_startup);
+
+static void aerator_fill(struct zone *zone)
+{
+	struct list_head *batch = &a_dev_info->batch;
+	int budget = a_dev_info->capacity;
+	unsigned int order, mt;
+
+	for_each_aerate_migratetype_order(order, mt) {
+		struct page *page;
+
+		/*
+		 * Pull pages from free list until we have drained
+		 * it or we have filled the batch reactor.
+		 */
+		while ((page = get_aeration_page(zone, order, mt))) {
+			list_add_tail(&page->lru, batch);
+
+			if (!--budget)
+				return;
+		}
+	}
+
+	/*
+	 * If there are no longer enough free pages to fully populate
+	 * the aerator, then we can just shut it down for this zone.
+	 */
+	clear_bit(ZONE_AERATION_REQUESTED, &zone->flags);
+	atomic_dec(&a_dev_info->refcnt);
+}
+
+static void aerator_drain(struct zone *zone)
+{
+	struct list_head *list = &a_dev_info->batch;
+	struct page *page;
+
+	/*
+	 * Drain the now aerated pages back into their respective
+	 * free lists/areas.
+	 */
+	while ((page = list_first_entry_or_null(list, struct page, lru))) {
+		list_del(&page->lru);
+		put_aeration_page(zone, page);
+	}
+}
+
+static void aerator_scrub_zone(struct zone *zone)
+{
+	/* See if there are any pages to pull */
+	if (!test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
+		return;
+
+	spin_lock(&zone->lock);
+
+	do {
+		aerator_fill(zone);
+
+		if (list_empty(&a_dev_info->batch))
+			break;
+
+		spin_unlock(&zone->lock);
+
+		/*
+		 * Start aerating the pages in the batch, and then
+		 * once that is completed we can drain the reactor
+		 * and refill the reactor, restarting the cycle.
+		 */
+		a_dev_info->react(a_dev_info);
+
+		spin_lock(&zone->lock);
+
+		/*
+		 * Guarantee boundaries are populated before we
+		 * start placing aerated pages in the zone.
+		 */
+		aerator_populate_boundaries(zone);
+
+		/*
+		 * We should have a list of pages that have been
+		 * processed. Return them to their original free lists.
+		 */
+		aerator_drain(zone);
+
+		/* keep pulling pages till there are none to pull */
+	} while (test_bit(ZONE_AERATION_REQUESTED, &zone->flags));
+
+	clear_bit(ZONE_AERATION_ACTIVE, &zone->flags);
+
+	spin_unlock(&zone->lock);
+}
+
+/**
+ * aerator_cycle - start aerating a batch of pages, drain, and refill
+ *
+ * The aerator cycle consists of 4 stages, fill, react, drain, and idle.
+ * We will cycle through the first 3 stages until we fail to obtain any
+ * pages, in that case we will switch to idle and the thread will go back
+ * to sleep awaiting the next request for aeration.
+ */
+static void aerator_cycle(struct work_struct *work)
+{
+	struct zone *zone = first_online_pgdat()->node_zones;
+	int refcnt;
+
+	/*
+	 * We want to hold one additional reference against the number of
+	 * active hints as we may clear the hint that originally brought us
+	 * here. We will clear it after we have either vaporized the content
+	 * of the pages, or if we discover all pages were stolen out from
+	 * under us.
+	 */
+	atomic_inc(&a_dev_info->refcnt);
+
+	for (;;) {
+		aerator_scrub_zone(zone);
+
+		/*
+		 * Move to next zone, if at the end of the list
+		 * test to see if we can just go into idle.
+		 */
+		zone = next_zone(zone);
+		if (zone)
+			continue;
+		zone = first_online_pgdat()->node_zones;
+
+		/*
+		 * If we never generated any pages and we are
+		 * holding the only remaining reference to active
+		 * hints then we can just let this go for now and
+		 * go idle.
+		 */
+		refcnt = atomic_read(&a_dev_info->refcnt);
+		if (refcnt != 1)
+			continue;
+		if (atomic_try_cmpxchg(&a_dev_info->refcnt, &refcnt, 0))
+			break;
+	}
+}
+
+static DECLARE_DELAYED_WORK(aerator_work, &aerator_cycle);
+
+void __aerator_notify(struct zone *zone)
+{
+	/*
+	 * We can use separate test and set operations here as there
+	 * is nothing else that can set or clear this bit while we are
+	 * holding the zone lock. The advantage to doing it this way is
+	 * that we don't have to dirty the cacheline unless we are
+	 * changing the value.
+	 */
+	set_bit(ZONE_AERATION_REQUESTED, &zone->flags);
+
+	if (atomic_fetch_inc(&a_dev_info->refcnt))
+		return;
+
+	/*
+	 * We should never be calling this function while there are already
+	 * pages in the list being aerated. If we are called under such a
+	 * circumstance report an error.
+	 */
+	WARN_ON(!list_empty(&a_dev_info->batch));
+
+	/*
+	 * Delay the start of work to allow a sizable queue to build. For
+	 * now we are limiting this to running no more than 10 times per
+	 * second.
+	 */
+	schedule_delayed_work(&aerator_work, HZ / 10);
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eb7ba8385374..45269c46c662 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2168,8 +2168,10 @@  struct page *get_aeration_page(struct zone *zone, unsigned int order,
 	list_for_each_entry_from_reverse(page, list, lru) {
 		if (PageAerated(page)) {
 			page = list_first_entry(list, struct page, lru);
-			if (PageAerated(page))
+			if (PageAerated(page)) {
+				aerator_add_to_boundary(page, zone);
 				break;
+			}
 		}
 
 		del_page_from_free_area(page, zone, order);