diff mbox

[RFC,v4,06/40] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists

Message ID 20130925231454.26184.19783.stgit@srivatsabhat.in.ibm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Srivatsa S. Bhat Sept. 25, 2013, 11:14 p.m. UTC
The zones' freelists need to be made region-aware, in order to influence
page allocation and freeing algorithms. So in every free list in the zone, we
would like to demarcate the pageblocks belonging to different memory regions
(we can do this using a set of pointers, and thus avoid splitting up the
freelists).

Also, we would like to keep the pageblocks in the freelists sorted in
region-order. That is, pageblocks belonging to region-0 would come first,
followed by pageblocks belonging to region-1 and so on, within a given
freelist. Of course, a set of pageblocks belonging to the same region need
not be sorted; it is sufficient if we maintain the pageblocks in
region-sorted-order, rather than a full address-sorted-order.

For each freelist within the zone, we maintain a set of pointers to
pageblocks belonging to the various memory regions in that zone.

Eg:

    |<---Region0--->|   |<---Region1--->|   |<-------Region2--------->|
     ____      ____      ____      ____      ____      ____      ____
--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->

                 ^                  ^                              ^
                 |                  |                              |
                Reg0               Reg1                          Reg2


Page allocation will proceed as usual - pick the first item on the free list.
But we don't want to keep updating these region pointers every time we allocate
a pageblock from the freelist. So, instead of pointing to the *first* pageblock
of that region, we maintain the region pointers such that they point to the
*last* pageblock in that region, as shown in the figure above. That way, as
long as there are > 1 pageblocks in that region in that freelist, that region
pointer doesn't need to be updated.


Page allocation algorithm:
-------------------------

The heart of the page allocation algorithm remains as it is - pick the first
item on the appropriate freelist and return it.


Arrangement of pageblocks in the zone freelists:
-----------------------------------------------

This is the main change - we keep the pageblocks in region-sorted order,
where pageblocks belonging to region-0 come first, followed by those belonging
to region-1 and so on. But the pageblocks within a given region need *not* be
sorted, since we need them to be only region-sorted and not fully
address-sorted.

This sorting is performed when adding pages back to the freelists, thus
avoiding any region-related overhead in the critical page allocation
paths.

Strategy to consolidate allocations to a minimum no. of regions:
---------------------------------------------------------------

Page allocation happens in the order of increasing region number. We would
like to do light-weight page reclaim or compaction (for the purpose of memory
power management) in the reverse order, to keep the allocated pages within
a minimum number of regions (approximately). The latter part is implemented
in subsequent patches.

---------------------------- Increasing region number---------------------->

Direction of allocation--->                <---Direction of reclaim/compaction

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 mm/page_alloc.c |  154 +++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 138 insertions(+), 16 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Dave Hansen Sept. 26, 2013, 10:16 p.m. UTC | #1
On 09/25/2013 04:14 PM, Srivatsa S. Bhat wrote:
> @@ -605,16 +713,22 @@ static inline void __free_one_page(struct page *page,
>  		buddy_idx = __find_buddy_index(combined_idx, order + 1);
>  		higher_buddy = higher_page + (buddy_idx - combined_idx);
>  		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
> -			list_add_tail(&page->lru,
> -				&zone->free_area[order].free_list[migratetype].list);
> +
> +			/*
> +			 * Implementing an add_to_freelist_tail() won't be
> +			 * very useful because both of them (almost) add to
> +			 * the tail within the region. So we could potentially
> +			 * switch off this entire "is next-higher buddy free?"
> +			 * logic when memory regions are used.
> +			 */
> +			add_to_freelist(page, &area->free_list[migratetype]);
>  			goto out;
>  		}
>  	}

Commit 6dda9d55b says that this had some discrete performance gains.
It's a bummer that this deoptimizes it, and I think that (expected)
performance degredation at least needs to be referenced _somewhere_.

I also find it very hard to take code seriously which stuff like this:

> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +		WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
> +#endif

nine times.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Sept. 27, 2013, 6:34 a.m. UTC | #2
On 09/27/2013 03:46 AM, Dave Hansen wrote:
> On 09/25/2013 04:14 PM, Srivatsa S. Bhat wrote:
>> @@ -605,16 +713,22 @@ static inline void __free_one_page(struct page *page,
>>  		buddy_idx = __find_buddy_index(combined_idx, order + 1);
>>  		higher_buddy = higher_page + (buddy_idx - combined_idx);
>>  		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
>> -			list_add_tail(&page->lru,
>> -				&zone->free_area[order].free_list[migratetype].list);
>> +
>> +			/*
>> +			 * Implementing an add_to_freelist_tail() won't be
>> +			 * very useful because both of them (almost) add to
>> +			 * the tail within the region. So we could potentially
>> +			 * switch off this entire "is next-higher buddy free?"
>> +			 * logic when memory regions are used.
>> +			 */
>> +			add_to_freelist(page, &area->free_list[migratetype]);
>>  			goto out;
>>  		}
>>  	}
> 
> Commit 6dda9d55b says that this had some discrete performance gains.

I had seen the comments about this but not the patch which made that change.
Thanks for pointing the commit to me! But now that I went through the changelog
carefully, it appears as if there were only some slight benefits in huge page
allocation benchmarks, and the results were either inconclusive or unsubstantial
in most other benchmarks that the author tried.

> It's a bummer that this deoptimizes it, and I think that (expected)
> performance degredation at least needs to be referenced _somewhere_.
>

I'm not so sure about that. Yes, I know that my patchset treats all pages
equally (by adding all of them _far_ _away_ from the head of the list), but
given that the above commit didn't show any significant improvements, I doubt
whether my patchset will lead to any noticeable _degradation_. Perhaps I'll try
out the huge-page allocation benchmark and observe what happens with my patchset.
 
> I also find it very hard to take code seriously which stuff like this:
> 
>> +#ifdef CONFIG_DEBUG_PAGEALLOC
>> +		WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
>> +#endif
> 
> nine times.
> 

Hmm, those debug checks were pretty invaluable for me when testing the code.
I retained them in the patches so that if other people test it and find
problems, they would be able to send bug reports with good amount of info as
to what exactly went wrong. Besides, this patchset adds a ton of new code, and
this list manipulation framework along with the bitmap-based radix tree is one
of the core components. If that goes for a toss, everything from there onwards
will be a train-wreck! So I felt having these checks and balances would be very
useful to validate the correct working of each piece and to debug complex
problems easily.

But please help me understand your point correctly - are you suggesting that
I remove these checks completely or just make them gel well with the other code
so that they don't become such an eyesore as they are at the moment (with all
the #ifdefs sticking out etc)?

If you are suggesting the latter, I completely agree with you. I'll find out
a way to do that, and if you have any suggestions, please let me know!

Thank you!

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner Oct. 23, 2013, 10:17 a.m. UTC | #3
On Thu, Sep 26, 2013 at 04:44:56AM +0530, Srivatsa S. Bhat wrote:
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -517,6 +517,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>  	return 0;
>  }
>  
> +static void add_to_freelist(struct page *page, struct free_list *free_list)
> +{
> +	struct list_head *prev_region_list, *lru;
> +	struct mem_region_list *region;
> +	int region_id, i;
> +
> +	lru = &page->lru;
> +	region_id = page_zone_region_id(page);
> +
> +	region = &free_list->mr_list[region_id];
> +	region->nr_free++;
> +
> +	if (region->page_block) {
> +		list_add_tail(lru, region->page_block);
> +		return;
> +	}
> +
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +	WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
> +#endif
> +
> +	if (!list_empty(&free_list->list)) {
> +		for (i = region_id - 1; i >= 0; i--) {
> +			if (free_list->mr_list[i].page_block) {
> +				prev_region_list =
> +					free_list->mr_list[i].page_block;
> +				goto out;
> +			}
> +		}
> +	}
> +
> +	/* This is the first region, so add to the head of the list */
> +	prev_region_list = &free_list->list;
> +
> +out:
> +	list_add(lru, prev_region_list);
> +
> +	/* Save pointer to page block of this region */
> +	region->page_block = lru;

"Pageblock" has a different meaning in the allocator already.

The things you string up here are just called pages, regardless of
which order they are in and how many pages they can be split into.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Oct. 23, 2013, 4:09 p.m. UTC | #4
On 10/23/2013 03:47 PM, Johannes Weiner wrote:
> On Thu, Sep 26, 2013 at 04:44:56AM +0530, Srivatsa S. Bhat wrote:
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -517,6 +517,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>>  	return 0;
>>  }
>>  
>> +static void add_to_freelist(struct page *page, struct free_list *free_list)
>> +{
>> +	struct list_head *prev_region_list, *lru;
>> +	struct mem_region_list *region;
>> +	int region_id, i;
>> +
>> +	lru = &page->lru;
>> +	region_id = page_zone_region_id(page);
>> +
>> +	region = &free_list->mr_list[region_id];
>> +	region->nr_free++;
>> +
>> +	if (region->page_block) {
>> +		list_add_tail(lru, region->page_block);
>> +		return;
>> +	}
>> +
>> +#ifdef CONFIG_DEBUG_PAGEALLOC
>> +	WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
>> +#endif
>> +
>> +	if (!list_empty(&free_list->list)) {
>> +		for (i = region_id - 1; i >= 0; i--) {
>> +			if (free_list->mr_list[i].page_block) {
>> +				prev_region_list =
>> +					free_list->mr_list[i].page_block;
>> +				goto out;
>> +			}
>> +		}
>> +	}
>> +
>> +	/* This is the first region, so add to the head of the list */
>> +	prev_region_list = &free_list->list;
>> +
>> +out:
>> +	list_add(lru, prev_region_list);
>> +
>> +	/* Save pointer to page block of this region */
>> +	region->page_block = lru;
> 
> "Pageblock" has a different meaning in the allocator already.
> 
> The things you string up here are just called pages, regardless of
> which order they are in and how many pages they can be split into.
> 

Ah, yes. I'll fix that.

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e9d8082..d48eb04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -517,6 +517,111 @@  static inline int page_is_buddy(struct page *page, struct page *buddy,
 	return 0;
 }
 
+static void add_to_freelist(struct page *page, struct free_list *free_list)
+{
+	struct list_head *prev_region_list, *lru;
+	struct mem_region_list *region;
+	int region_id, i;
+
+	lru = &page->lru;
+	region_id = page_zone_region_id(page);
+
+	region = &free_list->mr_list[region_id];
+	region->nr_free++;
+
+	if (region->page_block) {
+		list_add_tail(lru, region->page_block);
+		return;
+	}
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
+#endif
+
+	if (!list_empty(&free_list->list)) {
+		for (i = region_id - 1; i >= 0; i--) {
+			if (free_list->mr_list[i].page_block) {
+				prev_region_list =
+					free_list->mr_list[i].page_block;
+				goto out;
+			}
+		}
+	}
+
+	/* This is the first region, so add to the head of the list */
+	prev_region_list = &free_list->list;
+
+out:
+	list_add(lru, prev_region_list);
+
+	/* Save pointer to page block of this region */
+	region->page_block = lru;
+}
+
+static void del_from_freelist(struct page *page, struct free_list *free_list)
+{
+	struct list_head *prev_page_lru, *lru, *p;
+	struct mem_region_list *region;
+	int region_id;
+
+	lru = &page->lru;
+	region_id = page_zone_region_id(page);
+	region = &free_list->mr_list[region_id];
+	region->nr_free--;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
+
+	/* Verify whether this page indeed belongs to this free list! */
+
+	list_for_each(p, &free_list->list) {
+		if (p == lru)
+			goto page_found;
+	}
+
+	WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__);
+
+page_found:
+#endif
+
+	/*
+	 * If we are not deleting the last pageblock in this region (i.e.,
+	 * farthest from list head, but not necessarily the last numerically),
+	 * then we need not update the region->page_block pointer.
+	 */
+	if (lru != region->page_block) {
+		list_del(lru);
+#ifdef CONFIG_DEBUG_PAGEALLOC
+		WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
+#endif
+		return;
+	}
+
+	prev_page_lru = lru->prev;
+	list_del(lru);
+
+	if (region->nr_free == 0) {
+		region->page_block = NULL;
+	} else {
+		region->page_block = prev_page_lru;
+#ifdef CONFIG_DEBUG_PAGEALLOC
+		WARN(prev_page_lru == &free_list->list,
+			"%s: region->page_block points to list head\n",
+								__func__);
+#endif
+	}
+}
+
+/**
+ * Move a given page from one freelist to another.
+ */
+static void move_page_freelist(struct page *page, struct free_list *old_list,
+			       struct free_list *new_list)
+{
+	del_from_freelist(page, old_list);
+	add_to_freelist(page, new_list);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -550,6 +655,7 @@  static inline void __free_one_page(struct page *page,
 	unsigned long combined_idx;
 	unsigned long uninitialized_var(buddy_idx);
 	struct page *buddy;
+	struct free_area *area;
 
 	VM_BUG_ON(!zone_is_initialized(zone));
 
@@ -579,8 +685,9 @@  static inline void __free_one_page(struct page *page,
 			__mod_zone_freepage_state(zone, 1 << order,
 						  migratetype);
 		} else {
-			list_del(&buddy->lru);
-			zone->free_area[order].nr_free--;
+			area = &zone->free_area[order];
+			del_from_freelist(buddy, &area->free_list[migratetype]);
+			area->nr_free--;
 			rmv_page_order(buddy);
 		}
 		combined_idx = buddy_idx & page_idx;
@@ -589,6 +696,7 @@  static inline void __free_one_page(struct page *page,
 		order++;
 	}
 	set_page_order(page, order);
+	area = &zone->free_area[order];
 
 	/*
 	 * If this is not the largest possible page, check if the buddy
@@ -605,16 +713,22 @@  static inline void __free_one_page(struct page *page,
 		buddy_idx = __find_buddy_index(combined_idx, order + 1);
 		higher_buddy = higher_page + (buddy_idx - combined_idx);
 		if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype].list);
+
+			/*
+			 * Implementing an add_to_freelist_tail() won't be
+			 * very useful because both of them (almost) add to
+			 * the tail within the region. So we could potentially
+			 * switch off this entire "is next-higher buddy free?"
+			 * logic when memory regions are used.
+			 */
+			add_to_freelist(page, &area->free_list[migratetype]);
 			goto out;
 		}
 	}
 
-	list_add(&page->lru,
-		&zone->free_area[order].free_list[migratetype].list);
+	add_to_freelist(page, &area->free_list[migratetype]);
 out:
-	zone->free_area[order].nr_free++;
+	area->nr_free++;
 }
 
 static inline int free_pages_check(struct page *page)
@@ -833,7 +947,7 @@  static inline void expand(struct zone *zone, struct page *page,
 			continue;
 		}
 #endif
-		list_add(&page[size].lru, &area->free_list[migratetype].list);
+		add_to_freelist(&page[size], &area->free_list[migratetype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -900,7 +1014,7 @@  struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		page = list_entry(area->free_list[migratetype].list.next,
 							struct page, lru);
-		list_del(&page->lru);
+		del_from_freelist(page, &area->free_list[migratetype]);
 		rmv_page_order(page);
 		area->nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
@@ -941,7 +1055,8 @@  int move_freepages(struct zone *zone,
 {
 	struct page *page;
 	unsigned long order;
-	int pages_moved = 0;
+	struct free_area *area;
+	int pages_moved = 0, old_mt;
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -969,8 +1084,10 @@  int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
-		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype].list);
+		old_mt = get_freepage_migratetype(page);
+		area = &zone->free_area[order];
+		move_page_freelist(page, &area->free_list[old_mt],
+				    &area->free_list[migratetype]);
 		set_freepage_migratetype(page, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
@@ -1064,7 +1181,7 @@  __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	struct free_area *area;
 	int current_order;
 	struct page *page;
-	int migratetype, new_type, i;
+	int migratetype, new_type, i, mt;
 
 	/* Find the largest possible block of pages in the other list */
 	for (current_order = MAX_ORDER-1; current_order >= order;
@@ -1089,7 +1206,8 @@  __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 							  migratetype);
 
 			/* Remove the page from the freelists */
-			list_del(&page->lru);
+			mt = get_freepage_migratetype(page);
+			del_from_freelist(page, &area->free_list[mt]);
 			rmv_page_order(page);
 
 			/*
@@ -1449,7 +1567,8 @@  static int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
-	list_del(&page->lru);
+	mt = get_freepage_migratetype(page);
+	del_from_freelist(page, &zone->free_area[order].free_list[mt]);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
 
@@ -6442,6 +6561,8 @@  __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	int order, i;
 	unsigned long pfn;
 	unsigned long flags;
+	int mt;
+
 	/* find the first valid pfn */
 	for (pfn = start_pfn; pfn < end_pfn; pfn++)
 		if (pfn_valid(pfn))
@@ -6474,7 +6595,8 @@  __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 		printk(KERN_INFO "remove from free list %lx %d %lx\n",
 		       pfn, 1 << order, end_pfn);
 #endif
-		list_del(&page->lru);
+		mt = get_freepage_migratetype(page);
+		del_from_freelist(page, &zone->free_area[order].free_list[mt]);
 		rmv_page_order(page);
 		zone->free_area[order].nr_free--;
 #ifdef CONFIG_HIGHMEM