diff mbox series

[1/1] mm/khugepaged: reduce process visible downtime by pre-zeroing hugepage

Message ID 20240308074921.45752-1-ioworker0@gmail.com (mailing list archive)
State New
Headers show
Series [1/1] mm/khugepaged: reduce process visible downtime by pre-zeroing hugepage | expand

Commit Message

Lance Yang March 8, 2024, 7:49 a.m. UTC
The patch reduces the process visible downtime during hugepage
collapse. This is achieved by pre-zeroing the hugepage before
acquiring mmap_lock(write mode) if nr_pte_none >= 256, without
affecting the efficiency of khugepaged.

On an Intel Core i5 CPU, the process visible downtime during
hugepage collapse is as follows:

| nr_ptes_none  | w/o __GFP_ZERO | w/ __GFP_ZERO  |  Change |
--------------------------------------------------—----------
|      511      |     233us      |      95us      |  -59.21%|
|      384      |     376us      |     219us      |  -41.20%|
|      256      |     421us      |     323us      |  -23.28%|
|      128      |     523us      |     507us      |   -3.06%|

Of course, alloc_charge_hpage() will take longer to run with
the __GFP_ZERO flag.

|       Func           | w/o __GFP_ZERO | w/ __GFP_ZERO |
|----------------------|----------------|---------------|
| alloc_charge_hpage   |      198us     |      295us    |

But it's not a big deal because it doesn't impact the total
time spent by khugepaged in collapsing a hugepage. In fact,
it would decrease.

Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
 mm/khugepaged.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

Comments

David Hildenbrand March 11, 2024, 4:19 p.m. UTC | #1
On 08.03.24 08:49, Lance Yang wrote:
> The patch reduces the process visible downtime during hugepage
> collapse. This is achieved by pre-zeroing the hugepage before
> acquiring mmap_lock(write mode) if nr_pte_none >= 256, without
> affecting the efficiency of khugepaged.
> 
> On an Intel Core i5 CPU, the process visible downtime during
> hugepage collapse is as follows:
> 
> | nr_ptes_none  | w/o __GFP_ZERO | w/ __GFP_ZERO  |  Change |
> --------------------------------------------------—----------
> |      511      |     233us      |      95us      |  -59.21%|
> |      384      |     376us      |     219us      |  -41.20%|
> |      256      |     421us      |     323us      |  -23.28%|
> |      128      |     523us      |     507us      |   -3.06%|
> 
> Of course, alloc_charge_hpage() will take longer to run with
> the __GFP_ZERO flag.
> 
> |       Func           | w/o __GFP_ZERO | w/ __GFP_ZERO |
> |----------------------|----------------|---------------|
> | alloc_charge_hpage   |      198us     |      295us    |
> 
> But it's not a big deal because it doesn't impact the total
> time spent by khugepaged in collapsing a hugepage. In fact,
> it would decrease.

It does look sane to me and not overly complicated.

But, it's an optimization really only when we have quite a bunch of 
pte_none(), possibly repeatedly so that it really makes a difference.

Usually, when we repeatedly collapse that many pte_none() we're just 
wasting a lot of memory and should re-evaluate life choices :)

So my question is: do we really care about it that much that we care to 
optimize?

But again, LGTM.
Lance Yang March 12, 2024, 1:09 p.m. UTC | #2
Hey David,

Thanks for taking time to review!

On Tue, Mar 12, 2024 at 12:19 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.03.24 08:49, Lance Yang wrote:
> > The patch reduces the process visible downtime during hugepage
> > collapse. This is achieved by pre-zeroing the hugepage before
> > acquiring mmap_lock(write mode) if nr_pte_none >= 256, without
> > affecting the efficiency of khugepaged.
> >
> > On an Intel Core i5 CPU, the process visible downtime during
> > hugepage collapse is as follows:
> >
> > | nr_ptes_none  | w/o __GFP_ZERO | w/ __GFP_ZERO  |  Change |
> > --------------------------------------------------—----------
> > |      511      |     233us      |      95us      |  -59.21%|
> > |      384      |     376us      |     219us      |  -41.20%|
> > |      256      |     421us      |     323us      |  -23.28%|
> > |      128      |     523us      |     507us      |   -3.06%|
> >
> > Of course, alloc_charge_hpage() will take longer to run with
> > the __GFP_ZERO flag.
> >
> > |       Func           | w/o __GFP_ZERO | w/ __GFP_ZERO |
> > |----------------------|----------------|---------------|
> > | alloc_charge_hpage   |      198us     |      295us    |
> >
> > But it's not a big deal because it doesn't impact the total
> > time spent by khugepaged in collapsing a hugepage. In fact,
> > it would decrease.
>
> It does look sane to me and not overly complicated.
>
> But, it's an optimization really only when we have quite a bunch of
> pte_none(), possibly repeatedly so that it really makes a difference.
>
> Usually, when we repeatedly collapse that many pte_none() we're just
> wasting a lot of memory and should re-evaluate life choices :)

Agreed! It seems that the default value of max_pte_none may be set too
high, which could result in the memory wastage issue we're discussing.

>
> So my question is: do we really care about it that much that we care to
> optimize?

IMO, although it may not be our main concern, reducing the impact of
khugepaged on the process remains crucial. I think that users also prefer
minimal interference from khugepaged.

>
> But again, LGTM.

Thanks again for your time!

Best,
Lance
>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand March 12, 2024, 1:19 p.m. UTC | #3
On 12.03.24 14:09, Lance Yang wrote:
> Hey David,
> 
> Thanks for taking time to review!
> 
> On Tue, Mar 12, 2024 at 12:19 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.03.24 08:49, Lance Yang wrote:
>>> The patch reduces the process visible downtime during hugepage
>>> collapse. This is achieved by pre-zeroing the hugepage before
>>> acquiring mmap_lock(write mode) if nr_pte_none >= 256, without
>>> affecting the efficiency of khugepaged.
>>>
>>> On an Intel Core i5 CPU, the process visible downtime during
>>> hugepage collapse is as follows:
>>>
>>> | nr_ptes_none  | w/o __GFP_ZERO | w/ __GFP_ZERO  |  Change |
>>> --------------------------------------------------—----------
>>> |      511      |     233us      |      95us      |  -59.21%|
>>> |      384      |     376us      |     219us      |  -41.20%|
>>> |      256      |     421us      |     323us      |  -23.28%|
>>> |      128      |     523us      |     507us      |   -3.06%|
>>>
>>> Of course, alloc_charge_hpage() will take longer to run with
>>> the __GFP_ZERO flag.
>>>
>>> |       Func           | w/o __GFP_ZERO | w/ __GFP_ZERO |
>>> |----------------------|----------------|---------------|
>>> | alloc_charge_hpage   |      198us     |      295us    |
>>>
>>> But it's not a big deal because it doesn't impact the total
>>> time spent by khugepaged in collapsing a hugepage. In fact,
>>> it would decrease.
>>
>> It does look sane to me and not overly complicated.
>>
>> But, it's an optimization really only when we have quite a bunch of
>> pte_none(), possibly repeatedly so that it really makes a difference.
>>
>> Usually, when we repeatedly collapse that many pte_none() we're just
>> wasting a lot of memory and should re-evaluate life choices :)
> 
> Agreed! It seems that the default value of max_pte_none may be set too
> high, which could result in the memory wastage issue we're discussing.

IIRC, some companies disable it completely (set to 0) because of that.

> 
>>
>> So my question is: do we really care about it that much that we care to
>> optimize?
> 
> IMO, although it may not be our main concern, reducing the impact of
> khugepaged on the process remains crucial. I think that users also prefer
> minimal interference from khugepaged.

The problem I am having with this is that for the *common* case where we 
have a small number of pte_none(), we cannot really optimize because we 
have to perform the copy.

So this feels like we're rather optimizing a corner case, and I am not 
so sure if that is really worth it.

Other thoughts?
Lance Yang March 12, 2024, 1:55 p.m. UTC | #4
On Tue, Mar 12, 2024 at 9:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.03.24 14:09, Lance Yang wrote:
> > Hey David,
> >
> > Thanks for taking time to review!
> >
> > On Tue, Mar 12, 2024 at 12:19 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.03.24 08:49, Lance Yang wrote:
> >>> The patch reduces the process visible downtime during hugepage
> >>> collapse. This is achieved by pre-zeroing the hugepage before
> >>> acquiring mmap_lock(write mode) if nr_pte_none >= 256, without
> >>> affecting the efficiency of khugepaged.
> >>>
> >>> On an Intel Core i5 CPU, the process visible downtime during
> >>> hugepage collapse is as follows:
> >>>
> >>> | nr_ptes_none  | w/o __GFP_ZERO | w/ __GFP_ZERO  |  Change |
> >>> --------------------------------------------------—----------
> >>> |      511      |     233us      |      95us      |  -59.21%|
> >>> |      384      |     376us      |     219us      |  -41.20%|
> >>> |      256      |     421us      |     323us      |  -23.28%|
> >>> |      128      |     523us      |     507us      |   -3.06%|
> >>>
> >>> Of course, alloc_charge_hpage() will take longer to run with
> >>> the __GFP_ZERO flag.
> >>>
> >>> |       Func           | w/o __GFP_ZERO | w/ __GFP_ZERO |
> >>> |----------------------|----------------|---------------|
> >>> | alloc_charge_hpage   |      198us     |      295us    |
> >>>
> >>> But it's not a big deal because it doesn't impact the total
> >>> time spent by khugepaged in collapsing a hugepage. In fact,
> >>> it would decrease.
> >>
> >> It does look sane to me and not overly complicated.
> >>
> >> But, it's an optimization really only when we have quite a bunch of
> >> pte_none(), possibly repeatedly so that it really makes a difference.
> >>
> >> Usually, when we repeatedly collapse that many pte_none() we're just
> >> wasting a lot of memory and should re-evaluate life choices :)
> >
> > Agreed! It seems that the default value of max_pte_none may be set too
> > high, which could result in the memory wastage issue we're discussing.
>
> IIRC, some companies disable it completely (set to 0) because of that.
>
> >
> >>
> >> So my question is: do we really care about it that much that we care to
> >> optimize?
> >
> > IMO, although it may not be our main concern, reducing the impact of
> > khugepaged on the process remains crucial. I think that users also prefer
> > minimal interference from khugepaged.
>
> The problem I am having with this is that for the *common* case where we
> have a small number of pte_none(), we cannot really optimize because we
> have to perform the copy.
>
> So this feels like we're rather optimizing a corner case, and I am not
> so sure if that is really worth it.
>
> Other thoughts?

Another thought is to introduce khugepaged/alloc_zeroed_hpage for THP
sysfs settings. This would enable users to decide whether to avoid unnecessary
copies when nr_ptes_none > 0.

>
> --
> Cheers,
>
> David / dhildenb
>
Lance Yang March 14, 2024, 2:19 p.m. UTC | #5
Another thought suggested by Bang Li is that we record which pte is none
in hpage_collapse_scan_pmd. Then, before acquiring the mmap_lock (write mode),
we will pre-zero pages as needed.

What do you think?

Thanks,
Lance

On Tue, Mar 12, 2024 at 9:55 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> On Tue, Mar 12, 2024 at 9:19 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 12.03.24 14:09, Lance Yang wrote:
> > > Hey David,
> > >
> > > Thanks for taking time to review!
> > >
> > > On Tue, Mar 12, 2024 at 12:19 AM David Hildenbrand <david@redhat.com> wrote:
> > >>
> > >> On 08.03.24 08:49, Lance Yang wrote:
> > >>> The patch reduces the process visible downtime during hugepage
> > >>> collapse. This is achieved by pre-zeroing the hugepage before
> > >>> acquiring mmap_lock(write mode) if nr_pte_none >= 256, without
> > >>> affecting the efficiency of khugepaged.
> > >>>
> > >>> On an Intel Core i5 CPU, the process visible downtime during
> > >>> hugepage collapse is as follows:
> > >>>
> > >>> | nr_ptes_none  | w/o __GFP_ZERO | w/ __GFP_ZERO  |  Change |
> > >>> --------------------------------------------------—----------
> > >>> |      511      |     233us      |      95us      |  -59.21%|
> > >>> |      384      |     376us      |     219us      |  -41.20%|
> > >>> |      256      |     421us      |     323us      |  -23.28%|
> > >>> |      128      |     523us      |     507us      |   -3.06%|
> > >>>
> > >>> Of course, alloc_charge_hpage() will take longer to run with
> > >>> the __GFP_ZERO flag.
> > >>>
> > >>> |       Func           | w/o __GFP_ZERO | w/ __GFP_ZERO |
> > >>> |----------------------|----------------|---------------|
> > >>> | alloc_charge_hpage   |      198us     |      295us    |
> > >>>
> > >>> But it's not a big deal because it doesn't impact the total
> > >>> time spent by khugepaged in collapsing a hugepage. In fact,
> > >>> it would decrease.
> > >>
> > >> It does look sane to me and not overly complicated.
> > >>
> > >> But, it's an optimization really only when we have quite a bunch of
> > >> pte_none(), possibly repeatedly so that it really makes a difference.
> > >>
> > >> Usually, when we repeatedly collapse that many pte_none() we're just
> > >> wasting a lot of memory and should re-evaluate life choices :)
> > >
> > > Agreed! It seems that the default value of max_pte_none may be set too
> > > high, which could result in the memory wastage issue we're discussing.
> >
> > IIRC, some companies disable it completely (set to 0) because of that.
> >
> > >
> > >>
> > >> So my question is: do we really care about it that much that we care to
> > >> optimize?
> > >
> > > IMO, although it may not be our main concern, reducing the impact of
> > > khugepaged on the process remains crucial. I think that users also prefer
> > > minimal interference from khugepaged.
> >
> > The problem I am having with this is that for the *common* case where we
> > have a small number of pte_none(), we cannot really optimize because we
> > have to perform the copy.
> >
> > So this feels like we're rather optimizing a corner case, and I am not
> > so sure if that is really worth it.
> >
> > Other thoughts?
>
> Another thought is to introduce khugepaged/alloc_zeroed_hpage for THP
> sysfs settings. This would enable users to decide whether to avoid unnecessary
> copies when nr_ptes_none > 0.
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
David Hildenbrand March 15, 2024, 12:18 p.m. UTC | #6
On 14.03.24 15:19, Lance Yang wrote:
> Another thought suggested by Bang Li is that we record which pte is none
> in hpage_collapse_scan_pmd. Then, before acquiring the mmap_lock (write mode),
> we will pre-zero pages as needed.

Here is my point of view: we cannot optimize the common case where we 
have mostly !pte_none() in a similar way.

So why do we care about the less common case? Is the process visible 
downtime reduction for that less common case really noticable?

Or is it rather something that looks good in a micro-benchmark, but 
won't really make any difference in actual applications (again, where 
the common case will still result the same downtime).

I'm not against this, I'm rather wonder "do we really care". I'd like to 
hear other opinions.

>>>>>
>>>>> So my question is: do we really care about it that much that we care to
>>>>> optimize?
>>>>
>>>> IMO, although it may not be our main concern, reducing the impact of
>>>> khugepaged on the process remains crucial. I think that users also prefer
>>>> minimal interference from khugepaged.
>>>
>>> The problem I am having with this is that for the *common* case where we
>>> have a small number of pte_none(), we cannot really optimize because we
>>> have to perform the copy.
>>>
>>> So this feels like we're rather optimizing a corner case, and I am not
>>> so sure if that is really worth it.
>>>
>>> Other thoughts?
>>
>> Another thought is to introduce khugepaged/alloc_zeroed_hpage for THP
>> sysfs settings. This would enable users to decide whether to avoid unnecessary
>> copies when nr_ptes_none > 0.

Hm, new toggles for that, not sure ... I much rather prefer something 
without any new toggles, especially for this case.
diff mbox series

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 38830174608f..a2872596b865 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -88,6 +88,7 @@  static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
 static unsigned int khugepaged_max_ptes_none __read_mostly;
 static unsigned int khugepaged_max_ptes_swap __read_mostly;
 static unsigned int khugepaged_max_ptes_shared __read_mostly;
+static unsigned int khugepaged_min_ptes_none_prezero __read_mostly;
 
 #define MM_SLOTS_HASH_BITS 10
 static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
@@ -96,6 +97,7 @@  static struct kmem_cache *mm_slot_cache __ro_after_init;
 
 struct collapse_control {
 	bool is_khugepaged;
+	bool alloc_zeroed_hpage;
 
 	/* Num pages scanned per node */
 	u32 node_load[MAX_NUMNODES];
@@ -396,6 +398,7 @@  int __init khugepaged_init(void)
 	khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
 	khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
 	khugepaged_max_ptes_shared = HPAGE_PMD_NR / 2;
+	khugepaged_min_ptes_none_prezero = HPAGE_PMD_NR / 2;
 
 	return 0;
 }
@@ -782,6 +785,7 @@  static int __collapse_huge_page_copy(pte_t *pte,
 				     struct vm_area_struct *vma,
 				     unsigned long address,
 				     spinlock_t *ptl,
+				     struct collapse_control *cc,
 				     struct list_head *compound_pagelist)
 {
 	struct page *src_page;
@@ -797,7 +801,8 @@  static int __collapse_huge_page_copy(pte_t *pte,
 	     _pte++, page++, _address += PAGE_SIZE) {
 		pteval = ptep_get(_pte);
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-			clear_user_highpage(page, _address);
+			if (!cc->alloc_zeroed_hpage)
+				clear_user_highpage(page, _address);
 			continue;
 		}
 		src_page = pte_page(pteval);
@@ -1067,6 +1072,9 @@  static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
 	int node = hpage_collapse_find_target_node(cc);
 	struct folio *folio;
 
+	if (cc->alloc_zeroed_hpage)
+		gfp |= __GFP_ZERO;
+
 	if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) {
 		*hpage = NULL;
 		return SCAN_ALLOC_HUGE_PAGE_FAIL;
@@ -1209,7 +1217,7 @@  static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	anon_vma_unlock_write(vma->anon_vma);
 
 	result = __collapse_huge_page_copy(pte, hpage, pmd, _pmd,
-					   vma, address, pte_ptl,
+					   vma, address, pte_ptl, cc,
 					   &compound_pagelist);
 	pte_unmap(pte);
 	if (unlikely(result != SCAN_SUCCEED))
@@ -1272,6 +1280,7 @@  static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+	cc->alloc_zeroed_hpage = false;
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!pte) {
 		result = SCAN_PMD_NULL;
@@ -1408,6 +1417,10 @@  static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
+		if (cc->is_khugepaged &&
+		    none_or_zero >= khugepaged_min_ptes_none_prezero)
+			cc->alloc_zeroed_hpage = true;
+
 		result = collapse_huge_page(mm, address, referenced,
 					    unmapped, cc);
 		/* collapse_huge_page will return with the mmap_lock released */
@@ -2054,7 +2067,8 @@  static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	index = start;
 	list_for_each_entry(page, &pagelist, lru) {
 		while (index < page->index) {
-			clear_highpage(hpage + (index % HPAGE_PMD_NR));
+			if (!cc->alloc_zeroed_hpage)
+				clear_highpage(hpage + (index % HPAGE_PMD_NR));
 			index++;
 		}
 		if (copy_mc_highpage(hpage + (page->index % HPAGE_PMD_NR), page) > 0) {
@@ -2064,7 +2078,8 @@  static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		index++;
 	}
 	while (index < end) {
-		clear_highpage(hpage + (index % HPAGE_PMD_NR));
+		if (!cc->alloc_zeroed_hpage)
+			clear_highpage(hpage + (index % HPAGE_PMD_NR));
 		index++;
 	}
 
@@ -2234,6 +2249,7 @@  static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	swap = 0;
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+	cc->alloc_zeroed_hpage = false;
 	rcu_read_lock();
 	xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
 		if (xas_retry(&xas, page))
@@ -2305,11 +2321,16 @@  static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	rcu_read_unlock();
 
 	if (result == SCAN_SUCCEED) {
-		if (cc->is_khugepaged &&
-		    present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+		if (!cc->is_khugepaged)
+			result = collapse_file(mm, addr, file, start, cc);
+		else if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
 			result = SCAN_EXCEED_NONE_PTE;
 			count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
 		} else {
+			if (HPAGE_PMD_NR - present >=
+			    khugepaged_min_ptes_none_prezero)
+				cc->alloc_zeroed_hpage = true;
+
 			result = collapse_file(mm, addr, file, start, cc);
 		}
 	}