diff mbox series

mm, hugetlb: Avoid double clearing for hugetlb pages

Message ID 20201019182853.7467-1-gpiccoli@canonical.com (mailing list archive)
State New, archived
Headers show
Series mm, hugetlb: Avoid double clearing for hugetlb pages | expand

Commit Message

Guilherme Piccoli Oct. 19, 2020, 6:28 p.m. UTC
Commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1
boot options") introduced the option for clearing/initializing kernel
pages on allocation time - this can be achieved either using a parameter
or a Kconfig setting. The goal for the change was a kernel hardening measure.

Despite the performance degradation with "init_on_alloc" is considered
low, there is one case in which it can be noticed and it may impact
latency of the system - this is when hugetlb pages are allocated.
Those pages are meant to be used by userspace *only* (differently of THP,
for example). In allocation time for hugetlb, their component pages go
through the initialization/clearing process in

 prep_new_page()
  kernel_init_free_pages()

and, when used in userspace mappings, the hugetlb are _again_ cleared;
I've checked that in practice by running the kernel selftest[0] for
hugetlb mapping - the pages go through clear_huge_pages() on page
fault [ see hugetlb_no_page() ].

This patch proposes a way to prevent this resource waste by skipping
the page initialization/clearing if the page is a component of hugetlb
page (even if "init_on_alloc" or the respective Kconfig are set).
The performance improvement measured in [1] demonstrates that it is
effective and bring the hugetlb allocation time to the same level as
with "init_on_alloc" disabled. Despite we've used sysctl to allocate
hugetlb pages in our tests, the same delay happens in early boot time
when hugetlb parameters are set on kernel cmdline (and "init_on_alloc"
is set).

[0] tools/testing/selftests/vm/map_hugetlb.c

[1] Test results - all tests executed in a pristine kernel 5.9+, from
2020-10-19, at commit 7cf726a594353010. A virtual machine with 96G of
total memory was used, the test consists in allocating 64G of 2M hugetlb
pages. Results below:

* Without this patch, init_on_alloc=1
$ cat /proc/meminfo |grep "MemA\|Hugetlb"
MemAvailable:   97892212 kB
Hugetlb:               0 kB

$ time echo 32768 > /proc/sys/vm/nr_hugepages
real    0m24.189s
user    0m0.000s
sys     0m24.184s

$ cat /proc/meminfo |grep "MemA\|Hugetlb"
MemAvailable:   30784732 kB
Hugetlb:        67108864 kB

* Without this patch, init_on_alloc=0
$ cat /proc/meminfo |grep "MemA\|Hugetlb"
MemAvailable:   97892752 kB
Hugetlb:               0 kB

$ time echo 32768 > /proc/sys/vm/nr_hugepages
real    0m0.316s
user    0m0.000s
sys     0m0.316s

$ cat /proc/meminfo |grep "MemA\|Hugetlb"
MemAvailable:   30783628 kB
Hugetlb:        67108864 kB

* WITH this patch, init_on_alloc=1
$ cat /proc/meminfo |grep "MemA\|Hugetlb"
MemAvailable:   97891952 kB
Hugetlb:               0 kB

$ time echo 32768 > /proc/sys/vm/nr_hugepages
real    0m0.209s
user    0m0.000s
sys     0m0.209s

$ cat /proc/meminfo |grep "MemA\|Hugetlb"
MemAvailable:   30782964 kB
Hugetlb:        67108864 kB

* WITH this patch, init_on_alloc=0
$ cat /proc/meminfo |grep "MemA\|Hugetlb"
MemAvailable:   97892620 kB
Hugetlb:               0 kB

$ time echo 32768 > /proc/sys/vm/nr_hugepages
real    0m0.206s
user    0m0.000s
sys     0m0.206s

$ cat /proc/meminfo |grep "MemA\|Hugetlb"
MemAvailable:   30783804 kB
Hugetlb:        67108864 kB

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---


Hi everybody, thanks in advance for the review/comments. I'd like to
point 2 things related to the implementation:

1) I understand that adding GFP flags is not really welcome by the
mm community; I've considered passing that as function parameter but
that would be a hacky mess, so I decided to add the flag since it seems
this is a fair use of the flag mechanism (to control actions on pages).
If anybody has a better/simpler suggestion to implement this, I'm all
ears - thanks!

2) The checkpatch script gave me the following error, but I decided to
ignore it in order to maintain the same format present in the file:

ERROR: space required after that close brace '}'
#171: FILE: include/trace/events/mmflags.h:52:


 include/linux/gfp.h            | 14 +++++++++-----
 include/linux/mm.h             |  2 +-
 include/trace/events/mmflags.h |  3 ++-
 mm/hugetlb.c                   |  7 +++++++
 tools/perf/builtin-kmem.c      |  1 +
 5 files changed, 20 insertions(+), 7 deletions(-)

Comments

Michal Hocko Oct. 20, 2020, 8:20 a.m. UTC | #1
On Mon 19-10-20 15:28:53, Guilherme G. Piccoli wrote:
[...]
> $ time echo 32768 > /proc/sys/vm/nr_hugepages
> real    0m24.189s
> user    0m0.000s
> sys     0m24.184s
> 
> $ cat /proc/meminfo |grep "MemA\|Hugetlb"
> MemAvailable:   30784732 kB
> Hugetlb:        67108864 kB
> 
> * Without this patch, init_on_alloc=0
> $ cat /proc/meminfo |grep "MemA\|Hugetlb"
> MemAvailable:   97892752 kB
> Hugetlb:               0 kB
> 
> $ time echo 32768 > /proc/sys/vm/nr_hugepages
> real    0m0.316s
> user    0m0.000s
> sys     0m0.316s

Yes zeroying is quite costly and that is to be expected when the feature
is enabled. Hugetlb like other allocator users perform their own
initialization rather than go through __GFP_ZERO path. More on that
below.

Could you be more specific about why this is a problem. Hugetlb pool is
usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
is non trivial amount of time but it doens't look like a major disaster
either. If the pool is allocated later it can take much more time due to
memory fragmentation.

I definitely do not want to downplay this but I would like to hear about
the real life examples of the problem.

[...]
> 
> Hi everybody, thanks in advance for the review/comments. I'd like to
> point 2 things related to the implementation:
> 
> 1) I understand that adding GFP flags is not really welcome by the
> mm community; I've considered passing that as function parameter but
> that would be a hacky mess, so I decided to add the flag since it seems
> this is a fair use of the flag mechanism (to control actions on pages).
> If anybody has a better/simpler suggestion to implement this, I'm all
> ears - thanks!

This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com.
Previously it has been brought up in SLUB context AFAIR. Your numbers
are quite clear here but do we really need a gfp flag with all the
problems we tend to grow in with them?

One potential way around this specifically for hugetlb would be to use
__GFP_ZERO when allocating from the allocator and marking the fact in
the struct page while it is sitting in the pool. Page fault handler
could then skip the zeroying phase. Not an act of beauty TBH but it
fits into the existing model of the full control over initialization.
Btw. it would allow to implement init_on_free semantic as well. I
haven't implemented the actual two main methods
hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because
I am not entirely sure about the current state of hugetlb struct page in
the pool. But there should be a lot of room in there (or in tail pages).
Mike will certainly know much better. But the skeleton of the patch
would look like something like this (not even compile tested).

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b5c109703daa..031af7cdf8a7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -724,7 +724,8 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 			error = PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, addr, pages_per_huge_page(h));
+		if (!hugetlb_test_clear_pre_init_page(page))
+			clear_huge_page(page, addr, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 		error = huge_add_to_page_cache(page, mapping, index);
 		if (unlikely(error)) {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 67fc6383995b..83cc8abb4d69 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1413,6 +1413,7 @@ static void __free_huge_page(struct page *page)
 	page->mapping = NULL;
 	restore_reserve = PagePrivate(page);
 	ClearPagePrivate(page);
+	hugetlb_test_clear_pre_init_page(page);
 
 	/*
 	 * If PagePrivate() was set on page, page allocation consumed a
@@ -1703,6 +1704,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	int order = huge_page_order(h);
 	struct page *page;
 	bool alloc_try_hard = true;
+	bool pre_init = false;
 
 	/*
 	 * By default we always try hard to allocate the page with
@@ -1718,10 +1720,18 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 		gfp_mask |= __GFP_RETRY_MAYFAIL;
 	if (nid == NUMA_NO_NODE)
 		nid = numa_mem_id();
+
+	/* prevent from double initialization */
+	if (want_init_on_alloc(gfp_mask)) {
+		gfp_mask |= __GFP_ZERO;
+		pre_init = true;
+	}
+
 	page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
-	if (page)
+	if (page) {
 		__count_vm_event(HTLB_BUDDY_PGALLOC);
-	else
+		hugetlb_mark_pre_init_page(page);
+	} else
 		__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
 
 	/*
@@ -4221,6 +4231,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_release_all;
 	}
 
+	hugetlb_test_clear_pre_init_page(new_page);
 	copy_user_huge_page(new_page, old_page, address, vma,
 			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
@@ -4411,7 +4422,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			ret = vmf_error(PTR_ERR(page));
 			goto out;
 		}
-		clear_huge_page(page, address, pages_per_huge_page(h));
+		if (!hugetlb_test_clear_pre_init_page(page))
+			clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 		new_page = true;
 
@@ -4709,6 +4721,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		if (IS_ERR(page))
 			goto out;
 
+		hugetlb_test_clear_pre_init_page(page);
 		ret = copy_huge_page_from_user(page,
 						(const void __user *) src_addr,
 						pages_per_huge_page(h), false);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eddbe4e56c73..8cc1fc9c4d13 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -525,7 +525,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	unsigned long flags = qp->flags;
 	int ret;
 	bool has_unmovable = false;
-	pte_t *pte;
+	pte_t *pte, *mapped_pte;
 	spinlock_t *ptl;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
@@ -539,7 +539,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
-	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		if (!pte_present(*pte))
 			continue;
@@ -571,7 +571,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		} else
 			break;
 	}
-	pte_unmap_unlock(pte - 1, ptl);
+	pte_unmap_unlock(mapped_pte, ptl);
 	cond_resched();
 
 	if (has_unmovable)
David Hildenbrand Oct. 20, 2020, 1:36 p.m. UTC | #2
On 20.10.20 10:20, Michal Hocko wrote:
> On Mon 19-10-20 15:28:53, Guilherme G. Piccoli wrote:
> [...]
>> $ time echo 32768 > /proc/sys/vm/nr_hugepages
>> real    0m24.189s
>> user    0m0.000s
>> sys     0m24.184s
>>
>> $ cat /proc/meminfo |grep "MemA\|Hugetlb"
>> MemAvailable:   30784732 kB
>> Hugetlb:        67108864 kB
>>
>> * Without this patch, init_on_alloc=0
>> $ cat /proc/meminfo |grep "MemA\|Hugetlb"
>> MemAvailable:   97892752 kB
>> Hugetlb:               0 kB
>>
>> $ time echo 32768 > /proc/sys/vm/nr_hugepages
>> real    0m0.316s
>> user    0m0.000s
>> sys     0m0.316s
> 
> Yes zeroying is quite costly and that is to be expected when the feature
> is enabled. Hugetlb like other allocator users perform their own
> initialization rather than go through __GFP_ZERO path. More on that
> below.
> 
> Could you be more specific about why this is a problem. Hugetlb pool is
> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
> is non trivial amount of time but it doens't look like a major disaster
> either. If the pool is allocated later it can take much more time due to
> memory fragmentation.
> 
> I definitely do not want to downplay this but I would like to hear about
> the real life examples of the problem.
> 
> [...]
>>
>> Hi everybody, thanks in advance for the review/comments. I'd like to
>> point 2 things related to the implementation:
>>
>> 1) I understand that adding GFP flags is not really welcome by the
>> mm community; I've considered passing that as function parameter but
>> that would be a hacky mess, so I decided to add the flag since it seems
>> this is a fair use of the flag mechanism (to control actions on pages).
>> If anybody has a better/simpler suggestion to implement this, I'm all
>> ears - thanks!
> 
> This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com.
> Previously it has been brought up in SLUB context AFAIR. Your numbers
> are quite clear here but do we really need a gfp flag with all the
> problems we tend to grow in with them?
> 
> One potential way around this specifically for hugetlb would be to use
> __GFP_ZERO when allocating from the allocator and marking the fact in
> the struct page while it is sitting in the pool. Page fault handler
> could then skip the zeroying phase. Not an act of beauty TBH but it
> fits into the existing model of the full control over initialization.
> Btw. it would allow to implement init_on_free semantic as well. I
> haven't implemented the actual two main methods
> hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because
> I am not entirely sure about the current state of hugetlb struct page in
> the pool. But there should be a lot of room in there (or in tail pages).
> Mike will certainly know much better. But the skeleton of the patch
> would look like something like this (not even compile tested).

Something like that is certainly nicer than proposed gfp flags.
(__GFP_NOINIT_ON_ALLOC is just ugly, especially, to optimize such
corner-case features)
Mike Kravetz Oct. 20, 2020, 4:55 p.m. UTC | #3
On 10/20/20 1:20 AM, Michal Hocko wrote:
> On Mon 19-10-20 15:28:53, Guilherme G. Piccoli wrote:
> 
> Yes zeroying is quite costly and that is to be expected when the feature
> is enabled. Hugetlb like other allocator users perform their own
> initialization rather than go through __GFP_ZERO path. More on that
> below.
> 
> Could you be more specific about why this is a problem. Hugetlb pool is
> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
> is non trivial amount of time but it doens't look like a major disaster
> either. If the pool is allocated later it can take much more time due to
> memory fragmentation.
> 
> I definitely do not want to downplay this but I would like to hear about
> the real life examples of the problem.
> 
> [...]
>>
>> Hi everybody, thanks in advance for the review/comments. I'd like to
>> point 2 things related to the implementation:
>>
>> 1) I understand that adding GFP flags is not really welcome by the
>> mm community; I've considered passing that as function parameter but
>> that would be a hacky mess, so I decided to add the flag since it seems
>> this is a fair use of the flag mechanism (to control actions on pages).
>> If anybody has a better/simpler suggestion to implement this, I'm all
>> ears - thanks!
> 
> This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com.
> Previously it has been brought up in SLUB context AFAIR. Your numbers
> are quite clear here but do we really need a gfp flag with all the
> problems we tend to grow in with them?
> 
> One potential way around this specifically for hugetlb would be to use
> __GFP_ZERO when allocating from the allocator and marking the fact in
> the struct page while it is sitting in the pool. Page fault handler
> could then skip the zeroying phase. Not an act of beauty TBH but it
> fits into the existing model of the full control over initialization.
> Btw. it would allow to implement init_on_free semantic as well. I
> haven't implemented the actual two main methods
> hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because
> I am not entirely sure about the current state of hugetlb struct page in
> the pool. But there should be a lot of room in there (or in tail pages).
> Mike will certainly know much better. But the skeleton of the patch
> would look like something like this (not even compile tested).

Thanks Michal.  I was not involved in the discussions for init_on_alloc,
so was waiting for someone else to comment.

My first though was to also do as you propose.  Skip the clear on page
fault if page was already cleared at allocation time.  Yes, there should
be plenty of room to store this state while huge pages are in the pool.

Of course, users will still see those delays at allocation time pointed
out in the commit message.  I guess that should be expected.  We do have
users which allocate over 1TB of huge pages via sysctl.  Those pages are
used and cleared via page faults, but not necessarily all at the
same time.  If such users would ever set init_on_alloc they would see a
huge delay.  My 'guess' is that such users are unlikely to ever use
init_on_alloc or init_on_free for general performance reasons.
Guilherme Piccoli Oct. 20, 2020, 7:19 p.m. UTC | #4
Hi Michal, thanks a lot for your thorough response. I'll address the
comments inline, below. Thanks also David and Mike - in fact, I almost
don't need to respond here after Mike, he was right to the point I'm
going to discuss heh...


On 20/10/2020 05:20, Michal Hocko wrote:
> 
> Yes zeroying is quite costly and that is to be expected when the feature
> is enabled. Hugetlb like other allocator users perform their own
> initialization rather than go through __GFP_ZERO path. More on that
> below.
> 
> Could you be more specific about why this is a problem. Hugetlb pool is
> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
> is non trivial amount of time but it doens't look like a major disaster
> either. If the pool is allocated later it can take much more time due to
> memory fragmentation.
> 
> I definitely do not want to downplay this but I would like to hear about
> the real life examples of the problem.

Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was
just my simple test in a guest, the real case is much worse! It aligns
with Mike's comment, we have complains of minute-like delays, due to a
very big pool of hugepages being allocated.

Users have their own methodology for allocating pages, some would prefer
do that "later" for a variety of reasons, so early boot time allocations
are not always used, that shouldn't be the only focus of the discussion
here.
In the specific report I had, the user complains about more than 3
minutes to allocate ~542G of 2M hugetlb pages.

Now, you'll ask why in the heck they are using init_on_alloc then -
right? So, the Kconfig option "CONFIG_INIT_ON_ALLOC_DEFAULT_ON" is set
by default in Ubuntu, for hardening reasons. So, the workaround for the
users complaining of delays in allocating hugetlb pages currently is to
set "init_on_alloc" to 0. It's a bit lame to ask users to disable such
hardening thing just because we have a double initialization in hugetlb...


> 
> 
> This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com.
> Previously it has been brought up in SLUB context AFAIR. Your numbers
> are quite clear here but do we really need a gfp flag with all the
> problems we tend to grow in with them?
> 
> One potential way around this specifically for hugetlb would be to use
> __GFP_ZERO when allocating from the allocator and marking the fact in
> the struct page while it is sitting in the pool. Page fault handler
> could then skip the zeroying phase. Not an act of beauty TBH but it
> fits into the existing model of the full control over initialization.
> Btw. it would allow to implement init_on_free semantic as well. I
> haven't implemented the actual two main methods
> hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because
> I am not entirely sure about the current state of hugetlb struct page in
> the pool. But there should be a lot of room in there (or in tail pages).
> Mike will certainly know much better. But the skeleton of the patch
> would look like something like this (not even compile tested).
> [code...]

Thanks a lot for pointing the previous discussion for me! I should have
done my homework properly and read all versions of the patchset...my
bad! I'm glad to see this problem was discussed and considered early in
the patch submission, I guess it only missed more real-world numbers.

Your approach seems interesting, but as per Mike's response (which seems
to have anticipated all my arguments heheh) your approach is a bit
reversed, solving a ""non-existent"" problem (of zeroing hugetlb pages
in fault time), whereas the big problem hereby tentatively fixed is the
massive delay on allocation time of the hugetlb pages.

I understand that your suggestion has no burden of introducing more GFP
flags, and I agree that those are potentially dangerous if misused (and
I totally agree with David that __GFP_NOINIT_ON_ALLOC is heinous, I'd
rather go with the originally proposed __GFP_NO_AUTOINIT), but...
wouldn't it be letting the code just drive a design decision? Like "oh,
adding a flag is so bad..better just let this bug/perf issue to stay".

I agree with the arguments here, don't get me wrong - specially since
I'm far from being any kind of mm expert, I trust your judgement that
GFP flags are the utmost villains, but at the same time I'd rather not
change something (like the hugetlb zeroing code) that is not really
fixing the hereby discussed issue. I'm open to other suggestions, of
course, but the GFP flag seems the least hacky way for fixing that, and
ultimately, the flags are meant for this, right? Control page behavior
stuff.

About misuse of a GFP flag, this is a risk for every "API" on kernel,
and we rely in the (knowingly great) kernel review process to block
that. We could even have a more "terrifying" comment there around the
flag, asking new users to CC all relevant involved people in the patch
submission before using that...

Anyway, thanks a bunch for the good points raised here Michal, David and
Mike, and appreciate your patience with somebody trying to mess your GFP
flags. Let me know your thoughts!

Cheers,


Guilherme
David Hildenbrand Oct. 20, 2020, 8:07 p.m. UTC | #5
On 20.10.20 21:19, Guilherme G. Piccoli wrote:
> Hi Michal, thanks a lot for your thorough response. I'll address the
> comments inline, below. Thanks also David and Mike - in fact, I almost
> don't need to respond here after Mike, he was right to the point I'm
> going to discuss heh...
> 
> 
> On 20/10/2020 05:20, Michal Hocko wrote:
>>
>> Yes zeroying is quite costly and that is to be expected when the feature
>> is enabled. Hugetlb like other allocator users perform their own
>> initialization rather than go through __GFP_ZERO path. More on that
>> below.
>>
>> Could you be more specific about why this is a problem. Hugetlb pool is
>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
>> is non trivial amount of time but it doens't look like a major disaster
>> either. If the pool is allocated later it can take much more time due to
>> memory fragmentation.
>>
>> I definitely do not want to downplay this but I would like to hear about
>> the real life examples of the problem.
> 
> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was
> just my simple test in a guest, the real case is much worse! It aligns
> with Mike's comment, we have complains of minute-like delays, due to a
> very big pool of hugepages being allocated.
> 
> Users have their own methodology for allocating pages, some would prefer
> do that "later" for a variety of reasons, so early boot time allocations
> are not always used, that shouldn't be the only focus of the discussion
> here.
> In the specific report I had, the user complains about more than 3
> minutes to allocate ~542G of 2M hugetlb pages.
> 
> Now, you'll ask why in the heck they are using init_on_alloc then -
> right? So, the Kconfig option "CONFIG_INIT_ON_ALLOC_DEFAULT_ON" is set
> by default in Ubuntu, for hardening reasons. So, the workaround for the
> users complaining of delays in allocating hugetlb pages currently is to
> set "init_on_alloc" to 0. It's a bit lame to ask users to disable such
> hardening thing just because we have a double initialization in hugetlb...
> 
> 
>>
>>
>> This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com.
>> Previously it has been brought up in SLUB context AFAIR. Your numbers
>> are quite clear here but do we really need a gfp flag with all the
>> problems we tend to grow in with them?
>>
>> One potential way around this specifically for hugetlb would be to use
>> __GFP_ZERO when allocating from the allocator and marking the fact in
>> the struct page while it is sitting in the pool. Page fault handler
>> could then skip the zeroying phase. Not an act of beauty TBH but it
>> fits into the existing model of the full control over initialization.
>> Btw. it would allow to implement init_on_free semantic as well. I
>> haven't implemented the actual two main methods
>> hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because
>> I am not entirely sure about the current state of hugetlb struct page in
>> the pool. But there should be a lot of room in there (or in tail pages).
>> Mike will certainly know much better. But the skeleton of the patch
>> would look like something like this (not even compile tested).
>> [code...]
> 
> Thanks a lot for pointing the previous discussion for me! I should have
> done my homework properly and read all versions of the patchset...my
> bad! I'm glad to see this problem was discussed and considered early in
> the patch submission, I guess it only missed more real-world numbers.
> 
> Your approach seems interesting, but as per Mike's response (which seems
> to have anticipated all my arguments heheh) your approach is a bit
> reversed, solving a ""non-existent"" problem (of zeroing hugetlb pages
> in fault time), whereas the big problem hereby tentatively fixed is the
> massive delay on allocation time of the hugetlb pages.
> 
> I understand that your suggestion has no burden of introducing more GFP
> flags, and I agree that those are potentially dangerous if misused (and
> I totally agree with David that __GFP_NOINIT_ON_ALLOC is heinous, I'd
> rather go with the originally proposed __GFP_NO_AUTOINIT), but...
> wouldn't it be letting the code just drive a design decision? Like "oh,
> adding a flag is so bad..better just let this bug/perf issue to stay".

The main problem I have is that page alloc code does some internal page
allocator things ("init_on_alloc" - "Fill newly allocated pages and heap
objects with zeroes"), and we're allowing users of page alloc code *that
really shouldn't have to care* to override that behavior, exposing
unnecessary complexity. Mainly: other allocators.

"__GFP_NOINIT_ON_ALLOC" - what exactly does it do?
"__GFP_NO_AUTOINIT" - what exactly does it do?

__GFP_ZERO set: page always zero.
__GFP_ZERO not set: page zero with init_on_alloc, page not necessarily
                    zero without init_on_alloc. Users can find out by 	
                    looking at init_on_alloc.

IMHO, even something like __GFP_DONT_ZERO would be clearer. But I still
somewhat don't like letting users of the buddy override configured
behavior. Yes, it could be used by other alloactors (like hugetlb) to
optimize.

But it could also be used by any driver wanting to optimize the
"init_on_alloc" case, eventually introducing security issues because the
code tries to be smart.
Guilherme Piccoli Oct. 20, 2020, 8:19 p.m. UTC | #6
When I first wrote that, the design was a bit different, the flag was
called __GFP_HTLB_PAGE or something like that. The design was to
signal/mark the composing pages of hugetlb as exactly this: they are
pages composing a huge page of hugetlb "type". Then, I skipped the
"init_on_alloc" thing for such pages.

If your concern is more about semantics (or giving multiple users,
like drivers, the power to try "optimize" their code and skip this
security feature), I think my first approach was better! This way, the
flag would be restricted to hugetlb usage only.
I've changed my mind about that approach before submitting for 2 reasons:

(a) It feels a waste of resources having a GFP flag *only* to signal
regular pages composing hugetlb pages, it's a single user only,
forever!
(b) Having 2 conditional settings on __GFP_BITS_SHIFT (LOCKDEP and
HUGETLB) started to make this define a bit tricky to code, since we'd
have 2 Kconfig-conditional bits to be set.

So, I've moved to this other approach, hereby submitted.
Cheers,


Guilherme
David Hildenbrand Oct. 20, 2020, 8:28 p.m. UTC | #7
On 20.10.20 22:07, David Hildenbrand wrote:
> On 20.10.20 21:19, Guilherme G. Piccoli wrote:
>> Hi Michal, thanks a lot for your thorough response. I'll address the
>> comments inline, below. Thanks also David and Mike - in fact, I almost
>> don't need to respond here after Mike, he was right to the point I'm
>> going to discuss heh...
>>
>>
>> On 20/10/2020 05:20, Michal Hocko wrote:
>>>
>>> Yes zeroying is quite costly and that is to be expected when the feature
>>> is enabled. Hugetlb like other allocator users perform their own
>>> initialization rather than go through __GFP_ZERO path. More on that
>>> below.
>>>
>>> Could you be more specific about why this is a problem. Hugetlb pool is
>>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
>>> is non trivial amount of time but it doens't look like a major disaster
>>> either. If the pool is allocated later it can take much more time due to
>>> memory fragmentation.
>>>
>>> I definitely do not want to downplay this but I would like to hear about
>>> the real life examples of the problem.
>>
>> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was
>> just my simple test in a guest, the real case is much worse! It aligns
>> with Mike's comment, we have complains of minute-like delays, due to a
>> very big pool of hugepages being allocated.
>>
>> Users have their own methodology for allocating pages, some would prefer
>> do that "later" for a variety of reasons, so early boot time allocations
>> are not always used, that shouldn't be the only focus of the discussion
>> here.
>> In the specific report I had, the user complains about more than 3
>> minutes to allocate ~542G of 2M hugetlb pages.
>>
>> Now, you'll ask why in the heck they are using init_on_alloc then -
>> right? So, the Kconfig option "CONFIG_INIT_ON_ALLOC_DEFAULT_ON" is set
>> by default in Ubuntu, for hardening reasons. So, the workaround for the
>> users complaining of delays in allocating hugetlb pages currently is to
>> set "init_on_alloc" to 0. It's a bit lame to ask users to disable such
>> hardening thing just because we have a double initialization in hugetlb...
>>
>>
>>>
>>>
>>> This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com.
>>> Previously it has been brought up in SLUB context AFAIR. Your numbers
>>> are quite clear here but do we really need a gfp flag with all the
>>> problems we tend to grow in with them?
>>>
>>> One potential way around this specifically for hugetlb would be to use
>>> __GFP_ZERO when allocating from the allocator and marking the fact in
>>> the struct page while it is sitting in the pool. Page fault handler
>>> could then skip the zeroying phase. Not an act of beauty TBH but it
>>> fits into the existing model of the full control over initialization.
>>> Btw. it would allow to implement init_on_free semantic as well. I
>>> haven't implemented the actual two main methods
>>> hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because
>>> I am not entirely sure about the current state of hugetlb struct page in
>>> the pool. But there should be a lot of room in there (or in tail pages).
>>> Mike will certainly know much better. But the skeleton of the patch
>>> would look like something like this (not even compile tested).
>>> [code...]
>>
>> Thanks a lot for pointing the previous discussion for me! I should have
>> done my homework properly and read all versions of the patchset...my
>> bad! I'm glad to see this problem was discussed and considered early in
>> the patch submission, I guess it only missed more real-world numbers.
>>
>> Your approach seems interesting, but as per Mike's response (which seems
>> to have anticipated all my arguments heheh) your approach is a bit
>> reversed, solving a ""non-existent"" problem (of zeroing hugetlb pages
>> in fault time), whereas the big problem hereby tentatively fixed is the
>> massive delay on allocation time of the hugetlb pages.
>>
>> I understand that your suggestion has no burden of introducing more GFP
>> flags, and I agree that those are potentially dangerous if misused (and
>> I totally agree with David that __GFP_NOINIT_ON_ALLOC is heinous, I'd
>> rather go with the originally proposed __GFP_NO_AUTOINIT), but...
>> wouldn't it be letting the code just drive a design decision? Like "oh,
>> adding a flag is so bad..better just let this bug/perf issue to stay".
> 
> The main problem I have is that page alloc code does some internal page
> allocator things ("init_on_alloc" - "Fill newly allocated pages and heap
> objects with zeroes"), and we're allowing users of page alloc code *that
> really shouldn't have to care* to override that behavior, exposing
> unnecessary complexity. Mainly: other allocators.
> 
> "__GFP_NOINIT_ON_ALLOC" - what exactly does it do?
> "__GFP_NO_AUTOINIT" - what exactly does it do?
> 
> __GFP_ZERO set: page always zero.
> __GFP_ZERO not set: page zero with init_on_alloc, page not necessarily
>                     zero without init_on_alloc. Users can find out by 	
>                     looking at init_on_alloc.
> 
> IMHO, even something like __GFP_DONT_ZERO would be clearer. But I still
> somewhat don't like letting users of the buddy override configured
> behavior. Yes, it could be used by other alloactors (like hugetlb) to
> optimize.
> 
> But it could also be used by any driver wanting to optimize the
> "init_on_alloc" case, eventually introducing security issues because the
> code tries to be smart.
> 

BTW, there might be other users for something like __GFP_DONT_ZERO.

Especially, memory ballooning drivers (and virtio-mem), whereby the
hypervisor is (WHP) going to zap the page either way after allocation.
You just cannot assume that when freeing such a page again, that it's
actually zero.

But then, somebody told the system to suffer ("alloc_on_init"), so there
isn't too much motivation to optimize such corner cases.
Michal Hocko Oct. 21, 2020, 6:15 a.m. UTC | #8
On Tue 20-10-20 16:19:06, Guilherme G. Piccoli wrote:
> On 20/10/2020 05:20, Michal Hocko wrote:
> > 
> > Yes zeroying is quite costly and that is to be expected when the feature
> > is enabled. Hugetlb like other allocator users perform their own
> > initialization rather than go through __GFP_ZERO path. More on that
> > below.
> > 
> > Could you be more specific about why this is a problem. Hugetlb pool is
> > usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
> > is non trivial amount of time but it doens't look like a major disaster
> > either. If the pool is allocated later it can take much more time due to
> > memory fragmentation.
> > 
> > I definitely do not want to downplay this but I would like to hear about
> > the real life examples of the problem.
> 
> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was
> just my simple test in a guest, the real case is much worse! It aligns
> with Mike's comment, we have complains of minute-like delays, due to a
> very big pool of hugepages being allocated.

The cost of page clearing is mostly a constant overhead so it is quite
natural to see the time scaling with the number of pages. That overhead
has to happen at some point of time. Sure it is more visible when
allocating during boot time resp. when doing pre-allocation during
runtime. The page fault path would be then faster. The overhead just
moves to a different place. So I am not sure this is really a strong
argument to hold.

[...]

> Now, you'll ask why in the heck they are using init_on_alloc then -
> right? So, the Kconfig option "CONFIG_INIT_ON_ALLOC_DEFAULT_ON" is set
> by default in Ubuntu, for hardening reasons.

This is not really that important as long as you properly explain your
users what to expect from the default configuration. The hardening
aspect of this default might be really valuable but it comes with a
price. A non trivial one. The example you have highlighted is just one
from many. The one we can actually workaround although the more I think
about it the less I am convinced this is a good idea.

Effectively _any_ !__GFP_ZERO allocation suffers from double
initialization in one form or another. In the worst case the whole
content of the page gets overwritten. Consider any page cache allocation
for read for example.  This is GFP_*USER* request that gets overwritten
by the file content. Is this visible? Not really on a single page
but consider how many pages you read from disk and you will get a
considerable overhead. Should we exempt these allocations as well to
reduce the overhead? I dot think so. This directly undermines the
objective of the hardening. AFAIU the whole point of init_on_alloc is to
prevent from previous content leaking to a new user without relying the
new user is going to do right thing and initialize everything properly.

Hugetlb is no different here. It just doesn't bother to implement what
init_on_{alloc,free} promises. If there is a bug in hugetlb fault
handler then you can still leak data from one user to another
potentially. GFP_I_WANT_TO_OPT_OUT is actually allowing to increase the
number of users who would like to reduce overhead and risk data leak.
The power of unconditional initialization is in the fact that this is
clearly trivial to check that nothing gets missed.

> So, the workaround for the
> users complaining of delays in allocating hugetlb pages currently is to
> set "init_on_alloc" to 0. It's a bit lame to ask users to disable such
> hardening thing just because we have a double initialization in hugetlb...

It is not lame. It is expressing that people do not want to pay
additional price for the hardening or they are not aware of the cost
benefit here.
 
> Your approach seems interesting, but as per Mike's response (which seems
> to have anticipated all my arguments heheh) your approach is a bit
> reversed, solving a ""non-existent"" problem (of zeroing hugetlb pages
> in fault time), whereas the big problem hereby tentatively fixed is the
> massive delay on allocation time of the hugetlb pages.

Yes, my approach is not great either because it breaks the core
assumption of init_on_alloc and that is that the initialization is done
at a well defined spot. It had to go to all callers of the hugetlb
allocator (alloc_buddy_huge_page) and fix them up. The proper thing to
do would be to do the initialization (or opt out if the page is
pre-zeroed either from the page allocator or from init_on_free) in
alloc_buddy_huge_page.

[...]

> About misuse of a GFP flag, this is a risk for every "API" on kernel,
> and we rely in the (knowingly great) kernel review process to block
> that.

Been there done that. There were and still are many examples. I have
spent non-trivial time on clean ups. My experience is that the review
process simply doesn't stop random drivers or subsystems to do what they
think is right.

> We could even have a more "terrifying" comment there around the
> flag, asking new users to CC all relevant involved people in the patch
> submission before using that...

Been there done that. There is always room to improve though.
Michal Hocko Oct. 21, 2020, 6:25 a.m. UTC | #9
On Tue 20-10-20 17:19:42, Guilherme Piccoli wrote:
> When I first wrote that, the design was a bit different, the flag was
> called __GFP_HTLB_PAGE or something like that. The design was to
> signal/mark the composing pages of hugetlb as exactly this: they are
> pages composing a huge page of hugetlb "type". Then, I skipped the
> "init_on_alloc" thing for such pages.

As pointed out in the other email. This is not about hugetlb although
this might be visible more than other because they just add a tiny bit
to an overall overhead. Each page cache read, CoW and many many other
!__GFP_ZERO users are in the same position when they double initialize.
A dedicated __GFP_HTLB_PAGE is really focusing on a wrong side of the
problem. We do have __GFP_ZERO for a good reason and that is to optimize
the initialization. init_on_alloc goes effectively against this approach
with a "potentially broken code" philosophy in mind and that is good as
a hardening measure indeed. But that comes with an increased overhead
and/or shifted layer when the overhead happens. Sure there is some room
to optimize the code here and there but the primary idea of the
hardening is to make the initialization dead trivial and clear that
nothing can sneak out.
David Hildenbrand Oct. 21, 2020, 9:50 a.m. UTC | #10
On 21.10.20 08:15, Michal Hocko wrote:
> On Tue 20-10-20 16:19:06, Guilherme G. Piccoli wrote:
>> On 20/10/2020 05:20, Michal Hocko wrote:
>>>
>>> Yes zeroying is quite costly and that is to be expected when the feature
>>> is enabled. Hugetlb like other allocator users perform their own
>>> initialization rather than go through __GFP_ZERO path. More on that
>>> below.
>>>
>>> Could you be more specific about why this is a problem. Hugetlb pool is
>>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
>>> is non trivial amount of time but it doens't look like a major disaster
>>> either. If the pool is allocated later it can take much more time due to
>>> memory fragmentation.
>>>
>>> I definitely do not want to downplay this but I would like to hear about
>>> the real life examples of the problem.
>>
>> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was
>> just my simple test in a guest, the real case is much worse! It aligns
>> with Mike's comment, we have complains of minute-like delays, due to a
>> very big pool of hugepages being allocated.
> 
> The cost of page clearing is mostly a constant overhead so it is quite
> natural to see the time scaling with the number of pages. That overhead
> has to happen at some point of time. Sure it is more visible when
> allocating during boot time resp. when doing pre-allocation during
> runtime. The page fault path would be then faster. The overhead just
> moves to a different place. So I am not sure this is really a strong
> argument to hold.

We have people complaining that starting VMs backed by hugetlbfs takes
too long, they would much rather have that initialization be done when
booting the hypervisor ...

so looks like there is no right or wrong.
Michal Hocko Oct. 21, 2020, 11:31 a.m. UTC | #11
On Wed 21-10-20 11:50:48, David Hildenbrand wrote:
> On 21.10.20 08:15, Michal Hocko wrote:
> > On Tue 20-10-20 16:19:06, Guilherme G. Piccoli wrote:
> >> On 20/10/2020 05:20, Michal Hocko wrote:
> >>>
> >>> Yes zeroying is quite costly and that is to be expected when the feature
> >>> is enabled. Hugetlb like other allocator users perform their own
> >>> initialization rather than go through __GFP_ZERO path. More on that
> >>> below.
> >>>
> >>> Could you be more specific about why this is a problem. Hugetlb pool is
> >>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
> >>> is non trivial amount of time but it doens't look like a major disaster
> >>> either. If the pool is allocated later it can take much more time due to
> >>> memory fragmentation.
> >>>
> >>> I definitely do not want to downplay this but I would like to hear about
> >>> the real life examples of the problem.
> >>
> >> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was
> >> just my simple test in a guest, the real case is much worse! It aligns
> >> with Mike's comment, we have complains of minute-like delays, due to a
> >> very big pool of hugepages being allocated.
> > 
> > The cost of page clearing is mostly a constant overhead so it is quite
> > natural to see the time scaling with the number of pages. That overhead
> > has to happen at some point of time. Sure it is more visible when
> > allocating during boot time resp. when doing pre-allocation during
> > runtime. The page fault path would be then faster. The overhead just
> > moves to a different place. So I am not sure this is really a strong
> > argument to hold.
> 
> We have people complaining that starting VMs backed by hugetlbfs takes
> too long, they would much rather have that initialization be done when
> booting the hypervisor ...

I can imagine. Everybody would love to have a free lunch ;) But more
seriously, the overhead of the initialization is unavoidable. The memory
has to be zeroed out by definition and somebody has to pay for that.
Sure one can think of a deferred context to do that but this just
spreads  the overhead out to the overall system overhead.

Even if the zeroying is done during the allocation time then it is the
first user who can benefit from that. Any reuse of the hugetlb pool has
to reinitialize again.

One can still be creative - e.g. prefault hugetlb files from userspace
in parallel and reuse that but I am not sure kernel should try to be
clever here.
Mike Kravetz Oct. 21, 2020, 11:32 p.m. UTC | #12
On 10/21/20 4:31 AM, Michal Hocko wrote:
> On Wed 21-10-20 11:50:48, David Hildenbrand wrote:
>> On 21.10.20 08:15, Michal Hocko wrote:
>>> On Tue 20-10-20 16:19:06, Guilherme G. Piccoli wrote:
>>>> On 20/10/2020 05:20, Michal Hocko wrote:
>>>>>
>>>>> Yes zeroying is quite costly and that is to be expected when the feature
>>>>> is enabled. Hugetlb like other allocator users perform their own
>>>>> initialization rather than go through __GFP_ZERO path. More on that
>>>>> below.
>>>>>
>>>>> Could you be more specific about why this is a problem. Hugetlb pool is
>>>>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
>>>>> is non trivial amount of time but it doens't look like a major disaster
>>>>> either. If the pool is allocated later it can take much more time due to
>>>>> memory fragmentation.
>>>>>
>>>>> I definitely do not want to downplay this but I would like to hear about
>>>>> the real life examples of the problem.
>>>>
>>>> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was
>>>> just my simple test in a guest, the real case is much worse! It aligns
>>>> with Mike's comment, we have complains of minute-like delays, due to a
>>>> very big pool of hugepages being allocated.
>>>
>>> The cost of page clearing is mostly a constant overhead so it is quite
>>> natural to see the time scaling with the number of pages. That overhead
>>> has to happen at some point of time. Sure it is more visible when
>>> allocating during boot time resp. when doing pre-allocation during
>>> runtime. The page fault path would be then faster. The overhead just
>>> moves to a different place. So I am not sure this is really a strong
>>> argument to hold.
>>
>> We have people complaining that starting VMs backed by hugetlbfs takes
>> too long, they would much rather have that initialization be done when
>> booting the hypervisor ...
> 
> I can imagine. Everybody would love to have a free lunch ;) But more
> seriously, the overhead of the initialization is unavoidable. The memory
> has to be zeroed out by definition and somebody has to pay for that.
> Sure one can think of a deferred context to do that but this just
> spreads  the overhead out to the overall system overhead.
> 
> Even if the zeroying is done during the allocation time then it is the
> first user who can benefit from that. Any reuse of the hugetlb pool has
> to reinitialize again.

I remember a conversation with some of our database people who thought
it best for their model if hugetlb pages in the pool were already clear
so that no initialization was done at fault time.  Of course, this requires
clearing at page free time.  In their model, they thought it better to pay
the price at allocation (pool creation) time and free time so that faults
would be as fast as possible.

I wonder if the VMs backed by hugetlbfs pages would benefit from this
behavior as well?

If we track the initialized state (clean or not) of huge pages in the
pool as suggested in Michal's skeleton of a patch, we 'could' then allow
users to choose when hugetlb page clearing is done.

None of that would address the original point of this thread, the global
init_on_alloc parameter.
David Hildenbrand Oct. 22, 2020, 8:04 a.m. UTC | #13
On 22.10.20 01:32, Mike Kravetz wrote:
> On 10/21/20 4:31 AM, Michal Hocko wrote:
>> On Wed 21-10-20 11:50:48, David Hildenbrand wrote:
>>> On 21.10.20 08:15, Michal Hocko wrote:
>>>> On Tue 20-10-20 16:19:06, Guilherme G. Piccoli wrote:
>>>>> On 20/10/2020 05:20, Michal Hocko wrote:
>>>>>>
>>>>>> Yes zeroying is quite costly and that is to be expected when the feature
>>>>>> is enabled. Hugetlb like other allocator users perform their own
>>>>>> initialization rather than go through __GFP_ZERO path. More on that
>>>>>> below.
>>>>>>
>>>>>> Could you be more specific about why this is a problem. Hugetlb pool is
>>>>>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages
>>>>>> is non trivial amount of time but it doens't look like a major disaster
>>>>>> either. If the pool is allocated later it can take much more time due to
>>>>>> memory fragmentation.
>>>>>>
>>>>>> I definitely do not want to downplay this but I would like to hear about
>>>>>> the real life examples of the problem.
>>>>>
>>>>> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was
>>>>> just my simple test in a guest, the real case is much worse! It aligns
>>>>> with Mike's comment, we have complains of minute-like delays, due to a
>>>>> very big pool of hugepages being allocated.
>>>>
>>>> The cost of page clearing is mostly a constant overhead so it is quite
>>>> natural to see the time scaling with the number of pages. That overhead
>>>> has to happen at some point of time. Sure it is more visible when
>>>> allocating during boot time resp. when doing pre-allocation during
>>>> runtime. The page fault path would be then faster. The overhead just
>>>> moves to a different place. So I am not sure this is really a strong
>>>> argument to hold.
>>>
>>> We have people complaining that starting VMs backed by hugetlbfs takes
>>> too long, they would much rather have that initialization be done when
>>> booting the hypervisor ...
>>
>> I can imagine. Everybody would love to have a free lunch ;) But more
>> seriously, the overhead of the initialization is unavoidable. The memory
>> has to be zeroed out by definition and somebody has to pay for that.
>> Sure one can think of a deferred context to do that but this just
>> spreads  the overhead out to the overall system overhead.
>>
>> Even if the zeroying is done during the allocation time then it is the
>> first user who can benefit from that. Any reuse of the hugetlb pool has
>> to reinitialize again.
> 
> I remember a conversation with some of our database people who thought
> it best for their model if hugetlb pages in the pool were already clear
> so that no initialization was done at fault time.  Of course, this requires
> clearing at page free time.  In their model, they thought it better to pay
> the price at allocation (pool creation) time and free time so that faults
> would be as fast as possible.
> 
> I wonder if the VMs backed by hugetlbfs pages would benefit from this
> behavior as well?

So what VMMs like qemu already do is prealloc/prefault all hugetlbfs
memory (if told to, because it's not desired when overcommitting memory)
- relevant for low-latency applications and similar.

https://github.com/qemu/qemu/blob/67e8498937866b49b513e3acadef985c15f44fb5/util/oslib-posix.c#L561

That's why starting a VM backed by a lot of huge pages is slow when
prefaulting: you wait until everything was zeroed before booting the VM.

> 
> If we track the initialized state (clean or not) of huge pages in the
> pool as suggested in Michal's skeleton of a patch, we 'could' then allow
> users to choose when hugetlb page clearing is done.

Right, in case of QEMU if there are zeroed pages
a) prealloc would be faster
b) page faults would be faster

Also we could do hugetlb page clearing from a background thread/process,
as also mentioned by Michal.

> 
> None of that would address the original point of this thread, the global
> init_on_alloc parameter.

Yes, but I guess we're past that: whatever leaves the buddy shall be
zeroed out. That's the whole point of that security hardening mechanism.
Michal Hocko Oct. 22, 2020, 8:55 a.m. UTC | #14
On Thu 22-10-20 10:04:50, David Hildenbrand wrote:
[...]
> > None of that would address the original point of this thread, the global
> > init_on_alloc parameter.
> 
> Yes, but I guess we're past that: whatever leaves the buddy shall be
> zeroed out. That's the whole point of that security hardening mechanism.

Hugetlb can control its zeroying behavior via mount option (for
MAP_HUGETLB controled by a command line parameter). If the page fault
handler can recognize the pre-initialized pages then both init_on* can
be implemented along with such a hugetlb specific mechanism.
David Hildenbrand Oct. 23, 2020, 8:23 a.m. UTC | #15
On 22.10.20 10:55, Michal Hocko wrote:
> On Thu 22-10-20 10:04:50, David Hildenbrand wrote:
> [...]
>>> None of that would address the original point of this thread, the global
>>> init_on_alloc parameter.
>>
>> Yes, but I guess we're past that: whatever leaves the buddy shall be
>> zeroed out. That's the whole point of that security hardening mechanism.
> 
> Hugetlb can control its zeroying behavior via mount option (for
> MAP_HUGETLB controled by a command line parameter). If the page fault
> handler can recognize the pre-initialized pages then both init_on* can

Right, looking at init_on_alloc tells you if you have to zero after
alloc or if it's already been done even though you didn't pass GFP_ZERO.
Guilherme Piccoli Nov. 5, 2020, 7:37 p.m. UTC | #16
Thanks all for the valuable opinions and feedback!
Closing the loop here: so, the proposal wasn't accepted, but an
interesting idea to optimize later hugeTLB allocations was discussed in
the thread - it doesn't help much our case, but it is a performance
optimization.

Cheers,


Guilherme
diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c603237e006c..c03909f8e7b6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,8 +39,9 @@  struct vm_area_struct;
 #define ___GFP_HARDWALL		0x100000u
 #define ___GFP_THISNODE		0x200000u
 #define ___GFP_ACCOUNT		0x400000u
+#define ___GFP_NOINIT_ON_ALLOC	0x800000u
 #ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP	0x800000u
+#define ___GFP_NOLOCKDEP	0x1000000u
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
@@ -215,16 +216,19 @@  struct vm_area_struct;
  * %__GFP_COMP address compound page metadata.
  *
  * %__GFP_ZERO returns a zeroed page on success.
+ *
+ * %__GFP_NOINIT_ON_ALLOC avoids uspace pages to be double-cleared (like HugeTLB)
  */
-#define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
-#define __GFP_COMP	((__force gfp_t)___GFP_COMP)
-#define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
+#define __GFP_NOWARN		((__force gfp_t)___GFP_NOWARN)
+#define __GFP_COMP		((__force gfp_t)___GFP_COMP)
+#define __GFP_ZERO		((__force gfp_t)___GFP_ZERO)
+#define __GFP_NOINIT_ON_ALLOC	((__force gfp_t)___GFP_NOINIT_ON_ALLOC)
 
 /* Disable lockdep for GFP context tracking */
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (24 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /**
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..7fa60d22a90a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2882,7 +2882,7 @@  DECLARE_STATIC_KEY_FALSE(init_on_alloc);
 static inline bool want_init_on_alloc(gfp_t flags)
 {
 	if (static_branch_unlikely(&init_on_alloc) &&
-	    !page_poisoning_enabled())
+	    !page_poisoning_enabled() && !(flags & __GFP_NOINIT_ON_ALLOC))
 		return true;
 	return flags & __GFP_ZERO;
 }
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 67018d367b9f..89b0c0ddcc52 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -48,7 +48,8 @@ 
 	{(unsigned long)__GFP_WRITE,		"__GFP_WRITE"},		\
 	{(unsigned long)__GFP_RECLAIM,		"__GFP_RECLAIM"},	\
 	{(unsigned long)__GFP_DIRECT_RECLAIM,	"__GFP_DIRECT_RECLAIM"},\
-	{(unsigned long)__GFP_KSWAPD_RECLAIM,	"__GFP_KSWAPD_RECLAIM"}\
+	{(unsigned long)__GFP_KSWAPD_RECLAIM,	"__GFP_KSWAPD_RECLAIM"},\
+	{(unsigned long)__GFP_NOINIT_ON_ALLOC,	"__GFP_NOINIT_ON_ALLOC"}\
 
 #define show_gfp_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fe76f8fd5a73..c60a6726b0be 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1742,6 +1742,13 @@  static struct page *alloc_fresh_huge_page(struct hstate *h,
 {
 	struct page *page;
 
+	/*
+	 * Signal the following page allocs to avoid them being cleared
+	 * in allocation time - since HugeTLB pages are *only* used as
+	 * userspace pages, they'll be cleared by default before usage.
+	 */
+	gfp_mask |= __GFP_NOINIT_ON_ALLOC;
+
 	if (hstate_is_gigantic(h))
 		page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
 	else
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index a50dae2c4ae9..bef90d8bb7f6 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -660,6 +660,7 @@  static const struct {
 	{ "__GFP_RECLAIM",		"R" },
 	{ "__GFP_DIRECT_RECLAIM",	"DR" },
 	{ "__GFP_KSWAPD_RECLAIM",	"KR" },
+	{ "__GFP_NOINIT_ON_ALLOC",	"NIA" },
 };
 
 static size_t max_gfp_len;