diff mbox series

[v4,05/21] mm/hugetlb: Introduce pgtable allocation/freeing helpers

Message ID 20201113105952.11638-6-songmuchun@bytedance.com (mailing list archive)
State New, archived
Headers show
Series Free some vmemmap pages of hugetlb page | expand

Commit Message

Muchun Song Nov. 13, 2020, 10:59 a.m. UTC
On x86_64, vmemmap is always PMD mapped if the machine has hugepages
support and if we have 2MB contiguos pages and PMD aligned. If we want
to free the unused vmemmap pages, we have to split the huge pmd firstly.
So we should pre-allocate pgtable to split PMD to PTE.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb_vmemmap.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.h | 12 +++++++++
 2 files changed, 85 insertions(+)

Comments

Oscar Salvador Nov. 17, 2020, 3:06 p.m. UTC | #1
On Fri, Nov 13, 2020 at 06:59:36PM +0800, Muchun Song wrote:
> +#define page_huge_pte(page)		((page)->pmd_huge_pte)

Seems you do not need this one anymore.

> +void vmemmap_pgtable_free(struct page *page)
> +{
> +	struct page *pte_page, *t_page;
> +
> +	list_for_each_entry_safe(pte_page, t_page, &page->lru, lru) {
> +		list_del(&pte_page->lru);
> +		pte_free_kernel(&init_mm, page_to_virt(pte_page));
> +	}
> +}
> +
> +int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> +{
> +	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> +
> +	/* Store preallocated pages on huge page lru list */
> +	INIT_LIST_HEAD(&page->lru);
> +
> +	while (nr--) {
> +		pte_t *pte_p;
> +
> +		pte_p = pte_alloc_one_kernel(&init_mm);
> +		if (!pte_p)
> +			goto out;
> +		list_add(&virt_to_page(pte_p)->lru, &page->lru);
> +	}

Definetely this looks better and easier to handle.
Btw, did you explore Matthew's hint about instead of allocating a new page,
using one of the ones you are going to free to store the ptes?
I am not sure whether it is feasible at all though.


> --- a/mm/hugetlb_vmemmap.h
> +++ b/mm/hugetlb_vmemmap.h
> @@ -9,12 +9,24 @@
>  #ifndef _LINUX_HUGETLB_VMEMMAP_H
>  #define _LINUX_HUGETLB_VMEMMAP_H
>  #include <linux/hugetlb.h>
> +#include <linux/mm.h>

why do we need this here?
Muchun Song Nov. 17, 2020, 3:29 p.m. UTC | #2
On Tue, Nov 17, 2020 at 11:06 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Fri, Nov 13, 2020 at 06:59:36PM +0800, Muchun Song wrote:
> > +#define page_huge_pte(page)          ((page)->pmd_huge_pte)

Yeah, I forgot to remove it. Thanks.

>
> Seems you do not need this one anymore.
>
> > +void vmemmap_pgtable_free(struct page *page)
> > +{
> > +     struct page *pte_page, *t_page;
> > +
> > +     list_for_each_entry_safe(pte_page, t_page, &page->lru, lru) {
> > +             list_del(&pte_page->lru);
> > +             pte_free_kernel(&init_mm, page_to_virt(pte_page));
> > +     }
> > +}
> > +
> > +int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> > +{
> > +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> > +
> > +     /* Store preallocated pages on huge page lru list */
> > +     INIT_LIST_HEAD(&page->lru);
> > +
> > +     while (nr--) {
> > +             pte_t *pte_p;
> > +
> > +             pte_p = pte_alloc_one_kernel(&init_mm);
> > +             if (!pte_p)
> > +                     goto out;
> > +             list_add(&virt_to_page(pte_p)->lru, &page->lru);
> > +     }
>
> Definetely this looks better and easier to handle.
> Btw, did you explore Matthew's hint about instead of allocating a new page,
> using one of the ones you are going to free to store the ptes?

Oh, sorry for missing his reply. It is a good idea. I will start an
investigation.
Thanks for reminding me.

> I am not sure whether it is feasible at all though.
>
>
> > --- a/mm/hugetlb_vmemmap.h
> > +++ b/mm/hugetlb_vmemmap.h
> > @@ -9,12 +9,24 @@
> >  #ifndef _LINUX_HUGETLB_VMEMMAP_H
> >  #define _LINUX_HUGETLB_VMEMMAP_H
> >  #include <linux/hugetlb.h>
> > +#include <linux/mm.h>
>
> why do we need this here?

Yeah, also can remove:).


>
> --
> Oscar Salvador
> SUSE L3



--
Yours,
Muchun
Muchun Song Nov. 19, 2020, 6:17 a.m. UTC | #3
On Tue, Nov 17, 2020 at 11:06 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Fri, Nov 13, 2020 at 06:59:36PM +0800, Muchun Song wrote:
> > +#define page_huge_pte(page)          ((page)->pmd_huge_pte)
>
> Seems you do not need this one anymore.
>
> > +void vmemmap_pgtable_free(struct page *page)
> > +{
> > +     struct page *pte_page, *t_page;
> > +
> > +     list_for_each_entry_safe(pte_page, t_page, &page->lru, lru) {
> > +             list_del(&pte_page->lru);
> > +             pte_free_kernel(&init_mm, page_to_virt(pte_page));
> > +     }
> > +}
> > +
> > +int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> > +{
> > +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> > +
> > +     /* Store preallocated pages on huge page lru list */
> > +     INIT_LIST_HEAD(&page->lru);
> > +
> > +     while (nr--) {
> > +             pte_t *pte_p;
> > +
> > +             pte_p = pte_alloc_one_kernel(&init_mm);
> > +             if (!pte_p)
> > +                     goto out;
> > +             list_add(&virt_to_page(pte_p)->lru, &page->lru);
> > +     }
>
> Definetely this looks better and easier to handle.
> Btw, did you explore Matthew's hint about instead of allocating a new page,
> using one of the ones you are going to free to store the ptes?
> I am not sure whether it is feasible at all though.

Hi Oscar and Matthew,

I have started an investigation about this. Finally, I think that it
may not be feasible. If we use a vmemmap page frame as a
page table when we split the PMD table firstly, in this stage,
we need to set 512 pte entry to the vmemmap page frame. If
someone reads the tail struct page struct of the HugeTLB,
it can get the arbitrary value (I am not sure it actually exists,
maybe the memory compaction module can do this). So on
the safe side, I think that allocating a new page is a good
choice.

Thanks.

>
>
> > --- a/mm/hugetlb_vmemmap.h
> > +++ b/mm/hugetlb_vmemmap.h
> > @@ -9,12 +9,24 @@
> >  #ifndef _LINUX_HUGETLB_VMEMMAP_H
> >  #define _LINUX_HUGETLB_VMEMMAP_H
> >  #include <linux/hugetlb.h>
> > +#include <linux/mm.h>
>
> why do we need this here?
>
> --
> Oscar Salvador
> SUSE L3
Mike Kravetz Nov. 19, 2020, 11:21 p.m. UTC | #4
On 11/18/20 10:17 PM, Muchun Song wrote:
> On Tue, Nov 17, 2020 at 11:06 PM Oscar Salvador <osalvador@suse.de> wrote:
>>
>> On Fri, Nov 13, 2020 at 06:59:36PM +0800, Muchun Song wrote:
>>> +#define page_huge_pte(page)          ((page)->pmd_huge_pte)
>>
>> Seems you do not need this one anymore.
>>
>>> +void vmemmap_pgtable_free(struct page *page)
>>> +{
>>> +     struct page *pte_page, *t_page;
>>> +
>>> +     list_for_each_entry_safe(pte_page, t_page, &page->lru, lru) {
>>> +             list_del(&pte_page->lru);
>>> +             pte_free_kernel(&init_mm, page_to_virt(pte_page));
>>> +     }
>>> +}
>>> +
>>> +int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
>>> +{
>>> +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
>>> +
>>> +     /* Store preallocated pages on huge page lru list */
>>> +     INIT_LIST_HEAD(&page->lru);
>>> +
>>> +     while (nr--) {
>>> +             pte_t *pte_p;
>>> +
>>> +             pte_p = pte_alloc_one_kernel(&init_mm);
>>> +             if (!pte_p)
>>> +                     goto out;
>>> +             list_add(&virt_to_page(pte_p)->lru, &page->lru);
>>> +     }
>>
>> Definetely this looks better and easier to handle.
>> Btw, did you explore Matthew's hint about instead of allocating a new page,
>> using one of the ones you are going to free to store the ptes?
>> I am not sure whether it is feasible at all though.
> 
> Hi Oscar and Matthew,
> 
> I have started an investigation about this. Finally, I think that it
> may not be feasible. If we use a vmemmap page frame as a
> page table when we split the PMD table firstly, in this stage,
> we need to set 512 pte entry to the vmemmap page frame. If
> someone reads the tail struct page struct of the HugeTLB,
> it can get the arbitrary value (I am not sure it actually exists,
> maybe the memory compaction module can do this). So on
> the safe side, I think that allocating a new page is a good
> choice.

Thanks for looking into this.

If I understand correctly, the issue is that you need the pte page to set
up the new mappings.  In your current code, this is done before removing
the pages of struct pages.  This keeps everything 'consistent' as things
are remapped.

If you want to use one of the 'pages of struct pages' for the new pte
page, then there will be a period of time when things are inconsistent.
Before setting up the mapping, some code could potentially access that
pages of struct pages.

I tend to agree that allocating allocating a new page is the safest thing
to do here.  Or, perhaps someone can think of a way make this safe.
Mike Kravetz Nov. 19, 2020, 11:37 p.m. UTC | #5
On 11/13/20 2:59 AM, Muchun Song wrote:
> On x86_64, vmemmap is always PMD mapped if the machine has hugepages
> support and if we have 2MB contiguos pages and PMD aligned. If we want
                             contiguous              alignment
> to free the unused vmemmap pages, we have to split the huge pmd firstly.
> So we should pre-allocate pgtable to split PMD to PTE.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb_vmemmap.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/hugetlb_vmemmap.h | 12 +++++++++
>  2 files changed, 85 insertions(+)

Thanks for the cleanup.

Oscar made some other comments.  I only have one additional minor comment
below.

With those minor cleanups,
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
...
> +int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> +{
> +	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> +
> +	/* Store preallocated pages on huge page lru list */

Let's expland the above comment to something like this:

	/*
	 * Use the huge page lru list to temporarily store the preallocated
	 * pages.  The preallocated pages are used and the list is emptied
	 * before the huge page is put into use.  When the huge page is put
	 * into use by prep_new_huge_page() the list will be reinitialized.
	 */

> +	INIT_LIST_HEAD(&page->lru);
> +
> +	while (nr--) {
> +		pte_t *pte_p;
> +
> +		pte_p = pte_alloc_one_kernel(&init_mm);
> +		if (!pte_p)
> +			goto out;
> +		list_add(&virt_to_page(pte_p)->lru, &page->lru);
> +	}
> +
> +	return 0;
> +out:
> +	vmemmap_pgtable_free(page);
> +	return -ENOMEM;
> +}
Muchun Song Nov. 20, 2020, 2:52 a.m. UTC | #6
On Fri, Nov 20, 2020 at 7:22 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 11/18/20 10:17 PM, Muchun Song wrote:
> > On Tue, Nov 17, 2020 at 11:06 PM Oscar Salvador <osalvador@suse.de> wrote:
> >>
> >> On Fri, Nov 13, 2020 at 06:59:36PM +0800, Muchun Song wrote:
> >>> +#define page_huge_pte(page)          ((page)->pmd_huge_pte)
> >>
> >> Seems you do not need this one anymore.
> >>
> >>> +void vmemmap_pgtable_free(struct page *page)
> >>> +{
> >>> +     struct page *pte_page, *t_page;
> >>> +
> >>> +     list_for_each_entry_safe(pte_page, t_page, &page->lru, lru) {
> >>> +             list_del(&pte_page->lru);
> >>> +             pte_free_kernel(&init_mm, page_to_virt(pte_page));
> >>> +     }
> >>> +}
> >>> +
> >>> +int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
> >>> +{
> >>> +     unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
> >>> +
> >>> +     /* Store preallocated pages on huge page lru list */
> >>> +     INIT_LIST_HEAD(&page->lru);
> >>> +
> >>> +     while (nr--) {
> >>> +             pte_t *pte_p;
> >>> +
> >>> +             pte_p = pte_alloc_one_kernel(&init_mm);
> >>> +             if (!pte_p)
> >>> +                     goto out;
> >>> +             list_add(&virt_to_page(pte_p)->lru, &page->lru);
> >>> +     }
> >>
> >> Definetely this looks better and easier to handle.
> >> Btw, did you explore Matthew's hint about instead of allocating a new page,
> >> using one of the ones you are going to free to store the ptes?
> >> I am not sure whether it is feasible at all though.
> >
> > Hi Oscar and Matthew,
> >
> > I have started an investigation about this. Finally, I think that it
> > may not be feasible. If we use a vmemmap page frame as a
> > page table when we split the PMD table firstly, in this stage,
> > we need to set 512 pte entry to the vmemmap page frame. If
> > someone reads the tail struct page struct of the HugeTLB,
> > it can get the arbitrary value (I am not sure it actually exists,
> > maybe the memory compaction module can do this). So on
> > the safe side, I think that allocating a new page is a good
> > choice.
>
> Thanks for looking into this.
>
> If I understand correctly, the issue is that you need the pte page to set
> up the new mappings.  In your current code, this is done before removing
> the pages of struct pages.  This keeps everything 'consistent' as things
> are remapped.
>
> If you want to use one of the 'pages of struct pages' for the new pte
> page, then there will be a period of time when things are inconsistent.
> Before setting up the mapping, some code could potentially access that
> pages of struct pages.

Yeah, you are right.

>
> I tend to agree that allocating allocating a new page is the safest thing
> to do here.  Or, perhaps someone can think of a way make this safe.
> --
> Mike Kravetz
diff mbox series

Patch

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index a6c9948302e2..b7dfa97b4ea9 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -71,6 +71,8 @@ 
  */
 #define pr_fmt(fmt)	"HugeTLB Vmemmap: " fmt
 
+#include <linux/list.h>
+#include <asm/pgalloc.h>
 #include "hugetlb_vmemmap.h"
 
 /*
@@ -83,6 +85,77 @@ 
  */
 #define RESERVE_VMEMMAP_NR		2U
 
+#ifndef VMEMMAP_HPAGE_SHIFT
+#define VMEMMAP_HPAGE_SHIFT		HPAGE_SHIFT
+#endif
+#define VMEMMAP_HPAGE_ORDER		(VMEMMAP_HPAGE_SHIFT - PAGE_SHIFT)
+#define VMEMMAP_HPAGE_NR		(1 << VMEMMAP_HPAGE_ORDER)
+#define VMEMMAP_HPAGE_SIZE		((1UL) << VMEMMAP_HPAGE_SHIFT)
+#define VMEMMAP_HPAGE_MASK		(~(VMEMMAP_HPAGE_SIZE - 1))
+
+#define page_huge_pte(page)		((page)->pmd_huge_pte)
+
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return h->nr_free_vmemmap_pages;
+}
+
+static inline unsigned int vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return free_vmemmap_pages_per_hpage(h) + RESERVE_VMEMMAP_NR;
+}
+
+static inline unsigned long vmemmap_pages_size_per_hpage(struct hstate *h)
+{
+	return (unsigned long)vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
+}
+
+static inline unsigned int pgtable_pages_to_prealloc_per_hpage(struct hstate *h)
+{
+	unsigned long vmemmap_size = vmemmap_pages_size_per_hpage(h);
+
+	/*
+	 * No need pre-allocate page tables when there is no vmemmap pages
+	 * to free.
+	 */
+	if (!free_vmemmap_pages_per_hpage(h))
+		return 0;
+
+	return ALIGN(vmemmap_size, VMEMMAP_HPAGE_SIZE) >> VMEMMAP_HPAGE_SHIFT;
+}
+
+void vmemmap_pgtable_free(struct page *page)
+{
+	struct page *pte_page, *t_page;
+
+	list_for_each_entry_safe(pte_page, t_page, &page->lru, lru) {
+		list_del(&pte_page->lru);
+		pte_free_kernel(&init_mm, page_to_virt(pte_page));
+	}
+}
+
+int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
+{
+	unsigned int nr = pgtable_pages_to_prealloc_per_hpage(h);
+
+	/* Store preallocated pages on huge page lru list */
+	INIT_LIST_HEAD(&page->lru);
+
+	while (nr--) {
+		pte_t *pte_p;
+
+		pte_p = pte_alloc_one_kernel(&init_mm);
+		if (!pte_p)
+			goto out;
+		list_add(&virt_to_page(pte_p)->lru, &page->lru);
+	}
+
+	return 0;
+out:
+	vmemmap_pgtable_free(page);
+	return -ENOMEM;
+}
+
 void __init hugetlb_vmemmap_init(struct hstate *h)
 {
 	unsigned int order = huge_page_order(h);
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 40c0c7dfb60d..2a72d2f62411 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -9,12 +9,24 @@ 
 #ifndef _LINUX_HUGETLB_VMEMMAP_H
 #define _LINUX_HUGETLB_VMEMMAP_H
 #include <linux/hugetlb.h>
+#include <linux/mm.h>
 
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
 void __init hugetlb_vmemmap_init(struct hstate *h);
+int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page);
+void vmemmap_pgtable_free(struct page *page);
 #else
 static inline void hugetlb_vmemmap_init(struct hstate *h)
 {
 }
+
+static inline int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page)
+{
+	return 0;
+}
+
+static inline void vmemmap_pgtable_free(struct page *page)
+{
+}
 #endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
 #endif /* _LINUX_HUGETLB_VMEMMAP_H */