diff mbox series

[v5,11/21] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page

Message ID 20201120064325.34492-12-songmuchun@bytedance.com (mailing list archive)
State New, archived
Headers show
Series Free some vmemmap pages of hugetlb page | expand

Commit Message

Muchun Song Nov. 20, 2020, 6:43 a.m. UTC
When we free a hugetlb page to the buddy, we should allocate the vmemmap
pages associated with it. We can do that in the __free_hugepage().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c         |   2 ++
 mm/hugetlb_vmemmap.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.h |   5 +++
 3 files changed, 107 insertions(+)

Comments

Michal Hocko Nov. 20, 2020, 8:11 a.m. UTC | #1
On Fri 20-11-20 14:43:15, Muchun Song wrote:
[...]
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index eda7e3a0b67c..361c4174e222 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -117,6 +117,8 @@
>  #define RESERVE_VMEMMAP_NR		2U
>  #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
>  #define TAIL_PAGE_REUSE			-1
> +#define GFP_VMEMMAP_PAGE		\
> +	(GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)

This is really dangerous! __GFP_MEMALLOC would allow a complete memory
depletion. I am not even sure triggering the OOM killer is a reasonable
behavior. It is just unexpected that shrinking a hugetlb pool can have
destructive side effects. I believe it would be more reasonable to
simply refuse to shrink the pool if we cannot free those pages up. This
sucks as well but it isn't destructive at least.
Muchun Song Nov. 20, 2020, 8:51 a.m. UTC | #2
On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 14:43:15, Muchun Song wrote:
> [...]
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index eda7e3a0b67c..361c4174e222 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -117,6 +117,8 @@
> >  #define RESERVE_VMEMMAP_NR           2U
> >  #define RESERVE_VMEMMAP_SIZE         (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> >  #define TAIL_PAGE_REUSE                      -1
> > +#define GFP_VMEMMAP_PAGE             \
> > +     (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
>
> This is really dangerous! __GFP_MEMALLOC would allow a complete memory
> depletion. I am not even sure triggering the OOM killer is a reasonable
> behavior. It is just unexpected that shrinking a hugetlb pool can have
> destructive side effects. I believe it would be more reasonable to
> simply refuse to shrink the pool if we cannot free those pages up. This
> sucks as well but it isn't destructive at least.

I find the instructions of __GFP_MEMALLOC from the kernel doc.

%__GFP_MEMALLOC allows access to all memory. This should only be used when
the caller guarantees the allocation will allow more memory to be freed
very shortly.

Our situation is in line with the description above. We will free a HugeTLB page
to the buddy allocator which is much larger than that we allocated shortly.

Thanks.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Nov. 20, 2020, 9:28 a.m. UTC | #3
On Fri 20-11-20 16:51:59, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-11-20 14:43:15, Muchun Song wrote:
> > [...]
> > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > > index eda7e3a0b67c..361c4174e222 100644
> > > --- a/mm/hugetlb_vmemmap.c
> > > +++ b/mm/hugetlb_vmemmap.c
> > > @@ -117,6 +117,8 @@
> > >  #define RESERVE_VMEMMAP_NR           2U
> > >  #define RESERVE_VMEMMAP_SIZE         (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> > >  #define TAIL_PAGE_REUSE                      -1
> > > +#define GFP_VMEMMAP_PAGE             \
> > > +     (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
> >
> > This is really dangerous! __GFP_MEMALLOC would allow a complete memory
> > depletion. I am not even sure triggering the OOM killer is a reasonable
> > behavior. It is just unexpected that shrinking a hugetlb pool can have
> > destructive side effects. I believe it would be more reasonable to
> > simply refuse to shrink the pool if we cannot free those pages up. This
> > sucks as well but it isn't destructive at least.
> 
> I find the instructions of __GFP_MEMALLOC from the kernel doc.
> 
> %__GFP_MEMALLOC allows access to all memory. This should only be used when
> the caller guarantees the allocation will allow more memory to be freed
> very shortly.
> 
> Our situation is in line with the description above. We will free a HugeTLB page
> to the buddy allocator which is much larger than that we allocated shortly.

Yes that is a part of the description. But read it in its full entirety.
 * %__GFP_MEMALLOC allows access to all memory. This should only be used when
 * the caller guarantees the allocation will allow more memory to be freed
 * very shortly e.g. process exiting or swapping. Users either should
 * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
 * Users of this flag have to be extremely careful to not deplete the reserve
 * completely and implement a throttling mechanism which controls the
 * consumption of the reserve based on the amount of freed memory.
 * Usage of a pre-allocated pool (e.g. mempool) should be always considered
 * before using this flag.

GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH

sounds like a more reasonable fit to me.
Muchun Song Nov. 20, 2020, 9:37 a.m. UTC | #4
On Fri, Nov 20, 2020 at 5:28 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 16:51:59, Muchun Song wrote:
> > On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-11-20 14:43:15, Muchun Song wrote:
> > > [...]
> > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > > > index eda7e3a0b67c..361c4174e222 100644
> > > > --- a/mm/hugetlb_vmemmap.c
> > > > +++ b/mm/hugetlb_vmemmap.c
> > > > @@ -117,6 +117,8 @@
> > > >  #define RESERVE_VMEMMAP_NR           2U
> > > >  #define RESERVE_VMEMMAP_SIZE         (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> > > >  #define TAIL_PAGE_REUSE                      -1
> > > > +#define GFP_VMEMMAP_PAGE             \
> > > > +     (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
> > >
> > > This is really dangerous! __GFP_MEMALLOC would allow a complete memory
> > > depletion. I am not even sure triggering the OOM killer is a reasonable
> > > behavior. It is just unexpected that shrinking a hugetlb pool can have
> > > destructive side effects. I believe it would be more reasonable to
> > > simply refuse to shrink the pool if we cannot free those pages up. This
> > > sucks as well but it isn't destructive at least.
> >
> > I find the instructions of __GFP_MEMALLOC from the kernel doc.
> >
> > %__GFP_MEMALLOC allows access to all memory. This should only be used when
> > the caller guarantees the allocation will allow more memory to be freed
> > very shortly.
> >
> > Our situation is in line with the description above. We will free a HugeTLB page
> > to the buddy allocator which is much larger than that we allocated shortly.
>
> Yes that is a part of the description. But read it in its full entirety.
>  * %__GFP_MEMALLOC allows access to all memory. This should only be used when
>  * the caller guarantees the allocation will allow more memory to be freed
>  * very shortly e.g. process exiting or swapping. Users either should
>  * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
>  * Users of this flag have to be extremely careful to not deplete the reserve
>  * completely and implement a throttling mechanism which controls the
>  * consumption of the reserve based on the amount of freed memory.
>  * Usage of a pre-allocated pool (e.g. mempool) should be always considered
>  * before using this flag.
>
> GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH

We want to free the HugeTLB page to the buddy allocator, but before that,
we need to allocate some pages as vmemmap pages, so here we cannot
handle allocation failures. I think that we should replace the
__GFP_RETRY_MAYFAIL to __GFP_NOFAIL.

GFP_KERNEL | __GFP_NOFAIL | __GFP_HIGH

This meets our needs here. Thanks.

>
> sounds like a more reasonable fit to me.
>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Nov. 20, 2020, 11:10 a.m. UTC | #5
On Fri 20-11-20 17:37:09, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 5:28 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-11-20 16:51:59, Muchun Song wrote:
> > > On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 20-11-20 14:43:15, Muchun Song wrote:
> > > > [...]
> > > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > > > > index eda7e3a0b67c..361c4174e222 100644
> > > > > --- a/mm/hugetlb_vmemmap.c
> > > > > +++ b/mm/hugetlb_vmemmap.c
> > > > > @@ -117,6 +117,8 @@
> > > > >  #define RESERVE_VMEMMAP_NR           2U
> > > > >  #define RESERVE_VMEMMAP_SIZE         (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> > > > >  #define TAIL_PAGE_REUSE                      -1
> > > > > +#define GFP_VMEMMAP_PAGE             \
> > > > > +     (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
> > > >
> > > > This is really dangerous! __GFP_MEMALLOC would allow a complete memory
> > > > depletion. I am not even sure triggering the OOM killer is a reasonable
> > > > behavior. It is just unexpected that shrinking a hugetlb pool can have
> > > > destructive side effects. I believe it would be more reasonable to
> > > > simply refuse to shrink the pool if we cannot free those pages up. This
> > > > sucks as well but it isn't destructive at least.
> > >
> > > I find the instructions of __GFP_MEMALLOC from the kernel doc.
> > >
> > > %__GFP_MEMALLOC allows access to all memory. This should only be used when
> > > the caller guarantees the allocation will allow more memory to be freed
> > > very shortly.
> > >
> > > Our situation is in line with the description above. We will free a HugeTLB page
> > > to the buddy allocator which is much larger than that we allocated shortly.
> >
> > Yes that is a part of the description. But read it in its full entirety.
> >  * %__GFP_MEMALLOC allows access to all memory. This should only be used when
> >  * the caller guarantees the allocation will allow more memory to be freed
> >  * very shortly e.g. process exiting or swapping. Users either should
> >  * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
> >  * Users of this flag have to be extremely careful to not deplete the reserve
> >  * completely and implement a throttling mechanism which controls the
> >  * consumption of the reserve based on the amount of freed memory.
> >  * Usage of a pre-allocated pool (e.g. mempool) should be always considered
> >  * before using this flag.
> >
> > GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH
> 
> We want to free the HugeTLB page to the buddy allocator, but before that,
> we need to allocate some pages as vmemmap pages, so here we cannot
> handle allocation failures.

Why cannot you simply refuse to shrink the pool size?

> I think that we should replace the
> __GFP_RETRY_MAYFAIL to __GFP_NOFAIL.
> 
> GFP_KERNEL | __GFP_NOFAIL | __GFP_HIGH
> 
> This meets our needs here. Thanks.

Please read again my concern about the disruptive behavior or explain
why it is desirable.
Muchun Song Nov. 20, 2020, 11:56 a.m. UTC | #6
On Fri, Nov 20, 2020 at 7:10 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-11-20 17:37:09, Muchun Song wrote:
> > On Fri, Nov 20, 2020 at 5:28 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-11-20 16:51:59, Muchun Song wrote:
> > > > On Fri, Nov 20, 2020 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 20-11-20 14:43:15, Muchun Song wrote:
> > > > > [...]
> > > > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > > > > > index eda7e3a0b67c..361c4174e222 100644
> > > > > > --- a/mm/hugetlb_vmemmap.c
> > > > > > +++ b/mm/hugetlb_vmemmap.c
> > > > > > @@ -117,6 +117,8 @@
> > > > > >  #define RESERVE_VMEMMAP_NR           2U
> > > > > >  #define RESERVE_VMEMMAP_SIZE         (RESERVE_VMEMMAP_NR << PAGE_SHIFT)
> > > > > >  #define TAIL_PAGE_REUSE                      -1
> > > > > > +#define GFP_VMEMMAP_PAGE             \
> > > > > > +     (GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
> > > > >
> > > > > This is really dangerous! __GFP_MEMALLOC would allow a complete memory
> > > > > depletion. I am not even sure triggering the OOM killer is a reasonable
> > > > > behavior. It is just unexpected that shrinking a hugetlb pool can have
> > > > > destructive side effects. I believe it would be more reasonable to
> > > > > simply refuse to shrink the pool if we cannot free those pages up. This
> > > > > sucks as well but it isn't destructive at least.
> > > >
> > > > I find the instructions of __GFP_MEMALLOC from the kernel doc.
> > > >
> > > > %__GFP_MEMALLOC allows access to all memory. This should only be used when
> > > > the caller guarantees the allocation will allow more memory to be freed
> > > > very shortly.
> > > >
> > > > Our situation is in line with the description above. We will free a HugeTLB page
> > > > to the buddy allocator which is much larger than that we allocated shortly.
> > >
> > > Yes that is a part of the description. But read it in its full entirety.
> > >  * %__GFP_MEMALLOC allows access to all memory. This should only be used when
> > >  * the caller guarantees the allocation will allow more memory to be freed
> > >  * very shortly e.g. process exiting or swapping. Users either should
> > >  * be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
> > >  * Users of this flag have to be extremely careful to not deplete the reserve
> > >  * completely and implement a throttling mechanism which controls the
> > >  * consumption of the reserve based on the amount of freed memory.
> > >  * Usage of a pre-allocated pool (e.g. mempool) should be always considered
> > >  * before using this flag.
> > >
> > > GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_HIGH
> >
> > We want to free the HugeTLB page to the buddy allocator, but before that,
> > we need to allocate some pages as vmemmap pages, so here we cannot
> > handle allocation failures.
>
> Why cannot you simply refuse to shrink the pool size?
>
> > I think that we should replace the
> > __GFP_RETRY_MAYFAIL to __GFP_NOFAIL.
> >
> > GFP_KERNEL | __GFP_NOFAIL | __GFP_HIGH
> >
> > This meets our needs here. Thanks.
>
> Please read again my concern about the disruptive behavior or explain
> why it is desirable.

OK, I will come up with a solution which does not use the
__GFP_NOFAIL. Thanks.

>
> --
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4aabf12aca9b..ba927ae7f9bd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1382,6 +1382,8 @@  static void __free_hugepage(struct hstate *h, struct page *page)
 {
 	int i;
 
+	alloc_huge_page_vmemmap(h, page);
+
 	for (i = 0; i < pages_per_huge_page(h); i++) {
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
 				1 << PG_referenced | 1 << PG_dirty |
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index eda7e3a0b67c..361c4174e222 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -117,6 +117,8 @@ 
 #define RESERVE_VMEMMAP_NR		2U
 #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
 #define TAIL_PAGE_REUSE			-1
+#define GFP_VMEMMAP_PAGE		\
+	(GFP_KERNEL | __GFP_NOFAIL | __GFP_MEMALLOC)
 
 #ifndef VMEMMAP_HPAGE_SHIFT
 #define VMEMMAP_HPAGE_SHIFT		HPAGE_SHIFT
@@ -250,6 +252,104 @@  static inline int freed_vmemmap_hpage_dec(struct page *page)
 	return atomic_dec_return_relaxed(&page->_mapcount) + 1;
 }
 
+static void __remap_huge_page_pte_vmemmap(struct page *reuse, pte_t *ptep,
+					  unsigned long start,
+					  unsigned long end,
+					  struct list_head *remap_pages)
+{
+	pgprot_t pgprot = PAGE_KERNEL;
+	void *from = page_to_virt(reuse);
+	unsigned long addr;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
+		void *to;
+		struct page *page;
+		pte_t entry, old = *ptep;
+
+		page = list_first_entry_or_null(remap_pages, struct page, lru);
+		list_del(&page->lru);
+		to = page_to_virt(page);
+		copy_page(to, from);
+
+		/*
+		 * Make sure that any data that writes to the @to is made
+		 * visible to the physical page.
+		 */
+		flush_kernel_vmap_range(to, PAGE_SIZE);
+
+		prepare_vmemmap_page(page);
+
+		entry = mk_pte(page, pgprot);
+		set_pte_at(&init_mm, addr, ptep++, entry);
+
+		VM_BUG_ON(!pte_present(old) || pte_page(old) != reuse);
+	}
+}
+
+static void __remap_huge_page_pmd_vmemmap(struct hstate *h, pmd_t *pmd,
+					  unsigned long addr,
+					  struct list_head *remap_pages)
+{
+	unsigned long next;
+	unsigned long start = addr + RESERVE_VMEMMAP_SIZE;
+	unsigned long end = addr + vmemmap_pages_size_per_hpage(h);
+	struct page *reuse = NULL;
+
+	addr = start;
+	do {
+		pte_t *ptep;
+
+		ptep = pte_offset_kernel(pmd, addr);
+		if (!reuse)
+			reuse = pte_page(ptep[TAIL_PAGE_REUSE]);
+
+		next = vmemmap_hpage_addr_end(addr, end);
+		__remap_huge_page_pte_vmemmap(reuse, ptep, addr, next,
+					      remap_pages);
+	} while (pmd++, addr = next, addr != end);
+
+	flush_tlb_kernel_range(start, end);
+}
+
+static inline void alloc_vmemmap_pages(struct hstate *h, struct list_head *list)
+{
+	int i;
+
+	for (i = 0; i < free_vmemmap_pages_per_hpage(h); i++) {
+		struct page *page;
+
+		/* This should not fail */
+		page = alloc_page(GFP_VMEMMAP_PAGE);
+		list_add_tail(&page->lru, list);
+	}
+}
+
+void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+	pmd_t *pmd;
+	spinlock_t *ptl;
+	LIST_HEAD(remap_pages);
+
+	if (!free_vmemmap_pages_per_hpage(h))
+		return;
+
+	alloc_vmemmap_pages(h, &remap_pages);
+
+	pmd = vmemmap_to_pmd((unsigned long)head);
+	BUG_ON(!pmd);
+
+	ptl = vmemmap_pmd_lock(pmd);
+	__remap_huge_page_pmd_vmemmap(h, pmd, (unsigned long)head,
+				      &remap_pages);
+	if (!freed_vmemmap_hpage_dec(pmd_page(*pmd))) {
+		/*
+		 * Todo:
+		 * Merge pte to huge pmd if it has ever been split.
+		 */
+	}
+	spin_unlock(ptl);
+}
+
 static inline void free_vmemmap_page_list(struct list_head *list)
 {
 	struct page *page, *next;
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 4175b44f88bc..6dfa7ed6f88a 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -14,6 +14,7 @@ 
 void __init hugetlb_vmemmap_init(struct hstate *h);
 int vmemmap_pgtable_prealloc(struct hstate *h, struct page *page);
 void vmemmap_pgtable_free(struct page *page);
+void alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
 
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
@@ -34,6 +35,10 @@  static inline void vmemmap_pgtable_free(struct page *page)
 {
 }
 
+static inline void alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+}
+
 static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }