diff mbox series

[v4,12/15] mm,hwpoison: Rework soft offline for in-use pages

Message ID 20200716123810.25292-13-osalvador@suse.de (mailing list archive)
State New, archived
Headers show
Series Hwpoison soft-offline rework | expand

Commit Message

Oscar Salvador July 16, 2020, 12:38 p.m. UTC
This patch changes the way we set and handle in-use poisoned pages.
Until now, poisoned pages were released to the buddy allocator, trusting
that the checks that take place prior to deliver the page to its end
user would act as a safe net and would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be PageBuddy and be in a
freelist.
Although this might not be the only user, having poisoned pages
in the buddy allocator seems a bad idea as we should only have
free pages that are ready and meant to be used as such.

Before explaining the taken approach, let us break down the kind
of pages we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

* Normal pages (order-0 and anon-THP)

  - If they are clean and unmapped page cache pages, we invalidate
    then by means of invalidate_inode_page().
  - If they are mapped/dirty, we do the isolate-and-migrate dance.

  Either way, do not call put_page directly from those paths.
  Instead, we keep the page and send it to page_set_poison to perform the
  right handling.

  Among other things, page_set_poison() sets the HWPoison flag and does the last
  put_page.
  This call to put_page is mainly to be able to call __page_cache_release,
  since this function is not exported.

  Down the chain, we placed a check for HWPoison page in
  free_pages_prepare, that just skips any poisoned page, so those pages
  do not end up either in a pcplist or in buddy-freelist.

  After that, we set the refcount on the page to 1 and we increment
  the poisoned pages counter.

  We could do as we do for free pages:
  1) wait until the page hits buddy's freelists
  2) take it off
  3) flag it

  The problem is that we could race with an allocation, so by the time we
  want to take the page off the buddy, the page is already allocated, so we
  cannot soft-offline it.
  This is not fatal of course, but if it is better if we can close the race
  as does not require a lot of code.

* Hugetlb pages

  - We isolate-and-migrate them

  There is no magic in here, we just isolate and migrate them.
  A new set of internal functions have been made to flag a hugetlb page as
  poisoned (SetPageHugePoisoned(), PageHugePoisoned(), ClearPageHugePoisoned())
  This allows us to flag the page when we migrate it, back in
  move_hugetlb_state().
  Later on we check whether the page is poisoned in __free_huge_page,
  and we bail out in that case before sending the page to e.g: active
  free list.
  This gives us full control of the page, and we can handle it
  page_handle_poison().

  In other words, we do not allow migrated hugepages to get back to the
  freelists.

  Since now the page has no user and has been migrated, we can call
  dissolve_free_huge_page, which will end up calling update_and_free_page.
  In update_and_free_page(), we check for the page to be poisoned.
  If it so, we handle it as we handle gigantic pages, i.e: we break down
  the page in order-0 pages and free them one by one.
  Doing so, allows us for free_pages_prepare to skip poisoned pages.

Because of the way we handle now in-use pages, we no longer need the
put-as-isolation-migratetype dance, that was guarding for poisoned pages
to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/page-flags.h |  5 ----
 mm/hugetlb.c               | 60 +++++++++++++++++++++++++++++++++-----
 mm/memory-failure.c        | 53 +++++++++++++--------------------
 mm/migrate.c               | 11 ++-----
 mm/page_alloc.c            | 38 +++++++-----------------
 5 files changed, 86 insertions(+), 81 deletions(-)

Comments

HORIGUCHI NAOYA(堀口 直也) July 17, 2020, 6:55 a.m. UTC | #1
On Thu, Jul 16, 2020 at 02:38:06PM +0200, Oscar Salvador wrote:
> This patch changes the way we set and handle in-use poisoned pages.
> Until now, poisoned pages were released to the buddy allocator, trusting
> that the checks that take place prior to deliver the page to its end
> user would act as a safe net and would skip that page.
> 
> This has proved to be wrong, as we got some pfn walkers out there, like
> compaction, that all they care is the page to be PageBuddy and be in a
> freelist.
> Although this might not be the only user, having poisoned pages
> in the buddy allocator seems a bad idea as we should only have
> free pages that are ready and meant to be used as such.
> 
> Before explaining the taken approach, let us break down the kind
> of pages we can soft offline.
> 
> - Anonymous THP (after the split, they end up being 4K pages)
> - Hugetlb
> - Order-0 pages (that can be either migrated or invalited)
> 
> * Normal pages (order-0 and anon-THP)
> 
>   - If they are clean and unmapped page cache pages, we invalidate
>     then by means of invalidate_inode_page().
>   - If they are mapped/dirty, we do the isolate-and-migrate dance.
> 
>   Either way, do not call put_page directly from those paths.
>   Instead, we keep the page and send it to page_set_poison to perform the
>   right handling.
> 
>   Among other things, page_set_poison() sets the HWPoison flag and does the last
>   put_page.
>   This call to put_page is mainly to be able to call __page_cache_release,
>   since this function is not exported.
> 
>   Down the chain, we placed a check for HWPoison page in
>   free_pages_prepare, that just skips any poisoned page, so those pages
>   do not end up either in a pcplist or in buddy-freelist.
> 
>   After that, we set the refcount on the page to 1 and we increment
>   the poisoned pages counter.
> 
>   We could do as we do for free pages:
>   1) wait until the page hits buddy's freelists
>   2) take it off
>   3) flag it
> 
>   The problem is that we could race with an allocation, so by the time we
>   want to take the page off the buddy, the page is already allocated, so we
>   cannot soft-offline it.
>   This is not fatal of course, but if it is better if we can close the race
>   as does not require a lot of code.
> 
> * Hugetlb pages
> 
>   - We isolate-and-migrate them
> 
>   There is no magic in here, we just isolate and migrate them.
>   A new set of internal functions have been made to flag a hugetlb page as
>   poisoned (SetPageHugePoisoned(), PageHugePoisoned(), ClearPageHugePoisoned())
>   This allows us to flag the page when we migrate it, back in
>   move_hugetlb_state().
>
>   Later on we check whether the page is poisoned in __free_huge_page,
>   and we bail out in that case before sending the page to e.g: active
>   free list.
>   This gives us full control of the page, and we can handle it
>   page_handle_poison().
> 
>   In other words, we do not allow migrated hugepages to get back to the
>   freelists.
> 
>   Since now the page has no user and has been migrated, we can call
>   dissolve_free_huge_page, which will end up calling update_and_free_page.
>   In update_and_free_page(), we check for the page to be poisoned.
>   If it so, we handle it as we handle gigantic pages, i.e: we break down
>   the page in order-0 pages and free them one by one.
>   Doing so, allows us for free_pages_prepare to skip poisoned pages.
> 
> Because of the way we handle now in-use pages, we no longer need the
> put-as-isolation-migratetype dance, that was guarding for poisoned pages
> to end up in pcplists.

I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on a
small (4GB memory) VM, and weiredly found that (1) the target hugepages
are not always dissolved and (2) dissovled hugetpages are still counted
in "HugePages_Total:". See below:

    $ ./random 1
    - start: migrate_huge_offline
    - use NUMA nodes 0,1.
    - mmap and free 8388608 bytes hugepages on node 0
    - mmap and free 8388608 bytes hugepages on node 1
    madvise: Cannot allocate memory

    $ cat /proc/meminfo
    MemTotal:        4026772 kB
    MemFree:          976300 kB
    MemAvailable:     892840 kB
    Buffers:           20936 kB
    Cached:            99768 kB
    SwapCached:         5904 kB
    Active:            84332 kB
    Inactive:         116328 kB
    Active(anon):      27944 kB
    Inactive(anon):    68524 kB
    Active(file):      56388 kB
    Inactive(file):    47804 kB
    Unevictable:        7532 kB
    Mlocked:               0 kB
    SwapTotal:       2621436 kB
    SwapFree:        2609844 kB
    Dirty:                56 kB
    Writeback:             0 kB
    AnonPages:         81764 kB
    Mapped:            54348 kB
    Shmem:              8948 kB
    KReclaimable:      22744 kB
    Slab:              52056 kB
    SReclaimable:      22744 kB
    SUnreclaim:        29312 kB
    KernelStack:        3888 kB
    PageTables:         2804 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
    CommitLimit:     3260612 kB
    Committed_AS:     828196 kB
    VmallocTotal:   34359738367 kB
    VmallocUsed:       19260 kB
    VmallocChunk:          0 kB
    Percpu:             5120 kB
    HardwareCorrupted:  5368 kB
    AnonHugePages:     18432 kB
    ShmemHugePages:        0 kB
    ShmemPmdMapped:        0 kB
    FileHugePages:         0 kB
    FilePmdMapped:         0 kB
    CmaTotal:              0 kB
    CmaFree:               0 kB
    HugePages_Total:    1342     // still counted as hugetlb pages.
    HugePages_Free:        0     // all hugepage are still allocated (or leaked?)
    HugePages_Rsvd:        0
    HugePages_Surp:      762     // some are counted in surplus.
    Hugepagesize:       2048 kB
    Hugetlb:         2748416 kB
    DirectMap4k:      112480 kB
    DirectMap2M:     4081664 kB


    $ page-types -b hwpoison
                 flags      page-count       MB  symbolic-flags                     long-symbolic-flags
    0x0000000000080008             421        1  ___U_______________X_______________________      uptodate,hwpoison
    0x00000000000a8018               1        0  ___UD__________H_G_X_______________________      uptodate,dirty,compound_head,huge,hwpoison
    0x00000000000a801c             920        3  __RUD__________H_G_X_______________________      referenced,uptodate,dirty,compound_head,huge,hwpoison
                 total            1342        5

This means that some hugepages are dissolved, but the others not,
maybe which is not desirable.
I'll dig this more later but just let me share at first.

A few minor comment below ...

> 
> Signed-off-by: Oscar Salvador <osalvador@suse.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  include/linux/page-flags.h |  5 ----
>  mm/hugetlb.c               | 60 +++++++++++++++++++++++++++++++++-----
>  mm/memory-failure.c        | 53 +++++++++++++--------------------
>  mm/migrate.c               | 11 ++-----
>  mm/page_alloc.c            | 38 +++++++-----------------
>  5 files changed, 86 insertions(+), 81 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 01baf6d426ff..2ac8bfa0cf20 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -426,13 +426,8 @@ PAGEFLAG(HWPoison, hwpoison, PF_ANY)
>  TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
>  #define __PG_HWPOISON (1UL << PG_hwpoison)
>  extern bool take_page_off_buddy(struct page *page);
> -extern bool set_hwpoison_free_buddy_page(struct page *page);
>  #else
>  PAGEFLAG_FALSE(HWPoison)
> -static inline bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> -	return 0;
> -}
>  #define __PG_HWPOISON 0
>  #endif
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7badb01d15e3..1c6397936512 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -29,6 +29,7 @@
>  #include <linux/numa.h>
>  #include <linux/llist.h>
>  #include <linux/cma.h>
> +#include <linux/migrate.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgalloc.h>
> @@ -1209,9 +1210,26 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>  		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
>  		nr_nodes--)
>  
> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> -static void destroy_compound_gigantic_page(struct page *page,
> -					unsigned int order)
> +static inline bool PageHugePoisoned(struct page *page)
> +{
> +	if (!PageHuge(page))
> +		return false;
> +
> +	return (unsigned long)page[3].mapping == -1U;
> +}
> +
> +static inline void SetPageHugePoisoned(struct page *page)
> +{
> +	page[3].mapping = (void *)-1U;
> +}
> +
> +static inline void ClearPageHugePoisoned(struct page *page)
> +{
> +	page[3].mapping = NULL;
> +}
> +
> +static void destroy_compound_gigantic_page(struct hstate *h, struct page *page,
> +					   unsigned int order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
> @@ -1222,14 +1240,19 @@ static void destroy_compound_gigantic_page(struct page *page,
>  		atomic_set(compound_pincount_ptr(page), 0);
>  
>  	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> +		if (!hstate_is_gigantic(h))
> +			 p->mapping = NULL;
>  		clear_compound_head(p);
>  		set_page_refcounted(p);
>  	}
>  
> +	if (PageHugePoisoned(page))
> +		ClearPageHugePoisoned(page);
>  	set_compound_order(page, 0);
>  	__ClearPageHead(page);
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>  static void free_gigantic_page(struct page *page, unsigned int order)
>  {
>  	/*
> @@ -1284,13 +1307,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
>  	return NULL;
>  }
>  static inline void free_gigantic_page(struct page *page, unsigned int order) { }
> -static inline void destroy_compound_gigantic_page(struct page *page,
> -						unsigned int order) { }
> +static inline void destroy_compound_gigantic_page(struct hstate *h,
> +						  struct page *page,
> +						  unsigned int order) { }
>  #endif
>  
>  static void update_and_free_page(struct hstate *h, struct page *page)
>  {
>  	int i;
> +	bool poisoned = PageHugePoisoned(page);
> +	unsigned int order = huge_page_order(h);
>  
>  	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>  		return;
> @@ -1313,11 +1339,21 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  		 * we might block in free_gigantic_page().
>  		 */
>  		spin_unlock(&hugetlb_lock);
> -		destroy_compound_gigantic_page(page, huge_page_order(h));
> -		free_gigantic_page(page, huge_page_order(h));
> +		destroy_compound_gigantic_page(h, page, order);
> +		free_gigantic_page(page, order);
>  		spin_lock(&hugetlb_lock);
>  	} else {
> -		__free_pages(page, huge_page_order(h));
> +		if (unlikely(poisoned)) {
> +			/*
> +			 * If the hugepage is poisoned, do as we do for
> +			 * gigantic pages and free the pages as order-0.
> +			 * free_pages_prepare will skip over the poisoned ones.
> +			 */
> +			destroy_compound_gigantic_page(h, page, order);

This function is for gigantic page from its name, so shouldn't be called
for non-gigantic huge page. Maybe renaming it and/or introducing some inner
function layer to factor out common part would be better.

> +			free_contig_range(page_to_pfn(page), 1 << order);
> +		} else {
> +			__free_pages(page, huge_page_order(h));
> +		}
>  	}
>  }
>  
> @@ -1427,6 +1463,11 @@ static void __free_huge_page(struct page *page)
>  	if (restore_reserve)
>  		h->resv_huge_pages++;
>  
> +	if (PageHugePoisoned(page)) {
> +		spin_unlock(&hugetlb_lock);
> +		return;
> +	}
> +
>  	if (PageHugeTemporary(page)) {
>  		list_del(&page->lru);
>  		ClearPageHugeTemporary(page);
> @@ -5642,6 +5683,9 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
>  	hugetlb_cgroup_migrate(oldpage, newpage);
>  	set_page_owner_migrate_reason(newpage, reason);
>  
> +	if (reason == MR_MEMORY_FAILURE)
> +		SetPageHugePoisoned(oldpage);
> +
>  	/*
>  	 * transfer temporary state of the new huge page. This is
>  	 * reverse to other transitions because the newpage is going to
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index caf012d34607..c0ebab4eed4c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -65,9 +65,17 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
>  
>  atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>  
> -static void page_handle_poison(struct page *page)
> +static void page_handle_poison(struct page *page, bool release, bool set_flag,
> +			       bool huge_flag)
>  {
> -	SetPageHWPoison(page);
> +	if (set_flag)
> +		SetPageHWPoison(page);
> +
> +        if (huge_flag)
> +		dissolve_free_huge_page(page);
> +        else if (release)
> +		put_page(page);
> +

Indentation seems to be broken, you can run checkpatch.pl to find details.

Thanks,
Naoya Horiguchi

>  	page_ref_inc(page);
>  	num_poisoned_pages_inc();
>  }
> @@ -1717,7 +1725,7 @@ static int get_any_page(struct page *page, unsigned long pfn)
>  
>  static int soft_offline_huge_page(struct page *page)
>  {
> -	int ret;
> +	int ret = -EBUSY;
>  	unsigned long pfn = page_to_pfn(page);
>  	struct page *hpage = compound_head(page);
>  	LIST_HEAD(pagelist);
> @@ -1757,19 +1765,12 @@ static int soft_offline_huge_page(struct page *page)
>  			ret = -EIO;
>  	} else {
>  		/*
> -		 * We set PG_hwpoison only when the migration source hugepage
> -		 * was successfully dissolved, because otherwise hwpoisoned
> -		 * hugepage remains on free hugepage list, then userspace will
> -		 * find it as SIGBUS by allocation failure. That's not expected
> -		 * in soft-offlining.
> +		 * At this point the page cannot be in-use since we do not
> +		 * let the page to go back to hugetlb freelists.
> +		 * In that case we just need to dissolve it.
> +		 * page_handle_poison will take care of it.
>  		 */
> -		ret = dissolve_free_huge_page(page);
> -		if (!ret) {
> -			if (set_hwpoison_free_buddy_page(page))
> -				num_poisoned_pages_inc();
> -			else
> -				ret = -EBUSY;
> -		}
> +		page_handle_poison(page, true, true, true);
>  	}
>  	return ret;
>  }
> @@ -1804,10 +1805,8 @@ static int __soft_offline_page(struct page *page)
>  	 * would need to fix isolation locking first.
>  	 */
>  	if (ret == 1) {
> -		put_page(page);
>  		pr_info("soft_offline: %#lx: invalidated\n", pfn);
> -		SetPageHWPoison(page);
> -		num_poisoned_pages_inc();
> +		page_handle_poison(page, true, true, false);
>  		return 0;
>  	}
>  
> @@ -1838,7 +1837,9 @@ static int __soft_offline_page(struct page *page)
>  		list_add(&page->lru, &pagelist);
>  		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
>  					MIGRATE_SYNC, MR_MEMORY_FAILURE);
> -		if (ret) {
> +		if (!ret) {
> +			page_handle_poison(page, true, true, false);
> +		} else {
>  			if (!list_empty(&pagelist))
>  				putback_movable_pages(&pagelist);
>  
> @@ -1857,37 +1858,25 @@ static int __soft_offline_page(struct page *page)
>  static int soft_offline_in_use_page(struct page *page)
>  {
>  	int ret;
> -	int mt;
>  	struct page *hpage = compound_head(page);
>  
>  	if (!PageHuge(page) && PageTransHuge(hpage))
>  		if (try_to_split_thp_page(page, "soft offline") < 0)
>  			return -EBUSY;
>  
> -	/*
> -	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
> -	 * to free list immediately (not via pcplist) when released after
> -	 * successful page migration. Otherwise we can't guarantee that the
> -	 * page is really free after put_page() returns, so
> -	 * set_hwpoison_free_buddy_page() highly likely fails.
> -	 */
> -	mt = get_pageblock_migratetype(page);
> -	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>  	if (PageHuge(page))
>  		ret = soft_offline_huge_page(page);
>  	else
>  		ret = __soft_offline_page(page);
> -	set_pageblock_migratetype(page, mt);
>  	return ret;
>  }
>  
>  static int soft_offline_free_page(struct page *page)
>  {
>  	int rc = -EBUSY;
> -	int rc = dissolve_free_huge_page(page);
>  
>  	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
> -		page_handle_poison(page);
> +		page_handle_poison(page, false, true, false);
>  		rc = 0;
>  	}
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 75c10d81e833..a68d81d0ae6e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1222,16 +1222,11 @@ static int unmap_and_move(new_page_t get_new_page,
>  	 * we want to retry.
>  	 */
>  	if (rc == MIGRATEPAGE_SUCCESS) {
> -		put_page(page);
> -		if (reason == MR_MEMORY_FAILURE) {
> +		if (reason != MR_MEMORY_FAILURE)
>  			/*
> -			 * Set PG_HWPoison on just freed page
> -			 * intentionally. Although it's rather weird,
> -			 * it's how HWPoison flag works at the moment.
> +			 * We handle poisoned pages in page_handle_poison.
>  			 */
> -			if (set_hwpoison_free_buddy_page(page))
> -				num_poisoned_pages_inc();
> -		}
> +			put_page(page);
>  	} else {
>  		if (rc != -EAGAIN) {
>  			if (likely(!__PageMovable(page))) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4fa0e0887c07..11df51fc2718 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1175,6 +1175,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  
>  	trace_mm_page_free(page, order);
>  
> +	if (unlikely(PageHWPoison(page)) && !order) {
> +		/*
> +		 * Untie memcg state and reset page's owner
> +		 */
> +		if (memcg_kmem_enabled() && PageKmemcg(page))
> +			__memcg_kmem_uncharge_page(page, order);
> +		reset_page_owner(page, order);
> +		return false;
> +	}
> +
>  	/*
>  	 * Check tail pages before head page information is cleared to
>  	 * avoid checking PageCompound for order-0 pages.
> @@ -8844,32 +8854,4 @@ bool take_page_off_buddy(struct page *page)
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  	return ret;
>  }
> -
> -/*
> - * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
> - * test is performed under the zone lock to prevent a race against page
> - * allocation.
> - */
> -bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> -	struct zone *zone = page_zone(page);
> -	unsigned long pfn = page_to_pfn(page);
> -	unsigned long flags;
> -	unsigned int order;
> -	bool hwpoisoned = false;
> -
> -	spin_lock_irqsave(&zone->lock, flags);
> -	for (order = 0; order < MAX_ORDER; order++) {
> -		struct page *page_head = page - (pfn & ((1 << order) - 1));
> -
> -		if (PageBuddy(page_head) && page_order(page_head) >= order) {
> -			if (!TestSetPageHWPoison(page))
> -				hwpoisoned = true;
> -			break;
> -		}
> -	}
> -	spin_unlock_irqrestore(&zone->lock, flags);
> -
> -	return hwpoisoned;
> -}
>  #endif
> -- 
> 2.26.2
>
Oscar Salvador July 20, 2020, 8:27 a.m. UTC | #2
On 2020-07-17 08:55, HORIGUCHI NAOYA wrote:
> I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on a
> small (4GB memory) VM, and weiredly found that (1) the target hugepages
> are not always dissolved and (2) dissovled hugetpages are still counted
> in "HugePages_Total:". See below:
> 
>     $ ./random 1
>     - start: migrate_huge_offline
>     - use NUMA nodes 0,1.
>     - mmap and free 8388608 bytes hugepages on node 0
>     - mmap and free 8388608 bytes hugepages on node 1
>     madvise: Cannot allocate memory
> 
>     $ cat /proc/meminfo
>     MemTotal:        4026772 kB
>     MemFree:          976300 kB
>     MemAvailable:     892840 kB
>     Buffers:           20936 kB
>     Cached:            99768 kB
>     SwapCached:         5904 kB
>     Active:            84332 kB
>     Inactive:         116328 kB
>     Active(anon):      27944 kB
>     Inactive(anon):    68524 kB
>     Active(file):      56388 kB
>     Inactive(file):    47804 kB
>     Unevictable:        7532 kB
>     Mlocked:               0 kB
>     SwapTotal:       2621436 kB
>     SwapFree:        2609844 kB
>     Dirty:                56 kB
>     Writeback:             0 kB
>     AnonPages:         81764 kB
>     Mapped:            54348 kB
>     Shmem:              8948 kB
>     KReclaimable:      22744 kB
>     Slab:              52056 kB
>     SReclaimable:      22744 kB
>     SUnreclaim:        29312 kB
>     KernelStack:        3888 kB
>     PageTables:         2804 kB
>     NFS_Unstable:          0 kB
>     Bounce:                0 kB
>     WritebackTmp:          0 kB
>     CommitLimit:     3260612 kB
>     Committed_AS:     828196 kB
>     VmallocTotal:   34359738367 kB
>     VmallocUsed:       19260 kB
>     VmallocChunk:          0 kB
>     Percpu:             5120 kB
>     HardwareCorrupted:  5368 kB
>     AnonHugePages:     18432 kB
>     ShmemHugePages:        0 kB
>     ShmemPmdMapped:        0 kB
>     FileHugePages:         0 kB
>     FilePmdMapped:         0 kB
>     CmaTotal:              0 kB
>     CmaFree:               0 kB
>     HugePages_Total:    1342     // still counted as hugetlb pages.
>     HugePages_Free:        0     // all hugepage are still allocated
> (or leaked?)
>     HugePages_Rsvd:        0
>     HugePages_Surp:      762     // some are counted in surplus.
>     Hugepagesize:       2048 kB
>     Hugetlb:         2748416 kB
>     DirectMap4k:      112480 kB
>     DirectMap2M:     4081664 kB
> 
> 
>     $ page-types -b hwpoison
>                  flags      page-count       MB  symbolic-flags
>              long-symbolic-flags
>     0x0000000000080008             421        1
> ___U_______________X_______________________      uptodate,hwpoison
>     0x00000000000a8018               1        0
> ___UD__________H_G_X_______________________
> uptodate,dirty,compound_head,huge,hwpoison
>     0x00000000000a801c             920        3
> __RUD__________H_G_X_______________________
> referenced,uptodate,dirty,compound_head,huge,hwpoison
>                  total            1342        5
> 
> This means that some hugepages are dissolved, but the others not,
> maybe which is not desirable.
> I'll dig this more later but just let me share at first.
> 
> A few minor comment below ...


Uhm, weird.

I will be taking a look today.

Thanks
Oscar Salvador July 22, 2020, 8:08 a.m. UTC | #3
On 2020-07-20 10:27, osalvador@suse.de wrote:
> On 2020-07-17 08:55, HORIGUCHI NAOYA wrote:
>> I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on 
>> a
>> small (4GB memory) VM, and weiredly found that (1) the target 
>> hugepages
>> are not always dissolved and (2) dissovled hugetpages are still 
>> counted
>> in "HugePages_Total:". See below:
>> 
>>     $ ./random 1
>>     - start: migrate_huge_offline
>>     - use NUMA nodes 0,1.
>>     - mmap and free 8388608 bytes hugepages on node 0
>>     - mmap and free 8388608 bytes hugepages on node 1
>>     madvise: Cannot allocate memory
>> 
>>     $ cat /proc/meminfo
>>     MemTotal:        4026772 kB
>>     MemFree:          976300 kB
>>     MemAvailable:     892840 kB
>>     Buffers:           20936 kB
>>     Cached:            99768 kB
>>     SwapCached:         5904 kB
>>     Active:            84332 kB
>>     Inactive:         116328 kB
>>     Active(anon):      27944 kB
>>     Inactive(anon):    68524 kB
>>     Active(file):      56388 kB
>>     Inactive(file):    47804 kB
>>     Unevictable:        7532 kB
>>     Mlocked:               0 kB
>>     SwapTotal:       2621436 kB
>>     SwapFree:        2609844 kB
>>     Dirty:                56 kB
>>     Writeback:             0 kB
>>     AnonPages:         81764 kB
>>     Mapped:            54348 kB
>>     Shmem:              8948 kB
>>     KReclaimable:      22744 kB
>>     Slab:              52056 kB
>>     SReclaimable:      22744 kB
>>     SUnreclaim:        29312 kB
>>     KernelStack:        3888 kB
>>     PageTables:         2804 kB
>>     NFS_Unstable:          0 kB
>>     Bounce:                0 kB
>>     WritebackTmp:          0 kB
>>     CommitLimit:     3260612 kB
>>     Committed_AS:     828196 kB
>>     VmallocTotal:   34359738367 kB
>>     VmallocUsed:       19260 kB
>>     VmallocChunk:          0 kB
>>     Percpu:             5120 kB
>>     HardwareCorrupted:  5368 kB
>>     AnonHugePages:     18432 kB
>>     ShmemHugePages:        0 kB
>>     ShmemPmdMapped:        0 kB
>>     FileHugePages:         0 kB
>>     FilePmdMapped:         0 kB
>>     CmaTotal:              0 kB
>>     CmaFree:               0 kB
>>     HugePages_Total:    1342     // still counted as hugetlb pages.
>>     HugePages_Free:        0     // all hugepage are still allocated
>> (or leaked?)
>>     HugePages_Rsvd:        0
>>     HugePages_Surp:      762     // some are counted in surplus.
>>     Hugepagesize:       2048 kB
>>     Hugetlb:         2748416 kB
>>     DirectMap4k:      112480 kB
>>     DirectMap2M:     4081664 kB
>> 
>> 
>>     $ page-types -b hwpoison
>>                  flags      page-count       MB  symbolic-flags
>>              long-symbolic-flags
>>     0x0000000000080008             421        1
>> ___U_______________X_______________________      uptodate,hwpoison
>>     0x00000000000a8018               1        0
>> ___UD__________H_G_X_______________________
>> uptodate,dirty,compound_head,huge,hwpoison
>>     0x00000000000a801c             920        3
>> __RUD__________H_G_X_______________________
>> referenced,uptodate,dirty,compound_head,huge,hwpoison
>>                  total            1342        5
>> 
>> This means that some hugepages are dissolved, but the others not,
>> maybe which is not desirable.
>> I'll dig this more later but just let me share at first.
>> 
>> A few minor comment below ...
> 
> 
> Uhm, weird.
> 
> I will be taking a look today.

After some digging up I __think__ I found the problem.
I will try to fix it up and I will be running tests.

I might reach out to you once I am done because I remember you had a 
test-suite that worked quite well, so you can give it a spin there.

Thanks
Oscar Salvador July 23, 2020, 10:19 a.m. UTC | #4
On Wed, Jul 22, 2020 at 10:08:59AM +0200, osalvador@suse.de wrote:
> On 2020-07-20 10:27, osalvador@suse.de wrote:
> > > This means that some hugepages are dissolved, but the others not,
> > > maybe which is not desirable.
> > > I'll dig this more later but just let me share at first.
> > > 
> > > A few minor comment below ...
> > 
> > 
> > Uhm, weird.
> > 
> > I will be taking a look today.
> 
> After some digging up I __think__ I found the problem.
> I will try to fix it up and I will be running tests.

I found the problem.
I re-ran the tests again with small and large memory and 
the stats look correct this time.

After some more testing, I also fixed a list corruption that was happening
due to the same problem.

I am creating a git branch so you can re-run your tests on it as well.
diff mbox series

Patch

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 01baf6d426ff..2ac8bfa0cf20 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -426,13 +426,8 @@  PAGEFLAG(HWPoison, hwpoison, PF_ANY)
 TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
 extern bool take_page_off_buddy(struct page *page);
-extern bool set_hwpoison_free_buddy_page(struct page *page);
 #else
 PAGEFLAG_FALSE(HWPoison)
-static inline bool set_hwpoison_free_buddy_page(struct page *page)
-{
-	return 0;
-}
 #define __PG_HWPOISON 0
 #endif
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7badb01d15e3..1c6397936512 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -29,6 +29,7 @@ 
 #include <linux/numa.h>
 #include <linux/llist.h>
 #include <linux/cma.h>
+#include <linux/migrate.h>
 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
@@ -1209,9 +1210,26 @@  static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
 		nr_nodes--)
 
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_page(struct page *page,
-					unsigned int order)
+static inline bool PageHugePoisoned(struct page *page)
+{
+	if (!PageHuge(page))
+		return false;
+
+	return (unsigned long)page[3].mapping == -1U;
+}
+
+static inline void SetPageHugePoisoned(struct page *page)
+{
+	page[3].mapping = (void *)-1U;
+}
+
+static inline void ClearPageHugePoisoned(struct page *page)
+{
+	page[3].mapping = NULL;
+}
+
+static void destroy_compound_gigantic_page(struct hstate *h, struct page *page,
+					   unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -1222,14 +1240,19 @@  static void destroy_compound_gigantic_page(struct page *page,
 		atomic_set(compound_pincount_ptr(page), 0);
 
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		if (!hstate_is_gigantic(h))
+			 p->mapping = NULL;
 		clear_compound_head(p);
 		set_page_refcounted(p);
 	}
 
+	if (PageHugePoisoned(page))
+		ClearPageHugePoisoned(page);
 	set_compound_order(page, 0);
 	__ClearPageHead(page);
 }
 
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
 	/*
@@ -1284,13 +1307,16 @@  static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 	return NULL;
 }
 static inline void free_gigantic_page(struct page *page, unsigned int order) { }
-static inline void destroy_compound_gigantic_page(struct page *page,
-						unsigned int order) { }
+static inline void destroy_compound_gigantic_page(struct hstate *h,
+						  struct page *page,
+						  unsigned int order) { }
 #endif
 
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
+	bool poisoned = PageHugePoisoned(page);
+	unsigned int order = huge_page_order(h);
 
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
@@ -1313,11 +1339,21 @@  static void update_and_free_page(struct hstate *h, struct page *page)
 		 * we might block in free_gigantic_page().
 		 */
 		spin_unlock(&hugetlb_lock);
-		destroy_compound_gigantic_page(page, huge_page_order(h));
-		free_gigantic_page(page, huge_page_order(h));
+		destroy_compound_gigantic_page(h, page, order);
+		free_gigantic_page(page, order);
 		spin_lock(&hugetlb_lock);
 	} else {
-		__free_pages(page, huge_page_order(h));
+		if (unlikely(poisoned)) {
+			/*
+			 * If the hugepage is poisoned, do as we do for
+			 * gigantic pages and free the pages as order-0.
+			 * free_pages_prepare will skip over the poisoned ones.
+			 */
+			destroy_compound_gigantic_page(h, page, order);
+			free_contig_range(page_to_pfn(page), 1 << order);
+		} else {
+			__free_pages(page, huge_page_order(h));
+		}
 	}
 }
 
@@ -1427,6 +1463,11 @@  static void __free_huge_page(struct page *page)
 	if (restore_reserve)
 		h->resv_huge_pages++;
 
+	if (PageHugePoisoned(page)) {
+		spin_unlock(&hugetlb_lock);
+		return;
+	}
+
 	if (PageHugeTemporary(page)) {
 		list_del(&page->lru);
 		ClearPageHugeTemporary(page);
@@ -5642,6 +5683,9 @@  void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 	hugetlb_cgroup_migrate(oldpage, newpage);
 	set_page_owner_migrate_reason(newpage, reason);
 
+	if (reason == MR_MEMORY_FAILURE)
+		SetPageHugePoisoned(oldpage);
+
 	/*
 	 * transfer temporary state of the new huge page. This is
 	 * reverse to other transitions because the newpage is going to
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index caf012d34607..c0ebab4eed4c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -65,9 +65,17 @@  int sysctl_memory_failure_recovery __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
-static void page_handle_poison(struct page *page)
+static void page_handle_poison(struct page *page, bool release, bool set_flag,
+			       bool huge_flag)
 {
-	SetPageHWPoison(page);
+	if (set_flag)
+		SetPageHWPoison(page);
+
+        if (huge_flag)
+		dissolve_free_huge_page(page);
+        else if (release)
+		put_page(page);
+
 	page_ref_inc(page);
 	num_poisoned_pages_inc();
 }
@@ -1717,7 +1725,7 @@  static int get_any_page(struct page *page, unsigned long pfn)
 
 static int soft_offline_huge_page(struct page *page)
 {
-	int ret;
+	int ret = -EBUSY;
 	unsigned long pfn = page_to_pfn(page);
 	struct page *hpage = compound_head(page);
 	LIST_HEAD(pagelist);
@@ -1757,19 +1765,12 @@  static int soft_offline_huge_page(struct page *page)
 			ret = -EIO;
 	} else {
 		/*
-		 * We set PG_hwpoison only when the migration source hugepage
-		 * was successfully dissolved, because otherwise hwpoisoned
-		 * hugepage remains on free hugepage list, then userspace will
-		 * find it as SIGBUS by allocation failure. That's not expected
-		 * in soft-offlining.
+		 * At this point the page cannot be in-use since we do not
+		 * let the page to go back to hugetlb freelists.
+		 * In that case we just need to dissolve it.
+		 * page_handle_poison will take care of it.
 		 */
-		ret = dissolve_free_huge_page(page);
-		if (!ret) {
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-			else
-				ret = -EBUSY;
-		}
+		page_handle_poison(page, true, true, true);
 	}
 	return ret;
 }
@@ -1804,10 +1805,8 @@  static int __soft_offline_page(struct page *page)
 	 * would need to fix isolation locking first.
 	 */
 	if (ret == 1) {
-		put_page(page);
 		pr_info("soft_offline: %#lx: invalidated\n", pfn);
-		SetPageHWPoison(page);
-		num_poisoned_pages_inc();
+		page_handle_poison(page, true, true, false);
 		return 0;
 	}
 
@@ -1838,7 +1837,9 @@  static int __soft_offline_page(struct page *page)
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
 					MIGRATE_SYNC, MR_MEMORY_FAILURE);
-		if (ret) {
+		if (!ret) {
+			page_handle_poison(page, true, true, false);
+		} else {
 			if (!list_empty(&pagelist))
 				putback_movable_pages(&pagelist);
 
@@ -1857,37 +1858,25 @@  static int __soft_offline_page(struct page *page)
 static int soft_offline_in_use_page(struct page *page)
 {
 	int ret;
-	int mt;
 	struct page *hpage = compound_head(page);
 
 	if (!PageHuge(page) && PageTransHuge(hpage))
 		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
 
-	/*
-	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
-	 * to free list immediately (not via pcplist) when released after
-	 * successful page migration. Otherwise we can't guarantee that the
-	 * page is really free after put_page() returns, so
-	 * set_hwpoison_free_buddy_page() highly likely fails.
-	 */
-	mt = get_pageblock_migratetype(page);
-	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 	if (PageHuge(page))
 		ret = soft_offline_huge_page(page);
 	else
 		ret = __soft_offline_page(page);
-	set_pageblock_migratetype(page, mt);
 	return ret;
 }
 
 static int soft_offline_free_page(struct page *page)
 {
 	int rc = -EBUSY;
-	int rc = dissolve_free_huge_page(page);
 
 	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
-		page_handle_poison(page);
+		page_handle_poison(page, false, true, false);
 		rc = 0;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 75c10d81e833..a68d81d0ae6e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1222,16 +1222,11 @@  static int unmap_and_move(new_page_t get_new_page,
 	 * we want to retry.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		put_page(page);
-		if (reason == MR_MEMORY_FAILURE) {
+		if (reason != MR_MEMORY_FAILURE)
 			/*
-			 * Set PG_HWPoison on just freed page
-			 * intentionally. Although it's rather weird,
-			 * it's how HWPoison flag works at the moment.
+			 * We handle poisoned pages in page_handle_poison.
 			 */
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-		}
+			put_page(page);
 	} else {
 		if (rc != -EAGAIN) {
 			if (likely(!__PageMovable(page))) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4fa0e0887c07..11df51fc2718 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1175,6 +1175,16 @@  static __always_inline bool free_pages_prepare(struct page *page,
 
 	trace_mm_page_free(page, order);
 
+	if (unlikely(PageHWPoison(page)) && !order) {
+		/*
+		 * Untie memcg state and reset page's owner
+		 */
+		if (memcg_kmem_enabled() && PageKmemcg(page))
+			__memcg_kmem_uncharge_page(page, order);
+		reset_page_owner(page, order);
+		return false;
+	}
+
 	/*
 	 * Check tail pages before head page information is cleared to
 	 * avoid checking PageCompound for order-0 pages.
@@ -8844,32 +8854,4 @@  bool take_page_off_buddy(struct page *page)
 	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
-
-/*
- * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
- * test is performed under the zone lock to prevent a race against page
- * allocation.
- */
-bool set_hwpoison_free_buddy_page(struct page *page)
-{
-	struct zone *zone = page_zone(page);
-	unsigned long pfn = page_to_pfn(page);
-	unsigned long flags;
-	unsigned int order;
-	bool hwpoisoned = false;
-
-	spin_lock_irqsave(&zone->lock, flags);
-	for (order = 0; order < MAX_ORDER; order++) {
-		struct page *page_head = page - (pfn & ((1 << order) - 1));
-
-		if (PageBuddy(page_head) && page_order(page_head) >= order) {
-			if (!TestSetPageHWPoison(page))
-				hwpoisoned = true;
-			break;
-		}
-	}
-	spin_unlock_irqrestore(&zone->lock, flags);
-
-	return hwpoisoned;
-}
 #endif