[v2,1/2] mm: fix race on soft-offlining free huge pages

Message ID	1531805552-19547-2-git-send-email-n-horiguchi@ah.jp.nec.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of nao.horiguchi@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> To: linux-mm@kvack.org Cc: Michal Hocko <mhocko@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, xishi.qiuxishi@alibaba-inc.com, zy.zhengyi@alibaba-inc.com, linux-kernel@vger.kernel.org Subject: [PATCH v2 1/2] mm: fix race on soft-offlining free huge pages Date: Tue, 17 Jul 2018 14:32:31 +0900 Message-Id: <1531805552-19547-2-git-send-email-n-horiguchi@ah.jp.nec.com> In-Reply-To: <1531805552-19547-1-git-send-email-n-horiguchi@ah.jp.nec.com> References: <1531805552-19547-1-git-send-email-n-horiguchi@ah.jp.nec.com> Sender: owner-linux-mm@kvack.org Precedence: bulk

Naoya Horiguchi July 17, 2018, 5:32 a.m. UTC

There's a race condition between soft offline and hugetlb_fault which
causes unexpected process killing and/or hugetlb allocation failure.

The process killing is caused by the following flow:

  CPU 0               CPU 1              CPU 2

  soft offline
    get_any_page
    // find the hugetlb is free
                      mmap a hugetlb file
                      page fault
                        ...
                          hugetlb_fault
                            hugetlb_no_page
                              alloc_huge_page
                              // succeed
      soft_offline_free_page
      // set hwpoison flag
                                         mmap the hugetlb file
                                         page fault
                                           ...
                                             hugetlb_fault
                                               hugetlb_no_page
                                                 find_lock_page
                                                   return VM_FAULT_HWPOISON
                                           mm_fault_error
                                             do_sigbus
                                             // kill the process


The hugetlb allocation failure comes from the following flow:

  CPU 0                          CPU 1

                                 mmap a hugetlb file
                                 // reserve all free page but don't fault-in
  soft offline
    get_any_page
    // find the hugetlb is free
      soft_offline_free_page
      // set hwpoison flag
        dissolve_free_huge_page
        // fail because all free hugepages are reserved
                                 page fault
                                   ...
                                     hugetlb_fault
                                       hugetlb_no_page
                                         alloc_huge_page
                                           ...
                                             dequeue_huge_page_node_exact
                                             // ignore hwpoisoned hugepage
                                             // and finally fail due to no-mem

The root cause of this is that current soft-offline code is written
based on an assumption that PageHWPoison flag should beset at first to
avoid accessing the corrupted data.  This makes sense for memory_failure()
or hard offline, but does not for soft offline because soft offline is
about corrected (not uncorrected) error and is safe from data lost.
This patch changes soft offline semantics where it sets PageHWPoison flag
only after containment of the error page completes successfully.

Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Suggested-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
changelog v1->v2:
- don't use set_hwpoison_free_buddy_page() (not defined yet)
- updated comment in soft_offline_huge_page()
---
 mm/hugetlb.c        | 11 +++++------
 mm/memory-failure.c | 24 ++++++++++++++++++------
 mm/migrate.c        |  2 --
 3 files changed, 23 insertions(+), 14 deletions(-)

Michal Hocko July 17, 2018, 2:27 p.m. UTC | #1

On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> There's a race condition between soft offline and hugetlb_fault which
> causes unexpected process killing and/or hugetlb allocation failure.
> 
> The process killing is caused by the following flow:
> 
>   CPU 0               CPU 1              CPU 2
> 
>   soft offline
>     get_any_page
>     // find the hugetlb is free
>                       mmap a hugetlb file
>                       page fault
>                         ...
>                           hugetlb_fault
>                             hugetlb_no_page
>                               alloc_huge_page
>                               // succeed
>       soft_offline_free_page
>       // set hwpoison flag
>                                          mmap the hugetlb file
>                                          page fault
>                                            ...
>                                              hugetlb_fault
>                                                hugetlb_no_page
>                                                  find_lock_page
>                                                    return VM_FAULT_HWPOISON
>                                            mm_fault_error
>                                              do_sigbus
>                                              // kill the process
> 
> 
> The hugetlb allocation failure comes from the following flow:
> 
>   CPU 0                          CPU 1
> 
>                                  mmap a hugetlb file
>                                  // reserve all free page but don't fault-in
>   soft offline
>     get_any_page
>     // find the hugetlb is free
>       soft_offline_free_page
>       // set hwpoison flag
>         dissolve_free_huge_page
>         // fail because all free hugepages are reserved
>                                  page fault
>                                    ...
>                                      hugetlb_fault
>                                        hugetlb_no_page
>                                          alloc_huge_page
>                                            ...
>                                              dequeue_huge_page_node_exact
>                                              // ignore hwpoisoned hugepage
>                                              // and finally fail due to no-mem
> 
> The root cause of this is that current soft-offline code is written
> based on an assumption that PageHWPoison flag should beset at first to
> avoid accessing the corrupted data.  This makes sense for memory_failure()
> or hard offline, but does not for soft offline because soft offline is
> about corrected (not uncorrected) error and is safe from data lost.
> This patch changes soft offline semantics where it sets PageHWPoison flag
> only after containment of the error page completes successfully.

Could you please expand on the worklow here please? The code is really
hard to grasp. I must be missing something because the thing shouldn't
be really complicated. Either the page is in the free pool and you just
remove it from the allocator (with hugetlb asking for a new hugeltb page
to guaratee reserves) or it is used and you just migrate the content to
a new page (again with the hugetlb reserves consideration). Why should
PageHWPoison flag ordering make any relevance?

Do I get it right that the only difference between the hard and soft
offlining is that hugetlb reserves might break for the former while not
for the latter and that the failed migration kills all owners for the
former while not for latter?
 
> Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
> Suggested-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> changelog v1->v2:
> - don't use set_hwpoison_free_buddy_page() (not defined yet)
> - updated comment in soft_offline_huge_page()
> ---
>  mm/hugetlb.c        | 11 +++++------
>  mm/memory-failure.c | 24 ++++++++++++++++++------
>  mm/migrate.c        |  2 --
>  3 files changed, 23 insertions(+), 14 deletions(-)
> 
> diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c
> index 430be42..937c142 100644
> --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c
> +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c
> @@ -1479,22 +1479,20 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  /*
>   * Dissolve a given free hugepage into free buddy pages. This function does
>   * nothing for in-use (including surplus) hugepages. Returns -EBUSY if the
> - * number of free hugepages would be reduced below the number of reserved
> - * hugepages.
> + * dissolution fails because a give page is not a free hugepage, or because
> + * free hugepages are fully reserved.
>   */
>  int dissolve_free_huge_page(struct page *page)
>  {
> -	int rc = 0;
> +	int rc = -EBUSY;
>  
>  	spin_lock(&hugetlb_lock);
>  	if (PageHuge(page) && !page_count(page)) {
>  		struct page *head = compound_head(page);
>  		struct hstate *h = page_hstate(head);
>  		int nid = page_to_nid(head);
> -		if (h->free_huge_pages - h->resv_huge_pages == 0) {
> -			rc = -EBUSY;
> +		if (h->free_huge_pages - h->resv_huge_pages == 0)
>  			goto out;
> -		}
>  		/*
>  		 * Move PageHWPoison flag from head page to the raw error page,
>  		 * which makes any subpages rather than the error page reusable.
> @@ -1508,6 +1506,7 @@ int dissolve_free_huge_page(struct page *page)
>  		h->free_huge_pages_node[nid]--;
>  		h->max_huge_pages--;
>  		update_and_free_page(h, head);
> +		rc = 0;
>  	}
>  out:
>  	spin_unlock(&hugetlb_lock);
> diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
> index 9d142b9..9b77f85 100644
> --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c
> +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
> @@ -1598,8 +1598,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
>  		if (ret > 0)
>  			ret = -EIO;
>  	} else {
> -		if (PageHuge(page))
> -			dissolve_free_huge_page(page);
> +		/*
> +		 * We set PG_hwpoison only when the migration source hugepage
> +		 * was successfully dissolved, because otherwise hwpoisoned
> +		 * hugepage remains on free hugepage list. The allocator ignores
> +		 * such a hwpoisoned page so it's never allocated, but it could
> +		 * kill a process because of no-memory rather than hwpoison.
> +		 * Soft-offline never impacts the userspace, so this is
> +		 * undesired.
> +		 */
> +		ret = dissolve_free_huge_page(page);
> +		if (!ret) {
> +			if (!TestSetPageHWPoison(page))
> +				num_poisoned_pages_inc();
> +		}
>  	}
>  	return ret;
>  }
> @@ -1715,13 +1727,13 @@ static int soft_offline_in_use_page(struct page *page, int flags)
>  
>  static void soft_offline_free_page(struct page *page)
>  {
> +	int rc = 0;
>  	struct page *head = compound_head(page);
>  
> -	if (!TestSetPageHWPoison(head)) {
> +	if (PageHuge(head))
> +		rc = dissolve_free_huge_page(page);
> +	if (!rc && !TestSetPageHWPoison(page))
>  		num_poisoned_pages_inc();
> -		if (PageHuge(head))
> -			dissolve_free_huge_page(page);
> -	}
>  }
>  
>  /**
> diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
> index 198af42..3ae213b 100644
> --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c
> +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
> @@ -1318,8 +1318,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
>  out:
>  	if (rc != -EAGAIN)
>  		putback_active_hugepage(hpage);
> -	if (reason == MR_MEMORY_FAILURE && !test_set_page_hwpoison(hpage))
> -		num_poisoned_pages_inc();
>  
>  	/*
>  	 * If migration was not successful and there's a freeing callback, use
> -- 
> 2.7.0

Mike Kravetz July 17, 2018, 8:10 p.m. UTC | #2

On 07/17/2018 07:27 AM, Michal Hocko wrote:
> On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
>> There's a race condition between soft offline and hugetlb_fault which
>> causes unexpected process killing and/or hugetlb allocation failure.
>>
>> The process killing is caused by the following flow:
>>
>>   CPU 0               CPU 1              CPU 2
>>
>>   soft offline
>>     get_any_page
>>     // find the hugetlb is free
>>                       mmap a hugetlb file
>>                       page fault
>>                         ...
>>                           hugetlb_fault
>>                             hugetlb_no_page
>>                               alloc_huge_page
>>                               // succeed
>>       soft_offline_free_page
>>       // set hwpoison flag
>>                                          mmap the hugetlb file
>>                                          page fault
>>                                            ...
>>                                              hugetlb_fault
>>                                                hugetlb_no_page
>>                                                  find_lock_page
>>                                                    return VM_FAULT_HWPOISON
>>                                            mm_fault_error
>>                                              do_sigbus
>>                                              // kill the process
>>
>>
>> The hugetlb allocation failure comes from the following flow:
>>
>>   CPU 0                          CPU 1
>>
>>                                  mmap a hugetlb file
>>                                  // reserve all free page but don't fault-in
>>   soft offline
>>     get_any_page
>>     // find the hugetlb is free
>>       soft_offline_free_page
>>       // set hwpoison flag
>>         dissolve_free_huge_page
>>         // fail because all free hugepages are reserved
>>                                  page fault
>>                                    ...
>>                                      hugetlb_fault
>>                                        hugetlb_no_page
>>                                          alloc_huge_page
>>                                            ...
>>                                              dequeue_huge_page_node_exact
>>                                              // ignore hwpoisoned hugepage
>>                                              // and finally fail due to no-mem
>>
>> The root cause of this is that current soft-offline code is written
>> based on an assumption that PageHWPoison flag should beset at first to
>> avoid accessing the corrupted data.  This makes sense for memory_failure()
>> or hard offline, but does not for soft offline because soft offline is
>> about corrected (not uncorrected) error and is safe from data lost.
>> This patch changes soft offline semantics where it sets PageHWPoison flag
>> only after containment of the error page completes successfully.
> 
> Could you please expand on the worklow here please? The code is really
> hard to grasp. I must be missing something because the thing shouldn't
> be really complicated. Either the page is in the free pool and you just
> remove it from the allocator (with hugetlb asking for a new hugeltb page
> to guaratee reserves) or it is used and you just migrate the content to
> a new page (again with the hugetlb reserves consideration). Why should
> PageHWPoison flag ordering make any relevance?

My understanding may not be corect, but just looking at the current code
for soft_offline_free_page helps me understand:

static void soft_offline_free_page(struct page *page)
{
	struct page *head = compound_head(page);

	if (!TestSetPageHWPoison(head)) {
		num_poisoned_pages_inc();
		if (PageHuge(head))
			dissolve_free_huge_page(page);
	}
}

The HWPoison flag is set before even checking to determine if the huge
page can be dissolved.  So, someone could could attempt to pull the page
off the free list (if free) or fault/map it (if already associated with
a file) which leads to the  failures described above.  The patches ensure
that we only set HWPoison after successfully dissolving the page. At least
that is how I understand it.

It seems that soft_offline_free_page can be called for in use pages.
Certainly, that is the case in the first workflow above.  With the
suggested changes, I think this is OK for huge pages.  However, it seems
that setting HWPoison on a in use non-huge page could cause issues?

While looking at the code, I noticed this comment in __get_any_page()
        /*
         * When the target page is a free hugepage, just remove it
         * from free hugepage list.
         */
Did that apply to some code that was removed?  It does not seem to make
any sense in that routine.

Naoya Horiguchi July 18, 2018, 12:55 a.m. UTC | #3

On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > There's a race condition between soft offline and hugetlb_fault which
> > causes unexpected process killing and/or hugetlb allocation failure.
> > 
> > The process killing is caused by the following flow:
> > 
> >   CPU 0               CPU 1              CPU 2
> > 
> >   soft offline
> >     get_any_page
> >     // find the hugetlb is free
> >                       mmap a hugetlb file
> >                       page fault
> >                         ...
> >                           hugetlb_fault
> >                             hugetlb_no_page
> >                               alloc_huge_page
> >                               // succeed
> >       soft_offline_free_page
> >       // set hwpoison flag
> >                                          mmap the hugetlb file
> >                                          page fault
> >                                            ...
> >                                              hugetlb_fault
> >                                                hugetlb_no_page
> >                                                  find_lock_page
> >                                                    return VM_FAULT_HWPOISON
> >                                            mm_fault_error
> >                                              do_sigbus
> >                                              // kill the process
> > 
> > 
> > The hugetlb allocation failure comes from the following flow:
> > 
> >   CPU 0                          CPU 1
> > 
> >                                  mmap a hugetlb file
> >                                  // reserve all free page but don't fault-in
> >   soft offline
> >     get_any_page
> >     // find the hugetlb is free
> >       soft_offline_free_page
> >       // set hwpoison flag
> >         dissolve_free_huge_page
> >         // fail because all free hugepages are reserved
> >                                  page fault
> >                                    ...
> >                                      hugetlb_fault
> >                                        hugetlb_no_page
> >                                          alloc_huge_page
> >                                            ...
> >                                              dequeue_huge_page_node_exact
> >                                              // ignore hwpoisoned hugepage
> >                                              // and finally fail due to no-mem
> > 
> > The root cause of this is that current soft-offline code is written
> > based on an assumption that PageHWPoison flag should beset at first to
> > avoid accessing the corrupted data.  This makes sense for memory_failure()
> > or hard offline, but does not for soft offline because soft offline is
> > about corrected (not uncorrected) error and is safe from data lost.
> > This patch changes soft offline semantics where it sets PageHWPoison flag
> > only after containment of the error page completes successfully.
> 
> Could you please expand on the worklow here please? The code is really
> hard to grasp. I must be missing something because the thing shouldn't
> be really complicated. Either the page is in the free pool and you just
> remove it from the allocator (with hugetlb asking for a new hugeltb page
> to guaratee reserves) or it is used and you just migrate the content to
> a new page (again with the hugetlb reserves consideration). Why should
> PageHWPoison flag ordering make any relevance?

(Considering soft offlining free hugepage,)
PageHWPoison is set at first before this patch, which is racy with
hugetlb fault code because it's not protected by hugetlb_lock.

Originally this was written in the similar manner as hard-offline, where
the race is accepted and a PageHWPoison flag is set as soon as possible.
But actually that's found not necessary/correct because soft offline is
supposed to be less aggressive and failure is OK.

So this patch is suggesting to make soft-offline less aggressive by
moving SetPageHWPoison into the lock.

> 
> Do I get it right that the only difference between the hard and soft
> offlining is that hugetlb reserves might break for the former while not
> for the latter

Correct.

> and that the failed migration kills all owners for the
> former while not for latter?

Hard-offline doesn't cause any page migration because the data is already
lost, but yes it can kill the owners.
Soft-offline never kills processes even if it fails (due to migration failrue
or some other reasons.)

I listed below some common points and differences between hard-offline
and soft-offline.

  common points
    - they are both contained by PageHWPoison flag,
    - error is injected via simliar interfaces.

  differences
    - the data on the page is considered lost in hard offline, but is not
      in soft offline,
    - hard offline likely kills the affected processes, but soft offline
      never kills processes,
    - soft offline causes page migration, but hard offline does not,
    - hard offline prioritizes to prevent consumption of broken data with
      accepting some race, and soft offline prioritizes not to impact
      userspace with accepting failure.

Looks to me that there're more differences rather than commont points.

Thanks,
Naoya Horiguchi

>  
> > Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
> > Suggested-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> > changelog v1->v2:
> > - don't use set_hwpoison_free_buddy_page() (not defined yet)
> > - updated comment in soft_offline_huge_page()
> > ---
> >  mm/hugetlb.c        | 11 +++++------
> >  mm/memory-failure.c | 24 ++++++++++++++++++------
> >  mm/migrate.c        |  2 --
> >  3 files changed, 23 insertions(+), 14 deletions(-)
> > 
> > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c
> > index 430be42..937c142 100644
> > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c
> > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c
> > @@ -1479,22 +1479,20 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
> >  /*
> >   * Dissolve a given free hugepage into free buddy pages. This function does
> >   * nothing for in-use (including surplus) hugepages. Returns -EBUSY if the
> > - * number of free hugepages would be reduced below the number of reserved
> > - * hugepages.
> > + * dissolution fails because a give page is not a free hugepage, or because
> > + * free hugepages are fully reserved.
> >   */
> >  int dissolve_free_huge_page(struct page *page)
> >  {
> > -	int rc = 0;
> > +	int rc = -EBUSY;
> >  
> >  	spin_lock(&hugetlb_lock);
> >  	if (PageHuge(page) && !page_count(page)) {
> >  		struct page *head = compound_head(page);
> >  		struct hstate *h = page_hstate(head);
> >  		int nid = page_to_nid(head);
> > -		if (h->free_huge_pages - h->resv_huge_pages == 0) {
> > -			rc = -EBUSY;
> > +		if (h->free_huge_pages - h->resv_huge_pages == 0)
> >  			goto out;
> > -		}
> >  		/*
> >  		 * Move PageHWPoison flag from head page to the raw error page,
> >  		 * which makes any subpages rather than the error page reusable.
> > @@ -1508,6 +1506,7 @@ int dissolve_free_huge_page(struct page *page)
> >  		h->free_huge_pages_node[nid]--;
> >  		h->max_huge_pages--;
> >  		update_and_free_page(h, head);
> > +		rc = 0;
> >  	}
> >  out:
> >  	spin_unlock(&hugetlb_lock);
> > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
> > index 9d142b9..9b77f85 100644
> > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c
> > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c
> > @@ -1598,8 +1598,20 @@ static int soft_offline_huge_page(struct page *page, int flags)
> >  		if (ret > 0)
> >  			ret = -EIO;
> >  	} else {
> > -		if (PageHuge(page))
> > -			dissolve_free_huge_page(page);
> > +		/*
> > +		 * We set PG_hwpoison only when the migration source hugepage
> > +		 * was successfully dissolved, because otherwise hwpoisoned
> > +		 * hugepage remains on free hugepage list. The allocator ignores
> > +		 * such a hwpoisoned page so it's never allocated, but it could
> > +		 * kill a process because of no-memory rather than hwpoison.
> > +		 * Soft-offline never impacts the userspace, so this is
> > +		 * undesired.
> > +		 */
> > +		ret = dissolve_free_huge_page(page);
> > +		if (!ret) {
> > +			if (!TestSetPageHWPoison(page))
> > +				num_poisoned_pages_inc();
> > +		}
> >  	}
> >  	return ret;
> >  }
> > @@ -1715,13 +1727,13 @@ static int soft_offline_in_use_page(struct page *page, int flags)
> >  
> >  static void soft_offline_free_page(struct page *page)
> >  {
> > +	int rc = 0;
> >  	struct page *head = compound_head(page);
> >  
> > -	if (!TestSetPageHWPoison(head)) {
> > +	if (PageHuge(head))
> > +		rc = dissolve_free_huge_page(page);
> > +	if (!rc && !TestSetPageHWPoison(page))
> >  		num_poisoned_pages_inc();
> > -		if (PageHuge(head))
> > -			dissolve_free_huge_page(page);
> > -	}
> >  }
> >  
> >  /**
> > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
> > index 198af42..3ae213b 100644
> > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c
> > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c
> > @@ -1318,8 +1318,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> >  out:
> >  	if (rc != -EAGAIN)
> >  		putback_active_hugepage(hpage);
> > -	if (reason == MR_MEMORY_FAILURE && !test_set_page_hwpoison(hpage))
> > -		num_poisoned_pages_inc();
> >  
> >  	/*
> >  	 * If migration was not successful and there's a freeing callback, use
> > -- 
> > 2.7.0
> 
> -- 
> Michal Hocko
> SUSE Labs
>

Naoya Horiguchi July 18, 2018, 1:28 a.m. UTC | #4

On Tue, Jul 17, 2018 at 01:10:39PM -0700, Mike Kravetz wrote:
> On 07/17/2018 07:27 AM, Michal Hocko wrote:
> > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> >> There's a race condition between soft offline and hugetlb_fault which
> >> causes unexpected process killing and/or hugetlb allocation failure.
> >>
> >> The process killing is caused by the following flow:
> >>
> >>   CPU 0               CPU 1              CPU 2
> >>
> >>   soft offline
> >>     get_any_page
> >>     // find the hugetlb is free
> >>                       mmap a hugetlb file
> >>                       page fault
> >>                         ...
> >>                           hugetlb_fault
> >>                             hugetlb_no_page
> >>                               alloc_huge_page
> >>                               // succeed
> >>       soft_offline_free_page
> >>       // set hwpoison flag
> >>                                          mmap the hugetlb file
> >>                                          page fault
> >>                                            ...
> >>                                              hugetlb_fault
> >>                                                hugetlb_no_page
> >>                                                  find_lock_page
> >>                                                    return VM_FAULT_HWPOISON
> >>                                            mm_fault_error
> >>                                              do_sigbus
> >>                                              // kill the process
> >>
> >>
> >> The hugetlb allocation failure comes from the following flow:
> >>
> >>   CPU 0                          CPU 1
> >>
> >>                                  mmap a hugetlb file
> >>                                  // reserve all free page but don't fault-in
> >>   soft offline
> >>     get_any_page
> >>     // find the hugetlb is free
> >>       soft_offline_free_page
> >>       // set hwpoison flag
> >>         dissolve_free_huge_page
> >>         // fail because all free hugepages are reserved
> >>                                  page fault
> >>                                    ...
> >>                                      hugetlb_fault
> >>                                        hugetlb_no_page
> >>                                          alloc_huge_page
> >>                                            ...
> >>                                              dequeue_huge_page_node_exact
> >>                                              // ignore hwpoisoned hugepage
> >>                                              // and finally fail due to no-mem
> >>
> >> The root cause of this is that current soft-offline code is written
> >> based on an assumption that PageHWPoison flag should beset at first to
> >> avoid accessing the corrupted data.  This makes sense for memory_failure()
> >> or hard offline, but does not for soft offline because soft offline is
> >> about corrected (not uncorrected) error and is safe from data lost.
> >> This patch changes soft offline semantics where it sets PageHWPoison flag
> >> only after containment of the error page completes successfully.
> > 
> > Could you please expand on the worklow here please? The code is really
> > hard to grasp. I must be missing something because the thing shouldn't
> > be really complicated. Either the page is in the free pool and you just
> > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > to guaratee reserves) or it is used and you just migrate the content to
> > a new page (again with the hugetlb reserves consideration). Why should
> > PageHWPoison flag ordering make any relevance?
> 
> My understanding may not be corect, but just looking at the current code
> for soft_offline_free_page helps me understand:
> 
> static void soft_offline_free_page(struct page *page)
> {
> 	struct page *head = compound_head(page);
> 
> 	if (!TestSetPageHWPoison(head)) {
> 		num_poisoned_pages_inc();
> 		if (PageHuge(head))
> 			dissolve_free_huge_page(page);
> 	}
> }
> 
> The HWPoison flag is set before even checking to determine if the huge
> page can be dissolved.  So, someone could could attempt to pull the page
> off the free list (if free) or fault/map it (if already associated with
> a file) which leads to the  failures described above.  The patches ensure
> that we only set HWPoison after successfully dissolving the page. At least
> that is how I understand it.

Thanks for elaborating, this is correct.

> 
> It seems that soft_offline_free_page can be called for in use pages.
> Certainly, that is the case in the first workflow above.  With the
> suggested changes, I think this is OK for huge pages.  However, it seems
> that setting HWPoison on a in use non-huge page could cause issues?

Just after dissolve_free_huge_page() returns, the target page is just a
free buddy page without PageHWPoison set. If this page is allocated
immediately, that's "migration succeeded, but soft offline failed" case,
so no problem.
Certainly, there also is a race between cheking TestSetPageHWPoison and
page allocation, so this issue is handled in patch 2/2.

> While looking at the code, I noticed this comment in __get_any_page()
>         /*
>          * When the target page is a free hugepage, just remove it
>          * from free hugepage list.
>          */
> Did that apply to some code that was removed?  It does not seem to make
> any sense in that routine.

This comment is completely obsolete, I'll remove this one.

Thanks,
Naoya Horiguchi

Naoya Horiguchi July 18, 2018, 1:41 a.m. UTC | #5

On Wed, Jul 18, 2018 at 12:55:29AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > There's a race condition between soft offline and hugetlb_fault which
> > > causes unexpected process killing and/or hugetlb allocation failure.
> > > 
> > > The process killing is caused by the following flow:
> > > 
> > >   CPU 0               CPU 1              CPU 2
> > > 
> > >   soft offline
> > >     get_any_page
> > >     // find the hugetlb is free
> > >                       mmap a hugetlb file
> > >                       page fault
> > >                         ...
> > >                           hugetlb_fault
> > >                             hugetlb_no_page
> > >                               alloc_huge_page
> > >                               // succeed
> > >       soft_offline_free_page
> > >       // set hwpoison flag
> > >                                          mmap the hugetlb file
> > >                                          page fault
> > >                                            ...
> > >                                              hugetlb_fault
> > >                                                hugetlb_no_page
> > >                                                  find_lock_page
> > >                                                    return VM_FAULT_HWPOISON
> > >                                            mm_fault_error
> > >                                              do_sigbus
> > >                                              // kill the process
> > > 
> > > 
> > > The hugetlb allocation failure comes from the following flow:
> > > 
> > >   CPU 0                          CPU 1
> > > 
> > >                                  mmap a hugetlb file
> > >                                  // reserve all free page but don't fault-in
> > >   soft offline
> > >     get_any_page
> > >     // find the hugetlb is free
> > >       soft_offline_free_page
> > >       // set hwpoison flag
> > >         dissolve_free_huge_page
> > >         // fail because all free hugepages are reserved
> > >                                  page fault
> > >                                    ...
> > >                                      hugetlb_fault
> > >                                        hugetlb_no_page
> > >                                          alloc_huge_page
> > >                                            ...
> > >                                              dequeue_huge_page_node_exact
> > >                                              // ignore hwpoisoned hugepage
> > >                                              // and finally fail due to no-mem
> > > 
> > > The root cause of this is that current soft-offline code is written
> > > based on an assumption that PageHWPoison flag should beset at first to
> > > avoid accessing the corrupted data.  This makes sense for memory_failure()
> > > or hard offline, but does not for soft offline because soft offline is
> > > about corrected (not uncorrected) error and is safe from data lost.
> > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > only after containment of the error page completes successfully.
> > 
> > Could you please expand on the worklow here please? The code is really
> > hard to grasp. I must be missing something because the thing shouldn't
> > be really complicated. Either the page is in the free pool and you just
> > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > to guaratee reserves) or it is used and you just migrate the content to
> > a new page (again with the hugetlb reserves consideration). Why should
> > PageHWPoison flag ordering make any relevance?
> 
> (Considering soft offlining free hugepage,)
> PageHWPoison is set at first before this patch, which is racy with
> hugetlb fault code because it's not protected by hugetlb_lock.
> 
> Originally this was written in the similar manner as hard-offline, where
> the race is accepted and a PageHWPoison flag is set as soon as possible.
> But actually that's found not necessary/correct because soft offline is
> supposed to be less aggressive and failure is OK.
> 
> So this patch is suggesting to make soft-offline less aggressive


> by moving SetPageHWPoison into the lock.

My apology, this part of reasoning was incorrect.  What patch 1/2 actually
does is transforming the issue into the normal page's similar race issue
which is solved by patch 2/2.  After patch 1/2, soft offline never sets
PageHWPoison on hugepage.

Thanks,
Naoya Horiguchi

> 
> > 
> > Do I get it right that the only difference between the hard and soft
> > offlining is that hugetlb reserves might break for the former while not
> > for the latter
> 
> Correct.
> 
> > and that the failed migration kills all owners for the
> > former while not for latter?
> 
> Hard-offline doesn't cause any page migration because the data is already
> lost, but yes it can kill the owners.
> Soft-offline never kills processes even if it fails (due to migration failrue
> or some other reasons.)
> 
> I listed below some common points and differences between hard-offline
> and soft-offline.
> 
>   common points
>     - they are both contained by PageHWPoison flag,
>     - error is injected via simliar interfaces.
> 
>   differences
>     - the data on the page is considered lost in hard offline, but is not
>       in soft offline,
>     - hard offline likely kills the affected processes, but soft offline
>       never kills processes,
>     - soft offline causes page migration, but hard offline does not,
>     - hard offline prioritizes to prevent consumption of broken data with
>       accepting some race, and soft offline prioritizes not to impact
>       userspace with accepting failure.
> 
> Looks to me that there're more differences rather than commont points.

Mike Kravetz July 18, 2018, 2:36 a.m. UTC | #6

On 07/17/2018 06:28 PM, Naoya Horiguchi wrote:
> On Tue, Jul 17, 2018 at 01:10:39PM -0700, Mike Kravetz wrote:
>> It seems that soft_offline_free_page can be called for in use pages.
>> Certainly, that is the case in the first workflow above.  With the
>> suggested changes, I think this is OK for huge pages.  However, it seems
>> that setting HWPoison on a in use non-huge page could cause issues?
> 
> Just after dissolve_free_huge_page() returns, the target page is just a
> free buddy page without PageHWPoison set. If this page is allocated
> immediately, that's "migration succeeded, but soft offline failed" case,
> so no problem.
> Certainly, there also is a race between cheking TestSetPageHWPoison and
> page allocation, so this issue is handled in patch 2/2.

Yes, the issue of calling soft_offline_free_page() for an 'in use' page
is handled in the new routine set_hwpoison_free_buddy_page() of patch 2/2.

Thanks,

Michal Hocko July 18, 2018, 8:50 a.m. UTC | #7

On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote:
> On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > There's a race condition between soft offline and hugetlb_fault which
> > > causes unexpected process killing and/or hugetlb allocation failure.
> > > 
> > > The process killing is caused by the following flow:
> > > 
> > >   CPU 0               CPU 1              CPU 2
> > > 
> > >   soft offline
> > >     get_any_page
> > >     // find the hugetlb is free
> > >                       mmap a hugetlb file
> > >                       page fault
> > >                         ...
> > >                           hugetlb_fault
> > >                             hugetlb_no_page
> > >                               alloc_huge_page
> > >                               // succeed
> > >       soft_offline_free_page
> > >       // set hwpoison flag
> > >                                          mmap the hugetlb file
> > >                                          page fault
> > >                                            ...
> > >                                              hugetlb_fault
> > >                                                hugetlb_no_page
> > >                                                  find_lock_page
> > >                                                    return VM_FAULT_HWPOISON
> > >                                            mm_fault_error
> > >                                              do_sigbus
> > >                                              // kill the process
> > > 
> > > 
> > > The hugetlb allocation failure comes from the following flow:
> > > 
> > >   CPU 0                          CPU 1
> > > 
> > >                                  mmap a hugetlb file
> > >                                  // reserve all free page but don't fault-in
> > >   soft offline
> > >     get_any_page
> > >     // find the hugetlb is free
> > >       soft_offline_free_page
> > >       // set hwpoison flag
> > >         dissolve_free_huge_page
> > >         // fail because all free hugepages are reserved
> > >                                  page fault
> > >                                    ...
> > >                                      hugetlb_fault
> > >                                        hugetlb_no_page
> > >                                          alloc_huge_page
> > >                                            ...
> > >                                              dequeue_huge_page_node_exact
> > >                                              // ignore hwpoisoned hugepage
> > >                                              // and finally fail due to no-mem
> > > 
> > > The root cause of this is that current soft-offline code is written
> > > based on an assumption that PageHWPoison flag should beset at first to
> > > avoid accessing the corrupted data.  This makes sense for memory_failure()
> > > or hard offline, but does not for soft offline because soft offline is
> > > about corrected (not uncorrected) error and is safe from data lost.
> > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > only after containment of the error page completes successfully.
> > 
> > Could you please expand on the worklow here please? The code is really
> > hard to grasp. I must be missing something because the thing shouldn't
> > be really complicated. Either the page is in the free pool and you just
> > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > to guaratee reserves) or it is used and you just migrate the content to
> > a new page (again with the hugetlb reserves consideration). Why should
> > PageHWPoison flag ordering make any relevance?
> 
> (Considering soft offlining free hugepage,)
> PageHWPoison is set at first before this patch, which is racy with
> hugetlb fault code because it's not protected by hugetlb_lock.
> 
> Originally this was written in the similar manner as hard-offline, where
> the race is accepted and a PageHWPoison flag is set as soon as possible.
> But actually that's found not necessary/correct because soft offline is
> supposed to be less aggressive and failure is OK.

OK

> So this patch is suggesting to make soft-offline less aggressive by
> moving SetPageHWPoison into the lock.

I guess I still do not understand why we should even care about the
ordering of the HWPoison flag setting. Why cannot we simply have the
following code flow? Or maybe we are doing that already I just do not
follow the code

	soft_offline
	  check page_count
	    - free - normal page - remove from the allocator
	           - hugetlb - allocate a new hugetlb page && remove from the pool
	    - used - migrate to a new page && never release the old one

Why do we even need HWPoison flag here? Everything can be completely
transparent to the application. It shouldn't fail from what I
understood.

> > Do I get it right that the only difference between the hard and soft
> > offlining is that hugetlb reserves might break for the former while not
> > for the latter
> 
> Correct.
> 
> > and that the failed migration kills all owners for the
> > former while not for latter?
> 
> Hard-offline doesn't cause any page migration because the data is already
> lost, but yes it can kill the owners.
> Soft-offline never kills processes even if it fails (due to migration failrue
> or some other reasons.)
> 
> I listed below some common points and differences between hard-offline
> and soft-offline.
> 
>   common points
>     - they are both contained by PageHWPoison flag,
>     - error is injected via simliar interfaces.
> 
>   differences
>     - the data on the page is considered lost in hard offline, but is not
>       in soft offline,
>     - hard offline likely kills the affected processes, but soft offline
>       never kills processes,
>     - soft offline causes page migration, but hard offline does not,
>     - hard offline prioritizes to prevent consumption of broken data with
>       accepting some race, and soft offline prioritizes not to impact
>       userspace with accepting failure.
> 
> Looks to me that there're more differences rather than commont points.

Thanks for the summary. It certainly helped me

Naoya Horiguchi July 19, 2018, 6:19 a.m. UTC | #8

On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote:
> > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > > There's a race condition between soft offline and hugetlb_fault which
> > > > causes unexpected process killing and/or hugetlb allocation failure.
> > > >
> > > > The process killing is caused by the following flow:
> > > >
> > > >   CPU 0               CPU 1              CPU 2
> > > >
> > > >   soft offline
> > > >     get_any_page
> > > >     // find the hugetlb is free
> > > >                       mmap a hugetlb file
> > > >                       page fault
> > > >                         ...
> > > >                           hugetlb_fault
> > > >                             hugetlb_no_page
> > > >                               alloc_huge_page
> > > >                               // succeed
> > > >       soft_offline_free_page
> > > >       // set hwpoison flag
> > > >                                          mmap the hugetlb file
> > > >                                          page fault
> > > >                                            ...
> > > >                                              hugetlb_fault
> > > >                                                hugetlb_no_page
> > > >                                                  find_lock_page
> > > >                                                    return VM_FAULT_HWPOISON
> > > >                                            mm_fault_error
> > > >                                              do_sigbus
> > > >                                              // kill the process
> > > >
> > > >
> > > > The hugetlb allocation failure comes from the following flow:
> > > >
> > > >   CPU 0                          CPU 1
> > > >
> > > >                                  mmap a hugetlb file
> > > >                                  // reserve all free page but don't fault-in
> > > >   soft offline
> > > >     get_any_page
> > > >     // find the hugetlb is free
> > > >       soft_offline_free_page
> > > >       // set hwpoison flag
> > > >         dissolve_free_huge_page
> > > >         // fail because all free hugepages are reserved
> > > >                                  page fault
> > > >                                    ...
> > > >                                      hugetlb_fault
> > > >                                        hugetlb_no_page
> > > >                                          alloc_huge_page
> > > >                                            ...
> > > >                                              dequeue_huge_page_node_exact
> > > >                                              // ignore hwpoisoned hugepage
> > > >                                              // and finally fail due to no-mem
> > > >
> > > > The root cause of this is that current soft-offline code is written
> > > > based on an assumption that PageHWPoison flag should beset at first to
> > > > avoid accessing the corrupted data.  This makes sense for memory_failure()
> > > > or hard offline, but does not for soft offline because soft offline is
> > > > about corrected (not uncorrected) error and is safe from data lost.
> > > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > > only after containment of the error page completes successfully.
> > >
> > > Could you please expand on the worklow here please? The code is really
> > > hard to grasp. I must be missing something because the thing shouldn't
> > > be really complicated. Either the page is in the free pool and you just
> > > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > > to guaratee reserves) or it is used and you just migrate the content to
> > > a new page (again with the hugetlb reserves consideration). Why should
> > > PageHWPoison flag ordering make any relevance?
> >
> > (Considering soft offlining free hugepage,)
> > PageHWPoison is set at first before this patch, which is racy with
> > hugetlb fault code because it's not protected by hugetlb_lock.
> >
> > Originally this was written in the similar manner as hard-offline, where
> > the race is accepted and a PageHWPoison flag is set as soon as possible.
> > But actually that's found not necessary/correct because soft offline is
> > supposed to be less aggressive and failure is OK.
>
> OK
>
> > So this patch is suggesting to make soft-offline less aggressive by
> > moving SetPageHWPoison into the lock.
>
> I guess I still do not understand why we should even care about the
> ordering of the HWPoison flag setting. Why cannot we simply have the
> following code flow? Or maybe we are doing that already I just do not
> follow the code
>
> 	soft_offline
> 	  check page_count
> 	    - free - normal page - remove from the allocator
> 	           - hugetlb - allocate a new hugetlb page && remove from the pool
> 	    - used - migrate to a new page && never release the old one
>
> Why do we even need HWPoison flag here? Everything can be completely
> transparent to the application. It shouldn't fail from what I
> understood.

PageHWPoison flag is used to the 'remove from the allocator' part
which is like below:

  static inline
  struct page *rmqueue(
          ...
          do {
                  page = NULL;
                  if (alloc_flags & ALLOC_HARDER) {
                          page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
                          if (page)
                                  trace_mm_page_alloc_zone_locked(page, order, migratetype);
                  }
                  if (!page)
                          page = __rmqueue(zone, order, migratetype);
          } while (page && check_new_pages(page, order));

check_new_pages() returns true if the page taken from free list has
a hwpoison page so that the allocator iterates another round to get
another page.

There's no function that can be called from outside allocator to remove
a page in allocator.  So actual page removal is done at allocation time,
not at error handling time. That's the reason why we need PageHWPoison.

Thanks,
Naoya Horiguchi


> > > Do I get it right that the only difference between the hard and soft
> > > offlining is that hugetlb reserves might break for the former while not
> > > for the latter
> >
> > Correct.
> >
> > > and that the failed migration kills all owners for the
> > > former while not for latter?
> >
> > Hard-offline doesn't cause any page migration because the data is already
> > lost, but yes it can kill the owners.
> > Soft-offline never kills processes even if it fails (due to migration failrue
> > or some other reasons.)
> >
> > I listed below some common points and differences between hard-offline
> > and soft-offline.
> >
> >   common points
> >     - they are both contained by PageHWPoison flag,
> >     - error is injected via simliar interfaces.
> >
> >   differences
> >     - the data on the page is considered lost in hard offline, but is not
> >       in soft offline,
> >     - hard offline likely kills the affected processes, but soft offline
> >       never kills processes,
> >     - soft offline causes page migration, but hard offline does not,
> >     - hard offline prioritizes to prevent consumption of broken data with
> >       accepting some race, and soft offline prioritizes not to impact
> >       userspace with accepting failure.
> >
> > Looks to me that there're more differences rather than commont points.
>
> Thanks for the summary. It certainly helped me
> --
> Michal Hocko
> SUSE Labs
>

Michal Hocko July 19, 2018, 7:15 a.m. UTC | #9

On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote:
> On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> > On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote:
> > > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > > > There's a race condition between soft offline and hugetlb_fault which
> > > > > causes unexpected process killing and/or hugetlb allocation failure.
> > > > >
> > > > > The process killing is caused by the following flow:
> > > > >
> > > > >   CPU 0               CPU 1              CPU 2
> > > > >
> > > > >   soft offline
> > > > >     get_any_page
> > > > >     // find the hugetlb is free
> > > > >                       mmap a hugetlb file
> > > > >                       page fault
> > > > >                         ...
> > > > >                           hugetlb_fault
> > > > >                             hugetlb_no_page
> > > > >                               alloc_huge_page
> > > > >                               // succeed
> > > > >       soft_offline_free_page
> > > > >       // set hwpoison flag
> > > > >                                          mmap the hugetlb file
> > > > >                                          page fault
> > > > >                                            ...
> > > > >                                              hugetlb_fault
> > > > >                                                hugetlb_no_page
> > > > >                                                  find_lock_page
> > > > >                                                    return VM_FAULT_HWPOISON
> > > > >                                            mm_fault_error
> > > > >                                              do_sigbus
> > > > >                                              // kill the process
> > > > >
> > > > >
> > > > > The hugetlb allocation failure comes from the following flow:
> > > > >
> > > > >   CPU 0                          CPU 1
> > > > >
> > > > >                                  mmap a hugetlb file
> > > > >                                  // reserve all free page but don't fault-in
> > > > >   soft offline
> > > > >     get_any_page
> > > > >     // find the hugetlb is free
> > > > >       soft_offline_free_page
> > > > >       // set hwpoison flag
> > > > >         dissolve_free_huge_page
> > > > >         // fail because all free hugepages are reserved
> > > > >                                  page fault
> > > > >                                    ...
> > > > >                                      hugetlb_fault
> > > > >                                        hugetlb_no_page
> > > > >                                          alloc_huge_page
> > > > >                                            ...
> > > > >                                              dequeue_huge_page_node_exact
> > > > >                                              // ignore hwpoisoned hugepage
> > > > >                                              // and finally fail due to no-mem
> > > > >
> > > > > The root cause of this is that current soft-offline code is written
> > > > > based on an assumption that PageHWPoison flag should beset at first to
> > > > > avoid accessing the corrupted data.  This makes sense for memory_failure()
> > > > > or hard offline, but does not for soft offline because soft offline is
> > > > > about corrected (not uncorrected) error and is safe from data lost.
> > > > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > > > only after containment of the error page completes successfully.
> > > >
> > > > Could you please expand on the worklow here please? The code is really
> > > > hard to grasp. I must be missing something because the thing shouldn't
> > > > be really complicated. Either the page is in the free pool and you just
> > > > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > > > to guaratee reserves) or it is used and you just migrate the content to
> > > > a new page (again with the hugetlb reserves consideration). Why should
> > > > PageHWPoison flag ordering make any relevance?
> > >
> > > (Considering soft offlining free hugepage,)
> > > PageHWPoison is set at first before this patch, which is racy with
> > > hugetlb fault code because it's not protected by hugetlb_lock.
> > >
> > > Originally this was written in the similar manner as hard-offline, where
> > > the race is accepted and a PageHWPoison flag is set as soon as possible.
> > > But actually that's found not necessary/correct because soft offline is
> > > supposed to be less aggressive and failure is OK.
> >
> > OK
> >
> > > So this patch is suggesting to make soft-offline less aggressive by
> > > moving SetPageHWPoison into the lock.
> >
> > I guess I still do not understand why we should even care about the
> > ordering of the HWPoison flag setting. Why cannot we simply have the
> > following code flow? Or maybe we are doing that already I just do not
> > follow the code
> >
> > 	soft_offline
> > 	  check page_count
> > 	    - free - normal page - remove from the allocator
> > 	           - hugetlb - allocate a new hugetlb page && remove from the pool
> > 	    - used - migrate to a new page && never release the old one
> >
> > Why do we even need HWPoison flag here? Everything can be completely
> > transparent to the application. It shouldn't fail from what I
> > understood.
> 
> PageHWPoison flag is used to the 'remove from the allocator' part
> which is like below:
> 
>   static inline
>   struct page *rmqueue(
>           ...
>           do {
>                   page = NULL;
>                   if (alloc_flags & ALLOC_HARDER) {
>                           page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
>                           if (page)
>                                   trace_mm_page_alloc_zone_locked(page, order, migratetype);
>                   }
>                   if (!page)
>                           page = __rmqueue(zone, order, migratetype);
>           } while (page && check_new_pages(page, order));
> 
> check_new_pages() returns true if the page taken from free list has
> a hwpoison page so that the allocator iterates another round to get
> another page.
> 
> There's no function that can be called from outside allocator to remove
> a page in allocator.  So actual page removal is done at allocation time,
> not at error handling time. That's the reason why we need PageHWPoison.

hwpoison is an internal mm functionality so why cannot we simply add a
function that would do that? I find the PageHWPoison usage here doing
more complications than real good. Or am I missing something?

Naoya Horiguchi July 19, 2018, 8:08 a.m. UTC | #10

On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote:
> On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote:
> > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> > > On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote:
> > > > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote:
> > > > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote:
> > > > > > There's a race condition between soft offline and hugetlb_fault which
> > > > > > causes unexpected process killing and/or hugetlb allocation failure.
> > > > > >
> > > > > > The process killing is caused by the following flow:
> > > > > >
> > > > > >   CPU 0               CPU 1              CPU 2
> > > > > >
> > > > > >   soft offline
> > > > > >     get_any_page
> > > > > >     // find the hugetlb is free
> > > > > >                       mmap a hugetlb file
> > > > > >                       page fault
> > > > > >                         ...
> > > > > >                           hugetlb_fault
> > > > > >                             hugetlb_no_page
> > > > > >                               alloc_huge_page
> > > > > >                               // succeed
> > > > > >       soft_offline_free_page
> > > > > >       // set hwpoison flag
> > > > > >                                          mmap the hugetlb file
> > > > > >                                          page fault
> > > > > >                                            ...
> > > > > >                                              hugetlb_fault
> > > > > >                                                hugetlb_no_page
> > > > > >                                                  find_lock_page
> > > > > >                                                    return VM_FAULT_HWPOISON
> > > > > >                                            mm_fault_error
> > > > > >                                              do_sigbus
> > > > > >                                              // kill the process
> > > > > >
> > > > > >
> > > > > > The hugetlb allocation failure comes from the following flow:
> > > > > >
> > > > > >   CPU 0                          CPU 1
> > > > > >
> > > > > >                                  mmap a hugetlb file
> > > > > >                                  // reserve all free page but don't fault-in
> > > > > >   soft offline
> > > > > >     get_any_page
> > > > > >     // find the hugetlb is free
> > > > > >       soft_offline_free_page
> > > > > >       // set hwpoison flag
> > > > > >         dissolve_free_huge_page
> > > > > >         // fail because all free hugepages are reserved
> > > > > >                                  page fault
> > > > > >                                    ...
> > > > > >                                      hugetlb_fault
> > > > > >                                        hugetlb_no_page
> > > > > >                                          alloc_huge_page
> > > > > >                                            ...
> > > > > >                                              dequeue_huge_page_node_exact
> > > > > >                                              // ignore hwpoisoned hugepage
> > > > > >                                              // and finally fail due to no-mem
> > > > > >
> > > > > > The root cause of this is that current soft-offline code is written
> > > > > > based on an assumption that PageHWPoison flag should beset at first to
> > > > > > avoid accessing the corrupted data.  This makes sense for memory_failure()
> > > > > > or hard offline, but does not for soft offline because soft offline is
> > > > > > about corrected (not uncorrected) error and is safe from data lost.
> > > > > > This patch changes soft offline semantics where it sets PageHWPoison flag
> > > > > > only after containment of the error page completes successfully.
> > > > >
> > > > > Could you please expand on the worklow here please? The code is really
> > > > > hard to grasp. I must be missing something because the thing shouldn't
> > > > > be really complicated. Either the page is in the free pool and you just
> > > > > remove it from the allocator (with hugetlb asking for a new hugeltb page
> > > > > to guaratee reserves) or it is used and you just migrate the content to
> > > > > a new page (again with the hugetlb reserves consideration). Why should
> > > > > PageHWPoison flag ordering make any relevance?
> > > >
> > > > (Considering soft offlining free hugepage,)
> > > > PageHWPoison is set at first before this patch, which is racy with
> > > > hugetlb fault code because it's not protected by hugetlb_lock.
> > > >
> > > > Originally this was written in the similar manner as hard-offline, where
> > > > the race is accepted and a PageHWPoison flag is set as soon as possible.
> > > > But actually that's found not necessary/correct because soft offline is
> > > > supposed to be less aggressive and failure is OK.
> > >
> > > OK
> > >
> > > > So this patch is suggesting to make soft-offline less aggressive by
> > > > moving SetPageHWPoison into the lock.
> > >
> > > I guess I still do not understand why we should even care about the
> > > ordering of the HWPoison flag setting. Why cannot we simply have the
> > > following code flow? Or maybe we are doing that already I just do not
> > > follow the code
> > >
> > > 	soft_offline
> > > 	  check page_count
> > > 	    - free - normal page - remove from the allocator
> > > 	           - hugetlb - allocate a new hugetlb page && remove from the pool
> > > 	    - used - migrate to a new page && never release the old one
> > >
> > > Why do we even need HWPoison flag here? Everything can be completely
> > > transparent to the application. It shouldn't fail from what I
> > > understood.
> > 
> > PageHWPoison flag is used to the 'remove from the allocator' part
> > which is like below:
> > 
> >   static inline
> >   struct page *rmqueue(
> >           ...
> >           do {
> >                   page = NULL;
> >                   if (alloc_flags & ALLOC_HARDER) {
> >                           page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> >                           if (page)
> >                                   trace_mm_page_alloc_zone_locked(page, order, migratetype);
> >                   }
> >                   if (!page)
> >                           page = __rmqueue(zone, order, migratetype);
> >           } while (page && check_new_pages(page, order));
> > 
> > check_new_pages() returns true if the page taken from free list has
> > a hwpoison page so that the allocator iterates another round to get
> > another page.
> > 
> > There's no function that can be called from outside allocator to remove
> > a page in allocator.  So actual page removal is done at allocation time,
> > not at error handling time. That's the reason why we need PageHWPoison.
> 
> hwpoison is an internal mm functionality so why cannot we simply add a
> function that would do that?

That's one possible solution.

I know about another downside in current implementation.
If a hwpoison page is found during high order page allocation,
all 2^order pages (not only hwpoison page) are removed from
buddy because of the above quoted code. And these leaked pages
are never returned to freelist even with unpoison_memory().
If we have a page removal function which properly splits high order
free pages into lower order pages, this problem is avoided.

OTOH PageHWPoison still has a role to report error to userspace.
Without it unpoison_memory() doesn't work.

Thanks,
Naoya Horiguchi

> I find the PageHWPoison usage here doing
> more complications than real good. Or am I missing something?

Michal Hocko July 19, 2018, 8:27 a.m. UTC | #11

On Thu 19-07-18 08:08:05, Naoya Horiguchi wrote:
> On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote:
> > On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote:
> > > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
[...]
> > > > Why do we even need HWPoison flag here? Everything can be completely
> > > > transparent to the application. It shouldn't fail from what I
> > > > understood.
> > > 
> > > PageHWPoison flag is used to the 'remove from the allocator' part
> > > which is like below:
> > > 
> > >   static inline
> > >   struct page *rmqueue(
> > >           ...
> > >           do {
> > >                   page = NULL;
> > >                   if (alloc_flags & ALLOC_HARDER) {
> > >                           page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > >                           if (page)
> > >                                   trace_mm_page_alloc_zone_locked(page, order, migratetype);
> > >                   }
> > >                   if (!page)
> > >                           page = __rmqueue(zone, order, migratetype);
> > >           } while (page && check_new_pages(page, order));
> > > 
> > > check_new_pages() returns true if the page taken from free list has
> > > a hwpoison page so that the allocator iterates another round to get
> > > another page.
> > > 
> > > There's no function that can be called from outside allocator to remove
> > > a page in allocator.  So actual page removal is done at allocation time,
> > > not at error handling time. That's the reason why we need PageHWPoison.
> > 
> > hwpoison is an internal mm functionality so why cannot we simply add a
> > function that would do that?
> 
> That's one possible solution.

I would prefer that much more than add an overhead (albeit small) into
the page allocator directly. HWPoison should be a really rare event so
why should everybody pay the price? I would much rather see that the
poison path pays the additional price.

> I know about another downside in current implementation.
> If a hwpoison page is found during high order page allocation,
> all 2^order pages (not only hwpoison page) are removed from
> buddy because of the above quoted code. And these leaked pages
> are never returned to freelist even with unpoison_memory().
> If we have a page removal function which properly splits high order
> free pages into lower order pages, this problem is avoided.

Even more reason to move to a new scheme.

> OTOH PageHWPoison still has a role to report error to userspace.
> Without it unpoison_memory() doesn't work.

Sure but we do not really need a special page flag for that. We know the
page is not reachable other than via pfn walkers. If you make the page
reserved and note the fact it has been poisoned in the past then you can
emulate the missing functionality.

Btw. do we really need unpoisoning functionality? Who is really using
it, other than some tests? How does the memory become OK again? Don't we
really need to go through physical hotremove & hotadd to clean the
poison status?

Thanks!

Naoya Horiguchi July 19, 2018, 9:22 a.m. UTC | #12

On Thu, Jul 19, 2018 at 10:27:43AM +0200, Michal Hocko wrote:
> On Thu 19-07-18 08:08:05, Naoya Horiguchi wrote:
> > On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote:
> > > On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote:
> > > > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> [...]
> > > > > Why do we even need HWPoison flag here? Everything can be completely
> > > > > transparent to the application. It shouldn't fail from what I
> > > > > understood.
> > > > 
> > > > PageHWPoison flag is used to the 'remove from the allocator' part
> > > > which is like below:
> > > > 
> > > >   static inline
> > > >   struct page *rmqueue(
> > > >           ...
> > > >           do {
> > > >                   page = NULL;
> > > >                   if (alloc_flags & ALLOC_HARDER) {
> > > >                           page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > > >                           if (page)
> > > >                                   trace_mm_page_alloc_zone_locked(page, order, migratetype);
> > > >                   }
> > > >                   if (!page)
> > > >                           page = __rmqueue(zone, order, migratetype);
> > > >           } while (page && check_new_pages(page, order));
> > > > 
> > > > check_new_pages() returns true if the page taken from free list has
> > > > a hwpoison page so that the allocator iterates another round to get
> > > > another page.
> > > > 
> > > > There's no function that can be called from outside allocator to remove
> > > > a page in allocator.  So actual page removal is done at allocation time,
> > > > not at error handling time. That's the reason why we need PageHWPoison.
> > > 
> > > hwpoison is an internal mm functionality so why cannot we simply add a
> > > function that would do that?
> > 
> > That's one possible solution.
> 
> I would prefer that much more than add an overhead (albeit small) into
> the page allocator directly. HWPoison should be a really rare event so
> why should everybody pay the price? I would much rather see that the
> poison path pays the additional price.

Yes, that's more maintainable.

> 
> > I know about another downside in current implementation.
> > If a hwpoison page is found during high order page allocation,
> > all 2^order pages (not only hwpoison page) are removed from
> > buddy because of the above quoted code. And these leaked pages
> > are never returned to freelist even with unpoison_memory().
> > If we have a page removal function which properly splits high order
> > free pages into lower order pages, this problem is avoided.
> 
> Even more reason to move to a new scheme.
> 
> > OTOH PageHWPoison still has a role to report error to userspace.
> > Without it unpoison_memory() doesn't work.
> 
> Sure but we do not really need a special page flag for that. We know the
> page is not reachable other than via pfn walkers. If you make the page
> reserved and note the fact it has been poisoned in the past then you can
> emulate the missing functionality.
> 
> Btw. do we really need unpoisoning functionality? Who is really using
> it, other than some tests?

None, as long as I know.

> How does the memory become OK again?

For hard-offlined in-use pages which are assumed to be pinned,
we clear the PageHWPoison flag and unpin the page to return it to buddy.
For other cases, we simply clear the PageHWPoison flag.
Unless the page is checked by check_new_pages() before unpoison,
the page is reusable.

Sometimes error handling fails and the error page might turn into
unexpected state (like additional refcount/mapcount).
Unpoison just fails on such pages.

> Don't we
> really need to go through physical hotremove & hotadd to clean the
> poison status?

hotremove/hotadd can be a user of unpoison, but I think simply
reinitializing struct pages is easiler.

Thanks,
Naoya Horiguchi

Michal Hocko July 19, 2018, 10:32 a.m. UTC | #13

On Thu 19-07-18 09:22:47, Naoya Horiguchi wrote:
> On Thu, Jul 19, 2018 at 10:27:43AM +0200, Michal Hocko wrote:
> > On Thu 19-07-18 08:08:05, Naoya Horiguchi wrote:
> > > On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote:
> > > > On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote:
> > > > > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote:
> > [...]
> > > > > > Why do we even need HWPoison flag here? Everything can be completely
> > > > > > transparent to the application. It shouldn't fail from what I
> > > > > > understood.
> > > > > 
> > > > > PageHWPoison flag is used to the 'remove from the allocator' part
> > > > > which is like below:
> > > > > 
> > > > >   static inline
> > > > >   struct page *rmqueue(
> > > > >           ...
> > > > >           do {
> > > > >                   page = NULL;
> > > > >                   if (alloc_flags & ALLOC_HARDER) {
> > > > >                           page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > > > >                           if (page)
> > > > >                                   trace_mm_page_alloc_zone_locked(page, order, migratetype);
> > > > >                   }
> > > > >                   if (!page)
> > > > >                           page = __rmqueue(zone, order, migratetype);
> > > > >           } while (page && check_new_pages(page, order));
> > > > > 
> > > > > check_new_pages() returns true if the page taken from free list has
> > > > > a hwpoison page so that the allocator iterates another round to get
> > > > > another page.
> > > > > 
> > > > > There's no function that can be called from outside allocator to remove
> > > > > a page in allocator.  So actual page removal is done at allocation time,
> > > > > not at error handling time. That's the reason why we need PageHWPoison.
> > > > 
> > > > hwpoison is an internal mm functionality so why cannot we simply add a
> > > > function that would do that?
> > > 
> > > That's one possible solution.
> > 
> > I would prefer that much more than add an overhead (albeit small) into
> > the page allocator directly. HWPoison should be a really rare event so
> > why should everybody pay the price? I would much rather see that the
> > poison path pays the additional price.
> 
> Yes, that's more maintainable.
> 
> > 
> > > I know about another downside in current implementation.
> > > If a hwpoison page is found during high order page allocation,
> > > all 2^order pages (not only hwpoison page) are removed from
> > > buddy because of the above quoted code. And these leaked pages
> > > are never returned to freelist even with unpoison_memory().
> > > If we have a page removal function which properly splits high order
> > > free pages into lower order pages, this problem is avoided.
> > 
> > Even more reason to move to a new scheme.
> > 
> > > OTOH PageHWPoison still has a role to report error to userspace.
> > > Without it unpoison_memory() doesn't work.
> > 
> > Sure but we do not really need a special page flag for that. We know the
> > page is not reachable other than via pfn walkers. If you make the page
> > reserved and note the fact it has been poisoned in the past then you can
> > emulate the missing functionality.
> > 
> > Btw. do we really need unpoisoning functionality? Who is really using
> > it, other than some tests?
> 
> None, as long as I know.

Then why do we keep it?

> > How does the memory become OK again?
> 
> For hard-offlined in-use pages which are assumed to be pinned,
> we clear the PageHWPoison flag and unpin the page to return it to buddy.
> For other cases, we simply clear the PageHWPoison flag.
> Unless the page is checked by check_new_pages() before unpoison,
> the page is reusable.

No I mean how does the memory becomes OK again. Is there any in place
repair technique?

> Sometimes error handling fails and the error page might turn into
> unexpected state (like additional refcount/mapcount).
> Unpoison just fails on such pages.
> 
> > Don't we
> > really need to go through physical hotremove & hotadd to clean the
> > poison status?
> 
> hotremove/hotadd can be a user of unpoison, but I think simply
> reinitializing struct pages is easiler.

Sure, that is what I meant actually. I do not really see any way to make
the memory OK again other than to replace the faulty dimm and put it
back online. The state is not really preserved for that so having a
sticky struct page state doesn't make much sense.

So from what I understood, we only need to track the poison state just
allow to hotremove that memory and do not stumble over an unexpected
page or try to do something with it. All other actions can be done
synchronously (migrate vs. kill users).

[v2,1/2] mm: fix race on soft-offlining free huge pages

Commit Message

Comments

Patch