[RFC,v2,11/16] mm,hwpoison: Rework soft offline for in-use pages
diff mbox series

Message ID 20191017142123.24245-12-osalvador@suse.de
State New
Headers show
Series
  • Hwpoison rework {hard,soft}-offline
Related show

Commit Message

Oscar Salvador Oct. 17, 2019, 2:21 p.m. UTC
This patch changes the way we set and handle in-use poisoned pages.
Until now, poisoned pages were released to the buddy allocator, trusting
that the checks that take place prior to hand the page would act as a
safe net and would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be PageBuddy and be in a
freelist.
Although this might not be the only user, having poisoned pages
in the buddy allocator seems a bad idea as we should only have
free pages that are ready and meant to be used as such.

Before explainaing the taken approach, let us break down the kind
of pages we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

* Normal pages (order-0 and anon-THP)

  - If they are clean and unmapped page cache pages, we invalidate
    then by means of invalidate_inode_page().
  - If they are mapped/dirty, we do the isolate-and-migrate dance.

  Either way, do not call put_page directly from those paths.
  Instead, we keep the page and send it to page_set_poison to perform the
  right handling.

  page_set_poison sets the HWPoison flag and does the last put_page.
  This call to put_page is mainly to be able to call __page_cache_release,
  since this function is not exported.

  Down the chain, we placed a check for HWPoison page in free_pages_prepare,
  that just skips any poisoned page, so those pages do not end up in any
  pcplist/freelist.

  After that, we set the refcount on the page to 1 and we increment
  the poisoned pages counter.

  We could do as we do for free pages:
  1) wait until the page hits buddy's freelists
  2) take it off
  3) flag it

  The problem is that we could race with an allocation, so by the time we
  want to take the page off the buddy, the page is already allocated, so we
  cannot soft-offline it.
  This is not fatal of course, but if it is better if we can close the race
  as does not require a lot of code.

* Hugetlb pages

  - We isolate-and-migrate them

  After the migration has been succesful, we call dissolve_free_huge_page,
  and we set HWPoison on the page if we succeed.
  Hugetlb has a slightly different handling though.

  While for non-hugetlb pages we cared about closing the race with an allocation,
  doing so for hugetlb pages requires quite some additional code (we would need to
  hook in free_huge_page and some other places).
  So I decided to not make the code overly complicated and just fail normally
  if the page we allocated in the meantime.

Because of the way we handle now in-use pages, we no longer need the
put-as-isolation-migratetype dance, that was guarding for poisoned pages
to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/page-flags.h |  5 -----
 mm/memory-failure.c        | 43 ++++++++++++++-----------------------------
 mm/migrate.c               | 11 +++--------
 mm/page_alloc.c            | 31 +++----------------------------
 4 files changed, 20 insertions(+), 70 deletions(-)

Comments

Michal Hocko Oct. 18, 2019, 12:39 p.m. UTC | #1
On Thu 17-10-19 16:21:18, Oscar Salvador wrote:
> This patch changes the way we set and handle in-use poisoned pages.
> Until now, poisoned pages were released to the buddy allocator, trusting
> that the checks that take place prior to hand the page would act as a
> safe net and would skip that page.
> 
> This has proved to be wrong, as we got some pfn walkers out there, like
> compaction, that all they care is the page to be PageBuddy and be in a
> freelist.
> Although this might not be the only user, having poisoned pages
> in the buddy allocator seems a bad idea as we should only have
> free pages that are ready and meant to be used as such.

Agreed on this part.

> Before explainaing the taken approach, let us break down the kind
> of pages we can soft offline.
> 
> - Anonymous THP (after the split, they end up being 4K pages)
> - Hugetlb
> - Order-0 pages (that can be either migrated or invalited)
> 
> * Normal pages (order-0 and anon-THP)
> 
>   - If they are clean and unmapped page cache pages, we invalidate
>     then by means of invalidate_inode_page().
>   - If they are mapped/dirty, we do the isolate-and-migrate dance.
> 
>   Either way, do not call put_page directly from those paths.
>   Instead, we keep the page and send it to page_set_poison to perform the
>   right handling.
> 
>   page_set_poison sets the HWPoison flag and does the last put_page.
>   This call to put_page is mainly to be able to call __page_cache_release,
>   since this function is not exported.
> 
>   Down the chain, we placed a check for HWPoison page in free_pages_prepare,
>   that just skips any poisoned page, so those pages do not end up in any
>   pcplist/freelist.
> 
>   After that, we set the refcount on the page to 1 and we increment
>   the poisoned pages counter.
> 
>   We could do as we do for free pages:
>   1) wait until the page hits buddy's freelists
>   2) take it off
>   3) flag it
> 
>   The problem is that we could race with an allocation, so by the time we
>   want to take the page off the buddy, the page is already allocated, so we
>   cannot soft-offline it.
>   This is not fatal of course, but if it is better if we can close the race
>   as does not require a lot of code.
> 
> * Hugetlb pages
> 
>   - We isolate-and-migrate them
> 
>   After the migration has been succesful, we call dissolve_free_huge_page,
>   and we set HWPoison on the page if we succeed.
>   Hugetlb has a slightly different handling though.
> 
>   While for non-hugetlb pages we cared about closing the race with an allocation,
>   doing so for hugetlb pages requires quite some additional code (we would need to
>   hook in free_huge_page and some other places).
>   So I decided to not make the code overly complicated and just fail normally
>   if the page we allocated in the meantime.
> 
> Because of the way we handle now in-use pages, we no longer need the
> put-as-isolation-migratetype dance, that was guarding for poisoned pages
> to end up in pcplists.

I am sorry but I got lost in the above description and I cannot really
make much sense from the code either. Let me try to outline the way how
I think about this.

Say we have a pfn to hwpoison. We have effectivelly three possibilities
- page is poisoned already - done nothing to do
- page is managed by the buddy allocator - excavate from there
- page is in use

The last category is the most interesting one. There are essentially
three classes of pages
- freeable
- migrateable
- others

We cannot do really much about the last one, right? Do we mark them
HWPoison anyway?
Freeable should be simply marked HWPoison and freed.
For all those migrateable, we simply do migrate and mark HWPoison.

Now the main question is how to handle HWPoison page when it is freed
- aka last reference is dropped. The main question is whether the last
reference is ever dropped. If yes then the free_pages_prepare should
never release it to the allocator (some compound destructors would have
to special case as well, e.g. hugetlb would have to hand over to the
allocator rather than a pool). If not then the page would be lingering
potentially with some state bound to it (e.g. memcg charge).  So I
suspect you want the former.

> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  include/linux/page-flags.h |  5 -----
>  mm/memory-failure.c        | 43 ++++++++++++++-----------------------------
>  mm/migrate.c               | 11 +++--------
>  mm/page_alloc.c            | 31 +++----------------------------
>  4 files changed, 20 insertions(+), 70 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f91cb8898ff0..21df81c9ea57 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -414,13 +414,8 @@ PAGEFLAG_FALSE(Uncached)
>  PAGEFLAG(HWPoison, hwpoison, PF_ANY)
>  TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
>  #define __PG_HWPOISON (1UL << PG_hwpoison)
> -extern bool set_hwpoison_free_buddy_page(struct page *page);
>  #else
>  PAGEFLAG_FALSE(HWPoison)
> -static inline bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> -	return 0;
> -}
>  #define __PG_HWPOISON 0
>  #endif
>  
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 1d986580522d..9b40cf1cb4fc 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -80,9 +80,11 @@ EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
>  
>  extern bool take_page_off_buddy(struct page *page);
>  
> -static void page_handle_poison(struct page *page)
> +static void page_handle_poison(struct page *page, bool release)
>  {
>  	SetPageHWPoison(page);
> +	if (release)
> +		put_page(page);
>  	page_ref_inc(page);
>  	num_poisoned_pages_inc();
>  }
> @@ -1713,19 +1715,13 @@ static int soft_offline_huge_page(struct page *page)
>  			ret = -EIO;
>  	} else {
>  		/*
> -		 * We set PG_hwpoison only when the migration source hugepage
> -		 * was successfully dissolved, because otherwise hwpoisoned
> -		 * hugepage remains on free hugepage list, then userspace will
> -		 * find it as SIGBUS by allocation failure. That's not expected
> -		 * in soft-offlining.
> +		 * We set PG_hwpoison only when we were able to take the page
> +		 * off the buddy.
>  		 */
> -		ret = dissolve_free_huge_page(page);
> -		if (!ret) {
> -			if (set_hwpoison_free_buddy_page(page))
> -				num_poisoned_pages_inc();
> -			else
> -				ret = -EBUSY;
> -		}
> +		if (!dissolve_free_huge_page(page) && take_page_off_buddy(page))
> +			page_handle_poison(page, false);
> +		else
> +			ret = -EBUSY;
>  	}
>  	return ret;
>  }
> @@ -1760,10 +1756,8 @@ static int __soft_offline_page(struct page *page)
>  	 * would need to fix isolation locking first.
>  	 */
>  	if (ret == 1) {
> -		put_page(page);
>  		pr_info("soft_offline: %#lx: invalidated\n", pfn);
> -		SetPageHWPoison(page);
> -		num_poisoned_pages_inc();
> +		page_handle_poison(page, true);
>  		return 0;
>  	}
>  
> @@ -1794,7 +1788,9 @@ static int __soft_offline_page(struct page *page)
>  		list_add(&page->lru, &pagelist);
>  		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
>  					MIGRATE_SYNC, MR_MEMORY_FAILURE);
> -		if (ret) {
> +		if (!ret) {
> +			page_handle_poison(page, true);
> +		} else {
>  			if (!list_empty(&pagelist))
>  				putback_movable_pages(&pagelist);
>  
> @@ -1813,27 +1809,16 @@ static int __soft_offline_page(struct page *page)
>  static int soft_offline_in_use_page(struct page *page)
>  {
>  	int ret;
> -	int mt;
>  	struct page *hpage = compound_head(page);
>  
>  	if (!PageHuge(page) && PageTransHuge(hpage))
>  		if (try_to_split_thp_page(page, "soft offline") < 0)
>  			return -EBUSY;
>  
> -	/*
> -	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
> -	 * to free list immediately (not via pcplist) when released after
> -	 * successful page migration. Otherwise we can't guarantee that the
> -	 * page is really free after put_page() returns, so
> -	 * set_hwpoison_free_buddy_page() highly likely fails.
> -	 */
> -	mt = get_pageblock_migratetype(page);
> -	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>  	if (PageHuge(page))
>  		ret = soft_offline_huge_page(page);
>  	else
>  		ret = __soft_offline_page(page);
> -	set_pageblock_migratetype(page, mt);
>  	return ret;
>  }
>  
> @@ -1842,7 +1827,7 @@ static int soft_offline_free_page(struct page *page)
>  	int rc = -EBUSY;
>  
>  	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
> -		page_handle_poison(page);
> +		page_handle_poison(page, false);
>  		rc = 0;
>  	}
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4fe45d1428c8..71acece248d7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1224,16 +1224,11 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>  	 * we want to retry.
>  	 */
>  	if (rc == MIGRATEPAGE_SUCCESS) {
> -		put_page(page);
> -		if (reason == MR_MEMORY_FAILURE) {
> +		if (reason != MR_MEMORY_FAILURE)
>  			/*
> -			 * Set PG_HWPoison on just freed page
> -			 * intentionally. Although it's rather weird,
> -			 * it's how HWPoison flag works at the moment.
> +			 * We release the page in page_handle_poison.
>  			 */
> -			if (set_hwpoison_free_buddy_page(page))
> -				num_poisoned_pages_inc();
> -		}
> +			put_page(page);
>  	} else {
>  		if (rc != -EAGAIN) {
>  			if (likely(!__PageMovable(page))) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 255df0c76a40..cb35a4c8b1f2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1132,6 +1132,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  
>  	VM_BUG_ON_PAGE(PageTail(page), page);
>  
> +	if (unlikely(PageHWPoison(page)) && !order)
> +		return false;
> +
>  	trace_mm_page_free(page, order);
>  
>  	/*
> @@ -8698,32 +8701,4 @@ bool take_page_off_buddy(struct page *page)
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  	return ret;
>   }
> -
> -/*
> - * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
> - * test is performed under the zone lock to prevent a race against page
> - * allocation.
> - */
> -bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> -	struct zone *zone = page_zone(page);
> -	unsigned long pfn = page_to_pfn(page);
> -	unsigned long flags;
> -	unsigned int order;
> -	bool hwpoisoned = false;
> -
> -	spin_lock_irqsave(&zone->lock, flags);
> -	for (order = 0; order < MAX_ORDER; order++) {
> -		struct page *page_head = page - (pfn & ((1 << order) - 1));
> -
> -		if (PageBuddy(page_head) && page_order(page_head) >= order) {
> -			if (!TestSetPageHWPoison(page))
> -				hwpoisoned = true;
> -			break;
> -		}
> -	}
> -	spin_unlock_irqrestore(&zone->lock, flags);
> -
> -	return hwpoisoned;
> -}
>  #endif
> -- 
> 2.12.3
Oscar Salvador Oct. 21, 2019, 1:48 p.m. UTC | #2
On Fri, Oct 18, 2019 at 02:39:01PM +0200, Michal Hocko wrote:
> 
> I am sorry but I got lost in the above description and I cannot really
> make much sense from the code either. Let me try to outline the way how
> I think about this.
> 
> Say we have a pfn to hwpoison. We have effectivelly three possibilities
> - page is poisoned already - done nothing to do
> - page is managed by the buddy allocator - excavate from there
> - page is in use
> 
> The last category is the most interesting one. There are essentially
> three classes of pages
> - freeable
> - migrateable
> - others
> 
> We cannot do really much about the last one, right? Do we mark them
> HWPoison anyway?

We can only perform actions on LRU/Movable pages or hugetlb pages.

So unless the page does not fall into those areas, we do not do anything
with them.

> Freeable should be simply marked HWPoison and freed.
> For all those migrateable, we simply do migrate and mark HWPoison.
> Now the main question is how to handle HWPoison page when it is freed
> - aka last reference is dropped. The main question is whether the last
> reference is ever dropped. If yes then the free_pages_prepare should
> never release it to the allocator (some compound destructors would have
> to special case as well, e.g. hugetlb would have to hand over to the
> allocator rather than a pool). If not then the page would be lingering
> potentially with some state bound to it (e.g. memcg charge).  So I
> suspect you want the former.

For non-hugetlb pages, we do not call put_page in the migration path,
but we do it in page_handle_poison, after the page has been flagged as
hwpoison.
Then the check in free_papes_prepare will see that the page is hwpoison
and will bail out, so the page is not released into the allocator/pcp lists.

Hugetlb pages follow a different methodology.
They are dissolved, and then we split the higher-order page and take the
page off the buddy.
The problem is that while it is easy to hold a non-hugetlb page,
doing the same for hugetlb pages is not that easy:

1) we would need to hook in enqueue_hugetlb_page so the page is not enqueued
   into hugetlb freelists
2) when trying to free a hugetlb page, we would need to do as we do for gigantic
   pages now, and that is breaking down the pages into order-0 pages and release
   them to the buddy (so the check in free_papges_prepare would skip the
   hwpoison page).
   Trying to handle a higher-order hwpoison page in free_pages_prepare is
   a bit complicated.
   
There is one thing I was unsure though.
Bailing out at the beginning of free_pages_prepare if the page is hwpoison
means that the calls to

- __memcg_kmem_uncharge
- page_cpupid_reset_last
- reset_page_owner
- ...

will not be performed.
I thought this is right because the page is not really "free", it is just unusable,
so.. it should be still charged to the memcg?
Michal Hocko Oct. 21, 2019, 2:06 p.m. UTC | #3
On Mon 21-10-19 15:48:48, Oscar Salvador wrote:
> On Fri, Oct 18, 2019 at 02:39:01PM +0200, Michal Hocko wrote:
> > 
> > I am sorry but I got lost in the above description and I cannot really
> > make much sense from the code either. Let me try to outline the way how
> > I think about this.
> > 
> > Say we have a pfn to hwpoison. We have effectivelly three possibilities
> > - page is poisoned already - done nothing to do
> > - page is managed by the buddy allocator - excavate from there
> > - page is in use
> > 
> > The last category is the most interesting one. There are essentially
> > three classes of pages
> > - freeable
> > - migrateable
> > - others
> > 
> > We cannot do really much about the last one, right? Do we mark them
> > HWPoison anyway?
> 
> We can only perform actions on LRU/Movable pages or hugetlb pages.

What would prevent other pages mapped via page tables to be handled as
well?

> So unless the page does not fall into those areas, we do not do anything
> with them.
> 
> > Freeable should be simply marked HWPoison and freed.
> > For all those migrateable, we simply do migrate and mark HWPoison.
> > Now the main question is how to handle HWPoison page when it is freed
> > - aka last reference is dropped. The main question is whether the last
> > reference is ever dropped. If yes then the free_pages_prepare should
> > never release it to the allocator (some compound destructors would have
> > to special case as well, e.g. hugetlb would have to hand over to the
> > allocator rather than a pool). If not then the page would be lingering
> > potentially with some state bound to it (e.g. memcg charge).  So I
> > suspect you want the former.
> 
> For non-hugetlb pages, we do not call put_page in the migration path,
> but we do it in page_handle_poison, after the page has been flagged as
> hwpoison.
> Then the check in free_papes_prepare will see that the page is hwpoison
> and will bail out, so the page is not released into the allocator/pcp lists.
> 
> Hugetlb pages follow a different methodology.
> They are dissolved, and then we split the higher-order page and take the
> page off the buddy.
> The problem is that while it is easy to hold a non-hugetlb page,
> doing the same for hugetlb pages is not that easy:
> 
> 1) we would need to hook in enqueue_hugetlb_page so the page is not enqueued
>    into hugetlb freelists
> 2) when trying to free a hugetlb page, we would need to do as we do for gigantic
>    pages now, and that is breaking down the pages into order-0 pages and release
>    them to the buddy (so the check in free_papges_prepare would skip the
>    hwpoison page).
>    Trying to handle a higher-order hwpoison page in free_pages_prepare is
>    a bit complicated.

I am not sure I see the problem. If you dissolve the hugetlb page then
there is no hugetlb page anymore and so you make it a regular high-order
page.

> There is one thing I was unsure though.
> Bailing out at the beginning of free_pages_prepare if the page is hwpoison
> means that the calls to
> 
> - __memcg_kmem_uncharge
> - page_cpupid_reset_last
> - reset_page_owner
> - ...
> 
> will not be performed.
> I thought this is right because the page is not really "free", it is just unusable,
> so.. it should be still charged to the memcg?

If the page is free then it shouldn't pin the memcg or any other state.
Oscar Salvador Oct. 22, 2019, 7:56 a.m. UTC | #4
On Mon, Oct 21, 2019 at 04:06:19PM +0200, Michal Hocko wrote:
> On Mon 21-10-19 15:48:48, Oscar Salvador wrote:
> > We can only perform actions on LRU/Movable pages or hugetlb pages.
> 
> What would prevent other pages mapped via page tables to be handled as
> well?

What kind of pages?
I mean, I guess it could be done, it was just not implemented, and I
did not want to add more "features" as my main goal was to re-work
the interface to be more deterministic.

> > 1) we would need to hook in enqueue_hugetlb_page so the page is not enqueued
> >    into hugetlb freelists
> > 2) when trying to free a hugetlb page, we would need to do as we do for gigantic
> >    pages now, and that is breaking down the pages into order-0 pages and release
> >    them to the buddy (so the check in free_papges_prepare would skip the
> >    hwpoison page).
> >    Trying to handle a higher-order hwpoison page in free_pages_prepare is
> >    a bit complicated.
> 
> I am not sure I see the problem. If you dissolve the hugetlb page then
> there is no hugetlb page anymore and so you make it a regular high-order
> page.

Yes, but the problem comes when trying to work with a hwpoison high-order page
in free_pages_prepare, it gets more complicated, and when I weigthed
code vs benefits, I was not really sure to go down that road.

If we get a hwpoison high-order page in free_pages_prepare, we need to
break it down to smaller pages, so we can skip the "bad" to not be sent
into buddy allocator.

> If the page is free then it shouldn't pin the memcg or any other state.

Well, it is not really free, as it is not usable, is it?
Anyway, I do agree that we should clean the bondings to other subsystems like memcg.
Michal Hocko Oct. 22, 2019, 8:30 a.m. UTC | #5
On Tue 22-10-19 09:56:27, Oscar Salvador wrote:
> On Mon, Oct 21, 2019 at 04:06:19PM +0200, Michal Hocko wrote:
> > On Mon 21-10-19 15:48:48, Oscar Salvador wrote:
> > > We can only perform actions on LRU/Movable pages or hugetlb pages.
> > 
> > What would prevent other pages mapped via page tables to be handled as
> > well?
> 
> What kind of pages?

Any pages mapped to the userspace. E.g. driver memory which is not on
LRU.

> I mean, I guess it could be done, it was just not implemented, and I
> did not want to add more "features" as my main goal was to re-work
> the interface to be more deterministic.

Fair enough. One step at the time sounds definitely good
 
> > > 1) we would need to hook in enqueue_hugetlb_page so the page is not enqueued
> > >    into hugetlb freelists
> > > 2) when trying to free a hugetlb page, we would need to do as we do for gigantic
> > >    pages now, and that is breaking down the pages into order-0 pages and release
> > >    them to the buddy (so the check in free_papges_prepare would skip the
> > >    hwpoison page).
> > >    Trying to handle a higher-order hwpoison page in free_pages_prepare is
> > >    a bit complicated.
> > 
> > I am not sure I see the problem. If you dissolve the hugetlb page then
> > there is no hugetlb page anymore and so you make it a regular high-order
> > page.
> 
> Yes, but the problem comes when trying to work with a hwpoison high-order page
> in free_pages_prepare, it gets more complicated, and when I weigthed
> code vs benefits, I was not really sure to go down that road.
> 
> If we get a hwpoison high-order page in free_pages_prepare, we need to
> break it down to smaller pages, so we can skip the "bad" to not be sent
> into buddy allocator.

But we have destructors for compound pages. Can we do the heavy lifting
there?

> > If the page is free then it shouldn't pin the memcg or any other state.
> 
> Well, it is not really free, as it is not usable, is it?

Sorry I meant to say the page is free from the memcg POV - aka no task
from the memcg is holding a reference to it. The page is not usable for
anybody, that is true but no particular memcg should pay a price for
that. This would mean that random memcgs would end up pinned for ever
without a good reason.
Oscar Salvador Oct. 22, 2019, 9:40 a.m. UTC | #6
On Tue, Oct 22, 2019 at 10:30:02AM +0200, Michal Hocko wrote:
> But we have destructors for compound pages. Can we do the heavy lifting
> there?

Yes, we could.
Actually, I tried that approach, but I thought it was simpler this way.
Since there is no hurry in this, I will try to take that up again and
see how it looks.

> > > If the page is free then it shouldn't pin the memcg or any other state.
> > 
> > Well, it is not really free, as it is not usable, is it?
> 
> Sorry I meant to say the page is free from the memcg POV - aka no task
> from the memcg is holding a reference to it. The page is not usable for
> anybody, that is true but no particular memcg should pay a price for
> that. This would mean that random memcgs would end up pinned for ever
> without a good reason.

Sure, I will re-work that.

Patch
diff mbox series

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb8898ff0..21df81c9ea57 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -414,13 +414,8 @@  PAGEFLAG_FALSE(Uncached)
 PAGEFLAG(HWPoison, hwpoison, PF_ANY)
 TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
-extern bool set_hwpoison_free_buddy_page(struct page *page);
 #else
 PAGEFLAG_FALSE(HWPoison)
-static inline bool set_hwpoison_free_buddy_page(struct page *page)
-{
-	return 0;
-}
 #define __PG_HWPOISON 0
 #endif
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1d986580522d..9b40cf1cb4fc 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -80,9 +80,11 @@  EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
 
 extern bool take_page_off_buddy(struct page *page);
 
-static void page_handle_poison(struct page *page)
+static void page_handle_poison(struct page *page, bool release)
 {
 	SetPageHWPoison(page);
+	if (release)
+		put_page(page);
 	page_ref_inc(page);
 	num_poisoned_pages_inc();
 }
@@ -1713,19 +1715,13 @@  static int soft_offline_huge_page(struct page *page)
 			ret = -EIO;
 	} else {
 		/*
-		 * We set PG_hwpoison only when the migration source hugepage
-		 * was successfully dissolved, because otherwise hwpoisoned
-		 * hugepage remains on free hugepage list, then userspace will
-		 * find it as SIGBUS by allocation failure. That's not expected
-		 * in soft-offlining.
+		 * We set PG_hwpoison only when we were able to take the page
+		 * off the buddy.
 		 */
-		ret = dissolve_free_huge_page(page);
-		if (!ret) {
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-			else
-				ret = -EBUSY;
-		}
+		if (!dissolve_free_huge_page(page) && take_page_off_buddy(page))
+			page_handle_poison(page, false);
+		else
+			ret = -EBUSY;
 	}
 	return ret;
 }
@@ -1760,10 +1756,8 @@  static int __soft_offline_page(struct page *page)
 	 * would need to fix isolation locking first.
 	 */
 	if (ret == 1) {
-		put_page(page);
 		pr_info("soft_offline: %#lx: invalidated\n", pfn);
-		SetPageHWPoison(page);
-		num_poisoned_pages_inc();
+		page_handle_poison(page, true);
 		return 0;
 	}
 
@@ -1794,7 +1788,9 @@  static int __soft_offline_page(struct page *page)
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
 					MIGRATE_SYNC, MR_MEMORY_FAILURE);
-		if (ret) {
+		if (!ret) {
+			page_handle_poison(page, true);
+		} else {
 			if (!list_empty(&pagelist))
 				putback_movable_pages(&pagelist);
 
@@ -1813,27 +1809,16 @@  static int __soft_offline_page(struct page *page)
 static int soft_offline_in_use_page(struct page *page)
 {
 	int ret;
-	int mt;
 	struct page *hpage = compound_head(page);
 
 	if (!PageHuge(page) && PageTransHuge(hpage))
 		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
 
-	/*
-	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
-	 * to free list immediately (not via pcplist) when released after
-	 * successful page migration. Otherwise we can't guarantee that the
-	 * page is really free after put_page() returns, so
-	 * set_hwpoison_free_buddy_page() highly likely fails.
-	 */
-	mt = get_pageblock_migratetype(page);
-	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 	if (PageHuge(page))
 		ret = soft_offline_huge_page(page);
 	else
 		ret = __soft_offline_page(page);
-	set_pageblock_migratetype(page, mt);
 	return ret;
 }
 
@@ -1842,7 +1827,7 @@  static int soft_offline_free_page(struct page *page)
 	int rc = -EBUSY;
 
 	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
-		page_handle_poison(page);
+		page_handle_poison(page, false);
 		rc = 0;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 4fe45d1428c8..71acece248d7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1224,16 +1224,11 @@  static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 	 * we want to retry.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		put_page(page);
-		if (reason == MR_MEMORY_FAILURE) {
+		if (reason != MR_MEMORY_FAILURE)
 			/*
-			 * Set PG_HWPoison on just freed page
-			 * intentionally. Although it's rather weird,
-			 * it's how HWPoison flag works at the moment.
+			 * We release the page in page_handle_poison.
 			 */
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-		}
+			put_page(page);
 	} else {
 		if (rc != -EAGAIN) {
 			if (likely(!__PageMovable(page))) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 255df0c76a40..cb35a4c8b1f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1132,6 +1132,9 @@  static __always_inline bool free_pages_prepare(struct page *page,
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
+	if (unlikely(PageHWPoison(page)) && !order)
+		return false;
+
 	trace_mm_page_free(page, order);
 
 	/*
@@ -8698,32 +8701,4 @@  bool take_page_off_buddy(struct page *page)
 	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
  }
-
-/*
- * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
- * test is performed under the zone lock to prevent a race against page
- * allocation.
- */
-bool set_hwpoison_free_buddy_page(struct page *page)
-{
-	struct zone *zone = page_zone(page);
-	unsigned long pfn = page_to_pfn(page);
-	unsigned long flags;
-	unsigned int order;
-	bool hwpoisoned = false;
-
-	spin_lock_irqsave(&zone->lock, flags);
-	for (order = 0; order < MAX_ORDER; order++) {
-		struct page *page_head = page - (pfn & ((1 << order) - 1));
-
-		if (PageBuddy(page_head) && page_order(page_head) >= order) {
-			if (!TestSetPageHWPoison(page))
-				hwpoisoned = true;
-			break;
-		}
-	}
-	spin_unlock_irqrestore(&zone->lock, flags);
-
-	return hwpoisoned;
-}
 #endif