diff mbox series

mm/memory_hotplug: avoid consuming corrupted data when offline pages

Message ID 20220421135129.19767-1-linmiaohe@huawei.com (mailing list archive)
State New
Headers show
Series mm/memory_hotplug: avoid consuming corrupted data when offline pages | expand

Commit Message

Miaohe Lin April 21, 2022, 1:51 p.m. UTC
When trying to offline pages, HWPoisoned hugepage is migrated without
checking PageHWPoison first. So corrupted data could be consumed. Fix
it by deferring isolate_huge_page until PageHWPoison is handled.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memory_hotplug.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Comments

David Hildenbrand April 21, 2022, 2:20 p.m. UTC | #1
On 21.04.22 15:51, Miaohe Lin wrote:
> When trying to offline pages, HWPoisoned hugepage is migrated without
> checking PageHWPoison first. So corrupted data could be consumed. Fix
> it by deferring isolate_huge_page until PageHWPoison is handled.
> 

CCing Oscar, Mike and Naoya

> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/memory_hotplug.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 4c6065e5d274..093f85ec5c5c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1600,11 +1600,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  		folio = page_folio(page);
>  		head = &folio->page;
>  
> -		if (PageHuge(page)) {
> +		if (PageHuge(page))
>  			pfn = page_to_pfn(head) + compound_nr(head) - 1;
> -			isolate_huge_page(head, &source);
> -			continue;
> -		} else if (PageTransHuge(page))
> +		else if (PageTransHuge(page))
>  			pfn = page_to_pfn(head) + thp_nr_pages(page) - 1;
>  
>  		/*
> @@ -1622,6 +1620,11 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  			continue;
>  		}
>  
> +		if (PageHuge(page)) {
> +			isolate_huge_page(head, &source);
> +			continue;
> +		}
> +
>  		if (!get_page_unless_zero(page))
>  			continue;
>  		/*

The problem statement makes sense to me but I am not sure about the
details if we run into the "PageHWPoison" path with a huge page. I have
the gut feeling that we have to do more for huge pages in the
PageHWPoison() path, because we might be dealing with a free huge page
after unmap succeeds. I might be wrong.
Mike Kravetz April 21, 2022, 10:04 p.m. UTC | #2
On 4/21/22 07:20, David Hildenbrand wrote:
> On 21.04.22 15:51, Miaohe Lin wrote:
>> When trying to offline pages, HWPoisoned hugepage is migrated without
>> checking PageHWPoison first. So corrupted data could be consumed. Fix
>> it by deferring isolate_huge_page until PageHWPoison is handled.
>>
> 
> CCing Oscar, Mike and Naoya
> 
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/memory_hotplug.c | 11 +++++++----
>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 4c6065e5d274..093f85ec5c5c 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1600,11 +1600,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>>  		folio = page_folio(page);
>>  		head = &folio->page;
>>  
>> -		if (PageHuge(page)) {
>> +		if (PageHuge(page))
>>  			pfn = page_to_pfn(head) + compound_nr(head) - 1;
>> -			isolate_huge_page(head, &source);
>> -			continue;
>> -		} else if (PageTransHuge(page))
>> +		else if (PageTransHuge(page))
>>  			pfn = page_to_pfn(head) + thp_nr_pages(page) - 1;
>>  
>>  		/*
>> @@ -1622,6 +1620,11 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>>  			continue;
>>  		}
>>  
>> +		if (PageHuge(page)) {
>> +			isolate_huge_page(head, &source);
>> +			continue;
>> +		}
>> +
>>  		if (!get_page_unless_zero(page))
>>  			continue;
>>  		/*
> 
> The problem statement makes sense to me but I am not sure about the
> details if we run into the "PageHWPoison" path with a huge page. I have
> the gut feeling that we have to do more for huge pages in the
> PageHWPoison() path, because we might be dealing with a free huge page
> after unmap succeeds. I might be wrong.
> 

Thinking about memory errors always makes my head hurt :)

What 'state' could a poisoned hugetlb page be in here?
- Mapped into a process address space?
- On the hugetlb free lists?

IIUC, when poisoning a hugetlb page we try to dissolve those that are
free and unmap those which are mapped.  So, this means those operations
must have failed for some reason.  Is that correct?

Now, IF the above is true this implies there is a poisoned page somewhere
within the hugetlb page.  But, poison is only marked in the head page.
So, we do not really know where within the page the actual error exists.
Is my reasoning still correct?

If my reasoning is correct, then I am not sure what we can do here.
The code to handle poison is:

                 if (PageHWPoison(page)) {
                        if (WARN_ON(folio_test_lru(folio)))
                                folio_isolate_lru(folio);
                        if (folio_mapped(folio))
                                try_to_unmap(folio, TTU_IGNORE_MLOCK);
                        continue;
                }

My first observation is that if a hugetlb page is passed to try_to_unmap
as above we may BUG.  This is because we need to hold i_mmap_rwsem in
write mode because of the possibility of calling huge_pmd_unshare.  :(

I 'think' try_to_unmap could succeed on a poisoned hugetlb page once we
add correct locking.  So, the page would be unmapped.  I do not think anyone
is holding a ref, so the last unmap should put the hugetlb page on the
free list.  Correct?  We will not migrate the page, but ...

After the call to do_migrate_range() in offline_pages, we will call
dissolve_free_huge_pages.  For each hugetlb page, dissolve_free_huge_pages
will call dissolve_free_huge_page likely passing in the 'head' page.
When dissolve_free_huge_page is called for poisoned hugetlb pages from
the memory error handling code, it passes in the sub page which contains
the memory error.  Before freeing the hugetlb page to buddy, there is this
code:

                        /*
                         * Move PageHWPoison flag from head page to the raw
                         * error page, which makes any subpages rather than
                         * the error page reusable.
                         */
                        if (PageHWPoison(head) && page != head) {
                                SetPageHWPoison(page);
                                ClearPageHWPoison(head);
                        }
                        update_and_free_page(h, head, false)

As previously mentioned, outside the memory error handling code we do
not know what page within the hugetlb page contains poison.  So, this
will likely result with a page with errors on the free list and an OK
page marked with error.

In other places, we 'bail' if we encounter a hugetlb page with poison.
It would be unfortunate to prevent memory offline if the range contains
a hugetlb page with poison.  After all, offlining a section with poison
make sense.
Miaohe Lin April 22, 2022, 6:53 a.m. UTC | #3
On 2022/4/22 6:04, Mike Kravetz wrote:
> On 4/21/22 07:20, David Hildenbrand wrote:
>> On 21.04.22 15:51, Miaohe Lin wrote:
>>> When trying to offline pages, HWPoisoned hugepage is migrated without
>>> checking PageHWPoison first. So corrupted data could be consumed. Fix
>>> it by deferring isolate_huge_page until PageHWPoison is handled.
>>>
>>
>> CCing Oscar, Mike and Naoya
>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>> ---
>>>  mm/memory_hotplug.c | 11 +++++++----
>>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 4c6065e5d274..093f85ec5c5c 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -1600,11 +1600,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>>>  		folio = page_folio(page);
>>>  		head = &folio->page;
>>>  
>>> -		if (PageHuge(page)) {
>>> +		if (PageHuge(page))
>>>  			pfn = page_to_pfn(head) + compound_nr(head) - 1;
>>> -			isolate_huge_page(head, &source);
>>> -			continue;
>>> -		} else if (PageTransHuge(page))
>>> +		else if (PageTransHuge(page))
>>>  			pfn = page_to_pfn(head) + thp_nr_pages(page) - 1;
>>>  
>>>  		/*
>>> @@ -1622,6 +1620,11 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>>>  			continue;
>>>  		}
>>>  
>>> +		if (PageHuge(page)) {
>>> +			isolate_huge_page(head, &source);
>>> +			continue;
>>> +		}
>>> +
>>>  		if (!get_page_unless_zero(page))
>>>  			continue;
>>>  		/*
>>
>> The problem statement makes sense to me but I am not sure about the
>> details if we run into the "PageHWPoison" path with a huge page. I have
>> the gut feeling that we have to do more for huge pages in the
>> PageHWPoison() path, because we might be dealing with a free huge page
>> after unmap succeeds. I might be wrong.
>>
> 
> Thinking about memory errors always makes my head hurt :)

Me too. ;)

> 
> What 'state' could a poisoned hugetlb page be in here?
> - Mapped into a process address space?
> - On the hugetlb free lists?
> 
> IIUC, when poisoning a hugetlb page we try to dissolve those that are
> free and unmap those which are mapped.  So, this means those operations
> must have failed for some reason.  Is that correct?

I think you're right.

BTW: Now that PageHWPoison is set under hugetlb_lock and HPageFreed is also checked
under that lock, it should be more likely to handle the hugetlb page on the free lists
successfully because the possible race with hugetlb page allocation or demotion can be
prevented by PageHWPoison flag. So it is more likely to be a mapped hugetlb page here.
But that doesn't matter.

> 
> Now, IF the above is true this implies there is a poisoned page somewhere
> within the hugetlb page.  But, poison is only marked in the head page.
> So, we do not really know where within the page the actual error exists.
> Is my reasoning still correct?
> 

IIRC, Naoya once proposed to store the poisoned page info into subpage field.

> If my reasoning is correct, then I am not sure what we can do here.
> The code to handle poison is:
> 
>                  if (PageHWPoison(page)) {
>                         if (WARN_ON(folio_test_lru(folio)))
>                                 folio_isolate_lru(folio);
>                         if (folio_mapped(folio))
>                                 try_to_unmap(folio, TTU_IGNORE_MLOCK);
>                         continue;
>                 }
> 
> My first observation is that if a hugetlb page is passed to try_to_unmap
> as above we may BUG.  This is because we need to hold i_mmap_rwsem in
> write mode because of the possibility of calling huge_pmd_unshare.  :(

I'm sorry. I missed that case. :(

> 
> I 'think' try_to_unmap could succeed on a poisoned hugetlb page once we
> add correct locking.  So, the page would be unmapped.  I do not think anyone
> is holding a ref, so the last unmap should put the hugetlb page on the
> free list.  Correct?  We will not migrate the page, but ...

Might hugetlb page still be in the pagecache? But yes, the last unmap could put
the hugetlb page on the free list.

> 
> After the call to do_migrate_range() in offline_pages, we will call
> dissolve_free_huge_pages.  For each hugetlb page, dissolve_free_huge_pages
> will call dissolve_free_huge_page likely passing in the 'head' page.
> When dissolve_free_huge_page is called for poisoned hugetlb pages from
> the memory error handling code, it passes in the sub page which contains
> the memory error.  Before freeing the hugetlb page to buddy, there is this
> code:
> 
>                         /*
>                          * Move PageHWPoison flag from head page to the raw
>                          * error page, which makes any subpages rather than
>                          * the error page reusable.
>                          */
>                         if (PageHWPoison(head) && page != head) {
>                                 SetPageHWPoison(page);
>                                 ClearPageHWPoison(head);
>                         }
>                         update_and_free_page(h, head, false)
> 
> As previously mentioned, outside the memory error handling code we do
> not know what page within the hugetlb page contains poison.  So, this
> will likely result with a page with errors on the free list and an OK
> page marked with error.

Yes, this is possible. It's really a pity. Could we just reject to dissolve the
hugetlb page and keep the hugetlb page around? This will waste some healthy subpages
but can do the right things?

> 
> In other places, we 'bail' if we encounter a hugetlb page with poison.
> It would be unfortunate to prevent memory offline if the range contains
> a hugetlb page with poison.  After all, offlining a section with poison
> make sense.
> 

IIUC, a (hugetlb) page with poison could be offlined anytime even if it's still in the
pagecache, swapcache, i.e. still be referenced by somewhere, because memory offline will
just offline the poisoned page while ignoring the page refcnt totally. I can't figure
out a easy way to fix it yet. :(

Thanks!
Miaohe Lin April 26, 2022, 11:41 a.m. UTC | #4
On 2022/4/25 21:55, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Fri, Apr 22, 2022 at 02:53:48PM +0800, Miaohe Lin wrote:
>> On 2022/4/22 6:04, Mike Kravetz wrote:
>>> On 4/21/22 07:20, David Hildenbrand wrote:
>>>> On 21.04.22 15:51, Miaohe Lin wrote:
>>>>> When trying to offline pages, HWPoisoned hugepage is migrated without
>>>>> checking PageHWPoison first. So corrupted data could be consumed. Fix
>>>>> it by deferring isolate_huge_page until PageHWPoison is handled.
>>>>>
>>>>
>>>> CCing Oscar, Mike and Naoya
>>>>
>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>> ---
>>>>>  mm/memory_hotplug.c | 11 +++++++----
>>>>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>>> index 4c6065e5d274..093f85ec5c5c 100644
>>>>> --- a/mm/memory_hotplug.c
>>>>> +++ b/mm/memory_hotplug.c
>>>>> @@ -1600,11 +1600,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>>>>>  		folio = page_folio(page);
>>>>>  		head = &folio->page;
>>>>>  
>>>>> -		if (PageHuge(page)) {
>>>>> +		if (PageHuge(page))
>>>>>  			pfn = page_to_pfn(head) + compound_nr(head) - 1;
>>>>> -			isolate_huge_page(head, &source);
>>>>> -			continue;
>>>>> -		} else if (PageTransHuge(page))
>>>>> +		else if (PageTransHuge(page))
>>>>>  			pfn = page_to_pfn(head) + thp_nr_pages(page) - 1;
>>>>>  
>>>>>  		/*
>>>>> @@ -1622,6 +1620,11 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>>>>>  			continue;
>>>>>  		}
>>>>>  
>>>>> +		if (PageHuge(page)) {
>>>>> +			isolate_huge_page(head, &source);
>>>>> +			continue;
>>>>> +		}
>>>>> +
>>>>>  		if (!get_page_unless_zero(page))
>>>>>  			continue;
>>>>>  		/*
>>>>
>>>> The problem statement makes sense to me but I am not sure about the
>>>> details if we run into the "PageHWPoison" path with a huge page. I have
>>>> the gut feeling that we have to do more for huge pages in the
>>>> PageHWPoison() path, because we might be dealing with a free huge page
>>>> after unmap succeeds. I might be wrong.
>>>>
>>>
>>> Thinking about memory errors always makes my head hurt :)
>>
>> Me too. ;)
>>
>>>
>>> What 'state' could a poisoned hugetlb page be in here?
>>> - Mapped into a process address space?
>>> - On the hugetlb free lists?
> 
> There seems at least 2 cases where a hwpoisoned hugetlb page keeps undissolved.
>   (1) Hwpoisoned hugepage in free hugepage list
>   (2) Hwpoisoned hugepage belonging to a hugetlbfs file
> 
>>>
>>> IIUC, when poisoning a hugetlb page we try to dissolve those that are
>>> free and unmap those which are mapped.  So, this means those operations
>>> must have failed for some reason.  Is that correct?
> 
> Yes, (1) is made by failure in error handling, but (2) can be made even
> when memory error on in-use hugepage is successfully handled.

Could you please explain 2 more detailed? IIUC, in-use hugepage belonging to a hugetlbfs
file will be removed from pagecache via hugetlbfs_error_remove_page. So when successfully
handled, hugepage->mapping will be NULL and thus not belonging to a hugetlbfs file. Or am
I miss something?

> 
>>
>> I think you're right.
>>
>> BTW: Now that PageHWPoison is set under hugetlb_lock and HPageFreed is also checked
>> under that lock, it should be more likely to handle the hugetlb page on the free lists
>> successfully because the possible race with hugetlb page allocation or demotion can be
>> prevented by PageHWPoison flag. So it is more likely to be a mapped hugetlb page here.
>> But that doesn't matter.
>>
>>>
>>> Now, IF the above is true this implies there is a poisoned page somewhere
>>> within the hugetlb page.  But, poison is only marked in the head page.
>>> So, we do not really know where within the page the actual error exists.
>>> Is my reasoning still correct?
>>>
>>
>> IIRC, Naoya once proposed to store the poisoned page info into subpage field.
> 
> Yes, it's important to keep the offset of raw page.  I have a "storing raw
> error info" patch for this, but it's still immature to submit.
> 
>>
>>> If my reasoning is correct, then I am not sure what we can do here.
>>> The code to handle poison is:
>>>
>>>                  if (PageHWPoison(page)) {
>>>                         if (WARN_ON(folio_test_lru(folio)))
>>>                                 folio_isolate_lru(folio);
>>>                         if (folio_mapped(folio))
>>>                                 try_to_unmap(folio, TTU_IGNORE_MLOCK);
>>>                         continue;
>>>                 }
>>>
>>> My first observation is that if a hugetlb page is passed to try_to_unmap
>>> as above we may BUG.  This is because we need to hold i_mmap_rwsem in
>>> write mode because of the possibility of calling huge_pmd_unshare.  :(
>>
>> I'm sorry. I missed that case. :(
> 
> OK, so the above check should not be called for hugetlb pages.
> 
>>
>>>
>>> I 'think' try_to_unmap could succeed on a poisoned hugetlb page once we
>>> add correct locking.  So, the page would be unmapped.  I do not think anyone
>>> is holding a ref, so the last unmap should put the hugetlb page on the
>>> free list.  Correct?  We will not migrate the page, but ...
> 
> That seems depends on whether the hugepage is anonymous or file.
> If we consider anonymous hugepage, the above statement is correct.
> 
> As for what we can do, for (1) we can simply skip the hwpoisoned hugepage
> in do_migrate_range() (i.e. no need to call isolate_huge_page()), because in
> the outer while-loop in offline_pages() we call dissolve_free_huge_pages()
> so free hugepage can be handled here.  The above "storing raw error info"
> patch will also enable dissolve_free_huge_pages() to handle hwpoisoned free
> hugepage (by moving PG_hwpoison flag to the raw error page), so I assume
> to have the patch.

Yes, I think (1) will be fixed this way.

> 
> As for (2), I think that making such hugepages unmovable (by clearing
> HPageMigratable) could work.  With that change, scan_movable_pages() detects
> unmovable pages so do_migrate_range is not called for the hugepages.
> Moreover, scan_movable_pages() should return -EBUSY if an unmovable page is found,
> so offline_pages() fails without trying to touch the hwpoisoned hugepage.

Yes, it seems work but this will lead to the memory offline fails. Could this be too overkill?

> Anyway the reported problem (try to migrate hwpoisoned hugepage) would to be solved.
> 
> # Maybe we might optionally try to free the hwpoisoned hugepage by removing
> # the affected hugetlbfs file, but it's not unclear to me taht that's acceptable
> # (a file is gone when doing memory hotremove?).
> 
>>
>> Might hugetlb page still be in the pagecache? But yes, the last unmap could put
>> the hugetlb page on the free list.
>>
>>>
>>> After the call to do_migrate_range() in offline_pages, we will call
>>> dissolve_free_huge_pages.  For each hugetlb page, dissolve_free_huge_pages
>>> will call dissolve_free_huge_page likely passing in the 'head' page.
>>> When dissolve_free_huge_page is called for poisoned hugetlb pages from
>>> the memory error handling code, it passes in the sub page which contains
>>> the memory error.  Before freeing the hugetlb page to buddy, there is this
>>> code:
>>>
>>>                         /*
>>>                          * Move PageHWPoison flag from head page to the raw
>>>                          * error page, which makes any subpages rather than
>>>                          * the error page reusable.
>>>                          */
>>>                         if (PageHWPoison(head) && page != head) {
>>>                                 SetPageHWPoison(page);
>>>                                 ClearPageHWPoison(head);
>>>                         }
>>>                         update_and_free_page(h, head, false)
>>>
>>> As previously mentioned, outside the memory error handling code we do
>>> not know what page within the hugetlb page contains poison.  So, this
>>> will likely result with a page with errors on the free list and an OK
>>> page marked with error.
>>
>> Yes, this is possible. It's really a pity. Could we just reject to dissolve the
>> hugetlb page and keep the hugetlb page around? This will waste some healthy subpages
>> but can do the right things?
> 
> Could you let me try writing patches in my approach?
> I'll try to submit the fix with the "storing raw error info" patch.

Sure, I will wait for that. Many thanks for your hard work! :)

> 
>>
>>>
>>> In other places, we 'bail' if we encounter a hugetlb page with poison.
>>> It would be unfortunate to prevent memory offline if the range contains
>>> a hugetlb page with poison.  After all, offlining a section with poison
>>> make sense.
>>>
>>
>> IIUC, a (hugetlb) page with poison could be offlined anytime even if it's still in the
>> pagecache, swapcache, i.e. still be referenced by somewhere, because memory offline will
>> just offline the poisoned page while ignoring the page refcnt totally. I can't figure
>> out a easy way to fix it yet. :(
> 
> Yes, that's finally expected behavior, but unlike normal pagecache we need
> pay extra attention for hugetlbfs file because it's not storage-backed.
> 
> Thank you for valuable comments, everyone.

BTW: There is another possible problem about hwpoison. Think about the below scene:

offline_pages
do {
  scan_movable_pages
    __PageMovable /* Page is also hwpoisoned. */
      goto found; /* So ret is always 0. */
  do_migrate_range
     if (PageHWPoison(page)) /* PageMovable hwpoisoned page is not handled. */
        continue;
 } while (!ret); /* We will meet that page again and again. */

So we might busy-loop until a signal is pending. Or am I miss something? And I think this could be fixed by
deferring set hwpoisoned flag until page refcnt is successfully incremented for normal page in memory_failure()
which is already in your to-do list. So this possible issue could be left alone now?

Thanks!

> 
> - Naoya Horiguchi
>
HORIGUCHI NAOYA(堀口 直也) April 27, 2022, 3:13 a.m. UTC | #5
On Tue, Apr 26, 2022 at 07:41:52PM +0800, Miaohe Lin wrote:
> On 2022/4/25 21:55, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Fri, Apr 22, 2022 at 02:53:48PM +0800, Miaohe Lin wrote:
> >> On 2022/4/22 6:04, Mike Kravetz wrote:
> >>> On 4/21/22 07:20, David Hildenbrand wrote:
> >>>> On 21.04.22 15:51, Miaohe Lin wrote:
> >>>>> When trying to offline pages, HWPoisoned hugepage is migrated without
> >>>>> checking PageHWPoison first. So corrupted data could be consumed. Fix
> >>>>> it by deferring isolate_huge_page until PageHWPoison is handled.
> >>>>>
> >>>>
> >>>> CCing Oscar, Mike and Naoya
> >>>>
> >>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >>>>> ---
> >>>>>  mm/memory_hotplug.c | 11 +++++++----
> >>>>>  1 file changed, 7 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> >>>>> index 4c6065e5d274..093f85ec5c5c 100644
> >>>>> --- a/mm/memory_hotplug.c
> >>>>> +++ b/mm/memory_hotplug.c
> >>>>> @@ -1600,11 +1600,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> >>>>>  		folio = page_folio(page);
> >>>>>  		head = &folio->page;
> >>>>>  
> >>>>> -		if (PageHuge(page)) {
> >>>>> +		if (PageHuge(page))
> >>>>>  			pfn = page_to_pfn(head) + compound_nr(head) - 1;
> >>>>> -			isolate_huge_page(head, &source);
> >>>>> -			continue;
> >>>>> -		} else if (PageTransHuge(page))
> >>>>> +		else if (PageTransHuge(page))
> >>>>>  			pfn = page_to_pfn(head) + thp_nr_pages(page) - 1;
> >>>>>  
> >>>>>  		/*
> >>>>> @@ -1622,6 +1620,11 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> >>>>>  			continue;
> >>>>>  		}
> >>>>>  
> >>>>> +		if (PageHuge(page)) {
> >>>>> +			isolate_huge_page(head, &source);
> >>>>> +			continue;
> >>>>> +		}
> >>>>> +
> >>>>>  		if (!get_page_unless_zero(page))
> >>>>>  			continue;
> >>>>>  		/*
> >>>>
> >>>> The problem statement makes sense to me but I am not sure about the
> >>>> details if we run into the "PageHWPoison" path with a huge page. I have
> >>>> the gut feeling that we have to do more for huge pages in the
> >>>> PageHWPoison() path, because we might be dealing with a free huge page
> >>>> after unmap succeeds. I might be wrong.
> >>>>
> >>>
> >>> Thinking about memory errors always makes my head hurt :)
> >>
> >> Me too. ;)
> >>
> >>>
> >>> What 'state' could a poisoned hugetlb page be in here?
> >>> - Mapped into a process address space?
> >>> - On the hugetlb free lists?
> > 
> > There seems at least 2 cases where a hwpoisoned hugetlb page keeps undissolved.
> >   (1) Hwpoisoned hugepage in free hugepage list
> >   (2) Hwpoisoned hugepage belonging to a hugetlbfs file
> > 
> >>>
> >>> IIUC, when poisoning a hugetlb page we try to dissolve those that are
> >>> free and unmap those which are mapped.  So, this means those operations
> >>> must have failed for some reason.  Is that correct?
> > 
> > Yes, (1) is made by failure in error handling, but (2) can be made even
> > when memory error on in-use hugepage is successfully handled.
> 
> Could you please explain 2 more detailed? IIUC, in-use hugepage belonging to a hugetlbfs
> file will be removed from pagecache via hugetlbfs_error_remove_page. So when successfully
> handled, hugepage->mapping will be NULL and thus not belonging to a hugetlbfs file. Or am
> I miss something?

No, you're right.  Sorry for my misunderstanding about who takes the last
refcount in case (2).

> 
> > 
> >>
> >> I think you're right.
> >>
> >> BTW: Now that PageHWPoison is set under hugetlb_lock and HPageFreed is also checked
> >> under that lock, it should be more likely to handle the hugetlb page on the free lists
> >> successfully because the possible race with hugetlb page allocation or demotion can be
> >> prevented by PageHWPoison flag. So it is more likely to be a mapped hugetlb page here.
> >> But that doesn't matter.
> >>
> >>>
> >>> Now, IF the above is true this implies there is a poisoned page somewhere
> >>> within the hugetlb page.  But, poison is only marked in the head page.
> >>> So, we do not really know where within the page the actual error exists.
> >>> Is my reasoning still correct?
> >>>
> >>
> >> IIRC, Naoya once proposed to store the poisoned page info into subpage field.
> > 
> > Yes, it's important to keep the offset of raw page.  I have a "storing raw
> > error info" patch for this, but it's still immature to submit.
> > 
> >>
> >>> If my reasoning is correct, then I am not sure what we can do here.
> >>> The code to handle poison is:
> >>>
> >>>                  if (PageHWPoison(page)) {
> >>>                         if (WARN_ON(folio_test_lru(folio)))
> >>>                                 folio_isolate_lru(folio);
> >>>                         if (folio_mapped(folio))
> >>>                                 try_to_unmap(folio, TTU_IGNORE_MLOCK);
> >>>                         continue;
> >>>                 }
> >>>
> >>> My first observation is that if a hugetlb page is passed to try_to_unmap
> >>> as above we may BUG.  This is because we need to hold i_mmap_rwsem in
> >>> write mode because of the possibility of calling huge_pmd_unshare.  :(
> >>
> >> I'm sorry. I missed that case. :(
> > 
> > OK, so the above check should not be called for hugetlb pages.
> > 
> >>
> >>>
> >>> I 'think' try_to_unmap could succeed on a poisoned hugetlb page once we
> >>> add correct locking.  So, the page would be unmapped.  I do not think anyone
> >>> is holding a ref, so the last unmap should put the hugetlb page on the
> >>> free list.  Correct?  We will not migrate the page, but ...
> > 
> > That seems depends on whether the hugepage is anonymous or file.
> > If we consider anonymous hugepage, the above statement is correct.
> > 
> > As for what we can do, for (1) we can simply skip the hwpoisoned hugepage
> > in do_migrate_range() (i.e. no need to call isolate_huge_page()), because in
> > the outer while-loop in offline_pages() we call dissolve_free_huge_pages()
> > so free hugepage can be handled here.  The above "storing raw error info"
> > patch will also enable dissolve_free_huge_pages() to handle hwpoisoned free
> > hugepage (by moving PG_hwpoison flag to the raw error page), so I assume
> > to have the patch.
> 
> Yes, I think (1) will be fixed this way.
> 
> > 
> > As for (2), I think that making such hugepages unmovable (by clearing
> > HPageMigratable) could work.  With that change, scan_movable_pages() detects
> > unmovable pages so do_migrate_range is not called for the hugepages.
> > Moreover, scan_movable_pages() should return -EBUSY if an unmovable page is found,
> > so offline_pages() fails without trying to touch the hwpoisoned hugepage.
> 
> Yes, it seems work but this will lead to the memory offline fails. Could this be too overkill?

I think so. And I found (in writing/testing patches) that releasing the last
refcount in dissolve_free_huge_pages() in offline_pages() could work, so
offline_pages() does not have to fail.  So that's better than what I wrote above.

> 
> > Anyway the reported problem (try to migrate hwpoisoned hugepage) would to be solved.
> > 
> > # Maybe we might optionally try to free the hwpoisoned hugepage by removing
> > # the affected hugetlbfs file, but it's not unclear to me taht that's acceptable
> > # (a file is gone when doing memory hotremove?).
> > 
> >>
> >> Might hugetlb page still be in the pagecache? But yes, the last unmap could put
> >> the hugetlb page on the free list.
> >>
> >>>
> >>> After the call to do_migrate_range() in offline_pages, we will call
> >>> dissolve_free_huge_pages.  For each hugetlb page, dissolve_free_huge_pages
> >>> will call dissolve_free_huge_page likely passing in the 'head' page.
> >>> When dissolve_free_huge_page is called for poisoned hugetlb pages from
> >>> the memory error handling code, it passes in the sub page which contains
> >>> the memory error.  Before freeing the hugetlb page to buddy, there is this
> >>> code:
> >>>
> >>>                         /*
> >>>                          * Move PageHWPoison flag from head page to the raw
> >>>                          * error page, which makes any subpages rather than
> >>>                          * the error page reusable.
> >>>                          */
> >>>                         if (PageHWPoison(head) && page != head) {
> >>>                                 SetPageHWPoison(page);
> >>>                                 ClearPageHWPoison(head);
> >>>                         }
> >>>                         update_and_free_page(h, head, false)
> >>>
> >>> As previously mentioned, outside the memory error handling code we do
> >>> not know what page within the hugetlb page contains poison.  So, this
> >>> will likely result with a page with errors on the free list and an OK
> >>> page marked with error.
> >>
> >> Yes, this is possible. It's really a pity. Could we just reject to dissolve the
> >> hugetlb page and keep the hugetlb page around? This will waste some healthy subpages
> >> but can do the right things?
> > 
> > Could you let me try writing patches in my approach?
> > I'll try to submit the fix with the "storing raw error info" patch.
> 
> Sure, I will wait for that. Many thanks for your hard work! :)

I have almost prepared this, so will send this afternoon.

> > 
> >>
> >>>
> >>> In other places, we 'bail' if we encounter a hugetlb page with poison.
> >>> It would be unfortunate to prevent memory offline if the range contains
> >>> a hugetlb page with poison.  After all, offlining a section with poison
> >>> make sense.
> >>>
> >>
> >> IIUC, a (hugetlb) page with poison could be offlined anytime even if it's still in the
> >> pagecache, swapcache, i.e. still be referenced by somewhere, because memory offline will
> >> just offline the poisoned page while ignoring the page refcnt totally. I can't figure
> >> out a easy way to fix it yet. :(
> > 
> > Yes, that's finally expected behavior, but unlike normal pagecache we need
> > pay extra attention for hugetlbfs file because it's not storage-backed.
> > 
> > Thank you for valuable comments, everyone.
> 
> BTW: There is another possible problem about hwpoison. Think about the below scene:
> 
> offline_pages
> do {
>   scan_movable_pages
>     __PageMovable /* Page is also hwpoisoned. */
>       goto found; /* So ret is always 0. */
>   do_migrate_range
>      if (PageHWPoison(page)) /* PageMovable hwpoisoned page is not handled. */
>         continue;
>  } while (!ret); /* We will meet that page again and again. */
> 
> So we might busy-loop until a signal is pending. Or am I miss something? And I think this could be fixed by
> deferring set hwpoisoned flag until page refcnt is successfully incremented for normal page in memory_failure()
> which is already in your to-do list. So this possible issue could be left alone now?

Thank you for finding this.  Maybe I saw similar issue (infinite loop) in
testing patches, but not sure about the detail.  I'll learn it more.

Thanks,
Naoya Horiguchi
diff mbox series

Patch

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4c6065e5d274..093f85ec5c5c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1600,11 +1600,9 @@  do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		folio = page_folio(page);
 		head = &folio->page;
 
-		if (PageHuge(page)) {
+		if (PageHuge(page))
 			pfn = page_to_pfn(head) + compound_nr(head) - 1;
-			isolate_huge_page(head, &source);
-			continue;
-		} else if (PageTransHuge(page))
+		else if (PageTransHuge(page))
 			pfn = page_to_pfn(head) + thp_nr_pages(page) - 1;
 
 		/*
@@ -1622,6 +1620,11 @@  do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			continue;
 		}
 
+		if (PageHuge(page)) {
+			isolate_huge_page(head, &source);
+			continue;
+		}
+
 		if (!get_page_unless_zero(page))
 			continue;
 		/*