Message ID | 1531805552-19547-2-git-send-email-n-horiguchi@ah.jp.nec.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote: > There's a race condition between soft offline and hugetlb_fault which > causes unexpected process killing and/or hugetlb allocation failure. > > The process killing is caused by the following flow: > > CPU 0 CPU 1 CPU 2 > > soft offline > get_any_page > // find the hugetlb is free > mmap a hugetlb file > page fault > ... > hugetlb_fault > hugetlb_no_page > alloc_huge_page > // succeed > soft_offline_free_page > // set hwpoison flag > mmap the hugetlb file > page fault > ... > hugetlb_fault > hugetlb_no_page > find_lock_page > return VM_FAULT_HWPOISON > mm_fault_error > do_sigbus > // kill the process > > > The hugetlb allocation failure comes from the following flow: > > CPU 0 CPU 1 > > mmap a hugetlb file > // reserve all free page but don't fault-in > soft offline > get_any_page > // find the hugetlb is free > soft_offline_free_page > // set hwpoison flag > dissolve_free_huge_page > // fail because all free hugepages are reserved > page fault > ... > hugetlb_fault > hugetlb_no_page > alloc_huge_page > ... > dequeue_huge_page_node_exact > // ignore hwpoisoned hugepage > // and finally fail due to no-mem > > The root cause of this is that current soft-offline code is written > based on an assumption that PageHWPoison flag should beset at first to > avoid accessing the corrupted data. This makes sense for memory_failure() > or hard offline, but does not for soft offline because soft offline is > about corrected (not uncorrected) error and is safe from data lost. > This patch changes soft offline semantics where it sets PageHWPoison flag > only after containment of the error page completes successfully. Could you please expand on the worklow here please? The code is really hard to grasp. I must be missing something because the thing shouldn't be really complicated. Either the page is in the free pool and you just remove it from the allocator (with hugetlb asking for a new hugeltb page to guaratee reserves) or it is used and you just migrate the content to a new page (again with the hugetlb reserves consideration). Why should PageHWPoison flag ordering make any relevance? Do I get it right that the only difference between the hard and soft offlining is that hugetlb reserves might break for the former while not for the latter and that the failed migration kills all owners for the former while not for latter? > Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> > Suggested-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > --- > changelog v1->v2: > - don't use set_hwpoison_free_buddy_page() (not defined yet) > - updated comment in soft_offline_huge_page() > --- > mm/hugetlb.c | 11 +++++------ > mm/memory-failure.c | 24 ++++++++++++++++++------ > mm/migrate.c | 2 -- > 3 files changed, 23 insertions(+), 14 deletions(-) > > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c > index 430be42..937c142 100644 > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c > @@ -1479,22 +1479,20 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, > /* > * Dissolve a given free hugepage into free buddy pages. This function does > * nothing for in-use (including surplus) hugepages. Returns -EBUSY if the > - * number of free hugepages would be reduced below the number of reserved > - * hugepages. > + * dissolution fails because a give page is not a free hugepage, or because > + * free hugepages are fully reserved. > */ > int dissolve_free_huge_page(struct page *page) > { > - int rc = 0; > + int rc = -EBUSY; > > spin_lock(&hugetlb_lock); > if (PageHuge(page) && !page_count(page)) { > struct page *head = compound_head(page); > struct hstate *h = page_hstate(head); > int nid = page_to_nid(head); > - if (h->free_huge_pages - h->resv_huge_pages == 0) { > - rc = -EBUSY; > + if (h->free_huge_pages - h->resv_huge_pages == 0) > goto out; > - } > /* > * Move PageHWPoison flag from head page to the raw error page, > * which makes any subpages rather than the error page reusable. > @@ -1508,6 +1506,7 @@ int dissolve_free_huge_page(struct page *page) > h->free_huge_pages_node[nid]--; > h->max_huge_pages--; > update_and_free_page(h, head); > + rc = 0; > } > out: > spin_unlock(&hugetlb_lock); > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c > index 9d142b9..9b77f85 100644 > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c > @@ -1598,8 +1598,20 @@ static int soft_offline_huge_page(struct page *page, int flags) > if (ret > 0) > ret = -EIO; > } else { > - if (PageHuge(page)) > - dissolve_free_huge_page(page); > + /* > + * We set PG_hwpoison only when the migration source hugepage > + * was successfully dissolved, because otherwise hwpoisoned > + * hugepage remains on free hugepage list. The allocator ignores > + * such a hwpoisoned page so it's never allocated, but it could > + * kill a process because of no-memory rather than hwpoison. > + * Soft-offline never impacts the userspace, so this is > + * undesired. > + */ > + ret = dissolve_free_huge_page(page); > + if (!ret) { > + if (!TestSetPageHWPoison(page)) > + num_poisoned_pages_inc(); > + } > } > return ret; > } > @@ -1715,13 +1727,13 @@ static int soft_offline_in_use_page(struct page *page, int flags) > > static void soft_offline_free_page(struct page *page) > { > + int rc = 0; > struct page *head = compound_head(page); > > - if (!TestSetPageHWPoison(head)) { > + if (PageHuge(head)) > + rc = dissolve_free_huge_page(page); > + if (!rc && !TestSetPageHWPoison(page)) > num_poisoned_pages_inc(); > - if (PageHuge(head)) > - dissolve_free_huge_page(page); > - } > } > > /** > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c > index 198af42..3ae213b 100644 > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c > @@ -1318,8 +1318,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, > out: > if (rc != -EAGAIN) > putback_active_hugepage(hpage); > - if (reason == MR_MEMORY_FAILURE && !test_set_page_hwpoison(hpage)) > - num_poisoned_pages_inc(); > > /* > * If migration was not successful and there's a freeing callback, use > -- > 2.7.0
On 07/17/2018 07:27 AM, Michal Hocko wrote: > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote: >> There's a race condition between soft offline and hugetlb_fault which >> causes unexpected process killing and/or hugetlb allocation failure. >> >> The process killing is caused by the following flow: >> >> CPU 0 CPU 1 CPU 2 >> >> soft offline >> get_any_page >> // find the hugetlb is free >> mmap a hugetlb file >> page fault >> ... >> hugetlb_fault >> hugetlb_no_page >> alloc_huge_page >> // succeed >> soft_offline_free_page >> // set hwpoison flag >> mmap the hugetlb file >> page fault >> ... >> hugetlb_fault >> hugetlb_no_page >> find_lock_page >> return VM_FAULT_HWPOISON >> mm_fault_error >> do_sigbus >> // kill the process >> >> >> The hugetlb allocation failure comes from the following flow: >> >> CPU 0 CPU 1 >> >> mmap a hugetlb file >> // reserve all free page but don't fault-in >> soft offline >> get_any_page >> // find the hugetlb is free >> soft_offline_free_page >> // set hwpoison flag >> dissolve_free_huge_page >> // fail because all free hugepages are reserved >> page fault >> ... >> hugetlb_fault >> hugetlb_no_page >> alloc_huge_page >> ... >> dequeue_huge_page_node_exact >> // ignore hwpoisoned hugepage >> // and finally fail due to no-mem >> >> The root cause of this is that current soft-offline code is written >> based on an assumption that PageHWPoison flag should beset at first to >> avoid accessing the corrupted data. This makes sense for memory_failure() >> or hard offline, but does not for soft offline because soft offline is >> about corrected (not uncorrected) error and is safe from data lost. >> This patch changes soft offline semantics where it sets PageHWPoison flag >> only after containment of the error page completes successfully. > > Could you please expand on the worklow here please? The code is really > hard to grasp. I must be missing something because the thing shouldn't > be really complicated. Either the page is in the free pool and you just > remove it from the allocator (with hugetlb asking for a new hugeltb page > to guaratee reserves) or it is used and you just migrate the content to > a new page (again with the hugetlb reserves consideration). Why should > PageHWPoison flag ordering make any relevance? My understanding may not be corect, but just looking at the current code for soft_offline_free_page helps me understand: static void soft_offline_free_page(struct page *page) { struct page *head = compound_head(page); if (!TestSetPageHWPoison(head)) { num_poisoned_pages_inc(); if (PageHuge(head)) dissolve_free_huge_page(page); } } The HWPoison flag is set before even checking to determine if the huge page can be dissolved. So, someone could could attempt to pull the page off the free list (if free) or fault/map it (if already associated with a file) which leads to the failures described above. The patches ensure that we only set HWPoison after successfully dissolving the page. At least that is how I understand it. It seems that soft_offline_free_page can be called for in use pages. Certainly, that is the case in the first workflow above. With the suggested changes, I think this is OK for huge pages. However, it seems that setting HWPoison on a in use non-huge page could cause issues? While looking at the code, I noticed this comment in __get_any_page() /* * When the target page is a free hugepage, just remove it * from free hugepage list. */ Did that apply to some code that was removed? It does not seem to make any sense in that routine.
On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote: > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote: > > There's a race condition between soft offline and hugetlb_fault which > > causes unexpected process killing and/or hugetlb allocation failure. > > > > The process killing is caused by the following flow: > > > > CPU 0 CPU 1 CPU 2 > > > > soft offline > > get_any_page > > // find the hugetlb is free > > mmap a hugetlb file > > page fault > > ... > > hugetlb_fault > > hugetlb_no_page > > alloc_huge_page > > // succeed > > soft_offline_free_page > > // set hwpoison flag > > mmap the hugetlb file > > page fault > > ... > > hugetlb_fault > > hugetlb_no_page > > find_lock_page > > return VM_FAULT_HWPOISON > > mm_fault_error > > do_sigbus > > // kill the process > > > > > > The hugetlb allocation failure comes from the following flow: > > > > CPU 0 CPU 1 > > > > mmap a hugetlb file > > // reserve all free page but don't fault-in > > soft offline > > get_any_page > > // find the hugetlb is free > > soft_offline_free_page > > // set hwpoison flag > > dissolve_free_huge_page > > // fail because all free hugepages are reserved > > page fault > > ... > > hugetlb_fault > > hugetlb_no_page > > alloc_huge_page > > ... > > dequeue_huge_page_node_exact > > // ignore hwpoisoned hugepage > > // and finally fail due to no-mem > > > > The root cause of this is that current soft-offline code is written > > based on an assumption that PageHWPoison flag should beset at first to > > avoid accessing the corrupted data. This makes sense for memory_failure() > > or hard offline, but does not for soft offline because soft offline is > > about corrected (not uncorrected) error and is safe from data lost. > > This patch changes soft offline semantics where it sets PageHWPoison flag > > only after containment of the error page completes successfully. > > Could you please expand on the worklow here please? The code is really > hard to grasp. I must be missing something because the thing shouldn't > be really complicated. Either the page is in the free pool and you just > remove it from the allocator (with hugetlb asking for a new hugeltb page > to guaratee reserves) or it is used and you just migrate the content to > a new page (again with the hugetlb reserves consideration). Why should > PageHWPoison flag ordering make any relevance? (Considering soft offlining free hugepage,) PageHWPoison is set at first before this patch, which is racy with hugetlb fault code because it's not protected by hugetlb_lock. Originally this was written in the similar manner as hard-offline, where the race is accepted and a PageHWPoison flag is set as soon as possible. But actually that's found not necessary/correct because soft offline is supposed to be less aggressive and failure is OK. So this patch is suggesting to make soft-offline less aggressive by moving SetPageHWPoison into the lock. > > Do I get it right that the only difference between the hard and soft > offlining is that hugetlb reserves might break for the former while not > for the latter Correct. > and that the failed migration kills all owners for the > former while not for latter? Hard-offline doesn't cause any page migration because the data is already lost, but yes it can kill the owners. Soft-offline never kills processes even if it fails (due to migration failrue or some other reasons.) I listed below some common points and differences between hard-offline and soft-offline. common points - they are both contained by PageHWPoison flag, - error is injected via simliar interfaces. differences - the data on the page is considered lost in hard offline, but is not in soft offline, - hard offline likely kills the affected processes, but soft offline never kills processes, - soft offline causes page migration, but hard offline does not, - hard offline prioritizes to prevent consumption of broken data with accepting some race, and soft offline prioritizes not to impact userspace with accepting failure. Looks to me that there're more differences rather than commont points. Thanks, Naoya Horiguchi > > > Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> > > Suggested-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> > > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > > --- > > changelog v1->v2: > > - don't use set_hwpoison_free_buddy_page() (not defined yet) > > - updated comment in soft_offline_huge_page() > > --- > > mm/hugetlb.c | 11 +++++------ > > mm/memory-failure.c | 24 ++++++++++++++++++------ > > mm/migrate.c | 2 -- > > 3 files changed, 23 insertions(+), 14 deletions(-) > > > > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c > > index 430be42..937c142 100644 > > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c > > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c > > @@ -1479,22 +1479,20 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, > > /* > > * Dissolve a given free hugepage into free buddy pages. This function does > > * nothing for in-use (including surplus) hugepages. Returns -EBUSY if the > > - * number of free hugepages would be reduced below the number of reserved > > - * hugepages. > > + * dissolution fails because a give page is not a free hugepage, or because > > + * free hugepages are fully reserved. > > */ > > int dissolve_free_huge_page(struct page *page) > > { > > - int rc = 0; > > + int rc = -EBUSY; > > > > spin_lock(&hugetlb_lock); > > if (PageHuge(page) && !page_count(page)) { > > struct page *head = compound_head(page); > > struct hstate *h = page_hstate(head); > > int nid = page_to_nid(head); > > - if (h->free_huge_pages - h->resv_huge_pages == 0) { > > - rc = -EBUSY; > > + if (h->free_huge_pages - h->resv_huge_pages == 0) > > goto out; > > - } > > /* > > * Move PageHWPoison flag from head page to the raw error page, > > * which makes any subpages rather than the error page reusable. > > @@ -1508,6 +1506,7 @@ int dissolve_free_huge_page(struct page *page) > > h->free_huge_pages_node[nid]--; > > h->max_huge_pages--; > > update_and_free_page(h, head); > > + rc = 0; > > } > > out: > > spin_unlock(&hugetlb_lock); > > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c > > index 9d142b9..9b77f85 100644 > > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c > > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c > > @@ -1598,8 +1598,20 @@ static int soft_offline_huge_page(struct page *page, int flags) > > if (ret > 0) > > ret = -EIO; > > } else { > > - if (PageHuge(page)) > > - dissolve_free_huge_page(page); > > + /* > > + * We set PG_hwpoison only when the migration source hugepage > > + * was successfully dissolved, because otherwise hwpoisoned > > + * hugepage remains on free hugepage list. The allocator ignores > > + * such a hwpoisoned page so it's never allocated, but it could > > + * kill a process because of no-memory rather than hwpoison. > > + * Soft-offline never impacts the userspace, so this is > > + * undesired. > > + */ > > + ret = dissolve_free_huge_page(page); > > + if (!ret) { > > + if (!TestSetPageHWPoison(page)) > > + num_poisoned_pages_inc(); > > + } > > } > > return ret; > > } > > @@ -1715,13 +1727,13 @@ static int soft_offline_in_use_page(struct page *page, int flags) > > > > static void soft_offline_free_page(struct page *page) > > { > > + int rc = 0; > > struct page *head = compound_head(page); > > > > - if (!TestSetPageHWPoison(head)) { > > + if (PageHuge(head)) > > + rc = dissolve_free_huge_page(page); > > + if (!rc && !TestSetPageHWPoison(page)) > > num_poisoned_pages_inc(); > > - if (PageHuge(head)) > > - dissolve_free_huge_page(page); > > - } > > } > > > > /** > > diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c > > index 198af42..3ae213b 100644 > > --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c > > +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c > > @@ -1318,8 +1318,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, > > out: > > if (rc != -EAGAIN) > > putback_active_hugepage(hpage); > > - if (reason == MR_MEMORY_FAILURE && !test_set_page_hwpoison(hpage)) > > - num_poisoned_pages_inc(); > > > > /* > > * If migration was not successful and there's a freeing callback, use > > -- > > 2.7.0 > > -- > Michal Hocko > SUSE Labs >
On Tue, Jul 17, 2018 at 01:10:39PM -0700, Mike Kravetz wrote: > On 07/17/2018 07:27 AM, Michal Hocko wrote: > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote: > >> There's a race condition between soft offline and hugetlb_fault which > >> causes unexpected process killing and/or hugetlb allocation failure. > >> > >> The process killing is caused by the following flow: > >> > >> CPU 0 CPU 1 CPU 2 > >> > >> soft offline > >> get_any_page > >> // find the hugetlb is free > >> mmap a hugetlb file > >> page fault > >> ... > >> hugetlb_fault > >> hugetlb_no_page > >> alloc_huge_page > >> // succeed > >> soft_offline_free_page > >> // set hwpoison flag > >> mmap the hugetlb file > >> page fault > >> ... > >> hugetlb_fault > >> hugetlb_no_page > >> find_lock_page > >> return VM_FAULT_HWPOISON > >> mm_fault_error > >> do_sigbus > >> // kill the process > >> > >> > >> The hugetlb allocation failure comes from the following flow: > >> > >> CPU 0 CPU 1 > >> > >> mmap a hugetlb file > >> // reserve all free page but don't fault-in > >> soft offline > >> get_any_page > >> // find the hugetlb is free > >> soft_offline_free_page > >> // set hwpoison flag > >> dissolve_free_huge_page > >> // fail because all free hugepages are reserved > >> page fault > >> ... > >> hugetlb_fault > >> hugetlb_no_page > >> alloc_huge_page > >> ... > >> dequeue_huge_page_node_exact > >> // ignore hwpoisoned hugepage > >> // and finally fail due to no-mem > >> > >> The root cause of this is that current soft-offline code is written > >> based on an assumption that PageHWPoison flag should beset at first to > >> avoid accessing the corrupted data. This makes sense for memory_failure() > >> or hard offline, but does not for soft offline because soft offline is > >> about corrected (not uncorrected) error and is safe from data lost. > >> This patch changes soft offline semantics where it sets PageHWPoison flag > >> only after containment of the error page completes successfully. > > > > Could you please expand on the worklow here please? The code is really > > hard to grasp. I must be missing something because the thing shouldn't > > be really complicated. Either the page is in the free pool and you just > > remove it from the allocator (with hugetlb asking for a new hugeltb page > > to guaratee reserves) or it is used and you just migrate the content to > > a new page (again with the hugetlb reserves consideration). Why should > > PageHWPoison flag ordering make any relevance? > > My understanding may not be corect, but just looking at the current code > for soft_offline_free_page helps me understand: > > static void soft_offline_free_page(struct page *page) > { > struct page *head = compound_head(page); > > if (!TestSetPageHWPoison(head)) { > num_poisoned_pages_inc(); > if (PageHuge(head)) > dissolve_free_huge_page(page); > } > } > > The HWPoison flag is set before even checking to determine if the huge > page can be dissolved. So, someone could could attempt to pull the page > off the free list (if free) or fault/map it (if already associated with > a file) which leads to the failures described above. The patches ensure > that we only set HWPoison after successfully dissolving the page. At least > that is how I understand it. Thanks for elaborating, this is correct. > > It seems that soft_offline_free_page can be called for in use pages. > Certainly, that is the case in the first workflow above. With the > suggested changes, I think this is OK for huge pages. However, it seems > that setting HWPoison on a in use non-huge page could cause issues? Just after dissolve_free_huge_page() returns, the target page is just a free buddy page without PageHWPoison set. If this page is allocated immediately, that's "migration succeeded, but soft offline failed" case, so no problem. Certainly, there also is a race between cheking TestSetPageHWPoison and page allocation, so this issue is handled in patch 2/2. > While looking at the code, I noticed this comment in __get_any_page() > /* > * When the target page is a free hugepage, just remove it > * from free hugepage list. > */ > Did that apply to some code that was removed? It does not seem to make > any sense in that routine. This comment is completely obsolete, I'll remove this one. Thanks, Naoya Horiguchi
On Wed, Jul 18, 2018 at 12:55:29AM +0000, Horiguchi Naoya(堀口 直也) wrote: > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote: > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote: > > > There's a race condition between soft offline and hugetlb_fault which > > > causes unexpected process killing and/or hugetlb allocation failure. > > > > > > The process killing is caused by the following flow: > > > > > > CPU 0 CPU 1 CPU 2 > > > > > > soft offline > > > get_any_page > > > // find the hugetlb is free > > > mmap a hugetlb file > > > page fault > > > ... > > > hugetlb_fault > > > hugetlb_no_page > > > alloc_huge_page > > > // succeed > > > soft_offline_free_page > > > // set hwpoison flag > > > mmap the hugetlb file > > > page fault > > > ... > > > hugetlb_fault > > > hugetlb_no_page > > > find_lock_page > > > return VM_FAULT_HWPOISON > > > mm_fault_error > > > do_sigbus > > > // kill the process > > > > > > > > > The hugetlb allocation failure comes from the following flow: > > > > > > CPU 0 CPU 1 > > > > > > mmap a hugetlb file > > > // reserve all free page but don't fault-in > > > soft offline > > > get_any_page > > > // find the hugetlb is free > > > soft_offline_free_page > > > // set hwpoison flag > > > dissolve_free_huge_page > > > // fail because all free hugepages are reserved > > > page fault > > > ... > > > hugetlb_fault > > > hugetlb_no_page > > > alloc_huge_page > > > ... > > > dequeue_huge_page_node_exact > > > // ignore hwpoisoned hugepage > > > // and finally fail due to no-mem > > > > > > The root cause of this is that current soft-offline code is written > > > based on an assumption that PageHWPoison flag should beset at first to > > > avoid accessing the corrupted data. This makes sense for memory_failure() > > > or hard offline, but does not for soft offline because soft offline is > > > about corrected (not uncorrected) error and is safe from data lost. > > > This patch changes soft offline semantics where it sets PageHWPoison flag > > > only after containment of the error page completes successfully. > > > > Could you please expand on the worklow here please? The code is really > > hard to grasp. I must be missing something because the thing shouldn't > > be really complicated. Either the page is in the free pool and you just > > remove it from the allocator (with hugetlb asking for a new hugeltb page > > to guaratee reserves) or it is used and you just migrate the content to > > a new page (again with the hugetlb reserves consideration). Why should > > PageHWPoison flag ordering make any relevance? > > (Considering soft offlining free hugepage,) > PageHWPoison is set at first before this patch, which is racy with > hugetlb fault code because it's not protected by hugetlb_lock. > > Originally this was written in the similar manner as hard-offline, where > the race is accepted and a PageHWPoison flag is set as soon as possible. > But actually that's found not necessary/correct because soft offline is > supposed to be less aggressive and failure is OK. > > So this patch is suggesting to make soft-offline less aggressive > by moving SetPageHWPoison into the lock. My apology, this part of reasoning was incorrect. What patch 1/2 actually does is transforming the issue into the normal page's similar race issue which is solved by patch 2/2. After patch 1/2, soft offline never sets PageHWPoison on hugepage. Thanks, Naoya Horiguchi > > > > > Do I get it right that the only difference between the hard and soft > > offlining is that hugetlb reserves might break for the former while not > > for the latter > > Correct. > > > and that the failed migration kills all owners for the > > former while not for latter? > > Hard-offline doesn't cause any page migration because the data is already > lost, but yes it can kill the owners. > Soft-offline never kills processes even if it fails (due to migration failrue > or some other reasons.) > > I listed below some common points and differences between hard-offline > and soft-offline. > > common points > - they are both contained by PageHWPoison flag, > - error is injected via simliar interfaces. > > differences > - the data on the page is considered lost in hard offline, but is not > in soft offline, > - hard offline likely kills the affected processes, but soft offline > never kills processes, > - soft offline causes page migration, but hard offline does not, > - hard offline prioritizes to prevent consumption of broken data with > accepting some race, and soft offline prioritizes not to impact > userspace with accepting failure. > > Looks to me that there're more differences rather than commont points.
On 07/17/2018 06:28 PM, Naoya Horiguchi wrote: > On Tue, Jul 17, 2018 at 01:10:39PM -0700, Mike Kravetz wrote: >> It seems that soft_offline_free_page can be called for in use pages. >> Certainly, that is the case in the first workflow above. With the >> suggested changes, I think this is OK for huge pages. However, it seems >> that setting HWPoison on a in use non-huge page could cause issues? > > Just after dissolve_free_huge_page() returns, the target page is just a > free buddy page without PageHWPoison set. If this page is allocated > immediately, that's "migration succeeded, but soft offline failed" case, > so no problem. > Certainly, there also is a race between cheking TestSetPageHWPoison and > page allocation, so this issue is handled in patch 2/2. Yes, the issue of calling soft_offline_free_page() for an 'in use' page is handled in the new routine set_hwpoison_free_buddy_page() of patch 2/2. Thanks,
On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote: > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote: > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote: > > > There's a race condition between soft offline and hugetlb_fault which > > > causes unexpected process killing and/or hugetlb allocation failure. > > > > > > The process killing is caused by the following flow: > > > > > > CPU 0 CPU 1 CPU 2 > > > > > > soft offline > > > get_any_page > > > // find the hugetlb is free > > > mmap a hugetlb file > > > page fault > > > ... > > > hugetlb_fault > > > hugetlb_no_page > > > alloc_huge_page > > > // succeed > > > soft_offline_free_page > > > // set hwpoison flag > > > mmap the hugetlb file > > > page fault > > > ... > > > hugetlb_fault > > > hugetlb_no_page > > > find_lock_page > > > return VM_FAULT_HWPOISON > > > mm_fault_error > > > do_sigbus > > > // kill the process > > > > > > > > > The hugetlb allocation failure comes from the following flow: > > > > > > CPU 0 CPU 1 > > > > > > mmap a hugetlb file > > > // reserve all free page but don't fault-in > > > soft offline > > > get_any_page > > > // find the hugetlb is free > > > soft_offline_free_page > > > // set hwpoison flag > > > dissolve_free_huge_page > > > // fail because all free hugepages are reserved > > > page fault > > > ... > > > hugetlb_fault > > > hugetlb_no_page > > > alloc_huge_page > > > ... > > > dequeue_huge_page_node_exact > > > // ignore hwpoisoned hugepage > > > // and finally fail due to no-mem > > > > > > The root cause of this is that current soft-offline code is written > > > based on an assumption that PageHWPoison flag should beset at first to > > > avoid accessing the corrupted data. This makes sense for memory_failure() > > > or hard offline, but does not for soft offline because soft offline is > > > about corrected (not uncorrected) error and is safe from data lost. > > > This patch changes soft offline semantics where it sets PageHWPoison flag > > > only after containment of the error page completes successfully. > > > > Could you please expand on the worklow here please? The code is really > > hard to grasp. I must be missing something because the thing shouldn't > > be really complicated. Either the page is in the free pool and you just > > remove it from the allocator (with hugetlb asking for a new hugeltb page > > to guaratee reserves) or it is used and you just migrate the content to > > a new page (again with the hugetlb reserves consideration). Why should > > PageHWPoison flag ordering make any relevance? > > (Considering soft offlining free hugepage,) > PageHWPoison is set at first before this patch, which is racy with > hugetlb fault code because it's not protected by hugetlb_lock. > > Originally this was written in the similar manner as hard-offline, where > the race is accepted and a PageHWPoison flag is set as soon as possible. > But actually that's found not necessary/correct because soft offline is > supposed to be less aggressive and failure is OK. OK > So this patch is suggesting to make soft-offline less aggressive by > moving SetPageHWPoison into the lock. I guess I still do not understand why we should even care about the ordering of the HWPoison flag setting. Why cannot we simply have the following code flow? Or maybe we are doing that already I just do not follow the code soft_offline check page_count - free - normal page - remove from the allocator - hugetlb - allocate a new hugetlb page && remove from the pool - used - migrate to a new page && never release the old one Why do we even need HWPoison flag here? Everything can be completely transparent to the application. It shouldn't fail from what I understood. > > Do I get it right that the only difference between the hard and soft > > offlining is that hugetlb reserves might break for the former while not > > for the latter > > Correct. > > > and that the failed migration kills all owners for the > > former while not for latter? > > Hard-offline doesn't cause any page migration because the data is already > lost, but yes it can kill the owners. > Soft-offline never kills processes even if it fails (due to migration failrue > or some other reasons.) > > I listed below some common points and differences between hard-offline > and soft-offline. > > common points > - they are both contained by PageHWPoison flag, > - error is injected via simliar interfaces. > > differences > - the data on the page is considered lost in hard offline, but is not > in soft offline, > - hard offline likely kills the affected processes, but soft offline > never kills processes, > - soft offline causes page migration, but hard offline does not, > - hard offline prioritizes to prevent consumption of broken data with > accepting some race, and soft offline prioritizes not to impact > userspace with accepting failure. > > Looks to me that there're more differences rather than commont points. Thanks for the summary. It certainly helped me
On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote: > On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote: > > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote: > > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote: > > > > There's a race condition between soft offline and hugetlb_fault which > > > > causes unexpected process killing and/or hugetlb allocation failure. > > > > > > > > The process killing is caused by the following flow: > > > > > > > > CPU 0 CPU 1 CPU 2 > > > > > > > > soft offline > > > > get_any_page > > > > // find the hugetlb is free > > > > mmap a hugetlb file > > > > page fault > > > > ... > > > > hugetlb_fault > > > > hugetlb_no_page > > > > alloc_huge_page > > > > // succeed > > > > soft_offline_free_page > > > > // set hwpoison flag > > > > mmap the hugetlb file > > > > page fault > > > > ... > > > > hugetlb_fault > > > > hugetlb_no_page > > > > find_lock_page > > > > return VM_FAULT_HWPOISON > > > > mm_fault_error > > > > do_sigbus > > > > // kill the process > > > > > > > > > > > > The hugetlb allocation failure comes from the following flow: > > > > > > > > CPU 0 CPU 1 > > > > > > > > mmap a hugetlb file > > > > // reserve all free page but don't fault-in > > > > soft offline > > > > get_any_page > > > > // find the hugetlb is free > > > > soft_offline_free_page > > > > // set hwpoison flag > > > > dissolve_free_huge_page > > > > // fail because all free hugepages are reserved > > > > page fault > > > > ... > > > > hugetlb_fault > > > > hugetlb_no_page > > > > alloc_huge_page > > > > ... > > > > dequeue_huge_page_node_exact > > > > // ignore hwpoisoned hugepage > > > > // and finally fail due to no-mem > > > > > > > > The root cause of this is that current soft-offline code is written > > > > based on an assumption that PageHWPoison flag should beset at first to > > > > avoid accessing the corrupted data. This makes sense for memory_failure() > > > > or hard offline, but does not for soft offline because soft offline is > > > > about corrected (not uncorrected) error and is safe from data lost. > > > > This patch changes soft offline semantics where it sets PageHWPoison flag > > > > only after containment of the error page completes successfully. > > > > > > Could you please expand on the worklow here please? The code is really > > > hard to grasp. I must be missing something because the thing shouldn't > > > be really complicated. Either the page is in the free pool and you just > > > remove it from the allocator (with hugetlb asking for a new hugeltb page > > > to guaratee reserves) or it is used and you just migrate the content to > > > a new page (again with the hugetlb reserves consideration). Why should > > > PageHWPoison flag ordering make any relevance? > > > > (Considering soft offlining free hugepage,) > > PageHWPoison is set at first before this patch, which is racy with > > hugetlb fault code because it's not protected by hugetlb_lock. > > > > Originally this was written in the similar manner as hard-offline, where > > the race is accepted and a PageHWPoison flag is set as soon as possible. > > But actually that's found not necessary/correct because soft offline is > > supposed to be less aggressive and failure is OK. > > OK > > > So this patch is suggesting to make soft-offline less aggressive by > > moving SetPageHWPoison into the lock. > > I guess I still do not understand why we should even care about the > ordering of the HWPoison flag setting. Why cannot we simply have the > following code flow? Or maybe we are doing that already I just do not > follow the code > > soft_offline > check page_count > - free - normal page - remove from the allocator > - hugetlb - allocate a new hugetlb page && remove from the pool > - used - migrate to a new page && never release the old one > > Why do we even need HWPoison flag here? Everything can be completely > transparent to the application. It shouldn't fail from what I > understood. PageHWPoison flag is used to the 'remove from the allocator' part which is like below: static inline struct page *rmqueue( ... do { page = NULL; if (alloc_flags & ALLOC_HARDER) { page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (page) trace_mm_page_alloc_zone_locked(page, order, migratetype); } if (!page) page = __rmqueue(zone, order, migratetype); } while (page && check_new_pages(page, order)); check_new_pages() returns true if the page taken from free list has a hwpoison page so that the allocator iterates another round to get another page. There's no function that can be called from outside allocator to remove a page in allocator. So actual page removal is done at allocation time, not at error handling time. That's the reason why we need PageHWPoison. Thanks, Naoya Horiguchi > > > Do I get it right that the only difference between the hard and soft > > > offlining is that hugetlb reserves might break for the former while not > > > for the latter > > > > Correct. > > > > > and that the failed migration kills all owners for the > > > former while not for latter? > > > > Hard-offline doesn't cause any page migration because the data is already > > lost, but yes it can kill the owners. > > Soft-offline never kills processes even if it fails (due to migration failrue > > or some other reasons.) > > > > I listed below some common points and differences between hard-offline > > and soft-offline. > > > > common points > > - they are both contained by PageHWPoison flag, > > - error is injected via simliar interfaces. > > > > differences > > - the data on the page is considered lost in hard offline, but is not > > in soft offline, > > - hard offline likely kills the affected processes, but soft offline > > never kills processes, > > - soft offline causes page migration, but hard offline does not, > > - hard offline prioritizes to prevent consumption of broken data with > > accepting some race, and soft offline prioritizes not to impact > > userspace with accepting failure. > > > > Looks to me that there're more differences rather than commont points. > > Thanks for the summary. It certainly helped me > -- > Michal Hocko > SUSE Labs >
On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote: > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote: > > On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote: > > > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote: > > > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote: > > > > > There's a race condition between soft offline and hugetlb_fault which > > > > > causes unexpected process killing and/or hugetlb allocation failure. > > > > > > > > > > The process killing is caused by the following flow: > > > > > > > > > > CPU 0 CPU 1 CPU 2 > > > > > > > > > > soft offline > > > > > get_any_page > > > > > // find the hugetlb is free > > > > > mmap a hugetlb file > > > > > page fault > > > > > ... > > > > > hugetlb_fault > > > > > hugetlb_no_page > > > > > alloc_huge_page > > > > > // succeed > > > > > soft_offline_free_page > > > > > // set hwpoison flag > > > > > mmap the hugetlb file > > > > > page fault > > > > > ... > > > > > hugetlb_fault > > > > > hugetlb_no_page > > > > > find_lock_page > > > > > return VM_FAULT_HWPOISON > > > > > mm_fault_error > > > > > do_sigbus > > > > > // kill the process > > > > > > > > > > > > > > > The hugetlb allocation failure comes from the following flow: > > > > > > > > > > CPU 0 CPU 1 > > > > > > > > > > mmap a hugetlb file > > > > > // reserve all free page but don't fault-in > > > > > soft offline > > > > > get_any_page > > > > > // find the hugetlb is free > > > > > soft_offline_free_page > > > > > // set hwpoison flag > > > > > dissolve_free_huge_page > > > > > // fail because all free hugepages are reserved > > > > > page fault > > > > > ... > > > > > hugetlb_fault > > > > > hugetlb_no_page > > > > > alloc_huge_page > > > > > ... > > > > > dequeue_huge_page_node_exact > > > > > // ignore hwpoisoned hugepage > > > > > // and finally fail due to no-mem > > > > > > > > > > The root cause of this is that current soft-offline code is written > > > > > based on an assumption that PageHWPoison flag should beset at first to > > > > > avoid accessing the corrupted data. This makes sense for memory_failure() > > > > > or hard offline, but does not for soft offline because soft offline is > > > > > about corrected (not uncorrected) error and is safe from data lost. > > > > > This patch changes soft offline semantics where it sets PageHWPoison flag > > > > > only after containment of the error page completes successfully. > > > > > > > > Could you please expand on the worklow here please? The code is really > > > > hard to grasp. I must be missing something because the thing shouldn't > > > > be really complicated. Either the page is in the free pool and you just > > > > remove it from the allocator (with hugetlb asking for a new hugeltb page > > > > to guaratee reserves) or it is used and you just migrate the content to > > > > a new page (again with the hugetlb reserves consideration). Why should > > > > PageHWPoison flag ordering make any relevance? > > > > > > (Considering soft offlining free hugepage,) > > > PageHWPoison is set at first before this patch, which is racy with > > > hugetlb fault code because it's not protected by hugetlb_lock. > > > > > > Originally this was written in the similar manner as hard-offline, where > > > the race is accepted and a PageHWPoison flag is set as soon as possible. > > > But actually that's found not necessary/correct because soft offline is > > > supposed to be less aggressive and failure is OK. > > > > OK > > > > > So this patch is suggesting to make soft-offline less aggressive by > > > moving SetPageHWPoison into the lock. > > > > I guess I still do not understand why we should even care about the > > ordering of the HWPoison flag setting. Why cannot we simply have the > > following code flow? Or maybe we are doing that already I just do not > > follow the code > > > > soft_offline > > check page_count > > - free - normal page - remove from the allocator > > - hugetlb - allocate a new hugetlb page && remove from the pool > > - used - migrate to a new page && never release the old one > > > > Why do we even need HWPoison flag here? Everything can be completely > > transparent to the application. It shouldn't fail from what I > > understood. > > PageHWPoison flag is used to the 'remove from the allocator' part > which is like below: > > static inline > struct page *rmqueue( > ... > do { > page = NULL; > if (alloc_flags & ALLOC_HARDER) { > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > if (page) > trace_mm_page_alloc_zone_locked(page, order, migratetype); > } > if (!page) > page = __rmqueue(zone, order, migratetype); > } while (page && check_new_pages(page, order)); > > check_new_pages() returns true if the page taken from free list has > a hwpoison page so that the allocator iterates another round to get > another page. > > There's no function that can be called from outside allocator to remove > a page in allocator. So actual page removal is done at allocation time, > not at error handling time. That's the reason why we need PageHWPoison. hwpoison is an internal mm functionality so why cannot we simply add a function that would do that? I find the PageHWPoison usage here doing more complications than real good. Or am I missing something?
On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote: > On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote: > > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote: > > > On Wed 18-07-18 00:55:29, Naoya Horiguchi wrote: > > > > On Tue, Jul 17, 2018 at 04:27:43PM +0200, Michal Hocko wrote: > > > > > On Tue 17-07-18 14:32:31, Naoya Horiguchi wrote: > > > > > > There's a race condition between soft offline and hugetlb_fault which > > > > > > causes unexpected process killing and/or hugetlb allocation failure. > > > > > > > > > > > > The process killing is caused by the following flow: > > > > > > > > > > > > CPU 0 CPU 1 CPU 2 > > > > > > > > > > > > soft offline > > > > > > get_any_page > > > > > > // find the hugetlb is free > > > > > > mmap a hugetlb file > > > > > > page fault > > > > > > ... > > > > > > hugetlb_fault > > > > > > hugetlb_no_page > > > > > > alloc_huge_page > > > > > > // succeed > > > > > > soft_offline_free_page > > > > > > // set hwpoison flag > > > > > > mmap the hugetlb file > > > > > > page fault > > > > > > ... > > > > > > hugetlb_fault > > > > > > hugetlb_no_page > > > > > > find_lock_page > > > > > > return VM_FAULT_HWPOISON > > > > > > mm_fault_error > > > > > > do_sigbus > > > > > > // kill the process > > > > > > > > > > > > > > > > > > The hugetlb allocation failure comes from the following flow: > > > > > > > > > > > > CPU 0 CPU 1 > > > > > > > > > > > > mmap a hugetlb file > > > > > > // reserve all free page but don't fault-in > > > > > > soft offline > > > > > > get_any_page > > > > > > // find the hugetlb is free > > > > > > soft_offline_free_page > > > > > > // set hwpoison flag > > > > > > dissolve_free_huge_page > > > > > > // fail because all free hugepages are reserved > > > > > > page fault > > > > > > ... > > > > > > hugetlb_fault > > > > > > hugetlb_no_page > > > > > > alloc_huge_page > > > > > > ... > > > > > > dequeue_huge_page_node_exact > > > > > > // ignore hwpoisoned hugepage > > > > > > // and finally fail due to no-mem > > > > > > > > > > > > The root cause of this is that current soft-offline code is written > > > > > > based on an assumption that PageHWPoison flag should beset at first to > > > > > > avoid accessing the corrupted data. This makes sense for memory_failure() > > > > > > or hard offline, but does not for soft offline because soft offline is > > > > > > about corrected (not uncorrected) error and is safe from data lost. > > > > > > This patch changes soft offline semantics where it sets PageHWPoison flag > > > > > > only after containment of the error page completes successfully. > > > > > > > > > > Could you please expand on the worklow here please? The code is really > > > > > hard to grasp. I must be missing something because the thing shouldn't > > > > > be really complicated. Either the page is in the free pool and you just > > > > > remove it from the allocator (with hugetlb asking for a new hugeltb page > > > > > to guaratee reserves) or it is used and you just migrate the content to > > > > > a new page (again with the hugetlb reserves consideration). Why should > > > > > PageHWPoison flag ordering make any relevance? > > > > > > > > (Considering soft offlining free hugepage,) > > > > PageHWPoison is set at first before this patch, which is racy with > > > > hugetlb fault code because it's not protected by hugetlb_lock. > > > > > > > > Originally this was written in the similar manner as hard-offline, where > > > > the race is accepted and a PageHWPoison flag is set as soon as possible. > > > > But actually that's found not necessary/correct because soft offline is > > > > supposed to be less aggressive and failure is OK. > > > > > > OK > > > > > > > So this patch is suggesting to make soft-offline less aggressive by > > > > moving SetPageHWPoison into the lock. > > > > > > I guess I still do not understand why we should even care about the > > > ordering of the HWPoison flag setting. Why cannot we simply have the > > > following code flow? Or maybe we are doing that already I just do not > > > follow the code > > > > > > soft_offline > > > check page_count > > > - free - normal page - remove from the allocator > > > - hugetlb - allocate a new hugetlb page && remove from the pool > > > - used - migrate to a new page && never release the old one > > > > > > Why do we even need HWPoison flag here? Everything can be completely > > > transparent to the application. It shouldn't fail from what I > > > understood. > > > > PageHWPoison flag is used to the 'remove from the allocator' part > > which is like below: > > > > static inline > > struct page *rmqueue( > > ... > > do { > > page = NULL; > > if (alloc_flags & ALLOC_HARDER) { > > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > > if (page) > > trace_mm_page_alloc_zone_locked(page, order, migratetype); > > } > > if (!page) > > page = __rmqueue(zone, order, migratetype); > > } while (page && check_new_pages(page, order)); > > > > check_new_pages() returns true if the page taken from free list has > > a hwpoison page so that the allocator iterates another round to get > > another page. > > > > There's no function that can be called from outside allocator to remove > > a page in allocator. So actual page removal is done at allocation time, > > not at error handling time. That's the reason why we need PageHWPoison. > > hwpoison is an internal mm functionality so why cannot we simply add a > function that would do that? That's one possible solution. I know about another downside in current implementation. If a hwpoison page is found during high order page allocation, all 2^order pages (not only hwpoison page) are removed from buddy because of the above quoted code. And these leaked pages are never returned to freelist even with unpoison_memory(). If we have a page removal function which properly splits high order free pages into lower order pages, this problem is avoided. OTOH PageHWPoison still has a role to report error to userspace. Without it unpoison_memory() doesn't work. Thanks, Naoya Horiguchi > I find the PageHWPoison usage here doing > more complications than real good. Or am I missing something?
On Thu 19-07-18 08:08:05, Naoya Horiguchi wrote: > On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote: > > On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote: > > > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote: [...] > > > > Why do we even need HWPoison flag here? Everything can be completely > > > > transparent to the application. It shouldn't fail from what I > > > > understood. > > > > > > PageHWPoison flag is used to the 'remove from the allocator' part > > > which is like below: > > > > > > static inline > > > struct page *rmqueue( > > > ... > > > do { > > > page = NULL; > > > if (alloc_flags & ALLOC_HARDER) { > > > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > > > if (page) > > > trace_mm_page_alloc_zone_locked(page, order, migratetype); > > > } > > > if (!page) > > > page = __rmqueue(zone, order, migratetype); > > > } while (page && check_new_pages(page, order)); > > > > > > check_new_pages() returns true if the page taken from free list has > > > a hwpoison page so that the allocator iterates another round to get > > > another page. > > > > > > There's no function that can be called from outside allocator to remove > > > a page in allocator. So actual page removal is done at allocation time, > > > not at error handling time. That's the reason why we need PageHWPoison. > > > > hwpoison is an internal mm functionality so why cannot we simply add a > > function that would do that? > > That's one possible solution. I would prefer that much more than add an overhead (albeit small) into the page allocator directly. HWPoison should be a really rare event so why should everybody pay the price? I would much rather see that the poison path pays the additional price. > I know about another downside in current implementation. > If a hwpoison page is found during high order page allocation, > all 2^order pages (not only hwpoison page) are removed from > buddy because of the above quoted code. And these leaked pages > are never returned to freelist even with unpoison_memory(). > If we have a page removal function which properly splits high order > free pages into lower order pages, this problem is avoided. Even more reason to move to a new scheme. > OTOH PageHWPoison still has a role to report error to userspace. > Without it unpoison_memory() doesn't work. Sure but we do not really need a special page flag for that. We know the page is not reachable other than via pfn walkers. If you make the page reserved and note the fact it has been poisoned in the past then you can emulate the missing functionality. Btw. do we really need unpoisoning functionality? Who is really using it, other than some tests? How does the memory become OK again? Don't we really need to go through physical hotremove & hotadd to clean the poison status? Thanks!
On Thu, Jul 19, 2018 at 10:27:43AM +0200, Michal Hocko wrote: > On Thu 19-07-18 08:08:05, Naoya Horiguchi wrote: > > On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote: > > > On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote: > > > > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote: > [...] > > > > > Why do we even need HWPoison flag here? Everything can be completely > > > > > transparent to the application. It shouldn't fail from what I > > > > > understood. > > > > > > > > PageHWPoison flag is used to the 'remove from the allocator' part > > > > which is like below: > > > > > > > > static inline > > > > struct page *rmqueue( > > > > ... > > > > do { > > > > page = NULL; > > > > if (alloc_flags & ALLOC_HARDER) { > > > > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > > > > if (page) > > > > trace_mm_page_alloc_zone_locked(page, order, migratetype); > > > > } > > > > if (!page) > > > > page = __rmqueue(zone, order, migratetype); > > > > } while (page && check_new_pages(page, order)); > > > > > > > > check_new_pages() returns true if the page taken from free list has > > > > a hwpoison page so that the allocator iterates another round to get > > > > another page. > > > > > > > > There's no function that can be called from outside allocator to remove > > > > a page in allocator. So actual page removal is done at allocation time, > > > > not at error handling time. That's the reason why we need PageHWPoison. > > > > > > hwpoison is an internal mm functionality so why cannot we simply add a > > > function that would do that? > > > > That's one possible solution. > > I would prefer that much more than add an overhead (albeit small) into > the page allocator directly. HWPoison should be a really rare event so > why should everybody pay the price? I would much rather see that the > poison path pays the additional price. Yes, that's more maintainable. > > > I know about another downside in current implementation. > > If a hwpoison page is found during high order page allocation, > > all 2^order pages (not only hwpoison page) are removed from > > buddy because of the above quoted code. And these leaked pages > > are never returned to freelist even with unpoison_memory(). > > If we have a page removal function which properly splits high order > > free pages into lower order pages, this problem is avoided. > > Even more reason to move to a new scheme. > > > OTOH PageHWPoison still has a role to report error to userspace. > > Without it unpoison_memory() doesn't work. > > Sure but we do not really need a special page flag for that. We know the > page is not reachable other than via pfn walkers. If you make the page > reserved and note the fact it has been poisoned in the past then you can > emulate the missing functionality. > > Btw. do we really need unpoisoning functionality? Who is really using > it, other than some tests? None, as long as I know. > How does the memory become OK again? For hard-offlined in-use pages which are assumed to be pinned, we clear the PageHWPoison flag and unpin the page to return it to buddy. For other cases, we simply clear the PageHWPoison flag. Unless the page is checked by check_new_pages() before unpoison, the page is reusable. Sometimes error handling fails and the error page might turn into unexpected state (like additional refcount/mapcount). Unpoison just fails on such pages. > Don't we > really need to go through physical hotremove & hotadd to clean the > poison status? hotremove/hotadd can be a user of unpoison, but I think simply reinitializing struct pages is easiler. Thanks, Naoya Horiguchi
On Thu 19-07-18 09:22:47, Naoya Horiguchi wrote: > On Thu, Jul 19, 2018 at 10:27:43AM +0200, Michal Hocko wrote: > > On Thu 19-07-18 08:08:05, Naoya Horiguchi wrote: > > > On Thu, Jul 19, 2018 at 09:15:16AM +0200, Michal Hocko wrote: > > > > On Thu 19-07-18 06:19:45, Naoya Horiguchi wrote: > > > > > On Wed, Jul 18, 2018 at 10:50:32AM +0200, Michal Hocko wrote: > > [...] > > > > > > Why do we even need HWPoison flag here? Everything can be completely > > > > > > transparent to the application. It shouldn't fail from what I > > > > > > understood. > > > > > > > > > > PageHWPoison flag is used to the 'remove from the allocator' part > > > > > which is like below: > > > > > > > > > > static inline > > > > > struct page *rmqueue( > > > > > ... > > > > > do { > > > > > page = NULL; > > > > > if (alloc_flags & ALLOC_HARDER) { > > > > > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > > > > > if (page) > > > > > trace_mm_page_alloc_zone_locked(page, order, migratetype); > > > > > } > > > > > if (!page) > > > > > page = __rmqueue(zone, order, migratetype); > > > > > } while (page && check_new_pages(page, order)); > > > > > > > > > > check_new_pages() returns true if the page taken from free list has > > > > > a hwpoison page so that the allocator iterates another round to get > > > > > another page. > > > > > > > > > > There's no function that can be called from outside allocator to remove > > > > > a page in allocator. So actual page removal is done at allocation time, > > > > > not at error handling time. That's the reason why we need PageHWPoison. > > > > > > > > hwpoison is an internal mm functionality so why cannot we simply add a > > > > function that would do that? > > > > > > That's one possible solution. > > > > I would prefer that much more than add an overhead (albeit small) into > > the page allocator directly. HWPoison should be a really rare event so > > why should everybody pay the price? I would much rather see that the > > poison path pays the additional price. > > Yes, that's more maintainable. > > > > > > I know about another downside in current implementation. > > > If a hwpoison page is found during high order page allocation, > > > all 2^order pages (not only hwpoison page) are removed from > > > buddy because of the above quoted code. And these leaked pages > > > are never returned to freelist even with unpoison_memory(). > > > If we have a page removal function which properly splits high order > > > free pages into lower order pages, this problem is avoided. > > > > Even more reason to move to a new scheme. > > > > > OTOH PageHWPoison still has a role to report error to userspace. > > > Without it unpoison_memory() doesn't work. > > > > Sure but we do not really need a special page flag for that. We know the > > page is not reachable other than via pfn walkers. If you make the page > > reserved and note the fact it has been poisoned in the past then you can > > emulate the missing functionality. > > > > Btw. do we really need unpoisoning functionality? Who is really using > > it, other than some tests? > > None, as long as I know. Then why do we keep it? > > How does the memory become OK again? > > For hard-offlined in-use pages which are assumed to be pinned, > we clear the PageHWPoison flag and unpin the page to return it to buddy. > For other cases, we simply clear the PageHWPoison flag. > Unless the page is checked by check_new_pages() before unpoison, > the page is reusable. No I mean how does the memory becomes OK again. Is there any in place repair technique? > Sometimes error handling fails and the error page might turn into > unexpected state (like additional refcount/mapcount). > Unpoison just fails on such pages. > > > Don't we > > really need to go through physical hotremove & hotadd to clean the > > poison status? > > hotremove/hotadd can be a user of unpoison, but I think simply > reinitializing struct pages is easiler. Sure, that is what I meant actually. I do not really see any way to make the memory OK again other than to replace the faulty dimm and put it back online. The state is not really preserved for that so having a sticky struct page state doesn't make much sense. So from what I understood, we only need to track the poison state just allow to hotremove that memory and do not stumble over an unexpected page or try to do something with it. All other actions can be done synchronously (migrate vs. kill users).
diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c index 430be42..937c142 100644 --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/hugetlb.c +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/hugetlb.c @@ -1479,22 +1479,20 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, /* * Dissolve a given free hugepage into free buddy pages. This function does * nothing for in-use (including surplus) hugepages. Returns -EBUSY if the - * number of free hugepages would be reduced below the number of reserved - * hugepages. + * dissolution fails because a give page is not a free hugepage, or because + * free hugepages are fully reserved. */ int dissolve_free_huge_page(struct page *page) { - int rc = 0; + int rc = -EBUSY; spin_lock(&hugetlb_lock); if (PageHuge(page) && !page_count(page)) { struct page *head = compound_head(page); struct hstate *h = page_hstate(head); int nid = page_to_nid(head); - if (h->free_huge_pages - h->resv_huge_pages == 0) { - rc = -EBUSY; + if (h->free_huge_pages - h->resv_huge_pages == 0) goto out; - } /* * Move PageHWPoison flag from head page to the raw error page, * which makes any subpages rather than the error page reusable. @@ -1508,6 +1506,7 @@ int dissolve_free_huge_page(struct page *page) h->free_huge_pages_node[nid]--; h->max_huge_pages--; update_and_free_page(h, head); + rc = 0; } out: spin_unlock(&hugetlb_lock); diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c index 9d142b9..9b77f85 100644 --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/memory-failure.c +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/memory-failure.c @@ -1598,8 +1598,20 @@ static int soft_offline_huge_page(struct page *page, int flags) if (ret > 0) ret = -EIO; } else { - if (PageHuge(page)) - dissolve_free_huge_page(page); + /* + * We set PG_hwpoison only when the migration source hugepage + * was successfully dissolved, because otherwise hwpoisoned + * hugepage remains on free hugepage list. The allocator ignores + * such a hwpoisoned page so it's never allocated, but it could + * kill a process because of no-memory rather than hwpoison. + * Soft-offline never impacts the userspace, so this is + * undesired. + */ + ret = dissolve_free_huge_page(page); + if (!ret) { + if (!TestSetPageHWPoison(page)) + num_poisoned_pages_inc(); + } } return ret; } @@ -1715,13 +1727,13 @@ static int soft_offline_in_use_page(struct page *page, int flags) static void soft_offline_free_page(struct page *page) { + int rc = 0; struct page *head = compound_head(page); - if (!TestSetPageHWPoison(head)) { + if (PageHuge(head)) + rc = dissolve_free_huge_page(page); + if (!rc && !TestSetPageHWPoison(page)) num_poisoned_pages_inc(); - if (PageHuge(head)) - dissolve_free_huge_page(page); - } } /** diff --git v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c index 198af42..3ae213b 100644 --- v4.18-rc4-mmotm-2018-07-10-16-50/mm/migrate.c +++ v4.18-rc4-mmotm-2018-07-10-16-50_patched/mm/migrate.c @@ -1318,8 +1318,6 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, out: if (rc != -EAGAIN) putback_active_hugepage(hpage); - if (reason == MR_MEMORY_FAILURE && !test_set_page_hwpoison(hpage)) - num_poisoned_pages_inc(); /* * If migration was not successful and there's a freeing callback, use
There's a race condition between soft offline and hugetlb_fault which causes unexpected process killing and/or hugetlb allocation failure. The process killing is caused by the following flow: CPU 0 CPU 1 CPU 2 soft offline get_any_page // find the hugetlb is free mmap a hugetlb file page fault ... hugetlb_fault hugetlb_no_page alloc_huge_page // succeed soft_offline_free_page // set hwpoison flag mmap the hugetlb file page fault ... hugetlb_fault hugetlb_no_page find_lock_page return VM_FAULT_HWPOISON mm_fault_error do_sigbus // kill the process The hugetlb allocation failure comes from the following flow: CPU 0 CPU 1 mmap a hugetlb file // reserve all free page but don't fault-in soft offline get_any_page // find the hugetlb is free soft_offline_free_page // set hwpoison flag dissolve_free_huge_page // fail because all free hugepages are reserved page fault ... hugetlb_fault hugetlb_no_page alloc_huge_page ... dequeue_huge_page_node_exact // ignore hwpoisoned hugepage // and finally fail due to no-mem The root cause of this is that current soft-offline code is written based on an assumption that PageHWPoison flag should beset at first to avoid accessing the corrupted data. This makes sense for memory_failure() or hard offline, but does not for soft offline because soft offline is about corrected (not uncorrected) error and is safe from data lost. This patch changes soft offline semantics where it sets PageHWPoison flag only after containment of the error page completes successfully. Reported-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> Suggested-by: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> --- changelog v1->v2: - don't use set_hwpoison_free_buddy_page() (not defined yet) - updated comment in soft_offline_huge_page() --- mm/hugetlb.c | 11 +++++------ mm/memory-failure.c | 24 ++++++++++++++++++------ mm/migrate.c | 2 -- 3 files changed, 23 insertions(+), 14 deletions(-)