Message ID | 20220623235153.2623702-5-naoya.horiguchi@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm, hwpoison: enable 1GB hugepage support (v2) | expand |
On 2022/6/24 7:51, Naoya Horiguchi wrote: > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > > When handling memory error on a hugetlb page, the error handler tries to > dissolve and turn it into 4kB pages. If it's successfully dissolved, > PageHWPoison flag is moved to the raw error page, so that's all right. > However, dissolve sometimes fails, then the error page is left as > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save > healthy pages, but that's not possible now because the information about > where the raw error pages is lost. > > Use the private field of a few tail pages to keep that information. The > code path of shrinking hugepage pool uses this info to try delayed dissolve. > In order to remember multiple errors in a hugepage, a singly-linked list > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed. Only > simple operations (adding an entry or clearing all) are required and the > list is assumed not to be very long, so this simple data structure should > be enough. > > If we failed to save raw error info, the hwpoison hugepage has errors on > unknown subpage, then this new saving mechanism does not work any more, > so disable saving new raw error info and freeing hwpoison hugepages. > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Many thanks for your patch. This patch looks good to me. Some nits below. > <snip> > > +static inline int hugetlb_set_page_hwpoison(struct page *hpage, > + struct page *page) > +{ > + struct llist_head *head; > + struct raw_hwp_page *raw_hwp; > + struct llist_node *t, *tnode; > + int ret; > + > + /* > + * Once the hwpoison hugepage has lost reliable raw error info, > + * there is little mean to keep additional error info precisely, It should be s/mean/meaning/ ? > + * so skip to add additional raw error info. > + */ > + if (raw_hwp_unreliable(hpage)) > + return -EHWPOISON; > + head = raw_hwp_list_head(hpage); > + llist_for_each_safe(tnode, t, head->first) { > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > + > + if (p->page == page) > + return -EHWPOISON; > + } > + > + ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0; > + /* the first error event will be counted in action_result(). */ > + if (ret) > + num_poisoned_pages_inc(); > + > + raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL); > + if (raw_hwp) { > + raw_hwp->page = page; > + llist_add(&raw_hwp->node, head); > + } else { > + /* > + * Failed to save raw error info. We no longer trace all > + * hwpoisoned subpages, and we need refuse to free/dissolve > + * this hwpoisoned hugepage. > + */ > + set_raw_hwp_unreliable(hpage); > + return ret; This "return ret" can be combined into the below one? > + } > + return ret; > +} > + > +inline int hugetlb_clear_page_hwpoison(struct page *hpage) > +{ > + struct llist_head *head; > + struct llist_node *t, *tnode; > + > + if (raw_hwp_unreliable(hpage)) > + return -EBUSY; Can we try freeing the memory of raw_hwp_list to save possible memory? It seems raw_hwp_list becomes unneeded when raw_hwp_unreliable. Thanks. > + ClearPageHWPoison(hpage); > + head = raw_hwp_list_head(hpage); > + llist_for_each_safe(tnode, t, head->first) { > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > + > + SetPageHWPoison(p->page); > + kfree(p); > + } > + llist_del_all(head); > + return 0; > +} > + > /* > * Called from hugetlb code with hugetlb_lock held. > * > @@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags) > goto out; > } > > - if (TestSetPageHWPoison(head)) { > + if (hugetlb_set_page_hwpoison(head, page)) { > ret = -EHWPOISON; > goto out; > } > @@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb > lock_page(head); > > if (hwpoison_filter(p)) { > - ClearPageHWPoison(head); > + hugetlb_clear_page_hwpoison(head); > res = -EOPNOTSUPP; > goto out; > } >
On Mon, Jun 27, 2022 at 11:16:06AM +0800, Miaohe Lin wrote: > On 2022/6/24 7:51, Naoya Horiguchi wrote: > > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > > > > When handling memory error on a hugetlb page, the error handler tries to > > dissolve and turn it into 4kB pages. If it's successfully dissolved, > > PageHWPoison flag is moved to the raw error page, so that's all right. > > However, dissolve sometimes fails, then the error page is left as > > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save > > healthy pages, but that's not possible now because the information about > > where the raw error pages is lost. > > > > Use the private field of a few tail pages to keep that information. The > > code path of shrinking hugepage pool uses this info to try delayed dissolve. > > In order to remember multiple errors in a hugepage, a singly-linked list > > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed. Only > > simple operations (adding an entry or clearing all) are required and the > > list is assumed not to be very long, so this simple data structure should > > be enough. > > > > If we failed to save raw error info, the hwpoison hugepage has errors on > > unknown subpage, then this new saving mechanism does not work any more, > > so disable saving new raw error info and freeing hwpoison hugepages. > > > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > > Many thanks for your patch. This patch looks good to me. Some nits below. > > > > <snip> > > > > +static inline int hugetlb_set_page_hwpoison(struct page *hpage, > > + struct page *page) > > +{ > > + struct llist_head *head; > > + struct raw_hwp_page *raw_hwp; > > + struct llist_node *t, *tnode; > > + int ret; > > + > > + /* > > + * Once the hwpoison hugepage has lost reliable raw error info, > > + * there is little mean to keep additional error info precisely, > > It should be s/mean/meaning/ ? Right, fixed. > > > + * so skip to add additional raw error info. > > + */ > > + if (raw_hwp_unreliable(hpage)) > > + return -EHWPOISON; > > + head = raw_hwp_list_head(hpage); > > + llist_for_each_safe(tnode, t, head->first) { > > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > > + > > + if (p->page == page) > > + return -EHWPOISON; > > + } > > + > > + ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0; > > + /* the first error event will be counted in action_result(). */ > > + if (ret) > > + num_poisoned_pages_inc(); > > + > > + raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL); > > + if (raw_hwp) { > > + raw_hwp->page = page; > > + llist_add(&raw_hwp->node, head); > > + } else { > > + /* > > + * Failed to save raw error info. We no longer trace all > > + * hwpoisoned subpages, and we need refuse to free/dissolve > > + * this hwpoisoned hugepage. > > + */ > > + set_raw_hwp_unreliable(hpage); > > + return ret; > > This "return ret" can be combined into the below one? > Right, fixed. > > + } > > + return ret; > > +} > > + > > +inline int hugetlb_clear_page_hwpoison(struct page *hpage) > > +{ > > + struct llist_head *head; > > + struct llist_node *t, *tnode; > > + > > + if (raw_hwp_unreliable(hpage)) > > + return -EBUSY; > > Can we try freeing the memory of raw_hwp_list to save possible memory? It seems > raw_hwp_list becomes unneeded when raw_hwp_unreliable. Yes, we can. I try to change like that. Thanks for the suggestion. - Naoya Horiguchi
On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote: > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > > When handling memory error on a hugetlb page, the error handler tries to > dissolve and turn it into 4kB pages. If it's successfully dissolved, > PageHWPoison flag is moved to the raw error page, so that's all right. > However, dissolve sometimes fails, then the error page is left as > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save > healthy pages, but that's not possible now because the information about > where the raw error pages is lost. > > Use the private field of a few tail pages to keep that information. The > code path of shrinking hugepage pool uses this info to try delayed dissolve. > In order to remember multiple errors in a hugepage, a singly-linked list > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed. Only > simple operations (adding an entry or clearing all) are required and the > list is assumed not to be very long, so this simple data structure should > be enough. > > If we failed to save raw error info, the hwpoison hugepage has errors on > unknown subpage, then this new saving mechanism does not work any more, > so disable saving new raw error info and freeing hwpoison hugepages. > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Thanks for your work on this. I have several quesions below. > --- > v1 -> v2: > - support hwpoison hugepage with multiple errors, > - moved the new interface functions to mm/memory-failure.c, > - define additional subpage index SUBPAGE_INDEX_HWPOISON_UNRELIABLE, > - stop freeing/dissolving hwpoison hugepages with unreliable raw error info, > - drop hugetlb_clear_page_hwpoison() in dissolve_free_huge_page() because > that's done in update_and_free_page(), > - move setting/clearing PG_hwpoison flag to the new interfaces, > - checking already hwpoisoned or not on a subpage basis. > > ChangeLog since previous post on 4/27: > - fixed typo in patch description (by Miaohe) > - fixed config value in #ifdef statement (by Miaohe) > - added sentences about "multiple hwpoison pages" scenario in patch > description > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > --- > include/linux/hugetlb.h | 13 ++++++ > mm/hugetlb.c | 39 +++++++++-------- > mm/memory-failure.c | 95 ++++++++++++++++++++++++++++++++++++++++- > 3 files changed, 125 insertions(+), 22 deletions(-) > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index e4cff27d1198..ac13c2022517 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -42,6 +42,10 @@ enum { > SUBPAGE_INDEX_CGROUP, /* reuse page->private */ > SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */ > __MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD, > +#endif > +#ifdef CONFIG_MEMORY_FAILURE > + SUBPAGE_INDEX_HWPOISON, > + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, > #endif > __NR_USED_SUBPAGE, > }; > @@ -798,6 +802,15 @@ extern int dissolve_free_huge_page(struct page *page); > extern int dissolve_free_huge_pages(unsigned long start_pfn, > unsigned long end_pfn); > > +#ifdef CONFIG_MEMORY_FAILURE > +extern int hugetlb_clear_page_hwpoison(struct page *hpage); > +#else > +static inline int hugetlb_clear_page_hwpoison(struct page *hpage) > +{ > + return 0; > +} > +#endif > + > #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION > #ifndef arch_hugetlb_migration_supported > static inline bool arch_hugetlb_migration_supported(struct hstate *h) > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index b7ae5f73f3b2..19ef427aa1e8 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1541,17 +1541,15 @@ static void __update_and_free_page(struct hstate *h, struct page *page) > if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) > return; > > - if (hugetlb_vmemmap_alloc(h, page)) { > - spin_lock_irq(&hugetlb_lock); > - /* > - * If we cannot allocate vmemmap pages, just refuse to free the > - * page and put the page back on the hugetlb free list and treat > - * as a surplus page. > - */ > - add_hugetlb_page(h, page, true); > - spin_unlock_irq(&hugetlb_lock); > - return; > - } > + if (hugetlb_vmemmap_alloc(h, page)) > + goto fail; > + > + /* > + * Move PageHWPoison flag from head page to the raw error pages, > + * which makes any healthy subpages reusable. > + */ > + if (unlikely(PageHWPoison(page) && hugetlb_clear_page_hwpoison(page))) > + goto fail; > > for (i = 0; i < pages_per_huge_page(h); > i++, subpage = mem_map_next(subpage, page, i)) { > @@ -1572,6 +1570,16 @@ static void __update_and_free_page(struct hstate *h, struct page *page) > } else { > __free_pages(page, huge_page_order(h)); > } > + return; > +fail: > + spin_lock_irq(&hugetlb_lock); > + /* > + * If we cannot allocate vmemmap pages or cannot identify raw hwpoison > + * subpages reliably, just refuse to free the page and put the page > + * back on the hugetlb free list and treat as a surplus page. > + */ > + add_hugetlb_page(h, page, true); > + spin_unlock_irq(&hugetlb_lock); > } > > /* > @@ -2115,15 +2123,6 @@ int dissolve_free_huge_page(struct page *page) > */ > rc = hugetlb_vmemmap_alloc(h, head); > if (!rc) { > - /* > - * Move PageHWPoison flag from head page to the raw > - * error page, which makes any subpages rather than > - * the error page reusable. > - */ > - if (PageHWPoison(head) && page != head) { > - SetPageHWPoison(page); > - ClearPageHWPoison(head); > - } > update_and_free_page(h, head, false); > } else { > spin_lock_irq(&hugetlb_lock); > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index fb3feb1f363e..af571fa6f2af 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1499,6 +1499,97 @@ static int try_to_split_thp_page(struct page *page, const char *msg) > } > > #ifdef CONFIG_HUGETLB_PAGE > +/* > + * Struct raw_hwp_page represents information about "raw error page", > + * constructing singly linked list originated from ->private field of > + * SUBPAGE_INDEX_HWPOISON-th tail page. > + */ > +struct raw_hwp_page { > + struct llist_node node; > + struct page *page; > +}; > + > +static inline struct llist_head *raw_hwp_list_head(struct page *hpage) > +{ > + return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON); > +} > + > +static inline int raw_hwp_unreliable(struct page *hpage) > +{ > + return page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE); > +} > + > +static inline void set_raw_hwp_unreliable(struct page *hpage) > +{ > + set_page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, 1); > +} Why not use HPAGEFLAG(HwpoisonUnreliable, hwpoison_unreliable) directly? > + > +/* > + * about race consideration > + */ > +static inline int hugetlb_set_page_hwpoison(struct page *hpage, > + struct page *page) > +{ > + struct llist_head *head; > + struct raw_hwp_page *raw_hwp; > + struct llist_node *t, *tnode; > + int ret; > + > + /* > + * Once the hwpoison hugepage has lost reliable raw error info, > + * there is little mean to keep additional error info precisely, > + * so skip to add additional raw error info. > + */ > + if (raw_hwp_unreliable(hpage)) > + return -EHWPOISON; > + head = raw_hwp_list_head(hpage); > + llist_for_each_safe(tnode, t, head->first) { > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > + > + if (p->page == page) > + return -EHWPOISON; > + } > + > + ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0; > + /* the first error event will be counted in action_result(). */ > + if (ret) > + num_poisoned_pages_inc(); > + > + raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL); This function can be called in atomic context, GFP_ATOMIC should be used here. > + if (raw_hwp) { > + raw_hwp->page = page; > + llist_add(&raw_hwp->node, head); The maximum amount of items in the list is one, right? > + } else { > + /* > + * Failed to save raw error info. We no longer trace all > + * hwpoisoned subpages, and we need refuse to free/dissolve > + * this hwpoisoned hugepage. > + */ > + set_raw_hwp_unreliable(hpage); > + return ret; > + } > + return ret; > +} > + > +inline int hugetlb_clear_page_hwpoison(struct page *hpage) > +{ > + struct llist_head *head; > + struct llist_node *t, *tnode; > + > + if (raw_hwp_unreliable(hpage)) > + return -EBUSY; IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison() and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here? > + ClearPageHWPoison(hpage); > + head = raw_hwp_list_head(hpage); > + llist_for_each_safe(tnode, t, head->first) { Is it possible that a new item is added hugetlb_set_page_hwpoison() and we do not traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison page, right? > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > + > + SetPageHWPoison(p->page); > + kfree(p); > + } > + llist_del_all(head); If the above issue exists, moving ClearPageHWPoison(hpage) to here could fix it. We should clear PageHWPoison carefully since the head page is also can be poisoned. Thanks. > + return 0; > +} > + > /* > * Called from hugetlb code with hugetlb_lock held. > * > @@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags) > goto out; > } > > - if (TestSetPageHWPoison(head)) { > + if (hugetlb_set_page_hwpoison(head, page)) { > ret = -EHWPOISON; > goto out; > } > @@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb > lock_page(head); > > if (hwpoison_filter(p)) { > - ClearPageHWPoison(head); > + hugetlb_clear_page_hwpoison(head); > res = -EOPNOTSUPP; > goto out; > } > -- > 2.25.1 > >
On Mon, Jun 27, 2022 at 05:26:01PM +0800, Muchun Song wrote: > On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote: > > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > > > > When handling memory error on a hugetlb page, the error handler tries to > > dissolve and turn it into 4kB pages. If it's successfully dissolved, > > PageHWPoison flag is moved to the raw error page, so that's all right. > > However, dissolve sometimes fails, then the error page is left as > > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save > > healthy pages, but that's not possible now because the information about > > where the raw error pages is lost. > > > > Use the private field of a few tail pages to keep that information. The > > code path of shrinking hugepage pool uses this info to try delayed dissolve. > > In order to remember multiple errors in a hugepage, a singly-linked list > > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed. Only > > simple operations (adding an entry or clearing all) are required and the > > list is assumed not to be very long, so this simple data structure should > > be enough. > > > > If we failed to save raw error info, the hwpoison hugepage has errors on > > unknown subpage, then this new saving mechanism does not work any more, > > so disable saving new raw error info and freeing hwpoison hugepages. > > > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > > Thanks for your work on this. I have several quesions below. > ... > > @@ -1499,6 +1499,97 @@ static int try_to_split_thp_page(struct page *page, const char *msg) > > } > > > > #ifdef CONFIG_HUGETLB_PAGE > > +/* > > + * Struct raw_hwp_page represents information about "raw error page", > > + * constructing singly linked list originated from ->private field of > > + * SUBPAGE_INDEX_HWPOISON-th tail page. > > + */ > > +struct raw_hwp_page { > > + struct llist_node node; > > + struct page *page; > > +}; > > + > > +static inline struct llist_head *raw_hwp_list_head(struct page *hpage) > > +{ > > + return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON); > > +} > > + > > +static inline int raw_hwp_unreliable(struct page *hpage) > > +{ > > + return page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE); > > +} > > + > > +static inline void set_raw_hwp_unreliable(struct page *hpage) > > +{ > > + set_page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, 1); > > +} > > Why not use HPAGEFLAG(HwpoisonUnreliable, hwpoison_unreliable) directly? > OK, that sounds better, I'll do it. > > + > > +/* > > + * about race consideration > > + */ > > +static inline int hugetlb_set_page_hwpoison(struct page *hpage, > > + struct page *page) > > +{ > > + struct llist_head *head; > > + struct raw_hwp_page *raw_hwp; > > + struct llist_node *t, *tnode; > > + int ret; > > + > > + /* > > + * Once the hwpoison hugepage has lost reliable raw error info, > > + * there is little mean to keep additional error info precisely, > > + * so skip to add additional raw error info. > > + */ > > + if (raw_hwp_unreliable(hpage)) > > + return -EHWPOISON; > > + head = raw_hwp_list_head(hpage); > > + llist_for_each_safe(tnode, t, head->first) { > > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > > + > > + if (p->page == page) > > + return -EHWPOISON; > > + } > > + > > + ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0; > > + /* the first error event will be counted in action_result(). */ > > + if (ret) > > + num_poisoned_pages_inc(); > > + > > + raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL); > > This function can be called in atomic context, GFP_ATOMIC should be used > here. OK, I'll use GFP_ATOMIC. > > > + if (raw_hwp) { > > + raw_hwp->page = page; > > + llist_add(&raw_hwp->node, head); > > The maximum amount of items in the list is one, right? The maximum is the number of subpages in the hugepage (512 for 2MB hugepage, 262144 for 1GB hugepage). > > > + } else { > > + /* > > + * Failed to save raw error info. We no longer trace all > > + * hwpoisoned subpages, and we need refuse to free/dissolve > > + * this hwpoisoned hugepage. > > + */ > > + set_raw_hwp_unreliable(hpage); > > + return ret; > > + } > > + return ret; > > +} > > + > > +inline int hugetlb_clear_page_hwpoison(struct page *hpage) > > +{ > > + struct llist_head *head; > > + struct llist_node *t, *tnode; > > + > > + if (raw_hwp_unreliable(hpage)) > > + return -EBUSY; > > IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison() > and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here? Sorry if I might miss your point, but raw_hwp_unreliable is set when allocating raw_hwp_page failed. hugetlb_set_page_hwpoison() can be called multiple times on a hugepage and if one of the calls fails, the hwpoisoned hugepage becomes unreliable. BTW, as you pointed out above, if we switch to passing GFP_ATOMIC to kmalloc(), the kmalloc() never fails, so we no longer have to implement this unreliable flag, so things get simpler. > > > + ClearPageHWPoison(hpage); > > + head = raw_hwp_list_head(hpage); > > + llist_for_each_safe(tnode, t, head->first) { > > Is it possible that a new item is added hugetlb_set_page_hwpoison() and we do not > traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison > page, right? Maybe you are mentioning the race like below. Yes, that's possible. CPU 0 CPU 1 free_huge_page lock hugetlb_lock ClearHPageMigratable unlock hugetlb_lock get_huge_page_for_hwpoison lock hugetlb_lock __get_huge_page_for_hwpoison hugetlb_set_page_hwpoison allocate raw_hwp_page TestSetPageHWPoison update_and_free_page __update_and_free_page if (PageHWPoison) hugetlb_clear_page_hwpoison TestClearPageHWPoison // remove all list items llist_add unlock hugetlb_lock The end result seems not critical (leaking raced raw_hwp_page?), but we need fix. > > > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > > + > > + SetPageHWPoison(p->page); > > + kfree(p); > > + } > > + llist_del_all(head); > > If the above issue exists, moving ClearPageHWPoison(hpage) to here could > fix it. We should clear PageHWPoison carefully since the head page is > also can be poisoned. The reason why I put ClearPageHWPoison(hpage) before llist_for_each_safe() was that raw error page can be the head page. But this can be solved with some additional code to remember whether raw_hwp_page list has an item associated with the head page. Or another approach in my mind now is to call hugetlb_clear_page_hwpoison() with taking mf_mutex. > > Thanks. Thank you for valuable feedbacks. - Naoya Horiguchi > > > + return 0; > > +} > > + > > /* > > * Called from hugetlb code with hugetlb_lock held. > > * > > @@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags) > > goto out; > > } > > > > - if (TestSetPageHWPoison(head)) { > > + if (hugetlb_set_page_hwpoison(head, page)) { > > ret = -EHWPOISON; > > goto out; > > } > > @@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb > > lock_page(head); > > > > if (hwpoison_filter(p)) { > > - ClearPageHWPoison(head); > > + hugetlb_clear_page_hwpoison(head); > > res = -EOPNOTSUPP; > > goto out; > > } > > -- > > 2.25.1 > > > >
On Tue, Jun 28, 2022 at 02:41:22AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > On Mon, Jun 27, 2022 at 05:26:01PM +0800, Muchun Song wrote: > > On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote: > > > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > > > > > > When handling memory error on a hugetlb page, the error handler tries to > > > dissolve and turn it into 4kB pages. If it's successfully dissolved, > > > PageHWPoison flag is moved to the raw error page, so that's all right. > > > However, dissolve sometimes fails, then the error page is left as > > > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save > > > healthy pages, but that's not possible now because the information about > > > where the raw error pages is lost. > > > > > > Use the private field of a few tail pages to keep that information. The > > > code path of shrinking hugepage pool uses this info to try delayed dissolve. > > > In order to remember multiple errors in a hugepage, a singly-linked list > > > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed. Only > > > simple operations (adding an entry or clearing all) are required and the > > > list is assumed not to be very long, so this simple data structure should > > > be enough. > > > > > > If we failed to save raw error info, the hwpoison hugepage has errors on > > > unknown subpage, then this new saving mechanism does not work any more, > > > so disable saving new raw error info and freeing hwpoison hugepages. > > > > > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > > > > Thanks for your work on this. I have several quesions below. > > > ... > > > > @@ -1499,6 +1499,97 @@ static int try_to_split_thp_page(struct page *page, const char *msg) > > > } > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > +/* > > > + * Struct raw_hwp_page represents information about "raw error page", > > > + * constructing singly linked list originated from ->private field of > > > + * SUBPAGE_INDEX_HWPOISON-th tail page. > > > + */ > > > +struct raw_hwp_page { > > > + struct llist_node node; > > > + struct page *page; > > > +}; > > > + > > > +static inline struct llist_head *raw_hwp_list_head(struct page *hpage) > > > +{ > > > + return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON); > > > +} > > > + > > > +static inline int raw_hwp_unreliable(struct page *hpage) > > > +{ > > > + return page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE); > > > +} > > > + > > > +static inline void set_raw_hwp_unreliable(struct page *hpage) > > > +{ > > > + set_page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, 1); > > > +} > > > > Why not use HPAGEFLAG(HwpoisonUnreliable, hwpoison_unreliable) directly? > > > > OK, that sounds better, I'll do it. > > > > + > > > +/* > > > + * about race consideration > > > + */ > > > +static inline int hugetlb_set_page_hwpoison(struct page *hpage, > > > + struct page *page) > > > +{ > > > + struct llist_head *head; > > > + struct raw_hwp_page *raw_hwp; > > > + struct llist_node *t, *tnode; > > > + int ret; > > > + > > > + /* > > > + * Once the hwpoison hugepage has lost reliable raw error info, > > > + * there is little mean to keep additional error info precisely, > > > + * so skip to add additional raw error info. > > > + */ > > > + if (raw_hwp_unreliable(hpage)) > > > + return -EHWPOISON; > > > + head = raw_hwp_list_head(hpage); > > > + llist_for_each_safe(tnode, t, head->first) { > > > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > > > + > > > + if (p->page == page) > > > + return -EHWPOISON; > > > + } > > > + > > > + ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0; > > > + /* the first error event will be counted in action_result(). */ > > > + if (ret) > > > + num_poisoned_pages_inc(); > > > + > > > + raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL); > > > > This function can be called in atomic context, GFP_ATOMIC should be used > > here. > > OK, I'll use GFP_ATOMIC. > > > > > > + if (raw_hwp) { > > > + raw_hwp->page = page; > > > + llist_add(&raw_hwp->node, head); > > > > The maximum amount of items in the list is one, right? > > The maximum is the number of subpages in the hugepage (512 for 2MB hugepage, > 262144 for 1GB hugepage). > You are right. I have missed something yesterday. > > > > > + } else { > > > + /* > > > + * Failed to save raw error info. We no longer trace all > > > + * hwpoisoned subpages, and we need refuse to free/dissolve > > > + * this hwpoisoned hugepage. > > > + */ > > > + set_raw_hwp_unreliable(hpage); > > > + return ret; > > > + } > > > + return ret; > > > +} > > > + > > > +inline int hugetlb_clear_page_hwpoison(struct page *hpage) > > > +{ > > > + struct llist_head *head; > > > + struct llist_node *t, *tnode; > > > + > > > + if (raw_hwp_unreliable(hpage)) > > > + return -EBUSY; > > > > IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison() > > and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here? > > Sorry if I might miss your point, but raw_hwp_unreliable is set when > allocating raw_hwp_page failed. hugetlb_set_page_hwpoison() can be called Sorry. I have missed this. Thanks for your clarification. > multiple times on a hugepage and if one of the calls fails, the hwpoisoned > hugepage becomes unreliable. > > BTW, as you pointed out above, if we switch to passing GFP_ATOMIC to kmalloc(), > the kmalloc() never fails, so we no longer have to implement this unreliable No. kmalloc() with GFP_ATOMIC can fail unless I miss something important. > flag, so things get simpler. > > > > > > + ClearPageHWPoison(hpage); > > > + head = raw_hwp_list_head(hpage); > > > + llist_for_each_safe(tnode, t, head->first) { > > > > Is it possible that a new item is added hugetlb_set_page_hwpoison() and we do not > > traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison > > page, right? > > Maybe you are mentioning the race like below. Yes, that's possible. > Sorry, ignore my previous comments, I'm thinking something wrong. > CPU 0 CPU 1 > > free_huge_page > lock hugetlb_lock > ClearHPageMigratable remove_hugetlb_page() // the page is non-HugeTLB now > unlock hugetlb_lock > get_huge_page_for_hwpoison > lock hugetlb_lock > __get_huge_page_for_hwpoison // cannot reach here since it is not a HugeTLB page now. // So this race is impossible. Then we fallback to normal // page handling. Seems there is a new issue here. // // memory_failure() // try_memory_failure_hugetlb() // if (hugetlb) // goto unlock_mutex; // if (TestSetPageHWPoison(p)) { // // This non-HugeTLB page's vmemmap is still optimized. Setting COMPOUND_PAGE_DTOR after hugetlb_vmemmap_restore() might fix this issue, but we will encounter this race as you mentioned below. Thanks. > hugetlb_set_page_hwpoison > allocate raw_hwp_page > TestSetPageHWPoison > update_and_free_page > __update_and_free_page > if (PageHWPoison) > hugetlb_clear_page_hwpoison > TestClearPageHWPoison > // remove all list items > llist_add > unlock hugetlb_lock > > > The end result seems not critical (leaking raced raw_hwp_page?), but > we need fix. > > > > > > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > > > + > > > + SetPageHWPoison(p->page); > > > + kfree(p); > > > + } > > > + llist_del_all(head); > > > > If the above issue exists, moving ClearPageHWPoison(hpage) to here could > > fix it. We should clear PageHWPoison carefully since the head page is > > also can be poisoned. > > The reason why I put ClearPageHWPoison(hpage) before llist_for_each_safe() > was that raw error page can be the head page. But this can be solved > with some additional code to remember whether raw_hwp_page list has an item > associated with the head page. > > Or another approach in my mind now is to call hugetlb_clear_page_hwpoison() > with taking mf_mutex. > > > > > Thanks. > > Thank you for valuable feedbacks. > > - Naoya Horiguchi > > > > > > + return 0; > > > +} > > > + > > > /* > > > * Called from hugetlb code with hugetlb_lock held. > > > * > > > @@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags) > > > goto out; > > > } > > > > > > - if (TestSetPageHWPoison(head)) { > > > + if (hugetlb_set_page_hwpoison(head, page)) { > > > ret = -EHWPOISON; > > > goto out; > > > } > > > @@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb > > > lock_page(head); > > > > > > if (hwpoison_filter(p)) { > > > - ClearPageHWPoison(head); > > > + hugetlb_clear_page_hwpoison(head); > > > res = -EOPNOTSUPP; > > > goto out; > > > } > > > -- > > > 2.25.1 > > > > > >
On Tue, Jun 28, 2022 at 02:26:47PM +0800, Muchun Song wrote: > On Tue, Jun 28, 2022 at 02:41:22AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > > On Mon, Jun 27, 2022 at 05:26:01PM +0800, Muchun Song wrote: > > > On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote: > > > > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > > > > > > > > When handling memory error on a hugetlb page, the error handler tries to > > > > dissolve and turn it into 4kB pages. If it's successfully dissolved, > > > > PageHWPoison flag is moved to the raw error page, so that's all right. > > > > However, dissolve sometimes fails, then the error page is left as > > > > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save > > > > healthy pages, but that's not possible now because the information about > > > > where the raw error pages is lost. > > > > > > > > Use the private field of a few tail pages to keep that information. The > > > > code path of shrinking hugepage pool uses this info to try delayed dissolve. > > > > In order to remember multiple errors in a hugepage, a singly-linked list > > > > originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed. Only > > > > simple operations (adding an entry or clearing all) are required and the > > > > list is assumed not to be very long, so this simple data structure should > > > > be enough. > > > > > > > > If we failed to save raw error info, the hwpoison hugepage has errors on > > > > unknown subpage, then this new saving mechanism does not work any more, > > > > so disable saving new raw error info and freeing hwpoison hugepages. > > > > > > > > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> > > > > > > Thanks for your work on this. I have several quesions below. > > > > > ... > > > > > > @@ -1499,6 +1499,97 @@ static int try_to_split_thp_page(struct page *page, const char *msg) > > > > } > > > > > > > > #ifdef CONFIG_HUGETLB_PAGE > > > > +/* > > > > + * Struct raw_hwp_page represents information about "raw error page", > > > > + * constructing singly linked list originated from ->private field of > > > > + * SUBPAGE_INDEX_HWPOISON-th tail page. > > > > + */ > > > > +struct raw_hwp_page { > > > > + struct llist_node node; > > > > + struct page *page; > > > > +}; > > > > + > > > > +static inline struct llist_head *raw_hwp_list_head(struct page *hpage) > > > > +{ > > > > + return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON); > > > > +} > > > > + > > > > +static inline int raw_hwp_unreliable(struct page *hpage) > > > > +{ > > > > + return page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE); > > > > +} > > > > + > > > > +static inline void set_raw_hwp_unreliable(struct page *hpage) > > > > +{ > > > > + set_page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, 1); > > > > +} > > > > > > Why not use HPAGEFLAG(HwpoisonUnreliable, hwpoison_unreliable) directly? > > > > > > > OK, that sounds better, I'll do it. > > > > > > + > > > > +/* > > > > + * about race consideration > > > > + */ > > > > +static inline int hugetlb_set_page_hwpoison(struct page *hpage, > > > > + struct page *page) > > > > +{ > > > > + struct llist_head *head; > > > > + struct raw_hwp_page *raw_hwp; > > > > + struct llist_node *t, *tnode; > > > > + int ret; > > > > + > > > > + /* > > > > + * Once the hwpoison hugepage has lost reliable raw error info, > > > > + * there is little mean to keep additional error info precisely, > > > > + * so skip to add additional raw error info. > > > > + */ > > > > + if (raw_hwp_unreliable(hpage)) > > > > + return -EHWPOISON; > > > > + head = raw_hwp_list_head(hpage); > > > > + llist_for_each_safe(tnode, t, head->first) { > > > > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > > > > + > > > > + if (p->page == page) > > > > + return -EHWPOISON; > > > > + } > > > > + > > > > + ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0; > > > > + /* the first error event will be counted in action_result(). */ > > > > + if (ret) > > > > + num_poisoned_pages_inc(); > > > > + > > > > + raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL); > > > > > > This function can be called in atomic context, GFP_ATOMIC should be used > > > here. > > > > OK, I'll use GFP_ATOMIC. > > > > > > > > > + if (raw_hwp) { > > > > + raw_hwp->page = page; > > > > + llist_add(&raw_hwp->node, head); > > > > > > The maximum amount of items in the list is one, right? > > > > The maximum is the number of subpages in the hugepage (512 for 2MB hugepage, > > 262144 for 1GB hugepage). > > > > You are right. I have missed something yesterday. > > > > > > > > + } else { > > > > + /* > > > > + * Failed to save raw error info. We no longer trace all > > > > + * hwpoisoned subpages, and we need refuse to free/dissolve > > > > + * this hwpoisoned hugepage. > > > > + */ > > > > + set_raw_hwp_unreliable(hpage); > > > > + return ret; > > > > + } > > > > + return ret; > > > > +} > > > > + > > > > +inline int hugetlb_clear_page_hwpoison(struct page *hpage) > > > > +{ > > > > + struct llist_head *head; > > > > + struct llist_node *t, *tnode; > > > > + > > > > + if (raw_hwp_unreliable(hpage)) > > > > + return -EBUSY; > > > > > > IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison() > > > and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here? > > > > Sorry if I might miss your point, but raw_hwp_unreliable is set when > > allocating raw_hwp_page failed. hugetlb_set_page_hwpoison() can be called > > Sorry. I have missed this. Thanks for your clarification. > > > multiple times on a hugepage and if one of the calls fails, the hwpoisoned > > hugepage becomes unreliable. > > > > BTW, as you pointed out above, if we switch to passing GFP_ATOMIC to kmalloc(), > > the kmalloc() never fails, so we no longer have to implement this unreliable > > No. kmalloc() with GFP_ATOMIC can fail unless I miss something important. > > > flag, so things get simpler. > > > > > > > > > + ClearPageHWPoison(hpage); > > > > + head = raw_hwp_list_head(hpage); > > > > + llist_for_each_safe(tnode, t, head->first) { > > > > > > Is it possible that a new item is added hugetlb_set_page_hwpoison() and we do not > > > traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison > > > page, right? > > > > Maybe you are mentioning the race like below. Yes, that's possible. > > > > Sorry, ignore my previous comments, I'm thinking something wrong. > > > CPU 0 CPU 1 > > > > free_huge_page > > lock hugetlb_lock > > ClearHPageMigratable > remove_hugetlb_page() > // the page is non-HugeTLB now > > unlock hugetlb_lock > > get_huge_page_for_hwpoison > > lock hugetlb_lock > > __get_huge_page_for_hwpoison > > // cannot reach here since it is not a HugeTLB page now. > // So this race is impossible. Then we fallback to normal > // page handling. Seems there is a new issue here. > // > // memory_failure() > // try_memory_failure_hugetlb() > // if (hugetlb) > // goto unlock_mutex; > // if (TestSetPageHWPoison(p)) { > // // This non-HugeTLB page's vmemmap is still optimized. > > Setting COMPOUND_PAGE_DTOR after hugetlb_vmemmap_restore() might fix this ^^^ I mean hugetlb_vmemmap_alloc(). > issue, but we will encounter this race as you mentioned below. > > Thanks. > > > hugetlb_set_page_hwpoison > > allocate raw_hwp_page > > TestSetPageHWPoison > > update_and_free_page > > __update_and_free_page > > if (PageHWPoison) > > hugetlb_clear_page_hwpoison > > TestClearPageHWPoison > > // remove all list items > > llist_add > > unlock hugetlb_lock > > > > > > The end result seems not critical (leaking raced raw_hwp_page?), but > > we need fix. > > > > > > > > > + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); > > > > + > > > > + SetPageHWPoison(p->page); > > > > + kfree(p); > > > > + } > > > > + llist_del_all(head); > > > > > > If the above issue exists, moving ClearPageHWPoison(hpage) to here could > > > fix it. We should clear PageHWPoison carefully since the head page is > > > also can be poisoned. > > > > The reason why I put ClearPageHWPoison(hpage) before llist_for_each_safe() > > was that raw error page can be the head page. But this can be solved > > with some additional code to remember whether raw_hwp_page list has an item > > associated with the head page. > > > > Or another approach in my mind now is to call hugetlb_clear_page_hwpoison() > > with taking mf_mutex. > > > > > > > > Thanks. > > > > Thank you for valuable feedbacks. > > > > - Naoya Horiguchi > > > > > > > > > + return 0; > > > > +} > > > > + > > > > /* > > > > * Called from hugetlb code with hugetlb_lock held. > > > > * > > > > @@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags) > > > > goto out; > > > > } > > > > > > > > - if (TestSetPageHWPoison(head)) { > > > > + if (hugetlb_set_page_hwpoison(head, page)) { > > > > ret = -EHWPOISON; > > > > goto out; > > > > } > > > > @@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb > > > > lock_page(head); > > > > > > > > if (hwpoison_filter(p)) { > > > > - ClearPageHWPoison(head); > > > > + hugetlb_clear_page_hwpoison(head); > > > > res = -EOPNOTSUPP; > > > > goto out; > > > > } > > > > -- > > > > 2.25.1 > > > > > > > > >
On Tue, Jun 28, 2022 at 02:26:47PM +0800, Muchun Song wrote: > On Tue, Jun 28, 2022 at 02:41:22AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > > On Mon, Jun 27, 2022 at 05:26:01PM +0800, Muchun Song wrote: > > > On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote: > > > > From: Naoya Horiguchi <naoya.horiguchi@nec.com> ... > > > > + } else { > > > > + /* > > > > + * Failed to save raw error info. We no longer trace all > > > > + * hwpoisoned subpages, and we need refuse to free/dissolve > > > > + * this hwpoisoned hugepage. > > > > + */ > > > > + set_raw_hwp_unreliable(hpage); > > > > + return ret; > > > > + } > > > > + return ret; > > > > +} > > > > + > > > > +inline int hugetlb_clear_page_hwpoison(struct page *hpage) > > > > +{ > > > > + struct llist_head *head; > > > > + struct llist_node *t, *tnode; > > > > + > > > > + if (raw_hwp_unreliable(hpage)) > > > > + return -EBUSY; > > > > > > IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison() > > > and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here? > > > > Sorry if I might miss your point, but raw_hwp_unreliable is set when > > allocating raw_hwp_page failed. hugetlb_set_page_hwpoison() can be called > > Sorry. I have missed this. Thanks for your clarification. > > > multiple times on a hugepage and if one of the calls fails, the hwpoisoned > > hugepage becomes unreliable. > > > > BTW, as you pointed out above, if we switch to passing GFP_ATOMIC to kmalloc(), > > the kmalloc() never fails, so we no longer have to implement this unreliable > > No. kmalloc() with GFP_ATOMIC can fail unless I miss something important. OK, I've interpretted the comment about GFP_ATOMIC wrongly. * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower * watermark is applied to allow access to "atomic reserves". > > flag, so things get simpler. > > > > > > > > > + ClearPageHWPoison(hpage); > > > > + head = raw_hwp_list_head(hpage); > > > > + llist_for_each_safe(tnode, t, head->first) { > > > > > > Is it possible that a new item is added hugetlb_set_page_hwpoison() and we do not > > > traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison > > > page, right? > > > > Maybe you are mentioning the race like below. Yes, that's possible. > > > > Sorry, ignore my previous comments, I'm thinking something wrong. > > > CPU 0 CPU 1 > > > > free_huge_page > > lock hugetlb_lock > > ClearHPageMigratable > remove_hugetlb_page() > // the page is non-HugeTLB now Oh, I missed that. > > unlock hugetlb_lock > > get_huge_page_for_hwpoison > > lock hugetlb_lock > > __get_huge_page_for_hwpoison > > // cannot reach here since it is not a HugeTLB page now. > // So this race is impossible. Then we fallback to normal > // page handling. Seems there is a new issue here. > // > // memory_failure() > // try_memory_failure_hugetlb() > // if (hugetlb) > // goto unlock_mutex; > // if (TestSetPageHWPoison(p)) { > // // This non-HugeTLB page's vmemmap is still optimized. > > Setting COMPOUND_PAGE_DTOR after hugetlb_vmemmap_restore() might fix this > issue, but we will encounter this race as you mentioned below. I don't have clear ideas about this now (I don't test vmemmap-optimized case yet), so I will think more about this case. Maybe memory_failure() need detect it because memory_failure() heaviliy depends on the status of struct page. Thanks, Naoya Horiguchi > > Thanks. > > > hugetlb_set_page_hwpoison > > allocate raw_hwp_page > > TestSetPageHWPoison > > update_and_free_page > > __update_and_free_page > > if (PageHWPoison) > > hugetlb_clear_page_hwpoison > > TestClearPageHWPoison > > // remove all list items > > llist_add > > unlock hugetlb_lock > > > > > > The end result seems not critical (leaking raced raw_hwp_page?), but > > we need fix.
On Tue, Jun 28, 2022 at 08:17:55AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > On Tue, Jun 28, 2022 at 02:26:47PM +0800, Muchun Song wrote: > > On Tue, Jun 28, 2022 at 02:41:22AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > > > On Mon, Jun 27, 2022 at 05:26:01PM +0800, Muchun Song wrote: > > > > On Fri, Jun 24, 2022 at 08:51:48AM +0900, Naoya Horiguchi wrote: > > > > > From: Naoya Horiguchi <naoya.horiguchi@nec.com> > ... > > > > > + } else { > > > > > + /* > > > > > + * Failed to save raw error info. We no longer trace all > > > > > + * hwpoisoned subpages, and we need refuse to free/dissolve > > > > > + * this hwpoisoned hugepage. > > > > > + */ > > > > > + set_raw_hwp_unreliable(hpage); > > > > > + return ret; > > > > > + } > > > > > + return ret; > > > > > +} > > > > > + > > > > > +inline int hugetlb_clear_page_hwpoison(struct page *hpage) > > > > > +{ > > > > > + struct llist_head *head; > > > > > + struct llist_node *t, *tnode; > > > > > + > > > > > + if (raw_hwp_unreliable(hpage)) > > > > > + return -EBUSY; > > > > > > > > IIUC, we use head page's PageHWPoison to synchronize hugetlb_clear_page_hwpoison() > > > > and hugetlb_set_page_hwpoison(), right? If so, who can set hwp_unreliable here? > > > > > > Sorry if I might miss your point, but raw_hwp_unreliable is set when > > > allocating raw_hwp_page failed. hugetlb_set_page_hwpoison() can be called > > > > Sorry. I have missed this. Thanks for your clarification. > > > > > multiple times on a hugepage and if one of the calls fails, the hwpoisoned > > > hugepage becomes unreliable. > > > > > > BTW, as you pointed out above, if we switch to passing GFP_ATOMIC to kmalloc(), > > > the kmalloc() never fails, so we no longer have to implement this unreliable > > > > No. kmalloc() with GFP_ATOMIC can fail unless I miss something important. > > OK, I've interpretted the comment about GFP_ATOMIC wrongly. > > * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower > * watermark is applied to allow access to "atomic reserves". > > > > > flag, so things get simpler. > > > > > > > > > > > > + ClearPageHWPoison(hpage); > > > > > + head = raw_hwp_list_head(hpage); > > > > > + llist_for_each_safe(tnode, t, head->first) { > > > > > > > > Is it possible that a new item is added hugetlb_set_page_hwpoison() and we do not > > > > traverse it (we have cleared page's PageHWPoison)? Then we ignored a real hwpoison > > > > page, right? > > > > > > Maybe you are mentioning the race like below. Yes, that's possible. > > > > > > > Sorry, ignore my previous comments, I'm thinking something wrong. > > > > > CPU 0 CPU 1 > > > > > > free_huge_page > > > lock hugetlb_lock > > > ClearHPageMigratable > > remove_hugetlb_page() > > // the page is non-HugeTLB now > > Oh, I missed that. > > > > unlock hugetlb_lock > > > get_huge_page_for_hwpoison > > > lock hugetlb_lock > > > __get_huge_page_for_hwpoison > > > > // cannot reach here since it is not a HugeTLB page now. > > // So this race is impossible. Then we fallback to normal > > // page handling. Seems there is a new issue here. > > // > > // memory_failure() > > // try_memory_failure_hugetlb() > > // if (hugetlb) > > // goto unlock_mutex; > > // if (TestSetPageHWPoison(p)) { > > // // This non-HugeTLB page's vmemmap is still optimized. > > > > Setting COMPOUND_PAGE_DTOR after hugetlb_vmemmap_restore() might fix this > > issue, but we will encounter this race as you mentioned below. > > I don't have clear ideas about this now (I don't test vmemmap-optimized case > yet), so I will think more about this case. Maybe memory_failure() need > detect it because memory_failure() heaviliy depends on the status of struct > page. > Because HVO (HugeTLB Vmemmap Optimization) will map all tail vmemmap pages with read-only, we cannot write any data to some tail struct pages. It is a new issue unrelated to this patch. Thanks. > Thanks, > Naoya Horiguchi > > > > > Thanks. > > > > > hugetlb_set_page_hwpoison > > > allocate raw_hwp_page > > > TestSetPageHWPoison > > > update_and_free_page > > > __update_and_free_page > > > if (PageHWPoison) > > > hugetlb_clear_page_hwpoison > > > TestClearPageHWPoison > > > // remove all list items > > > llist_add > > > unlock hugetlb_lock > > > > > > > > > The end result seems not critical (leaking raced raw_hwp_page?), but > > > we need fix.
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index e4cff27d1198..ac13c2022517 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -42,6 +42,10 @@ enum { SUBPAGE_INDEX_CGROUP, /* reuse page->private */ SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */ __MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD, +#endif +#ifdef CONFIG_MEMORY_FAILURE + SUBPAGE_INDEX_HWPOISON, + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, #endif __NR_USED_SUBPAGE, }; @@ -798,6 +802,15 @@ extern int dissolve_free_huge_page(struct page *page); extern int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn); +#ifdef CONFIG_MEMORY_FAILURE +extern int hugetlb_clear_page_hwpoison(struct page *hpage); +#else +static inline int hugetlb_clear_page_hwpoison(struct page *hpage) +{ + return 0; +} +#endif + #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION #ifndef arch_hugetlb_migration_supported static inline bool arch_hugetlb_migration_supported(struct hstate *h) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index b7ae5f73f3b2..19ef427aa1e8 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1541,17 +1541,15 @@ static void __update_and_free_page(struct hstate *h, struct page *page) if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) return; - if (hugetlb_vmemmap_alloc(h, page)) { - spin_lock_irq(&hugetlb_lock); - /* - * If we cannot allocate vmemmap pages, just refuse to free the - * page and put the page back on the hugetlb free list and treat - * as a surplus page. - */ - add_hugetlb_page(h, page, true); - spin_unlock_irq(&hugetlb_lock); - return; - } + if (hugetlb_vmemmap_alloc(h, page)) + goto fail; + + /* + * Move PageHWPoison flag from head page to the raw error pages, + * which makes any healthy subpages reusable. + */ + if (unlikely(PageHWPoison(page) && hugetlb_clear_page_hwpoison(page))) + goto fail; for (i = 0; i < pages_per_huge_page(h); i++, subpage = mem_map_next(subpage, page, i)) { @@ -1572,6 +1570,16 @@ static void __update_and_free_page(struct hstate *h, struct page *page) } else { __free_pages(page, huge_page_order(h)); } + return; +fail: + spin_lock_irq(&hugetlb_lock); + /* + * If we cannot allocate vmemmap pages or cannot identify raw hwpoison + * subpages reliably, just refuse to free the page and put the page + * back on the hugetlb free list and treat as a surplus page. + */ + add_hugetlb_page(h, page, true); + spin_unlock_irq(&hugetlb_lock); } /* @@ -2115,15 +2123,6 @@ int dissolve_free_huge_page(struct page *page) */ rc = hugetlb_vmemmap_alloc(h, head); if (!rc) { - /* - * Move PageHWPoison flag from head page to the raw - * error page, which makes any subpages rather than - * the error page reusable. - */ - if (PageHWPoison(head) && page != head) { - SetPageHWPoison(page); - ClearPageHWPoison(head); - } update_and_free_page(h, head, false); } else { spin_lock_irq(&hugetlb_lock); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index fb3feb1f363e..af571fa6f2af 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1499,6 +1499,97 @@ static int try_to_split_thp_page(struct page *page, const char *msg) } #ifdef CONFIG_HUGETLB_PAGE +/* + * Struct raw_hwp_page represents information about "raw error page", + * constructing singly linked list originated from ->private field of + * SUBPAGE_INDEX_HWPOISON-th tail page. + */ +struct raw_hwp_page { + struct llist_node node; + struct page *page; +}; + +static inline struct llist_head *raw_hwp_list_head(struct page *hpage) +{ + return (struct llist_head *)&page_private(hpage + SUBPAGE_INDEX_HWPOISON); +} + +static inline int raw_hwp_unreliable(struct page *hpage) +{ + return page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE); +} + +static inline void set_raw_hwp_unreliable(struct page *hpage) +{ + set_page_private(hpage + SUBPAGE_INDEX_HWPOISON_UNRELIABLE, 1); +} + +/* + * about race consideration + */ +static inline int hugetlb_set_page_hwpoison(struct page *hpage, + struct page *page) +{ + struct llist_head *head; + struct raw_hwp_page *raw_hwp; + struct llist_node *t, *tnode; + int ret; + + /* + * Once the hwpoison hugepage has lost reliable raw error info, + * there is little mean to keep additional error info precisely, + * so skip to add additional raw error info. + */ + if (raw_hwp_unreliable(hpage)) + return -EHWPOISON; + head = raw_hwp_list_head(hpage); + llist_for_each_safe(tnode, t, head->first) { + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); + + if (p->page == page) + return -EHWPOISON; + } + + ret = TestSetPageHWPoison(hpage) ? -EHWPOISON : 0; + /* the first error event will be counted in action_result(). */ + if (ret) + num_poisoned_pages_inc(); + + raw_hwp = kmalloc(sizeof(struct raw_hwp_page), GFP_KERNEL); + if (raw_hwp) { + raw_hwp->page = page; + llist_add(&raw_hwp->node, head); + } else { + /* + * Failed to save raw error info. We no longer trace all + * hwpoisoned subpages, and we need refuse to free/dissolve + * this hwpoisoned hugepage. + */ + set_raw_hwp_unreliable(hpage); + return ret; + } + return ret; +} + +inline int hugetlb_clear_page_hwpoison(struct page *hpage) +{ + struct llist_head *head; + struct llist_node *t, *tnode; + + if (raw_hwp_unreliable(hpage)) + return -EBUSY; + ClearPageHWPoison(hpage); + head = raw_hwp_list_head(hpage); + llist_for_each_safe(tnode, t, head->first) { + struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node); + + SetPageHWPoison(p->page); + kfree(p); + } + llist_del_all(head); + return 0; +} + /* * Called from hugetlb code with hugetlb_lock held. * @@ -1533,7 +1624,7 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags) goto out; } - if (TestSetPageHWPoison(head)) { + if (hugetlb_set_page_hwpoison(head, page)) { ret = -EHWPOISON; goto out; } @@ -1585,7 +1676,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb lock_page(head); if (hwpoison_filter(p)) { - ClearPageHWPoison(head); + hugetlb_clear_page_hwpoison(head); res = -EOPNOTSUPP; goto out; }