Message ID | 20210421060259.67554-1-songmuchun@bytedance.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: hugetlb: fix a race between memory-failure/soft_offline and gather_surplus_pages | expand |
[Cc Naoya] On Wed 21-04-21 14:02:59, Muchun Song wrote: > The possible bad scenario: > > CPU0: CPU1: > > gather_surplus_pages() > page = alloc_surplus_huge_page() > memory_failure_hugetlb() > get_hwpoison_page(page) > __get_hwpoison_page(page) > get_page_unless_zero(page) > zero = put_page_testzero(page) > VM_BUG_ON_PAGE(!zero, page) > enqueue_huge_page(h, page) > put_page(page) > > The refcount can possibly be increased by memory-failure or soft_offline > handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the > hugetlb pool list. The hwpoison side of this looks really suspicious to me. It shouldn't really touch the reference count of hugetlb pages without being very careful (and having hugetlb_lock held). What would happen if the reference count was increased after the page has been enqueed into the pool? This can just blow up later. > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > --- > mm/hugetlb.c | 11 ++++------- > 1 file changed, 4 insertions(+), 7 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 3476aa06da70..6c96332db34b 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2145,17 +2145,14 @@ static int gather_surplus_pages(struct hstate *h, long delta) > > /* Free the needed pages to the hugetlb pool */ > list_for_each_entry_safe(page, tmp, &surplus_list, lru) { > - int zeroed; > - > if ((--needed) < 0) > break; > /* > - * This page is now managed by the hugetlb allocator and has > - * no users -- drop the buddy allocator's reference. > + * The refcount can possibly be increased by memory-failure or > + * soft_offline handlers. > */ > - zeroed = put_page_testzero(page); > - VM_BUG_ON_PAGE(!zeroed, page); > - enqueue_huge_page(h, page); > + if (likely(put_page_testzero(page))) > + enqueue_huge_page(h, page); > } > free: > spin_unlock_irq(&hugetlb_lock); > -- > 2.11.0 >
On Wed, Apr 21, 2021 at 4:03 PM Michal Hocko <mhocko@suse.com> wrote: > > [Cc Naoya] > > On Wed 21-04-21 14:02:59, Muchun Song wrote: > > The possible bad scenario: > > > > CPU0: CPU1: > > > > gather_surplus_pages() > > page = alloc_surplus_huge_page() > > memory_failure_hugetlb() > > get_hwpoison_page(page) > > __get_hwpoison_page(page) > > get_page_unless_zero(page) > > zero = put_page_testzero(page) > > VM_BUG_ON_PAGE(!zero, page) > > enqueue_huge_page(h, page) > > put_page(page) > > > > The refcount can possibly be increased by memory-failure or soft_offline > > handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the > > hugetlb pool list. > > The hwpoison side of this looks really suspicious to me. It shouldn't > really touch the reference count of hugetlb pages without being very > careful (and having hugetlb_lock held). What would happen if the > reference count was increased after the page has been enqueed into the > pool? This can just blow up later. If the page has been enqueued into the pool, then the page can be allocated to other users. The page reference count will be reset to 1 in the dequeue_huge_page_node_exact(). Then memory-failure will free the page because of put_page(). This is wrong. Because there is another user. > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > --- > > mm/hugetlb.c | 11 ++++------- > > 1 file changed, 4 insertions(+), 7 deletions(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 3476aa06da70..6c96332db34b 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -2145,17 +2145,14 @@ static int gather_surplus_pages(struct hstate *h, long delta) > > > > /* Free the needed pages to the hugetlb pool */ > > list_for_each_entry_safe(page, tmp, &surplus_list, lru) { > > - int zeroed; > > - > > if ((--needed) < 0) > > break; > > /* > > - * This page is now managed by the hugetlb allocator and has > > - * no users -- drop the buddy allocator's reference. > > + * The refcount can possibly be increased by memory-failure or > > + * soft_offline handlers. > > */ > > - zeroed = put_page_testzero(page); > > - VM_BUG_ON_PAGE(!zeroed, page); > > - enqueue_huge_page(h, page); > > + if (likely(put_page_testzero(page))) > > + enqueue_huge_page(h, page); > > } > > free: > > spin_unlock_irq(&hugetlb_lock); > > -- > > 2.11.0 > > > > -- > Michal Hocko > SUSE Labs
On Wed, Apr 21, 2021 at 04:15:00PM +0800, Muchun Song wrote: > > The hwpoison side of this looks really suspicious to me. It shouldn't > > really touch the reference count of hugetlb pages without being very > > careful (and having hugetlb_lock held). What would happen if the > > reference count was increased after the page has been enqueed into the > > pool? This can just blow up later. > > If the page has been enqueued into the pool, then the page can be > allocated to other users. The page reference count will be reset to > 1 in the dequeue_huge_page_node_exact(). Then memory-failure > will free the page because of put_page(). This is wrong. Because > there is another user. Note that dequeue_huge_page_node_exact() will not hand over any pages which are poisoned, so in this case it will not be allocated. But it is true that we might need hugetlb lock, this needs some more thought. I will have a look.
On Wed 21-04-21 16:15:00, Muchun Song wrote: > On Wed, Apr 21, 2021 at 4:03 PM Michal Hocko <mhocko@suse.com> wrote: > > > > [Cc Naoya] > > > > On Wed 21-04-21 14:02:59, Muchun Song wrote: > > > The possible bad scenario: > > > > > > CPU0: CPU1: > > > > > > gather_surplus_pages() > > > page = alloc_surplus_huge_page() > > > memory_failure_hugetlb() > > > get_hwpoison_page(page) > > > __get_hwpoison_page(page) > > > get_page_unless_zero(page) > > > zero = put_page_testzero(page) > > > VM_BUG_ON_PAGE(!zero, page) > > > enqueue_huge_page(h, page) > > > put_page(page) > > > > > > The refcount can possibly be increased by memory-failure or soft_offline > > > handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the > > > hugetlb pool list. > > > > The hwpoison side of this looks really suspicious to me. It shouldn't > > really touch the reference count of hugetlb pages without being very > > careful (and having hugetlb_lock held). What would happen if the > > reference count was increased after the page has been enqueed into the > > pool? This can just blow up later. > > If the page has been enqueued into the pool, then the page can be > allocated to other users. The page reference count will be reset to > 1 in the dequeue_huge_page_node_exact(). Then memory-failure > will free the page because of put_page(). This is wrong. Because > there is another user. Yes that is one of the scenarios but I suspect there are more lurking there. That was my point that this should be addressed at the hwpoison side.
On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote: > [Cc Naoya] > > On Wed 21-04-21 14:02:59, Muchun Song wrote: > > The possible bad scenario: > > > > CPU0: CPU1: > > > > gather_surplus_pages() > > page = alloc_surplus_huge_page() > > memory_failure_hugetlb() > > get_hwpoison_page(page) > > __get_hwpoison_page(page) > > get_page_unless_zero(page) > > zero = put_page_testzero(page) > > VM_BUG_ON_PAGE(!zero, page) > > enqueue_huge_page(h, page) > > put_page(page) > > > > The refcount can possibly be increased by memory-failure or soft_offline > > handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the > > hugetlb pool list. > > The hwpoison side of this looks really suspicious to me. It shouldn't > really touch the reference count of hugetlb pages without being very > careful (and having hugetlb_lock held). I have the same feeling, there is a window where a hugepage is refcounted during converting from buddy free pages into free hugepage, so refcount alone is not enough to prevent the race. hugetlb_lock is retaken after alloc_surplus_huge_page returns, so simply holding hugetlb_lock in get_hwpoison_page() seems not work. Is there any status bit to show that a hugepage is just being initialized (not in free hugepage pool or in use)? > What would happen if the > reference count was increased after the page has been enqueed into the > pool? This can just blow up later. Yes, this is another concern. Thanks, Naoya Horiguchi
On Wed, Apr 21, 2021 at 4:21 PM Oscar Salvador <osalvador@suse.de> wrote: > > On Wed, Apr 21, 2021 at 04:15:00PM +0800, Muchun Song wrote: > > > The hwpoison side of this looks really suspicious to me. It shouldn't > > > really touch the reference count of hugetlb pages without being very > > > careful (and having hugetlb_lock held). What would happen if the > > > reference count was increased after the page has been enqueed into the > > > pool? This can just blow up later. > > > > If the page has been enqueued into the pool, then the page can be > > allocated to other users. The page reference count will be reset to > > 1 in the dequeue_huge_page_node_exact(). Then memory-failure > > will free the page because of put_page(). This is wrong. Because > > there is another user. > > Note that dequeue_huge_page_node_exact() will not hand over any pages > which are poisoned, so in this case it will not be allocated. But softoffline does not set page hwpoison before __get_hwpoison_page(). So the page still can be allocated. Right? > But it is true that we might need hugetlb lock, this needs some more > thought. > > I will have a look. > > -- > Oscar Salvador > SUSE L3
On Wed 21-04-21 10:21:03, Oscar Salvador wrote: > On Wed, Apr 21, 2021 at 04:15:00PM +0800, Muchun Song wrote: > > > The hwpoison side of this looks really suspicious to me. It shouldn't > > > really touch the reference count of hugetlb pages without being very > > > careful (and having hugetlb_lock held). What would happen if the > > > reference count was increased after the page has been enqueed into the > > > pool? This can just blow up later. > > > > If the page has been enqueued into the pool, then the page can be > > allocated to other users. The page reference count will be reset to > > 1 in the dequeue_huge_page_node_exact(). Then memory-failure > > will free the page because of put_page(). This is wrong. Because > > there is another user. > > Note that dequeue_huge_page_node_exact() will not hand over any pages > which are poisoned, so in this case it will not be allocated. I have to say I have missed the HWPoison check so the this particular scenario is not possible indeed. > But it is true that we might need hugetlb lock, this needs some more > thought. yes, nobody should be touching to the reference count of hugetlb pool pages out of the hugetlb proper. > I will have a look. Thanks!
On Wed, Apr 21, 2021 at 04:41:10PM +0800, Muchun Song wrote: > But softoffline does not set page hwpoison before > __get_hwpoison_page(). So the page still can be > allocated. Right? Yep, soft_offline() only marks the page as hwpoison once the page has been fully contended and no other use is possible. But yeah, hugetlb is a bit trickier in that regard. This needs fixing in there.
On Wed, Apr 21, 2021 at 4:49 PM Oscar Salvador <osalvador@suse.de> wrote: > > On Wed, Apr 21, 2021 at 04:41:10PM +0800, Muchun Song wrote: > > > But softoffline does not set page hwpoison before > > __get_hwpoison_page(). So the page still can be > > allocated. Right? > > Yep, soft_offline() only marks the page as hwpoison once the page has been > fully contended and no other use is possible. > But yeah, hugetlb is a bit trickier in that regard. > > This needs fixing in there. It is OK to fix it in softoffline/memory-failure. I just want to expose the race. Thanks. > > > -- > Oscar Salvador > SUSE L3
On Wed, Apr 21, 2021 at 4:33 PM HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@nec.com> wrote: > > On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote: > > [Cc Naoya] > > > > On Wed 21-04-21 14:02:59, Muchun Song wrote: > > > The possible bad scenario: > > > > > > CPU0: CPU1: > > > > > > gather_surplus_pages() > > > page = alloc_surplus_huge_page() > > > memory_failure_hugetlb() > > > get_hwpoison_page(page) > > > __get_hwpoison_page(page) > > > get_page_unless_zero(page) > > > zero = put_page_testzero(page) > > > VM_BUG_ON_PAGE(!zero, page) > > > enqueue_huge_page(h, page) > > > put_page(page) > > > > > > The refcount can possibly be increased by memory-failure or soft_offline > > > handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the > > > hugetlb pool list. > > > > The hwpoison side of this looks really suspicious to me. It shouldn't > > really touch the reference count of hugetlb pages without being very > > careful (and having hugetlb_lock held). > > I have the same feeling, there is a window where a hugepage is refcounted > during converting from buddy free pages into free hugepage, so refcount > alone is not enough to prevent the race. hugetlb_lock is retaken after > alloc_surplus_huge_page returns, so simply holding hugetlb_lock in > get_hwpoison_page() seems not work. Is there any status bit to show that a > hugepage is just being initialized (not in free hugepage pool or in use)? HPageFreed() can indicate whether a page is on the free pool list. > > > What would happen if the > > reference count was increased after the page has been enqueed into the > > pool? This can just blow up later. > > Yes, this is another concern. > > Thanks, > Naoya Horiguchi
On 4/21/21 1:33 AM, HORIGUCHI NAOYA(堀口 直也) wrote: > On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote: >> [Cc Naoya] >> >> On Wed 21-04-21 14:02:59, Muchun Song wrote: >>> The possible bad scenario: >>> >>> CPU0: CPU1: >>> >>> gather_surplus_pages() >>> page = alloc_surplus_huge_page() >>> memory_failure_hugetlb() >>> get_hwpoison_page(page) >>> __get_hwpoison_page(page) >>> get_page_unless_zero(page) >>> zero = put_page_testzero(page) >>> VM_BUG_ON_PAGE(!zero, page) >>> enqueue_huge_page(h, page) >>> put_page(page) >>> >>> The refcount can possibly be increased by memory-failure or soft_offline >>> handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the >>> hugetlb pool list. >> >> The hwpoison side of this looks really suspicious to me. It shouldn't >> really touch the reference count of hugetlb pages without being very >> careful (and having hugetlb_lock held). > > I have the same feeling, there is a window where a hugepage is refcounted > during converting from buddy free pages into free hugepage, so refcount > alone is not enough to prevent the race. hugetlb_lock is retaken after > alloc_surplus_huge_page returns, so simply holding hugetlb_lock in > get_hwpoison_page() seems not work. Is there any status bit to show that a > hugepage is just being initialized (not in free hugepage pool or in use)? > It seems we can also race with the code that makes a compound page a hugetlb page. The memory failure code could be called after allocating pages from buddy and before setting compound page DTOR. So, the memory handling code will process it as a compound page. Just thinking that this may not be limited to the hugetlb specific memory failure handling?
On Wed, Apr 21, 2021 at 11:03:24AM -0700, Mike Kravetz wrote: > On 4/21/21 1:33 AM, HORIGUCHI NAOYA(堀口 直也) wrote: > > On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote: > >> [Cc Naoya] > >> > >> On Wed 21-04-21 14:02:59, Muchun Song wrote: > >>> The possible bad scenario: > >>> > >>> CPU0: CPU1: > >>> > >>> gather_surplus_pages() > >>> page = alloc_surplus_huge_page() > >>> memory_failure_hugetlb() > >>> get_hwpoison_page(page) > >>> __get_hwpoison_page(page) > >>> get_page_unless_zero(page) > >>> zero = put_page_testzero(page) > >>> VM_BUG_ON_PAGE(!zero, page) > >>> enqueue_huge_page(h, page) > >>> put_page(page) > >>> > >>> The refcount can possibly be increased by memory-failure or soft_offline > >>> handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the > >>> hugetlb pool list. > >> > >> The hwpoison side of this looks really suspicious to me. It shouldn't > >> really touch the reference count of hugetlb pages without being very > >> careful (and having hugetlb_lock held). > > > > I have the same feeling, there is a window where a hugepage is refcounted > > during converting from buddy free pages into free hugepage, so refcount > > alone is not enough to prevent the race. hugetlb_lock is retaken after > > alloc_surplus_huge_page returns, so simply holding hugetlb_lock in > > get_hwpoison_page() seems not work. Is there any status bit to show that a > > hugepage is just being initialized (not in free hugepage pool or in use)? > > > > It seems we can also race with the code that makes a compound page a > hugetlb page. The memory failure code could be called after allocating > pages from buddy and before setting compound page DTOR. So, the memory > handling code will process it as a compound page. Yes, so get_hwpoison_page() has to call get_page_unless_zero() only when memory_failure() can surely handle the error. > > Just thinking that this may not be limited to the hugetlb specific memory > failure handling? Currently hugetlb page is the only type of compound page supported by memory failure. But I agree with you that other types of compound pages have the same race window, and judging only with get_page_unless_zero() is dangerous. So I think that __get_hwpoison_page() should have the following structure: if (PageCompound) { if (PageHuge) { if (PageHugeFreed || PageHugeActive) { if (get_page_unless_zero) return 0; // path for in-use hugetlb page else return 1; // path for free hugetlb page } else { return -EBUSY; // any transient hugetlb page } } else { ... // any other compound page (like thp, slab, ...) } } else { ... // any non-compound page } Thanks, Naoya Horiguchi
On Thu, Apr 22, 2021 at 08:27:46AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote: > On Wed, Apr 21, 2021 at 11:03:24AM -0700, Mike Kravetz wrote: > > On 4/21/21 1:33 AM, HORIGUCHI NAOYA(堀口 直也) wrote: > > > On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote: > > >> [Cc Naoya] > > >> > > >> On Wed 21-04-21 14:02:59, Muchun Song wrote: > > >>> The possible bad scenario: > > >>> > > >>> CPU0: CPU1: > > >>> > > >>> gather_surplus_pages() > > >>> page = alloc_surplus_huge_page() > > >>> memory_failure_hugetlb() > > >>> get_hwpoison_page(page) > > >>> __get_hwpoison_page(page) > > >>> get_page_unless_zero(page) > > >>> zero = put_page_testzero(page) > > >>> VM_BUG_ON_PAGE(!zero, page) > > >>> enqueue_huge_page(h, page) > > >>> put_page(page) > > >>> > > >>> The refcount can possibly be increased by memory-failure or soft_offline > > >>> handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the > > >>> hugetlb pool list. > > >> > > >> The hwpoison side of this looks really suspicious to me. It shouldn't > > >> really touch the reference count of hugetlb pages without being very > > >> careful (and having hugetlb_lock held). > > > > > > I have the same feeling, there is a window where a hugepage is refcounted > > > during converting from buddy free pages into free hugepage, so refcount > > > alone is not enough to prevent the race. hugetlb_lock is retaken after > > > alloc_surplus_huge_page returns, so simply holding hugetlb_lock in > > > get_hwpoison_page() seems not work. Is there any status bit to show that a > > > hugepage is just being initialized (not in free hugepage pool or in use)? > > > > > > > It seems we can also race with the code that makes a compound page a > > hugetlb page. The memory failure code could be called after allocating > > pages from buddy and before setting compound page DTOR. So, the memory > > handling code will process it as a compound page. > > Yes, so get_hwpoison_page() has to call get_page_unless_zero() > only when memory_failure() can surely handle the error. > > > > > Just thinking that this may not be limited to the hugetlb specific memory > > failure handling? > > Currently hugetlb page is the only type of compound page supported by memory > failure. But I agree with you that other types of compound pages have the > same race window, and judging only with get_page_unless_zero() is dangerous. > So I think that __get_hwpoison_page() should have the following structure: > > if (PageCompound) { > if (PageHuge) { > if (PageHugeFreed || PageHugeActive) { > if (get_page_unless_zero) > return 0; // path for in-use hugetlb page > else > return 1; // path for free hugetlb page > } else { > return -EBUSY; // any transient hugetlb page > } > } else { > ... // any other compound page (like thp, slab, ...) > } > } else { > ... // any non-compound page > } The above pseudo code was wrong, so let me update my thought. I'm now trying to solve the reported issue by changing __get_hwpoison_page() like below: static int __get_hwpoison_page(struct page *page) { struct page *head = compound_head(page); if (PageCompound(page)) { if (PageSlab(page)) { return get_page_unless_zero(page); } else if (PageHuge(head)) { if (HPageFreed(head) || HPageMigratable(head)) return get_page_unless_zero(head); } else if (PageTransHuge(head)) { /* * Non anonymous thp exists only in allocation/free time. We * can't handle such a case correctly, so let's give it up. * This should be better than triggering BUG_ON when kernel * tries to touch the "partially handled" page. */ if (!PageAnon(head)) { pr_err("Memory failure: %#lx: non anonymous thp\n", page_to_pfn(page)); return 0; } if (get_page_unless_zero(head)) { if (head == compound_head(page)) return 1; pr_info("Memory failure: %#lx cannot catch tail\n", page_to_pfn(page)); put_page(head); } } return 0; } return get_page_unless_zero(page); } Some notes: - in hugetlb path, new HPage* checks should avoid the reported race, but I still need more testing to confirm it, - PageSlab check is added because otherwise I found that "non anonymous thp" path is chosen, that's obviously wrong, - thp's branch has a known issue unrelated to the current issue, which will/should be improved later. I'll send a patch next week. Thanks, Naoya Horiguchi
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3476aa06da70..6c96332db34b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2145,17 +2145,14 @@ static int gather_surplus_pages(struct hstate *h, long delta) /* Free the needed pages to the hugetlb pool */ list_for_each_entry_safe(page, tmp, &surplus_list, lru) { - int zeroed; - if ((--needed) < 0) break; /* - * This page is now managed by the hugetlb allocator and has - * no users -- drop the buddy allocator's reference. + * The refcount can possibly be increased by memory-failure or + * soft_offline handlers. */ - zeroed = put_page_testzero(page); - VM_BUG_ON_PAGE(!zeroed, page); - enqueue_huge_page(h, page); + if (likely(put_page_testzero(page))) + enqueue_huge_page(h, page); } free: spin_unlock_irq(&hugetlb_lock);
The possible bad scenario: CPU0: CPU1: gather_surplus_pages() page = alloc_surplus_huge_page() memory_failure_hugetlb() get_hwpoison_page(page) __get_hwpoison_page(page) get_page_unless_zero(page) zero = put_page_testzero(page) VM_BUG_ON_PAGE(!zero, page) enqueue_huge_page(h, page) put_page(page) The refcount can possibly be increased by memory-failure or soft_offline handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the page to the hugetlb pool list. Signed-off-by: Muchun Song <songmuchun@bytedance.com> --- mm/hugetlb.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-)