diff mbox series

[v3,4/6] mm: hugetlb: add return -EAGAIN for dissolve_free_huge_page

Message ID 20210110124017.86750-5-songmuchun@bytedance.com (mailing list archive)
State New, archived
Headers show
Series Fix some bugs about HugeTLB code | expand

Commit Message

Muchun Song Jan. 10, 2021, 12:40 p.m. UTC
There is a race between dissolve_free_huge_page() and put_page(),
and the race window is quite small. Theoretically, we should return
-EBUSY when we encounter this race. In fact, we have a chance to
successfully dissolve the page if we do a retry. Because the race
window is quite small. If we seize this opportunity, it is an
optimization for increasing the success rate of dissolving page.

If we free a HugeTLB page from a non-task context, it is deferred
through a workqueue. In this case, we need to flush the work.

The dissolve_free_huge_page() can be called from memory hotplug,
the caller aims to free the HugeTLB page to the buddy allocator
so that the caller can unplug the page successfully.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

Comments

Mike Kravetz Jan. 12, 2021, 1:20 a.m. UTC | #1
On 1/10/21 4:40 AM, Muchun Song wrote:
> There is a race between dissolve_free_huge_page() and put_page(),
> and the race window is quite small. Theoretically, we should return
> -EBUSY when we encounter this race. In fact, we have a chance to
> successfully dissolve the page if we do a retry. Because the race
> window is quite small. If we seize this opportunity, it is an
> optimization for increasing the success rate of dissolving page.
> 
> If we free a HugeTLB page from a non-task context, it is deferred
> through a workqueue. In this case, we need to flush the work.
> 
> The dissolve_free_huge_page() can be called from memory hotplug,
> the caller aims to free the HugeTLB page to the buddy allocator
> so that the caller can unplug the page successfully.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)

I am unsure about the need for this patch.  The code is OK, there are no
issues with the code.

As mentioned in the commit message, this is an optimization and could
potentially cause a memory offline operation to succeed instead of fail.
However, we are very unlikely to ever exercise this code.  Adding an
optimization that is unlikely to be exercised is certainly questionable.

Memory offline is the only code that could benefit from this optimization.
As someone with more memory offline user experience, what is your opinion
Michal?
Michal Hocko Jan. 12, 2021, 8:33 a.m. UTC | #2
On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> On 1/10/21 4:40 AM, Muchun Song wrote:
> > There is a race between dissolve_free_huge_page() and put_page(),
> > and the race window is quite small. Theoretically, we should return
> > -EBUSY when we encounter this race. In fact, we have a chance to
> > successfully dissolve the page if we do a retry. Because the race
> > window is quite small. If we seize this opportunity, it is an
> > optimization for increasing the success rate of dissolving page.
> > 
> > If we free a HugeTLB page from a non-task context, it is deferred
> > through a workqueue. In this case, we need to flush the work.
> > 
> > The dissolve_free_huge_page() can be called from memory hotplug,
> > the caller aims to free the HugeTLB page to the buddy allocator
> > so that the caller can unplug the page successfully.
> > 
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> >  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> I am unsure about the need for this patch.  The code is OK, there are no
> issues with the code.
> 
> As mentioned in the commit message, this is an optimization and could
> potentially cause a memory offline operation to succeed instead of fail.
> However, we are very unlikely to ever exercise this code.  Adding an
> optimization that is unlikely to be exercised is certainly questionable.
> 
> Memory offline is the only code that could benefit from this optimization.
> As someone with more memory offline user experience, what is your opinion
> Michal?

I am not a great fun of optimizations without any data to back them up.
I do not see any sign this code has been actually tested and the
condition triggered.

Besides that I have requested to have an explanation of why blocking on
the WQ is safe and that hasn't happened.
Muchun Song Jan. 12, 2021, 9:51 a.m. UTC | #3
On Tue, Jan 12, 2021 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> > On 1/10/21 4:40 AM, Muchun Song wrote:
> > > There is a race between dissolve_free_huge_page() and put_page(),
> > > and the race window is quite small. Theoretically, we should return
> > > -EBUSY when we encounter this race. In fact, we have a chance to
> > > successfully dissolve the page if we do a retry. Because the race
> > > window is quite small. If we seize this opportunity, it is an
> > > optimization for increasing the success rate of dissolving page.
> > >
> > > If we free a HugeTLB page from a non-task context, it is deferred
> > > through a workqueue. In this case, we need to flush the work.
> > >
> > > The dissolve_free_huge_page() can be called from memory hotplug,
> > > the caller aims to free the HugeTLB page to the buddy allocator
> > > so that the caller can unplug the page successfully.
> > >
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > ---
> > >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> > >  1 file changed, 21 insertions(+), 5 deletions(-)
> >
> > I am unsure about the need for this patch.  The code is OK, there are no
> > issues with the code.
> >
> > As mentioned in the commit message, this is an optimization and could
> > potentially cause a memory offline operation to succeed instead of fail.
> > However, we are very unlikely to ever exercise this code.  Adding an
> > optimization that is unlikely to be exercised is certainly questionable.
> >
> > Memory offline is the only code that could benefit from this optimization.
> > As someone with more memory offline user experience, what is your opinion
> > Michal?
>
> I am not a great fun of optimizations without any data to back them up.
> I do not see any sign this code has been actually tested and the
> condition triggered.

This race is quite small. I only trigger this only once on my server.
And then the kernel panic. So I sent this patch series to fix some
bugs.

>
> Besides that I have requested to have an explanation of why blocking on
> the WQ is safe and that hasn't happened.

I have seen all the caller of dissolve_free_huge_page, some caller is under
page lock (via lock_page). Others are also under a sleep context.

So I think that blocking on the WQ is safe. Right?

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Jan. 12, 2021, 10:06 a.m. UTC | #4
On Tue 12-01-21 17:51:05, Muchun Song wrote:
> On Tue, Jan 12, 2021 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> > > On 1/10/21 4:40 AM, Muchun Song wrote:
> > > > There is a race between dissolve_free_huge_page() and put_page(),
> > > > and the race window is quite small. Theoretically, we should return
> > > > -EBUSY when we encounter this race. In fact, we have a chance to
> > > > successfully dissolve the page if we do a retry. Because the race
> > > > window is quite small. If we seize this opportunity, it is an
> > > > optimization for increasing the success rate of dissolving page.
> > > >
> > > > If we free a HugeTLB page from a non-task context, it is deferred
> > > > through a workqueue. In this case, we need to flush the work.
> > > >
> > > > The dissolve_free_huge_page() can be called from memory hotplug,
> > > > the caller aims to free the HugeTLB page to the buddy allocator
> > > > so that the caller can unplug the page successfully.
> > > >
> > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > ---
> > > >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> > > >  1 file changed, 21 insertions(+), 5 deletions(-)
> > >
> > > I am unsure about the need for this patch.  The code is OK, there are no
> > > issues with the code.
> > >
> > > As mentioned in the commit message, this is an optimization and could
> > > potentially cause a memory offline operation to succeed instead of fail.
> > > However, we are very unlikely to ever exercise this code.  Adding an
> > > optimization that is unlikely to be exercised is certainly questionable.
> > >
> > > Memory offline is the only code that could benefit from this optimization.
> > > As someone with more memory offline user experience, what is your opinion
> > > Michal?
> >
> > I am not a great fun of optimizations without any data to back them up.
> > I do not see any sign this code has been actually tested and the
> > condition triggered.
> 
> This race is quite small. I only trigger this only once on my server.
> And then the kernel panic. So I sent this patch series to fix some
> bugs.

Memory hotplug shouldn't panic when this race happens. Are you sure you
have seen a race that is directly related to this patch?

> > Besides that I have requested to have an explanation of why blocking on
> > the WQ is safe and that hasn't happened.
> 
> I have seen all the caller of dissolve_free_huge_page, some caller is under
> page lock (via lock_page). Others are also under a sleep context.
> 
> So I think that blocking on the WQ is safe. Right?

I have requested to explicitly write your thinking why this is safe so
that we can double check it. Dependency on a work queue progress is much
more complex than any other locks because there is no guarantee that WQ
will make forward progress (all workers might be stuck, new workers not
able to be created etc.).
Muchun Song Jan. 12, 2021, 10:49 a.m. UTC | #5
On Tue, Jan 12, 2021 at 6:06 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 12-01-21 17:51:05, Muchun Song wrote:
> > On Tue, Jan 12, 2021 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> > > > On 1/10/21 4:40 AM, Muchun Song wrote:
> > > > > There is a race between dissolve_free_huge_page() and put_page(),
> > > > > and the race window is quite small. Theoretically, we should return
> > > > > -EBUSY when we encounter this race. In fact, we have a chance to
> > > > > successfully dissolve the page if we do a retry. Because the race
> > > > > window is quite small. If we seize this opportunity, it is an
> > > > > optimization for increasing the success rate of dissolving page.
> > > > >
> > > > > If we free a HugeTLB page from a non-task context, it is deferred
> > > > > through a workqueue. In this case, we need to flush the work.
> > > > >
> > > > > The dissolve_free_huge_page() can be called from memory hotplug,
> > > > > the caller aims to free the HugeTLB page to the buddy allocator
> > > > > so that the caller can unplug the page successfully.
> > > > >
> > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > > ---
> > > > >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> > > > >  1 file changed, 21 insertions(+), 5 deletions(-)
> > > >
> > > > I am unsure about the need for this patch.  The code is OK, there are no
> > > > issues with the code.
> > > >
> > > > As mentioned in the commit message, this is an optimization and could
> > > > potentially cause a memory offline operation to succeed instead of fail.
> > > > However, we are very unlikely to ever exercise this code.  Adding an
> > > > optimization that is unlikely to be exercised is certainly questionable.
> > > >
> > > > Memory offline is the only code that could benefit from this optimization.
> > > > As someone with more memory offline user experience, what is your opinion
> > > > Michal?
> > >
> > > I am not a great fun of optimizations without any data to back them up.
> > > I do not see any sign this code has been actually tested and the
> > > condition triggered.
> >
> > This race is quite small. I only trigger this only once on my server.
> > And then the kernel panic. So I sent this patch series to fix some
> > bugs.
>
> Memory hotplug shouldn't panic when this race happens. Are you sure you
> have seen a race that is directly related to this patch?

I mean the panic is fixed by:

  [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

>
> > > Besides that I have requested to have an explanation of why blocking on
> > > the WQ is safe and that hasn't happened.
> >
> > I have seen all the caller of dissolve_free_huge_page, some caller is under
> > page lock (via lock_page). Others are also under a sleep context.
> >
> > So I think that blocking on the WQ is safe. Right?
>
> I have requested to explicitly write your thinking why this is safe so
> that we can double check it. Dependency on a work queue progress is much
> more complex than any other locks because there is no guarantee that WQ
> will make forward progress (all workers might be stuck, new workers not
> able to be created etc.).

OK. I know about your concern. How about setting the page as temporary
when hitting this race?

 int dissolve_free_huge_page(struct page *page)
 {
@@ -1793,8 +1794,10 @@ int dissolve_free_huge_page(struct page *page)
                 * We should make sure that the page is already on the free list
                 * when it is dissolved.
                 */
-               if (unlikely(!PageHugeFreed(head)))
+               if (unlikely(!PageHugeFreed(head))) {
+                       SetPageHugeTemporary(page)
                        goto out;
+               }

Setting the page as temporary and just return -EBUSY (do not flush
the work). __free_huge_page() will free it to the buddy allocator later.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Jan. 12, 2021, 11:11 a.m. UTC | #6
On Tue 12-01-21 18:49:17, Muchun Song wrote:
> On Tue, Jan 12, 2021 at 6:06 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 12-01-21 17:51:05, Muchun Song wrote:
> > > On Tue, Jan 12, 2021 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> > > > > On 1/10/21 4:40 AM, Muchun Song wrote:
> > > > > > There is a race between dissolve_free_huge_page() and put_page(),
> > > > > > and the race window is quite small. Theoretically, we should return
> > > > > > -EBUSY when we encounter this race. In fact, we have a chance to
> > > > > > successfully dissolve the page if we do a retry. Because the race
> > > > > > window is quite small. If we seize this opportunity, it is an
> > > > > > optimization for increasing the success rate of dissolving page.
> > > > > >
> > > > > > If we free a HugeTLB page from a non-task context, it is deferred
> > > > > > through a workqueue. In this case, we need to flush the work.
> > > > > >
> > > > > > The dissolve_free_huge_page() can be called from memory hotplug,
> > > > > > the caller aims to free the HugeTLB page to the buddy allocator
> > > > > > so that the caller can unplug the page successfully.
> > > > > >
> > > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > > > ---
> > > > > >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> > > > > >  1 file changed, 21 insertions(+), 5 deletions(-)
> > > > >
> > > > > I am unsure about the need for this patch.  The code is OK, there are no
> > > > > issues with the code.
> > > > >
> > > > > As mentioned in the commit message, this is an optimization and could
> > > > > potentially cause a memory offline operation to succeed instead of fail.
> > > > > However, we are very unlikely to ever exercise this code.  Adding an
> > > > > optimization that is unlikely to be exercised is certainly questionable.
> > > > >
> > > > > Memory offline is the only code that could benefit from this optimization.
> > > > > As someone with more memory offline user experience, what is your opinion
> > > > > Michal?
> > > >
> > > > I am not a great fun of optimizations without any data to back them up.
> > > > I do not see any sign this code has been actually tested and the
> > > > condition triggered.
> > >
> > > This race is quite small. I only trigger this only once on my server.
> > > And then the kernel panic. So I sent this patch series to fix some
> > > bugs.
> >
> > Memory hotplug shouldn't panic when this race happens. Are you sure you
> > have seen a race that is directly related to this patch?
> 
> I mean the panic is fixed by:
> 
>   [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page

OK, so the answer is that this is not really triggered by any real life
problem. Can you actually trigger it intentionally?
 
> > > > Besides that I have requested to have an explanation of why blocking on
> > > > the WQ is safe and that hasn't happened.
> > >
> > > I have seen all the caller of dissolve_free_huge_page, some caller is under
> > > page lock (via lock_page). Others are also under a sleep context.
> > >
> > > So I think that blocking on the WQ is safe. Right?
> >
> > I have requested to explicitly write your thinking why this is safe so
> > that we can double check it. Dependency on a work queue progress is much
> > more complex than any other locks because there is no guarantee that WQ
> > will make forward progress (all workers might be stuck, new workers not
> > able to be created etc.).
> 
> OK. I know about your concern. How about setting the page as temporary
> when hitting this race?
> 
>  int dissolve_free_huge_page(struct page *page)
>  {
> @@ -1793,8 +1794,10 @@ int dissolve_free_huge_page(struct page *page)
>                  * We should make sure that the page is already on the free list
>                  * when it is dissolved.
>                  */
> -               if (unlikely(!PageHugeFreed(head)))
> +               if (unlikely(!PageHugeFreed(head))) {
> +                       SetPageHugeTemporary(page)
>                         goto out;
> +               }
> 
> Setting the page as temporary and just return -EBUSY (do not flush
> the work). __free_huge_page() will free it to the buddy allocator later.

Can we stop these subtle hacks please? Temporary page is meant to
represent unaccounted temporary page for migration. This has nothing to
do with it.
Muchun Song Jan. 12, 2021, 11:34 a.m. UTC | #7
On Tue, Jan 12, 2021 at 7:12 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 12-01-21 18:49:17, Muchun Song wrote:
> > On Tue, Jan 12, 2021 at 6:06 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 12-01-21 17:51:05, Muchun Song wrote:
> > > > On Tue, Jan 12, 2021 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 11-01-21 17:20:51, Mike Kravetz wrote:
> > > > > > On 1/10/21 4:40 AM, Muchun Song wrote:
> > > > > > > There is a race between dissolve_free_huge_page() and put_page(),
> > > > > > > and the race window is quite small. Theoretically, we should return
> > > > > > > -EBUSY when we encounter this race. In fact, we have a chance to
> > > > > > > successfully dissolve the page if we do a retry. Because the race
> > > > > > > window is quite small. If we seize this opportunity, it is an
> > > > > > > optimization for increasing the success rate of dissolving page.
> > > > > > >
> > > > > > > If we free a HugeTLB page from a non-task context, it is deferred
> > > > > > > through a workqueue. In this case, we need to flush the work.
> > > > > > >
> > > > > > > The dissolve_free_huge_page() can be called from memory hotplug,
> > > > > > > the caller aims to free the HugeTLB page to the buddy allocator
> > > > > > > so that the caller can unplug the page successfully.
> > > > > > >
> > > > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > > > > > ---
> > > > > > >  mm/hugetlb.c | 26 +++++++++++++++++++++-----
> > > > > > >  1 file changed, 21 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > I am unsure about the need for this patch.  The code is OK, there are no
> > > > > > issues with the code.
> > > > > >
> > > > > > As mentioned in the commit message, this is an optimization and could
> > > > > > potentially cause a memory offline operation to succeed instead of fail.
> > > > > > However, we are very unlikely to ever exercise this code.  Adding an
> > > > > > optimization that is unlikely to be exercised is certainly questionable.
> > > > > >
> > > > > > Memory offline is the only code that could benefit from this optimization.
> > > > > > As someone with more memory offline user experience, what is your opinion
> > > > > > Michal?
> > > > >
> > > > > I am not a great fun of optimizations without any data to back them up.
> > > > > I do not see any sign this code has been actually tested and the
> > > > > condition triggered.
> > > >
> > > > This race is quite small. I only trigger this only once on my server.
> > > > And then the kernel panic. So I sent this patch series to fix some
> > > > bugs.
> > >
> > > Memory hotplug shouldn't panic when this race happens. Are you sure you
> > > have seen a race that is directly related to this patch?
> >
> > I mean the panic is fixed by:
> >
> >   [PATCH v3 3/6] mm: hugetlb: fix a race between freeing and dissolving the page
>
> OK, so the answer is that this is not really triggered by any real life
> problem. Can you actually trigger it intentionally?
>
> > > > > Besides that I have requested to have an explanation of why blocking on
> > > > > the WQ is safe and that hasn't happened.
> > > >
> > > > I have seen all the caller of dissolve_free_huge_page, some caller is under
> > > > page lock (via lock_page). Others are also under a sleep context.
> > > >
> > > > So I think that blocking on the WQ is safe. Right?
> > >
> > > I have requested to explicitly write your thinking why this is safe so
> > > that we can double check it. Dependency on a work queue progress is much
> > > more complex than any other locks because there is no guarantee that WQ
> > > will make forward progress (all workers might be stuck, new workers not
> > > able to be created etc.).
> >
> > OK. I know about your concern. How about setting the page as temporary
> > when hitting this race?
> >
> >  int dissolve_free_huge_page(struct page *page)
> >  {
> > @@ -1793,8 +1794,10 @@ int dissolve_free_huge_page(struct page *page)
> >                  * We should make sure that the page is already on the free list
> >                  * when it is dissolved.
> >                  */
> > -               if (unlikely(!PageHugeFreed(head)))
> > +               if (unlikely(!PageHugeFreed(head))) {
> > +                       SetPageHugeTemporary(page)
> >                         goto out;
> > +               }
> >
> > Setting the page as temporary and just return -EBUSY (do not flush
> > the work). __free_huge_page() will free it to the buddy allocator later.
>
> Can we stop these subtle hacks please? Temporary page is meant to
> represent unaccounted temporary page for migration. This has nothing to
> do with it.

Sure. Can drop this patch.

> --
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4a9011e12175..a176ceed55f1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1763,10 +1763,11 @@  static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
  * nothing for in-use hugepages and non-hugepages.
  * This function returns values like below:
  *
- *  -EBUSY: failed to dissolved free hugepages or the hugepage is in-use
- *          (allocated or reserved.)
- *       0: successfully dissolved free hugepages or the page is not a
- *          hugepage (considered as already dissolved)
+ *  -EAGAIN: race with __free_huge_page() and can do a retry
+ *  -EBUSY:  failed to dissolved free hugepages or the hugepage is in-use
+ *           (allocated or reserved.)
+ *       0:  successfully dissolved free hugepages or the page is not a
+ *           hugepage (considered as already dissolved)
  */
 int dissolve_free_huge_page(struct page *page)
 {
@@ -1793,8 +1794,10 @@  int dissolve_free_huge_page(struct page *page)
 		 * We should make sure that the page is already on the free list
 		 * when it is dissolved.
 		 */
-		if (unlikely(!PageHugeFreed(head)))
+		if (unlikely(!PageHugeFreed(head))) {
+			rc = -EAGAIN;
 			goto out;
+		}
 
 		/*
 		 * Move PageHWPoison flag from head page to the raw error page,
@@ -1813,6 +1816,14 @@  int dissolve_free_huge_page(struct page *page)
 	}
 out:
 	spin_unlock(&hugetlb_lock);
+
+	/*
+	 * If the freeing of the HugeTLB page is put on a work queue, we should
+	 * flush the work before retrying.
+	 */
+	if (unlikely(rc == -EAGAIN))
+		flush_work(&free_hpage_work);
+
 	return rc;
 }
 
@@ -1835,7 +1846,12 @@  int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) {
 		page = pfn_to_page(pfn);
+retry:
 		rc = dissolve_free_huge_page(page);
+		if (rc == -EAGAIN) {
+			cpu_relax();
+			goto retry;
+		}
 		if (rc)
 			break;
 	}