diff mbox series

[1/1] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

Message ID 20231215081204.8802-1-qiuxu.zhuo@intel.com (mailing list archive)
State New
Headers show
Series [1/1] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN | expand

Commit Message

Qiuxu Zhuo Dec. 15, 2023, 8:12 a.m. UTC
During the process of splitting a hw-poisoned huge page, it is possible
for the reference count of the huge page to be increased by the threads
within the affected process, leading to a failure in splitting the
hw-poisoned huge page with an error code of -EAGAIN.

This issue can be reproduced when doing memory error injection to a
multiple-thread process, and the error occurs within a huge page.
The call path with the returned -EAGAIN during the testing is shown below:

  memory_failure()
    try_to_split_thp_page()
      split_huge_page()
        split_huge_page_to_list() {
          ...
          Step A: can_split_folio() - Checked that the thp can be split.
          Step B: unmap_folio()
          Step C: folio_ref_freeze() - Failed and returned -EAGAIN.
          ...
        }

The testing logs indicated that some huge pages were split successfully
via the call path above (Step C was successful for these huge pages).
However, some huge pages failed to split due to a failure at Step C, and
it was observed that the reference count of the huge page increased between
Step A and Step C.

Testing has shown that after receiving -EAGAIN, simply re-splitting the
hw-poisoned huge page within memory_failure() always results in the same
-EAGAIN. This is possible because memory_failure() is executed in the
currently affected process. Before this process exits memory_failure() and
is terminated, its threads could increase the reference count of the
hw-poisoned page.

To address this issue, employ the kernel worker to re-split the hw-poisoned
huge page. By the time this worker begins re-splitting the hw-poisoned huge
page, the affected process has already been terminated, preventing its
threads from increasing the reference count. Experimental results have
consistently shown that this worker successfully re-splits these
hw-poisoned huge pages on its first attempt.

The kernel log (before):
  [ 1116.862895] Memory failure: 0x4097fa7: recovery action for unsplit thp: Ignored

The kernel log (after):
  [  793.573536] Memory failure: 0x2100dda: recovery action for unsplit thp: Delayed
  [  793.574666] Memory failure: 0x2100dda: split unsplit thp successfully.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
 mm/memory-failure.c | 73 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 71 insertions(+), 2 deletions(-)

Comments

Naoya Horiguchi Dec. 19, 2023, 2:17 a.m. UTC | #1
On Fri, Dec 15, 2023 at 04:12:04PM +0800, Qiuxu Zhuo wrote:
> During the process of splitting a hw-poisoned huge page, it is possible
> for the reference count of the huge page to be increased by the threads
> within the affected process, leading to a failure in splitting the
> hw-poisoned huge page with an error code of -EAGAIN.
> 
> This issue can be reproduced when doing memory error injection to a
> multiple-thread process, and the error occurs within a huge page.
> The call path with the returned -EAGAIN during the testing is shown below:
> 
>   memory_failure()
>     try_to_split_thp_page()
>       split_huge_page()
>         split_huge_page_to_list() {
>           ...
>           Step A: can_split_folio() - Checked that the thp can be split.
>           Step B: unmap_folio()
>           Step C: folio_ref_freeze() - Failed and returned -EAGAIN.
>           ...
>         }
> 
> The testing logs indicated that some huge pages were split successfully
> via the call path above (Step C was successful for these huge pages).
> However, some huge pages failed to split due to a failure at Step C, and
> it was observed that the reference count of the huge page increased between
> Step A and Step C.
> 
> Testing has shown that after receiving -EAGAIN, simply re-splitting the
> hw-poisoned huge page within memory_failure() always results in the same
> -EAGAIN. This is possible because memory_failure() is executed in the
> currently affected process. Before this process exits memory_failure() and
> is terminated, its threads could increase the reference count of the
> hw-poisoned page.
> 
> To address this issue, employ the kernel worker to re-split the hw-poisoned
> huge page. By the time this worker begins re-splitting the hw-poisoned huge
> page, the affected process has already been terminated, preventing its
> threads from increasing the reference count. Experimental results have
> consistently shown that this worker successfully re-splits these
> hw-poisoned huge pages on its first attempt.
> 
> The kernel log (before):
>   [ 1116.862895] Memory failure: 0x4097fa7: recovery action for unsplit thp: Ignored
> 
> The kernel log (after):
>   [  793.573536] Memory failure: 0x2100dda: recovery action for unsplit thp: Delayed
>   [  793.574666] Memory failure: 0x2100dda: split unsplit thp successfully.

I'm unclear about the user-visible benefit of ensuring that the error thp is split.
So could you explain about it?

I think that the raw error page is not unmapped (with hwpoisoned entry) after
delayed re-splitting, so recovery action seems not complete even with this patch.
So this patch seems to just convert a hwpoisoned unrecovered thp into a hwpoisoned
unrecovered raw page.

Thanks,
Naoya Horiguchi
Miaohe Lin Dec. 19, 2023, 11:50 a.m. UTC | #2
On 2023/12/15 16:12, Qiuxu Zhuo wrote:
> During the process of splitting a hw-poisoned huge page, it is possible
> for the reference count of the huge page to be increased by the threads
> within the affected process, leading to a failure in splitting the
> hw-poisoned huge page with an error code of -EAGAIN.
> 
> This issue can be reproduced when doing memory error injection to a
> multiple-thread process, and the error occurs within a huge page.
> The call path with the returned -EAGAIN during the testing is shown below:
> 
>   memory_failure()
>     try_to_split_thp_page()
>       split_huge_page()
>         split_huge_page_to_list() {
>           ...
>           Step A: can_split_folio() - Checked that the thp can be split.
>           Step B: unmap_folio()
>           Step C: folio_ref_freeze() - Failed and returned -EAGAIN.
>           ...
>         }
> 
> The testing logs indicated that some huge pages were split successfully
> via the call path above (Step C was successful for these huge pages).
> However, some huge pages failed to split due to a failure at Step C, and
> it was observed that the reference count of the huge page increased between
> Step A and Step C.
> 
> Testing has shown that after receiving -EAGAIN, simply re-splitting the
> hw-poisoned huge page within memory_failure() always results in the same
> -EAGAIN. This is possible because memory_failure() is executed in the
> currently affected process. Before this process exits memory_failure() and
> is terminated, its threads could increase the reference count of the
> hw-poisoned page.
> 
> To address this issue, employ the kernel worker to re-split the hw-poisoned
> huge page. By the time this worker begins re-splitting the hw-poisoned huge
> page, the affected process has already been terminated, preventing its
> threads from increasing the reference count. Experimental results have
> consistently shown that this worker successfully re-splits these
> hw-poisoned huge pages on its first attempt.
> 
> The kernel log (before):
>   [ 1116.862895] Memory failure: 0x4097fa7: recovery action for unsplit thp: Ignored
> 
> The kernel log (after):
>   [  793.573536] Memory failure: 0x2100dda: recovery action for unsplit thp: Delayed
>   [  793.574666] Memory failure: 0x2100dda: split unsplit thp successfully.
> 
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>

Thanks for your patch. Except for the comment from Naoya, I have some questions about the code itself.

> ---
>  mm/memory-failure.c | 73 +++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 71 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 660c21859118..0db4cf712a78 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -72,6 +72,60 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>  
>  static bool hw_memory_failure __read_mostly = false;
>  
> +#define SPLIT_THP_MAX_RETRY_CNT		10
> +#define SPLIT_THP_INIT_DELAYED_MS	1
> +
> +static bool split_thp_pending;
> +
> +struct split_thp_req {
> +	struct delayed_work work;
> +	struct page *thp;
> +	int retries;
> +};
> +
> +static void split_thp_work_fn(struct work_struct *work)
> +{
> +	struct split_thp_req *req = container_of(work, typeof(*req), work.work);
> +	int ret;
> +
> +	/* Split the thp. */
> +	get_page(req->thp);

Can req->thp be freed when split_thp_work_fn is scheduled ?

> +	lock_page(req->thp);
> +	ret = split_huge_page(req->thp);
> +	unlock_page(req->thp);
> +	put_page(req->thp);
> +
> +	/* Retry with an exponential backoff. */
> +	if (ret && ++req->retries < SPLIT_THP_MAX_RETRY_CNT) {
> +		schedule_delayed_work(to_delayed_work(work),
> +				      msecs_to_jiffies(SPLIT_THP_INIT_DELAYED_MS << req->retries));
> +		return;
> +	}
> +
> +	pr_err("%#lx: split unsplit thp %ssuccessfully.\n", page_to_pfn(req->thp), ret ? "un" : "");
> +	kfree(req);
> +	split_thp_pending = false;

split_thp_pending is not protected against split_thp_delayed? Though this race should be benign.

Thanks.
Qiuxu Zhuo Dec. 20, 2023, 8:44 a.m. UTC | #3
Hi Naoya Horiguchi,

Thanks for the review. 
See the comments below.

> From: Naoya Horiguchi <naoya.horiguchi@linux.dev>
> Sent: Tuesday, December 19, 2023 10:17 AM
> ...
> > The kernel log (before):
> >   [ 1116.862895] Memory failure: 0x4097fa7: recovery action for
> > unsplit thp: Ignored
> >
> > The kernel log (after):
> >   [  793.573536] Memory failure: 0x2100dda: recovery action for unsplit thp:
> Delayed
> >   [  793.574666] Memory failure: 0x2100dda: split unsplit thp successfully.
> 
> I'm unclear about the user-visible benefit of ensuring that the error thp is
> split.
> So could you explain about it?

During our testing, we observed that the hardware-poisoned huge page had been 
mapped for the victim application's text and was present in the file cache.
Unfortunately, when attempting to restart the application without splitting the thp,
the application restart failed. This was possible because its text was remapped to the 
hardware-poisoned huge page from the file cache, leading to its swift termination 
due to another MCE.

So, after re-splitting the unsplit thp successfully (drop the text mapping), 
the application restart is successful.  I'll also add this description in the commit message in the v2.

> I think that the raw error page is not unmapped (with hwpoisoned entry)
> after delayed re-splitting, so recovery action seems not complete even with
> this patch.
> So this patch seems to just convert a hwpoisoned unrecovered thp into a
> hwpoisoned unrecovered raw page.

You're correct. Thanks for catching this.
Instead of creating a new work just to split the thp, I'll leverage the existing memory_failure_queue()
 to re-split the thp in the v2, which should make the recovery action more complete.
 
-Qiuxu
Qiuxu Zhuo Dec. 20, 2023, 8:56 a.m. UTC | #4
Hi Miaohe,

Thanks for the review.
Please see the comments below.

> From: Miaohe Lin <linmiaohe@huawei.com>
> ...
> > +
> > +static void split_thp_work_fn(struct work_struct *work) {
> > +	struct split_thp_req *req = container_of(work, typeof(*req),
> work.work);
> > +	int ret;
> > +
> > +	/* Split the thp. */
> > +	get_page(req->thp);
> 
> Can req->thp be freed when split_thp_work_fn is scheduled ?

It's possible. Thanks for catching this.

Instead of making a new work to re-split the thp, 
I'll leverage the existing memory_failure_queue() to resplit the thp in the v2.

> 
> > +	lock_page(req->thp);
> > +	ret = split_huge_page(req->thp);
> > +	unlock_page(req->thp);
> > +	put_page(req->thp);
> > +
> > +	/* Retry with an exponential backoff. */
> > +	if (ret && ++req->retries < SPLIT_THP_MAX_RETRY_CNT) {
> > +		schedule_delayed_work(to_delayed_work(work),
> > +
> msecs_to_jiffies(SPLIT_THP_INIT_DELAYED_MS << req->retries));
> > +		return;
> > +	}
> > +
> > +	pr_err("%#lx: split unsplit thp %ssuccessfully.\n", page_to_pfn(req-
> >thp), ret ? "un" : "");
> > +	kfree(req);
> > +	split_thp_pending = false;
> 
> split_thp_pending is not protected against split_thp_delayed? Though this
> race should be benign.

Thanks for being concerned about this.

As the Read-Check-Modify of "split_thp_pending" is protected by the
mutex " &mf_mutex", and the worker only modified it to false (no read on it). 
In theory, there is no race here. 

Will leverage the existing memory_failure_queue() in v2. There should be no
such concern about this race. 
diff mbox series

Patch

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 660c21859118..0db4cf712a78 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -72,6 +72,60 @@  atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
 
+#define SPLIT_THP_MAX_RETRY_CNT		10
+#define SPLIT_THP_INIT_DELAYED_MS	1
+
+static bool split_thp_pending;
+
+struct split_thp_req {
+	struct delayed_work work;
+	struct page *thp;
+	int retries;
+};
+
+static void split_thp_work_fn(struct work_struct *work)
+{
+	struct split_thp_req *req = container_of(work, typeof(*req), work.work);
+	int ret;
+
+	/* Split the thp. */
+	get_page(req->thp);
+	lock_page(req->thp);
+	ret = split_huge_page(req->thp);
+	unlock_page(req->thp);
+	put_page(req->thp);
+
+	/* Retry with an exponential backoff. */
+	if (ret && ++req->retries < SPLIT_THP_MAX_RETRY_CNT) {
+		schedule_delayed_work(to_delayed_work(work),
+				      msecs_to_jiffies(SPLIT_THP_INIT_DELAYED_MS << req->retries));
+		return;
+	}
+
+	pr_err("%#lx: split unsplit thp %ssuccessfully.\n", page_to_pfn(req->thp), ret ? "un" : "");
+	kfree(req);
+	split_thp_pending = false;
+}
+
+static bool split_thp_delayed(struct page *thp)
+{
+	struct split_thp_req *req;
+
+	if (split_thp_pending)
+		return false;
+
+	req = kmalloc(sizeof(*req), GFP_ATOMIC);
+	if (!req)
+		return false;
+
+	req->thp = thp;
+	req->retries = 0;
+	INIT_DELAYED_WORK(&req->work, split_thp_work_fn);
+	split_thp_pending = true;
+	schedule_delayed_work(&req->work, msecs_to_jiffies(SPLIT_THP_INIT_DELAYED_MS));
+	return true;
+}
+
 static DEFINE_MUTEX(mf_mutex);
 
 void num_poisoned_pages_inc(unsigned long pfn)
@@ -2275,8 +2329,23 @@  int memory_failure(unsigned long pfn, int flags)
 		 * page is a valid handlable page.
 		 */
 		SetPageHasHWPoisoned(hpage);
-		if (try_to_split_thp_page(p) < 0) {
-			res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+		res = try_to_split_thp_page(p);
+		if (res < 0) {
+			/*
+			 * Re-attempting try_to_split_thp_page() here could consistently
+			 * yield -EAGAIN, as the threads of the process may increment the
+			 * reference count of the huge page before the process exits
+			 * memory_failure() and terminates.
+			 *
+			 * Employ the kernel worker to re-split the huge page. By the time
+			 * this worker initiates the re-splitting process, the affected
+			 * process has already been terminated, preventing its threads from
+			 * incrementing the reference count.
+			 */
+			if (res == -EAGAIN && split_thp_delayed(p))
+				res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_DELAYED);
+			else
+				res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
 			goto unlock_mutex;
 		}
 		VM_BUG_ON_PAGE(!page_count(p), p);