diff mbox series

mm,hwpoison: Return -EBUSY when migration fails

Message ID 20201209092818.30417-1-osalvador@suse.de (mailing list archive)
State New, archived
Headers show
Series mm,hwpoison: Return -EBUSY when migration fails | expand

Commit Message

Oscar Salvador Dec. 9, 2020, 9:28 a.m. UTC
Currently, we return -EIO when we fail to migrate the page.

Migrations' failures are rather transient as they can happen due to
several reasons, e.g: high page refcount bump, mapping->migrate_page
failing etc.
All meaning that at that time the page could not be migrated, but
that has nothing to do with an EIO error.

Let us return -EBUSY instead, as we do in case we failed to isolate
the page.

While are it, let us remove the "ret" print as its value does not change.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

HORIGUCHI NAOYA(堀口 直也) Dec. 9, 2020, 9:57 a.m. UTC | #1
On Wed, Dec 09, 2020 at 10:28:18AM +0100, Oscar Salvador wrote:
> Currently, we return -EIO when we fail to migrate the page.
> 
> Migrations' failures are rather transient as they can happen due to
> several reasons, e.g: high page refcount bump, mapping->migrate_page
> failing etc.
> All meaning that at that time the page could not be migrated, but
> that has nothing to do with an EIO error.
> 
> Let us return -EBUSY instead, as we do in case we failed to isolate
> the page.
> 
> While are it, let us remove the "ret" print as its value does not change.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
David Hildenbrand Dec. 9, 2020, 9:59 a.m. UTC | #2
On 09.12.20 10:28, Oscar Salvador wrote:
> Currently, we return -EIO when we fail to migrate the page.
> 
> Migrations' failures are rather transient as they can happen due to
> several reasons, e.g: high page refcount bump, mapping->migrate_page
> failing etc.
> All meaning that at that time the page could not be migrated, but
> that has nothing to do with an EIO error.
> 
> Let us return -EBUSY instead, as we do in case we failed to isolate
> the page.
> 
> While are it, let us remove the "ret" print as its value does not change.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  mm/memory-failure.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 428991e297e2..1942fb83ac64 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1849,11 +1849,11 @@ static int __soft_offline_page(struct page *page)
>  			pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n",
>  				pfn, msg_page[huge], ret, page->flags, &page->flags);
>  			if (ret > 0)
> -				ret = -EIO;
> +				ret = -EBUSY;

Do we expect callers to retry immediately? -EAGAIN might make also
sense. But -EBUSY is an obvious improvement. Do we have callers relying
on this behavior?
Vlastimil Babka Dec. 9, 2020, 10:25 a.m. UTC | #3
On 12/9/20 10:28 AM, Oscar Salvador wrote:
> Currently, we return -EIO when we fail to migrate the page.
> 
> Migrations' failures are rather transient as they can happen due to
> several reasons, e.g: high page refcount bump, mapping->migrate_page
> failing etc.
> All meaning that at that time the page could not be migrated, but
> that has nothing to do with an EIO error.
> 
> Let us return -EBUSY instead, as we do in case we failed to isolate
> the page.
> 
> While are it, let us remove the "ret" print as its value does not change.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Technically this affects madvise(2) so let's cc linux-api. The manpage doesn't
document error codes of MADV_HWPOISON and MADV_SOFT_OFFLINE (besides EPERM)
though so nothing to adjust there. It's meant only for the hwpoison testing
suite anyway.

> ---
>  mm/memory-failure.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 428991e297e2..1942fb83ac64 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1849,11 +1849,11 @@ static int __soft_offline_page(struct page *page)
>  			pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n",
>  				pfn, msg_page[huge], ret, page->flags, &page->flags);
>  			if (ret > 0)
> -				ret = -EIO;
> +				ret = -EBUSY;
>  		}
>  	} else {
> -		pr_info("soft offline: %#lx: %s isolation failed: %d, page count %d, type %lx (%pGp)\n",
> -			pfn, msg_page[huge], ret, page_count(page), page->flags, &page->flags);
> +		pr_info("soft offline: %#lx: %s isolation failed, page count %d, type %lx (%pGp)\n",
> +			pfn, msg_page[huge], page_count(page), page->flags, &page->flags);
>  		ret = -EBUSY;
>  	}
>  	return ret;
>
Oscar Salvador Dec. 9, 2020, 10:36 a.m. UTC | #4
On Wed, Dec 09, 2020 at 10:59:04AM +0100, David Hildenbrand wrote:
> On 09.12.20 10:28, Oscar Salvador wrote:
> Do we expect callers to retry immediately? -EAGAIN might make also
> sense. But -EBUSY is an obvious improvement. Do we have callers relying
> on this behavior?

Not really, unless something LTP takes a look at the error code in retries
in case EBUSY.

Take into account that most of the callers do not even really check the
return code (GHES, RAS/CEC, etc.)
Oscar Salvador Dec. 9, 2020, 10:38 a.m. UTC | #5
On Wed, Dec 09, 2020 at 11:25:31AM +0100, Vlastimil Babka wrote:
> On 12/9/20 10:28 AM, Oscar Salvador wrote:
> > Currently, we return -EIO when we fail to migrate the page.
> > 
> > Migrations' failures are rather transient as they can happen due to
> > several reasons, e.g: high page refcount bump, mapping->migrate_page
> > failing etc.
> > All meaning that at that time the page could not be migrated, but
> > that has nothing to do with an EIO error.
> > 
> > Let us return -EBUSY instead, as we do in case we failed to isolate
> > the page.
> > 
> > While are it, let us remove the "ret" print as its value does not change.
> > 
> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Technically this affects madvise(2) so let's cc linux-api. The manpage doesn't
> document error codes of MADV_HWPOISON and MADV_SOFT_OFFLINE (besides EPERM)
> though so nothing to adjust there. It's meant only for the hwpoison testing
> suite anyway.

Well, not only for hwpoison testing suite.

RAS/CEC and GHES also use soft_offline_page by means of memory_failure_queue
in case a page cec count goes beyond a certain thereshold, but they do not
really check the return code.

Only madvise does.
diff mbox series

Patch

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 428991e297e2..1942fb83ac64 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1849,11 +1849,11 @@  static int __soft_offline_page(struct page *page)
 			pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n",
 				pfn, msg_page[huge], ret, page->flags, &page->flags);
 			if (ret > 0)
-				ret = -EIO;
+				ret = -EBUSY;
 		}
 	} else {
-		pr_info("soft offline: %#lx: %s isolation failed: %d, page count %d, type %lx (%pGp)\n",
-			pfn, msg_page[huge], ret, page_count(page), page->flags, &page->flags);
+		pr_info("soft offline: %#lx: %s isolation failed, page count %d, type %lx (%pGp)\n",
+			pfn, msg_page[huge], page_count(page), page->flags, &page->flags);
 		ret = -EBUSY;
 	}
 	return ret;