Fix do_move_pages_to_node() error handling

Message ID	20181114004059.1287439-1-pjaroszynski@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of p.jaroszynski@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: p.jaroszynski@gmail.com To: Michal Hocko <mhocko@suse.com> Cc: linux-mm@kvack.org, Piotr Jaroszynski <pjaroszynski@nvidia.com> Subject: [PATCH] Fix do_move_pages_to_node() error handling Date: Tue, 13 Nov 2018 16:40:59 -0800 Message-Id: <20181114004059.1287439-1-pjaroszynski@nvidia.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Fix do_move_pages_to_node() error handling \| expand Fix do_move_pages_to_node() error handling

Piotr Jaroszy?ski Nov. 14, 2018, 12:40 a.m. UTC

From: Piotr Jaroszynski <pjaroszynski@nvidia.com>

migrate_pages() can return the number of pages that failed to migrate
instead of 0 or an error code. If that happens, the positive return is
treated as an error all the way up through the stack leading to the
move_pages() syscall returning a positive number. I believe this
regressed with commit a49bd4d71637 ("mm, numa: rework do_pages_move")
that refactored a lot of this code.

Fix this by treating positive returns as success in
do_move_pages_to_node() as that seems to most closely follow the
previous code. This still leaves the question whether silently
considering this case a success is the right thing to do as even the
status of the pages will be set as if they were successfully migrated,
but that seems to have been the case before as well.

Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
---
 mm/migrate.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Michal Hocko Nov. 14, 2018, 7:34 a.m. UTC | #1

On Tue 13-11-18 16:40:59, p.jaroszynski@gmail.com wrote:
> From: Piotr Jaroszynski <pjaroszynski@nvidia.com>
> 
> migrate_pages() can return the number of pages that failed to migrate
> instead of 0 or an error code. If that happens, the positive return is
> treated as an error all the way up through the stack leading to the
> move_pages() syscall returning a positive number. I believe this
> regressed with commit a49bd4d71637 ("mm, numa: rework do_pages_move")
> that refactored a lot of this code.

Yes this is correct.

> Fix this by treating positive returns as success in
> do_move_pages_to_node() as that seems to most closely follow the
> previous code. This still leaves the question whether silently
> considering this case a success is the right thing to do as even the
> status of the pages will be set as if they were successfully migrated,
> but that seems to have been the case before as well.

Yes, I believe the previous semantic was just wrong and we want to fix
it. Jan has already brought this up [1]. I believe we want to update the
documentation rather than restore the previous hazy semantic.

Just wondering, how have you found out? Is there any real application
failing because of the change or this is a result of some test?

[1] http://lkml.kernel.org/r/0329efa0984b9b0252ef166abb4498c0795fab36.1535113317.git.jstancek@redhat.com
> 
> Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
> Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
> ---
>  mm/migrate.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8baeb7ff2f6d..b42efef780d6 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1461,6 +1461,7 @@ static int store_status(int __user *status, int start, int value, int nr)
>  	return 0;
>  }
>  
> +/* Returns 0 or an error code. */
>  static int do_move_pages_to_node(struct mm_struct *mm,
>  		struct list_head *pagelist, int node)
>  {
> @@ -1473,6 +1474,15 @@ static int do_move_pages_to_node(struct mm_struct *mm,
>  			MIGRATE_SYNC, MR_SYSCALL);
>  	if (err)
>  		putback_movable_pages(pagelist);
> +
> +	/*
> +	 * migrate_pages() can return the number of not migrated pages, but the
> +	 * callers of do_move_pages_to_node() only care about and handle hard
> +	 * failures.
> +	 */
> +	if (err > 0)
> +		err = 0;
> +
>  	return err;
>  }
>  
> -- 
> 2.11.0.262.g4b0a5b2.dirty
>

Michal Hocko Nov. 14, 2018, 11:29 a.m. UTC | #2

On Wed 14-11-18 08:34:15, Michal Hocko wrote:
> On Tue 13-11-18 16:40:59, p.jaroszynski@gmail.com wrote:
> > From: Piotr Jaroszynski <pjaroszynski@nvidia.com>
> > 
> > migrate_pages() can return the number of pages that failed to migrate
> > instead of 0 or an error code. If that happens, the positive return is
> > treated as an error all the way up through the stack leading to the
> > move_pages() syscall returning a positive number. I believe this
> > regressed with commit a49bd4d71637 ("mm, numa: rework do_pages_move")
> > that refactored a lot of this code.
> 
> Yes this is correct.
> 
> > Fix this by treating positive returns as success in
> > do_move_pages_to_node() as that seems to most closely follow the
> > previous code. This still leaves the question whether silently
> > considering this case a success is the right thing to do as even the
> > status of the pages will be set as if they were successfully migrated,
> > but that seems to have been the case before as well.
> 
> Yes, I believe the previous semantic was just wrong and we want to fix
> it. Jan has already brought this up [1]. I believe we want to update the
> documentation rather than restore the previous hazy semantic.
> 
> Just wondering, how have you found out? Is there any real application
> failing because of the change or this is a result of some test?
> 
> [1] http://lkml.kernel.org/r/0329efa0984b9b0252ef166abb4498c0795fab36.1535113317.git.jstancek@redhat.com

Btw. this is what I was suggesting back then (along with the man page
update suggested by Jan)

From cfb88c266b645197135cde2905c2bfc82f6d82a9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 14 Nov 2018 12:19:09 +0100
Subject: [PATCH] mm: fix do_pages_move error reporting

a49bd4d71637 ("mm, numa: rework do_pages_move") has changed the way how
we report error to layers above. As the changelog mentioned the semantic
was quite unclear previously because the return 0 could mean both
success and failure.

The above mentioned commit didn't get all the way down to fix this
completely because it doesn't report pages that we even haven't
attempted to migrate and therefore we cannot simply say that the
semantic is:
- err < 0 - errno
- err >= 0 number of non-migrated pages.

Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/migrate.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index f7e4bfdc13b7..aa53ebc523eb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1615,8 +1615,16 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 			goto out_flush;
 
 		err = do_move_pages_to_node(mm, &pagelist, current_node);
-		if (err)
+		if (err) {
+			/*
+			 * Possitive err means the number of failed pages to
+			 * migrate. Make sure to report the rest of the
+			 * nr_pages is not migrated as well.
+			 */
+			if (err > 0)
+				err += nr_pages - i - 1;
 			goto out;
+		}
 		if (i > start) {
 			err = store_status(status, start, current_node, i - start);
 			if (err)

Piotr Jaroszynski Nov. 14, 2018, 6:04 p.m. UTC | #3

On 11/14/18 3:29 AM, Michal Hocko wrote:
> On Wed 14-11-18 08:34:15, Michal Hocko wrote:
>> On Tue 13-11-18 16:40:59, p.jaroszynski@gmail.com wrote:
>>> From: Piotr Jaroszynski <pjaroszynski@nvidia.com>
>>>
>>> migrate_pages() can return the number of pages that failed to migrate
>>> instead of 0 or an error code. If that happens, the positive return is
>>> treated as an error all the way up through the stack leading to the
>>> move_pages() syscall returning a positive number. I believe this
>>> regressed with commit a49bd4d71637 ("mm, numa: rework do_pages_move")
>>> that refactored a lot of this code.
>>
>> Yes this is correct.
>>
>>> Fix this by treating positive returns as success in
>>> do_move_pages_to_node() as that seems to most closely follow the
>>> previous code. This still leaves the question whether silently
>>> considering this case a success is the right thing to do as even the
>>> status of the pages will be set as if they were successfully migrated,
>>> but that seems to have been the case before as well.
>>
>> Yes, I believe the previous semantic was just wrong and we want to fix
>> it. Jan has already brought this up [1]. I believe we want to update the
>> documentation rather than restore the previous hazy semantic.

That's probably fair although at least some code we have will have to be
updated as it just checks for non-zero returns from move_pages() and
assumes errno is set when that happens.

>>
>> Just wondering, how have you found out? Is there any real application
>> failing because of the change or this is a result of some test?

I have a test that creates a tmp file, mmaps it as shared, memsets the
memory and then attempts to move it to a different node. It used to
work, but now fails. I suspect the filesystem's migratepage() callback
regressed and will look into it next. So far I have only tested this on
powerpc with the xfs filesystem.

>>
>> [1] http://lkml.kernel.org/r/0329efa0984b9b0252ef166abb4498c0795fab36.1535113317.git.jstancek@redhat.com
> 
> Btw. this is what I was suggesting back then (along with the man page
> update suggested by Jan)
> 
> From cfb88c266b645197135cde2905c2bfc82f6d82a9 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 14 Nov 2018 12:19:09 +0100
> Subject: [PATCH] mm: fix do_pages_move error reporting
> 
> a49bd4d71637 ("mm, numa: rework do_pages_move") has changed the way how
> we report error to layers above. As the changelog mentioned the semantic
> was quite unclear previously because the return 0 could mean both
> success and failure.
> 
> The above mentioned commit didn't get all the way down to fix this
> completely because it doesn't report pages that we even haven't
> attempted to migrate and therefore we cannot simply say that the
> semantic is:
> - err < 0 - errno
> - err >= 0 number of non-migrated pages.
> 
> Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/migrate.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f7e4bfdc13b7..aa53ebc523eb 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1615,8 +1615,16 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
>  			goto out_flush;
>  
>  		err = do_move_pages_to_node(mm, &pagelist, current_node);
> -		if (err)
> +		if (err) {
> +			/*
> +			 * Possitive err means the number of failed pages to
> +			 * migrate. Make sure to report the rest of the
> +			 * nr_pages is not migrated as well.
> +			 */
> +			if (err > 0)
> +				err += nr_pages - i - 1;
>  			goto out;

Ok, so we give up after the first failure to migrate everything. That
probably makes sense although I don't have a good idea about how
frequent it is for the migration to give up in such a manner (short of
the issue I'm seeing that I suspect is a separate bug). In this case,
should the status of each page be updated to something instead of being
left undefined? Or should it be specified that page status is only valid
for the first N - not migrated pages?

> +		}
>  		if (i > start) {
>  			err = store_status(status, start, current_node, i - start);
>  			if (err)
>

Michal Hocko Nov. 14, 2018, 9:22 p.m. UTC | #4

On Wed 14-11-18 10:04:45, Piotr Jaroszynski wrote:
> On 11/14/18 3:29 AM, Michal Hocko wrote:
> > On Wed 14-11-18 08:34:15, Michal Hocko wrote:
> >> On Tue 13-11-18 16:40:59, p.jaroszynski@gmail.com wrote:
> >>> From: Piotr Jaroszynski <pjaroszynski@nvidia.com>
> >>>
> >>> migrate_pages() can return the number of pages that failed to migrate
> >>> instead of 0 or an error code. If that happens, the positive return is
> >>> treated as an error all the way up through the stack leading to the
> >>> move_pages() syscall returning a positive number. I believe this
> >>> regressed with commit a49bd4d71637 ("mm, numa: rework do_pages_move")
> >>> that refactored a lot of this code.
> >>
> >> Yes this is correct.
> >>
> >>> Fix this by treating positive returns as success in
> >>> do_move_pages_to_node() as that seems to most closely follow the
> >>> previous code. This still leaves the question whether silently
> >>> considering this case a success is the right thing to do as even the
> >>> status of the pages will be set as if they were successfully migrated,
> >>> but that seems to have been the case before as well.
> >>
> >> Yes, I believe the previous semantic was just wrong and we want to fix
> >> it. Jan has already brought this up [1]. I believe we want to update the
> >> documentation rather than restore the previous hazy semantic.
> 
> That's probably fair although at least some code we have will have to be
> updated as it just checks for non-zero returns from move_pages() and
> assumes errno is set when that happens.

Can you tell me more about your usecase plase? I definitely do not want
to break any existing userspace. Making the syscall return code more
reasonable is still attractive. So if this new semantic can work better
for you it would be one argument more to keep it this way.
 
> >> Just wondering, how have you found out? Is there any real application
> >> failing because of the change or this is a result of some test?
> 
> I have a test that creates a tmp file, mmaps it as shared, memsets the
> memory and then attempts to move it to a different node. It used to
> work, but now fails. I suspect the filesystem's migratepage() callback
> regressed and will look into it next. So far I have only tested this on
> powerpc with the xfs filesystem.

I would be surprise if the rewor changed the migration behavior.

[...]
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index f7e4bfdc13b7..aa53ebc523eb 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1615,8 +1615,16 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
> >  			goto out_flush;
> >  
> >  		err = do_move_pages_to_node(mm, &pagelist, current_node);
> > -		if (err)
> > +		if (err) {
> > +			/*
> > +			 * Possitive err means the number of failed pages to
> > +			 * migrate. Make sure to report the rest of the
> > +			 * nr_pages is not migrated as well.
> > +			 */
> > +			if (err > 0)
> > +				err += nr_pages - i - 1;
> >  			goto out;
> 
> Ok, so we give up after the first failure to migrate everything. That
> probably makes sense although I don't have a good idea about how
> frequent it is for the migration to give up in such a manner (short of
> the issue I'm seeing that I suspect is a separate bug). In this case,
> should the status of each page be updated to something instead of being
> left undefined? Or should it be specified that page status is only valid
> for the first N - not migrated pages?

I believe this is consistent with the previous behavior. I do not
remember in detail but I believe we haven't set the status for the
remaining pages. With this patch it seems straightforward to skip over
exact number of pages that failed.

Piotr Jaroszynski Nov. 15, 2018, 1:04 a.m. UTC | #5

On 11/14/18 1:22 PM, Michal Hocko wrote:
> On Wed 14-11-18 10:04:45, Piotr Jaroszynski wrote:
>> On 11/14/18 3:29 AM, Michal Hocko wrote:
>>> On Wed 14-11-18 08:34:15, Michal Hocko wrote:
>>>> On Tue 13-11-18 16:40:59, p.jaroszynski@gmail.com wrote:
>>>>> From: Piotr Jaroszynski <pjaroszynski@nvidia.com>
>>>>>
>>>>> migrate_pages() can return the number of pages that failed to migrate
>>>>> instead of 0 or an error code. If that happens, the positive return is
>>>>> treated as an error all the way up through the stack leading to the
>>>>> move_pages() syscall returning a positive number. I believe this
>>>>> regressed with commit a49bd4d71637 ("mm, numa: rework do_pages_move")
>>>>> that refactored a lot of this code.
>>>>
>>>> Yes this is correct.
>>>>
>>>>> Fix this by treating positive returns as success in
>>>>> do_move_pages_to_node() as that seems to most closely follow the
>>>>> previous code. This still leaves the question whether silently
>>>>> considering this case a success is the right thing to do as even the
>>>>> status of the pages will be set as if they were successfully migrated,
>>>>> but that seems to have been the case before as well.
>>>>
>>>> Yes, I believe the previous semantic was just wrong and we want to fix
>>>> it. Jan has already brought this up [1]. I believe we want to update the
>>>> documentation rather than restore the previous hazy semantic.
>>
>> That's probably fair although at least some code we have will have to be
>> updated as it just checks for non-zero returns from move_pages() and
>> assumes errno is set when that happens.
> 
> Can you tell me more about your usecase plase? I definitely do not want
> to break any existing userspace. Making the syscall return code more
> reasonable is still attractive. So if this new semantic can work better
> for you it would be one argument more to keep it this way.
>  

One of our APIs exposes a way to move a VA range to a GPU NUMA node or one of
the CPU NUMA nodes. The code keeps retrying move_pages() and relies on
the reported page status to decide whether each page is done, needs a
retry (EAGAIN or EBUSY) or possibly needs a fallback (EMEM).

With the previous behaviour we would get a success, but the page status
would be reported incorrectly. That's bad as we skip the migration
without knowing about it.

With the current code we get what we interpret as success as errno is 0,
but the page status is gargabe/untouched. That's also bad.

The proposed solution adds a new case to handle, but it will just tell
us the page status is unusable and all we can do is just retry blindly.
If it was possible to plumb through the migration status for each page
accurately that would allow us to save redoing the call for pages that
actually worked. Perhaps we would need a special status for pages
skipped due to errors.

But maybe this is all a tiny corner case short of the bug I hit (see
more below) and it's not worth thinking too much about.

>>>> Just wondering, how have you found out? Is there any real application
>>>> failing because of the change or this is a result of some test?
>>
>> I have a test that creates a tmp file, mmaps it as shared, memsets the
>> memory and then attempts to move it to a different node. It used to
>> work, but now fails. I suspect the filesystem's migratepage() callback
>> regressed and will look into it next. So far I have only tested this on
>> powerpc with the xfs filesystem.
> 
> I would be surprise if the rewor changed the migration behavior.

It didn't, I tracked it down to the new fs/iomap.c code used by xfs not
being compatible with migrate_page_move_mapping() and prepared a perhaps
naive fix in [1].

> 
> [...]
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index f7e4bfdc13b7..aa53ebc523eb 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1615,8 +1615,16 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
>>>  			goto out_flush;
>>>  
>>>  		err = do_move_pages_to_node(mm, &pagelist, current_node);
>>> -		if (err)
>>> +		if (err) {
>>> +			/*
>>> +			 * Possitive err means the number of failed pages to
>>> +			 * migrate. Make sure to report the rest of the
>>> +			 * nr_pages is not migrated as well.
>>> +			 */
>>> +			if (err > 0)
>>> +				err += nr_pages - i - 1;
>>>  			goto out;
>>
>> Ok, so we give up after the first failure to migrate everything. That
>> probably makes sense although I don't have a good idea about how
>> frequent it is for the migration to give up in such a manner (short of
>> the issue I'm seeing that I suspect is a separate bug). In this case,
>> should the status of each page be updated to something instead of being
>> left undefined? Or should it be specified that page status is only valid
>> for the first N - not migrated pages?
> 
> I believe this is consistent with the previous behavior. I do not
> remember in detail but I believe we haven't set the status for the
> remaining pages. With this patch it seems straightforward to skip over
> exact number of pages that failed.
> 

Yeah, I was just wondering whether the semantics could be improved
further, not arguing that the previous behaviour was any better.

[1] - https://lore.kernel.org/lkml/20181115003000.1358007-1-pjaroszynski@nvidia.com/

Thanks,
Piotr

Michal Hocko Nov. 15, 2018, 8:47 a.m. UTC | #6

On Wed 14-11-18 17:04:37, Piotr Jaroszynski wrote:
> On 11/14/18 1:22 PM, Michal Hocko wrote:
> > On Wed 14-11-18 10:04:45, Piotr Jaroszynski wrote:
> >> On 11/14/18 3:29 AM, Michal Hocko wrote:
> >>> On Wed 14-11-18 08:34:15, Michal Hocko wrote:
> >>>> On Tue 13-11-18 16:40:59, p.jaroszynski@gmail.com wrote:
> >>>>> From: Piotr Jaroszynski <pjaroszynski@nvidia.com>
> >>>>>
> >>>>> migrate_pages() can return the number of pages that failed to migrate
> >>>>> instead of 0 or an error code. If that happens, the positive return is
> >>>>> treated as an error all the way up through the stack leading to the
> >>>>> move_pages() syscall returning a positive number. I believe this
> >>>>> regressed with commit a49bd4d71637 ("mm, numa: rework do_pages_move")
> >>>>> that refactored a lot of this code.
> >>>>
> >>>> Yes this is correct.
> >>>>
> >>>>> Fix this by treating positive returns as success in
> >>>>> do_move_pages_to_node() as that seems to most closely follow the
> >>>>> previous code. This still leaves the question whether silently
> >>>>> considering this case a success is the right thing to do as even the
> >>>>> status of the pages will be set as if they were successfully migrated,
> >>>>> but that seems to have been the case before as well.
> >>>>
> >>>> Yes, I believe the previous semantic was just wrong and we want to fix
> >>>> it. Jan has already brought this up [1]. I believe we want to update the
> >>>> documentation rather than restore the previous hazy semantic.
> >>
> >> That's probably fair although at least some code we have will have to be
> >> updated as it just checks for non-zero returns from move_pages() and
> >> assumes errno is set when that happens.
> > 
> > Can you tell me more about your usecase plase? I definitely do not want
> > to break any existing userspace. Making the syscall return code more
> > reasonable is still attractive. So if this new semantic can work better
> > for you it would be one argument more to keep it this way.
> >  
> 
> One of our APIs exposes a way to move a VA range to a GPU NUMA node or one of
> the CPU NUMA nodes. The code keeps retrying move_pages() and relies on
> the reported page status to decide whether each page is done, needs a
> retry (EAGAIN or EBUSY) or possibly needs a fallback (EMEM).
> 
> With the previous behaviour we would get a success, but the page status
> would be reported incorrectly. That's bad as we skip the migration
> without knowing about it.

Exactly.

> With the current code we get what we interpret as success as errno is 0,
> but the page status is gargabe/untouched. That's also bad.

Agreed.

> The proposed solution adds a new case to handle, but it will just tell
> us the page status is unusable and all we can do is just retry blindly.
> If it was possible to plumb through the migration status for each page
> accurately that would allow us to save redoing the call for pages that
> actually worked. Perhaps we would need a special status for pages
> skipped due to errors.

This would be possible but with this patch applied you should know how
many pages to skip from the tail of the array.

> But maybe this is all a tiny corner case short of the bug I hit (see
> more below) and it's not worth thinking too much about.
> 
> >>>> Just wondering, how have you found out? Is there any real application
> >>>> failing because of the change or this is a result of some test?
> >>
> >> I have a test that creates a tmp file, mmaps it as shared, memsets the
> >> memory and then attempts to move it to a different node. It used to
> >> work, but now fails. I suspect the filesystem's migratepage() callback
> >> regressed and will look into it next. So far I have only tested this on
> >> powerpc with the xfs filesystem.
> > 
> > I would be surprise if the rewor changed the migration behavior.
> 
> It didn't, I tracked it down to the new fs/iomap.c code used by xfs not
> being compatible with migrate_page_move_mapping() and prepared a perhaps
> naive fix in [1].

I am not familiar with iomap code much TBH so I cannot really judge your
fix.

Piotr Jaroszynski Nov. 15, 2018, 6:58 p.m. UTC | #7

On 11/15/18 12:47 AM, Michal Hocko wrote:
> On Wed 14-11-18 17:04:37, Piotr Jaroszynski wrote:
>> On 11/14/18 1:22 PM, Michal Hocko wrote:
>>> On Wed 14-11-18 10:04:45, Piotr Jaroszynski wrote:
>>>> On 11/14/18 3:29 AM, Michal Hocko wrote:
>>>>> On Wed 14-11-18 08:34:15, Michal Hocko wrote:
>>>>>> On Tue 13-11-18 16:40:59, p.jaroszynski@gmail.com wrote:
>>>>>>> From: Piotr Jaroszynski <pjaroszynski@nvidia.com>
>>>>>>>
>>>>>>> migrate_pages() can return the number of pages that failed to migrate
>>>>>>> instead of 0 or an error code. If that happens, the positive return is
>>>>>>> treated as an error all the way up through the stack leading to the
>>>>>>> move_pages() syscall returning a positive number. I believe this
>>>>>>> regressed with commit a49bd4d71637 ("mm, numa: rework do_pages_move")
>>>>>>> that refactored a lot of this code.
>>>>>>
>>>>>> Yes this is correct.
>>>>>>
>>>>>>> Fix this by treating positive returns as success in
>>>>>>> do_move_pages_to_node() as that seems to most closely follow the
>>>>>>> previous code. This still leaves the question whether silently
>>>>>>> considering this case a success is the right thing to do as even the
>>>>>>> status of the pages will be set as if they were successfully migrated,
>>>>>>> but that seems to have been the case before as well.
>>>>>>
>>>>>> Yes, I believe the previous semantic was just wrong and we want to fix
>>>>>> it. Jan has already brought this up [1]. I believe we want to update the
>>>>>> documentation rather than restore the previous hazy semantic.
>>>>
>>>> That's probably fair although at least some code we have will have to be
>>>> updated as it just checks for non-zero returns from move_pages() and
>>>> assumes errno is set when that happens.
>>>
>>> Can you tell me more about your usecase plase? I definitely do not want
>>> to break any existing userspace. Making the syscall return code more
>>> reasonable is still attractive. So if this new semantic can work better
>>> for you it would be one argument more to keep it this way.
>>>  
>>
>> One of our APIs exposes a way to move a VA range to a GPU NUMA node or one of
>> the CPU NUMA nodes. The code keeps retrying move_pages() and relies on
>> the reported page status to decide whether each page is done, needs a
>> retry (EAGAIN or EBUSY) or possibly needs a fallback (EMEM).
>>
>> With the previous behaviour we would get a success, but the page status
>> would be reported incorrectly. That's bad as we skip the migration
>> without knowing about it.
> 
> Exactly.
> 
>> With the current code we get what we interpret as success as errno is 0,
>> but the page status is gargabe/untouched. That's also bad.
> 
> Agreed.
> 
>> The proposed solution adds a new case to handle, but it will just tell
>> us the page status is unusable and all we can do is just retry blindly.
>> If it was possible to plumb through the migration status for each page
>> accurately that would allow us to save redoing the call for pages that
>> actually worked. Perhaps we would need a special status for pages
>> skipped due to errors.
> 
> This would be possible but with this patch applied you should know how
> many pages to skip from the tail of the array.

At least in our case the node target is the same for all the pages so we
would just learn that all the pages failed to migrate as they would be
all batched together to the do_move_pages_to_node() call.

> 
>> But maybe this is all a tiny corner case short of the bug I hit (see
>> more below) and it's not worth thinking too much about.
>>
>>>>>> Just wondering, how have you found out? Is there any real application
>>>>>> failing because of the change or this is a result of some test?
>>>>
>>>> I have a test that creates a tmp file, mmaps it as shared, memsets the
>>>> memory and then attempts to move it to a different node. It used to
>>>> work, but now fails. I suspect the filesystem's migratepage() callback
>>>> regressed and will look into it next. So far I have only tested this on
>>>> powerpc with the xfs filesystem.
>>>
>>> I would be surprise if the rewor changed the migration behavior.
>>
>> It didn't, I tracked it down to the new fs/iomap.c code used by xfs not
>> being compatible with migrate_page_move_mapping() and prepared a perhaps
>> naive fix in [1].
> 
> I am not familiar with iomap code much TBH so I cannot really judge your
> fix.
> 

Christoph reviewed it already (thanks!) so it should be good after all.
But in its context, I wanted to ask about migrate_page_move_mapping()
page count checks that it was hitting. Is it true that the count checks
are to handle the case when a page might be temporarily pinned and hence
have the count too high temporarily? That would explain why it returns
EAGAIN in this case. But should having the count too low (what the bug
was hitting) be a fatal error with a WARN maybe? Or are there expected
cases where the count is too low temporarily too? I could send a patch
for that, but also just wanted to understand the expectations.

Michal Hocko Nov. 16, 2018, 11:49 a.m. UTC | #8

On Thu 15-11-18 10:58:33, Piotr Jaroszynski wrote:
> On 11/15/18 12:47 AM, Michal Hocko wrote:
> > On Wed 14-11-18 17:04:37, Piotr Jaroszynski wrote:
[...]
> >> The proposed solution adds a new case to handle, but it will just tell
> >> us the page status is unusable and all we can do is just retry blindly.
> >> If it was possible to plumb through the migration status for each page
> >> accurately that would allow us to save redoing the call for pages that
> >> actually worked. Perhaps we would need a special status for pages
> >> skipped due to errors.
> > 
> > This would be possible but with this patch applied you should know how
> > many pages to skip from the tail of the array.
> 
> At least in our case the node target is the same for all the pages so we
> would just learn that all the pages failed to migrate as they would be
> all batched together to the do_move_pages_to_node() call.

Anyway, could you give this patch a try please? I would appreciate some
Tested-bys to push this forward ;)

> >> But maybe this is all a tiny corner case short of the bug I hit (see
> >> more below) and it's not worth thinking too much about.
> >>
> >>>>>> Just wondering, how have you found out? Is there any real application
> >>>>>> failing because of the change or this is a result of some test?
> >>>>
> >>>> I have a test that creates a tmp file, mmaps it as shared, memsets the
> >>>> memory and then attempts to move it to a different node. It used to
> >>>> work, but now fails. I suspect the filesystem's migratepage() callback
> >>>> regressed and will look into it next. So far I have only tested this on
> >>>> powerpc with the xfs filesystem.
> >>>
> >>> I would be surprise if the rewor changed the migration behavior.
> >>
> >> It didn't, I tracked it down to the new fs/iomap.c code used by xfs not
> >> being compatible with migrate_page_move_mapping() and prepared a perhaps
> >> naive fix in [1].
> > 
> > I am not familiar with iomap code much TBH so I cannot really judge your
> > fix.
> > 
> 
> Christoph reviewed it already (thanks!) so it should be good after all.
> But in its context, I wanted to ask about migrate_page_move_mapping()
> page count checks that it was hitting. Is it true that the count checks
> are to handle the case when a page might be temporarily pinned and hence
> have the count too high temporarily?

Yes. We cannot really migrate pinned pages.

> That would explain why it returns
> EAGAIN in this case. But should having the count too low (what the bug
> was hitting) be a fatal error with a WARN maybe? Or are there expected
> cases where the count is too low temporarily too?

Nope, page reference count too low is a bug.

> I could send a patch
> for that, but also just wanted to understand the expectations.

Fix do_move_pages_to_node() error handling

Commit Message

Comments

Patch