Message ID | 20230807063945.911582-1-apopple@nvidia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/2] mm/migrate.c: Fix return code when migration fails | expand |
On Mon 07-08-23 16:39:44, Alistair Popple wrote: > When a page fails to migrate move_pages() returns the error code in a > per-page array of status values. The function call itself is also > supposed to return a summary error code indicating that a failure > occurred. > > This doesn't always happen. Instead success can be returned even > though some pages failed to migrate. This is due to incorrectly > returning the error code from store_status() rather than the code from > add_page_for_migration. Fix this by only returning an error from > store_status() if the store actually failed. Error reporting by this syscall has been really far from straightforward. Please read through a49bd4d71637 and the section "On a side note". Is there any specific reason you are trying to address this now or is this motivated by the code inspection? > Signed-off-by: Alistair Popple <apopple@nvidia.com> > Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") > --- > mm/migrate.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 24baad2571e3..bb3a37245e13 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -2222,7 +2222,9 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, > * If the page is already on the target node (!err), store the > * node, otherwise, store the err. > */ > - err = store_status(status, i, err ? : current_node, 1); > + err1 = store_status(status, i, err ? : current_node, 1); > + if (err1) > + err = err1; > if (err) > goto out_flush; > > -- > 2.39.2
Michal Hocko <mhocko@suse.com> writes: > On Mon 07-08-23 16:39:44, Alistair Popple wrote: >> When a page fails to migrate move_pages() returns the error code in a >> per-page array of status values. The function call itself is also >> supposed to return a summary error code indicating that a failure >> occurred. >> >> This doesn't always happen. Instead success can be returned even >> though some pages failed to migrate. This is due to incorrectly >> returning the error code from store_status() rather than the code from >> add_page_for_migration. Fix this by only returning an error from >> store_status() if the store actually failed. > > Error reporting by this syscall has been really far from > straightforward. Please read through a49bd4d71637 and the section "On a > side note". > Is there any specific reason you are trying to address this now or is > this motivated by the code inspection? Thanks Michal. There was no specific reason to address this now other than I came across this behaviour when updating the migration selftest to inspect the status array and thought it was odd. I was seeing pages had failed to migrate according to the status argument even though move_pages() had returned 0 (ie. success) rather than a number of non-migrated pages. If I'm interpreting the side note correctly the behaviour you were concerned about was the opposite - returning a fail return code from move_pages() but not indicating failure in the status array. That said I'm happy to leave the behaviour as is, although in that case an update to the man page is in order to clarify a return value of 0 from move_pages() doesn't actually mean all pages were successfully migrated. >> Signed-off-by: Alistair Popple <apopple@nvidia.com> >> Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") >> --- >> mm/migrate.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> index 24baad2571e3..bb3a37245e13 100644 >> --- a/mm/migrate.c >> +++ b/mm/migrate.c >> @@ -2222,7 +2222,9 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, >> * If the page is already on the target node (!err), store the >> * node, otherwise, store the err. >> */ >> - err = store_status(status, i, err ? : current_node, 1); >> + err1 = store_status(status, i, err ? : current_node, 1); >> + if (err1) >> + err = err1; >> if (err) >> goto out_flush; >> >> -- >> 2.39.2
On Mon 07-08-23 22:31:52, Alistair Popple wrote: > > Michal Hocko <mhocko@suse.com> writes: > > > On Mon 07-08-23 16:39:44, Alistair Popple wrote: > >> When a page fails to migrate move_pages() returns the error code in a > >> per-page array of status values. The function call itself is also > >> supposed to return a summary error code indicating that a failure > >> occurred. > >> > >> This doesn't always happen. Instead success can be returned even > >> though some pages failed to migrate. This is due to incorrectly > >> returning the error code from store_status() rather than the code from > >> add_page_for_migration. Fix this by only returning an error from > >> store_status() if the store actually failed. > > > > Error reporting by this syscall has been really far from > > straightforward. Please read through a49bd4d71637 and the section "On a > > side note". > > Is there any specific reason you are trying to address this now or is > > this motivated by the code inspection? > > Thanks Michal. There was no specific reason to address this now other > than I came across this behaviour when updating the migration selftest > to inspect the status array and thought it was odd. I was seeing pages > had failed to migrate according to the status argument even though > move_pages() had returned 0 (ie. success) rather than a number of > non-migrated pages. It is good to mention such a motivation in the changelog to make it clear. Also do we have a specific test case which trigger this case? > If I'm interpreting the side note correctly the behaviour you were > concerned about was the opposite - returning a fail return code from > move_pages() but not indicating failure in the status array. > > That said I'm happy to leave the behaviour as is, although in that case > an update to the man page is in order to clarify a return value of 0 > from move_pages() doesn't actually mean all pages were successfully > migrated. While I would say that it is better to let old dogs sleep I do not mind changing the behavior and see whether anything breaks. I suspect nobody except for couple of test cases hardcoded to the original behavior will notice. > >> Signed-off-by: Alistair Popple <apopple@nvidia.com> > >> Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") The patch itself looks good. I am not sure the fixes tag is accurate. Has the reporting been correct before this change? I didn't have time to re-read the original code which was quite different. Anyway Acked-by: Michal Hocko <mhocko@suse.com> Anyway rewriting this function to clarify the error handling would be a nice exercise if somebody is interested. > >> --- > >> mm/migrate.c | 4 +++- > >> 1 file changed, 3 insertions(+), 1 deletion(-) > >> > >> diff --git a/mm/migrate.c b/mm/migrate.c > >> index 24baad2571e3..bb3a37245e13 100644 > >> --- a/mm/migrate.c > >> +++ b/mm/migrate.c > >> @@ -2222,7 +2222,9 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, > >> * If the page is already on the target node (!err), store the > >> * node, otherwise, store the err. > >> */ > >> - err = store_status(status, i, err ? : current_node, 1); > >> + err1 = store_status(status, i, err ? : current_node, 1); > >> + if (err1) > >> + err = err1; > >> if (err) > >> goto out_flush; > >> > >> -- > >> 2.39.2
Michal Hocko <mhocko@suse.com> writes: > On Mon 07-08-23 22:31:52, Alistair Popple wrote: >> >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Mon 07-08-23 16:39:44, Alistair Popple wrote: >> >> When a page fails to migrate move_pages() returns the error code in a >> >> per-page array of status values. The function call itself is also >> >> supposed to return a summary error code indicating that a failure >> >> occurred. >> >> >> >> This doesn't always happen. Instead success can be returned even >> >> though some pages failed to migrate. This is due to incorrectly >> >> returning the error code from store_status() rather than the code from >> >> add_page_for_migration. Fix this by only returning an error from >> >> store_status() if the store actually failed. >> > >> > Error reporting by this syscall has been really far from >> > straightforward. Please read through a49bd4d71637 and the section "On a >> > side note". >> > Is there any specific reason you are trying to address this now or is >> > this motivated by the code inspection? >> >> Thanks Michal. There was no specific reason to address this now other >> than I came across this behaviour when updating the migration selftest >> to inspect the status array and thought it was odd. I was seeing pages >> had failed to migrate according to the status argument even though >> move_pages() had returned 0 (ie. success) rather than a number of >> non-migrated pages. > > It is good to mention such a motivation in the changelog to make it > clear. Also do we have a specific test case which trigger this case? Not explicitly/reliably although I could write one. >> If I'm interpreting the side note correctly the behaviour you were >> concerned about was the opposite - returning a fail return code from >> move_pages() but not indicating failure in the status array. >> >> That said I'm happy to leave the behaviour as is, although in that case >> an update to the man page is in order to clarify a return value of 0 >> from move_pages() doesn't actually mean all pages were successfully >> migrated. > > While I would say that it is better to let old dogs sleep I do not mind > changing the behavior and see whether anything breaks. I suspect nobody > except for couple of test cases hardcoded to the original behavior will > notice. > >> >> Signed-off-by: Alistair Popple <apopple@nvidia.com> >> >> Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") > > The patch itself looks good. I am not sure the fixes tag is accurate. > Has the reporting been correct before this change? I didn't have time to > re-read the original code which was quite different. I dug deeper into the history and the fixes tag is wrong. The behaviour was actually introduced way back in commit e78bbfa82624 ("mm: stop returning -ENOENT from sys_move_pages() if nothing got migrated"). As you may guess from the title it was intentional, so suspect it is better to update documentation. > Anyway > Acked-by: Michal Hocko <mhocko@suse.com> Thanks for looking, but I will drop this and see if I can get the man page updated. > Anyway rewriting this function to clarify the error handling would be a > nice exercise if somebody is interested. Yeah, everytime I look at this function I want to do that but haven't yet found the time. >> >> --- >> >> mm/migrate.c | 4 +++- >> >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> >> >> diff --git a/mm/migrate.c b/mm/migrate.c >> >> index 24baad2571e3..bb3a37245e13 100644 >> >> --- a/mm/migrate.c >> >> +++ b/mm/migrate.c >> >> @@ -2222,7 +2222,9 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, >> >> * If the page is already on the target node (!err), store the >> >> * node, otherwise, store the err. >> >> */ >> >> - err = store_status(status, i, err ? : current_node, 1); >> >> + err1 = store_status(status, i, err ? : current_node, 1); >> >> + if (err1) >> >> + err = err1; >> >> if (err) >> >> goto out_flush; >> >> >> >> -- >> >> 2.39.2
Alistair Popple <apopple@nvidia.com> writes: > Michal Hocko <mhocko@suse.com> writes: > >> On Mon 07-08-23 22:31:52, Alistair Popple wrote: >>> >>> Michal Hocko <mhocko@suse.com> writes: >>> >>> > On Mon 07-08-23 16:39:44, Alistair Popple wrote: >>> >> When a page fails to migrate move_pages() returns the error code in a >>> >> per-page array of status values. The function call itself is also >>> >> supposed to return a summary error code indicating that a failure >>> >> occurred. >>> >> >>> >> This doesn't always happen. Instead success can be returned even >>> >> though some pages failed to migrate. This is due to incorrectly >>> >> returning the error code from store_status() rather than the code from >>> >> add_page_for_migration. Fix this by only returning an error from >>> >> store_status() if the store actually failed. >>> > >>> > Error reporting by this syscall has been really far from >>> > straightforward. Please read through a49bd4d71637 and the section "On a >>> > side note". >>> > Is there any specific reason you are trying to address this now or is >>> > this motivated by the code inspection? >>> >>> Thanks Michal. There was no specific reason to address this now other >>> than I came across this behaviour when updating the migration selftest >>> to inspect the status array and thought it was odd. I was seeing pages >>> had failed to migrate according to the status argument even though >>> move_pages() had returned 0 (ie. success) rather than a number of >>> non-migrated pages. >> >> It is good to mention such a motivation in the changelog to make it >> clear. Also do we have a specific test case which trigger this case? > > Not explicitly/reliably although I could write one. > >>> If I'm interpreting the side note correctly the behaviour you were >>> concerned about was the opposite - returning a fail return code from >>> move_pages() but not indicating failure in the status array. IIUC, we cannot avoid this even if leaving code untouched. In move_pages_and_store_status(), if some pages fails to be migrated, we will not store status. In fact, we don't know which pages failed to be migrated in kernel too. >>> That said I'm happy to leave the behaviour as is, although in that case >>> an update to the man page is in order to clarify a return value of 0 >>> from move_pages() doesn't actually mean all pages were successfully >>> migrated. IMHO, return value is more important than "status" as the reason I mentioned above. At least, we can make one thing correct. >> While I would say that it is better to let old dogs sleep I do not mind >> changing the behavior and see whether anything breaks. I suspect nobody >> except for couple of test cases hardcoded to the original behavior will >> notice. >> >>> >> Signed-off-by: Alistair Popple <apopple@nvidia.com> >>> >> Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") >> >> The patch itself looks good. I am not sure the fixes tag is accurate. >> Has the reporting been correct before this change? I didn't have time to >> re-read the original code which was quite different. > > I dug deeper into the history and the fixes tag is wrong. The behaviour > was actually introduced way back in commit e78bbfa82624 ("mm: stop > returning -ENOENT from sys_move_pages() if nothing got migrated"). As > you may guess from the title it was intentional, so suspect it is better > to update documentation. Can we change the code but keep the -ENOENT behavior with some special case? -- Best Regards, Huang, Ying >> Anyway >> Acked-by: Michal Hocko <mhocko@suse.com> > > Thanks for looking, but I will drop this and see if I can get the man > page updated. > >> Anyway rewriting this function to clarify the error handling would be a >> nice exercise if somebody is interested. > > Yeah, everytime I look at this function I want to do that but haven't > yet found the time. > >>> >> --- >>> >> mm/migrate.c | 4 +++- >>> >> 1 file changed, 3 insertions(+), 1 deletion(-) >>> >> >>> >> diff --git a/mm/migrate.c b/mm/migrate.c >>> >> index 24baad2571e3..bb3a37245e13 100644 >>> >> --- a/mm/migrate.c >>> >> +++ b/mm/migrate.c >>> >> @@ -2222,7 +2222,9 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, >>> >> * If the page is already on the target node (!err), store the >>> >> * node, otherwise, store the err. >>> >> */ >>> >> - err = store_status(status, i, err ? : current_node, 1); >>> >> + err1 = store_status(status, i, err ? : current_node, 1); >>> >> + if (err1) >>> >> + err = err1; >>> >> if (err) >>> >> goto out_flush; >>> >> >>> >> -- >>> >> 2.39.2
diff --git a/mm/migrate.c b/mm/migrate.c index 24baad2571e3..bb3a37245e13 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2222,7 +2222,9 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, * If the page is already on the target node (!err), store the * node, otherwise, store the err. */ - err = store_status(status, i, err ? : current_node, 1); + err1 = store_status(status, i, err ? : current_node, 1); + if (err1) + err = err1; if (err) goto out_flush;
When a page fails to migrate move_pages() returns the error code in a per-page array of status values. The function call itself is also supposed to return a summary error code indicating that a failure occurred. This doesn't always happen. Instead success can be returned even though some pages failed to migrate. This is due to incorrectly returning the error code from store_status() rather than the code from add_page_for_migration. Fix this by only returning an error from store_status() if the store actually failed. Signed-off-by: Alistair Popple <apopple@nvidia.com> Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") --- mm/migrate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)