diff mbox series

[2/2] mm/mempolicy: Skip walking HUGETLB vma if MPOL_MF_STRICT is specified alone

Message ID 1578993378-10860-2-git-send-email-lixinhai.lxh@gmail.com (mailing list archive)
State New, archived
Headers show
Series [1/2] mm/mempolicy: Checking hstate for hugetlbfs page in vma_migratable | expand

Commit Message

Li Xinhai Jan. 14, 2020, 9:16 a.m. UTC
Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man
page:

Notes
MPOL_MF_STRICT is ignored on huge page mappings.

If MPOL_MF_STRICT is specified alone without any MOVE flag, we should
indicate, from test_walk, that walking this vma should be skipped even if
there are misplaced pages.

Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/mempolicy.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Li Xinhai Jan. 14, 2020, 2:09 p.m. UTC | #1
Add cc to
Yang Shi <yang.shi@linux.alibaba.com>
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
, who has been worked on this part

On 2020-01-14 at 17:16 Li Xinhai wrote:
>Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man
>page:
>
>Notes
>MPOL_MF_STRICT is ignored on huge page mappings.
>
>If MPOL_MF_STRICT is specified alone without any MOVE flag, we should
>indicate, from test_walk, that walking this vma should be skipped even if
>there are misplaced pages.
>
>Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
>Cc: Michal Hocko <mhocko@suse.com>
>Cc: Mike Kravetz <mike.kravetz@oracle.com>
>---
> mm/mempolicy.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
>diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>index 067cf7d..c117b5f 100644
>--- a/mm/mempolicy.c
>+++ b/mm/mempolicy.c
>@@ -656,6 +656,13 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end,
> return 1;
> }
>
>+	/* MPOL_MF_STRICT is ignored for huge page, skip checking
>+	*  misplaced pages
>+	*/
>+	if ((flags & MPOL_MF_VALID) == MPOL_MF_STRICT &&
>+	is_vm_hugetlb_page(vma))
>+	return 1;
>+
> /* queue pages from current vma */
> if (flags & MPOL_MF_VALID)
> return 0;
>--
>1.8.3.1
>
Yang Shi Jan. 14, 2020, 6:27 p.m. UTC | #2
On 1/14/20 6:09 AM, Li Xinhai wrote:
> Add cc to
> Yang Shi <yang.shi@linux.alibaba.com>
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> , who has been worked on this part
>
> On 2020-01-14 at 17:16 Li Xinhai wrote:
>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man
>> page:
>>
>> Notes
>> MPOL_MF_STRICT is ignored on huge page mappings.
>>
>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should
>> indicate, from test_walk, that walking this vma should be skipped even if
>> there are misplaced pages.
>>
>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>> mm/mempolicy.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 067cf7d..c117b5f 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -656,6 +656,13 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end,
>> return 1;
>> }
>>
>> +	/* MPOL_MF_STRICT is ignored for huge page, skip checking
>> +	*  misplaced pages
>> +	*/
>> +	if ((flags & MPOL_MF_VALID) == MPOL_MF_STRICT &&
>> +	is_vm_hugetlb_page(vma))
>> +	return 1;

It makes some sense to me.  We do save some effort by not 
acquiring/releasing ptl and walking page tables.

Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>

>> /* queue pages from current vma */
>> if (flags & MPOL_MF_VALID)
>> return 0;
>> --
>> 1.8.3.1
> >
Mike Kravetz Jan. 15, 2020, 1:07 a.m. UTC | #3
On 1/14/20 6:09 AM, Li Xinhai wrote:
> Add cc to
> Yang Shi <yang.shi@linux.alibaba.com>
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> , who has been worked on this part
> 
> On 2020-01-14 at 17:16 Li Xinhai wrote:
>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man
>> page:
>>
>> Notes
>> MPOL_MF_STRICT is ignored on huge page mappings.
>>
>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should
>> indicate, from test_walk, that walking this vma should be skipped even if
>> there are misplaced pages.
>>
>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>

I do not necessarily disagree with the change.  However, this has made me
question a couple things:
1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings?
   - Is that leftover from the the days when huge page migration was not
     supported?
   - Is it just because huge page migration is more likely to fail than
     base page migration.
2) Does the mbind code function properly when unable to migrate a huge page
   MPOL_MF_STRICT is set?  A quick look at the code looks like it returns
   EIO.

I will look into these questions.  However, if someone already knows the
answer that would be appreciated.
Yang Shi Jan. 15, 2020, 1:24 a.m. UTC | #4
On 1/14/20 5:07 PM, Mike Kravetz wrote:
> On 1/14/20 6:09 AM, Li Xinhai wrote:
>> Add cc to
>> Yang Shi <yang.shi@linux.alibaba.com>
>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> , who has been worked on this part
>>
>> On 2020-01-14 at 17:16 Li Xinhai wrote:
>>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man
>>> page:
>>>
>>> Notes
>>> MPOL_MF_STRICT is ignored on huge page mappings.
>>>
>>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should
>>> indicate, from test_walk, that walking this vma should be skipped even if
>>> there are misplaced pages.
>>>
>>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> I do not necessarily disagree with the change.  However, this has made me
> question a couple things:
> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings?
>     - Is that leftover from the the days when huge page migration was not
>       supported?
>     - Is it just because huge page migration is more likely to fail than
>       base page migration.
> 2) Does the mbind code function properly when unable to migrate a huge page
>     MPOL_MF_STRICT is set?  A quick look at the code looks like it returns
>     EIO.

I don't know the answer about question #1 I didn't dig into the history. 
The queue_pages_hugetlb() returns 0 unconditionally, I think this is 
what "MPOL_MF_STRICT is ignored on huge page mappings" means in code.

It would return -EIO for base pages or THP as what the manpage describes.

>
> I will look into these questions.  However, if someone already knows the
> answer that would be appreciated.
Mike Kravetz Jan. 15, 2020, 4:28 a.m. UTC | #5
On 1/14/20 5:24 PM, Yang Shi wrote:
> 
> 
> On 1/14/20 5:07 PM, Mike Kravetz wrote:
>> On 1/14/20 6:09 AM, Li Xinhai wrote:
>>> Add cc to
>>> Yang Shi <yang.shi@linux.alibaba.com>
>>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>> , who has been worked on this part
>>>
>>> On 2020-01-14 at 17:16 Li Xinhai wrote:
>>>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man
>>>> page:
>>>>
>>>> Notes
>>>> MPOL_MF_STRICT is ignored on huge page mappings.
>>>>
>>>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should
>>>> indicate, from test_walk, that walking this vma should be skipped even if
>>>> there are misplaced pages.
>>>>
>>>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
>>>> Cc: Michal Hocko <mhocko@suse.com>
>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> I do not necessarily disagree with the change.  However, this has made me
>> question a couple things:
>> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings?
>>     - Is that leftover from the the days when huge page migration was not
>>       supported?
>>     - Is it just because huge page migration is more likely to fail than
>>       base page migration.
>> 2) Does the mbind code function properly when unable to migrate a huge page
>>     MPOL_MF_STRICT is set?  A quick look at the code looks like it returns
>>     EIO.
> 
> I don't know the answer about question #1 I didn't dig into the history. The queue_pages_hugetlb() returns 0 unconditionally, I think this is what "MPOL_MF_STRICT is ignored on huge page mappings" means in code.
> 
> It would return -EIO for base pages or THP as what the manpage describes.
> 

I was thinking about a migration failure after isolation.  This block of
code in do_mbind() after queue_pages_range() and mbind_range().

        if (!err) {
                int nr_failed = 0;

                if (!list_empty(&pagelist)) {
                        WARN_ON_ONCE(flags & MPOL_MF_LAZY);
                        nr_failed = migrate_pages(&pagelist, new_page, NULL,
                                start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
                        if (nr_failed)
                                putback_movable_pages(&pagelist);
                }

                if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
                        err = -EIO;
Yang Shi Jan. 15, 2020, 5:23 a.m. UTC | #6
On 1/14/20 8:28 PM, Mike Kravetz wrote:
> On 1/14/20 5:24 PM, Yang Shi wrote:
>>
>> On 1/14/20 5:07 PM, Mike Kravetz wrote:
>>> On 1/14/20 6:09 AM, Li Xinhai wrote:
>>>> Add cc to
>>>> Yang Shi <yang.shi@linux.alibaba.com>
>>>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>>> , who has been worked on this part
>>>>
>>>> On 2020-01-14 at 17:16 Li Xinhai wrote:
>>>>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man
>>>>> page:
>>>>>
>>>>> Notes
>>>>> MPOL_MF_STRICT is ignored on huge page mappings.
>>>>>
>>>>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should
>>>>> indicate, from test_walk, that walking this vma should be skipped even if
>>>>> there are misplaced pages.
>>>>>
>>>>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
>>>>> Cc: Michal Hocko <mhocko@suse.com>
>>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> I do not necessarily disagree with the change.  However, this has made me
>>> question a couple things:
>>> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings?
>>>      - Is that leftover from the the days when huge page migration was not
>>>        supported?
>>>      - Is it just because huge page migration is more likely to fail than
>>>        base page migration.
>>> 2) Does the mbind code function properly when unable to migrate a huge page
>>>      MPOL_MF_STRICT is set?  A quick look at the code looks like it returns
>>>      EIO.
>> I don't know the answer about question #1 I didn't dig into the history. The queue_pages_hugetlb() returns 0 unconditionally, I think this is what "MPOL_MF_STRICT is ignored on huge page mappings" means in code.
>>
>> It would return -EIO for base pages or THP as what the manpage describes.
>>
> I was thinking about a migration failure after isolation.  This block of
> code in do_mbind() after queue_pages_range() and mbind_range().
>
>          if (!err) {
>                  int nr_failed = 0;
>
>                  if (!list_empty(&pagelist)) {
>                          WARN_ON_ONCE(flags & MPOL_MF_LAZY);
>                          nr_failed = migrate_pages(&pagelist, new_page, NULL,
>                                  start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
>                          if (nr_failed)
>                                  putback_movable_pages(&pagelist);
>                  }
>
>                  if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
>                          err = -EIO;

Hmm.. I agree this part in man page does look ambiguous. We may assume 
"MPOL_MF_STRICT is ignored on huge page mappings." implies if 
MPOL_MF_STRICT is specified alone? If MOVE flag is specified it should 
return -EIO if some pages could not be moved as what the man page describes.

I don't know what the intention was at the first place. We may have to 
dig into the history.

>
Li Xinhai Jan. 15, 2020, 7:36 a.m. UTC | #7
On 2020-01-15 at 13:23 Yang Shi wrote:
>
>
>On 1/14/20 8:28 PM, Mike Kravetz wrote:
>> On 1/14/20 5:24 PM, Yang Shi wrote:
>>>
>>> On 1/14/20 5:07 PM, Mike Kravetz wrote:
>>>> On 1/14/20 6:09 AM, Li Xinhai wrote:
>>>>> Add cc to
>>>>> Yang Shi <yang.shi@linux.alibaba.com>
>>>>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>>>> , who has been worked on this part
>>>>>
>>>>> On 2020-01-14 at 17:16 Li Xinhai wrote:
>>>>>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man
>>>>>> page:
>>>>>>
>>>>>> Notes
>>>>>> MPOL_MF_STRICT is ignored on huge page mappings.
>>>>>>
>>>>>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should
>>>>>> indicate, from test_walk, that walking this vma should be skipped even if
>>>>>> there are misplaced pages.
>>>>>>
>>>>>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
>>>>>> Cc: Michal Hocko <mhocko@suse.com>
>>>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>> I do not necessarily disagree with the change.  However, this has made me
>>>> question a couple things:
>>>> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings?
>>>>      - Is that leftover from the the days when huge page migration was not
>>>>        supported?
>>>>      - Is it just because huge page migration is more likely to fail than
>>>>        base page migration.
>>>> 2) Does the mbind code function properly when unable to migrate a huge page
>>>>      MPOL_MF_STRICT is set?  A quick look at the code looks like it returns
>>>>      EIO. 
for question (2),
look into queue_pages_hugetlb(), the misplaced page would not
cause -EIO reported, for both STRICT set alone and STRICT set with MOVE*;
that means STRICT been effectively ignored during isolation phase.

In unmap and move phase, -EIO is reported if failed to move page and
STRICT is set.

>>> I don't know the answer about question #1 I didn't dig into the history. The queue_pages_hugetlb() returns 0 unconditionally, I think this is what "MPOL_MF_STRICT is ignored on huge page mappings" means in code.
>>>
>>> It would return -EIO for base pages or THP as what the manpage describes.
>>>
>> I was thinking about a migration failure after isolation.  This block of
>> code in do_mbind() after queue_pages_range() and mbind_range().
>>
>>          if (!err) {
>>                  int nr_failed = 0;
>>
>>                  if (!list_empty(&pagelist)) {
>>                          WARN_ON_ONCE(flags & MPOL_MF_LAZY);
>>                          nr_failed = migrate_pages(&pagelist, new_page, NULL,
>>                                  start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
>>                          if (nr_failed)
>>                                  putback_movable_pages(&pagelist);
>>                  }
>>
>>                  if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
>>                          err = -EIO;
>
>Hmm.. I agree this part in man page does look ambiguous. We may assume
>"MPOL_MF_STRICT is ignored on huge page mappings." implies if
>MPOL_MF_STRICT is specified alone? If MOVE flag is specified it should
>return -EIO if some pages could not be moved as what the man page describes.
> 
It looks to me that current code has no feasible way to ignore STRICT
flag for hugetlb page when failure happen in  unmap&move phase,
because mbind is about to handle multiple vma(i.e., hugetlb vma mixed with
other vma) in one call.

>I don't know what the intention was at the first place. We may have to
>dig into the history.
>
>>
>
>
Yang Shi Jan. 15, 2020, 5:16 p.m. UTC | #8
On 1/14/20 11:36 PM, Li Xinhai wrote:
> On 2020-01-15 at 13:23 Yang Shi wrote:
>>
>> On 1/14/20 8:28 PM, Mike Kravetz wrote:
>>> On 1/14/20 5:24 PM, Yang Shi wrote:
>>>> On 1/14/20 5:07 PM, Mike Kravetz wrote:
>>>>> On 1/14/20 6:09 AM, Li Xinhai wrote:
>>>>>> Add cc to
>>>>>> Yang Shi <yang.shi@linux.alibaba.com>
>>>>>> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>>>>> , who has been worked on this part
>>>>>>
>>>>>> On 2020-01-14 at 17:16 Li Xinhai wrote:
>>>>>>> Checking MPOL_MF_STRICT is ignored for HUGETLB vma according to mbind man
>>>>>>> page:
>>>>>>>
>>>>>>> Notes
>>>>>>> MPOL_MF_STRICT is ignored on huge page mappings.
>>>>>>>
>>>>>>> If MPOL_MF_STRICT is specified alone without any MOVE flag, we should
>>>>>>> indicate, from test_walk, that walking this vma should be skipped even if
>>>>>>> there are misplaced pages.
>>>>>>>
>>>>>>> Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
>>>>>>> Cc: Michal Hocko <mhocko@suse.com>
>>>>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>>> I do not necessarily disagree with the change.  However, this has made me
>>>>> question a couple things:
>>>>> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings?
>>>>>        - Is that leftover from the the days when huge page migration was not
>>>>>          supported?
>>>>>        - Is it just because huge page migration is more likely to fail than
>>>>>          base page migration.
>>>>> 2) Does the mbind code function properly when unable to migrate a huge page
>>>>>        MPOL_MF_STRICT is set?  A quick look at the code looks like it returns
>>>>>        EIO.
> for question (2),
> look into queue_pages_hugetlb(), the misplaced page would not
> cause -EIO reported, for both STRICT set alone and STRICT set with MOVE*;
> that means STRICT been effectively ignored during isolation phase.
>
> In unmap and move phase, -EIO is reported if failed to move page and
> STRICT is set.
>
>>>> I don't know the answer about question #1 I didn't dig into the history. The queue_pages_hugetlb() returns 0 unconditionally, I think this is what "MPOL_MF_STRICT is ignored on huge page mappings" means in code.
>>>>
>>>> It would return -EIO for base pages or THP as what the manpage describes.
>>>>
>>> I was thinking about a migration failure after isolation.  This block of
>>> code in do_mbind() after queue_pages_range() and mbind_range().
>>>
>>>            if (!err) {
>>>                    int nr_failed = 0;
>>>
>>>                    if (!list_empty(&pagelist)) {
>>>                            WARN_ON_ONCE(flags & MPOL_MF_LAZY);
>>>                            nr_failed = migrate_pages(&pagelist, new_page, NULL,
>>>                                    start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
>>>                            if (nr_failed)
>>>                                    putback_movable_pages(&pagelist);
>>>                    }
>>>
>>>                    if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT)))
>>>                            err = -EIO;
>> Hmm.. I agree this part in man page does look ambiguous. We may assume
>> "MPOL_MF_STRICT is ignored on huge page mappings." implies if
>> MPOL_MF_STRICT is specified alone? If MOVE flag is specified it should
>> return -EIO if some pages could not be moved as what the man page describes.
>>
> It looks to me that current code has no feasible way to ignore STRICT
> flag for hugetlb page when failure happen in  unmap&move phase,
> because mbind is about to handle multiple vma(i.e., hugetlb vma mixed with
> other vma) in one call.

Yes, if you have both MPOL_MF_STRICT and MPOL_MF_MOVE_{ALL} specified, 
so I speculate the condition is MPOL_MF_STRICT is specified alone, but 
the man page doesn't elaborate this. This is also what the code does. 
But, I'm not sure if my understanding is correct or not.

>
>> I don't know what the intention was at the first place. We may have to
>> dig into the history.
>>
> >
Mike Kravetz Jan. 15, 2020, 9:07 p.m. UTC | #9
Naoya, would appreciate any comments you have as this seems to be related
to your changes to add huge page migration support.

On 1/14/20 5:07 PM, Mike Kravetz wrote:
> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings?
>    - Is that leftover from the the days when huge page migration was not
>      supported?
>    - Is it just because huge page migration is more likely to fail than
>      base page migration.


According to man page, mbind was added to v2.6.7 kernel.  git history does
not go back that far, so I first looked at v2.6.12.

Definitions from include/linux/mempolicy.h
------------------------------------------
/* Policies */
#define MPOL_DEFAULT    0
#define MPOL_PREFERRED  1
#define MPOL_BIND       2
#define MPOL_INTERLEAVE 3

/* Flags for mbind */
#define MPOL_MF_STRICT  (1<<0)  /* Verify existing pages in the mapping */

Note the MPOL_MF_STRICT would simply verify current page placement.  At
this time, hugetlb pages were not part of this verification.

From v2.6.12 routine check_range()
...
	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
		if (!vma->vm_next && vma->vm_end < end)
			return ERR_PTR(-EFAULT);
		if (prev && prev->vm_end < vma->vm_start)
			return ERR_PTR(-EFAULT);
		if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
			err = verify_pages(vma->vm_mm,
					   vma->vm_start, vma->vm_end, nodes);
			if (err) {
				first = ERR_PTR(err);
				break;
			}
		}
		prev = vma;
	}
...

man page history does not go back this far.  The first mbind man page has
some things that are interesting:
1) It DOES NOT have the note saying MPOL_MF_STRICT is ignored on huge
   page mappings (even though the code does this).
2) Support for huge page policy was added with 2.6.16
3) For MPOL_MF_STRICT, in 2.6.16 or later the kernel will also try to
   move pages to the requested node with this flag.

Next look at v2.6.16 kernel
==========================
This kernel introduces the MPOL_MF_MOVE* flags
#define MPOL_MF_MOVE    (1<<1)  /* Move pages owned by this process to conform
				   to mapping */
#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */

Once again, the up front check_range() routine will skip hugetlb vma's.
Note that check_range() will also isolate pages.  So since hugetlb vma's
are skipped, no hugetlb pages will be isolated.  Since no hugetlb pages
are isolated, none will be on the list to be migrate.  Therefore, hugetlb
pages are effectively ignored for the new MPOL_MF_MOVE* flags.  This makes
sense as huge page migration support was not added until the v3.12 kernel.

Note that at when MPOL_MF_MOVE* flags were added to the mbind man page,
the statement "MPOL_MF_STRICT is ignored on huge page mappings right now."
was added.  This was later changed to "MPOL_MF_STRICT is ignored on huge
page mappings."

Next look at v3.12 kernel
=========================
The v3.12 code looks similiar to today's code.  In the verify/isolation
phase, v3.12 routine queue_pages_hugetlb_pmd_range() looks very similiar
to queue_pages_hugetlb().  Neither will cause errors at this point in the
process.  And, hugetlb pages with a mapcount == 1 will potentially be
isolated and added to the list of pages to be migrated.  In addition, if
the subsequent call to migrate_pages() fails to migrate ANY page, we
return -EIO if MPOL_MF_STRICT is set.  This is true even if the only page(s)
that failed to migrate were hugetlb pages.

It should also be noted that no mbind man page updates were made WRT
hugetlb pages after migration support was added.

Summary
=======
It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page
mappings." is left over from the original mbind implementation.  When
the huge page migration support was added, I can not be sure if ignoring
MPOL_MF_STRICT for huge pages during the verify/isolation phase was
intentional.  It seems like it was as the return value from
isolate_huge_page() is ignored.

What should we do?
==================
1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
   seen as conflicting with man page has existed since v3.12 and I am
   not aware of any complaints.
2) In addition to optimizations by Li Xinhai, modify code to truly ignore
   MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
   after a failure of migrate_pages().  We could simply traverse the list
   of pages that were not migrated looking for any non-hugetlb page.
3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
   and modify code accordingly.

My suggestion would be for 1 or 2.  Thoughts?
Yang Shi Jan. 15, 2020, 9:30 p.m. UTC | #10
On 1/15/20 1:07 PM, Mike Kravetz wrote:
> Naoya, would appreciate any comments you have as this seems to be related
> to your changes to add huge page migration support.
>
> On 1/14/20 5:07 PM, Mike Kravetz wrote:
>> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings?
>>     - Is that leftover from the the days when huge page migration was not
>>       supported?
>>     - Is it just because huge page migration is more likely to fail than
>>       base page migration.
>
> According to man page, mbind was added to v2.6.7 kernel.  git history does
> not go back that far, so I first looked at v2.6.12.
>
> Definitions from include/linux/mempolicy.h
> ------------------------------------------
> /* Policies */
> #define MPOL_DEFAULT    0
> #define MPOL_PREFERRED  1
> #define MPOL_BIND       2
> #define MPOL_INTERLEAVE 3
>
> /* Flags for mbind */
> #define MPOL_MF_STRICT  (1<<0)  /* Verify existing pages in the mapping */
>
> Note the MPOL_MF_STRICT would simply verify current page placement.  At
> this time, hugetlb pages were not part of this verification.
>
>  From v2.6.12 routine check_range()
> ...
> 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
> 		if (!vma->vm_next && vma->vm_end < end)
> 			return ERR_PTR(-EFAULT);
> 		if (prev && prev->vm_end < vma->vm_start)
> 			return ERR_PTR(-EFAULT);
> 		if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
> 			err = verify_pages(vma->vm_mm,
> 					   vma->vm_start, vma->vm_end, nodes);
> 			if (err) {
> 				first = ERR_PTR(err);
> 				break;
> 			}
> 		}
> 		prev = vma;
> 	}
> ...
>
> man page history does not go back this far.  The first mbind man page has
> some things that are interesting:
> 1) It DOES NOT have the note saying MPOL_MF_STRICT is ignored on huge
>     page mappings (even though the code does this).
> 2) Support for huge page policy was added with 2.6.16
> 3) For MPOL_MF_STRICT, in 2.6.16 or later the kernel will also try to
>     move pages to the requested node with this flag.
>
> Next look at v2.6.16 kernel
> ==========================
> This kernel introduces the MPOL_MF_MOVE* flags
> #define MPOL_MF_MOVE    (1<<1)  /* Move pages owned by this process to conform
> 				   to mapping */
> #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */
>
> Once again, the up front check_range() routine will skip hugetlb vma's.
> Note that check_range() will also isolate pages.  So since hugetlb vma's
> are skipped, no hugetlb pages will be isolated.  Since no hugetlb pages
> are isolated, none will be on the list to be migrate.  Therefore, hugetlb
> pages are effectively ignored for the new MPOL_MF_MOVE* flags.  This makes
> sense as huge page migration support was not added until the v3.12 kernel.
>
> Note that at when MPOL_MF_MOVE* flags were added to the mbind man page,
> the statement "MPOL_MF_STRICT is ignored on huge page mappings right now."
> was added.  This was later changed to "MPOL_MF_STRICT is ignored on huge
> page mappings."
>
> Next look at v3.12 kernel
> =========================
> The v3.12 code looks similiar to today's code.  In the verify/isolation
> phase, v3.12 routine queue_pages_hugetlb_pmd_range() looks very similiar
> to queue_pages_hugetlb().  Neither will cause errors at this point in the
> process.  And, hugetlb pages with a mapcount == 1 will potentially be
> isolated and added to the list of pages to be migrated.  In addition, if
> the subsequent call to migrate_pages() fails to migrate ANY page, we
> return -EIO if MPOL_MF_STRICT is set.  This is true even if the only page(s)
> that failed to migrate were hugetlb pages.
>
> It should also be noted that no mbind man page updates were made WRT
> hugetlb pages after migration support was added.
>
> Summary
> =======
> It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page
> mappings." is left over from the original mbind implementation.  When
> the huge page migration support was added, I can not be sure if ignoring
> MPOL_MF_STRICT for huge pages during the verify/isolation phase was
> intentional.  It seems like it was as the return value from
> isolate_huge_page() is ignored.

Thanks a lot for digging into the history.

>
> What should we do?
> ==================
> 1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
>     seen as conflicting with man page has existed since v3.12 and I am
>     not aware of any complaints.
> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
>     MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
>     after a failure of migrate_pages().  We could simply traverse the list
>     of pages that were not migrated looking for any non-hugetlb page.

I don't think we can do this easily since migrate_pages() would put the 
migration failed hugetlb pages back to hugepage_activelist so there 
should not any hugetlb page reside on the pagelist regardless of failure 
if I read the code correctly.

Other than that traversing page list to look for a certain type page 
doesn't sound scalable to me.

> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
>     and modify code accordingly.
>
> My suggestion would be for 1 or 2.  Thoughts?

By rethinking the history (thanks again for digging into it), it sounds 
#3 should be more reasonable. It sounds like the behavior was changed 
since hugetlb migration was added but the man page was not updated to 
reflect the change.
Mike Kravetz Jan. 15, 2020, 9:45 p.m. UTC | #11
On 1/15/20 1:30 PM, Yang Shi wrote:
> On 1/15/20 1:07 PM, Mike Kravetz wrote:
>> What should we do?
>> ==================
>> 1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
>>     seen as conflicting with man page has existed since v3.12 and I am
>>     not aware of any complaints.
>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
>>     MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
>>     after a failure of migrate_pages().  We could simply traverse the list
>>     of pages that were not migrated looking for any non-hugetlb page.
> 
> I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly.
> 

You are correct.  I made an assumption without actually looking at the code. :(

> Other than that traversing page list to look for a certain type page doesn't sound scalable to me.
> 
>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
>>     and modify code accordingly.
>>
>> My suggestion would be for 1 or 2.  Thoughts?
> 
> By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change.
> 

Let's hope Naoya comments.  My only concern with #3 is that we will be changing
behavior.  I do not think many people (if any) depend on existing behavior.
However, you can never be sure.
Yang Shi Jan. 15, 2020, 9:59 p.m. UTC | #12
On 1/15/20 1:45 PM, Mike Kravetz wrote:
> On 1/15/20 1:30 PM, Yang Shi wrote:
>> On 1/15/20 1:07 PM, Mike Kravetz wrote:
>>> What should we do?
>>> ==================
>>> 1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
>>>      seen as conflicting with man page has existed since v3.12 and I am
>>>      not aware of any complaints.
>>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
>>>      MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
>>>      after a failure of migrate_pages().  We could simply traverse the list
>>>      of pages that were not migrated looking for any non-hugetlb page.
>> I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly.
>>
> You are correct.  I made an assumption without actually looking at the code. :(
>
>> Other than that traversing page list to look for a certain type page doesn't sound scalable to me.
>>
>>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
>>>      and modify code accordingly.
>>>
>>> My suggestion would be for 1 or 2.  Thoughts?
>> By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change.
>>
> Let's hope Naoya comments.  My only concern with #3 is that we will be changing
> behavior.  I do not think many people (if any) depend on existing behavior.
> However, you can never be sure.

Yes, this would change the bahavior, but I don't see why we have to 
treat hugetlb specially nowadays with migration supported.
Michal Hocko Jan. 16, 2020, 7:59 a.m. UTC | #13
On Wed 15-01-20 13:07:17, Mike Kravetz wrote:
[...]
> Summary
> =======
> It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page
> mappings." is left over from the original mbind implementation.  When
> the huge page migration support was added, I can not be sure if ignoring
> MPOL_MF_STRICT for huge pages during the verify/isolation phase was
> intentional.  It seems like it was as the return value from
> isolate_huge_page() is ignored.

THanks for the tedious work of studying the mess^Whistory.
> 
> What should we do?
> ==================
> 1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
>    seen as conflicting with man page has existed since v3.12 and I am
>    not aware of any complaints.
> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
>    MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
>    after a failure of migrate_pages().  We could simply traverse the list
>    of pages that were not migrated looking for any non-hugetlb page.
> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
>    and modify code accordingly.
> 
> My suggestion would be for 1 or 2.  Thoughts?

And why do we exactly need to do anything at all? There is an
inconsistency that has been there for years without anybody noticing.
NUMA API is a mess on its own and unfixable at this stage, there will
always be some corner cases. If there is no real workload hitting this
incosistency and suffering, I would rather not touch this at all.
Unless the change would clean up the code or make it more maintainable.
HORIGUCHI NAOYA(堀口 直也) Jan. 16, 2020, 8:07 a.m. UTC | #14
Hi everyone,

Thank you all for finding and digging the issue.

> Summary
> =======
> It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page
> mappings." is left over from the original mbind implementation.  When
> the huge page migration support was added, I can not be sure if ignoring
> MPOL_MF_STRICT for huge pages during the verify/isolation phase was
> intentional.  It seems like it was as the return value from
> isolate_huge_page() is ignored.

This summary is totally correct.  I've simply missed considering MPOL_MF_STRICT
flag when implementing hugetlb migration.  As you pointed out, the discrepacy
between the manpage and the code is also due to the lack of updates on the
"MPOL_MF_STRICT is ignored on huge page mappings." statement.

On Wed, Jan 15, 2020 at 01:59:14PM -0800, Yang Shi wrote:
> 
> On 1/15/20 1:45 PM, Mike Kravetz wrote:
> > On 1/15/20 1:30 PM, Yang Shi wrote:
> > > On 1/15/20 1:07 PM, Mike Kravetz wrote:
> > > > What should we do?
> > > > ==================
> > > > 1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
> > > >      seen as conflicting with man page has existed since v3.12 and I am
> > > >      not aware of any complaints.
> > > > 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
> > > >      MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
> > > >      after a failure of migrate_pages().  We could simply traverse the list
> > > >      of pages that were not migrated looking for any non-hugetlb page.
> > > I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly.

Although this behavior seems to me not prevent from finding non-hugetlb
pages in migration list, this is a difference in migration behavior between
normal pages and hugepages that might be better to be optimized.
Maybe hugepages failed to migrate should remain in migration list after
migrate_pages() returns and the should be put back via putback_movable_pages().

> > > 
> > You are correct.  I made an assumption without actually looking at the code. :(
> > 
> > > Other than that traversing page list to look for a certain type page doesn't sound scalable to me.
> > > 
> > > > 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
> > > >      and modify code accordingly.
> > > > 
> > > > My suggestion would be for 1 or 2.  Thoughts?
> > > By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change.
> > > 
> > Let's hope Naoya comments.  My only concern with #3 is that we will be changing
> > behavior.  I do not think many people (if any) depend on existing behavior.
> > However, you can never be sure.
> 
> Yes, this would change the bahavior, but I don't see why we have to treat
> hugetlb specially nowadays with migration supported.

(Option #1 is good for short term solution, but eventually) I agree with option #3.
We have no reason to handle hugetlb differently about MPOL_MF_STRICT flag.

Thanks,
Naoya Horiguchi
Li Xinhai Jan. 16, 2020, 3:32 p.m. UTC | #15
On 2020-01-16 at 16:07 HORIGUCHI NAOYA(堀口 直也) wrote:
>Hi everyone,
>
>Thank you all for finding and digging the issue.
>
>> Summary
>> =======
>> It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page
>> mappings." is left over from the original mbind implementation.  When
>> the huge page migration support was added, I can not be sure if ignoring
>> MPOL_MF_STRICT for huge pages during the verify/isolation phase was
>> intentional.  It seems like it was as the return value from
>> isolate_huge_page() is ignored.
>
>This summary is totally correct.  I've simply missed considering MPOL_MF_STRICT
>flag when implementing hugetlb migration.  As you pointed out, the discrepacy
>between the manpage and the code is also due to the lack of updates on the
>"MPOL_MF_STRICT is ignored on huge page mappings." statement.
>
>On Wed, Jan 15, 2020 at 01:59:14PM -0800, Yang Shi wrote:
>>
>> On 1/15/20 1:45 PM, Mike Kravetz wrote:
>> > On 1/15/20 1:30 PM, Yang Shi wrote:
>> > > On 1/15/20 1:07 PM, Mike Kravetz wrote:
>> > > > What should we do?
>> > > > ==================
>> > > > 1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
>> > > >      seen as conflicting with man page has existed since v3.12 and I am
>> > > >      not aware of any complaints.
>> > > > 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
>> > > >      MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
>> > > >      after a failure of migrate_pages().  We could simply traverse the list
>> > > >      of pages that were not migrated looking for any non-hugetlb page.
>> > > I don't think we can do this easily since migrate_pages() would put the migration failed hugetlb pages back to hugepage_activelist so there should not any hugetlb page reside on the pagelist regardless of failure if I read the code correctly.
>
>Although this behavior seems to me not prevent from finding non-hugetlb
>pages in migration list, this is a difference in migration behavior between
>normal pages and hugepages that might be better to be optimized.
>Maybe hugepages failed to migrate should remain in migration list after
>migrate_pages() returns and the should be put back via putback_movable_pages().
>
>> > >
>> > You are correct.  I made an assumption without actually looking at the code. :(
>> >
>> > > Other than that traversing page list to look for a certain type page doesn't sound scalable to me.
>> > >
>> > > > 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
>> > > >      and modify code accordingly.
>> > > >
>> > > > My suggestion would be for 1 or 2.  Thoughts?
>> > > By rethinking the history (thanks again for digging into it), it sounds #3 should be more reasonable. It sounds like the behavior was changed since hugetlb migration was added but the man page was not updated to reflect the change.
>> > >
>> > Let's hope Naoya comments.  My only concern with #3 is that we will be changing
>> > behavior.  I do not think many people (if any) depend on existing behavior.
>> > However, you can never be sure.
>>
>> Yes, this would change the bahavior, but I don't see why we have to treat
>> hugetlb specially nowadays with migration supported.
>
>(Option #1 is good for short term solution, but eventually) I agree with option #3.
>We have no reason to handle hugetlb differently about MPOL_MF_STRICT flag. 

Thanks. Same thoughts for option #3, but it seems better not change current
behavior. 
Add more about current behavior of code:
- In unmap&move phase, there is no different behavior of handling hugepage
and non-hugepage, that is when STRICT is set, report -EIO if any page
can't move, when STRICT is not set, don't report when failed to move;
- In isolation phase, STRICT is effective for non-hugepge, that means set
STRICT alone will cause -EIO if found misplaced pages, and set STRICT with
MOVE* will cause -EIO if failed to isolate pages; for hugepage, STRICT is
ignored, it don't detect misplaced pages nor report -EIO if isolation failed.

This patch don't change any part of current behavior, only avoids walking
page table, where currently do nothing if STRICT is set alone.

>
>Thanks,
>Naoya Horiguchi
Mike Kravetz Jan. 16, 2020, 7:22 p.m. UTC | #16
On 1/15/20 11:59 PM, Michal Hocko wrote:
> On Wed 15-01-20 13:07:17, Mike Kravetz wrote:
>> What should we do?
>> ==================
>> 1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
>>    seen as conflicting with man page has existed since v3.12 and I am
>>    not aware of any complaints.
>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
>>    MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
>>    after a failure of migrate_pages().  We could simply traverse the list
>>    of pages that were not migrated looking for any non-hugetlb page.
>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
>>    and modify code accordingly.
>>
>> My suggestion would be for 1 or 2.  Thoughts?
> 
> And why do we exactly need to do anything at all? There is an
> inconsistency that has been there for years without anybody noticing.
> NUMA API is a mess on its own and unfixable at this stage, there will
> always be some corner cases. If there is no real workload hitting this
> incosistency and suffering, I would rather not touch this at all.
> Unless the change would clean up the code or make it more maintainable.

That is a very valid point.  Sometimes we as developers get focused on the
actual code changes and fail to ask the question "does this really need to
be changed?" or "what value do the code changes provide?".

Li Xinhai came up with two optimizations in how the mbind code deals with
hugetlb pages.  This 'sub-optimal' code has existed for more than 6 years.
Unless I am mistaken, nobody has actually complained or noticed this behavior.
I believe Li Xinhai noticed this inefficient code via code inspection.  Of
course, based on what we know today one could write a test program to show
the inefficient behavior.  However, no real users have noticed this during
the past 6 years.

The proposed code changes are fairly simple.  However, I would not say that
they clean up the code or make it more maintainable.  They essentially add
or modify two checks to bail out early for hugetlb vma's if the flag which
is documented to not apply to hugetlb pages (MPOL_MF_STRICT) is specified.
If one is trying to follow the entire mbind code path for hugetlb pages,
these patches will make that easier follow/understand.  That is simply
because one can ignore downstream code/functionality.

Based on Michal's criteria above, I now believe the code changes should not
be made.  Yes, they are fairly simple.  However, even simple changes have
the potential to break something (build breakage with v1 of patch).  We should
leave this code as is unless issues are reported by users.
Yang Shi Jan. 17, 2020, 2:32 a.m. UTC | #17
On 1/16/20 11:22 AM, Mike Kravetz wrote:
> On 1/15/20 11:59 PM, Michal Hocko wrote:
>> On Wed 15-01-20 13:07:17, Mike Kravetz wrote:
>>> What should we do?
>>> ==================
>>> 1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
>>>     seen as conflicting with man page has existed since v3.12 and I am
>>>     not aware of any complaints.
>>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
>>>     MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
>>>     after a failure of migrate_pages().  We could simply traverse the list
>>>     of pages that were not migrated looking for any non-hugetlb page.
>>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
>>>     and modify code accordingly.
>>>
>>> My suggestion would be for 1 or 2.  Thoughts?
>> And why do we exactly need to do anything at all? There is an
>> inconsistency that has been there for years without anybody noticing.
>> NUMA API is a mess on its own and unfixable at this stage, there will
>> always be some corner cases. If there is no real workload hitting this
>> incosistency and suffering, I would rather not touch this at all.
>> Unless the change would clean up the code or make it more maintainable.
> That is a very valid point.  Sometimes we as developers get focused on the
> actual code changes and fail to ask the question "does this really need to
> be changed?" or "what value do the code changes provide?".
>
> Li Xinhai came up with two optimizations in how the mbind code deals with
> hugetlb pages.  This 'sub-optimal' code has existed for more than 6 years.
> Unless I am mistaken, nobody has actually complained or noticed this behavior.
> I believe Li Xinhai noticed this inefficient code via code inspection.  Of
> course, based on what we know today one could write a test program to show
> the inefficient behavior.  However, no real users have noticed this during
> the past 6 years.
>
> The proposed code changes are fairly simple.  However, I would not say that
> they clean up the code or make it more maintainable.  They essentially add
> or modify two checks to bail out early for hugetlb vma's if the flag which
> is documented to not apply to hugetlb pages (MPOL_MF_STRICT) is specified.
> If one is trying to follow the entire mbind code path for hugetlb pages,
> these patches will make that easier follow/understand.  That is simply
> because one can ignore downstream code/functionality.
>
> Based on Michal's criteria above, I now believe the code changes should not
> be made.  Yes, they are fairly simple.  However, even simple changes have
> the potential to break something (build breakage with v1 of patch).  We should
> leave this code as is unless issues are reported by users.

I tend to agree with you. And, according to what Horiguchi explained, 
the intention was not ignoring hugetlb mappings when hugetlb migration 
was added at the first place.  And, I'm supposed all of us agree hugetlb 
pages should be not treated specially although it is not a good timing 
or there is not strong motivation to fix it right now (we may correct 
the behavior in the future). The patch may convey the wrong information. 
And, the code path is definitely not a hot path, so I'm fine to drop it.

And, I'm wondering if we need add some comments in the code to explain 
the edge case just in case someone else repeat all the tedious history 
digging.
Li Xinhai Jan. 17, 2020, 2:38 a.m. UTC | #18
On 2020-01-17 at 03:22 Mike Kravetz wrote:
>On 1/15/20 11:59 PM, Michal Hocko wrote:
>> On Wed 15-01-20 13:07:17, Mike Kravetz wrote:
>>> What should we do?
>>> ==================
>>> 1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
>>>    seen as conflicting with man page has existed since v3.12 and I am
>>>    not aware of any complaints.
>>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
>>>    MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
>>>    after a failure of migrate_pages().  We could simply traverse the list
>>>    of pages that were not migrated looking for any non-hugetlb page.
>>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
>>>    and modify code accordingly.
>>>
>>> My suggestion would be for 1 or 2.  Thoughts?
>>
>> And why do we exactly need to do anything at all? There is an
>> inconsistency that has been there for years without anybody noticing.
>> NUMA API is a mess on its own and unfixable at this stage, there will
>> always be some corner cases. If there is no real workload hitting this
>> incosistency and suffering, I would rather not touch this at all.
>> Unless the change would clean up the code or make it more maintainable.
>
>That is a very valid point.  Sometimes we as developers get focused on the
>actual code changes and fail to ask the question "does this really need to
>be changed?" or "what value do the code changes provide?".
>
>Li Xinhai came up with two optimizations in how the mbind code deals with
>hugetlb pages.  This 'sub-optimal' code has existed for more than 6 years.
>Unless I am mistaken, nobody has actually complained or noticed this behavior.
>I believe Li Xinhai noticed this inefficient code via code inspection.  Of
>course, based on what we know today one could write a test program to show
>the inefficient behavior.  However, no real users have noticed this during
>the past 6 years.
>
>The proposed code changes are fairly simple.  However, I would not say that
>they clean up the code or make it more maintainable.  They essentially add
>or modify two checks to bail out early for hugetlb vma's if the flag which
>is documented to not apply to hugetlb pages (MPOL_MF_STRICT) is specified.
>If one is trying to follow the entire mbind code path for hugetlb pages,
>these patches will make that easier follow/understand.  That is simply
>because one can ignore downstream code/functionality.
>
>Based on Michal's criteria above, I now believe the code changes should not
>be made.  Yes, they are fairly simple.  However, even simple changes have
>the potential to break something (build breakage with v1 of patch).  We should
>leave this code as is unless issues are reported by users. 

Acctually I am the user of this API, and when using STRICT alone to know if
pages are misplaced and take action later, it seems don't work consistantly
on hugepage. Initially, I assumed that I didn't use it correctly, that flag must
use with MOVE*? After check the code, found that flag didn't been handled,
and finally noticed that there is a NOTE about it in MAN page.

I'd like the STRICT been handled, but at present since this is not going to
be implemented, I have to handle differently on hugetlb mapping.

At meantime, I thought that this patch can be useful and posted it, because
for scenarios where hugetlb mapping are handled with other mappings, less
cost for the mbind call. (although it didn't help my current case)

I think if we could provid better performance, why not do that only because
that code has exists for 6 years?

>--
>Mike Kravetz
Michal Hocko Jan. 17, 2020, 7:57 a.m. UTC | #19
On Fri 17-01-20 10:38:21, Li Xinhai wrote:
> On 2020-01-17 at 03:22 Mike Kravetz wrote:
> >On 1/15/20 11:59 PM, Michal Hocko wrote:
> >> On Wed 15-01-20 13:07:17, Mike Kravetz wrote:
> >>> What should we do?
> >>> ==================
> >>> 1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
> >>>    seen as conflicting with man page has existed since v3.12 and I am
> >>>    not aware of any complaints.
> >>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
> >>>    MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
> >>>    after a failure of migrate_pages().  We could simply traverse the list
> >>>    of pages that were not migrated looking for any non-hugetlb page.
> >>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
> >>>    and modify code accordingly.
> >>>
> >>> My suggestion would be for 1 or 2.  Thoughts?
> >>
> >> And why do we exactly need to do anything at all? There is an
> >> inconsistency that has been there for years without anybody noticing.
> >> NUMA API is a mess on its own and unfixable at this stage, there will
> >> always be some corner cases. If there is no real workload hitting this
> >> incosistency and suffering, I would rather not touch this at all.
> >> Unless the change would clean up the code or make it more maintainable.
> >
> >That is a very valid point.  Sometimes we as developers get focused on the
> >actual code changes and fail to ask the question "does this really need to
> >be changed?" or "what value do the code changes provide?".
> >
> >Li Xinhai came up with two optimizations in how the mbind code deals with
> >hugetlb pages.  This 'sub-optimal' code has existed for more than 6 years.
> >Unless I am mistaken, nobody has actually complained or noticed this behavior.
> >I believe Li Xinhai noticed this inefficient code via code inspection.  Of
> >course, based on what we know today one could write a test program to show
> >the inefficient behavior.  However, no real users have noticed this during
> >the past 6 years.
> >
> >The proposed code changes are fairly simple.  However, I would not say that
> >they clean up the code or make it more maintainable.  They essentially add
> >or modify two checks to bail out early for hugetlb vma's if the flag which
> >is documented to not apply to hugetlb pages (MPOL_MF_STRICT) is specified.
> >If one is trying to follow the entire mbind code path for hugetlb pages,
> >these patches will make that easier follow/understand.  That is simply
> >because one can ignore downstream code/functionality.
> >
> >Based on Michal's criteria above, I now believe the code changes should not
> >be made.  Yes, they are fairly simple.  However, even simple changes have
> >the potential to break something (build breakage with v1 of patch).  We should
> >leave this code as is unless issues are reported by users. 
> 
> Acctually I am the user of this API, and when using STRICT alone to know if
> pages are misplaced and take action later, it seems don't work consistantly
> on hugepage. Initially, I assumed that I didn't use it correctly, that flag must
> use with MOVE*? After check the code, found that flag didn't been handled,
> and finally noticed that there is a NOTE about it in MAN page.

This is the first time you are mentioning an actual use case. This
should have been expressed from the very begining. Including an
explanation of what the use case is and how it is affected by the
current behavior.

> I'd like the STRICT been handled, but at present since this is not going to
> be implemented, I have to handle differently on hugetlb mapping.
> 
> At meantime, I thought that this patch can be useful and posted it, because
> for scenarios where hugetlb mapping are handled with other mappings, less
> cost for the mbind call. (although it didn't help my current case)
> 
> I think if we could provid better performance, why not do that only because
> that code has exists for 6 years?

Do you have any numbers to prove performance improvements? I believe
arguments against the patch have been already presented.
Li Xinhai Jan. 17, 2020, 12:05 p.m. UTC | #20
On 2020-01-17 at 15:57 Michal Hocko wrote:
>On Fri 17-01-20 10:38:21, Li Xinhai wrote:
>> On 2020-01-17 at 03:22 Mike Kravetz wrote:
>> >On 1/15/20 11:59 PM, Michal Hocko wrote:
>> >> On Wed 15-01-20 13:07:17, Mike Kravetz wrote:
>> >>> What should we do?
>> >>> ==================
>> >>> 1) Nothing more than optimizations by Li Xinhai.  Behavior that could be
>> >>>    seen as conflicting with man page has existed since v3.12 and I am
>> >>>    not aware of any complaints.
>> >>> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
>> >>>    MPOL_MF_STRICT for huge page mappings.  This would be fairly easy to do
>> >>>    after a failure of migrate_pages().  We could simply traverse the list
>> >>>    of pages that were not migrated looking for any non-hugetlb page.
>> >>> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
>> >>>    and modify code accordingly.
>> >>>
>> >>> My suggestion would be for 1 or 2.  Thoughts?
>> >>
>> >> And why do we exactly need to do anything at all? There is an
>> >> inconsistency that has been there for years without anybody noticing.
>> >> NUMA API is a mess on its own and unfixable at this stage, there will
>> >> always be some corner cases. If there is no real workload hitting this
>> >> incosistency and suffering, I would rather not touch this at all.
>> >> Unless the change would clean up the code or make it more maintainable.
>> >
>> >That is a very valid point.  Sometimes we as developers get focused on the
>> >actual code changes and fail to ask the question "does this really need to
>> >be changed?" or "what value do the code changes provide?".
>> >
>> >Li Xinhai came up with two optimizations in how the mbind code deals with
>> >hugetlb pages.  This 'sub-optimal' code has existed for more than 6 years.
>> >Unless I am mistaken, nobody has actually complained or noticed this behavior.
>> >I believe Li Xinhai noticed this inefficient code via code inspection.  Of
>> >course, based on what we know today one could write a test program to show
>> >the inefficient behavior.  However, no real users have noticed this during
>> >the past 6 years.
>> >
>> >The proposed code changes are fairly simple.  However, I would not say that
>> >they clean up the code or make it more maintainable.  They essentially add
>> >or modify two checks to bail out early for hugetlb vma's if the flag which
>> >is documented to not apply to hugetlb pages (MPOL_MF_STRICT) is specified.
>> >If one is trying to follow the entire mbind code path for hugetlb pages,
>> >these patches will make that easier follow/understand.  That is simply
>> >because one can ignore downstream code/functionality.
>> >
>> >Based on Michal's criteria above, I now believe the code changes should not
>> >be made.  Yes, they are fairly simple.  However, even simple changes have
>> >the potential to break something (build breakage with v1 of patch).  We should
>> >leave this code as is unless issues are reported by users.
>>
>> Acctually I am the user of this API, and when using STRICT alone to know if
>> pages are misplaced and take action later, it seems don't work consistantly
>> on hugepage. Initially, I assumed that I didn't use it correctly, that flag must
>> use with MOVE*? After check the code, found that flag didn't been handled,
>> and finally noticed that there is a NOTE about it in MAN page.
>
>This is the first time you are mentioning an actual use case. This
>should have been expressed from the very begining. Including an
>explanation of what the use case is and how it is affected by the
>current behavior.
> 
OK, that is better practice, thanks.

>> I'd like the STRICT been handled, but at present since this is not going to
>> be implemented, I have to handle differently on hugetlb mapping.
>>
>> At meantime, I thought that this patch can be useful and posted it, because
>> for scenarios where hugetlb mapping are handled with other mappings, less
>> cost for the mbind call. (although it didn't help my current case)
>>
>> I think if we could provid better performance, why not do that only because
>> that code has exists for 6 years?
>
>Do you have any numbers to prove performance improvements? I believe
>arguments against the patch have been already presented. 

I didn't test to know difference with this patch since it doesn't help for
my case. If walking the page table and checking the present PTE is
basically no cost according to existing experience, then it is not needed.

Now, for my case, I am doing workaround by call mbind then check the
numa_maps, instead of just call mbind. So, I'd like option #3 is there. That is
a simple change, and I am glad to add this function if no objection.

>--
>Michal Hocko
>SUSE Labs
Michal Hocko Jan. 17, 2020, 3:20 p.m. UTC | #21
On Fri 17-01-20 20:05:23, Li Xinhai wrote:
[...]
> Now, for my case, I am doing workaround

workaround for what?

> by call mbind then check the
> numa_maps, instead of just call mbind. So, I'd like option #3 is there. That is
> a simple change, and I am glad to add this function if no objection.

Any reason you cannot use move_pages syscall?
Li Xinhai Jan. 17, 2020, 3:46 p.m. UTC | #22
On 2020-01-17 at 23:20 Michal Hocko wrote:
>On Fri 17-01-20 20:05:23, Li Xinhai wrote:
>[...]
>> Now, for my case, I am doing workaround
>
>workaround for what?
> 
I mean now call mbind without MOVE* or STRICT, only apply policy,
because my purpose is to take action later on misplaced pages.
If STRICT give correct feedback, I don't need to check numa_maps.
For above description, I don't use MOVE*, as want to take action later.

>> by call mbind then check the
>> numa_maps, instead of just call mbind. So, I'd like option #3 is there. That is
>> a simple change, and I am glad to add this function if no objection.
>
>Any reason you cannot use move_pages syscall? 

I use it, that is the 'take action later' mentioned above, I'd like to
know if there are misplace pages before call it if mbind can help
to give answer.

>--
>Michal Hocko
>SUSE Labs
Michal Hocko Jan. 20, 2020, 12:45 p.m. UTC | #23
On Fri 17-01-20 23:46:03, Li Xinhai wrote:
[...]
> >> by call mbind then check the
> >> numa_maps, instead of just call mbind. So, I'd like option #3 is there. That is
> >> a simple change, and I am glad to add this function if no objection.
> >
> >Any reason you cannot use move_pages syscall? 
> 
> I use it, that is the 'take action later' mentioned above, I'd like to
> know if there are misplace pages before call it if mbind can help
> to give answer.

I am sorry but I do not follow? Why do you need to do two steps? Why
don't you simply call move_pages right away?
Li Xinhai Jan. 21, 2020, 2:15 p.m. UTC | #24
On 2020-01-20 at 20:45 Michal Hocko wrote:
>On Fri 17-01-20 23:46:03, Li Xinhai wrote:
>[...]
>> >> by call mbind then check the
>> >> numa_maps, instead of just call mbind. So, I'd like option #3 is there. That is
>> >> a simple change, and I am glad to add this function if no objection.
>> >
>> >Any reason you cannot use move_pages syscall?
>>
>> I use it, that is the 'take action later' mentioned above, I'd like to
>> know if there are misplace pages before call it if mbind can help
>> to give answer.
>
>I am sorry but I do not follow? Why do you need to do two steps? Why
>don't you simply call move_pages right away? 
I am trying to decrease the number of call to move_pages(). For
call move_pages(), we need prepare parameters in array for 'pages',
'nodes' (i.e., it do not support moving pages as specified through start addr
and length), which could be cost.

One way for handling my case is 
- mbind: set policy without detecting if any pages misplaced nor moving 
  pages.
(continue below part when timing is right)
- move_pages:  call on pages which are within range of above mbind.

The other way is:
- mbind: set policy and detects if any pages are misplaced.
(continue below part when timing is right)
- move_pages:  call on pages which are within range has misplaced pages.

Knowing about misplaced pages from mbind will give me flexibility for
choosing different solutions.

>
>--
>Michal Hocko
>SUSE Labs
Michal Hocko Jan. 21, 2020, 2:53 p.m. UTC | #25
On Tue 21-01-20 22:15:43, Li Xinhai wrote:
[...]
> Knowing about misplaced pages from mbind will give me flexibility for
> choosing different solutions.

Would /proc/pid/numa_maps help to give you rough idea which mappings
of interest have misplaced pages and move_pages for those? It is still
allocating arrays but is that really a big deal?
Li Xinhai Jan. 22, 2020, 1:55 p.m. UTC | #26
On 2020-01-21 at 22:53 Michal Hocko wrote:
>On Tue 21-01-20 22:15:43, Li Xinhai wrote:
>[...]
>> Knowing about misplaced pages from mbind will give me flexibility for
>> choosing different solutions.
>
>Would /proc/pid/numa_maps help to give you rough idea which mappings
>of interest have misplaced pages and move_pages for those? It is still
>allocating arrays but is that really a big deal? 

numa_maps can be checked about misplaced pages, and I mentioned it in
previous mail for this purpose.
For my case, using move_pages() is necessary for later action. The only difference
is about mbind() could give feedback of misplaced page on impacted range,
numa_mpas requres me iterate over maps (that will collect states on those
maps although don't been used) to reach impacted range and check
it. 
I may assume there is no much difference on overall performance, it is more
about coding effort in user sapce application.

I've being try to apply patch for mbind(i.e., apply STRICT on hugetlb mapping),
and may sharing it after verification and let's see if that is usable, thanks.

>--
>Michal Hocko
>SUSE Labs
diff mbox series

Patch

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 067cf7d..c117b5f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -656,6 +656,13 @@  static int queue_pages_test_walk(unsigned long start, unsigned long end,
 		return 1;
 	}
 
+	/* MPOL_MF_STRICT is ignored for huge page, skip checking
+	*  misplaced pages
+	*/
+	if ((flags & MPOL_MF_VALID) == MPOL_MF_STRICT &&
+		is_vm_hugetlb_page(vma))
+		return 1;
+
 	/* queue pages from current vma */
 	if (flags & MPOL_MF_VALID)
 		return 0;