diff mbox series

[v2] mm: thp: fix false negative of shmem vma's THP eligibility

Message ID 1556037781-57869-1-git-send-email-yang.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm: thp: fix false negative of shmem vma's THP eligibility | expand

Commit Message

Yang Shi April 23, 2019, 4:43 p.m. UTC
The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
vma") introduced THPeligible bit for processes' smaps. But, when checking
the eligibility for shmem vma, __transparent_hugepage_enabled() is
called to override the result from shmem_huge_enabled().  It may result
in the anonymous vma's THP flag override shmem's.  For example, running a
simple test which create THP for shmem, but with anonymous THP disabled,
when reading the process's smaps, it may show:

7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
Size:               4096 kB
...
[snip]
...
ShmemPmdMapped:     4096 kB
...
[snip]
...
THPeligible:    0

And, /proc/meminfo does show THP allocated and PMD mapped too:

ShmemHugePages:     4096 kB
ShmemPmdMapped:     4096 kB

This doesn't make too much sense.  The anonymous THP flag should not
intervene shmem THP.  Calling shmem_huge_enabled() with checking
MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
dax vma check since we already checked if the vma is shmem already.

Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma")
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
v2: Check VM_NOHUGEPAGE per Michal Hocko

 mm/huge_memory.c | 4 ++--
 mm/shmem.c       | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

Comments

Michal Hocko April 23, 2019, 5:52 p.m. UTC | #1
On Wed 24-04-19 00:43:01, Yang Shi wrote:
> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
> vma") introduced THPeligible bit for processes' smaps. But, when checking
> the eligibility for shmem vma, __transparent_hugepage_enabled() is
> called to override the result from shmem_huge_enabled().  It may result
> in the anonymous vma's THP flag override shmem's.  For example, running a
> simple test which create THP for shmem, but with anonymous THP disabled,
> when reading the process's smaps, it may show:
> 
> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
> Size:               4096 kB
> ...
> [snip]
> ...
> ShmemPmdMapped:     4096 kB
> ...
> [snip]
> ...
> THPeligible:    0
> 
> And, /proc/meminfo does show THP allocated and PMD mapped too:
> 
> ShmemHugePages:     4096 kB
> ShmemPmdMapped:     4096 kB
> 
> This doesn't make too much sense.  The anonymous THP flag should not
> intervene shmem THP.  Calling shmem_huge_enabled() with checking
> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
> dax vma check since we already checked if the vma is shmem already.

Kirill, can we get a confirmation that this is really intended behavior
rather than an omission please? Is this documented? What is a global
knob to simply disable THP system wise?

I have to say that the THP tuning API is one giant mess :/

Btw. this patch also seem to fix khugepaged behavior because it previously
ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP.

> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma")
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
> v2: Check VM_NOHUGEPAGE per Michal Hocko
> 
>  mm/huge_memory.c | 4 ++--
>  mm/shmem.c       | 3 +++
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 165ea46..5881e82 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>  {
>  	if (vma_is_anonymous(vma))
>  		return __transparent_hugepage_enabled(vma);
> -	if (vma_is_shmem(vma) && shmem_huge_enabled(vma))
> -		return __transparent_hugepage_enabled(vma);
> +	if (vma_is_shmem(vma))
> +		return shmem_huge_enabled(vma);
>  
>  	return false;
>  }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 2275a0f..6f09a31 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>  	loff_t i_size;
>  	pgoff_t off;
>  
> +	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> +		return false;
>  	if (shmem_huge == SHMEM_HUGE_FORCE)
>  		return true;
>  	if (shmem_huge == SHMEM_HUGE_DENY)
> -- 
> 1.8.3.1
>
Yang Shi April 23, 2019, 6:34 p.m. UTC | #2
On 4/23/19 10:52 AM, Michal Hocko wrote:
> On Wed 24-04-19 00:43:01, Yang Shi wrote:
>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
>> vma") introduced THPeligible bit for processes' smaps. But, when checking
>> the eligibility for shmem vma, __transparent_hugepage_enabled() is
>> called to override the result from shmem_huge_enabled().  It may result
>> in the anonymous vma's THP flag override shmem's.  For example, running a
>> simple test which create THP for shmem, but with anonymous THP disabled,
>> when reading the process's smaps, it may show:
>>
>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
>> Size:               4096 kB
>> ...
>> [snip]
>> ...
>> ShmemPmdMapped:     4096 kB
>> ...
>> [snip]
>> ...
>> THPeligible:    0
>>
>> And, /proc/meminfo does show THP allocated and PMD mapped too:
>>
>> ShmemHugePages:     4096 kB
>> ShmemPmdMapped:     4096 kB
>>
>> This doesn't make too much sense.  The anonymous THP flag should not
>> intervene shmem THP.  Calling shmem_huge_enabled() with checking
>> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
>> dax vma check since we already checked if the vma is shmem already.
> Kirill, can we get a confirmation that this is really intended behavior
> rather than an omission please? Is this documented? What is a global
> knob to simply disable THP system wise?
>
> I have to say that the THP tuning API is one giant mess :/
>
> Btw. this patch also seem to fix khugepaged behavior because it previously
> ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP.

Aha, I didn't notice this. It looks we need separate the patch to fix 
that khugepaged problem for both 5.1-rc and LTS.

>
>> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma")
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>> v2: Check VM_NOHUGEPAGE per Michal Hocko
>>
>>   mm/huge_memory.c | 4 ++--
>>   mm/shmem.c       | 3 +++
>>   2 files changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 165ea46..5881e82 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>>   {
>>   	if (vma_is_anonymous(vma))
>>   		return __transparent_hugepage_enabled(vma);
>> -	if (vma_is_shmem(vma) && shmem_huge_enabled(vma))
>> -		return __transparent_hugepage_enabled(vma);
>> +	if (vma_is_shmem(vma))
>> +		return shmem_huge_enabled(vma);
>>   
>>   	return false;
>>   }
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 2275a0f..6f09a31 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>>   	loff_t i_size;
>>   	pgoff_t off;
>>   
>> +	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>> +		return false;
>>   	if (shmem_huge == SHMEM_HUGE_FORCE)
>>   		return true;
>>   	if (shmem_huge == SHMEM_HUGE_DENY)
>> -- 
>> 1.8.3.1
>>
Yang Shi April 24, 2019, 12:22 a.m. UTC | #3
On 4/23/19 11:34 AM, Yang Shi wrote:
>
>
> On 4/23/19 10:52 AM, Michal Hocko wrote:
>> On Wed 24-04-19 00:43:01, Yang Shi wrote:
>>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for 
>>> each
>>> vma") introduced THPeligible bit for processes' smaps. But, when 
>>> checking
>>> the eligibility for shmem vma, __transparent_hugepage_enabled() is
>>> called to override the result from shmem_huge_enabled().  It may result
>>> in the anonymous vma's THP flag override shmem's.  For example, 
>>> running a
>>> simple test which create THP for shmem, but with anonymous THP 
>>> disabled,
>>> when reading the process's smaps, it may show:
>>>
>>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
>>> Size:               4096 kB
>>> ...
>>> [snip]
>>> ...
>>> ShmemPmdMapped:     4096 kB
>>> ...
>>> [snip]
>>> ...
>>> THPeligible:    0
>>>
>>> And, /proc/meminfo does show THP allocated and PMD mapped too:
>>>
>>> ShmemHugePages:     4096 kB
>>> ShmemPmdMapped:     4096 kB
>>>
>>> This doesn't make too much sense.  The anonymous THP flag should not
>>> intervene shmem THP.  Calling shmem_huge_enabled() with checking
>>> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
>>> dax vma check since we already checked if the vma is shmem already.
>> Kirill, can we get a confirmation that this is really intended behavior
>> rather than an omission please? Is this documented? What is a global
>> knob to simply disable THP system wise?
>>
>> I have to say that the THP tuning API is one giant mess :/
>>
>> Btw. this patch also seem to fix khugepaged behavior because it 
>> previously
>> ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP.

Second look shows this is not ignored. hugepage_vma_check() would check 
this for both anonymous vma and shmem vma before scanning. It is called 
before shmem_huge_enabled().

>
> Aha, I didn't notice this. It looks we need separate the patch to fix 
> that khugepaged problem for both 5.1-rc and LTS.
>
>>
>>> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each 
>>> vma")
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: David Rientjes <rientjes@google.com>
>>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>> ---
>>> v2: Check VM_NOHUGEPAGE per Michal Hocko
>>>
>>>   mm/huge_memory.c | 4 ++--
>>>   mm/shmem.c       | 3 +++
>>>   2 files changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 165ea46..5881e82 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct 
>>> vm_area_struct *vma)
>>>   {
>>>       if (vma_is_anonymous(vma))
>>>           return __transparent_hugepage_enabled(vma);
>>> -    if (vma_is_shmem(vma) && shmem_huge_enabled(vma))
>>> -        return __transparent_hugepage_enabled(vma);
>>> +    if (vma_is_shmem(vma))
>>> +        return shmem_huge_enabled(vma);
>>>         return false;
>>>   }
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 2275a0f..6f09a31 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct 
>>> *vma)
>>>       loff_t i_size;
>>>       pgoff_t off;
>>>   +    if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>> +        test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>> +        return false;
>>>       if (shmem_huge == SHMEM_HUGE_FORCE)
>>>           return true;
>>>       if (shmem_huge == SHMEM_HUGE_DENY)
>>> -- 
>>> 1.8.3.1
>>>
>
Michal Hocko April 24, 2019, 7:58 a.m. UTC | #4
On Tue 23-04-19 17:22:36, Yang Shi wrote:
> 
> 
> On 4/23/19 11:34 AM, Yang Shi wrote:
> > 
> > 
> > On 4/23/19 10:52 AM, Michal Hocko wrote:
> > > On Wed 24-04-19 00:43:01, Yang Shi wrote:
> > > > The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility
> > > > for each
> > > > vma") introduced THPeligible bit for processes' smaps. But, when
> > > > checking
> > > > the eligibility for shmem vma, __transparent_hugepage_enabled() is
> > > > called to override the result from shmem_huge_enabled().  It may result
> > > > in the anonymous vma's THP flag override shmem's.  For example,
> > > > running a
> > > > simple test which create THP for shmem, but with anonymous THP
> > > > disabled,
> > > > when reading the process's smaps, it may show:
> > > > 
> > > > 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
> > > > Size:               4096 kB
> > > > ...
> > > > [snip]
> > > > ...
> > > > ShmemPmdMapped:     4096 kB
> > > > ...
> > > > [snip]
> > > > ...
> > > > THPeligible:    0
> > > > 
> > > > And, /proc/meminfo does show THP allocated and PMD mapped too:
> > > > 
> > > > ShmemHugePages:     4096 kB
> > > > ShmemPmdMapped:     4096 kB
> > > > 
> > > > This doesn't make too much sense.  The anonymous THP flag should not
> > > > intervene shmem THP.  Calling shmem_huge_enabled() with checking
> > > > MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
> > > > dax vma check since we already checked if the vma is shmem already.
> > > Kirill, can we get a confirmation that this is really intended behavior
> > > rather than an omission please? Is this documented? What is a global
> > > knob to simply disable THP system wise?
> > > 
> > > I have to say that the THP tuning API is one giant mess :/
> > > 
> > > Btw. this patch also seem to fix khugepaged behavior because it
> > > previously
> > > ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP.
> 
> Second look shows this is not ignored. hugepage_vma_check() would check this
> for both anonymous vma and shmem vma before scanning. It is called before
> shmem_huge_enabled().

Right. I have missed the earlier check. The main question remains
though.
Vlastimil Babka April 24, 2019, 1:10 p.m. UTC | #5
On 4/23/19 6:43 PM, Yang Shi wrote:
> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
> vma") introduced THPeligible bit for processes' smaps. But, when checking
> the eligibility for shmem vma, __transparent_hugepage_enabled() is
> called to override the result from shmem_huge_enabled().  It may result
> in the anonymous vma's THP flag override shmem's.  For example, running a
> simple test which create THP for shmem, but with anonymous THP disabled,
> when reading the process's smaps, it may show:
> 
> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
> Size:               4096 kB
> ...
> [snip]
> ...
> ShmemPmdMapped:     4096 kB

But how does this happen in the first place?
In __handle_mm_fault() we do:

        if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
                ret = create_huge_pmd(&vmf);
                if (!(ret & VM_FAULT_FALLBACK))
                        return ret;

And __transparent_hugepage_enabled() checks the global THP settings.
If THP is not enabled / is only for madvise and the vma is not madvised,
then this should fail, and also khugepaged shouldn't either run at all,
or don't do its job for such non-madvised vma.

What am I missing?

> ...
> [snip]
> ...
> THPeligible:    0
> 
> And, /proc/meminfo does show THP allocated and PMD mapped too:
> 
> ShmemHugePages:     4096 kB
> ShmemPmdMapped:     4096 kB
> 
> This doesn't make too much sense.  The anonymous THP flag should not
> intervene shmem THP.  Calling shmem_huge_enabled() with checking
> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
> dax vma check since we already checked if the vma is shmem already.
> 
> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma")
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
> v2: Check VM_NOHUGEPAGE per Michal Hocko
> 
>  mm/huge_memory.c | 4 ++--
>  mm/shmem.c       | 3 +++
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 165ea46..5881e82 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>  {
>  	if (vma_is_anonymous(vma))
>  		return __transparent_hugepage_enabled(vma);
> -	if (vma_is_shmem(vma) && shmem_huge_enabled(vma))
> -		return __transparent_hugepage_enabled(vma);
> +	if (vma_is_shmem(vma))
> +		return shmem_huge_enabled(vma);
>  
>  	return false;
>  }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 2275a0f..6f09a31 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>  	loff_t i_size;
>  	pgoff_t off;
>  
> +	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> +		return false;
>  	if (shmem_huge == SHMEM_HUGE_FORCE)
>  		return true;
>  	if (shmem_huge == SHMEM_HUGE_DENY)
>
Yang Shi April 24, 2019, 3:47 p.m. UTC | #6
On 4/24/19 6:10 AM, Vlastimil Babka wrote:
> On 4/23/19 6:43 PM, Yang Shi wrote:
>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
>> vma") introduced THPeligible bit for processes' smaps. But, when checking
>> the eligibility for shmem vma, __transparent_hugepage_enabled() is
>> called to override the result from shmem_huge_enabled().  It may result
>> in the anonymous vma's THP flag override shmem's.  For example, running a
>> simple test which create THP for shmem, but with anonymous THP disabled,
>> when reading the process's smaps, it may show:
>>
>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
>> Size:               4096 kB
>> ...
>> [snip]
>> ...
>> ShmemPmdMapped:     4096 kB
> But how does this happen in the first place?
> In __handle_mm_fault() we do:
>
>          if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
>                  ret = create_huge_pmd(&vmf);
>                  if (!(ret & VM_FAULT_FALLBACK))
>                          return ret;
>
> And __transparent_hugepage_enabled() checks the global THP settings.
> If THP is not enabled / is only for madvise and the vma is not madvised,
> then this should fail, and also khugepaged shouldn't either run at all,
> or don't do its job for such non-madvised vma.

If __transparent_hugepage_enabled() returns false, the code will not 
reach create_huge_pmd() at all. If it returns true, create_huge_pmd() 
actually will return VM_FAULT_FALLBACK for shmem since shmem doesn't 
have huge_fault (or pmd_fault in earlier versions) method.

Then it will get into handle_pte_fault(), finally shmem_fault() is 
called, which allocates THP by checking some global flag (i.e. 
VM_NOHUGEPAGE and MMF_DISABLE_THP) and  shmem THP knobs.

4.8 (the first version has shmem THP merged) behaves exactly in the same 
way. So, I suspect this may be intended behavior.

>
> What am I missing?
>
>> ...
>> [snip]
>> ...
>> THPeligible:    0
>>
>> And, /proc/meminfo does show THP allocated and PMD mapped too:
>>
>> ShmemHugePages:     4096 kB
>> ShmemPmdMapped:     4096 kB
>>
>> This doesn't make too much sense.  The anonymous THP flag should not
>> intervene shmem THP.  Calling shmem_huge_enabled() with checking
>> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
>> dax vma check since we already checked if the vma is shmem already.
>>
>> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma")
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>> v2: Check VM_NOHUGEPAGE per Michal Hocko
>>
>>   mm/huge_memory.c | 4 ++--
>>   mm/shmem.c       | 3 +++
>>   2 files changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 165ea46..5881e82 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>>   {
>>   	if (vma_is_anonymous(vma))
>>   		return __transparent_hugepage_enabled(vma);
>> -	if (vma_is_shmem(vma) && shmem_huge_enabled(vma))
>> -		return __transparent_hugepage_enabled(vma);
>> +	if (vma_is_shmem(vma))
>> +		return shmem_huge_enabled(vma);
>>   
>>   	return false;
>>   }
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 2275a0f..6f09a31 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>>   	loff_t i_size;
>>   	pgoff_t off;
>>   
>> +	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>> +		return false;
>>   	if (shmem_huge == SHMEM_HUGE_FORCE)
>>   		return true;
>>   	if (shmem_huge == SHMEM_HUGE_DENY)
>>
Vlastimil Babka April 24, 2019, 4:17 p.m. UTC | #7
On 4/24/19 5:47 PM, Yang Shi wrote:
> 
> 
> On 4/24/19 6:10 AM, Vlastimil Babka wrote:
>> On 4/23/19 6:43 PM, Yang Shi wrote:
>>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
>>> vma") introduced THPeligible bit for processes' smaps. But, when checking
>>> the eligibility for shmem vma, __transparent_hugepage_enabled() is
>>> called to override the result from shmem_huge_enabled().  It may result
>>> in the anonymous vma's THP flag override shmem's.  For example, running a
>>> simple test which create THP for shmem, but with anonymous THP disabled,
>>> when reading the process's smaps, it may show:
>>>
>>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
>>> Size:               4096 kB
>>> ...
>>> [snip]
>>> ...
>>> ShmemPmdMapped:     4096 kB
>> But how does this happen in the first place?
>> In __handle_mm_fault() we do:
>>
>>          if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
>>                  ret = create_huge_pmd(&vmf);
>>                  if (!(ret & VM_FAULT_FALLBACK))
>>                          return ret;
>>
>> And __transparent_hugepage_enabled() checks the global THP settings.
>> If THP is not enabled / is only for madvise and the vma is not madvised,
>> then this should fail, and also khugepaged shouldn't either run at all,
>> or don't do its job for such non-madvised vma.
> 
> If __transparent_hugepage_enabled() returns false, the code will not 
> reach create_huge_pmd() at all. If it returns true, create_huge_pmd() 
> actually will return VM_FAULT_FALLBACK for shmem since shmem doesn't 
> have huge_fault (or pmd_fault in earlier versions) method.
> 
> Then it will get into handle_pte_fault(), finally shmem_fault() is 
> called, which allocates THP by checking some global flag (i.e. 
> VM_NOHUGEPAGE and MMF_DISABLE_THP) and  shmem THP knobs.

Aha, thanks! What a mess...

> 
> 4.8 (the first version has shmem THP merged) behaves exactly in the same 
> way. So, I suspect this may be intended behavior.

Still looks like an oversight to me. And it's inconsistent... it might
fault huge shmem pages when THPs are globally disabled, but khugepaged
is still not running. I think it should just check the global THP flags
as well...
Yang Shi April 25, 2019, 4:44 p.m. UTC | #8
On 4/24/19 9:17 AM, Vlastimil Babka wrote:
> On 4/24/19 5:47 PM, Yang Shi wrote:
>>
>> On 4/24/19 6:10 AM, Vlastimil Babka wrote:
>>> On 4/23/19 6:43 PM, Yang Shi wrote:
>>>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
>>>> vma") introduced THPeligible bit for processes' smaps. But, when checking
>>>> the eligibility for shmem vma, __transparent_hugepage_enabled() is
>>>> called to override the result from shmem_huge_enabled().  It may result
>>>> in the anonymous vma's THP flag override shmem's.  For example, running a
>>>> simple test which create THP for shmem, but with anonymous THP disabled,
>>>> when reading the process's smaps, it may show:
>>>>
>>>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
>>>> Size:               4096 kB
>>>> ...
>>>> [snip]
>>>> ...
>>>> ShmemPmdMapped:     4096 kB
>>> But how does this happen in the first place?
>>> In __handle_mm_fault() we do:
>>>
>>>           if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
>>>                   ret = create_huge_pmd(&vmf);
>>>                   if (!(ret & VM_FAULT_FALLBACK))
>>>                           return ret;
>>>
>>> And __transparent_hugepage_enabled() checks the global THP settings.
>>> If THP is not enabled / is only for madvise and the vma is not madvised,
>>> then this should fail, and also khugepaged shouldn't either run at all,
>>> or don't do its job for such non-madvised vma.
>> If __transparent_hugepage_enabled() returns false, the code will not
>> reach create_huge_pmd() at all. If it returns true, create_huge_pmd()
>> actually will return VM_FAULT_FALLBACK for shmem since shmem doesn't
>> have huge_fault (or pmd_fault in earlier versions) method.
>>
>> Then it will get into handle_pte_fault(), finally shmem_fault() is
>> called, which allocates THP by checking some global flag (i.e.
>> VM_NOHUGEPAGE and MMF_DISABLE_THP) and  shmem THP knobs.
> Aha, thanks! What a mess...

Yes, it does look convoluted. I'm wondering we may consider refactor the 
shmem THP fault handling.

>
>> 4.8 (the first version has shmem THP merged) behaves exactly in the same
>> way. So, I suspect this may be intended behavior.
> Still looks like an oversight to me. And it's inconsistent... it might
> fault huge shmem pages when THPs are globally disabled, but khugepaged
> is still not running. I think it should just check the global THP flags
> as well...

It does looks inconsistent, particularly for the khugepaged part.
Yang Shi April 28, 2019, 7:13 p.m. UTC | #9
On 4/23/19 10:52 AM, Michal Hocko wrote:
> On Wed 24-04-19 00:43:01, Yang Shi wrote:
>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
>> vma") introduced THPeligible bit for processes' smaps. But, when checking
>> the eligibility for shmem vma, __transparent_hugepage_enabled() is
>> called to override the result from shmem_huge_enabled().  It may result
>> in the anonymous vma's THP flag override shmem's.  For example, running a
>> simple test which create THP for shmem, but with anonymous THP disabled,
>> when reading the process's smaps, it may show:
>>
>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
>> Size:               4096 kB
>> ...
>> [snip]
>> ...
>> ShmemPmdMapped:     4096 kB
>> ...
>> [snip]
>> ...
>> THPeligible:    0
>>
>> And, /proc/meminfo does show THP allocated and PMD mapped too:
>>
>> ShmemHugePages:     4096 kB
>> ShmemPmdMapped:     4096 kB
>>
>> This doesn't make too much sense.  The anonymous THP flag should not
>> intervene shmem THP.  Calling shmem_huge_enabled() with checking
>> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
>> dax vma check since we already checked if the vma is shmem already.
> Kirill, can we get a confirmation that this is really intended behavior
> rather than an omission please? Is this documented? What is a global
> knob to simply disable THP system wise?

Hi Kirill,

Ping. Any comment?

Thanks,
Yang

>
> I have to say that the THP tuning API is one giant mess :/
>
> Btw. this patch also seem to fix khugepaged behavior because it previously
> ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP.
>
>> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma")
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>> v2: Check VM_NOHUGEPAGE per Michal Hocko
>>
>>   mm/huge_memory.c | 4 ++--
>>   mm/shmem.c       | 3 +++
>>   2 files changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 165ea46..5881e82 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>>   {
>>   	if (vma_is_anonymous(vma))
>>   		return __transparent_hugepage_enabled(vma);
>> -	if (vma_is_shmem(vma) && shmem_huge_enabled(vma))
>> -		return __transparent_hugepage_enabled(vma);
>> +	if (vma_is_shmem(vma))
>> +		return shmem_huge_enabled(vma);
>>   
>>   	return false;
>>   }
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 2275a0f..6f09a31 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>>   	loff_t i_size;
>>   	pgoff_t off;
>>   
>> +	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>> +		return false;
>>   	if (shmem_huge == SHMEM_HUGE_FORCE)
>>   		return true;
>>   	if (shmem_huge == SHMEM_HUGE_DENY)
>> -- 
>> 1.8.3.1
>>
Yang Shi May 6, 2019, 11:37 p.m. UTC | #10
On 4/28/19 12:13 PM, Yang Shi wrote:
>
>
> On 4/23/19 10:52 AM, Michal Hocko wrote:
>> On Wed 24-04-19 00:43:01, Yang Shi wrote:
>>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for 
>>> each
>>> vma") introduced THPeligible bit for processes' smaps. But, when 
>>> checking
>>> the eligibility for shmem vma, __transparent_hugepage_enabled() is
>>> called to override the result from shmem_huge_enabled().  It may result
>>> in the anonymous vma's THP flag override shmem's.  For example, 
>>> running a
>>> simple test which create THP for shmem, but with anonymous THP 
>>> disabled,
>>> when reading the process's smaps, it may show:
>>>
>>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
>>> Size:               4096 kB
>>> ...
>>> [snip]
>>> ...
>>> ShmemPmdMapped:     4096 kB
>>> ...
>>> [snip]
>>> ...
>>> THPeligible:    0
>>>
>>> And, /proc/meminfo does show THP allocated and PMD mapped too:
>>>
>>> ShmemHugePages:     4096 kB
>>> ShmemPmdMapped:     4096 kB
>>>
>>> This doesn't make too much sense.  The anonymous THP flag should not
>>> intervene shmem THP.  Calling shmem_huge_enabled() with checking
>>> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
>>> dax vma check since we already checked if the vma is shmem already.
>> Kirill, can we get a confirmation that this is really intended behavior
>> rather than an omission please? Is this documented? What is a global
>> knob to simply disable THP system wise?
>
> Hi Kirill,
>
> Ping. Any comment?

Talked with Kirill at LSFMM, it sounds this is kind of intended behavior 
according to him. But, we all agree it looks inconsistent.

So, we may have two options:
     - Just fix the false negative issue as what the patch does
     - Change the behavior to make it more consistent

I'm not sure whether anyone relies on the behavior explicitly or 
implicitly or not.

If we would like to change the behavior, I may consider to take a step 
further to refactor the code a little bit to use huge_fault() to handle 
THP fault instead of falling back to handle_pte_fault() in the current 
implementation. This may make adding THP for other filesystems easier.

>
> Thanks,
> Yang
>
>>
>> I have to say that the THP tuning API is one giant mess :/
>>
>> Btw. this patch also seem to fix khugepaged behavior because it 
>> previously
>> ignored both VM_NOHUGEPAGE and MMF_DISABLE_THP.
>>
>>> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each 
>>> vma")
>>> Cc: Michal Hocko <mhocko@suse.com>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: David Rientjes <rientjes@google.com>
>>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>>> ---
>>> v2: Check VM_NOHUGEPAGE per Michal Hocko
>>>
>>>   mm/huge_memory.c | 4 ++--
>>>   mm/shmem.c       | 3 +++
>>>   2 files changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 165ea46..5881e82 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct 
>>> vm_area_struct *vma)
>>>   {
>>>       if (vma_is_anonymous(vma))
>>>           return __transparent_hugepage_enabled(vma);
>>> -    if (vma_is_shmem(vma) && shmem_huge_enabled(vma))
>>> -        return __transparent_hugepage_enabled(vma);
>>> +    if (vma_is_shmem(vma))
>>> +        return shmem_huge_enabled(vma);
>>>         return false;
>>>   }
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 2275a0f..6f09a31 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct 
>>> *vma)
>>>       loff_t i_size;
>>>       pgoff_t off;
>>>   +    if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>>> +        test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>>> +        return false;
>>>       if (shmem_huge == SHMEM_HUGE_FORCE)
>>>           return true;
>>>       if (shmem_huge == SHMEM_HUGE_DENY)
>>> -- 
>>> 1.8.3.1
>>>
>
Michal Hocko May 7, 2019, 10:47 a.m. UTC | #11
[Hmm, I thought, Hugh was CCed]

On Mon 06-05-19 16:37:42, Yang Shi wrote:
> 
> 
> On 4/28/19 12:13 PM, Yang Shi wrote:
> > 
> > 
> > On 4/23/19 10:52 AM, Michal Hocko wrote:
> > > On Wed 24-04-19 00:43:01, Yang Shi wrote:
> > > > The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility
> > > > for each
> > > > vma") introduced THPeligible bit for processes' smaps. But, when
> > > > checking
> > > > the eligibility for shmem vma, __transparent_hugepage_enabled() is
> > > > called to override the result from shmem_huge_enabled().  It may result
> > > > in the anonymous vma's THP flag override shmem's.  For example,
> > > > running a
> > > > simple test which create THP for shmem, but with anonymous THP
> > > > disabled,
> > > > when reading the process's smaps, it may show:
> > > > 
> > > > 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
> > > > Size:               4096 kB
> > > > ...
> > > > [snip]
> > > > ...
> > > > ShmemPmdMapped:     4096 kB
> > > > ...
> > > > [snip]
> > > > ...
> > > > THPeligible:    0
> > > > 
> > > > And, /proc/meminfo does show THP allocated and PMD mapped too:
> > > > 
> > > > ShmemHugePages:     4096 kB
> > > > ShmemPmdMapped:     4096 kB
> > > > 
> > > > This doesn't make too much sense.  The anonymous THP flag should not
> > > > intervene shmem THP.  Calling shmem_huge_enabled() with checking
> > > > MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
> > > > dax vma check since we already checked if the vma is shmem already.
> > > Kirill, can we get a confirmation that this is really intended behavior
> > > rather than an omission please? Is this documented? What is a global
> > > knob to simply disable THP system wise?
> > 
> > Hi Kirill,
> > 
> > Ping. Any comment?
> 
> Talked with Kirill at LSFMM, it sounds this is kind of intended behavior
> according to him. But, we all agree it looks inconsistent.
> 
> So, we may have two options:
>     - Just fix the false negative issue as what the patch does
>     - Change the behavior to make it more consistent
> 
> I'm not sure whether anyone relies on the behavior explicitly or implicitly
> or not.

Well, I would be certainly more happy with a more consistent behavior.
Talked to Hugh at LSFMM about this and he finds treating shmem objects
separately from the anonymous memory. And that is already the case
partially when each mount point might have its own setup. So the primary
question is whether we need a one global knob to controll all THP
allocations. One argument to have that is that it might be helpful to
for an admin to simply disable source of THP at a single place rather
than crawling over all shmem mount points and remount them. Especially
in environments where shmem points are mounted in a container by a
non-root. Why would somebody wanted something like that? One example
would be to temporarily workaround high order allocations issues which
we have seen non trivial amount of in the past and we are likely not at
the end of the tunel.

That being said I would be in favor of treating the global sysfs knob to
be global for all THP allocations. I will not push back on that if there
is a general consensus that shmem and fs in general are a different
class of objects and a single global control is not desirable for
whatever reasons.

Kirill, Hugh othe folks?
Yang Shi May 7, 2019, 5:10 p.m. UTC | #12
On 5/7/19 3:47 AM, Michal Hocko wrote:
> [Hmm, I thought, Hugh was CCed]
>
> On Mon 06-05-19 16:37:42, Yang Shi wrote:
>>
>> On 4/28/19 12:13 PM, Yang Shi wrote:
>>>
>>> On 4/23/19 10:52 AM, Michal Hocko wrote:
>>>> On Wed 24-04-19 00:43:01, Yang Shi wrote:
>>>>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility
>>>>> for each
>>>>> vma") introduced THPeligible bit for processes' smaps. But, when
>>>>> checking
>>>>> the eligibility for shmem vma, __transparent_hugepage_enabled() is
>>>>> called to override the result from shmem_huge_enabled().  It may result
>>>>> in the anonymous vma's THP flag override shmem's.  For example,
>>>>> running a
>>>>> simple test which create THP for shmem, but with anonymous THP
>>>>> disabled,
>>>>> when reading the process's smaps, it may show:
>>>>>
>>>>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
>>>>> Size:               4096 kB
>>>>> ...
>>>>> [snip]
>>>>> ...
>>>>> ShmemPmdMapped:     4096 kB
>>>>> ...
>>>>> [snip]
>>>>> ...
>>>>> THPeligible:    0
>>>>>
>>>>> And, /proc/meminfo does show THP allocated and PMD mapped too:
>>>>>
>>>>> ShmemHugePages:     4096 kB
>>>>> ShmemPmdMapped:     4096 kB
>>>>>
>>>>> This doesn't make too much sense.  The anonymous THP flag should not
>>>>> intervene shmem THP.  Calling shmem_huge_enabled() with checking
>>>>> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
>>>>> dax vma check since we already checked if the vma is shmem already.
>>>> Kirill, can we get a confirmation that this is really intended behavior
>>>> rather than an omission please? Is this documented? What is a global
>>>> knob to simply disable THP system wise?
>>> Hi Kirill,
>>>
>>> Ping. Any comment?
>> Talked with Kirill at LSFMM, it sounds this is kind of intended behavior
>> according to him. But, we all agree it looks inconsistent.
>>
>> So, we may have two options:
>>      - Just fix the false negative issue as what the patch does
>>      - Change the behavior to make it more consistent
>>
>> I'm not sure whether anyone relies on the behavior explicitly or implicitly
>> or not.
> Well, I would be certainly more happy with a more consistent behavior.
> Talked to Hugh at LSFMM about this and he finds treating shmem objects
> separately from the anonymous memory. And that is already the case
> partially when each mount point might have its own setup. So the primary
> question is whether we need a one global knob to controll all THP
> allocations. One argument to have that is that it might be helpful to
> for an admin to simply disable source of THP at a single place rather
> than crawling over all shmem mount points and remount them. Especially
> in environments where shmem points are mounted in a container by a
> non-root. Why would somebody wanted something like that? One example
> would be to temporarily workaround high order allocations issues which
> we have seen non trivial amount of in the past and we are likely not at
> the end of the tunel.

Shmem has a global control for such use. Setting shmem_enabled to 
"force" or "deny" would enable or disable THP for shmem globally, 
including non-fs objects, i.e. memfd, SYS V shmem, etc.

>
> That being said I would be in favor of treating the global sysfs knob to
> be global for all THP allocations. I will not push back on that if there
> is a general consensus that shmem and fs in general are a different
> class of objects and a single global control is not desirable for
> whatever reasons.

OK, we need more inputs from Kirill, Hugh and other folks.

>
> Kirill, Hugh othe folks?
Yang Shi June 6, 2019, 6:59 p.m. UTC | #13
On 5/7/19 10:10 AM, Yang Shi wrote:
>
>
> On 5/7/19 3:47 AM, Michal Hocko wrote:
>> [Hmm, I thought, Hugh was CCed]
>>
>> On Mon 06-05-19 16:37:42, Yang Shi wrote:
>>>
>>> On 4/28/19 12:13 PM, Yang Shi wrote:
>>>>
>>>> On 4/23/19 10:52 AM, Michal Hocko wrote:
>>>>> On Wed 24-04-19 00:43:01, Yang Shi wrote:
>>>>>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility
>>>>>> for each
>>>>>> vma") introduced THPeligible bit for processes' smaps. But, when
>>>>>> checking
>>>>>> the eligibility for shmem vma, __transparent_hugepage_enabled() is
>>>>>> called to override the result from shmem_huge_enabled().  It may 
>>>>>> result
>>>>>> in the anonymous vma's THP flag override shmem's.  For example,
>>>>>> running a
>>>>>> simple test which create THP for shmem, but with anonymous THP
>>>>>> disabled,
>>>>>> when reading the process's smaps, it may show:
>>>>>>
>>>>>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
>>>>>> Size:               4096 kB
>>>>>> ...
>>>>>> [snip]
>>>>>> ...
>>>>>> ShmemPmdMapped:     4096 kB
>>>>>> ...
>>>>>> [snip]
>>>>>> ...
>>>>>> THPeligible:    0
>>>>>>
>>>>>> And, /proc/meminfo does show THP allocated and PMD mapped too:
>>>>>>
>>>>>> ShmemHugePages:     4096 kB
>>>>>> ShmemPmdMapped:     4096 kB
>>>>>>
>>>>>> This doesn't make too much sense.  The anonymous THP flag should not
>>>>>> intervene shmem THP.  Calling shmem_huge_enabled() with checking
>>>>>> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
>>>>>> dax vma check since we already checked if the vma is shmem already.
>>>>> Kirill, can we get a confirmation that this is really intended 
>>>>> behavior
>>>>> rather than an omission please? Is this documented? What is a global
>>>>> knob to simply disable THP system wise?
>>>> Hi Kirill,
>>>>
>>>> Ping. Any comment?
>>> Talked with Kirill at LSFMM, it sounds this is kind of intended 
>>> behavior
>>> according to him. But, we all agree it looks inconsistent.
>>>
>>> So, we may have two options:
>>>      - Just fix the false negative issue as what the patch does
>>>      - Change the behavior to make it more consistent
>>>
>>> I'm not sure whether anyone relies on the behavior explicitly or 
>>> implicitly
>>> or not.
>> Well, I would be certainly more happy with a more consistent behavior.
>> Talked to Hugh at LSFMM about this and he finds treating shmem objects
>> separately from the anonymous memory. And that is already the case
>> partially when each mount point might have its own setup. So the primary
>> question is whether we need a one global knob to controll all THP
>> allocations. One argument to have that is that it might be helpful to
>> for an admin to simply disable source of THP at a single place rather
>> than crawling over all shmem mount points and remount them. Especially
>> in environments where shmem points are mounted in a container by a
>> non-root. Why would somebody wanted something like that? One example
>> would be to temporarily workaround high order allocations issues which
>> we have seen non trivial amount of in the past and we are likely not at
>> the end of the tunel.
>
> Shmem has a global control for such use. Setting shmem_enabled to 
> "force" or "deny" would enable or disable THP for shmem globally, 
> including non-fs objects, i.e. memfd, SYS V shmem, etc.
>
>>
>> That being said I would be in favor of treating the global sysfs knob to
>> be global for all THP allocations. I will not push back on that if there
>> is a general consensus that shmem and fs in general are a different
>> class of objects and a single global control is not desirable for
>> whatever reasons.
>
> OK, we need more inputs from Kirill, Hugh and other folks.

[Forgot cc to mailing lists]

Hi guys,

How should we move forward for this one? Make the sysfs knob 
(/sys/kernel/mm/transparent_hugepage/enabled) to be global for both 
anonymous and tmpfs? Or just treat shmem objects separately from anon 
memory then fix the false-negative of THP eligibility by this patch?

>
>>
>> Kirill, Hugh othe folks?
>
Hugh Dickins June 7, 2019, 10:57 a.m. UTC | #14
On Thu, 6 Jun 2019, Yang Shi wrote:
> On 5/7/19 10:10 AM, Yang Shi wrote:
> > On 5/7/19 3:47 AM, Michal Hocko wrote:
> > > [Hmm, I thought, Hugh was CCed]
> > > 
> > > On Mon 06-05-19 16:37:42, Yang Shi wrote:
> > > > 
> > > > On 4/28/19 12:13 PM, Yang Shi wrote:
> > > > > 
> > > > > On 4/23/19 10:52 AM, Michal Hocko wrote:
> > > > > > On Wed 24-04-19 00:43:01, Yang Shi wrote:
> > > > > > > The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility
> > > > > > > for each
> > > > > > > vma") introduced THPeligible bit for processes' smaps. But, when
> > > > > > > checking
> > > > > > > the eligibility for shmem vma, __transparent_hugepage_enabled()
> > > > > > > is
> > > > > > > called to override the result from shmem_huge_enabled().  It may
> > > > > > > result
> > > > > > > in the anonymous vma's THP flag override shmem's.  For example,
> > > > > > > running a
> > > > > > > simple test which create THP for shmem, but with anonymous THP
> > > > > > > disabled,
> > > > > > > when reading the process's smaps, it may show:
> > > > > > > 
> > > > > > > 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
> > > > > > > Size:               4096 kB
> > > > > > > ...
> > > > > > > [snip]
> > > > > > > ...
> > > > > > > ShmemPmdMapped:     4096 kB
> > > > > > > ...
> > > > > > > [snip]
> > > > > > > ...
> > > > > > > THPeligible:    0
> > > > > > > 
> > > > > > > And, /proc/meminfo does show THP allocated and PMD mapped too:
> > > > > > > 
> > > > > > > ShmemHugePages:     4096 kB
> > > > > > > ShmemPmdMapped:     4096 kB
> > > > > > > 
> > > > > > > This doesn't make too much sense.  The anonymous THP flag should
> > > > > > > not
> > > > > > > intervene shmem THP.  Calling shmem_huge_enabled() with checking
> > > > > > > MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
> > > > > > > dax vma check since we already checked if the vma is shmem
> > > > > > > already.
> > > > > > Kirill, can we get a confirmation that this is really intended
> > > > > > behavior
> > > > > > rather than an omission please? Is this documented? What is a
> > > > > > global
> > > > > > knob to simply disable THP system wise?
> > > > > Hi Kirill,
> > > > > 
> > > > > Ping. Any comment?
> > > > Talked with Kirill at LSFMM, it sounds this is kind of intended
> > > > behavior
> > > > according to him. But, we all agree it looks inconsistent.
> > > > 
> > > > So, we may have two options:
> > > >      - Just fix the false negative issue as what the patch does
> > > >      - Change the behavior to make it more consistent
> > > > 
> > > > I'm not sure whether anyone relies on the behavior explicitly or
> > > > implicitly
> > > > or not.
> > > Well, I would be certainly more happy with a more consistent behavior.
> > > Talked to Hugh at LSFMM about this and he finds treating shmem objects
> > > separately from the anonymous memory. And that is already the case
> > > partially when each mount point might have its own setup. So the primary
> > > question is whether we need a one global knob to controll all THP
> > > allocations. One argument to have that is that it might be helpful to
> > > for an admin to simply disable source of THP at a single place rather
> > > than crawling over all shmem mount points and remount them. Especially
> > > in environments where shmem points are mounted in a container by a
> > > non-root. Why would somebody wanted something like that? One example
> > > would be to temporarily workaround high order allocations issues which
> > > we have seen non trivial amount of in the past and we are likely not at
> > > the end of the tunel.
> > 
> > Shmem has a global control for such use. Setting shmem_enabled to "force"
> > or "deny" would enable or disable THP for shmem globally, including non-fs
> > objects, i.e. memfd, SYS V shmem, etc.
> > 
> > > 
> > > That being said I would be in favor of treating the global sysfs knob to
> > > be global for all THP allocations. I will not push back on that if there
> > > is a general consensus that shmem and fs in general are a different
> > > class of objects and a single global control is not desirable for
> > > whatever reasons.
> > 
> > OK, we need more inputs from Kirill, Hugh and other folks.
> 
> [Forgot cc to mailing lists]
> 
> Hi guys,
> 
> How should we move forward for this one? Make the sysfs knob
> (/sys/kernel/mm/transparent_hugepage/enabled) to be global for both anonymous
> and tmpfs? Or just treat shmem objects separately from anon memory then fix
> the false-negative of THP eligibility by this patch?

Sorry for not getting back to you sooner on this.

I don't like to drive design by smaps. I agree with the word "mess" used
several times of THP tunings in this thread, but it's too easy to make
that mess worse by unnecessary changes, so I'm very cautious here.

The addition of "THPeligible" without an "Anon" in its name was
unfortunate. I suppose we're two releases too late to change that.

Applying process (PR_SET_THP_DISABLE) and mm (MADV_*HUGEPAGE)
limitations to shared filesystem objects doesn't work all that well.

I recommend that you continue to treat shmem objects separately from
anon memory, and just make the smaps "THPeligible" more often accurate.

Is your v2 patch earlier in this thread the best for that?
No answer tonight, I'll re-examine later in the day.

Hugh
Michal Hocko June 7, 2019, 2:25 p.m. UTC | #15
On Fri 07-06-19 03:57:18, Hugh Dickins wrote:
[...]
> The addition of "THPeligible" without an "Anon" in its name was
> unfortunate. I suppose we're two releases too late to change that.

Well, I do not really see any reason why THPeligible should be Anon
specific at all. Even if ...

> Applying process (PR_SET_THP_DISABLE) and mm (MADV_*HUGEPAGE)
> limitations to shared filesystem objects doesn't work all that well.

... this is what we are going with then it is really important to have a
single place to query the eligibility IMHO.

> I recommend that you continue to treat shmem objects separately from
> anon memory, and just make the smaps "THPeligible" more often accurate.

Agreed on this.
Yang Shi June 7, 2019, 6:51 p.m. UTC | #16
On 6/7/19 3:57 AM, Hugh Dickins wrote:
> On Thu, 6 Jun 2019, Yang Shi wrote:
>> On 5/7/19 10:10 AM, Yang Shi wrote:
>>> On 5/7/19 3:47 AM, Michal Hocko wrote:
>>>> [Hmm, I thought, Hugh was CCed]
>>>>
>>>> On Mon 06-05-19 16:37:42, Yang Shi wrote:
>>>>> On 4/28/19 12:13 PM, Yang Shi wrote:
>>>>>> On 4/23/19 10:52 AM, Michal Hocko wrote:
>>>>>>> On Wed 24-04-19 00:43:01, Yang Shi wrote:
>>>>>>>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility
>>>>>>>> for each
>>>>>>>> vma") introduced THPeligible bit for processes' smaps. But, when
>>>>>>>> checking
>>>>>>>> the eligibility for shmem vma, __transparent_hugepage_enabled()
>>>>>>>> is
>>>>>>>> called to override the result from shmem_huge_enabled().  It may
>>>>>>>> result
>>>>>>>> in the anonymous vma's THP flag override shmem's.  For example,
>>>>>>>> running a
>>>>>>>> simple test which create THP for shmem, but with anonymous THP
>>>>>>>> disabled,
>>>>>>>> when reading the process's smaps, it may show:
>>>>>>>>
>>>>>>>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
>>>>>>>> Size:               4096 kB
>>>>>>>> ...
>>>>>>>> [snip]
>>>>>>>> ...
>>>>>>>> ShmemPmdMapped:     4096 kB
>>>>>>>> ...
>>>>>>>> [snip]
>>>>>>>> ...
>>>>>>>> THPeligible:    0
>>>>>>>>
>>>>>>>> And, /proc/meminfo does show THP allocated and PMD mapped too:
>>>>>>>>
>>>>>>>> ShmemHugePages:     4096 kB
>>>>>>>> ShmemPmdMapped:     4096 kB
>>>>>>>>
>>>>>>>> This doesn't make too much sense.  The anonymous THP flag should
>>>>>>>> not
>>>>>>>> intervene shmem THP.  Calling shmem_huge_enabled() with checking
>>>>>>>> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
>>>>>>>> dax vma check since we already checked if the vma is shmem
>>>>>>>> already.
>>>>>>> Kirill, can we get a confirmation that this is really intended
>>>>>>> behavior
>>>>>>> rather than an omission please? Is this documented? What is a
>>>>>>> global
>>>>>>> knob to simply disable THP system wise?
>>>>>> Hi Kirill,
>>>>>>
>>>>>> Ping. Any comment?
>>>>> Talked with Kirill at LSFMM, it sounds this is kind of intended
>>>>> behavior
>>>>> according to him. But, we all agree it looks inconsistent.
>>>>>
>>>>> So, we may have two options:
>>>>>       - Just fix the false negative issue as what the patch does
>>>>>       - Change the behavior to make it more consistent
>>>>>
>>>>> I'm not sure whether anyone relies on the behavior explicitly or
>>>>> implicitly
>>>>> or not.
>>>> Well, I would be certainly more happy with a more consistent behavior.
>>>> Talked to Hugh at LSFMM about this and he finds treating shmem objects
>>>> separately from the anonymous memory. And that is already the case
>>>> partially when each mount point might have its own setup. So the primary
>>>> question is whether we need a one global knob to controll all THP
>>>> allocations. One argument to have that is that it might be helpful to
>>>> for an admin to simply disable source of THP at a single place rather
>>>> than crawling over all shmem mount points and remount them. Especially
>>>> in environments where shmem points are mounted in a container by a
>>>> non-root. Why would somebody wanted something like that? One example
>>>> would be to temporarily workaround high order allocations issues which
>>>> we have seen non trivial amount of in the past and we are likely not at
>>>> the end of the tunel.
>>> Shmem has a global control for such use. Setting shmem_enabled to "force"
>>> or "deny" would enable or disable THP for shmem globally, including non-fs
>>> objects, i.e. memfd, SYS V shmem, etc.
>>>
>>>> That being said I would be in favor of treating the global sysfs knob to
>>>> be global for all THP allocations. I will not push back on that if there
>>>> is a general consensus that shmem and fs in general are a different
>>>> class of objects and a single global control is not desirable for
>>>> whatever reasons.
>>> OK, we need more inputs from Kirill, Hugh and other folks.
>> [Forgot cc to mailing lists]
>>
>> Hi guys,
>>
>> How should we move forward for this one? Make the sysfs knob
>> (/sys/kernel/mm/transparent_hugepage/enabled) to be global for both anonymous
>> and tmpfs? Or just treat shmem objects separately from anon memory then fix
>> the false-negative of THP eligibility by this patch?
> Sorry for not getting back to you sooner on this.
>
> I don't like to drive design by smaps. I agree with the word "mess" used
> several times of THP tunings in this thread, but it's too easy to make
> that mess worse by unnecessary changes, so I'm very cautious here.
>
> The addition of "THPeligible" without an "Anon" in its name was
> unfortunate. I suppose we're two releases too late to change that.

The smaps shows it is anon vma or shmem vma for the most cases.

>
> Applying process (PR_SET_THP_DISABLE) and mm (MADV_*HUGEPAGE)
> limitations to shared filesystem objects doesn't work all that well.

The THP eligibility indicator is per vma, it just reports whether THP is 
eligible for a specific vma. So, I'm supposed it should keep consistent 
with MMF_DISABLE_THP and MADV_*HUGEPAGE setting.

The current implementation in shmem and kuhugepaged also checks these.

>
> I recommend that you continue to treat shmem objects separately from
> anon memory, and just make the smaps "THPeligible" more often accurate.
>
> Is your v2 patch earlier in this thread the best for that?

The v2 patch treats shmem objects separately from anon memory and it 
makes the "THPeligible" more often accurate.

> No answer tonight, I'll re-examine later in the day.
>
> Hugh
Hugh Dickins June 8, 2019, 3:58 a.m. UTC | #17
On Wed, 24 Apr 2019, Yang Shi wrote:

> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
> vma") introduced THPeligible bit for processes' smaps. But, when checking
> the eligibility for shmem vma, __transparent_hugepage_enabled() is
> called to override the result from shmem_huge_enabled().  It may result
> in the anonymous vma's THP flag override shmem's.  For example, running a
> simple test which create THP for shmem, but with anonymous THP disabled,
> when reading the process's smaps, it may show:
> 
> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
> Size:               4096 kB
> ...
> [snip]
> ...
> ShmemPmdMapped:     4096 kB
> ...
> [snip]
> ...
> THPeligible:    0
> 
> And, /proc/meminfo does show THP allocated and PMD mapped too:
> 
> ShmemHugePages:     4096 kB
> ShmemPmdMapped:     4096 kB
> 
> This doesn't make too much sense.  The anonymous THP flag should not
> intervene shmem THP.  Calling shmem_huge_enabled() with checking
> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
> dax vma check since we already checked if the vma is shmem already.
> 
> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma")
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Kirill A. Shutemov <kirill@shutemov.name>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> ---
> v2: Check VM_NOHUGEPAGE per Michal Hocko
> 
>  mm/huge_memory.c | 4 ++--
>  mm/shmem.c       | 3 +++
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 165ea46..5881e82 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>  {
>  	if (vma_is_anonymous(vma))
>  		return __transparent_hugepage_enabled(vma);
> -	if (vma_is_shmem(vma) && shmem_huge_enabled(vma))
> -		return __transparent_hugepage_enabled(vma);
> +	if (vma_is_shmem(vma))
> +		return shmem_huge_enabled(vma);
>  
>  	return false;
>  }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 2275a0f..6f09a31 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>  	loff_t i_size;
>  	pgoff_t off;
>  
> +	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> +		return false;

Yes, that is correct; and correctly placed. But a little more is needed:
see how mm/memory.c's transhuge_vma_suitable() will only allow a pmd to
be used instead of a pte if the vma offset and size permit. smaps should
not report a shmem vma as THPeligible if its offset or size prevent it.

And I see that should also be fixed on anon vmas: at present smaps
reports even a 4kB anon vma as THPeligible, which is not right.
Maybe a test like transhuge_vma_suitable() can be added into
transparent_hugepage_enabled(), to handle anon and shmem together.
I say "like transhuge_vma_suitable()", because that function needs
an address, which here you don't have.

The anon offset situation is interesting: usually anon vm_pgoff is
initialized to fit with its vm_start, so the anon offset check passes;
but I wonder what happens after mremap to a different address - does
transhuge_vma_suitable() then prevent the use of pmds where they could
actually be used? Not a Number#1 priority to investigate or fix here!
but a curiosity someone might want to look into.

>  	if (shmem_huge == SHMEM_HUGE_FORCE)
>  		return true;
>  	if (shmem_huge == SHMEM_HUGE_DENY)
> -- 
> 1.8.3.1


Even with your changes
ShmemPmdMapped:     4096 kB
THPeligible:    0
will easily be seen: THPeligible reflects whether a huge page can be
allocated and mapped by pmd in that vma; but if something else already
allocated the huge page earlier, it will be mapped by pmd in this vma
if offset and size allow, whatever THPeligible says. We could change
transhuge_vma_suitable() to force ptes in that case, but it would be
a silly change, just to make what smaps shows easier to explain.

Hugh
Yang Shi June 10, 2019, 5:33 p.m. UTC | #18
On 6/7/19 8:58 PM, Hugh Dickins wrote:
> On Wed, 24 Apr 2019, Yang Shi wrote:
>
>> The commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
>> vma") introduced THPeligible bit for processes' smaps. But, when checking
>> the eligibility for shmem vma, __transparent_hugepage_enabled() is
>> called to override the result from shmem_huge_enabled().  It may result
>> in the anonymous vma's THP flag override shmem's.  For example, running a
>> simple test which create THP for shmem, but with anonymous THP disabled,
>> when reading the process's smaps, it may show:
>>
>> 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
>> Size:               4096 kB
>> ...
>> [snip]
>> ...
>> ShmemPmdMapped:     4096 kB
>> ...
>> [snip]
>> ...
>> THPeligible:    0
>>
>> And, /proc/meminfo does show THP allocated and PMD mapped too:
>>
>> ShmemHugePages:     4096 kB
>> ShmemPmdMapped:     4096 kB
>>
>> This doesn't make too much sense.  The anonymous THP flag should not
>> intervene shmem THP.  Calling shmem_huge_enabled() with checking
>> MMF_DISABLE_THP sounds good enough.  And, we could skip stack and
>> dax vma check since we already checked if the vma is shmem already.
>>
>> Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma")
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Kirill A. Shutemov <kirill@shutemov.name>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> ---
>> v2: Check VM_NOHUGEPAGE per Michal Hocko
>>
>>   mm/huge_memory.c | 4 ++--
>>   mm/shmem.c       | 3 +++
>>   2 files changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 165ea46..5881e82 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -67,8 +67,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma)
>>   {
>>   	if (vma_is_anonymous(vma))
>>   		return __transparent_hugepage_enabled(vma);
>> -	if (vma_is_shmem(vma) && shmem_huge_enabled(vma))
>> -		return __transparent_hugepage_enabled(vma);
>> +	if (vma_is_shmem(vma))
>> +		return shmem_huge_enabled(vma);
>>   
>>   	return false;
>>   }
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 2275a0f..6f09a31 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -3873,6 +3873,9 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>>   	loff_t i_size;
>>   	pgoff_t off;
>>   
>> +	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
>> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>> +		return false;
> Yes, that is correct; and correctly placed. But a little more is needed:
> see how mm/memory.c's transhuge_vma_suitable() will only allow a pmd to
> be used instead of a pte if the vma offset and size permit. smaps should
> not report a shmem vma as THPeligible if its offset or size prevent it.
>
> And I see that should also be fixed on anon vmas: at present smaps
> reports even a 4kB anon vma as THPeligible, which is not right.
> Maybe a test like transhuge_vma_suitable() can be added into
> transparent_hugepage_enabled(), to handle anon and shmem together.
> I say "like transhuge_vma_suitable()", because that function needs
> an address, which here you don't have.

Thanks for the remind. Since we don't have an address I'm supposed we 
just need check if the vma's size is big enough or not other than other 
alignment check.

And, I'm wondering whether we could reuse transhuge_vma_suitable() by 
passing in an impossible address, i.e. -1 since it is not a valid 
userspace address. It can be used as and indicator that this call is 
from THPeligible context.

>
> The anon offset situation is interesting: usually anon vm_pgoff is
> initialized to fit with its vm_start, so the anon offset check passes;
> but I wonder what happens after mremap to a different address - does
> transhuge_vma_suitable() then prevent the use of pmds where they could
> actually be used? Not a Number#1 priority to investigate or fix here!
> but a curiosity someone might want to look into.

Will mark on my TODO list.

>
>>   	if (shmem_huge == SHMEM_HUGE_FORCE)
>>   		return true;
>>   	if (shmem_huge == SHMEM_HUGE_DENY)
>> -- 
>> 1.8.3.1
>
> Even with your changes
> ShmemPmdMapped:     4096 kB
> THPeligible:    0
> will easily be seen: THPeligible reflects whether a huge page can be
> allocated and mapped by pmd in that vma; but if something else already
> allocated the huge page earlier, it will be mapped by pmd in this vma
> if offset and size allow, whatever THPeligible says. We could change
> transhuge_vma_suitable() to force ptes in that case, but it would be
> a silly change, just to make what smaps shows easier to explain.

Where did this come from? From the commit log? If so it is the example 
for the wrong smap output. If that case really happens, I think we could 
document it since THPeligible should just show the current status.

>
> Hugh
Hugh Dickins June 12, 2019, 6:44 p.m. UTC | #19
On Mon, 10 Jun 2019, Yang Shi wrote:
> On 6/7/19 8:58 PM, Hugh Dickins wrote:
> > Yes, that is correct; and correctly placed. But a little more is needed:
> > see how mm/memory.c's transhuge_vma_suitable() will only allow a pmd to
> > be used instead of a pte if the vma offset and size permit. smaps should
> > not report a shmem vma as THPeligible if its offset or size prevent it.
> > 
> > And I see that should also be fixed on anon vmas: at present smaps
> > reports even a 4kB anon vma as THPeligible, which is not right.
> > Maybe a test like transhuge_vma_suitable() can be added into
> > transparent_hugepage_enabled(), to handle anon and shmem together.
> > I say "like transhuge_vma_suitable()", because that function needs
> > an address, which here you don't have.
> 
> Thanks for the remind. Since we don't have an address I'm supposed we just
> need check if the vma's size is big enough or not other than other alignment
> check.
> 
> And, I'm wondering whether we could reuse transhuge_vma_suitable() by passing
> in an impossible address, i.e. -1 since it is not a valid userspace address.
> It can be used as and indicator that this call is from THPeligible context.

Perhaps, but sounds like it will abuse and uglify transhuge_vma_suitable()
just for smaps. Would passing transhuge_vma_suitable() the address
    ((vma->vm_end & HPAGE_PMD_MASK) - HPAGE_PMD_SIZE)
give the the correct answer in all cases?

> > 
> > The anon offset situation is interesting: usually anon vm_pgoff is
> > initialized to fit with its vm_start, so the anon offset check passes;
> > but I wonder what happens after mremap to a different address - does
> > transhuge_vma_suitable() then prevent the use of pmds where they could
> > actually be used? Not a Number#1 priority to investigate or fix here!
> > but a curiosity someone might want to look into.
> 
> Will mark on my TODO list.
> 
> > Even with your changes
> > ShmemPmdMapped:     4096 kB
> > THPeligible:    0
> > will easily be seen: THPeligible reflects whether a huge page can be
> > allocated and mapped by pmd in that vma; but if something else already
> > allocated the huge page earlier, it will be mapped by pmd in this vma
> > if offset and size allow, whatever THPeligible says. We could change
> > transhuge_vma_suitable() to force ptes in that case, but it would be
> > a silly change, just to make what smaps shows easier to explain.
> 
> Where did this come from? From the commit log? If so it is the example for
> the wrong smap output. If that case really happens, I think we could document
> it since THPeligible should just show the current status.

Please read again what I explained there: it's not necessarily an example
of wrong smaps output, it's reasonable smaps output for a reasonable case.

Yes, maybe Documentation/filesystems/proc.txt should explain "THPeligble"
a little better - "eligible for allocating THP pages" rather than just
"eligible for THP pages" would be good enough? we don't want to write
a book about the various cases.

Oh, and the "THPeligible" output lines up very nicely there in proc.txt:
could the actual alignment of that 0 or 1 be fixed in smaps itself too?

Thanks,
Hugh
Yang Shi June 12, 2019, 7:59 p.m. UTC | #20
On 6/12/19 11:44 AM, Hugh Dickins wrote:
> On Mon, 10 Jun 2019, Yang Shi wrote:
>> On 6/7/19 8:58 PM, Hugh Dickins wrote:
>>> Yes, that is correct; and correctly placed. But a little more is needed:
>>> see how mm/memory.c's transhuge_vma_suitable() will only allow a pmd to
>>> be used instead of a pte if the vma offset and size permit. smaps should
>>> not report a shmem vma as THPeligible if its offset or size prevent it.
>>>
>>> And I see that should also be fixed on anon vmas: at present smaps
>>> reports even a 4kB anon vma as THPeligible, which is not right.
>>> Maybe a test like transhuge_vma_suitable() can be added into
>>> transparent_hugepage_enabled(), to handle anon and shmem together.
>>> I say "like transhuge_vma_suitable()", because that function needs
>>> an address, which here you don't have.
>> Thanks for the remind. Since we don't have an address I'm supposed we just
>> need check if the vma's size is big enough or not other than other alignment
>> check.
>>
>> And, I'm wondering whether we could reuse transhuge_vma_suitable() by passing
>> in an impossible address, i.e. -1 since it is not a valid userspace address.
>> It can be used as and indicator that this call is from THPeligible context.
> Perhaps, but sounds like it will abuse and uglify transhuge_vma_suitable()
> just for smaps. Would passing transhuge_vma_suitable() the address
>      ((vma->vm_end & HPAGE_PMD_MASK) - HPAGE_PMD_SIZE)
> give the the correct answer in all cases?

Yes, it looks better.

>
>>> The anon offset situation is interesting: usually anon vm_pgoff is
>>> initialized to fit with its vm_start, so the anon offset check passes;
>>> but I wonder what happens after mremap to a different address - does
>>> transhuge_vma_suitable() then prevent the use of pmds where they could
>>> actually be used? Not a Number#1 priority to investigate or fix here!
>>> but a curiosity someone might want to look into.
>> Will mark on my TODO list.
>>
>>> Even with your changes
>>> ShmemPmdMapped:     4096 kB
>>> THPeligible:    0
>>> will easily be seen: THPeligible reflects whether a huge page can be
>>> allocated and mapped by pmd in that vma; but if something else already
>>> allocated the huge page earlier, it will be mapped by pmd in this vma
>>> if offset and size allow, whatever THPeligible says. We could change
>>> transhuge_vma_suitable() to force ptes in that case, but it would be
>>> a silly change, just to make what smaps shows easier to explain.
>> Where did this come from? From the commit log? If so it is the example for
>> the wrong smap output. If that case really happens, I think we could document
>> it since THPeligible should just show the current status.
> Please read again what I explained there: it's not necessarily an example
> of wrong smaps output, it's reasonable smaps output for a reasonable case.
>
> Yes, maybe Documentation/filesystems/proc.txt should explain "THPeligble"
> a little better - "eligible for allocating THP pages" rather than just
> "eligible for THP pages" would be good enough? we don't want to write
> a book about the various cases.

Yes, I agree.

>
> Oh, and the "THPeligible" output lines up very nicely there in proc.txt:
> could the actual alignment of that 0 or 1 be fixed in smaps itself too?

Sure.

Thanks,
Yang

>
> Thanks,
> Hugh
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 165ea46..5881e82 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -67,8 +67,8 @@  bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
 	if (vma_is_anonymous(vma))
 		return __transparent_hugepage_enabled(vma);
-	if (vma_is_shmem(vma) && shmem_huge_enabled(vma))
-		return __transparent_hugepage_enabled(vma);
+	if (vma_is_shmem(vma))
+		return shmem_huge_enabled(vma);
 
 	return false;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 2275a0f..6f09a31 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3873,6 +3873,9 @@  bool shmem_huge_enabled(struct vm_area_struct *vma)
 	loff_t i_size;
 	pgoff_t off;
 
+	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+		return false;
 	if (shmem_huge == SHMEM_HUGE_FORCE)
 		return true;
 	if (shmem_huge == SHMEM_HUGE_DENY)