diff mbox series

mm/thp: Correctly differentiate between mapped THP and PMD migration entry

Message ID 1539057538-27446-1-git-send-email-anshuman.khandual@arm.com (mailing list archive)
State New, archived
Headers show
Series mm/thp: Correctly differentiate between mapped THP and PMD migration entry | expand

Commit Message

Anshuman Khandual Oct. 9, 2018, 3:58 a.m. UTC
A normal mapped THP page at PMD level should be correctly differentiated
from a PMD migration entry while walking the page table. A mapped THP would
additionally check positive for pmd_present() along with pmd_trans_huge()
as compared to a PMD migration entry. This just adds a new conditional test
differentiating the two while walking the page table.

Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
exclusive which makes the current conditional block work for both mapped
and migration entries. This is not same with arm64 where pmd_trans_huge()
returns positive for both mapped and migration entries. Could some one
please explain why pmd_trans_huge() has to return false for migration
entries which just install swap bits and its still a PMD ? Nonetheless
pmd_present() seems to be a better check to distinguish between mapped
and (non-mapped non-present) migration entries without any ambiguity.

 mm/page_vma_mapped.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Kirill A . Shutemov Oct. 9, 2018, 1:04 p.m. UTC | #1
On Tue, Oct 09, 2018 at 09:28:58AM +0530, Anshuman Khandual wrote:
> A normal mapped THP page at PMD level should be correctly differentiated
> from a PMD migration entry while walking the page table. A mapped THP would
> additionally check positive for pmd_present() along with pmd_trans_huge()
> as compared to a PMD migration entry. This just adds a new conditional test
> differentiating the two while walking the page table.
> 
> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
> exclusive which makes the current conditional block work for both mapped
> and migration entries. This is not same with arm64 where pmd_trans_huge()
> returns positive for both mapped and migration entries. Could some one
> please explain why pmd_trans_huge() has to return false for migration
> entries which just install swap bits and its still a PMD ?

I guess it's just a design choice. Any reason why arm64 cannot do the
same?

> Nonetheless pmd_present() seems to be a better check to distinguish
> between mapped and (non-mapped non-present) migration entries without
> any ambiguity.

Can we instead reverse order of check:

if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
	pvmw->ptl = pmd_lock(mm, pvmw->pmd);
	if (!pmd_present(*pvmw->pmd)) {
		...
	} else if (likely(pmd_trans_huge(*pvmw->pmd))) {
		...
	} else {
		...
	}
...

This should cover both imeplementations of pmd_trans_huge().
Will Deacon Oct. 9, 2018, 1:18 p.m. UTC | #2
On Tue, Oct 09, 2018 at 04:04:21PM +0300, Kirill A. Shutemov wrote:
> On Tue, Oct 09, 2018 at 09:28:58AM +0530, Anshuman Khandual wrote:
> > A normal mapped THP page at PMD level should be correctly differentiated
> > from a PMD migration entry while walking the page table. A mapped THP would
> > additionally check positive for pmd_present() along with pmd_trans_huge()
> > as compared to a PMD migration entry. This just adds a new conditional test
> > differentiating the two while walking the page table.
> > 
> > Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
> > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> > ---
> > On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
> > exclusive which makes the current conditional block work for both mapped
> > and migration entries. This is not same with arm64 where pmd_trans_huge()
> > returns positive for both mapped and migration entries. Could some one
> > please explain why pmd_trans_huge() has to return false for migration
> > entries which just install swap bits and its still a PMD ?
> 
> I guess it's just a design choice. Any reason why arm64 cannot do the
> same?

Anshuman, would it work to:

#define pmd_trans_huge(pmd)     (pmd_present(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))

?

> > Nonetheless pmd_present() seems to be a better check to distinguish
> > between mapped and (non-mapped non-present) migration entries without
> > any ambiguity.
> 
> Can we instead reverse order of check:
> 
> if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
> 	pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> 	if (!pmd_present(*pvmw->pmd)) {
> 		...
> 	} else if (likely(pmd_trans_huge(*pvmw->pmd))) {
> 		...
> 	} else {
> 		...
> 	}
> ...
> 
> This should cover both imeplementations of pmd_trans_huge().

I'd much rather have portable semantics for pmd_trans_huge(), if we can
achieve that efficiently. But that would be fast /and/ correct, so perhaps
I'm being too hopeful :)

Will
Anshuman Khandual Oct. 9, 2018, 1:42 p.m. UTC | #3
On 10/09/2018 06:34 PM, Kirill A. Shutemov wrote:
> On Tue, Oct 09, 2018 at 09:28:58AM +0530, Anshuman Khandual wrote:
>> A normal mapped THP page at PMD level should be correctly differentiated
>> from a PMD migration entry while walking the page table. A mapped THP would
>> additionally check positive for pmd_present() along with pmd_trans_huge()
>> as compared to a PMD migration entry. This just adds a new conditional test
>> differentiating the two while walking the page table.
>>
>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>> exclusive which makes the current conditional block work for both mapped
>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>> returns positive for both mapped and migration entries. Could some one
>> please explain why pmd_trans_huge() has to return false for migration
>> entries which just install swap bits and its still a PMD ?
> 
> I guess it's just a design choice. Any reason why arm64 cannot do the
> same?
>
I think probably it can do. I am happy to look into these in detail what
will make pmd_trans_huge() return false on migration entries but it does
not quite sound like a right semantic at the moment.

>> Nonetheless pmd_present() seems to be a better check to distinguish
>> between mapped and (non-mapped non-present) migration entries without
>> any ambiguity.
> 
> Can we instead reverse order of check:
> 
> if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
> 	pvmw->ptl = pmd_lock(mm, pvmw->pmd);
> 	if (!pmd_present(*pvmw->pmd)) {
> 		...
> 	} else if (likely(pmd_trans_huge(*pvmw->pmd))) {
> 		...
> 	} else {
> 		...
> 	}
> ...
> 
> This should cover both imeplementations of pmd_trans_huge().

Yeah it does cover and I have tested it first before proposing the current
patch. The only problem is that the order saves the code :) Having another
reasonable check like pmd_present() prevents it from being broken if the
code block moves around for some reason. But I am happy to do either way.
Zi Yan Oct. 9, 2018, 1:58 p.m. UTC | #4
cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
PMD migration entry check)

On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:

> A normal mapped THP page at PMD level should be correctly differentiated
> from a PMD migration entry while walking the page table. A mapped THP would
> additionally check positive for pmd_present() along with pmd_trans_huge()
> as compared to a PMD migration entry. This just adds a new conditional test
> differentiating the two while walking the page table.
>
> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
> exclusive which makes the current conditional block work for both mapped
> and migration entries. This is not same with arm64 where pmd_trans_huge()

!pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,
since _PAGE_PRESENT is cleared during THP splitting but _PAGE_PSE is not.
See the comment in pmd_present() for x86, in arch/x86/include/asm/pgtable.h

> returns positive for both mapped and migration entries. Could some one
> please explain why pmd_trans_huge() has to return false for migration
> entries which just install swap bits and its still a PMD ? Nonetheless
> pmd_present() seems to be a better check to distinguish between mapped
> and (non-mapped non-present) migration entries without any ambiguity.

If arm64 does it differently, I just wonder how THP splitting is handled
in arm64.


--
Best Regards
Yan Zi
Anshuman Khandual Oct. 10, 2018, 4:05 a.m. UTC | #5
On 10/09/2018 07:28 PM, Zi Yan wrote:
> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
> PMD migration entry check)
> 
> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
> 
>> A normal mapped THP page at PMD level should be correctly differentiated
>> from a PMD migration entry while walking the page table. A mapped THP would
>> additionally check positive for pmd_present() along with pmd_trans_huge()
>> as compared to a PMD migration entry. This just adds a new conditional test
>> differentiating the two while walking the page table.
>>
>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>> exclusive which makes the current conditional block work for both mapped
>> and migration entries. This is not same with arm64 where pmd_trans_huge()
> 
> !pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,

Not really if we just look at code in the conditional blocks.

> since _PAGE_PRESENT is cleared during THP splitting but _PAGE_PSE is not.
> See the comment in pmd_present() for x86, in arch/x86/include/asm/pgtable.h


        if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
                pvmw->ptl = pmd_lock(mm, pvmw->pmd);
                if (likely(pmd_trans_huge(*pvmw->pmd))) {
                        if (pvmw->flags & PVMW_MIGRATION)
                                return not_found(pvmw);
                        if (pmd_page(*pvmw->pmd) != page)
                                return not_found(pvmw);
                        return true;
                } else if (!pmd_present(*pvmw->pmd)) {
                        if (thp_migration_supported()) {
                                if (!(pvmw->flags & PVMW_MIGRATION))
                                        return not_found(pvmw);
                                if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
                                        swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);

                                        if (migration_entry_to_page(entry) != page)
                                                return not_found(pvmw);
                                        return true;
                                }
                        }
                        return not_found(pvmw);
                } else {
                        /* THP pmd was split under us: handle on pte level */
                        spin_unlock(pvmw->ptl);
                        pvmw->ptl = NULL;
                }
        } else if (!pmd_present(pmde)) { ---> Outer 'else if'
                return false;
        }

Looking at the above code, it seems the conditional check for a THP
splitting case would be (!pmd_trans_huge && pmd_present) instead as
it has skipped the first two conditions. But THP splitting must have
been initiated once it has cleared the outer check (else it would not
have cleared otherwise)

if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)).


BTW what PMD state does the outer 'else if' block identify which must
have cleared the following condition to get there.

(!pmd_present && !pmd_trans_huge && !is_pmd_migration_entry)
Zi Yan Oct. 10, 2018, 12:43 p.m. UTC | #6
On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:

> On 10/09/2018 07:28 PM, Zi Yan wrote:
>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
>> PMD migration entry check)
>>
>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>
>>> A normal mapped THP page at PMD level should be correctly differentiated
>>> from a PMD migration entry while walking the page table. A mapped THP would
>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>> as compared to a PMD migration entry. This just adds a new conditional test
>>> differentiating the two while walking the page table.
>>>
>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>> ---
>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>> exclusive which makes the current conditional block work for both mapped
>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>
>> !pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,
>
> Not really if we just look at code in the conditional blocks.

Yeah, I explained it wrong above. Sorry about that.

In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
thus, it returns true even if the present bit is cleared but PSE bit is set.
This is done so, because THPs under splitting are regarded as present in the kernel
but not present when a hardware page table walker checks it.

For PMD migration entry, which should be regarded as not present, if PSE bit
is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
PMD migration entries will be regarded as present.

My concern is that if ARM64’s pmd_trans_huge() returns true for migration
entries, unlike x86, there might be bugs triggered in the kernel when
THP migration is enabled in ARM64.

Let me know if I explain this clear to you.

>
>> since _PAGE_PRESENT is cleared during THP splitting but _PAGE_PSE is not.
>> See the comment in pmd_present() for x86, in arch/x86/include/asm/pgtable.h
>
>
>         if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
>                 pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>                 if (likely(pmd_trans_huge(*pvmw->pmd))) {
>                         if (pvmw->flags & PVMW_MIGRATION)
>                                 return not_found(pvmw);
>                         if (pmd_page(*pvmw->pmd) != page)
>                                 return not_found(pvmw);
>                         return true;
>                 } else if (!pmd_present(*pvmw->pmd)) {
>                         if (thp_migration_supported()) {
>                                 if (!(pvmw->flags & PVMW_MIGRATION))
>                                         return not_found(pvmw);
>                                 if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
>                                         swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
>
>                                         if (migration_entry_to_page(entry) != page)
>                                                 return not_found(pvmw);
>                                         return true;
>                                 }
>                         }
>                         return not_found(pvmw);
>                 } else {
>                         /* THP pmd was split under us: handle on pte level */
>                         spin_unlock(pvmw->ptl);
>                         pvmw->ptl = NULL;
>                 }
>         } else if (!pmd_present(pmde)) { ---> Outer 'else if'
>                 return false;
>         }
>
> Looking at the above code, it seems the conditional check for a THP
> splitting case would be (!pmd_trans_huge && pmd_present) instead as
> it has skipped the first two conditions. But THP splitting must have
> been initiated once it has cleared the outer check (else it would not
> have cleared otherwise)
>
> if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)).

If a THP is under splitting, both pmd_present() and pmd_trans_huge() return
true in x86. The else part (/* THP pmd was split under us … */) happens
after splitting is done.

> BTW what PMD state does the outer 'else if' block identify which must
> have cleared the following condition to get there.
>
> (!pmd_present && !pmd_trans_huge && !is_pmd_migration_entry)

I think it is the case that the PMD is gone or equivalently pmd_none().
This PMD entry is not in use.


—
Best Regards,
Yan Zi
Anshuman Khandual Oct. 12, 2018, 8 a.m. UTC | #7
On 10/10/2018 06:13 PM, Zi Yan wrote:
> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
> 
>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
>>> PMD migration entry check)
>>>
>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>
>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>> from a PMD migration entry while walking the page table. A mapped THP would
>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>> as compared to a PMD migration entry. This just adds a new conditional test
>>>> differentiating the two while walking the page table.
>>>>
>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>> ---
>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>> exclusive which makes the current conditional block work for both mapped
>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>
>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,
>>
>> Not really if we just look at code in the conditional blocks.
> 
> Yeah, I explained it wrong above. Sorry about that.
> 
> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
> thus, it returns true even if the present bit is cleared but PSE bit is set.

Okay.

> This is done so, because THPs under splitting are regarded as present in the kernel
> but not present when a hardware page table walker checks it.

Okay.

> 
> For PMD migration entry, which should be regarded as not present, if PSE bit
> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
> PMD migration entries will be regarded as present

Okay to make pmd_present() return false pmd_trans_huge() has to return false
as well. Is there anything which can be done to get around this problem on
X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
Otherwise we would revert the condition block order to accommodate both the
implementation for pmd_trans_huge() as suggested by Kirill before or just
consider this patch forward.

Because I am not really sure yet about the idea of getting pmd_present()
check into pmd_trans_huge() on arm64 just to make it fit into this semantics
as suggested by Will. If a PMD is trans huge page or not should not depend on
whether it is present or not.

> 
> My concern is that if ARM64’s pmd_trans_huge() returns true for migration
> entries, unlike x86, there might be bugs triggered in the kernel when
> THP migration is enabled in ARM64.

Right and that is exactly what we are trying to fix with this patch.

>
> Let me know if I explain this clear to you.
> 
>>
>>> since _PAGE_PRESENT is cleared during THP splitting but _PAGE_PSE is not.
>>> See the comment in pmd_present() for x86, in arch/x86/include/asm/pgtable.h
>>
>>
>>         if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
>>                 pvmw->ptl = pmd_lock(mm, pvmw->pmd);
>>                 if (likely(pmd_trans_huge(*pvmw->pmd))) {
>>                         if (pvmw->flags & PVMW_MIGRATION)
>>                                 return not_found(pvmw);
>>                         if (pmd_page(*pvmw->pmd) != page)
>>                                 return not_found(pvmw);
>>                         return true;
>>                 } else if (!pmd_present(*pvmw->pmd)) {
>>                         if (thp_migration_supported()) {
>>                                 if (!(pvmw->flags & PVMW_MIGRATION))
>>                                         return not_found(pvmw);
>>                                 if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
>>                                         swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
>>
>>                                         if (migration_entry_to_page(entry) != page)
>>                                                 return not_found(pvmw);
>>                                         return true;
>>                                 }
>>                         }
>>                         return not_found(pvmw);
>>                 } else {
>>                         /* THP pmd was split under us: handle on pte level */
>>                         spin_unlock(pvmw->ptl);
>>                         pvmw->ptl = NULL;
>>                 }
>>         } else if (!pmd_present(pmde)) { ---> Outer 'else if'
>>                 return false;
>>         }
>>
>> Looking at the above code, it seems the conditional check for a THP
>> splitting case would be (!pmd_trans_huge && pmd_present) instead as
>> it has skipped the first two conditions. But THP splitting must have
>> been initiated once it has cleared the outer check (else it would not
>> have cleared otherwise)
>>
>> if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)).
> 
> If a THP is under splitting, both pmd_present() and pmd_trans_huge() return
> true in x86. The else part (/* THP pmd was split under us … */) happens
> after splitting is done.

Okay, got it.

> 
>> BTW what PMD state does the outer 'else if' block identify which must
>> have cleared the following condition to get there.
>>
>> (!pmd_present && !pmd_trans_huge && !is_pmd_migration_entry)
> 
> I think it is the case that the PMD is gone or equivalently pmd_none().
> This PMD entry is not in use.

Okay, got it.
Anshuman Khandual Oct. 12, 2018, 8:02 a.m. UTC | #8
On 10/09/2018 06:48 PM, Will Deacon wrote:
> On Tue, Oct 09, 2018 at 04:04:21PM +0300, Kirill A. Shutemov wrote:
>> On Tue, Oct 09, 2018 at 09:28:58AM +0530, Anshuman Khandual wrote:
>>> A normal mapped THP page at PMD level should be correctly differentiated
>>> from a PMD migration entry while walking the page table. A mapped THP would
>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>> as compared to a PMD migration entry. This just adds a new conditional test
>>> differentiating the two while walking the page table.
>>>
>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>> ---
>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>> exclusive which makes the current conditional block work for both mapped
>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>> returns positive for both mapped and migration entries. Could some one
>>> please explain why pmd_trans_huge() has to return false for migration
>>> entries which just install swap bits and its still a PMD ?
>>
>> I guess it's just a design choice. Any reason why arm64 cannot do the
>> same?
> 
> Anshuman, would it work to:
> 
> #define pmd_trans_huge(pmd)     (pmd_present(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
yeah this works but some how does not seem like the right thing to do
but can be the very last option.
Zi Yan Oct. 15, 2018, 12:53 a.m. UTC | #9
On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:

> On 10/10/2018 06:13 PM, Zi Yan wrote:
>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
>>
>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
>>>> PMD migration entry check)
>>>>
>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>>
>>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>>> from a PMD migration entry while walking the page table. A mapped THP would
>>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>>> as compared to a PMD migration entry. This just adds a new conditional test
>>>>> differentiating the two while walking the page table.
>>>>>
>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>> ---
>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>>> exclusive which makes the current conditional block work for both mapped
>>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>>
>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,
>>>
>>> Not really if we just look at code in the conditional blocks.
>>
>> Yeah, I explained it wrong above. Sorry about that.
>>
>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
>> thus, it returns true even if the present bit is cleared but PSE bit is set.
>
> Okay.
>
>> This is done so, because THPs under splitting are regarded as present in the kernel
>> but not present when a hardware page table walker checks it.
>
> Okay.
>
>>
>> For PMD migration entry, which should be regarded as not present, if PSE bit
>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
>> PMD migration entries will be regarded as present
>
> Okay to make pmd_present() return false pmd_trans_huge() has to return false
> as well. Is there anything which can be done to get around this problem on
> X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
> Otherwise we would revert the condition block order to accommodate both the
> implementation for pmd_trans_huge() as suggested by Kirill before or just
> consider this patch forward.
>
> Because I am not really sure yet about the idea of getting pmd_present()
> check into pmd_trans_huge() on arm64 just to make it fit into this semantics
> as suggested by Will. If a PMD is trans huge page or not should not depend on
> whether it is present or not.

In terms of THPs, we have three cases: a present THP, a THP under splitting,
and a THP under migration. pmd_present() and pmd_trans_huge() both return true
for a present THP and a THP under splitting, because they discover _PAGE_PSE bit
is set for both cases, whereas they both return false for a THP under migration.
You want to change them to make pmd_trans_huge() returns true for a THP under migration
instead of false to help ARM64’s support for THP migration.

For x86, this change requires:
1. changing the condition in pmd_trans_huge(), so that it returns true for
PMD migration entries;
2. changing the code, which calls pmd_trans_huge(), to match the new logic.

Another problem I see is that x86’s pmd_present() returns true for a THP under
splitting but ARM64’s pmd_present() returns false for a THP under splitting.
I do not know if there is any correctness issue with this. So I copy Andrea
here, since he made x86’s pmd_present() returns true for a THP under splitting
as an optimization. I want to understand more about it and potentially make
x86 and ARM64 (maybe all other architectures, too) return the same value
for all three cases mentioned above.


Hi Andrea, what is the purpose/benefit of making x86’s pmd_present() returns true
for a THP under splitting? Does it cause problems when ARM64’s pmd_present()
returns false in the same situation?


>>
>> My concern is that if ARM64’s pmd_trans_huge() returns true for migration
>> entries, unlike x86, there might be bugs triggered in the kernel when
>> THP migration is enabled in ARM64.
>
> Right and that is exactly what we are trying to fix with this patch.
>

I am not sure this patch can fix the problem in ARM64, because many other places
in the kernel, pmd_trans_huge() still returns false for a THP under migration.
We may need more comprehensive fixes for ARM64.

Thanks.

—
Best Regards,
Yan Zi
Anshuman Khandual Oct. 15, 2018, 4:06 a.m. UTC | #10
On 10/15/2018 06:23 AM, Zi Yan wrote:
> On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:
> 
>> On 10/10/2018 06:13 PM, Zi Yan wrote:
>>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
>>>
>>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
>>>>> PMD migration entry check)
>>>>>
>>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>>>
>>>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>>>> from a PMD migration entry while walking the page table. A mapped THP would
>>>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>>>> as compared to a PMD migration entry. This just adds a new conditional test
>>>>>> differentiating the two while walking the page table.
>>>>>>
>>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>>> ---
>>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>>>> exclusive which makes the current conditional block work for both mapped
>>>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>>>
>>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,
>>>>
>>>> Not really if we just look at code in the conditional blocks.
>>>
>>> Yeah, I explained it wrong above. Sorry about that.
>>>
>>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
>>> thus, it returns true even if the present bit is cleared but PSE bit is set.
>>
>> Okay.
>>
>>> This is done so, because THPs under splitting are regarded as present in the kernel
>>> but not present when a hardware page table walker checks it.
>>
>> Okay.
>>
>>>
>>> For PMD migration entry, which should be regarded as not present, if PSE bit
>>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
>>> PMD migration entries will be regarded as present
>>
>> Okay to make pmd_present() return false pmd_trans_huge() has to return false
>> as well. Is there anything which can be done to get around this problem on
>> X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
>> Otherwise we would revert the condition block order to accommodate both the
>> implementation for pmd_trans_huge() as suggested by Kirill before or just
>> consider this patch forward.
>>
>> Because I am not really sure yet about the idea of getting pmd_present()
>> check into pmd_trans_huge() on arm64 just to make it fit into this semantics
>> as suggested by Will. If a PMD is trans huge page or not should not depend on
>> whether it is present or not.
> 
> In terms of THPs, we have three cases: a present THP, a THP under splitting,
> and a THP under migration. pmd_present() and pmd_trans_huge() both return true
> for a present THP and a THP under splitting, because they discover _PAGE_PSE bit

Then how do we differentiate between a mapped THP and a splitting THP.

> is set for both cases, whereas they both return false for a THP under migration.
> You want to change them to make pmd_trans_huge() returns true for a THP under migration
> instead of false to help ARM64’s support for THP migration.
I am just trying to understand the rationale behind this semantics and see where
it should be fixed.

I think the fundamental problem here is that THP under split has been difficult
to be re-presented through the available helper functions and in turn PTE bits.

The following checks

1) pmd_present()
2) pmd_trans_huge()

Represent three THP states

1) Mapped THP		(pmd_present && pmd_trans_huge)
2) Splitting THP	(pmd_present && pmd_trans_huge)
3) Migrating THP	(!pmd_present && !pmd_trans_huge)

The problem is if we make pmd_trans_huge() return true for all the three states
which sounds logical because they are all still trans huge PMD, then pmd_present()
can only represent two states not three as required.

> 
> For x86, this change requires:
> 1. changing the condition in pmd_trans_huge(), so that it returns true for
> PMD migration entries;
> 2. changing the code, which calls pmd_trans_huge(), to match the new logic.
Can those be fixed with an additional check for pmd_present() as suggested here
in this patch ? Asking because in case we could not get common semantics for
these helpers on all arch that would be a fall back option for the moment.

> 
> Another problem I see is that x86’s pmd_present() returns true for a THP under
> splitting but ARM64’s pmd_present() returns false for a THP under splitting.

But how did you conclude this ? I dont see any explicit helper for splitting
THP. Could you please point me in the code ?

> I do not know if there is any correctness issue with this. So I copy Andrea
> here, since he made x86’s pmd_present() returns true for a THP under splitting
> as an optimization. I want to understand more about it and potentially make
> x86 and ARM64 (maybe all other architectures, too) return the same value
> for all three cases mentioned above.

I agree. Fixing the semantics is the right thing to do. I am kind of wondering if
it would be a good idea to have explicit helpers for (1) mapped THP, (2) splitting
THP like the one for (3) migrating THP (e.g is_pmd_migration_entry) and use them
in various conditional blocks instead of looking out for multiple checks like
pmd_trans_huge(), pmd_present() etc. It will help unify the semantics as well.

> 
> 
> Hi Andrea, what is the purpose/benefit of making x86’s pmd_present() returns true
> for a THP under splitting? Does it cause problems when ARM64’s pmd_present()
> returns false in the same situation?
> 
> 
>>>
>>> My concern is that if ARM64’s pmd_trans_huge() returns true for migration
>>> entries, unlike x86, there might be bugs triggered in the kernel when
>>> THP migration is enabled in ARM64.
>>
>> Right and that is exactly what we are trying to fix with this patch.
>>
> 
> I am not sure this patch can fix the problem in ARM64, because many other places
> in the kernel, pmd_trans_huge() still returns false for a THP under migration.
> We may need more comprehensive fixes for ARM64.
Are there more places where semantics needs to be fixed than what was originally
added through 616b8371539a ("mm: thp: enable thp migration in generic path").
Aneesh Kumar K.V Oct. 15, 2018, 8:32 a.m. UTC | #11
On 10/12/18 1:32 PM, Anshuman Khandual wrote:
> 
> 
> On 10/09/2018 06:48 PM, Will Deacon wrote:
>> On Tue, Oct 09, 2018 at 04:04:21PM +0300, Kirill A. Shutemov wrote:
>>> On Tue, Oct 09, 2018 at 09:28:58AM +0530, Anshuman Khandual wrote:
>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>> from a PMD migration entry while walking the page table. A mapped THP would
>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>> as compared to a PMD migration entry. This just adds a new conditional test
>>>> differentiating the two while walking the page table.
>>>>
>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>> ---
>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>> exclusive which makes the current conditional block work for both mapped
>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>> returns positive for both mapped and migration entries. Could some one
>>>> please explain why pmd_trans_huge() has to return false for migration
>>>> entries which just install swap bits and its still a PMD ?
>>>
>>> I guess it's just a design choice. Any reason why arm64 cannot do the
>>> same?
>>
>> Anshuman, would it work to:
>>
>> #define pmd_trans_huge(pmd)     (pmd_present(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
> yeah this works but some how does not seem like the right thing to do
> but can be the very last option.
> 


There can be other code paths that makes that assumption. I ended up 
doing the below for pmd_trans_huge on ppc64.

/*
  * Only returns true for a THP. False for pmd migration entry.
  * We also need to return true when we come across a pte that
  * in between a thp split. While splitting THP, we mark the pmd
  * invalid (pmdp_invalidate()) before we set it with pte page
  * address. A pmd_trans_huge() check against a pmd entry during that time
  * should return true.
  * We should not call this on a hugetlb entry. We should check for HugeTLB
  * entry using vma->vm_flags
  * The page table walk rule is explained in Documentation/vm/transhuge.rst
  */
static inline int pmd_trans_huge(pmd_t pmd)
{
	if (!pmd_present(pmd))
		return false;

	if (radix_enabled())
		return radix__pmd_trans_huge(pmd);
	return hash__pmd_trans_huge(pmd);
}

-aneesh
Anshuman Khandual Oct. 16, 2018, 1:16 p.m. UTC | #12
On 10/15/2018 02:02 PM, Aneesh Kumar K.V wrote:
> On 10/12/18 1:32 PM, Anshuman Khandual wrote:
>>
>>
>> On 10/09/2018 06:48 PM, Will Deacon wrote:
>>> On Tue, Oct 09, 2018 at 04:04:21PM +0300, Kirill A. Shutemov wrote:
>>>> On Tue, Oct 09, 2018 at 09:28:58AM +0530, Anshuman Khandual wrote:
>>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>>> from a PMD migration entry while walking the page table. A mapped THP would
>>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>>> as compared to a PMD migration entry. This just adds a new conditional test
>>>>> differentiating the two while walking the page table.
>>>>>
>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>> ---
>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>>> exclusive which makes the current conditional block work for both mapped
>>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>>> returns positive for both mapped and migration entries. Could some one
>>>>> please explain why pmd_trans_huge() has to return false for migration
>>>>> entries which just install swap bits and its still a PMD ?
>>>>
>>>> I guess it's just a design choice. Any reason why arm64 cannot do the
>>>> same?
>>>
>>> Anshuman, would it work to:
>>>
>>> #define pmd_trans_huge(pmd)     (pmd_present(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
>> yeah this works but some how does not seem like the right thing to do
>> but can be the very last option.
>>
> 
> 
> There can be other code paths that makes that assumption. I ended up doing the below for pmd_trans_huge on ppc64.
>

Yeah, did see that in one of the previous proposals.

https://patchwork.kernel.org/patch/10544291/

But the existing semantics does not look right and makes vague assumptions.
Zi Yan has already asked Andrea for his input in this regard on the next
thread. So I guess while being here, its a good idea to revisit existing
semantics and it's assumptions before fixing it in arch specific helpers.

- Anshuman 


> /*
>  * Only returns true for a THP. False for pmd migration entry.
>  * We also need to return true when we come across a pte that
>  * in between a thp split. While splitting THP, we mark the pmd
>  * invalid (pmdp_invalidate()) before we set it with pte page
>  * address. A pmd_trans_huge() check against a pmd entry during that time
>  * should return true.
>  * We should not call this on a hugetlb entry. We should check for HugeTLB
>  * entry using vma->vm_flags
>  * The page table walk rule is explained in Documentation/vm/transhuge.rst
>  */
> static inline int pmd_trans_huge(pmd_t pmd)
> {
>     if (!pmd_present(pmd))
>         return false;
> 
>     if (radix_enabled())
>         return radix__pmd_trans_huge(pmd);
>     return hash__pmd_trans_huge(pmd);
> }
> 
> -aneesh
>
Zi Yan Oct. 16, 2018, 2:31 p.m. UTC | #13
On 15 Oct 2018, at 0:06, Anshuman Khandual wrote:

> On 10/15/2018 06:23 AM, Zi Yan wrote:
>> On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:
>>
>>> On 10/10/2018 06:13 PM, Zi Yan wrote:
>>>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
>>>>
>>>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
>>>>>> PMD migration entry check)
>>>>>>
>>>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>>>>
>>>>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>>>>> from a PMD migration entry while walking the page table. A mapped THP would
>>>>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>>>>> as compared to a PMD migration entry. This just adds a new conditional test
>>>>>>> differentiating the two while walking the page table.
>>>>>>>
>>>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>>>> ---
>>>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>>>>> exclusive which makes the current conditional block work for both mapped
>>>>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>>>>
>>>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,
>>>>>
>>>>> Not really if we just look at code in the conditional blocks.
>>>>
>>>> Yeah, I explained it wrong above. Sorry about that.
>>>>
>>>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
>>>> thus, it returns true even if the present bit is cleared but PSE bit is set.
>>>
>>> Okay.
>>>
>>>> This is done so, because THPs under splitting are regarded as present in the kernel
>>>> but not present when a hardware page table walker checks it.
>>>
>>> Okay.
>>>
>>>>
>>>> For PMD migration entry, which should be regarded as not present, if PSE bit
>>>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
>>>> PMD migration entries will be regarded as present
>>>
>>> Okay to make pmd_present() return false pmd_trans_huge() has to return false
>>> as well. Is there anything which can be done to get around this problem on
>>> X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
>>> Otherwise we would revert the condition block order to accommodate both the
>>> implementation for pmd_trans_huge() as suggested by Kirill before or just
>>> consider this patch forward.
>>>
>>> Because I am not really sure yet about the idea of getting pmd_present()
>>> check into pmd_trans_huge() on arm64 just to make it fit into this semantics
>>> as suggested by Will. If a PMD is trans huge page or not should not depend on
>>> whether it is present or not.
>>
>> In terms of THPs, we have three cases: a present THP, a THP under splitting,
>> and a THP under migration. pmd_present() and pmd_trans_huge() both return true
>> for a present THP and a THP under splitting, because they discover _PAGE_PSE bit
>
> Then how do we differentiate between a mapped THP and a splitting THP.

AFAIK, in x86, there is no distinction between a mapped THP and a splitting THP
using helper functions.

A mapped THP has _PAGE_PRESENT bit and _PAGE_PSE bit set, whereas a splitting THP
has only _PAGE_PSE bit set. But both pmd_present() and pmd_trans_huge() return
true as long as _PAGE_PSE bit is set.

>
>> is set for both cases, whereas they both return false for a THP under migration.
>> You want to change them to make pmd_trans_huge() returns true for a THP under migration
>> instead of false to help ARM64’s support for THP migration.
> I am just trying to understand the rationale behind this semantics and see where
> it should be fixed.
>
> I think the fundamental problem here is that THP under split has been difficult
> to be re-presented through the available helper functions and in turn PTE bits.
>
> The following checks
>
> 1) pmd_present()
> 2) pmd_trans_huge()
>
> Represent three THP states
>
> 1) Mapped THP		(pmd_present && pmd_trans_huge)
> 2) Splitting THP	(pmd_present && pmd_trans_huge)
> 3) Migrating THP	(!pmd_present && !pmd_trans_huge)
>
> The problem is if we make pmd_trans_huge() return true for all the three states
> which sounds logical because they are all still trans huge PMD, then pmd_present()
> can only represent two states not three as required.

We are on the same page about representing three THP states in x86.
I also agree with you that it is logical to use three distinct representations
for these three states, i.e. splitting THP could be changed to (!pmd_present && pmd_trans_huge).


>>
>> For x86, this change requires:
>> 1. changing the condition in pmd_trans_huge(), so that it returns true for
>> PMD migration entries;
>> 2. changing the code, which calls pmd_trans_huge(), to match the new logic.
> Can those be fixed with an additional check for pmd_present() as suggested here
> in this patch ? Asking because in case we could not get common semantics for
> these helpers on all arch that would be a fall back option for the moment.

It would be OK for x86, since pmd_trans_huge() implies pmd_present() and hence
adding pmd_present() to pmd_trans_huge() makes no difference. But for ARM64,
from my understanding of the code described below, adding pmd_present() to
pmd_trans_huge() seems to exclude splitting THPs from the original semantic.


>>
>> Another problem I see is that x86’s pmd_present() returns true for a THP under
>> splitting but ARM64’s pmd_present() returns false for a THP under splitting.
>
> But how did you conclude this ? I dont see any explicit helper for splitting
> THP. Could you please point me in the code ?

From the code I read for ARM64
(https://elixir.bootlin.com/linux/v4.19-rc8/source/arch/arm64/include/asm/pgtable.h#L360
and https://elixir.bootlin.com/linux/v4.19-rc8/source/arch/arm64/include/asm/pgtable.h#L86),
pmd_present() only checks _PAGE_PRESENT and _PAGE_PROTONE. During a THP splitting,
pmdp_invalidate() clears _PAGE_PRESENT (https://elixir.bootlin.com/linux/v4.19-rc8/source/mm/huge_memory.c#L2130). So pmd_present() returns false in ARM64. Let me know
if I got anything wrong.



>> I do not know if there is any correctness issue with this. So I copy Andrea
>> here, since he made x86’s pmd_present() returns true for a THP under splitting
>> as an optimization. I want to understand more about it and potentially make
>> x86 and ARM64 (maybe all other architectures, too) return the same value
>> for all three cases mentioned above.
>
> I agree. Fixing the semantics is the right thing to do. I am kind of wondering if
> it would be a good idea to have explicit helpers for (1) mapped THP, (2) splitting
> THP like the one for (3) migrating THP (e.g is_pmd_migration_entry) and use them
> in various conditional blocks instead of looking out for multiple checks like
> pmd_trans_huge(), pmd_present() etc. It will help unify the semantics as well.
>

I agree that explicit and distinct helpers for all three THP states would be helpful.

>>
>>
>> Hi Andrea, what is the purpose/benefit of making x86’s pmd_present() returns true
>> for a THP under splitting? Does it cause problems when ARM64’s pmd_present()
>> returns false in the same situation?
>>
>>
>>>>
>>>> My concern is that if ARM64’s pmd_trans_huge() returns true for migration
>>>> entries, unlike x86, there might be bugs triggered in the kernel when
>>>> THP migration is enabled in ARM64.
>>>
>>> Right and that is exactly what we are trying to fix with this patch.
>>>
>>
>> I am not sure this patch can fix the problem in ARM64, because many other places
>> in the kernel, pmd_trans_huge() still returns false for a THP under migration.
>> We may need more comprehensive fixes for ARM64.
> Are there more places where semantics needs to be fixed than what was originally
> added through 616b8371539a ("mm: thp: enable thp migration in generic path").

I guess not, but it would be safer to grep for all pmd_trans_huge() and pmd_present().

--
Best Regards
Yan Zi
Andrea Arcangeli Oct. 17, 2018, 2:09 a.m. UTC | #14
Hello Zi,

On Sun, Oct 14, 2018 at 08:53:55PM -0400, Zi Yan wrote:
> Hi Andrea, what is the purpose/benefit of making x86’s pmd_present() returns true
> for a THP under splitting? Does it cause problems when ARM64’s pmd_present()
> returns false in the same situation?

!pmd_present means it's a migration entry or swap entry and doesn't
point to RAM. It means if you do pmd_to_page(*pmd) it will return you
an undefined result.

During splitting the physical page is still very well pointed by the
pmd as long as pmd_trans_huge returns true and you hold the
pmd_lock.

pmd_trans_huge must be true at all times for a transhuge pmd that
points to a hugepage, or all VM fast paths won't serialize with the
pmd_lock, that is the only reason why, and it's a very good reason
because it avoids to take the pmd_lock when walking over non transhuge
pmds (i.e. when there are no THP allocated).

Now if we've to keep _PAGE_PSE set and return true in pmd_trans_huge
at all times, why would you want to make pmd_present return false? How
could it help if pmd_trans_huge returns true, but pmd_present returns
false despite pmd_to_page works fine and the pmd is really still
pointing to the page?

When userland faults on such pmd !pmd_present it will make the page
fault take a swap or migration path, but that's the wrong path if the
pmd points to RAM.

What we need to do during split is an invalidate of the huge TLB.
There's no pmd_trans_splitting anymore, so we only clear the present
bit in the PTE despite pmd_present still returns true (just like
PROT_NONE, nothing new in this respect). pmd_present never meant the
real present bit in the pte was set, it just means the pmd points to
RAM. It means it doesn't point to swap or migration entry and you can
do pmd_to_page and it works fine.

We need to invalidate the TLB by clearing the present bit and by
flushing the TLB before overwriting the transhuge pmd with the regular
pte (i.e. to make it non huge). That is actually required by an errata
(l1 cache aliasing of the same mapping through two different TLB of
two different sizes broke some old CPU and triggered machine checks).
It's not something fundamentally necessary from a common code point of
view. It's more risky from an hardware (not software) standpoint and
before you can get rid of the pmd you need to do a TLB flush anyway to
be sure CPUs stops using it, so better clear the present bit before
doing the real costly thing (the tlb flush with IPIs). Clearing the
present bit during the TLB flush is a cost that gets lost in the noise.

The clear of the real present bit during pmd (virtual) splitting is
done with pmdp_invalidate, that is created specifically to keeps
pmd_trans_huge=true, pmd_present=true despite the present bit is not
set. So you could imagine _PAGE_PSE as the real present bit.

Before the physical split was deferred and decoupled from the virtual
memory pmd split, pmd_trans_splitting allowed to wait the split to
finish and to keep all gup_fast at bay during it (while the page was
still mapped readable and writable in userland by other CPUs). Now the
physical split is deferred so you just split the pmd locally and only
a physical split invoked on the page (not the virtual split invoked on
the pmd with split_huge_pmd) has to keep gup at bay, and it does so by
freezing the refcount so all gup_fast fail with the
page_cache_get_speculative during the freeze. This removed the need of
the pmd_splitting flag in gup_fast (when pmd_splitting was set gup
fast had to go through the non-fast gup), but it means that now a
hugepage cannot be physically splitted if it's gup pinned. The main
difference is that freezing the refcount can fail, so the code must
learn to cope with such failure and defer it. Decoupling the physical
and virtual splits introduced the need of tracking the doublemap case
with a new PG_double_map flag too. It makes the refcounting of
hugepages trivial in comparison (identical to hugetlbfs in fact), but
it requires total_mapcount to account for all those huge and non huge
mappings. It primarily pays off to add THP to tmpfs where the physical
split may have to be deferred for pagecache reasons anyway.

Thanks,
Andrea
Naoya Horiguchi Oct. 18, 2018, 2:17 a.m. UTC | #15
On Tue, Oct 16, 2018 at 10:31:50AM -0400, Zi Yan wrote:
> On 15 Oct 2018, at 0:06, Anshuman Khandual wrote:
> 
> > On 10/15/2018 06:23 AM, Zi Yan wrote:
> >> On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:
> >>
> >>> On 10/10/2018 06:13 PM, Zi Yan wrote:
> >>>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
> >>>>
> >>>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
> >>>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
> >>>>>> PMD migration entry check)
> >>>>>>
> >>>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
> >>>>>>
> >>>>>>> A normal mapped THP page at PMD level should be correctly differentiated
> >>>>>>> from a PMD migration entry while walking the page table. A mapped THP would
> >>>>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
> >>>>>>> as compared to a PMD migration entry. This just adds a new conditional test
> >>>>>>> differentiating the two while walking the page table.
> >>>>>>>
> >>>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
> >>>>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> >>>>>>> ---
> >>>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
> >>>>>>> exclusive which makes the current conditional block work for both mapped
> >>>>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
> >>>>>>
> >>>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,
> >>>>>
> >>>>> Not really if we just look at code in the conditional blocks.
> >>>>
> >>>> Yeah, I explained it wrong above. Sorry about that.
> >>>>
> >>>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
> >>>> thus, it returns true even if the present bit is cleared but PSE bit is set.
> >>>
> >>> Okay.
> >>>
> >>>> This is done so, because THPs under splitting are regarded as present in the kernel
> >>>> but not present when a hardware page table walker checks it.
> >>>
> >>> Okay.
> >>>
> >>>>
> >>>> For PMD migration entry, which should be regarded as not present, if PSE bit
> >>>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
> >>>> PMD migration entries will be regarded as present
> >>>
> >>> Okay to make pmd_present() return false pmd_trans_huge() has to return false
> >>> as well. Is there anything which can be done to get around this problem on
> >>> X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
> >>> Otherwise we would revert the condition block order to accommodate both the
> >>> implementation for pmd_trans_huge() as suggested by Kirill before or just
> >>> consider this patch forward.
> >>>
> >>> Because I am not really sure yet about the idea of getting pmd_present()
> >>> check into pmd_trans_huge() on arm64 just to make it fit into this semantics
> >>> as suggested by Will. If a PMD is trans huge page or not should not depend on
> >>> whether it is present or not.
> >>
> >> In terms of THPs, we have three cases: a present THP, a THP under splitting,
> >> and a THP under migration. pmd_present() and pmd_trans_huge() both return true
> >> for a present THP and a THP under splitting, because they discover _PAGE_PSE bit
> >
> > Then how do we differentiate between a mapped THP and a splitting THP.
> 
> AFAIK, in x86, there is no distinction between a mapped THP and a splitting THP
> using helper functions.
> 
> A mapped THP has _PAGE_PRESENT bit and _PAGE_PSE bit set, whereas a splitting THP
> has only _PAGE_PSE bit set. But both pmd_present() and pmd_trans_huge() return
> true as long as _PAGE_PSE bit is set.
> 
> >
> >> is set for both cases, whereas they both return false for a THP under migration.
> >> You want to change them to make pmd_trans_huge() returns true for a THP under migration
> >> instead of false to help ARM64’s support for THP migration.
> > I am just trying to understand the rationale behind this semantics and see where
> > it should be fixed.
> >
> > I think the fundamental problem here is that THP under split has been difficult
> > to be re-presented through the available helper functions and in turn PTE bits.
> >
> > The following checks
> >
> > 1) pmd_present()
> > 2) pmd_trans_huge()
> >
> > Represent three THP states
> >
> > 1) Mapped THP		(pmd_present && pmd_trans_huge)
> > 2) Splitting THP	(pmd_present && pmd_trans_huge)
> > 3) Migrating THP	(!pmd_present && !pmd_trans_huge)
> >
> > The problem is if we make pmd_trans_huge() return true for all the three states
> > which sounds logical because they are all still trans huge PMD, then pmd_present()
> > can only represent two states not three as required.
> 
> We are on the same page about representing three THP states in x86.
> I also agree with you that it is logical to use three distinct representations
> for these three states, i.e. splitting THP could be changed to (!pmd_present && pmd_trans_huge).

I think that the behavior of pmd_trans_huge() for non-present pmd is
undefined by its nature. IOW, it's no use determining whether it's thp or
not for non-existing pages because it does not exist :)

So I think that the right direction is to make sure that pmd_trans_huge() is
never checked for non-present pmd, just like Kirill's suggestion.  And maybe
we have some room for engineering to ensure it (rather than just commenting it).

Thanks,
Naoya Horiguchi
Zi Yan Oct. 22, 2018, 2 p.m. UTC | #16
Hi Andrea,

On 16 Oct 2018, at 22:09, Andrea Arcangeli wrote:

> Hello Zi,
>
> On Sun, Oct 14, 2018 at 08:53:55PM -0400, Zi Yan wrote:
>> Hi Andrea, what is the purpose/benefit of making x86’s pmd_present() returns true
>> for a THP under splitting? Does it cause problems when ARM64’s pmd_present()
>> returns false in the same situation?
>
> !pmd_present means it's a migration entry or swap entry and doesn't
> point to RAM. It means if you do pmd_to_page(*pmd) it will return you
> an undefined result.
>
> During splitting the physical page is still very well pointed by the
> pmd as long as pmd_trans_huge returns true and you hold the
> pmd_lock.
>
> pmd_trans_huge must be true at all times for a transhuge pmd that
> points to a hugepage, or all VM fast paths won't serialize with the
> pmd_lock, that is the only reason why, and it's a very good reason
> because it avoids to take the pmd_lock when walking over non transhuge
> pmds (i.e. when there are no THP allocated).
>
> Now if we've to keep _PAGE_PSE set and return true in pmd_trans_huge
> at all times, why would you want to make pmd_present return false? How
> could it help if pmd_trans_huge returns true, but pmd_present returns
> false despite pmd_to_page works fine and the pmd is really still
> pointing to the page?
>
> When userland faults on such pmd !pmd_present it will make the page
> fault take a swap or migration path, but that's the wrong path if the
> pmd points to RAM.
>
> What we need to do during split is an invalidate of the huge TLB.
> There's no pmd_trans_splitting anymore, so we only clear the present
> bit in the PTE despite pmd_present still returns true (just like
> PROT_NONE, nothing new in this respect). pmd_present never meant the
> real present bit in the pte was set, it just means the pmd points to
> RAM. It means it doesn't point to swap or migration entry and you can
> do pmd_to_page and it works fine.
>
> We need to invalidate the TLB by clearing the present bit and by
> flushing the TLB before overwriting the transhuge pmd with the regular
> pte (i.e. to make it non huge). That is actually required by an errata
> (l1 cache aliasing of the same mapping through two different TLB of
> two different sizes broke some old CPU and triggered machine checks).
> It's not something fundamentally necessary from a common code point of
> view. It's more risky from an hardware (not software) standpoint and
> before you can get rid of the pmd you need to do a TLB flush anyway to
> be sure CPUs stops using it, so better clear the present bit before
> doing the real costly thing (the tlb flush with IPIs). Clearing the
> present bit during the TLB flush is a cost that gets lost in the noise.
>
> The clear of the real present bit during pmd (virtual) splitting is
> done with pmdp_invalidate, that is created specifically to keeps
> pmd_trans_huge=true, pmd_present=true despite the present bit is not
> set. So you could imagine _PAGE_PSE as the real present bit.
>
> Before the physical split was deferred and decoupled from the virtual
> memory pmd split, pmd_trans_splitting allowed to wait the split to
> finish and to keep all gup_fast at bay during it (while the page was
> still mapped readable and writable in userland by other CPUs). Now the
> physical split is deferred so you just split the pmd locally and only
> a physical split invoked on the page (not the virtual split invoked on
> the pmd with split_huge_pmd) has to keep gup at bay, and it does so by
> freezing the refcount so all gup_fast fail with the
> page_cache_get_speculative during the freeze. This removed the need of
> the pmd_splitting flag in gup_fast (when pmd_splitting was set gup
> fast had to go through the non-fast gup), but it means that now a
> hugepage cannot be physically splitted if it's gup pinned. The main
> difference is that freezing the refcount can fail, so the code must
> learn to cope with such failure and defer it. Decoupling the physical
> and virtual splits introduced the need of tracking the doublemap case
> with a new PG_double_map flag too. It makes the refcounting of
> hugepages trivial in comparison (identical to hugetlbfs in fact), but
> it requires total_mapcount to account for all those huge and non huge
> mappings. It primarily pays off to add THP to tmpfs where the physical
> split may have to be deferred for pagecache reasons anyway.

Thanks for your detailed explanation!

Do you think it is worth documenting what you have said? At least on
why we want pmd_present() and pmd_trans_huge() both return true when
a THP is under splitting, so that we can avoid some confusion in the future.
I can send a patch to add it to Document/vm/transhuge.rst.

--
Best Regards
Yan Zi
Anshuman Khandual Oct. 25, 2018, 8:10 a.m. UTC | #17
On 10/16/2018 08:01 PM, Zi Yan wrote:
> On 15 Oct 2018, at 0:06, Anshuman Khandual wrote:
> 
>> On 10/15/2018 06:23 AM, Zi Yan wrote:
>>> On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:
>>>
>>>> On 10/10/2018 06:13 PM, Zi Yan wrote:
>>>>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
>>>>>
>>>>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>>>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
>>>>>>> PMD migration entry check)
>>>>>>>
>>>>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>>>>>
>>>>>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>>>>>> from a PMD migration entry while walking the page table. A mapped THP would
>>>>>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>>>>>> as compared to a PMD migration entry. This just adds a new conditional test
>>>>>>>> differentiating the two while walking the page table.
>>>>>>>>
>>>>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>>>>> ---
>>>>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>>>>>> exclusive which makes the current conditional block work for both mapped
>>>>>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>>>>>
>>>>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,
>>>>>>
>>>>>> Not really if we just look at code in the conditional blocks.
>>>>>
>>>>> Yeah, I explained it wrong above. Sorry about that.
>>>>>
>>>>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
>>>>> thus, it returns true even if the present bit is cleared but PSE bit is set.
>>>>
>>>> Okay.
>>>>
>>>>> This is done so, because THPs under splitting are regarded as present in the kernel
>>>>> but not present when a hardware page table walker checks it.
>>>>
>>>> Okay.
>>>>
>>>>>
>>>>> For PMD migration entry, which should be regarded as not present, if PSE bit
>>>>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
>>>>> PMD migration entries will be regarded as present
>>>>
>>>> Okay to make pmd_present() return false pmd_trans_huge() has to return false
>>>> as well. Is there anything which can be done to get around this problem on
>>>> X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
>>>> Otherwise we would revert the condition block order to accommodate both the
>>>> implementation for pmd_trans_huge() as suggested by Kirill before or just
>>>> consider this patch forward.
>>>>
>>>> Because I am not really sure yet about the idea of getting pmd_present()
>>>> check into pmd_trans_huge() on arm64 just to make it fit into this semantics
>>>> as suggested by Will. If a PMD is trans huge page or not should not depend on
>>>> whether it is present or not.
>>>
>>> In terms of THPs, we have three cases: a present THP, a THP under splitting,
>>> and a THP under migration. pmd_present() and pmd_trans_huge() both return true
>>> for a present THP and a THP under splitting, because they discover _PAGE_PSE bit
>>
>> Then how do we differentiate between a mapped THP and a splitting THP.
> 
> AFAIK, in x86, there is no distinction between a mapped THP and a splitting THP
> using helper functions.
> 
> A mapped THP has _PAGE_PRESENT bit and _PAGE_PSE bit set, whereas a splitting THP
> has only _PAGE_PSE bit set. But both pmd_present() and pmd_trans_huge() return
> true as long as _PAGE_PSE bit is set.

I understand that. What I was wondering was since there is a need to differentiate
between a mapped THP and a splitting THP at various places in generic THP, we would
need to way to identify each of them unambiguously some how. Is that particular
assumption wrong ? Dont we need to differentiate between a mapped THP and THP under
splitting ?

> 
>>
>>> is set for both cases, whereas they both return false for a THP under migration.
>>> You want to change them to make pmd_trans_huge() returns true for a THP under migration
>>> instead of false to help ARM64’s support for THP migration.
>> I am just trying to understand the rationale behind this semantics and see where
>> it should be fixed.
>>
>> I think the fundamental problem here is that THP under split has been difficult
>> to be re-presented through the available helper functions and in turn PTE bits.
>>
>> The following checks
>>
>> 1) pmd_present()
>> 2) pmd_trans_huge()
>>
>> Represent three THP states
>>
>> 1) Mapped THP		(pmd_present && pmd_trans_huge)
>> 2) Splitting THP	(pmd_present && pmd_trans_huge)
>> 3) Migrating THP	(!pmd_present && !pmd_trans_huge)
>>
>> The problem is if we make pmd_trans_huge() return true for all the three states
>> which sounds logical because they are all still trans huge PMD, then pmd_present()
>> can only represent two states not three as required.
> 
> We are on the same page about representing three THP states in x86.
> I also agree with you that it is logical to use three distinct representations
> for these three states, i.e. splitting THP could be changed to (!pmd_present && pmd_trans_huge

Right. Also we need clear wrapper around them in line with is_pmd_migration_entry() to
represent three states all of which calling pmd_present() and pmd_trans_huge() which
are exported by various architectures with exact same semantics without any ambiguity.

1) is_pmd_mapped_entry()
2) is_pmd_splitting_entry()
3) is_pmd_migration_entry()

> 
> 
>>>
>>> For x86, this change requires:
>>> 1. changing the condition in pmd_trans_huge(), so that it returns true for
>>> PMD migration entries;
>>> 2. changing the code, which calls pmd_trans_huge(), to match the new logic.
>> Can those be fixed with an additional check for pmd_present() as suggested here
>> in this patch ? Asking because in case we could not get common semantics for
>> these helpers on all arch that would be a fall back option for the moment.
> 
> It would be OK for x86, since pmd_trans_huge() implies pmd_present() and hence
> adding pmd_present() to pmd_trans_huge() makes no difference. But for ARM64,
> from my understanding of the code described below, adding pmd_present() to
> pmd_trans_huge() seems to exclude splitting THPs from the original semantic.
> 
> 
>>>
>>> Another problem I see is that x86’s pmd_present() returns true for a THP under
>>> splitting but ARM64’s pmd_present() returns false for a THP under splitting.
>>
>> But how did you conclude this ? I dont see any explicit helper for splitting
>> THP. Could you please point me in the code ?
> 
> From the code I read for ARM64
> (https://elixir.bootlin.com/linux/v4.19-rc8/source/arch/arm64/include/asm/pgtable.h#L360
> and https://elixir.bootlin.com/linux/v4.19-rc8/source/arch/arm64/include/asm/pgtable.h#L86),
> pmd_present() only checks _PAGE_PRESENT and _PAGE_PROTONE. During a THP splitting,

These are PTE_VALID and PTE_PROT_NONE instead on arm64. But yes, they are equivalent
to __PAGE_PRESENT and __PAGE_PROTNONE on other archs.

#define pmd_present(pmd)        pte_present(pmd_pte(pmd))
#define pte_present(pte)        (!!(pte_val(pte) & (PTE_VALID | PTE_PROT_NONE)))

> pmdp_invalidate() clears _PAGE_PRESENT (https://elixir.bootlin.com/linux/v4.19-rc8/source/mm/huge_memory.c#L2130). So pmd_present() returns false in ARM64. Let me know
> if I got anything wrong.
>

old_pmd = pmdp_invalidate(vma, haddr, pmd);

__split_huge_pmd_locked -> pmdp_invalidate (the above mentioned instance)
pmdp_invalidate -> pmd_mknotpresent

#define pmd_mknotpresent(pmd)   (__pmd(pmd_val(pmd) & ~PMD_SECT_VALID)

Generic pmdp invalidation removes PMD_SECT_VALID from a mapped PMD entry.
PMD_SECT_VALID is similar to PTE_VALID through identified separately. So you
are right, on arm64 pmd_present() return false for THP under splitting.

> 
> 
>>> I do not know if there is any correctness issue with this. So I copy Andrea
>>> here, since he made x86’s pmd_present() returns true for a THP under splitting
>>> as an optimization. I want to understand more about it and potentially make
>>> x86 and ARM64 (maybe all other architectures, too) return the same value
>>> for all three cases mentioned above.
>>
>> I agree. Fixing the semantics is the right thing to do. I am kind of wondering if
>> it would be a good idea to have explicit helpers for (1) mapped THP, (2) splitting
>> THP like the one for (3) migrating THP (e.g is_pmd_migration_entry) and use them
>> in various conditional blocks instead of looking out for multiple checks like
>> pmd_trans_huge(), pmd_present() etc. It will help unify the semantics as well.
>>
> 
> I agree that explicit and distinct helpers for all three THP states would be helpful.
> 

Right.

>>>
>>>
>>> Hi Andrea, what is the purpose/benefit of making x86’s pmd_present() returns true
>>> for a THP under splitting? Does it cause problems when ARM64’s pmd_present()
>>> returns false in the same situation?
>>>
>>>
>>>>>
>>>>> My concern is that if ARM64’s pmd_trans_huge() returns true for migration
>>>>> entries, unlike x86, there might be bugs triggered in the kernel when
>>>>> THP migration is enabled in ARM64.
>>>>
>>>> Right and that is exactly what we are trying to fix with this patch.
>>>>
>>>
>>> I am not sure this patch can fix the problem in ARM64, because many other places
>>> in the kernel, pmd_trans_huge() still returns false for a THP under migration.
>>> We may need more comprehensive fixes for ARM64.
>> Are there more places where semantics needs to be fixed than what was originally
>> added through 616b8371539a ("mm: thp: enable thp migration in generic path").
> 
> I guess not, but it would be safer to grep for all pmd_trans_huge() and pmd_present().
Sure.
Zi Yan Oct. 25, 2018, 6:45 p.m. UTC | #18
On 25 Oct 2018, at 4:10, Anshuman Khandual wrote:

> On 10/16/2018 08:01 PM, Zi Yan wrote:
>> On 15 Oct 2018, at 0:06, Anshuman Khandual wrote:
>>
>>> On 10/15/2018 06:23 AM, Zi Yan wrote:
>>>> On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:
>>>>
>>>>> On 10/10/2018 06:13 PM, Zi Yan wrote:
>>>>>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
>>>>>>
>>>>>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>>>>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
>>>>>>>> PMD migration entry check)
>>>>>>>>
>>>>>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>>>>>>
>>>>>>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>>>>>>> from a PMD migration entry while walking the page table. A mapped THP would
>>>>>>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>>>>>>> as compared to a PMD migration entry. This just adds a new conditional test
>>>>>>>>> differentiating the two while walking the page table.
>>>>>>>>>
>>>>>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>>>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>>>>>> ---
>>>>>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>>>>>>> exclusive which makes the current conditional block work for both mapped
>>>>>>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>>>>>>
>>>>>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,
>>>>>>>
>>>>>>> Not really if we just look at code in the conditional blocks.
>>>>>>
>>>>>> Yeah, I explained it wrong above. Sorry about that.
>>>>>>
>>>>>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
>>>>>> thus, it returns true even if the present bit is cleared but PSE bit is set.
>>>>>
>>>>> Okay.
>>>>>
>>>>>> This is done so, because THPs under splitting are regarded as present in the kernel
>>>>>> but not present when a hardware page table walker checks it.
>>>>>
>>>>> Okay.
>>>>>
>>>>>>
>>>>>> For PMD migration entry, which should be regarded as not present, if PSE bit
>>>>>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
>>>>>> PMD migration entries will be regarded as present
>>>>>
>>>>> Okay to make pmd_present() return false pmd_trans_huge() has to return false
>>>>> as well. Is there anything which can be done to get around this problem on
>>>>> X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
>>>>> Otherwise we would revert the condition block order to accommodate both the
>>>>> implementation for pmd_trans_huge() as suggested by Kirill before or just
>>>>> consider this patch forward.
>>>>>
>>>>> Because I am not really sure yet about the idea of getting pmd_present()
>>>>> check into pmd_trans_huge() on arm64 just to make it fit into this semantics
>>>>> as suggested by Will. If a PMD is trans huge page or not should not depend on
>>>>> whether it is present or not.
>>>>
>>>> In terms of THPs, we have three cases: a present THP, a THP under splitting,
>>>> and a THP under migration. pmd_present() and pmd_trans_huge() both return true
>>>> for a present THP and a THP under splitting, because they discover _PAGE_PSE bit
>>>
>>> Then how do we differentiate between a mapped THP and a splitting THP.
>>
>> AFAIK, in x86, there is no distinction between a mapped THP and a splitting THP
>> using helper functions.
>>
>> A mapped THP has _PAGE_PRESENT bit and _PAGE_PSE bit set, whereas a splitting THP
>> has only _PAGE_PSE bit set. But both pmd_present() and pmd_trans_huge() return
>> true as long as _PAGE_PSE bit is set.
>
> I understand that. What I was wondering was since there is a need to differentiate
> between a mapped THP and a splitting THP at various places in generic THP, we would
> need to way to identify each of them unambiguously some how. Is that particular
> assumption wrong ? Dont we need to differentiate between a mapped THP and THP under
> splitting ?

According to Andrea's explanation here: https://lore.kernel.org/patchwork/patch/997412/#1184298,
we do not distinguish between a mapped THP and a splitting THP, because pmd_to_page()
can return valid pages for both cases.

>>
>>>
>>>> is set for both cases, whereas they both return false for a THP under migration.
>>>> You want to change them to make pmd_trans_huge() returns true for a THP under migration
>>>> instead of false to help ARM64’s support for THP migration.
>>> I am just trying to understand the rationale behind this semantics and see where
>>> it should be fixed.
>>>
>>> I think the fundamental problem here is that THP under split has been difficult
>>> to be re-presented through the available helper functions and in turn PTE bits.
>>>
>>> The following checks
>>>
>>> 1) pmd_present()
>>> 2) pmd_trans_huge()
>>>
>>> Represent three THP states
>>>
>>> 1) Mapped THP		(pmd_present && pmd_trans_huge)
>>> 2) Splitting THP	(pmd_present && pmd_trans_huge)
>>> 3) Migrating THP	(!pmd_present && !pmd_trans_huge)
>>>
>>> The problem is if we make pmd_trans_huge() return true for all the three states
>>> which sounds logical because they are all still trans huge PMD, then pmd_present()
>>> can only represent two states not three as required.
>>
>> We are on the same page about representing three THP states in x86.
>> I also agree with you that it is logical to use three distinct representations
>> for these three states, i.e. splitting THP could be changed to (!pmd_present && pmd_trans_huge
>
> Right. Also we need clear wrapper around them in line with is_pmd_migration_entry() to
> represent three states all of which calling pmd_present() and pmd_trans_huge() which
> are exported by various architectures with exact same semantics without any ambiguity.
>
> 1) is_pmd_mapped_entry()
> 2) is_pmd_splitting_entry()
> 3) is_pmd_migration_entry()

I think the semantics of pmd_trans_huge() is that the pmd entry is pointing to
a huge page. So is_pmd_mapped_entry() is the same as is_pmd_splitting_entry()
in terms of that.

According to Andrea's explanation:https://lore.kernel.org/patchwork/patch/997412/#1184298,
the semantics can avoid pmd_lock serializations on all VM fast paths, which
is valid IMHO.


>>
>>
>>>>
>>>> For x86, this change requires:
>>>> 1. changing the condition in pmd_trans_huge(), so that it returns true for
>>>> PMD migration entries;
>>>> 2. changing the code, which calls pmd_trans_huge(), to match the new logic.
>>> Can those be fixed with an additional check for pmd_present() as suggested here
>>> in this patch ? Asking because in case we could not get common semantics for
>>> these helpers on all arch that would be a fall back option for the moment.
>>
>> It would be OK for x86, since pmd_trans_huge() implies pmd_present() and hence
>> adding pmd_present() to pmd_trans_huge() makes no difference. But for ARM64,
>> from my understanding of the code described below, adding pmd_present() to
>> pmd_trans_huge() seems to exclude splitting THPs from the original semantic.
>>
>>
>>>>
>>>> Another problem I see is that x86’s pmd_present() returns true for a THP under
>>>> splitting but ARM64’s pmd_present() returns false for a THP under splitting.
>>>
>>> But how did you conclude this ? I dont see any explicit helper for splitting
>>> THP. Could you please point me in the code ?
>>
>> From the code I read for ARM64
>> (https://elixir.bootlin.com/linux/v4.19-rc8/source/arch/arm64/include/asm/pgtable.h#L360
>> and https://elixir.bootlin.com/linux/v4.19-rc8/source/arch/arm64/include/asm/pgtable.h#L86),
>> pmd_present() only checks _PAGE_PRESENT and _PAGE_PROTONE. During a THP splitting,
>
> These are PTE_VALID and PTE_PROT_NONE instead on arm64. But yes, they are equivalent
> to __PAGE_PRESENT and __PAGE_PROTNONE on other archs.
>
> #define pmd_present(pmd)        pte_present(pmd_pte(pmd))
> #define pte_present(pte)        (!!(pte_val(pte) & (PTE_VALID | PTE_PROT_NONE)))
>
>> pmdp_invalidate() clears _PAGE_PRESENT (https://elixir.bootlin.com/linux/v4.19-rc8/source/mm/huge_memory.c#L2130). So pmd_present() returns false in ARM64. Let me know
>> if I got anything wrong.
>>
>
> old_pmd = pmdp_invalidate(vma, haddr, pmd);
>
> __split_huge_pmd_locked -> pmdp_invalidate (the above mentioned instance)
> pmdp_invalidate -> pmd_mknotpresent
>
> #define pmd_mknotpresent(pmd)   (__pmd(pmd_val(pmd) & ~PMD_SECT_VALID)
>
> Generic pmdp invalidation removes PMD_SECT_VALID from a mapped PMD entry.
> PMD_SECT_VALID is similar to PTE_VALID through identified separately. So you
> are right, on arm64 pmd_present() return false for THP under splitting.

This may actually cause problems in arm64, since the kernel will miss all splitting THPs.

In sum, according to Andrea's explanation, I think it is better to adjust
arm64's pmd_present() and pmd_trans_huge() to match what x86's semantics.
Otherwise, arm64 might hit bugs while handling THPs.

--
Best Regards
Yan Zi
Anshuman Khandual Oct. 26, 2018, 1:39 a.m. UTC | #19
On 10/26/2018 12:15 AM, Zi Yan wrote:
> On 25 Oct 2018, at 4:10, Anshuman Khandual wrote:
> 
>> On 10/16/2018 08:01 PM, Zi Yan wrote:
>>> On 15 Oct 2018, at 0:06, Anshuman Khandual wrote:
>>>
>>>> On 10/15/2018 06:23 AM, Zi Yan wrote:
>>>>> On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:
>>>>>
>>>>>> On 10/10/2018 06:13 PM, Zi Yan wrote:
>>>>>>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
>>>>>>>
>>>>>>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>>>>>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
>>>>>>>>> PMD migration entry check)
>>>>>>>>>
>>>>>>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>>>>>>>
>>>>>>>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>>>>>>>> from a PMD migration entry while walking the page table. A mapped THP would
>>>>>>>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>>>>>>>> as compared to a PMD migration entry. This just adds a new conditional test
>>>>>>>>>> differentiating the two while walking the page table.
>>>>>>>>>>
>>>>>>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>>>>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>>>>>>> ---
>>>>>>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>>>>>>>> exclusive which makes the current conditional block work for both mapped
>>>>>>>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>>>>>>>
>>>>>>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,
>>>>>>>>
>>>>>>>> Not really if we just look at code in the conditional blocks.
>>>>>>>
>>>>>>> Yeah, I explained it wrong above. Sorry about that.
>>>>>>>
>>>>>>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
>>>>>>> thus, it returns true even if the present bit is cleared but PSE bit is set.
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>> This is done so, because THPs under splitting are regarded as present in the kernel
>>>>>>> but not present when a hardware page table walker checks it.
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>>
>>>>>>> For PMD migration entry, which should be regarded as not present, if PSE bit
>>>>>>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
>>>>>>> PMD migration entries will be regarded as present
>>>>>>
>>>>>> Okay to make pmd_present() return false pmd_trans_huge() has to return false
>>>>>> as well. Is there anything which can be done to get around this problem on
>>>>>> X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
>>>>>> Otherwise we would revert the condition block order to accommodate both the
>>>>>> implementation for pmd_trans_huge() as suggested by Kirill before or just
>>>>>> consider this patch forward.
>>>>>>
>>>>>> Because I am not really sure yet about the idea of getting pmd_present()
>>>>>> check into pmd_trans_huge() on arm64 just to make it fit into this semantics
>>>>>> as suggested by Will. If a PMD is trans huge page or not should not depend on
>>>>>> whether it is present or not.
>>>>>
>>>>> In terms of THPs, we have three cases: a present THP, a THP under splitting,
>>>>> and a THP under migration. pmd_present() and pmd_trans_huge() both return true
>>>>> for a present THP and a THP under splitting, because they discover _PAGE_PSE bit
>>>>
>>>> Then how do we differentiate between a mapped THP and a splitting THP.
>>>
>>> AFAIK, in x86, there is no distinction between a mapped THP and a splitting THP
>>> using helper functions.
>>>
>>> A mapped THP has _PAGE_PRESENT bit and _PAGE_PSE bit set, whereas a splitting THP
>>> has only _PAGE_PSE bit set. But both pmd_present() and pmd_trans_huge() return
>>> true as long as _PAGE_PSE bit is set.
>>
>> I understand that. What I was wondering was since there is a need to differentiate
>> between a mapped THP and a splitting THP at various places in generic THP, we would
>> need to way to identify each of them unambiguously some how. Is that particular
>> assumption wrong ? Dont we need to differentiate between a mapped THP and THP under
>> splitting ?
> 
> According to Andrea's explanation here: https://lore.kernel.org/patchwork/patch/997412/#1184298,
> we do not distinguish between a mapped THP and a splitting THP, because pmd_to_page()
> can return valid pages for both cases.

I have gone through Andrea's explanation but still trying to understand all those
details in there. Will get back on this.

> 
>>>
>>>>
>>>>> is set for both cases, whereas they both return false for a THP under migration.
>>>>> You want to change them to make pmd_trans_huge() returns true for a THP under migration
>>>>> instead of false to help ARM64’s support for THP migration.
>>>> I am just trying to understand the rationale behind this semantics and see where
>>>> it should be fixed.
>>>>
>>>> I think the fundamental problem here is that THP under split has been difficult
>>>> to be re-presented through the available helper functions and in turn PTE bits.
>>>>
>>>> The following checks
>>>>
>>>> 1) pmd_present()
>>>> 2) pmd_trans_huge()
>>>>
>>>> Represent three THP states
>>>>
>>>> 1) Mapped THP		(pmd_present && pmd_trans_huge)
>>>> 2) Splitting THP	(pmd_present && pmd_trans_huge)
>>>> 3) Migrating THP	(!pmd_present && !pmd_trans_huge)
>>>>
>>>> The problem is if we make pmd_trans_huge() return true for all the three states
>>>> which sounds logical because they are all still trans huge PMD, then pmd_present()
>>>> can only represent two states not three as required.
>>>
>>> We are on the same page about representing three THP states in x86.
>>> I also agree with you that it is logical to use three distinct representations
>>> for these three states, i.e. splitting THP could be changed to (!pmd_present && pmd_trans_huge
>>
>> Right. Also we need clear wrapper around them in line with is_pmd_migration_entry() to
>> represent three states all of which calling pmd_present() and pmd_trans_huge() which
>> are exported by various architectures with exact same semantics without any ambiguity.
>>
>> 1) is_pmd_mapped_entry()
>> 2) is_pmd_splitting_entry()
>> 3) is_pmd_migration_entry()
> 
> I think the semantics of pmd_trans_huge() is that the pmd entry is pointing to
> a huge page. So is_pmd_mapped_entry() is the same as is_pmd_splitting_entry()
> in terms of that.

Because in both the situations pmd_page() returns true which means PMD really points
to a huge page in the RAM which is true for a mapped and a splitting PMD entry.

> 
> According to Andrea's explanation:https://lore.kernel.org/patchwork/patch/997412/#1184298,
> the semantics can avoid pmd_lock serializations on all VM fast paths, which
> is valid IMHO.

But still I do not understand how not distinguishing between a mapped PMD entry
and splitting PMD entry can help speed up core VM paths by not taking pmd_lock.
IIUC pmd_lock has to be taken before splitting the PMD entry into constituent
pages and generic page table walk does not take the lock. Hence trying to see
how not distinguishing mapped and splitting PMD entry helps here.

> 
> 
>>>
>>>
>>>>>
>>>>> For x86, this change requires:
>>>>> 1. changing the condition in pmd_trans_huge(), so that it returns true for
>>>>> PMD migration entries;
>>>>> 2. changing the code, which calls pmd_trans_huge(), to match the new logic.
>>>> Can those be fixed with an additional check for pmd_present() as suggested here
>>>> in this patch ? Asking because in case we could not get common semantics for
>>>> these helpers on all arch that would be a fall back option for the moment.
>>>
>>> It would be OK for x86, since pmd_trans_huge() implies pmd_present() and hence
>>> adding pmd_present() to pmd_trans_huge() makes no difference. But for ARM64,
>>> from my understanding of the code described below, adding pmd_present() to
>>> pmd_trans_huge() seems to exclude splitting THPs from the original semantic.
>>>
>>>
>>>>>
>>>>> Another problem I see is that x86’s pmd_present() returns true for a THP under
>>>>> splitting but ARM64’s pmd_present() returns false for a THP under splitting.
>>>>
>>>> But how did you conclude this ? I dont see any explicit helper for splitting
>>>> THP. Could you please point me in the code ?
>>>
>>> From the code I read for ARM64
>>> (https://elixir.bootlin.com/linux/v4.19-rc8/source/arch/arm64/include/asm/pgtable.h#L360
>>> and https://elixir.bootlin.com/linux/v4.19-rc8/source/arch/arm64/include/asm/pgtable.h#L86),
>>> pmd_present() only checks _PAGE_PRESENT and _PAGE_PROTONE. During a THP splitting,
>>
>> These are PTE_VALID and PTE_PROT_NONE instead on arm64. But yes, they are equivalent
>> to __PAGE_PRESENT and __PAGE_PROTNONE on other archs.
>>
>> #define pmd_present(pmd)        pte_present(pmd_pte(pmd))
>> #define pte_present(pte)        (!!(pte_val(pte) & (PTE_VALID | PTE_PROT_NONE)))
>>
>>> pmdp_invalidate() clears _PAGE_PRESENT (https://elixir.bootlin.com/linux/v4.19-rc8/source/mm/huge_memory.c#L2130). So pmd_present() returns false in ARM64. Let me know
>>> if I got anything wrong.
>>>
>>
>> old_pmd = pmdp_invalidate(vma, haddr, pmd);
>>
>> __split_huge_pmd_locked -> pmdp_invalidate (the above mentioned instance)
>> pmdp_invalidate -> pmd_mknotpresent
>>
>> #define pmd_mknotpresent(pmd)   (__pmd(pmd_val(pmd) & ~PMD_SECT_VALID)
>>
>> Generic pmdp invalidation removes PMD_SECT_VALID from a mapped PMD entry.
>> PMD_SECT_VALID is similar to PTE_VALID through identified separately. So you
>> are right, on arm64 pmd_present() return false for THP under splitting.
> 
> This may actually cause problems in arm64, since the kernel will miss all splitting THPs.

Now I understand why pmd_present() needs to return true for a splitting PMD entry.
Yeah, this needs to be fixed on arm64.

> 
> In sum, according to Andrea's explanation, I think it is better to adjust
> arm64's pmd_present() and pmd_trans_huge() to match what x86's semantics.
> Otherwise, arm64 might hit bugs while handling THPs.

I will look into this further. Thanks for the explanation.
Anshuman Khandual Nov. 2, 2018, 5:22 a.m. UTC | #20
On 10/18/2018 07:47 AM, Naoya Horiguchi wrote:
> On Tue, Oct 16, 2018 at 10:31:50AM -0400, Zi Yan wrote:
>> On 15 Oct 2018, at 0:06, Anshuman Khandual wrote:
>>
>>> On 10/15/2018 06:23 AM, Zi Yan wrote:
>>>> On 12 Oct 2018, at 4:00, Anshuman Khandual wrote:
>>>>
>>>>> On 10/10/2018 06:13 PM, Zi Yan wrote:
>>>>>> On 10 Oct 2018, at 0:05, Anshuman Khandual wrote:
>>>>>>
>>>>>>> On 10/09/2018 07:28 PM, Zi Yan wrote:
>>>>>>>> cc: Naoya Horiguchi (who proposed to use !_PAGE_PRESENT && !_PAGE_PSE for x86
>>>>>>>> PMD migration entry check)
>>>>>>>>
>>>>>>>> On 8 Oct 2018, at 23:58, Anshuman Khandual wrote:
>>>>>>>>
>>>>>>>>> A normal mapped THP page at PMD level should be correctly differentiated
>>>>>>>>> from a PMD migration entry while walking the page table. A mapped THP would
>>>>>>>>> additionally check positive for pmd_present() along with pmd_trans_huge()
>>>>>>>>> as compared to a PMD migration entry. This just adds a new conditional test
>>>>>>>>> differentiating the two while walking the page table.
>>>>>>>>>
>>>>>>>>> Fixes: 616b8371539a6 ("mm: thp: enable thp migration in generic path")
>>>>>>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>>>>>> ---
>>>>>>>>> On X86, pmd_trans_huge() and is_pmd_migration_entry() are always mutually
>>>>>>>>> exclusive which makes the current conditional block work for both mapped
>>>>>>>>> and migration entries. This is not same with arm64 where pmd_trans_huge()
>>>>>>>>
>>>>>>>> !pmd_present() && pmd_trans_huge() is used to represent THPs under splitting,
>>>>>>>
>>>>>>> Not really if we just look at code in the conditional blocks.
>>>>>>
>>>>>> Yeah, I explained it wrong above. Sorry about that.
>>>>>>
>>>>>> In x86, pmd_present() checks (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE),
>>>>>> thus, it returns true even if the present bit is cleared but PSE bit is set.
>>>>>
>>>>> Okay.
>>>>>
>>>>>> This is done so, because THPs under splitting are regarded as present in the kernel
>>>>>> but not present when a hardware page table walker checks it.
>>>>>
>>>>> Okay.
>>>>>
>>>>>>
>>>>>> For PMD migration entry, which should be regarded as not present, if PSE bit
>>>>>> is set, which makes pmd_trans_huge() returns true, like ARM64 does, all
>>>>>> PMD migration entries will be regarded as present
>>>>>
>>>>> Okay to make pmd_present() return false pmd_trans_huge() has to return false
>>>>> as well. Is there anything which can be done to get around this problem on
>>>>> X86 ? pmd_trans_huge() returning true for a migration entry sounds logical.
>>>>> Otherwise we would revert the condition block order to accommodate both the
>>>>> implementation for pmd_trans_huge() as suggested by Kirill before or just
>>>>> consider this patch forward.
>>>>>
>>>>> Because I am not really sure yet about the idea of getting pmd_present()
>>>>> check into pmd_trans_huge() on arm64 just to make it fit into this semantics
>>>>> as suggested by Will. If a PMD is trans huge page or not should not depend on
>>>>> whether it is present or not.
>>>>
>>>> In terms of THPs, we have three cases: a present THP, a THP under splitting,
>>>> and a THP under migration. pmd_present() and pmd_trans_huge() both return true
>>>> for a present THP and a THP under splitting, because they discover _PAGE_PSE bit
>>>
>>> Then how do we differentiate between a mapped THP and a splitting THP.
>>
>> AFAIK, in x86, there is no distinction between a mapped THP and a splitting THP
>> using helper functions.
>>
>> A mapped THP has _PAGE_PRESENT bit and _PAGE_PSE bit set, whereas a splitting THP
>> has only _PAGE_PSE bit set. But both pmd_present() and pmd_trans_huge() return
>> true as long as _PAGE_PSE bit is set.
>>
>>>
>>>> is set for both cases, whereas they both return false for a THP under migration.
>>>> You want to change them to make pmd_trans_huge() returns true for a THP under migration
>>>> instead of false to help ARM64’s support for THP migration.
>>> I am just trying to understand the rationale behind this semantics and see where
>>> it should be fixed.
>>>
>>> I think the fundamental problem here is that THP under split has been difficult
>>> to be re-presented through the available helper functions and in turn PTE bits.
>>>
>>> The following checks
>>>
>>> 1) pmd_present()
>>> 2) pmd_trans_huge()
>>>
>>> Represent three THP states
>>>
>>> 1) Mapped THP		(pmd_present && pmd_trans_huge)
>>> 2) Splitting THP	(pmd_present && pmd_trans_huge)
>>> 3) Migrating THP	(!pmd_present && !pmd_trans_huge)
>>>
>>> The problem is if we make pmd_trans_huge() return true for all the three states
>>> which sounds logical because they are all still trans huge PMD, then pmd_present()
>>> can only represent two states not three as required.
>>
>> We are on the same page about representing three THP states in x86.
>> I also agree with you that it is logical to use three distinct representations
>> for these three states, i.e. splitting THP could be changed to (!pmd_present && pmd_trans_huge).
> 
> I think that the behavior of pmd_trans_huge() for non-present pmd is
> undefined by its nature. IOW, it's no use determining whether it's thp or
> not for non-existing pages because it does not exist :)> 
> So I think that the right direction is to make sure that pmd_trans_huge() is
> never checked for non-present pmd, just like Kirill's suggestion.  And maybe
> we have some room for engineering to ensure it (rather than just commenting it).

Agreed, pmd_trans_huge() does not make sense for a migration or a swap entry
and should not be checked on them.
Anshuman Khandual Nov. 2, 2018, 6:15 a.m. UTC | #21
On 10/17/2018 07:39 AM, Andrea Arcangeli wrote:
> Hello Zi,
> 
> On Sun, Oct 14, 2018 at 08:53:55PM -0400, Zi Yan wrote:
>> Hi Andrea, what is the purpose/benefit of making x86’s pmd_present() returns true
>> for a THP under splitting? Does it cause problems when ARM64’s pmd_present()
>> returns false in the same situation?
Thank you Andrea for a such a detailed explanation. It really helped us in
understanding certain subtle details about pmd_present() & pmd_trans_huge().

> 
> !pmd_present means it's a migration entry or swap entry and doesn't
> point to RAM. It means if you do pmd_to_page(*pmd) it will return you
> an undefined result.

Sure but this needs to be made clear some where. Not sure whether its better
just by adding some in-code documentation or enforcing it in generic paths.

> 
> During splitting the physical page is still very well pointed by the
> pmd as long as pmd_trans_huge returns true and you hold the
> pmd_lock.

Agreed, it still does point to a huge page in RAM. So pmd_present() should
just return true in such cases as you have explained above.

> 
> pmd_trans_huge must be true at all times for a transhuge pmd that
> points to a hugepage, or all VM fast paths won't serialize with the

But as Naoya mentioned we should not check for pmd_trans_huge() on swap or
migration entries. If this makes sense, I will be happy to look into this
further and remove/replace pmd_trans_huge() check from affected code paths.

> pmd_lock, that is the only reason why, and it's a very good reason
> because it avoids to take the pmd_lock when walking over non transhuge
> pmds (i.e. when there are no THP allocated).
> 
> Now if we've to keep _PAGE_PSE set and return true in pmd_trans_huge
> at all times, why would you want to make pmd_present return false? How
> could it help if pmd_trans_huge returns true, but pmd_present returns
> false despite pmd_to_page works fine and the pmd is really still
> pointing to the page?

Then what is the difference between pmd_trans_huge() and pmd_present()
if both should return true if the PMD points to a huge page in RAM and
pmd_page() also returns a valid huge page in RAM.

> 
> When userland faults on such pmd !pmd_present it will make the page
> fault take a swap or migration path, but that's the wrong path if the
> pmd points to RAM.
This is a real concern. __handle_mm_fault() does check for a swap entry
(which can only be a migration entry at the moment) and then wait on
till the migration is completed.

                if (unlikely(is_swap_pmd(orig_pmd))) {
                        VM_BUG_ON(thp_migration_supported() &&
                                          !is_pmd_migration_entry(orig_pmd));
                        if (is_pmd_migration_entry(orig_pmd))
                                pmd_migration_entry_wait(mm, vmf.pmd);
                        return 0;
                }

> 
> What we need to do during split is an invalidate of the huge TL> There's no pmd_trans_splitting anymore, so we only clear the present
> bit in the PTE despite pmd_present still returns true (just like
> PROT_NONE, nothing new in this respect). pmd_present never meant the

On arm64, the problem is that pmd_present() is tied with pte_present() which
checks for PTE_VALID (also PTE_PROT_NONE) but which gets cleared during PTE
invalidation. pmd_present() returns false just after the first step of PMD
splitting. So pmd_present() needs to be decoupled from PTE_VALID which is
same as PMD_SECT_VALID and instead should depend upon a pte bit which sticks
around like PAGE_PSE as in case of x86. I am working towards a solution.

> real present bit in the pte was set, it just means the pmd points to
> RAM. It means it doesn't point to swap or migration entry and you can
> do pmd_to_page and it works fine
> We need to invalidate the TLB by clearing the present bit and by
> flushing the TLB before overwriting the transhuge pmd with the regular
> pte (i.e. to make it non huge). That is actually required by an errata
> (l1 cache aliasing of the same mapping through two different TLB of
> two different sizes broke some old CPU and triggered machine checks).
> It's not something fundamentally necessary from a common code point of

TLB entries mapping same VA -> PA space with different pages sizes might
not co-exist with each other which requires TLB invalidation. PMD split
phase initiating a TLB invalidation is not like getting around a CPU HW
problem but its just that SW should not assume behavior on behalf of the
architecture regarding which TLB entries can co-exist at any point.

> view. It's more risky from an hardware (not software) standpoint and
> before you can get rid of the pmd you need to do a TLB flush anyway to
> be sure CPUs stops using it, so better clear the present bit before
> doing the real costly thing (the tlb flush with IPIs). Clearing the
> present bit during the TLB flush is a cost that gets lost in the noise.

Doing TLB invalidation is not tied to whether present bit is marked on
the PTE or not. If I am not mistaken, a TLB invalidation can still get
started on a PTE with it's present bit marked on. IIUC the reason we
clear the present bit on the PMD entry to prevent further MMU HW walk
of the table and creation of new TLB entries reflecting older mapping
while flushing the older ones.

> 
> The clear of the real present bit during pmd (virtual) splitting is
> done with pmdp_invalidate, that is created specifically to keeps
> pmd_trans_huge=true, pmd_present=true despite the present bit is not
> set. So you could imagine _PAGE_PSE as the real present bit.
> 
I understand. In conclusion we would need to make some changes to
pmd_present() and pmd_trans_huge() on arm64 in accordance with the
semantics explained above. pmd_present() is very clear though I am
still wondering how pmd_trans_huge() is different than pmd_present().
Will Deacon Nov. 6, 2018, 12:35 a.m. UTC | #22
On Fri, Nov 02, 2018 at 11:45:00AM +0530, Anshuman Khandual wrote:
> On 10/17/2018 07:39 AM, Andrea Arcangeli wrote:
> > What we need to do during split is an invalidate of the huge TLB.
> > There's no pmd_trans_splitting anymore, so we only clear the present
> > bit in the PTE despite pmd_present still returns true (just like
> > PROT_NONE, nothing new in this respect). pmd_present never meant the
> 
> On arm64, the problem is that pmd_present() is tied with pte_present() which
> checks for PTE_VALID (also PTE_PROT_NONE) but which gets cleared during PTE
> invalidation. pmd_present() returns false just after the first step of PMD
> splitting. So pmd_present() needs to be decoupled from PTE_VALID which is
> same as PMD_SECT_VALID and instead should depend upon a pte bit which sticks
> around like PAGE_PSE as in case of x86. I am working towards a solution.

Could we not just go via a PROT_NONE mapping during the split, instead of
having to allocate a new software bit to treat these invalid ptes as
present?

Will
Anshuman Khandual Nov. 6, 2018, 9:51 a.m. UTC | #23
On 11/06/2018 06:05 AM, Will Deacon wrote:
> On Fri, Nov 02, 2018 at 11:45:00AM +0530, Anshuman Khandual wrote:
>> On 10/17/2018 07:39 AM, Andrea Arcangeli wrote:
>>> What we need to do during split is an invalidate of the huge TLB.
>>> There's no pmd_trans_splitting anymore, so we only clear the present
>>> bit in the PTE despite pmd_present still returns true (just like
>>> PROT_NONE, nothing new in this respect). pmd_present never meant the
>>
>> On arm64, the problem is that pmd_present() is tied with pte_present() which
>> checks for PTE_VALID (also PTE_PROT_NONE) but which gets cleared during PTE
>> invalidation. pmd_present() returns false just after the first step of PMD
>> splitting. So pmd_present() needs to be decoupled from PTE_VALID which is
>> same as PMD_SECT_VALID and instead should depend upon a pte bit which sticks
>> around like PAGE_PSE as in case of x86. I am working towards a solution.
> 
> Could we not just go via a PROT_NONE mapping during the split, instead of
> having to allocate a new software bit to treat these invalid ptes as
> present?

The problem might occur during page fault (i.e __handle_mm_fault). As discussed
previously on this thread any potential PTE sticky bit would be used for both
pmd_trans_huge() and pmd_present() wrappers to maintain existing semantics. At
present, PMD state analysis during page fault has conditional block like this.

                if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
                        if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
                                return do_huge_pmd_numa_page(&vmf, orig_pmd);

Using PROT_NONE for pmd_trans_huge() might force PMD page fault to go through
NUMA fault handling all the time as both pmd_trans_huge() and pmd_protnone()
will return true in that situation.
diff mbox series

Patch

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae3c2a3..b384396 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -161,7 +161,8 @@  bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	pmde = READ_ONCE(*pvmw->pmd);
 	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		if (likely(pmd_trans_huge(*pvmw->pmd))) {
+		if (likely(pmd_trans_huge(*pvmw->pmd) &&
+					pmd_present(*pvmw->pmd))) {
 			if (pvmw->flags & PVMW_MIGRATION)
 				return not_found(pvmw);
 			if (pmd_page(*pvmw->pmd) != page)