diff mbox series

[RFC,1/2] arm64/mm: Change THP helpers to comply with generic MM semantics

Message ID 1561639696-16361-2-git-send-email-anshuman.khandual@arm.com (mailing list archive)
State RFC
Headers show
Series arm64/mm: Enable THP migration | expand

Commit Message

Anshuman Khandual June 27, 2019, 12:48 p.m. UTC
pmd_present() and pmd_trans_huge() are expected to behave in the following
manner during various phases of a given PMD. It is derived from a previous
detailed discussion on this topic [1] and present THP documentation [2].

pmd_present(pmd):

- Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
- Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)

pmd_trans_huge(pmd):

- Returns true if pmd refers to system RAM and is a trans huge mapping

-------------------------------------------------------------------------
|	PMD states	|	pmd_present	|	pmd_trans_huge	|
-------------------------------------------------------------------------
|	Mapped		|	Yes		|	Yes		|
-------------------------------------------------------------------------
|	Splitting	|	Yes		|	Yes		|
-------------------------------------------------------------------------
|	Migration/Swap	|	No		|	No		|
-------------------------------------------------------------------------

The problem:

PMD is first invalidated with pmdp_invalidate() before it's splitting. This
invalidation clears PMD_SECT_VALID as below.

PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID

Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
on the PMD entry. It will need another bit apart from PMD_SECT_VALID to re-
affirm pmd_present() as true during the THP split process. To comply with
above mentioned semantics, pmd_trans_huge() should also check pmd_present()
first before testing presence of an actual transparent huge mapping.

The solution:

Ideally PMD_TYPE_SECT should have been used here instead. But it shares the
bit position with PMD_SECT_VALID which is used for THP invalidation. Hence
it will not be there for pmd_present() check after pmdp_invalidate().

PTE_SPECIAL never gets used for PMD mapping i.e there is no pmd_special().
Hence this bit can be set on the PMD entry during invalidation which can
help in making pmd_present() return true and in recognizing the fact that
it still points to memory.

This bit is transient. During the split is process it will be overridden
by a page table page representing the normal pages in place of erstwhile
huge page. Other pmdp_invalidate() callers always write a fresh PMD value
on the entry overriding this transient PTE_SPECIAL making it safe. In the
past former pmd_[mk]splitting() functions used PTE_SPECIAL.

[1]: https://lkml.org/lkml/2018/10/17/231
[2]: https://www.kernel.org/doc/Documentation/vm/transhuge.txt

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Suzuki Poulose <suzuki.poulose@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

Comments

Zi Yan June 27, 2019, 3:31 p.m. UTC | #1
On 27 Jun 2019, at 8:48, Anshuman Khandual wrote:

> pmd_present() and pmd_trans_huge() are expected to behave in the following
> manner during various phases of a given PMD. It is derived from a previous
> detailed discussion on this topic [1] and present THP documentation [2].
>
> pmd_present(pmd):
>
> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
>
> pmd_trans_huge(pmd):
>
> - Returns true if pmd refers to system RAM and is a trans huge mapping
>
> -------------------------------------------------------------------------
> |	PMD states	|	pmd_present	|	pmd_trans_huge	|
> -------------------------------------------------------------------------
> |	Mapped		|	Yes		|	Yes		|
> -------------------------------------------------------------------------
> |	Splitting	|	Yes		|	Yes		|
> -------------------------------------------------------------------------
> |	Migration/Swap	|	No		|	No		|
> -------------------------------------------------------------------------
>
> The problem:
>
> PMD is first invalidated with pmdp_invalidate() before it's splitting. This
> invalidation clears PMD_SECT_VALID as below.
>
> PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID
>
> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
> on the PMD entry. It will need another bit apart from PMD_SECT_VALID to re-
> affirm pmd_present() as true during the THP split process. To comply with
> above mentioned semantics, pmd_trans_huge() should also check pmd_present()
> first before testing presence of an actual transparent huge mapping.
>
> The solution:
>
> Ideally PMD_TYPE_SECT should have been used here instead. But it shares the
> bit position with PMD_SECT_VALID which is used for THP invalidation. Hence
> it will not be there for pmd_present() check after pmdp_invalidate().
>
> PTE_SPECIAL never gets used for PMD mapping i.e there is no pmd_special().
> Hence this bit can be set on the PMD entry during invalidation which can
> help in making pmd_present() return true and in recognizing the fact that
> it still points to memory.
>
> This bit is transient. During the split is process it will be overridden
> by a page table page representing the normal pages in place of erstwhile
> huge page. Other pmdp_invalidate() callers always write a fresh PMD value
> on the entry overriding this transient PTE_SPECIAL making it safe. In the
> past former pmd_[mk]splitting() functions used PTE_SPECIAL.
>
> [1]: https://lkml.org/lkml/2018/10/17/231

Just want to point out that lkml.org link might not be stable.
This one would be better: https://lore.kernel.org/linux-mm/20181017020930.GN30832@redhat.com/


—
Best Regards,
Yan Zi
Anshuman Khandual June 28, 2019, 3:46 a.m. UTC | #2
On 06/27/2019 09:01 PM, Zi Yan wrote:
> On 27 Jun 2019, at 8:48, Anshuman Khandual wrote:
> 
>> pmd_present() and pmd_trans_huge() are expected to behave in the following
>> manner during various phases of a given PMD. It is derived from a previous
>> detailed discussion on this topic [1] and present THP documentation [2].
>>
>> pmd_present(pmd):
>>
>> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
>> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
>>
>> pmd_trans_huge(pmd):
>>
>> - Returns true if pmd refers to system RAM and is a trans huge mapping
>>
>> -------------------------------------------------------------------------
>> |	PMD states	|	pmd_present	|	pmd_trans_huge	|
>> -------------------------------------------------------------------------
>> |	Mapped		|	Yes		|	Yes		|
>> -------------------------------------------------------------------------
>> |	Splitting	|	Yes		|	Yes		|
>> -------------------------------------------------------------------------
>> |	Migration/Swap	|	No		|	No		|
>> -------------------------------------------------------------------------
>>
>> The problem:
>>
>> PMD is first invalidated with pmdp_invalidate() before it's splitting. This
>> invalidation clears PMD_SECT_VALID as below.
>>
>> PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID
>>
>> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
>> on the PMD entry. It will need another bit apart from PMD_SECT_VALID to re-
>> affirm pmd_present() as true during the THP split process. To comply with
>> above mentioned semantics, pmd_trans_huge() should also check pmd_present()
>> first before testing presence of an actual transparent huge mapping.
>>
>> The solution:
>>
>> Ideally PMD_TYPE_SECT should have been used here instead. But it shares the
>> bit position with PMD_SECT_VALID which is used for THP invalidation. Hence
>> it will not be there for pmd_present() check after pmdp_invalidate().
>>
>> PTE_SPECIAL never gets used for PMD mapping i.e there is no pmd_special().
>> Hence this bit can be set on the PMD entry during invalidation which can
>> help in making pmd_present() return true and in recognizing the fact that
>> it still points to memory.
>>
>> This bit is transient. During the split is process it will be overridden
>> by a page table page representing the normal pages in place of erstwhile
>> huge page. Other pmdp_invalidate() callers always write a fresh PMD value
>> on the entry overriding this transient PTE_SPECIAL making it safe. In the
>> past former pmd_[mk]splitting() functions used PTE_SPECIAL.
>>
>> [1]: https://lkml.org/lkml/2018/10/17/231
> 
> Just want to point out that lkml.org link might not be stable.
> This one would be better: https://lore.kernel.org/linux-mm/20181017020930.GN30832@redhat.com/

Sure will update the link in the commit. Thanks !
Catalin Marinas June 28, 2019, 10:20 a.m. UTC | #3
Hi Anshuman,

On Thu, Jun 27, 2019 at 06:18:15PM +0530, Anshuman Khandual wrote:
> pmd_present() and pmd_trans_huge() are expected to behave in the following
> manner during various phases of a given PMD. It is derived from a previous
> detailed discussion on this topic [1] and present THP documentation [2].
> 
> pmd_present(pmd):
> 
> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
> 
> pmd_trans_huge(pmd):
> 
> - Returns true if pmd refers to system RAM and is a trans huge mapping
> 
> -------------------------------------------------------------------------
> |	PMD states	|	pmd_present	|	pmd_trans_huge	|
> -------------------------------------------------------------------------
> |	Mapped		|	Yes		|	Yes		|
> -------------------------------------------------------------------------
> |	Splitting	|	Yes		|	Yes		|
> -------------------------------------------------------------------------
> |	Migration/Swap	|	No		|	No		|
> -------------------------------------------------------------------------

Before we actually start fixing this, I would strongly suggest that you
add a boot selftest (see lib/Kconfig.debug for other similar cases)
which checks the consistency of the page table macros w.r.t. the
expected mm semantics. Once the mm maintainers agreed with the
semantics, it will really help architecture maintainers in implementing
them correctly.

You wouldn't need actual page tables, just things like assertions on
pmd_trans_huge(pmd_mkhuge(pmd)) == true. You could go further and have
checks on pmdp_invalidate(&dummy_vma, dummy_addr, &dummy_pmd) with the
dummy_* variables on the stack.

> The problem:
> 
> PMD is first invalidated with pmdp_invalidate() before it's splitting. This
> invalidation clears PMD_SECT_VALID as below.
> 
> PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID
> 
> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
> on the PMD entry.

I think that's an inconsistency in the expected semantics here. Do you
mean that pmd_present(pmd_mknotpresent(pmd)) should be true? If not, do
we need to implement our own pmdp_invalidate() or change the generic one
to set a "special" bit instead of just a pmd_mknotpresent?

> +static inline int pmd_present(pmd_t pmd)
> +{
> +	if (pte_present(pmd_pte(pmd)))
> +		return 1;
> +
> +	return pte_special(pmd_pte(pmd));
> +}
[...]
> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
> +{
> +	pmd = pte_pmd(pte_mkspecial(pmd_pte(pmd)));
> +	return __pmd(pmd_val(pmd) & ~PMD_SECT_VALID);
> +}

I'm not sure I agree with the semantics here where pmd_mknotpresent()
does not actually make pmd_present() == false.
Anshuman Khandual July 2, 2019, 3:37 a.m. UTC | #4
On 06/28/2019 03:50 PM, Catalin Marinas wrote:
> Hi Anshuman,

Hello Catalin,

> 
> On Thu, Jun 27, 2019 at 06:18:15PM +0530, Anshuman Khandual wrote:
>> pmd_present() and pmd_trans_huge() are expected to behave in the following
>> manner during various phases of a given PMD. It is derived from a previous
>> detailed discussion on this topic [1] and present THP documentation [2].
>>
>> pmd_present(pmd):
>>
>> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
>> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
>>
>> pmd_trans_huge(pmd):
>>
>> - Returns true if pmd refers to system RAM and is a trans huge mapping
>>
>> -------------------------------------------------------------------------
>> |	PMD states	|	pmd_present	|	pmd_trans_huge	|
>> -------------------------------------------------------------------------
>> |	Mapped		|	Yes		|	Yes		|
>> -------------------------------------------------------------------------
>> |	Splitting	|	Yes		|	Yes		|
>> -------------------------------------------------------------------------
>> |	Migration/Swap	|	No		|	No		|
>> -------------------------------------------------------------------------
> 
> Before we actually start fixing this, I would strongly suggest that you
> add a boot selftest (see lib/Kconfig.debug for other similar cases)
> which checks the consistency of the page table macros w.r.t. the
> expected mm semantics. Once the mm maintainers agreed with the
> semantics, it will really help architecture maintainers in implementing
> them correctly.

Sure and it will help all architectures to be in sync wrt semantics.

> 
> You wouldn't need actual page tables, just things like assertions on
> pmd_trans_huge(pmd_mkhuge(pmd)) == true. You could go further and have
> checks on pmdp_invalidate(&dummy_vma, dummy_addr, &dummy_pmd) with the
> dummy_* variables on the stack.

Hmm. I guess macros which operate directly on a page table entry will be
okay but the ones which check on specific states for VMA or MM might be
bit tricky. Try to emulate VMA/MM states while on stack ?. But sure, will
explore adding such a test.

> 
>> The problem:
>>
>> PMD is first invalidated with pmdp_invalidate() before it's splitting. This
>> invalidation clears PMD_SECT_VALID as below.
>>
>> PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID
>>
>> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
>> on the PMD entry.
> 
> I think that's an inconsistency in the expected semantics here. Do you
> mean that pmd_present(pmd_mknotpresent(pmd)) should be true? If not, do

Actually that is true (more so if we are using generic pmdp_invalidate). Else
in general pmd_present(pmdp_invalidate(pmd)) needs to be true to successfully
represent a splitting THP. That is what Andrea explained back on this thread
(https://lkml.org/lkml/2018/10/17/231).

Extracting relevant sections from that thread -

"pmd_present never meant the real present bit in the pte was set, it just means
the pmd points to RAM. It means it doesn't point to swap or migration entry and
you can do pmd_to_page and it works fine."

"The clear of the real present bit during pmd (virtual) splitting is done with
pmdp_invalidate, that is created specifically to keeps pmd_trans_huge=true,
pmd_present=true despite the present bit is not set. So you could imagine
_PAGE_PSE as the real present bit."

pmd_present() and pmd_mknotpresent() are not exact inverse.

Problem is all platforms using generic pmdp_invalidate() calls pmd_mknotpresent()
which invariably across platforms remove the valid or present bit from the entry.
The point to note here is that pmd_mknotpresent() invalidates the entry from MMU
point of view but pmd_present() does not check for a MMU valid PMD entry. Hence
pmd_present(pmd_mknotpresent(pmd)) can still be true.

In absence of a positive section mapping bit on arm64, PTE_SPECIAL is being set
temporarily to remember that it was a mapped PMD which got invalidated recently
but which still points to memory. Hence pmd_present() must evaluate true.

pmd_mknotpresent() does not make !pmd_present() it just invalidates the entry.

> we need to implement our own pmdp_invalidate() or change the generic one
> to set a "special" bit instead of just a pmd_mknotpresent?

Though arm64 can subscribe __HAVE_ARCH_PMDP_INVALIDATE and implement it's own
pmdp_invalidate() in order to not call pmd_mknotpresent() and instead operate
on the invalid and special bits directly. But its not going to alter relevant
semantics here. AFAICS it might be bit better as it saves pmd_mknotpresent()
from putting in that special bit in there which it is not supposed do.

IFAICS there is no compelling reason for generic pmdp_invalidate() to change
either. It calls pmd_mknotpresent() which invalidates the entry through valid
or present bit and platforms which have dedicated huge page bit can still test
positive for pmd_present() after it's invalidation. It works for such platforms.
Platform specific override is required when invalidation via pmd_mknotpresent()
is not enough.

> 
>> +static inline int pmd_present(pmd_t pmd)
>> +{
>> +	if (pte_present(pmd_pte(pmd)))
>> +		return 1;
>> +
>> +	return pte_special(pmd_pte(pmd));
>> +}
> [...]
>> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
>> +{
>> +	pmd = pte_pmd(pte_mkspecial(pmd_pte(pmd)));
>> +	return __pmd(pmd_val(pmd) & ~PMD_SECT_VALID);
>> +}
> 
> I'm not sure I agree with the semantics here where pmd_mknotpresent()
> does not actually make pmd_present() == false.

As Andrea explained, pmd_present() does not check validity of the PMD entry
from MMU perspective but the presence of a valid pmd_page() which still refers
to a valid struct page in the memory. It is irrespective of whether the entry
in itself is valid for MMU walk or not.

+ Cc: Andrea Arcangeli <aarcange@redhat.com>

I have added Andrea on this thread if he would like to add something.

- Anshuman
Catalin Marinas July 3, 2019, 5:52 p.m. UTC | #5
On Tue, Jul 02, 2019 at 09:07:28AM +0530, Anshuman Khandual wrote:
> On 06/28/2019 03:50 PM, Catalin Marinas wrote:
> > On Thu, Jun 27, 2019 at 06:18:15PM +0530, Anshuman Khandual wrote:
> >> pmd_present() and pmd_trans_huge() are expected to behave in the following
> >> manner during various phases of a given PMD. It is derived from a previous
> >> detailed discussion on this topic [1] and present THP documentation [2].
> >>
> >> pmd_present(pmd):
> >>
> >> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
> >> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
> >>
> >> pmd_trans_huge(pmd):
> >>
> >> - Returns true if pmd refers to system RAM and is a trans huge mapping
[...]
> > Before we actually start fixing this, I would strongly suggest that you
> > add a boot selftest (see lib/Kconfig.debug for other similar cases)
> > which checks the consistency of the page table macros w.r.t. the
> > expected mm semantics. Once the mm maintainers agreed with the
> > semantics, it will really help architecture maintainers in implementing
> > them correctly.
> 
> Sure and it will help all architectures to be in sync wrt semantics.
> 
> > You wouldn't need actual page tables, just things like assertions on
> > pmd_trans_huge(pmd_mkhuge(pmd)) == true. You could go further and have
> > checks on pmdp_invalidate(&dummy_vma, dummy_addr, &dummy_pmd) with the
> > dummy_* variables on the stack.
> 
> Hmm. I guess macros which operate directly on a page table entry will be
> okay but the ones which check on specific states for VMA or MM might be
> bit tricky. Try to emulate VMA/MM states while on stack ?. But sure, will
> explore adding such a test.

You can pretend that the page table is on the stack. See the _pmd
variable in do_huge_pmd_wp_page_fallback() and
__split_huge_zero_page_pmd(). Similarly, the vma and even the mm can be
faked on the stack (see the arm64 tlb_flush()).

> >> The problem:
> >>
> >> PMD is first invalidated with pmdp_invalidate() before it's splitting. This
> >> invalidation clears PMD_SECT_VALID as below.
> >>
> >> PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID
> >>
> >> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
> >> on the PMD entry.
> > 
> > I think that's an inconsistency in the expected semantics here. Do you
> > mean that pmd_present(pmd_mknotpresent(pmd)) should be true? If not, do
[...]
> pmd_present() and pmd_mknotpresent() are not exact inverse.

I find this very confusing (not your fault, just the semantics expected
by the core code). I can see that x86 is using _PAGE_PSE to make
pmd_present(pmd_mknotpresent()) == true. However, for pud that's not the
case (because it's not used for transhuge).

I'd rather have this renamed to pmd_mknotvalid().

> In absence of a positive section mapping bit on arm64, PTE_SPECIAL is being set
> temporarily to remember that it was a mapped PMD which got invalidated recently
> but which still points to memory. Hence pmd_present() must evaluate true.

I wonder if we can encode this safely for arm64 in the bottom two bits
of a pmd :

0b00 - not valid, not present
0b10 - not valid, present, huge
0b01 - valid, present, huge
0b11 - valid, table (not huge)

Do we ever call pmdp_invalidate() on a table entry? I don't think we do.

So a pte_mknotvalid would set bit 1 and I think swp_entry_to_pmd() would
have to clear it so that pmd_present() actually returns false for a swp
pmd entry.

> > we need to implement our own pmdp_invalidate() or change the generic one
> > to set a "special" bit instead of just a pmd_mknotpresent?
> 
> Though arm64 can subscribe __HAVE_ARCH_PMDP_INVALIDATE and implement it's own
> pmdp_invalidate() in order to not call pmd_mknotpresent() and instead operate
> on the invalid and special bits directly. But its not going to alter relevant
> semantics here. AFAICS it might be bit better as it saves pmd_mknotpresent()
> from putting in that special bit in there which it is not supposed do.
> 
> IFAICS there is no compelling reason for generic pmdp_invalidate() to change
> either. It calls pmd_mknotpresent() which invalidates the entry through valid
> or present bit and platforms which have dedicated huge page bit can still test
> positive for pmd_present() after it's invalidation. It works for such platforms.
> Platform specific override is required when invalidation via pmd_mknotpresent()
> is not enough.

I'd really like the mknotpresent to be renamed to mknotvalid and then we
can keep pmdp_invalidate unchanged (well, calling mknotvalid instead).
Anshuman Khandual July 8, 2019, 4:27 a.m. UTC | #6
On 07/03/2019 11:22 PM, Catalin Marinas wrote:
> On Tue, Jul 02, 2019 at 09:07:28AM +0530, Anshuman Khandual wrote:
>> On 06/28/2019 03:50 PM, Catalin Marinas wrote:
>>> On Thu, Jun 27, 2019 at 06:18:15PM +0530, Anshuman Khandual wrote:
>>>> pmd_present() and pmd_trans_huge() are expected to behave in the following
>>>> manner during various phases of a given PMD. It is derived from a previous
>>>> detailed discussion on this topic [1] and present THP documentation [2].
>>>>
>>>> pmd_present(pmd):
>>>>
>>>> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
>>>> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
>>>>
>>>> pmd_trans_huge(pmd):
>>>>
>>>> - Returns true if pmd refers to system RAM and is a trans huge mapping
> [...]
>>> Before we actually start fixing this, I would strongly suggest that you
>>> add a boot selftest (see lib/Kconfig.debug for other similar cases)
>>> which checks the consistency of the page table macros w.r.t. the
>>> expected mm semantics. Once the mm maintainers agreed with the
>>> semantics, it will really help architecture maintainers in implementing
>>> them correctly.
>>
>> Sure and it will help all architectures to be in sync wrt semantics.
>>
>>> You wouldn't need actual page tables, just things like assertions on
>>> pmd_trans_huge(pmd_mkhuge(pmd)) == true. You could go further and have
>>> checks on pmdp_invalidate(&dummy_vma, dummy_addr, &dummy_pmd) with the
>>> dummy_* variables on the stack.
>>
>> Hmm. I guess macros which operate directly on a page table entry will be
>> okay but the ones which check on specific states for VMA or MM might be
>> bit tricky. Try to emulate VMA/MM states while on stack ?. But sure, will
>> explore adding such a test.
> 
> You can pretend that the page table is on the stack. See the _pmd
> variable in do_huge_pmd_wp_page_fallback() and
> __split_huge_zero_page_pmd(). Similarly, the vma and even the mm can be
> faked on the stack (see the arm64 tlb_flush()).

Sure will explore them and other similar examples. I am already working on a module
which will test various architecture page table accessors semantics as expected from
generic MM. This should help us making sure that all architectures are on same page.

> 
>>>> The problem:
>>>>
>>>> PMD is first invalidated with pmdp_invalidate() before it's splitting. This
>>>> invalidation clears PMD_SECT_VALID as below.
>>>>
>>>> PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID
>>>>
>>>> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
>>>> on the PMD entry.
>>>
>>> I think that's an inconsistency in the expected semantics here. Do you
>>> mean that pmd_present(pmd_mknotpresent(pmd)) should be true? If not, do
> [...]
>> pmd_present() and pmd_mknotpresent() are not exact inverse.
> 
> I find this very confusing (not your fault, just the semantics expected
> by the core code). I can see that x86 is using _PAGE_PSE to make
> pmd_present(pmd_mknotpresent()) == true. However, for pud that's not the
> case (because it's not used for transhuge).
> 
> I'd rather have this renamed to pmd_mknotvalid().

Right, it makes sense to do the renaming even without considering this proposal.

> 
>> In absence of a positive section mapping bit on arm64, PTE_SPECIAL is being set
>> temporarily to remember that it was a mapped PMD which got invalidated recently
>> but which still points to memory. Hence pmd_present() must evaluate true.
> 
> I wonder if we can encode this safely for arm64 in the bottom two bits
> of a pmd :
> 
> 0b00 - not valid, not present
> 0b10 - not valid, present, huge
> 0b01 - valid, present, huge
> 0b11 - valid, table (not huge)
> 
> Do we ever call pmdp_invalidate() on a table entry? I don't think we do.
> 
> So a pte_mknotvalid would set bit 1 and I think swp_entry_to_pmd() would
> have to clear it so that pmd_present() actually returns false for a swp
> pmd entry.

All these makes it riskier for collision with other core MM paths as compared to
using a an isolated SW bit like PTE_SPECIAL exclusively for this purpose. This
is in line with using PTE_PROTNONE. PTE_SPECIAL seems to be well away from core
PMD path. Is there any particular concern about using PTE_SPECIAL ? Nonetheless
I will evaluate above proposal of using (0b10) to represent invalid but present
huge PMD entry during splitting.

> 
>>> we need to implement our own pmdp_invalidate() or change the generic one
>>> to set a "special" bit instead of just a pmd_mknotpresent?
>>
>> Though arm64 can subscribe __HAVE_ARCH_PMDP_INVALIDATE and implement it's own
>> pmdp_invalidate() in order to not call pmd_mknotpresent() and instead operate
>> on the invalid and special bits directly. But its not going to alter relevant
>> semantics here. AFAICS it might be bit better as it saves pmd_mknotpresent()
>> from putting in that special bit in there which it is not supposed do.
>>
>> IFAICS there is no compelling reason for generic pmdp_invalidate() to change
>> either. It calls pmd_mknotpresent() which invalidates the entry through valid
>> or present bit and platforms which have dedicated huge page bit can still test
>> positive for pmd_present() after it's invalidation. It works for such platforms.
>> Platform specific override is required when invalidation via pmd_mknotpresent()
>> is not enough.
> 
> I'd really like the mknotpresent to be renamed to mknotvalid and then we
> can keep pmdp_invalidate unchanged (well, calling mknotvalid instead).
> 

Though this change really makes sense just from fixing generic pmdp_invalidate()
perspective as all it asks is to invalidate the PMD entry not mark them non-present
and currently calling pmd_mknotpresent() in that sense is bit misleading.

But for arm64 I believe implementing arch specific pmdp_invalidate() via subscribing
__HAVE_ARCH_PMDP_INVALIDATE is bit better. Because the implementation needs more than
just a PMD entry invalidation even with above proposed 0b10 method or with PTE_SPECIAL.
pmd_mknotvalid() should not do that additional stuff but instead a platform specific 
pmdp_invalidate() can incorporate that after doing the real invalidation i.e clearing
the bit 0 in pmd_mknotvalid().
Anshuman Khandual April 1, 2020, 8:14 a.m. UTC | #7
On 07/08/2019 09:57 AM, Anshuman Khandual wrote:
> 
> On 07/03/2019 11:22 PM, Catalin Marinas wrote:
>> On Tue, Jul 02, 2019 at 09:07:28AM +0530, Anshuman Khandual wrote:
>>> On 06/28/2019 03:50 PM, Catalin Marinas wrote:
>>>> On Thu, Jun 27, 2019 at 06:18:15PM +0530, Anshuman Khandual wrote:
>>>>> pmd_present() and pmd_trans_huge() are expected to behave in the following
>>>>> manner during various phases of a given PMD. It is derived from a previous
>>>>> detailed discussion on this topic [1] and present THP documentation [2].
>>>>>
>>>>> pmd_present(pmd):
>>>>>
>>>>> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
>>>>> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
>>>>>
>>>>> pmd_trans_huge(pmd):
>>>>>
>>>>> - Returns true if pmd refers to system RAM and is a trans huge mapping
>> [...]
>>>> Before we actually start fixing this, I would strongly suggest that you
>>>> add a boot selftest (see lib/Kconfig.debug for other similar cases)
>>>> which checks the consistency of the page table macros w.r.t. the
>>>> expected mm semantics. Once the mm maintainers agreed with the
>>>> semantics, it will really help architecture maintainers in implementing
>>>> them correctly.
>>> Sure and it will help all architectures to be in sync wrt semantics.
>>>
>>>> You wouldn't need actual page tables, just things like assertions on
>>>> pmd_trans_huge(pmd_mkhuge(pmd)) == true. You could go further and have
>>>> checks on pmdp_invalidate(&dummy_vma, dummy_addr, &dummy_pmd) with the
>>>> dummy_* variables on the stack.
>>> Hmm. I guess macros which operate directly on a page table entry will be
>>> okay but the ones which check on specific states for VMA or MM might be
>>> bit tricky. Try to emulate VMA/MM states while on stack ?. But sure, will
>>> explore adding such a test.
>> You can pretend that the page table is on the stack. See the _pmd
>> variable in do_huge_pmd_wp_page_fallback() and
>> __split_huge_zero_page_pmd(). Similarly, the vma and even the mm can be
>> faked on the stack (see the arm64 tlb_flush()).
> Sure will explore them and other similar examples. I am already working on a module
> which will test various architecture page table accessors semantics as expected from
> generic MM. This should help us making sure that all architectures are on same page.
> 
>>>>> The problem:
>>>>>
>>>>> PMD is first invalidated with pmdp_invalidate() before it's splitting. This
>>>>> invalidation clears PMD_SECT_VALID as below.
>>>>>
>>>>> PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID
>>>>>
>>>>> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
>>>>> on the PMD entry.
>>>> I think that's an inconsistency in the expected semantics here. Do you
>>>> mean that pmd_present(pmd_mknotpresent(pmd)) should be true? If not, do
>> [...]
>>> pmd_present() and pmd_mknotpresent() are not exact inverse.
>> I find this very confusing (not your fault, just the semantics expected
>> by the core code). I can see that x86 is using _PAGE_PSE to make
>> pmd_present(pmd_mknotpresent()) == true. However, for pud that's not the
>> case (because it's not used for transhuge).
>>
>> I'd rather have this renamed to pmd_mknotvalid().
> Right, it makes sense to do the renaming even without considering this proposal.
> 
>>> In absence of a positive section mapping bit on arm64, PTE_SPECIAL is being set
>>> temporarily to remember that it was a mapped PMD which got invalidated recently
>>> but which still points to memory. Hence pmd_present() must evaluate true.
>> I wonder if we can encode this safely for arm64 in the bottom two bits
>> of a pmd :
>>
>> 0b00 - not valid, not present
>> 0b10 - not valid, present, huge
>> 0b01 - valid, present, huge
>> 0b11 - valid, table (not huge)
>>
>> Do we ever call pmdp_invalidate() on a table entry? I don't think we do.
>>
>> So a pte_mknotvalid would set bit 1 and I think swp_entry_to_pmd() would
>> have to clear it so that pmd_present() actually returns false for a swp
>> pmd entry.
> All these makes it riskier for collision with other core MM paths as compared to
> using a an isolated SW bit like PTE_SPECIAL exclusively for this purpose. This
> is in line with using PTE_PROTNONE. PTE_SPECIAL seems to be well away from core
> PMD path. Is there any particular concern about using PTE_SPECIAL ? Nonetheless
> I will evaluate above proposal of using (0b10) to represent invalid but present
> huge PMD entry during splitting.

Tried to implement the proposed encoding scheme from Catalin and it does
seem to work with (or even without) clearing the PMD_TABLE_BIT bit during
__swp_entry_to_pmd(). It clears basic memory stress test with THP enabled.

0b00 - not valid, not present
0b10 - not valid, present, huge		/* Invalidated splitting PMD */
0b01 - valid, present, huge		/* Valid mapped PMD */
0b11 - valid, table (not huge)

Will continue testing this change.

----------->
From 9593fb80eb41984de484fd151cd1140f4cbead7e Mon Sep 17 00:00:00 2001
From: Anshuman Khandual <anshuman.khandual@arm.com>
Date: Tue, 31 Mar 2020 11:38:53 +0100
Subject: [PATCH] arm64/mm: Change THP helpers to comply with generic MM
 semantics

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 74 ++++++++++++++++++++++++++++----
 arch/arm64/mm/hugetlbpage.c      |  2 +-
 2 files changed, 67 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 44883038dbe6..86c22a4fa427 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -348,15 +348,72 @@ static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#define pmd_table(pmd)		((pmd_val(pmd) & PMD_TYPE_MASK) == \
+				 PMD_TYPE_TABLE)
+#define pmd_sect(pmd)		((pmd_val(pmd) & PMD_TYPE_MASK) == \
+				 PMD_TYPE_SECT)
+static pmd_t pmd_mksplitting(pmd_t pmd)
+{
+	unsigned long mask = PMD_TYPE_MASK;
+	unsigned long val = pmd_val(pmd);
+
+	val = (val & ~mask) | PMD_TABLE_BIT;
+
+	return __pmd(val);
+}
+
+static bool pmd_splitting(pmd_t pmd)
+{
+	unsigned long mask = PMD_TYPE_MASK;
+	unsigned long val = pmd_val(pmd);
+
+	if ((val & mask) == PMD_TABLE_BIT)
+		return true;
+
+	return false;
+}
+
+static bool pmd_mapped(pmd_t pmd)
+{
+	return pmd_sect(pmd);
+}
+
+static inline int pmd_present(pmd_t pmd)
+{
+	pte_t pte = pmd_pte(pmd);
+
+	if (pte_present(pte))
+		return 1;
+
+	if (pmd_splitting(pmd))
+		return 1;
+
+	return 0;
+}
+
 /*
  * THP definitions.
  */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	if (!pmd_present(pmd))
+		return 0;
+
+	if (!pmd_val(pmd))
+		return 0;
+
+	if (pmd_mapped(pmd))
+		return 1;
+
+	if (pmd_splitting(pmd))
+		return 1;
+
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#define pmd_present(pmd)	pte_present(pmd_pte(pmd))
 #define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
 #define pmd_valid(pmd)		pte_valid(pmd_pte(pmd))
@@ -366,7 +423,12 @@ static inline int pmd_protnone(pmd_t pmd)
 #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
-#define pmd_mknotvalid(pmd)	(__pmd(pmd_val(pmd) & ~PMD_SECT_VALID))
+
+static inline pmd_t pmd_mknotvalid(pmd_t pmd)
+{
+	BUG_ON(pmd_table(pmd));
+	return pmd_mksplitting(pmd);
+}
 
 #define pmd_thp_or_huge(pmd)	(pmd_huge(pmd) || pmd_trans_huge(pmd))
 
@@ -437,10 +499,6 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 
 #define pmd_bad(pmd)		(!(pmd_val(pmd) & PMD_TABLE_BIT))
 
-#define pmd_table(pmd)		((pmd_val(pmd) & PMD_TYPE_MASK) == \
-				 PMD_TYPE_TABLE)
-#define pmd_sect(pmd)		((pmd_val(pmd) & PMD_TYPE_MASK) == \
-				 PMD_TYPE_SECT)
 #define pmd_leaf(pmd)		pmd_sect(pmd)
 
 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
@@ -834,7 +892,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 #define __pmd_to_swp_entry(pmd)        ((swp_entry_t) { pmd_val(pmd) })
-#define __swp_entry_to_pmd(swp)        __pmd((swp).val)
+#define __swp_entry_to_pmd(swp)        __pmd((swp).val & ~PMD_TABLE_BIT)
 #endif
 
 /*
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index bbeb6a5a6ba6..056fe716b9f8 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -283,7 +283,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 	if (!(sz == PMD_SIZE || sz == CONT_PMD_SIZE) &&
 	    pmd_none(pmd))
 		return NULL;
-	if (pmd_huge(pmd) || !pmd_present(pmd))
+	if (pmd_huge(pmd) || !pte_present(pmd_pte(pmd)))
 		return (pte_t *)pmdp;
 
 	if (sz == CONT_PTE_SIZE)
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 87a4b2d..90d4e24 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -368,15 +368,31 @@  static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+static inline int pmd_present(pmd_t pmd)
+{
+	if (pte_present(pmd_pte(pmd)))
+		return 1;
+
+	return pte_special(pmd_pte(pmd));
+}
+
 /*
  * THP definitions.
  */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	if (!pmd_val(pmd))
+		return 0;
+
+	if (!pmd_present(pmd))
+		return 0;
+
+	return !(pmd_val(pmd) & PMD_TABLE_BIT);
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#define pmd_present(pmd)	pte_present(pmd_pte(pmd))
 #define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
 #define pmd_valid(pmd)		pte_valid(pmd_pte(pmd))
@@ -386,7 +402,12 @@  static inline int pmd_protnone(pmd_t pmd)
 #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
-#define pmd_mknotpresent(pmd)	(__pmd(pmd_val(pmd) & ~PMD_SECT_VALID))
+
+static inline pmd_t pmd_mknotpresent(pmd_t pmd)
+{
+	pmd = pte_pmd(pte_mkspecial(pmd_pte(pmd)));
+	return __pmd(pmd_val(pmd) & ~PMD_SECT_VALID);
+}
 
 #define pmd_thp_or_huge(pmd)	(pmd_huge(pmd) || pmd_trans_huge(pmd))