mbox series

[v1,00/11] mm/memory: optimize fork() with PTE-mapped THP

Message ID 20240122194200.381241-1-david@redhat.com (mailing list archive)
Headers show
Series mm/memory: optimize fork() with PTE-mapped THP | expand

Message

David Hildenbrand Jan. 22, 2024, 7:41 p.m. UTC
Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):

Folio Size | v6.8-rc1 |      New | Change
------------------------------------------
      4KiB | 0.014328 | 0.014265 |     0%
     16KiB | 0.014263 | 0.013293 |   - 7%
     32KiB | 0.014334 | 0.012355 |   -14%
     64KiB | 0.014046 | 0.011837 |   -16%
    128KiB | 0.014011 | 0.011536 |   -18%
    256KiB | 0.013993 | 0.01134  |   -19%
    512KiB | 0.013983 | 0.011311 |   -19%
   1024KiB | 0.013986 | 0.011282 |   -19%
   2048KiB | 0.014305 | 0.011496 |   -20%

Next up is PTE batching when unmapping, that I'll probably send out
based on this series this/next week.

Only tested on x86-64. Compile-tested on most other architectures. Will
do more testing and double-check the arch changes while this is getting
some review.

[1] https://lkml.kernel.org/r/20231220224504.646757-1-david@redhat.com
[2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com
[3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com
[4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com
[5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: sparclinux@vger.kernel.org

David Hildenbrand (11):
  arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  nios2/pgtable: define PFN_PTE_SHIFT
  powerpc/pgtable: define PFN_PTE_SHIFT
  risc: pgtable: define PFN_PTE_SHIFT
  s390/pgtable: define PFN_PTE_SHIFT
  sparc/pgtable: define PFN_PTE_SHIFT
  mm/memory: factor out copying the actual PTE in copy_present_pte()
  mm/memory: pass PTE to copy_present_pte()
  mm/memory: optimize fork() with PTE-mapped THP
  mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
  mm/memory: ignore writable bit in folio_pte_batch()

 arch/arm/include/asm/pgtable.h      |   2 +
 arch/arm64/include/asm/pgtable.h    |   2 +
 arch/nios2/include/asm/pgtable.h    |   2 +
 arch/powerpc/include/asm/pgtable.h  |   2 +
 arch/riscv/include/asm/pgtable.h    |   2 +
 arch/s390/include/asm/pgtable.h     |   2 +
 arch/sparc/include/asm/pgtable_64.h |   2 +
 include/linux/pgtable.h             |  17 ++-
 mm/memory.c                         | 188 +++++++++++++++++++++-------
 9 files changed, 173 insertions(+), 46 deletions(-)


base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d

Comments

Ryan Roberts Jan. 23, 2024, 7:15 p.m. UTC | #1
On 22/01/2024 19:41, David Hildenbrand wrote:
> Now that the rmap overhaul[1] is upstream that provides a clean interface
> for rmap batching, let's implement PTE batching during fork when processing
> PTE-mapped THPs.
> 
> This series is partially based on Ryan's previous work[2] to implement
> cont-pte support on arm64, but its a complete rewrite based on [1] to
> optimize all architectures independent of any such PTE bits, and to
> use the new rmap batching functions that simplify the code and prepare
> for further rmap accounting changes.
> 
> We collect consecutive PTEs that map consecutive pages of the same large
> folio, making sure that the other PTE bits are compatible, and (a) adjust
> the refcount only once per batch, (b) call rmap handling functions only
> once per batch and (c) perform batch PTE setting/updates.
> 
> While this series should be beneficial for adding cont-pte support on
> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
> for large folios with minimal added overhead and further changes[4] that
> build up on top of the total mapcount.

I'm currently rebasing my contpte work onto this series, and have hit a problem.
I need to expose the "size" of a pte (pte_size()) and skip forward to the start
of the next (cont)pte every time through the folio_pte_batch() loop. But
pte_next_pfn() only allows advancing by 1 pfn; I need to advance by nr pfns:


static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
		pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
{
	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
	const pte_t *end_ptep = start_ptep + max_nr;
	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
-	pte_t *ptep = start_ptep + 1;
+	pte_t *ptep = start_ptep;
+	int vfn, nr, i;
	bool writable;

	if (any_writable)
		*any_writable = false;

	VM_WARN_ON_FOLIO(!pte_present(pte), folio);

+	vfn = addr >> PAGE_SIZE;
+	nr = pte_size(pte);
+	nr = ALIGN_DOWN(vfn + nr, nr) - vfn;
+	ptep += nr;
+
	while (ptep != end_ptep) {
+		pte = ptep_get(ptep);
		nr = pte_size(pte);
		if (any_writable)
			writable = !!pte_write(pte);
		pte = __pte_batch_clear_ignored(pte);

		if (!pte_same(pte, expected_pte))
			break;

		/*
		 * Stop immediately once we reached the end of the folio. In
		 * corner cases the next PFN might fall into a different
		 * folio.
		 */
-		if (pte_pfn(pte) == folio_end_pfn)
+		if (pte_pfn(pte) >= folio_end_pfn)
			break;

		if (any_writable)
			*any_writable |= writable;

-		expected_pte = pte_next_pfn(expected_pte);
-		ptep++;
+		for (i = 0; i < nr; i++)
+			expected_pte = pte_next_pfn(expected_pte);
+		ptep += nr;
	}

	return ptep - start_ptep;
}


So I'm wondering if instead of enabling pte_next_pfn() for all the arches,
perhaps its actually better to expose pte_pgprot() for all the arches. Then we
can be much more flexible about generating ptes with pfn_pte(pfn, pgprot).

What do you think?
David Hildenbrand Jan. 23, 2024, 7:33 p.m. UTC | #2
On 23.01.24 20:15, Ryan Roberts wrote:
> On 22/01/2024 19:41, David Hildenbrand wrote:
>> Now that the rmap overhaul[1] is upstream that provides a clean interface
>> for rmap batching, let's implement PTE batching during fork when processing
>> PTE-mapped THPs.
>>
>> This series is partially based on Ryan's previous work[2] to implement
>> cont-pte support on arm64, but its a complete rewrite based on [1] to
>> optimize all architectures independent of any such PTE bits, and to
>> use the new rmap batching functions that simplify the code and prepare
>> for further rmap accounting changes.
>>
>> We collect consecutive PTEs that map consecutive pages of the same large
>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>> the refcount only once per batch, (b) call rmap handling functions only
>> once per batch and (c) perform batch PTE setting/updates.
>>
>> While this series should be beneficial for adding cont-pte support on
>> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
>> for large folios with minimal added overhead and further changes[4] that
>> build up on top of the total mapcount.
> 
> I'm currently rebasing my contpte work onto this series, and have hit a problem.
> I need to expose the "size" of a pte (pte_size()) and skip forward to the start
> of the next (cont)pte every time through the folio_pte_batch() loop. But
> pte_next_pfn() only allows advancing by 1 pfn; I need to advance by nr pfns:
> 
> 
> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> 		pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
> {
> 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
> 	const pte_t *end_ptep = start_ptep + max_nr;
> 	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
> -	pte_t *ptep = start_ptep + 1;
> +	pte_t *ptep = start_ptep;
> +	int vfn, nr, i;
> 	bool writable;
> 
> 	if (any_writable)
> 		*any_writable = false;
> 
> 	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> 
> +	vfn = addr >> PAGE_SIZE;
> +	nr = pte_size(pte);
> +	nr = ALIGN_DOWN(vfn + nr, nr) - vfn;
> +	ptep += nr;
> +
> 	while (ptep != end_ptep) {
> +		pte = ptep_get(ptep);
> 		nr = pte_size(pte);
> 		if (any_writable)
> 			writable = !!pte_write(pte);
> 		pte = __pte_batch_clear_ignored(pte);
> 
> 		if (!pte_same(pte, expected_pte))
> 			break;
> 
> 		/*
> 		 * Stop immediately once we reached the end of the folio. In
> 		 * corner cases the next PFN might fall into a different
> 		 * folio.
> 		 */
> -		if (pte_pfn(pte) == folio_end_pfn)
> +		if (pte_pfn(pte) >= folio_end_pfn)
> 			break;
> 
> 		if (any_writable)
> 			*any_writable |= writable;
> 
> -		expected_pte = pte_next_pfn(expected_pte);
> -		ptep++;
> +		for (i = 0; i < nr; i++)
> +			expected_pte = pte_next_pfn(expected_pte);
> +		ptep += nr;
> 	}
> 
> 	return ptep - start_ptep;
> }
> 
> 
> So I'm wondering if instead of enabling pte_next_pfn() for all the arches,
> perhaps its actually better to expose pte_pgprot() for all the arches. Then we
> can be much more flexible about generating ptes with pfn_pte(pfn, pgprot).
> 
> What do you think?

The pte_pgprot() stuff is just nasty IMHO.

Likely it's best to simply convert pte_next_pfn() to something like 
pte_advance_pfns(). The we could just have

#define pte_next_pfn(pte) pte_advance_pfns(pte, 1)

That should be fairly easy to do on top (based on PFN_PTE_SHIFT). And 
only 3 archs (x86-64, arm64, and powerpc) need slight care to replace a 
hardcoded "1" by an integer we pass in.
Ryan Roberts Jan. 23, 2024, 7:43 p.m. UTC | #3
On 23/01/2024 19:33, David Hildenbrand wrote:
> On 23.01.24 20:15, Ryan Roberts wrote:
>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>> Now that the rmap overhaul[1] is upstream that provides a clean interface
>>> for rmap batching, let's implement PTE batching during fork when processing
>>> PTE-mapped THPs.
>>>
>>> This series is partially based on Ryan's previous work[2] to implement
>>> cont-pte support on arm64, but its a complete rewrite based on [1] to
>>> optimize all architectures independent of any such PTE bits, and to
>>> use the new rmap batching functions that simplify the code and prepare
>>> for further rmap accounting changes.
>>>
>>> We collect consecutive PTEs that map consecutive pages of the same large
>>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>>> the refcount only once per batch, (b) call rmap handling functions only
>>> once per batch and (c) perform batch PTE setting/updates.
>>>
>>> While this series should be beneficial for adding cont-pte support on
>>> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
>>> for large folios with minimal added overhead and further changes[4] that
>>> build up on top of the total mapcount.
>>
>> I'm currently rebasing my contpte work onto this series, and have hit a problem.
>> I need to expose the "size" of a pte (pte_size()) and skip forward to the start
>> of the next (cont)pte every time through the folio_pte_batch() loop. But
>> pte_next_pfn() only allows advancing by 1 pfn; I need to advance by nr pfns:
>>
>>
>> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>         pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
>> {
>>     unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>     const pte_t *end_ptep = start_ptep + max_nr;
>>     pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
>> -    pte_t *ptep = start_ptep + 1;
>> +    pte_t *ptep = start_ptep;
>> +    int vfn, nr, i;
>>     bool writable;
>>
>>     if (any_writable)
>>         *any_writable = false;
>>
>>     VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>
>> +    vfn = addr >> PAGE_SIZE;
>> +    nr = pte_size(pte);
>> +    nr = ALIGN_DOWN(vfn + nr, nr) - vfn;
>> +    ptep += nr;
>> +
>>     while (ptep != end_ptep) {
>> +        pte = ptep_get(ptep);
>>         nr = pte_size(pte);
>>         if (any_writable)
>>             writable = !!pte_write(pte);
>>         pte = __pte_batch_clear_ignored(pte);
>>
>>         if (!pte_same(pte, expected_pte))
>>             break;
>>
>>         /*
>>          * Stop immediately once we reached the end of the folio. In
>>          * corner cases the next PFN might fall into a different
>>          * folio.
>>          */
>> -        if (pte_pfn(pte) == folio_end_pfn)
>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>             break;
>>
>>         if (any_writable)
>>             *any_writable |= writable;
>>
>> -        expected_pte = pte_next_pfn(expected_pte);
>> -        ptep++;
>> +        for (i = 0; i < nr; i++)
>> +            expected_pte = pte_next_pfn(expected_pte);
>> +        ptep += nr;
>>     }
>>
>>     return ptep - start_ptep;
>> }
>>
>>
>> So I'm wondering if instead of enabling pte_next_pfn() for all the arches,
>> perhaps its actually better to expose pte_pgprot() for all the arches. Then we
>> can be much more flexible about generating ptes with pfn_pte(pfn, pgprot).
>>
>> What do you think?
> 
> The pte_pgprot() stuff is just nasty IMHO.

I dunno; we have pfn_pte() which takes a pfn and a pgprot. It seems reasonable
that we should be able to do the reverse.

> 
> Likely it's best to simply convert pte_next_pfn() to something like
> pte_advance_pfns(). The we could just have
> 
> #define pte_next_pfn(pte) pte_advance_pfns(pte, 1)
> 
> That should be fairly easy to do on top (based on PFN_PTE_SHIFT). And only 3
> archs (x86-64, arm64, and powerpc) need slight care to replace a hardcoded "1"
> by an integer we pass in.

I thought we agreed powerpc was safe to just define PFN_PTE_SHIFT? But, yeah,
the principle works I guess. I guess I can do this change along with my series.

>
David Hildenbrand Jan. 23, 2024, 8:14 p.m. UTC | #4
On 23.01.24 20:43, Ryan Roberts wrote:
> On 23/01/2024 19:33, David Hildenbrand wrote:
>> On 23.01.24 20:15, Ryan Roberts wrote:
>>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>>> Now that the rmap overhaul[1] is upstream that provides a clean interface
>>>> for rmap batching, let's implement PTE batching during fork when processing
>>>> PTE-mapped THPs.
>>>>
>>>> This series is partially based on Ryan's previous work[2] to implement
>>>> cont-pte support on arm64, but its a complete rewrite based on [1] to
>>>> optimize all architectures independent of any such PTE bits, and to
>>>> use the new rmap batching functions that simplify the code and prepare
>>>> for further rmap accounting changes.
>>>>
>>>> We collect consecutive PTEs that map consecutive pages of the same large
>>>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>>>> the refcount only once per batch, (b) call rmap handling functions only
>>>> once per batch and (c) perform batch PTE setting/updates.
>>>>
>>>> While this series should be beneficial for adding cont-pte support on
>>>> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
>>>> for large folios with minimal added overhead and further changes[4] that
>>>> build up on top of the total mapcount.
>>>
>>> I'm currently rebasing my contpte work onto this series, and have hit a problem.
>>> I need to expose the "size" of a pte (pte_size()) and skip forward to the start
>>> of the next (cont)pte every time through the folio_pte_batch() loop. But
>>> pte_next_pfn() only allows advancing by 1 pfn; I need to advance by nr pfns:
>>>
>>>
>>> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>          pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
>>> {
>>>      unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>>      const pte_t *end_ptep = start_ptep + max_nr;
>>>      pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
>>> -    pte_t *ptep = start_ptep + 1;
>>> +    pte_t *ptep = start_ptep;
>>> +    int vfn, nr, i;
>>>      bool writable;
>>>
>>>      if (any_writable)
>>>          *any_writable = false;
>>>
>>>      VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>>
>>> +    vfn = addr >> PAGE_SIZE;
>>> +    nr = pte_size(pte);
>>> +    nr = ALIGN_DOWN(vfn + nr, nr) - vfn;
>>> +    ptep += nr;
>>> +
>>>      while (ptep != end_ptep) {
>>> +        pte = ptep_get(ptep);
>>>          nr = pte_size(pte);
>>>          if (any_writable)
>>>              writable = !!pte_write(pte);
>>>          pte = __pte_batch_clear_ignored(pte);
>>>
>>>          if (!pte_same(pte, expected_pte))
>>>              break;
>>>
>>>          /*
>>>           * Stop immediately once we reached the end of the folio. In
>>>           * corner cases the next PFN might fall into a different
>>>           * folio.
>>>           */
>>> -        if (pte_pfn(pte) == folio_end_pfn)
>>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>>              break;
>>>
>>>          if (any_writable)
>>>              *any_writable |= writable;
>>>
>>> -        expected_pte = pte_next_pfn(expected_pte);
>>> -        ptep++;
>>> +        for (i = 0; i < nr; i++)
>>> +            expected_pte = pte_next_pfn(expected_pte);
>>> +        ptep += nr;
>>>      }
>>>
>>>      return ptep - start_ptep;
>>> }
>>>
>>>
>>> So I'm wondering if instead of enabling pte_next_pfn() for all the arches,
>>> perhaps its actually better to expose pte_pgprot() for all the arches. Then we
>>> can be much more flexible about generating ptes with pfn_pte(pfn, pgprot).
>>>
>>> What do you think?
>>
>> The pte_pgprot() stuff is just nasty IMHO.
> 
> I dunno; we have pfn_pte() which takes a pfn and a pgprot. It seems reasonable
> that we should be able to do the reverse.

But pte_pgprot() is only available on a handful of architectures, no? It 
would be nice to have a completely generic pte_next_pfn() / 
pte_advance_pfns(), though.

Anyhow, this is all "easy" to rework later. Unless I am missing 
something, the low hanging fruit is simply using PFN_PTE_SHIFT for now 
that exists on most archs already.

> 
>>
>> Likely it's best to simply convert pte_next_pfn() to something like
>> pte_advance_pfns(). The we could just have
>>
>> #define pte_next_pfn(pte) pte_advance_pfns(pte, 1)
>>
>> That should be fairly easy to do on top (based on PFN_PTE_SHIFT). And only 3
>> archs (x86-64, arm64, and powerpc) need slight care to replace a hardcoded "1"
>> by an integer we pass in.
> 
> I thought we agreed powerpc was safe to just define PFN_PTE_SHIFT? But, yeah,
> the principle works I guess. I guess I can do this change along with my series.

It is, if nobody insists on that micro-optimization on powerpc.

If there is good reason to invest more time and effort right now on the 
pte_pgprot approach, then please let me know :)
Ryan Roberts Jan. 23, 2024, 8:43 p.m. UTC | #5
On 23/01/2024 20:14, David Hildenbrand wrote:
> On 23.01.24 20:43, Ryan Roberts wrote:
>> On 23/01/2024 19:33, David Hildenbrand wrote:
>>> On 23.01.24 20:15, Ryan Roberts wrote:
>>>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>>>> Now that the rmap overhaul[1] is upstream that provides a clean interface
>>>>> for rmap batching, let's implement PTE batching during fork when processing
>>>>> PTE-mapped THPs.
>>>>>
>>>>> This series is partially based on Ryan's previous work[2] to implement
>>>>> cont-pte support on arm64, but its a complete rewrite based on [1] to
>>>>> optimize all architectures independent of any such PTE bits, and to
>>>>> use the new rmap batching functions that simplify the code and prepare
>>>>> for further rmap accounting changes.
>>>>>
>>>>> We collect consecutive PTEs that map consecutive pages of the same large
>>>>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>>>>> the refcount only once per batch, (b) call rmap handling functions only
>>>>> once per batch and (c) perform batch PTE setting/updates.
>>>>>
>>>>> While this series should be beneficial for adding cont-pte support on
>>>>> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
>>>>> for large folios with minimal added overhead and further changes[4] that
>>>>> build up on top of the total mapcount.
>>>>
>>>> I'm currently rebasing my contpte work onto this series, and have hit a
>>>> problem.
>>>> I need to expose the "size" of a pte (pte_size()) and skip forward to the start
>>>> of the next (cont)pte every time through the folio_pte_batch() loop. But
>>>> pte_next_pfn() only allows advancing by 1 pfn; I need to advance by nr pfns:
>>>>
>>>>
>>>> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>>          pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
>>>> {
>>>>      unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>>>      const pte_t *end_ptep = start_ptep + max_nr;
>>>>      pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
>>>> -    pte_t *ptep = start_ptep + 1;
>>>> +    pte_t *ptep = start_ptep;
>>>> +    int vfn, nr, i;
>>>>      bool writable;
>>>>
>>>>      if (any_writable)
>>>>          *any_writable = false;
>>>>
>>>>      VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>>>
>>>> +    vfn = addr >> PAGE_SIZE;
>>>> +    nr = pte_size(pte);
>>>> +    nr = ALIGN_DOWN(vfn + nr, nr) - vfn;
>>>> +    ptep += nr;
>>>> +
>>>>      while (ptep != end_ptep) {
>>>> +        pte = ptep_get(ptep);
>>>>          nr = pte_size(pte);
>>>>          if (any_writable)
>>>>              writable = !!pte_write(pte);
>>>>          pte = __pte_batch_clear_ignored(pte);
>>>>
>>>>          if (!pte_same(pte, expected_pte))
>>>>              break;
>>>>
>>>>          /*
>>>>           * Stop immediately once we reached the end of the folio. In
>>>>           * corner cases the next PFN might fall into a different
>>>>           * folio.
>>>>           */
>>>> -        if (pte_pfn(pte) == folio_end_pfn)
>>>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>>>              break;
>>>>
>>>>          if (any_writable)
>>>>              *any_writable |= writable;
>>>>
>>>> -        expected_pte = pte_next_pfn(expected_pte);
>>>> -        ptep++;
>>>> +        for (i = 0; i < nr; i++)
>>>> +            expected_pte = pte_next_pfn(expected_pte);
>>>> +        ptep += nr;
>>>>      }
>>>>
>>>>      return ptep - start_ptep;
>>>> }
>>>>
>>>>
>>>> So I'm wondering if instead of enabling pte_next_pfn() for all the arches,
>>>> perhaps its actually better to expose pte_pgprot() for all the arches. Then we
>>>> can be much more flexible about generating ptes with pfn_pte(pfn, pgprot).
>>>>
>>>> What do you think?
>>>
>>> The pte_pgprot() stuff is just nasty IMHO.
>>
>> I dunno; we have pfn_pte() which takes a pfn and a pgprot. It seems reasonable
>> that we should be able to do the reverse.
> 
> But pte_pgprot() is only available on a handful of architectures, no? It would
> be nice to have a completely generic pte_next_pfn() / pte_advance_pfns(), though.
> 
> Anyhow, this is all "easy" to rework later. Unless I am missing something, the
> low hanging fruit is simply using PFN_PTE_SHIFT for now that exists on most
> archs already.
> 
>>
>>>
>>> Likely it's best to simply convert pte_next_pfn() to something like
>>> pte_advance_pfns(). The we could just have
>>>
>>> #define pte_next_pfn(pte) pte_advance_pfns(pte, 1)
>>>
>>> That should be fairly easy to do on top (based on PFN_PTE_SHIFT). And only 3
>>> archs (x86-64, arm64, and powerpc) need slight care to replace a hardcoded "1"
>>> by an integer we pass in.
>>
>> I thought we agreed powerpc was safe to just define PFN_PTE_SHIFT? But, yeah,
>> the principle works I guess. I guess I can do this change along with my series.
> 
> It is, if nobody insists on that micro-optimization on powerpc.
> 
> If there is good reason to invest more time and effort right now on the
> pte_pgprot approach, then please let me know :)
> 

No I think you're right. I thought pte_pgprot() was implemented by more arches,
but there are 13 without it, so clearly a lot of effort to plug that gap. I'll
take the approach you suggest with pte_advance_pfns(). It'll just require mods
to x86 and arm64, +/- ppc.