diff mbox series

[v3,3/6] mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally

Message ID 20240503005023.174597-4-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series large folios swap-in: handle refault cases first | expand

Commit Message

Barry Song May 3, 2024, 12:50 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

There could arise a necessity to obtain the first pte_t from a swap
pte_t located in the middle. For instance, this may occur within the
context of do_swap_page(), where a page fault can potentially occur in
any PTE of a large folio. To address this, the following patch introduces
pte_move_swp_offset(), a function capable of bidirectional movement by
a specified delta argument. Consequently, pte_increment_swp_offset()
will directly invoke it with delta = 1.

Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/internal.h | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

Comments

Ryan Roberts May 3, 2024, 9:41 a.m. UTC | #1
On 03/05/2024 01:50, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> There could arise a necessity to obtain the first pte_t from a swap
> pte_t located in the middle. For instance, this may occur within the
> context of do_swap_page(), where a page fault can potentially occur in
> any PTE of a large folio. To address this, the following patch introduces
> pte_move_swp_offset(), a function capable of bidirectional movement by
> a specified delta argument. Consequently, pte_increment_swp_offset()

You mean pte_next_swp_offset()?

> will directly invoke it with delta = 1.
> 
> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/internal.h | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index c5552d35d995..cfe4aed66a5c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>  }
>  
>  /**
> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> + *	 forward or backward by delta
>   * @pte: The initial pte state; is_swap_pte(pte) must be true and
>   *	 non_swap_entry() must be false.
> + * @delta: The direction and the offset we are moving; forward if delta
> + *	 is positive; backward if delta is negative
>   *
> - * Increments the swap offset, while maintaining all other fields, including
> + * Moves the swap offset, while maintaining all other fields, including
>   * swap type, and any swp pte bits. The resulting pte is returned.
>   */
> -static inline pte_t pte_next_swp_offset(pte_t pte)
> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)

We have equivalent functions for pfn:

  pte_next_pfn()
  pte_advance_pfn()

Although the latter takes an unsigned long and only moves forward currently. I
wonder if it makes sense to have their naming and semantics match? i.e. change
pte_advance_pfn() to pte_move_pfn() and let it move backwards too.

I guess we don't have a need for that and it adds more churn.

Anyway:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>


>  {
>  	swp_entry_t entry = pte_to_swp_entry(pte);
>  	pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> -						   (swp_offset(entry) + 1)));
> +						   (swp_offset(entry) + delta)));
>  
>  	if (pte_swp_soft_dirty(pte))
>  		new = pte_swp_mksoft_dirty(new);
> @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
>  	return new;
>  }
>  
> +
> +/**
> + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> + *	 non_swap_entry() must be false.
> + *
> + * Increments the swap offset, while maintaining all other fields, including
> + * swap type, and any swp pte bits. The resulting pte is returned.
> + */
> +static inline pte_t pte_next_swp_offset(pte_t pte)
> +{
> +	return pte_move_swp_offset(pte, 1);
> +}
> +
>  /**
>   * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>   * @start_ptep: Page table pointer for the first entry.
Chris Li May 3, 2024, 8:51 p.m. UTC | #2
On Thu, May 2, 2024 at 5:51 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> There could arise a necessity to obtain the first pte_t from a swap
> pte_t located in the middle. For instance, this may occur within the
> context of do_swap_page(), where a page fault can potentially occur in
> any PTE of a large folio. To address this, the following patch introduces
> pte_move_swp_offset(), a function capable of bidirectional movement by
> a specified delta argument. Consequently, pte_increment_swp_offset()
> will directly invoke it with delta = 1.

BTW, this patch has conflict with the latest mm-unstable. You might
need to refresh it.

I do not see the delta = -1 usage case here. Maybe merge the patch
with the follow up patch that uses the delta = -1?
Does delta only make sense as 1 and -1?

This patch doesn't seem straightly necessary to me.

Chris


Chris

>
> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/internal.h | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index c5552d35d995..cfe4aed66a5c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>  }
>
>  /**
> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> + *      forward or backward by delta
>   * @pte: The initial pte state; is_swap_pte(pte) must be true and
>   *      non_swap_entry() must be false.
> + * @delta: The direction and the offset we are moving; forward if delta
> + *      is positive; backward if delta is negative
>   *
> - * Increments the swap offset, while maintaining all other fields, including
> + * Moves the swap offset, while maintaining all other fields, including
>   * swap type, and any swp pte bits. The resulting pte is returned.
>   */
> -static inline pte_t pte_next_swp_offset(pte_t pte)
> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>  {
>         swp_entry_t entry = pte_to_swp_entry(pte);
>         pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> -                                                  (swp_offset(entry) + 1)));
> +                                                  (swp_offset(entry) + delta)));
>
>         if (pte_swp_soft_dirty(pte))
>                 new = pte_swp_mksoft_dirty(new);
> @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
>         return new;
>  }
>
> +
> +/**
> + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> + *      non_swap_entry() must be false.
> + *
> + * Increments the swap offset, while maintaining all other fields, including
> + * swap type, and any swp pte bits. The resulting pte is returned.
> + */
> +static inline pte_t pte_next_swp_offset(pte_t pte)
> +{
> +       return pte_move_swp_offset(pte, 1);
> +}
> +
>  /**
>   * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>   * @start_ptep: Page table pointer for the first entry.
> --
> 2.34.1
>
>
Barry Song May 3, 2024, 11:07 p.m. UTC | #3
On Sat, May 4, 2024 at 4:51 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, May 2, 2024 at 5:51 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > There could arise a necessity to obtain the first pte_t from a swap
> > pte_t located in the middle. For instance, this may occur within the
> > context of do_swap_page(), where a page fault can potentially occur in
> > any PTE of a large folio. To address this, the following patch introduces
> > pte_move_swp_offset(), a function capable of bidirectional movement by
> > a specified delta argument. Consequently, pte_increment_swp_offset()
> > will directly invoke it with delta = 1.
>
> BTW, this patch has conflict with the latest mm-unstable. You might
> need to refresh it.
>
> I do not see the delta = -1 usage case here. Maybe merge the patch
> with the follow up patch that uses the delta = -1?
> Does delta only make sense as 1 and -1?

nop. delta can be any value from -1 to -(nr_pages - 1). This is used to
get the first pte from vmf->pte in patdh6.

it might be better to be a separate patch as it is introducing a new
helper? do_swap_page() might not be the only user in the future?

>
> This patch doesn't seem straightly necessary to me.
>
> Chris
>
>
> Chris
>
> >
> > Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/internal.h | 25 +++++++++++++++++++++----
> >  1 file changed, 21 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index c5552d35d995..cfe4aed66a5c 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >  }
> >
> >  /**
> > - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> > + *      forward or backward by delta
> >   * @pte: The initial pte state; is_swap_pte(pte) must be true and
> >   *      non_swap_entry() must be false.
> > + * @delta: The direction and the offset we are moving; forward if delta
> > + *      is positive; backward if delta is negative
> >   *
> > - * Increments the swap offset, while maintaining all other fields, including
> > + * Moves the swap offset, while maintaining all other fields, including
> >   * swap type, and any swp pte bits. The resulting pte is returned.
> >   */
> > -static inline pte_t pte_next_swp_offset(pte_t pte)
> > +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
> >  {
> >         swp_entry_t entry = pte_to_swp_entry(pte);
> >         pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> > -                                                  (swp_offset(entry) + 1)));
> > +                                                  (swp_offset(entry) + delta)));
> >
> >         if (pte_swp_soft_dirty(pte))
> >                 new = pte_swp_mksoft_dirty(new);
> > @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
> >         return new;
> >  }
> >
> > +
> > +/**
> > + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> > + *      non_swap_entry() must be false.
> > + *
> > + * Increments the swap offset, while maintaining all other fields, including
> > + * swap type, and any swp pte bits. The resulting pte is returned.
> > + */
> > +static inline pte_t pte_next_swp_offset(pte_t pte)
> > +{
> > +       return pte_move_swp_offset(pte, 1);
> > +}
> > +
> >  /**
> >   * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
> >   * @start_ptep: Page table pointer for the first entry.
> > --
> > 2.34.1

Thanks
Barry
Barry Song May 3, 2024, 11:40 p.m. UTC | #4
On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 03/05/2024 01:50, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > There could arise a necessity to obtain the first pte_t from a swap
> > pte_t located in the middle. For instance, this may occur within the
> > context of do_swap_page(), where a page fault can potentially occur in
> > any PTE of a large folio. To address this, the following patch introduces
> > pte_move_swp_offset(), a function capable of bidirectional movement by
> > a specified delta argument. Consequently, pte_increment_swp_offset()
>
> You mean pte_next_swp_offset()?

yes.

>
> > will directly invoke it with delta = 1.
> >
> > Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/internal.h | 25 +++++++++++++++++++++----
> >  1 file changed, 21 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index c5552d35d995..cfe4aed66a5c 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >  }
> >
> >  /**
> > - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> > + *    forward or backward by delta
> >   * @pte: The initial pte state; is_swap_pte(pte) must be true and
> >   *    non_swap_entry() must be false.
> > + * @delta: The direction and the offset we are moving; forward if delta
> > + *    is positive; backward if delta is negative
> >   *
> > - * Increments the swap offset, while maintaining all other fields, including
> > + * Moves the swap offset, while maintaining all other fields, including
> >   * swap type, and any swp pte bits. The resulting pte is returned.
> >   */
> > -static inline pte_t pte_next_swp_offset(pte_t pte)
> > +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>
> We have equivalent functions for pfn:
>
>   pte_next_pfn()
>   pte_advance_pfn()
>
> Although the latter takes an unsigned long and only moves forward currently. I
> wonder if it makes sense to have their naming and semantics match? i.e. change
> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>
> I guess we don't have a need for that and it adds more churn.

we might have a need in the below case.
A forks B, then A and B share large folios. B unmap/exit, then large
folios of process
A become single-mapped.
Right now, while writing A's folios, we are CoWing A's large folios
into many small
folios. I believe we can reuse the entire large folios instead of doing nr_pages
CoW and page faults.
In this case, we might want to get the first PTE from vmf->pte.

Another case, might be
A forks B, and we write either A or B, we might CoW an entire large
folios instead
CoWing nr_pages small folios.

case 1 seems more useful, I might have a go after some days. then we might
see pte_move_pfn().

>
> Anyway:
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

thanks!

>
>
> >  {
> >       swp_entry_t entry = pte_to_swp_entry(pte);
> >       pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> > -                                                (swp_offset(entry) + 1)));
> > +                                                (swp_offset(entry) + delta)));
> >
> >       if (pte_swp_soft_dirty(pte))
> >               new = pte_swp_mksoft_dirty(new);
> > @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
> >       return new;
> >  }
> >
> > +
> > +/**
> > + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> > + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> > + *    non_swap_entry() must be false.
> > + *
> > + * Increments the swap offset, while maintaining all other fields, including
> > + * swap type, and any swp pte bits. The resulting pte is returned.
> > + */
> > +static inline pte_t pte_next_swp_offset(pte_t pte)
> > +{
> > +     return pte_move_swp_offset(pte, 1);
> > +}
> > +
> >  /**
> >   * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
> >   * @start_ptep: Page table pointer for the first entry.
>

Barry
David Hildenbrand May 6, 2024, 8:06 a.m. UTC | #5
On 04.05.24 01:40, Barry Song wrote:
> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 03/05/2024 01:50, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> There could arise a necessity to obtain the first pte_t from a swap
>>> pte_t located in the middle. For instance, this may occur within the
>>> context of do_swap_page(), where a page fault can potentially occur in
>>> any PTE of a large folio. To address this, the following patch introduces
>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>
>> You mean pte_next_swp_offset()?
> 
> yes.
> 
>>
>>> will directly invoke it with delta = 1.
>>>
>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>   mm/internal.h | 25 +++++++++++++++++++++----
>>>   1 file changed, 21 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index c5552d35d995..cfe4aed66a5c 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>   }
>>>
>>>   /**
>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>> + *    forward or backward by delta
>>>    * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>>    *    non_swap_entry() must be false.
>>> + * @delta: The direction and the offset we are moving; forward if delta
>>> + *    is positive; backward if delta is negative
>>>    *
>>> - * Increments the swap offset, while maintaining all other fields, including
>>> + * Moves the swap offset, while maintaining all other fields, including
>>>    * swap type, and any swp pte bits. The resulting pte is returned.
>>>    */
>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>
>> We have equivalent functions for pfn:
>>
>>    pte_next_pfn()
>>    pte_advance_pfn()
>>
>> Although the latter takes an unsigned long and only moves forward currently. I
>> wonder if it makes sense to have their naming and semantics match? i.e. change
>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>
>> I guess we don't have a need for that and it adds more churn.
> 
> we might have a need in the below case.
> A forks B, then A and B share large folios. B unmap/exit, then large
> folios of process
> A become single-mapped.
> Right now, while writing A's folios, we are CoWing A's large folios
> into many small
> folios. I believe we can reuse the entire large folios instead of doing nr_pages
> CoW and page faults.
> In this case, we might want to get the first PTE from vmf->pte.

Once we have COW reuse for large folios in place (I think you know that 
I am working on that), it might make sense to "COW-reuse around", 
meaning we look if some neighboring PTEs map the same large folio and 
map them writable as well. But if it's really worth it, increasing page 
fault latency, is to be decided separately.


> 
> Another case, might be
> A forks B, and we write either A or B, we might CoW an entire large
> folios instead
> CoWing nr_pages small folios.
> 
> case 1 seems more useful, I might have a go after some days. then we might
> see pte_move_pfn().
pte_move_pfn() does sound odd to me. It might not be required to 
implement the optimization described above. (it's easier to simply read 
another PTE, check if it maps the same large folio, and to batch from there)
Barry Song May 6, 2024, 8:20 a.m. UTC | #6
On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.05.24 01:40, Barry Song wrote:
> > On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 03/05/2024 01:50, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> There could arise a necessity to obtain the first pte_t from a swap
> >>> pte_t located in the middle. For instance, this may occur within the
> >>> context of do_swap_page(), where a page fault can potentially occur in
> >>> any PTE of a large folio. To address this, the following patch introduces
> >>> pte_move_swp_offset(), a function capable of bidirectional movement by
> >>> a specified delta argument. Consequently, pte_increment_swp_offset()
> >>
> >> You mean pte_next_swp_offset()?
> >
> > yes.
> >
> >>
> >>> will directly invoke it with delta = 1.
> >>>
> >>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> ---
> >>>   mm/internal.h | 25 +++++++++++++++++++++----
> >>>   1 file changed, 21 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/mm/internal.h b/mm/internal.h
> >>> index c5552d35d995..cfe4aed66a5c 100644
> >>> --- a/mm/internal.h
> >>> +++ b/mm/internal.h
> >>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> >>>   }
> >>>
> >>>   /**
> >>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> >>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> >>> + *    forward or backward by delta
> >>>    * @pte: The initial pte state; is_swap_pte(pte) must be true and
> >>>    *    non_swap_entry() must be false.
> >>> + * @delta: The direction and the offset we are moving; forward if delta
> >>> + *    is positive; backward if delta is negative
> >>>    *
> >>> - * Increments the swap offset, while maintaining all other fields, including
> >>> + * Moves the swap offset, while maintaining all other fields, including
> >>>    * swap type, and any swp pte bits. The resulting pte is returned.
> >>>    */
> >>> -static inline pte_t pte_next_swp_offset(pte_t pte)
> >>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
> >>
> >> We have equivalent functions for pfn:
> >>
> >>    pte_next_pfn()
> >>    pte_advance_pfn()
> >>
> >> Although the latter takes an unsigned long and only moves forward currently. I
> >> wonder if it makes sense to have their naming and semantics match? i.e. change
> >> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
> >>
> >> I guess we don't have a need for that and it adds more churn.
> >
> > we might have a need in the below case.
> > A forks B, then A and B share large folios. B unmap/exit, then large
> > folios of process
> > A become single-mapped.
> > Right now, while writing A's folios, we are CoWing A's large folios
> > into many small
> > folios. I believe we can reuse the entire large folios instead of doing nr_pages
> > CoW and page faults.
> > In this case, we might want to get the first PTE from vmf->pte.
>
> Once we have COW reuse for large folios in place (I think you know that
> I am working on that), it might make sense to "COW-reuse around",

TBH, I don't know if you are working on that. please Cc me next time :-)

> meaning we look if some neighboring PTEs map the same large folio and
> map them writable as well. But if it's really worth it, increasing page
> fault latency, is to be decided separately.

On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
Perhaps we can discover a more cost-effective method to signify that a large
folio is probably singly mapped? and only call "multi-PTEs" reuse while that
condition is true in PF and avoid increasing latency always?

>
>
> >
> > Another case, might be
> > A forks B, and we write either A or B, we might CoW an entire large
> > folios instead
> > CoWing nr_pages small folios.
> >
> > case 1 seems more useful, I might have a go after some days. then we might
> > see pte_move_pfn().
> pte_move_pfn() does sound odd to me. It might not be required to
> implement the optimization described above. (it's easier to simply read
> another PTE, check if it maps the same large folio, and to batch from there)
>

It appears that your proposal suggests potential reusability as follows: if we
have a large folio containing 16 PTEs, you might consider reusing only 4 by
examining PTEs "around" but not necessarily all 16 PTEs. please correct me
if my understanding is wrong.

Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
in nr_pages, thus enabling complete reuse of the whole folio.

> --
> Cheers,
>
> David / dhildenb

Thanks
Barry
David Hildenbrand May 6, 2024, 8:31 a.m. UTC | #7
On 06.05.24 10:20, Barry Song wrote:
> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.05.24 01:40, Barry Song wrote:
>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> There could arise a necessity to obtain the first pte_t from a swap
>>>>> pte_t located in the middle. For instance, this may occur within the
>>>>> context of do_swap_page(), where a page fault can potentially occur in
>>>>> any PTE of a large folio. To address this, the following patch introduces
>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>>>
>>>> You mean pte_next_swp_offset()?
>>>
>>> yes.
>>>
>>>>
>>>>> will directly invoke it with delta = 1.
>>>>>
>>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>> ---
>>>>>    mm/internal.h | 25 +++++++++++++++++++++----
>>>>>    1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index c5552d35d995..cfe4aed66a5c 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>>>    }
>>>>>
>>>>>    /**
>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>>>> + *    forward or backward by delta
>>>>>     * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>>>>     *    non_swap_entry() must be false.
>>>>> + * @delta: The direction and the offset we are moving; forward if delta
>>>>> + *    is positive; backward if delta is negative
>>>>>     *
>>>>> - * Increments the swap offset, while maintaining all other fields, including
>>>>> + * Moves the swap offset, while maintaining all other fields, including
>>>>>     * swap type, and any swp pte bits. The resulting pte is returned.
>>>>>     */
>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>>>
>>>> We have equivalent functions for pfn:
>>>>
>>>>     pte_next_pfn()
>>>>     pte_advance_pfn()
>>>>
>>>> Although the latter takes an unsigned long and only moves forward currently. I
>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>>>
>>>> I guess we don't have a need for that and it adds more churn.
>>>
>>> we might have a need in the below case.
>>> A forks B, then A and B share large folios. B unmap/exit, then large
>>> folios of process
>>> A become single-mapped.
>>> Right now, while writing A's folios, we are CoWing A's large folios
>>> into many small
>>> folios. I believe we can reuse the entire large folios instead of doing nr_pages
>>> CoW and page faults.
>>> In this case, we might want to get the first PTE from vmf->pte.
>>
>> Once we have COW reuse for large folios in place (I think you know that
>> I am working on that), it might make sense to "COW-reuse around",
> 
> TBH, I don't know if you are working on that. please Cc me next time :-)

I could have sworn I mentioned it to you already :)

See

https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/

I'll follow-up on that soonish (now that batching is upstream and the 
large mapcount is on its way upstream).

> 
>> meaning we look if some neighboring PTEs map the same large folio and
>> map them writable as well. But if it's really worth it, increasing page
>> fault latency, is to be decided separately.
> 
> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
> Perhaps we can discover a more cost-effective method to signify that a large
> folio is probably singly mapped?

Yes, precisely what I am up to!

> and only call "multi-PTEs" reuse while that
> condition is true in PF and avoid increasing latency always?

I'm thinking along those lines:

If we detect that it's exclusive, we can certainly mapped the current 
PTE writable. Then, we can decide how much (and if) we want to 
fault-around writable as an optimization.

For smallish large folios, it might make sense to try faulting around 
most of the folio.

For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might 
not want to fault around the whole thing -- especially if there is 
little benefit to be had from contig-pte bits.

> 
>>
>>
>>>
>>> Another case, might be
>>> A forks B, and we write either A or B, we might CoW an entire large
>>> folios instead
>>> CoWing nr_pages small folios.
>>>
>>> case 1 seems more useful, I might have a go after some days. then we might
>>> see pte_move_pfn().
>> pte_move_pfn() does sound odd to me. It might not be required to
>> implement the optimization described above. (it's easier to simply read
>> another PTE, check if it maps the same large folio, and to batch from there)
>>
> 
> It appears that your proposal suggests potential reusability as follows: if we
> have a large folio containing 16 PTEs, you might consider reusing only 4 by
> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
> if my understanding is wrong.
> 
> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
> in nr_pages, thus enabling complete reuse of the whole folio.

Simply doing an vm_normal_folio(pte - X) == folio and then trying to 
batch from there might be easier and cleaner.
Ryan Roberts May 7, 2024, 8:14 a.m. UTC | #8
On 06/05/2024 09:31, David Hildenbrand wrote:
> On 06.05.24 10:20, Barry Song wrote:
>> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 04.05.24 01:40, Barry Song wrote:
>>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> There could arise a necessity to obtain the first pte_t from a swap
>>>>>> pte_t located in the middle. For instance, this may occur within the
>>>>>> context of do_swap_page(), where a page fault can potentially occur in
>>>>>> any PTE of a large folio. To address this, the following patch introduces
>>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>>>>
>>>>> You mean pte_next_swp_offset()?
>>>>
>>>> yes.
>>>>
>>>>>
>>>>>> will directly invoke it with delta = 1.
>>>>>>
>>>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>> ---
>>>>>>    mm/internal.h | 25 +++++++++++++++++++++----
>>>>>>    1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>> index c5552d35d995..cfe4aed66a5c 100644
>>>>>> --- a/mm/internal.h
>>>>>> +++ b/mm/internal.h
>>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio
>>>>>> *folio, unsigned long addr,
>>>>>>    }
>>>>>>
>>>>>>    /**
>>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap
>>>>>> pte.
>>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>>>>> + *    forward or backward by delta
>>>>>>     * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>>>>>     *    non_swap_entry() must be false.
>>>>>> + * @delta: The direction and the offset we are moving; forward if delta
>>>>>> + *    is positive; backward if delta is negative
>>>>>>     *
>>>>>> - * Increments the swap offset, while maintaining all other fields, including
>>>>>> + * Moves the swap offset, while maintaining all other fields, including
>>>>>>     * swap type, and any swp pte bits. The resulting pte is returned.
>>>>>>     */
>>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>>>>
>>>>> We have equivalent functions for pfn:
>>>>>
>>>>>     pte_next_pfn()
>>>>>     pte_advance_pfn()
>>>>>
>>>>> Although the latter takes an unsigned long and only moves forward currently. I
>>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
>>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>>>>
>>>>> I guess we don't have a need for that and it adds more churn.
>>>>
>>>> we might have a need in the below case.
>>>> A forks B, then A and B share large folios. B unmap/exit, then large
>>>> folios of process
>>>> A become single-mapped.
>>>> Right now, while writing A's folios, we are CoWing A's large folios
>>>> into many small
>>>> folios. I believe we can reuse the entire large folios instead of doing
>>>> nr_pages
>>>> CoW and page faults.
>>>> In this case, we might want to get the first PTE from vmf->pte.
>>>
>>> Once we have COW reuse for large folios in place (I think you know that
>>> I am working on that), it might make sense to "COW-reuse around",
>>
>> TBH, I don't know if you are working on that. please Cc me next time :-)
> 
> I could have sworn I mentioned it to you already :)
> 
> See
> 
> https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/
> 
> I'll follow-up on that soonish (now that batching is upstream and the large
> mapcount is on its way upstream).
> 
>>
>>> meaning we look if some neighboring PTEs map the same large folio and
>>> map them writable as well. But if it's really worth it, increasing page
>>> fault latency, is to be decided separately.
>>
>> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
>> Perhaps we can discover a more cost-effective method to signify that a large
>> folio is probably singly mapped?
> 
> Yes, precisely what I am up to!
> 
>> and only call "multi-PTEs" reuse while that
>> condition is true in PF and avoid increasing latency always?
> 
> I'm thinking along those lines:
> 
> If we detect that it's exclusive, we can certainly mapped the current PTE
> writable. Then, we can decide how much (and if) we want to fault-around writable
> as an optimization.
> 
> For smallish large folios, it might make sense to try faulting around most of
> the folio.
> 
> For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might not want
> to fault around the whole thing -- especially if there is little benefit to be
> had from contig-pte bits.
> 
>>
>>>
>>>
>>>>
>>>> Another case, might be
>>>> A forks B, and we write either A or B, we might CoW an entire large
>>>> folios instead
>>>> CoWing nr_pages small folios.
>>>>
>>>> case 1 seems more useful, I might have a go after some days. then we might
>>>> see pte_move_pfn().
>>> pte_move_pfn() does sound odd to me. 

Yes, I agree the name is odd. pte_move_swp_offset() sounds similarly odd tbh.
Perhaps just pte_advance_swp_offset() with a negative value is clearer about
what its doing?

>>> It might not be required to
>>> implement the optimization described above. (it's easier to simply read
>>> another PTE, check if it maps the same large folio, and to batch from there)

Yes agreed.

>>>
>>
>> It appears that your proposal suggests potential reusability as follows: if we
>> have a large folio containing 16 PTEs, you might consider reusing only 4 by
>> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
>> if my understanding is wrong.
>>
>> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
>> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
>> in nr_pages, thus enabling complete reuse of the whole folio.
> 
> Simply doing an vm_normal_folio(pte - X) == folio and then trying to batch from
> there might be easier and cleaner.
>
Barry Song May 7, 2024, 8:24 a.m. UTC | #9
On Tue, May 7, 2024 at 8:14 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 06/05/2024 09:31, David Hildenbrand wrote:
> > On 06.05.24 10:20, Barry Song wrote:
> >> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 04.05.24 01:40, Barry Song wrote:
> >>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>
> >>>>> On 03/05/2024 01:50, Barry Song wrote:
> >>>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>>
> >>>>>> There could arise a necessity to obtain the first pte_t from a swap
> >>>>>> pte_t located in the middle. For instance, this may occur within the
> >>>>>> context of do_swap_page(), where a page fault can potentially occur in
> >>>>>> any PTE of a large folio. To address this, the following patch introduces
> >>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
> >>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
> >>>>>
> >>>>> You mean pte_next_swp_offset()?
> >>>>
> >>>> yes.
> >>>>
> >>>>>
> >>>>>> will directly invoke it with delta = 1.
> >>>>>>
> >>>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> >>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>> ---
> >>>>>>    mm/internal.h | 25 +++++++++++++++++++++----
> >>>>>>    1 file changed, 21 insertions(+), 4 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/internal.h b/mm/internal.h
> >>>>>> index c5552d35d995..cfe4aed66a5c 100644
> >>>>>> --- a/mm/internal.h
> >>>>>> +++ b/mm/internal.h
> >>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio
> >>>>>> *folio, unsigned long addr,
> >>>>>>    }
> >>>>>>
> >>>>>>    /**
> >>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap
> >>>>>> pte.
> >>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> >>>>>> + *    forward or backward by delta
> >>>>>>     * @pte: The initial pte state; is_swap_pte(pte) must be true and
> >>>>>>     *    non_swap_entry() must be false.
> >>>>>> + * @delta: The direction and the offset we are moving; forward if delta
> >>>>>> + *    is positive; backward if delta is negative
> >>>>>>     *
> >>>>>> - * Increments the swap offset, while maintaining all other fields, including
> >>>>>> + * Moves the swap offset, while maintaining all other fields, including
> >>>>>>     * swap type, and any swp pte bits. The resulting pte is returned.
> >>>>>>     */
> >>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
> >>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
> >>>>>
> >>>>> We have equivalent functions for pfn:
> >>>>>
> >>>>>     pte_next_pfn()
> >>>>>     pte_advance_pfn()
> >>>>>
> >>>>> Although the latter takes an unsigned long and only moves forward currently. I
> >>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
> >>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
> >>>>>
> >>>>> I guess we don't have a need for that and it adds more churn.
> >>>>
> >>>> we might have a need in the below case.
> >>>> A forks B, then A and B share large folios. B unmap/exit, then large
> >>>> folios of process
> >>>> A become single-mapped.
> >>>> Right now, while writing A's folios, we are CoWing A's large folios
> >>>> into many small
> >>>> folios. I believe we can reuse the entire large folios instead of doing
> >>>> nr_pages
> >>>> CoW and page faults.
> >>>> In this case, we might want to get the first PTE from vmf->pte.
> >>>
> >>> Once we have COW reuse for large folios in place (I think you know that
> >>> I am working on that), it might make sense to "COW-reuse around",
> >>
> >> TBH, I don't know if you are working on that. please Cc me next time :-)
> >
> > I could have sworn I mentioned it to you already :)
> >
> > See
> >
> > https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/
> >
> > I'll follow-up on that soonish (now that batching is upstream and the large
> > mapcount is on its way upstream).
> >
> >>
> >>> meaning we look if some neighboring PTEs map the same large folio and
> >>> map them writable as well. But if it's really worth it, increasing page
> >>> fault latency, is to be decided separately.
> >>
> >> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
> >> Perhaps we can discover a more cost-effective method to signify that a large
> >> folio is probably singly mapped?
> >
> > Yes, precisely what I am up to!
> >
> >> and only call "multi-PTEs" reuse while that
> >> condition is true in PF and avoid increasing latency always?
> >
> > I'm thinking along those lines:
> >
> > If we detect that it's exclusive, we can certainly mapped the current PTE
> > writable. Then, we can decide how much (and if) we want to fault-around writable
> > as an optimization.
> >
> > For smallish large folios, it might make sense to try faulting around most of
> > the folio.
> >
> > For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might not want
> > to fault around the whole thing -- especially if there is little benefit to be
> > had from contig-pte bits.
> >
> >>
> >>>
> >>>
> >>>>
> >>>> Another case, might be
> >>>> A forks B, and we write either A or B, we might CoW an entire large
> >>>> folios instead
> >>>> CoWing nr_pages small folios.
> >>>>
> >>>> case 1 seems more useful, I might have a go after some days. then we might
> >>>> see pte_move_pfn().
> >>> pte_move_pfn() does sound odd to me.
>
> Yes, I agree the name is odd. pte_move_swp_offset() sounds similarly odd tbh.
> Perhaps just pte_advance_swp_offset() with a negative value is clearer about
> what its doing?
>

I am not a native speaker. but dictionary says

advance:
move forward in a purposeful way.
a forward movement.

Now we are moving backward or forward :-)

> >>> It might not be required to
> >>> implement the optimization described above. (it's easier to simply read
> >>> another PTE, check if it maps the same large folio, and to batch from there)
>
> Yes agreed.
>
> >>>
> >>
> >> It appears that your proposal suggests potential reusability as follows: if we
> >> have a large folio containing 16 PTEs, you might consider reusing only 4 by
> >> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
> >> if my understanding is wrong.
> >>
> >> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
> >> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
> >> in nr_pages, thus enabling complete reuse of the whole folio.
> >
> > Simply doing an vm_normal_folio(pte - X) == folio and then trying to batch from
> > there might be easier and cleaner.
> >
>
Ryan Roberts May 7, 2024, 9:39 a.m. UTC | #10
On 07/05/2024 09:24, Barry Song wrote:
> On Tue, May 7, 2024 at 8:14 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 06/05/2024 09:31, David Hildenbrand wrote:
>>> On 06.05.24 10:20, Barry Song wrote:
>>>> On Mon, May 6, 2024 at 8:06 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 04.05.24 01:40, Barry Song wrote:
>>>>>> On Fri, May 3, 2024 at 5:41 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>>>
>>>>>>>> There could arise a necessity to obtain the first pte_t from a swap
>>>>>>>> pte_t located in the middle. For instance, this may occur within the
>>>>>>>> context of do_swap_page(), where a page fault can potentially occur in
>>>>>>>> any PTE of a large folio. To address this, the following patch introduces
>>>>>>>> pte_move_swp_offset(), a function capable of bidirectional movement by
>>>>>>>> a specified delta argument. Consequently, pte_increment_swp_offset()
>>>>>>>
>>>>>>> You mean pte_next_swp_offset()?
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>>
>>>>>>>> will directly invoke it with delta = 1.
>>>>>>>>
>>>>>>>> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
>>>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>>>> ---
>>>>>>>>    mm/internal.h | 25 +++++++++++++++++++++----
>>>>>>>>    1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>>>> index c5552d35d995..cfe4aed66a5c 100644
>>>>>>>> --- a/mm/internal.h
>>>>>>>> +++ b/mm/internal.h
>>>>>>>> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio
>>>>>>>> *folio, unsigned long addr,
>>>>>>>>    }
>>>>>>>>
>>>>>>>>    /**
>>>>>>>> - * pte_next_swp_offset - Increment the swap entry offset field of a swap
>>>>>>>> pte.
>>>>>>>> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
>>>>>>>> + *    forward or backward by delta
>>>>>>>>     * @pte: The initial pte state; is_swap_pte(pte) must be true and
>>>>>>>>     *    non_swap_entry() must be false.
>>>>>>>> + * @delta: The direction and the offset we are moving; forward if delta
>>>>>>>> + *    is positive; backward if delta is negative
>>>>>>>>     *
>>>>>>>> - * Increments the swap offset, while maintaining all other fields, including
>>>>>>>> + * Moves the swap offset, while maintaining all other fields, including
>>>>>>>>     * swap type, and any swp pte bits. The resulting pte is returned.
>>>>>>>>     */
>>>>>>>> -static inline pte_t pte_next_swp_offset(pte_t pte)
>>>>>>>> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>>>>>>>
>>>>>>> We have equivalent functions for pfn:
>>>>>>>
>>>>>>>     pte_next_pfn()
>>>>>>>     pte_advance_pfn()
>>>>>>>
>>>>>>> Although the latter takes an unsigned long and only moves forward currently. I
>>>>>>> wonder if it makes sense to have their naming and semantics match? i.e. change
>>>>>>> pte_advance_pfn() to pte_move_pfn() and let it move backwards too.
>>>>>>>
>>>>>>> I guess we don't have a need for that and it adds more churn.
>>>>>>
>>>>>> we might have a need in the below case.
>>>>>> A forks B, then A and B share large folios. B unmap/exit, then large
>>>>>> folios of process
>>>>>> A become single-mapped.
>>>>>> Right now, while writing A's folios, we are CoWing A's large folios
>>>>>> into many small
>>>>>> folios. I believe we can reuse the entire large folios instead of doing
>>>>>> nr_pages
>>>>>> CoW and page faults.
>>>>>> In this case, we might want to get the first PTE from vmf->pte.
>>>>>
>>>>> Once we have COW reuse for large folios in place (I think you know that
>>>>> I am working on that), it might make sense to "COW-reuse around",
>>>>
>>>> TBH, I don't know if you are working on that. please Cc me next time :-)
>>>
>>> I could have sworn I mentioned it to you already :)
>>>
>>> See
>>>
>>> https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/
>>>
>>> I'll follow-up on that soonish (now that batching is upstream and the large
>>> mapcount is on its way upstream).
>>>
>>>>
>>>>> meaning we look if some neighboring PTEs map the same large folio and
>>>>> map them writable as well. But if it's really worth it, increasing page
>>>>> fault latency, is to be decided separately.
>>>>
>>>> On the other hand, we eliminate latency for the remaining nr_pages - 1 PTEs.
>>>> Perhaps we can discover a more cost-effective method to signify that a large
>>>> folio is probably singly mapped?
>>>
>>> Yes, precisely what I am up to!
>>>
>>>> and only call "multi-PTEs" reuse while that
>>>> condition is true in PF and avoid increasing latency always?
>>>
>>> I'm thinking along those lines:
>>>
>>> If we detect that it's exclusive, we can certainly mapped the current PTE
>>> writable. Then, we can decide how much (and if) we want to fault-around writable
>>> as an optimization.
>>>
>>> For smallish large folios, it might make sense to try faulting around most of
>>> the folio.
>>>
>>> For large large folios (e.g., PTE-mapped 2MiB THP and bigger), we might not want
>>> to fault around the whole thing -- especially if there is little benefit to be
>>> had from contig-pte bits.
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Another case, might be
>>>>>> A forks B, and we write either A or B, we might CoW an entire large
>>>>>> folios instead
>>>>>> CoWing nr_pages small folios.
>>>>>>
>>>>>> case 1 seems more useful, I might have a go after some days. then we might
>>>>>> see pte_move_pfn().
>>>>> pte_move_pfn() does sound odd to me.
>>
>> Yes, I agree the name is odd. pte_move_swp_offset() sounds similarly odd tbh.
>> Perhaps just pte_advance_swp_offset() with a negative value is clearer about
>> what its doing?
>>
> 
> I am not a native speaker. but dictionary says
> 
> advance:
> move forward in a purposeful way.
> a forward movement.
> 
> Now we are moving backward or forward :-)

Sure, but if you pass a negative value then you are moving forwards by a
negative amount ;-)

Anyway, forget I said anything - its not important.

> 
>>>>> It might not be required to
>>>>> implement the optimization described above. (it's easier to simply read
>>>>> another PTE, check if it maps the same large folio, and to batch from there)
>>
>> Yes agreed.
>>
>>>>>
>>>>
>>>> It appears that your proposal suggests potential reusability as follows: if we
>>>> have a large folio containing 16 PTEs, you might consider reusing only 4 by
>>>> examining PTEs "around" but not necessarily all 16 PTEs. please correct me
>>>> if my understanding is wrong.
>>>>
>>>> Initially, my idea was to obtain the first PTE using pte_move_pfn() and then
>>>> utilize folio_pte_batch() with the first PTE as arguments to ensure consistency
>>>> in nr_pages, thus enabling complete reuse of the whole folio.
>>>
>>> Simply doing an vm_normal_folio(pte - X) == folio and then trying to batch from
>>> there might be easier and cleaner.
>>>
>>
Huang, Ying May 8, 2024, 8:08 a.m. UTC | #11
Barry Song <21cnbao@gmail.com> writes:

> From: Barry Song <v-songbaohua@oppo.com>
>
> There could arise a necessity to obtain the first pte_t from a swap
> pte_t located in the middle. For instance, this may occur within the
> context of do_swap_page(), where a page fault can potentially occur in
> any PTE of a large folio. To address this, the following patch introduces
> pte_move_swp_offset(), a function capable of bidirectional movement by
> a specified delta argument. Consequently, pte_increment_swp_offset()
> will directly invoke it with delta = 1.
>
> Suggested-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

LGTM, Thanks!  Feel free to add

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

in the future version.

> ---
>  mm/internal.h | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index c5552d35d995..cfe4aed66a5c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -211,18 +211,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>  }
>  
>  /**
> - * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * pte_move_swp_offset - Move the swap entry offset field of a swap pte
> + *	 forward or backward by delta
>   * @pte: The initial pte state; is_swap_pte(pte) must be true and
>   *	 non_swap_entry() must be false.
> + * @delta: The direction and the offset we are moving; forward if delta
> + *	 is positive; backward if delta is negative
>   *
> - * Increments the swap offset, while maintaining all other fields, including
> + * Moves the swap offset, while maintaining all other fields, including
>   * swap type, and any swp pte bits. The resulting pte is returned.
>   */
> -static inline pte_t pte_next_swp_offset(pte_t pte)
> +static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
>  {
>  	swp_entry_t entry = pte_to_swp_entry(pte);
>  	pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
> -						   (swp_offset(entry) + 1)));
> +						   (swp_offset(entry) + delta)));
>  
>  	if (pte_swp_soft_dirty(pte))
>  		new = pte_swp_mksoft_dirty(new);
> @@ -234,6 +237,20 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
>  	return new;
>  }
>  
> +
> +/**
> + * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
> + * @pte: The initial pte state; is_swap_pte(pte) must be true and
> + *	 non_swap_entry() must be false.
> + *
> + * Increments the swap offset, while maintaining all other fields, including
> + * swap type, and any swp pte bits. The resulting pte is returned.
> + */
> +static inline pte_t pte_next_swp_offset(pte_t pte)
> +{
> +	return pte_move_swp_offset(pte, 1);
> +}
> +
>  /**
>   * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>   * @start_ptep: Page table pointer for the first entry.

--
Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/mm/internal.h b/mm/internal.h
index c5552d35d995..cfe4aed66a5c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -211,18 +211,21 @@  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 }
 
 /**
- * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
+ * pte_move_swp_offset - Move the swap entry offset field of a swap pte
+ *	 forward or backward by delta
  * @pte: The initial pte state; is_swap_pte(pte) must be true and
  *	 non_swap_entry() must be false.
+ * @delta: The direction and the offset we are moving; forward if delta
+ *	 is positive; backward if delta is negative
  *
- * Increments the swap offset, while maintaining all other fields, including
+ * Moves the swap offset, while maintaining all other fields, including
  * swap type, and any swp pte bits. The resulting pte is returned.
  */
-static inline pte_t pte_next_swp_offset(pte_t pte)
+static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
 {
 	swp_entry_t entry = pte_to_swp_entry(pte);
 	pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
-						   (swp_offset(entry) + 1)));
+						   (swp_offset(entry) + delta)));
 
 	if (pte_swp_soft_dirty(pte))
 		new = pte_swp_mksoft_dirty(new);
@@ -234,6 +237,20 @@  static inline pte_t pte_next_swp_offset(pte_t pte)
 	return new;
 }
 
+
+/**
+ * pte_next_swp_offset - Increment the swap entry offset field of a swap pte.
+ * @pte: The initial pte state; is_swap_pte(pte) must be true and
+ *	 non_swap_entry() must be false.
+ *
+ * Increments the swap offset, while maintaining all other fields, including
+ * swap type, and any swp pte bits. The resulting pte is returned.
+ */
+static inline pte_t pte_next_swp_offset(pte_t pte)
+{
+	return pte_move_swp_offset(pte, 1);
+}
+
 /**
  * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
  * @start_ptep: Page table pointer for the first entry.