diff mbox series

[v2,1/5] mm: swap: introduce swap_free_nr() for batched swap_free()

Message ID 20240409082631.187483-2-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series large folios swap-in: handle refault cases first | expand

Commit Message

Barry Song April 9, 2024, 8:26 a.m. UTC
From: Chuanhua Han <hanchuanhua@oppo.com>

While swapping in a large folio, we need to free swaps related to the whole
folio. To avoid frequently acquiring and releasing swap locks, it is better
to introduce an API for batched free.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/swap.h |  5 +++++
 mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

Comments

SeongJae Park April 10, 2024, 11:37 p.m. UTC | #1
Hi Barry,

On Tue,  9 Apr 2024 20:26:27 +1200 Barry Song <21cnbao@gmail.com> wrote:

> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/swap.h |  5 +++++
>  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..b7a107e983b8 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
[...]
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}

I found the latest mm-unstable fails build when CONFIG_SWAP is not set with
errors including below, and 'git bisect' points this patch.

    do_mounts.c:(.text+0x6): multiple definition of `swap_free_nr'; init/main.o:main.c:(.text+0x9c): first defined here

I think this should be defined as 'static inline'?  I confirmed adding the two
keywords as below fixes the build failure.

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bf5090de0fd..5fd60d733ba8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -562,7 +562,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }

-void swap_free_nr(swp_entry_t entry, int nr_pages)
+static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 }


Thanks,
SJ

[...]
Barry Song April 11, 2024, 1:27 a.m. UTC | #2
On Thu, Apr 11, 2024 at 11:38 AM SeongJae Park <sj@kernel.org> wrote:
>
> Hi Barry,
>
> On Tue,  9 Apr 2024 20:26:27 +1200 Barry Song <21cnbao@gmail.com> wrote:
>
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > While swapping in a large folio, we need to free swaps related to the whole
> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > to introduce an API for batched free.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/swap.h |  5 +++++
> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 56 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 11c53692f65f..b7a107e983b8 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> [...]
> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > +{
> > +}
>
> I found the latest mm-unstable fails build when CONFIG_SWAP is not set with
> errors including below, and 'git bisect' points this patch.
>
>     do_mounts.c:(.text+0x6): multiple definition of `swap_free_nr'; init/main.o:main.c:(.text+0x9c): first defined here
>
> I think this should be defined as 'static inline'?  I confirmed adding the two
> keywords as below fixes the build failure.

definitely yes.
It's highly likely that it was an oversight. we definitely meant
static inline for !CONFIG_SWAP.
Thanks! will fix it in v3.


>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4bf5090de0fd..5fd60d733ba8 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -562,7 +562,7 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>
> -void swap_free_nr(swp_entry_t entry, int nr_pages)
> +static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
>  {
>  }
>
>
> Thanks,
> SJ
>
Thanks
Barry
Ryan Roberts April 11, 2024, 2:30 p.m. UTC | #3
On 09/04/2024 09:26, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

Couple of nits; feel free to ignore.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  include/linux/swap.h |  5 +++++
>  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..b7a107e983b8 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>  int swap_type_of(dev_t device, sector_t offset);
> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>  
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}
> +
>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>  {
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 28642c188c93..f4c65aeb088d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>  		__swap_entry_free(p, entry);
>  }
>  
> +/*
> + * Free up the maximum number of swap entries at once to limit the
> + * maximum kernel stack usage.
> + */
> +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> +
> +/*
> + * Called after swapping in a large folio, batched free swap entries
> + * for this large folio, entry should be for the first subpage and
> + * its offset is aligned with nr_pages
> + */
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +	int i, j;
> +	struct swap_cluster_info *ci;
> +	struct swap_info_struct *p;
> +	unsigned int type = swp_type(entry);
> +	unsigned long offset = swp_offset(entry);
> +	int batch_nr, remain_nr;
> +	DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> +
> +	/* all swap entries are within a cluster for mTHP */
> +	VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> +
> +	if (nr_pages == 1) {
> +		swap_free(entry);
> +		return;
> +	}
> +
> +	remain_nr = nr_pages;
> +	p = _swap_info_get(entry);
> +	if (p) {

nit: perhaps return early if (!p) ? Then you dedent the for() block.

> +		for (i = 0; i < nr_pages; i += batch_nr) {
> +			batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> +
> +			ci = lock_cluster_or_swap_info(p, offset);
> +			for (j = 0; j < batch_nr; j++) {
> +				if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> +					__bitmap_set(usage, j, 1);
> +			}
> +			unlock_cluster_or_swap_info(p, ci);
> +
> +			for_each_clear_bit(j, usage, batch_nr)
> +				free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> +

nit: perhaps change to for (;;), and do the checks here to avoid clearing the
bitmap on the last run:

			i += batch_nr;
			if (i < nr_pages)
				break;

> +			bitmap_clear(usage, 0, SWAP_BATCH_NR);
> +			remain_nr -= batch_nr;
> +		}
> +	}
> +}
> +
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */
Chuanhua Han April 12, 2024, 2:07 a.m. UTC | #4
Ryan Roberts <ryan.roberts@arm.com> 于2024年4月11日周四 22:30写道:
>
> On 09/04/2024 09:26, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > While swapping in a large folio, we need to free swaps related to the whole
> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > to introduce an API for batched free.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>
> Couple of nits; feel free to ignore.
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>
> > ---
> >  include/linux/swap.h |  5 +++++
> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 56 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 11c53692f65f..b7a107e983b8 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >  extern int swap_duplicate(swp_entry_t);
> >  extern int swapcache_prepare(swp_entry_t);
> >  extern void swap_free(swp_entry_t);
> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >  int swap_type_of(dev_t device, sector_t offset);
> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >  {
> >  }
> >
> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > +{
> > +}
> > +
> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >  {
> >  }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 28642c188c93..f4c65aeb088d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >               __swap_entry_free(p, entry);
> >  }
> >
> > +/*
> > + * Free up the maximum number of swap entries at once to limit the
> > + * maximum kernel stack usage.
> > + */
> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > +
> > +/*
> > + * Called after swapping in a large folio, batched free swap entries
> > + * for this large folio, entry should be for the first subpage and
> > + * its offset is aligned with nr_pages
> > + */
> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > +{
> > +     int i, j;
> > +     struct swap_cluster_info *ci;
> > +     struct swap_info_struct *p;
> > +     unsigned int type = swp_type(entry);
> > +     unsigned long offset = swp_offset(entry);
> > +     int batch_nr, remain_nr;
> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> > +
> > +     /* all swap entries are within a cluster for mTHP */
> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> > +
> > +     if (nr_pages == 1) {
> > +             swap_free(entry);
> > +             return;
> > +     }
> > +
> > +     remain_nr = nr_pages;
> > +     p = _swap_info_get(entry);
> > +     if (p) {
>
> nit: perhaps return early if (!p) ? Then you dedent the for() block.

Agreed!

>
> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> > +
> > +                     ci = lock_cluster_or_swap_info(p, offset);
> > +                     for (j = 0; j < batch_nr; j++) {
> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> > +                                     __bitmap_set(usage, j, 1);
> > +                     }
> > +                     unlock_cluster_or_swap_info(p, ci);
> > +
> > +                     for_each_clear_bit(j, usage, batch_nr)
> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> > +
>
> nit: perhaps change to for (;;), and do the checks here to avoid clearing the
> bitmap on the last run:
>
>                         i += batch_nr;
>                         if (i < nr_pages)
>                                 break;
Great, thank you for your advice!
>
> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> > +                     remain_nr -= batch_nr;
> > +             }
> > +     }
> > +}
> > +
> >  /*
> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >   */
>
>
Ryan Roberts April 12, 2024, 11:28 a.m. UTC | #5
On 12/04/2024 03:07, Chuanhua Han wrote:
> Ryan Roberts <ryan.roberts@arm.com> 于2024年4月11日周四 22:30写道:
>>
>> On 09/04/2024 09:26, Barry Song wrote:
>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>
>>> While swapping in a large folio, we need to free swaps related to the whole
>>> folio. To avoid frequently acquiring and releasing swap locks, it is better
>>> to introduce an API for batched free.
>>>
>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>
>> Couple of nits; feel free to ignore.
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>>> ---
>>>  include/linux/swap.h |  5 +++++
>>>  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>>>  2 files changed, 56 insertions(+)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 11c53692f65f..b7a107e983b8 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>>>  extern int swap_duplicate(swp_entry_t);
>>>  extern int swapcache_prepare(swp_entry_t);
>>>  extern void swap_free(swp_entry_t);
>>> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>>>  int swap_type_of(dev_t device, sector_t offset);
>>> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>>>  {
>>>  }
>>>
>>> +void swap_free_nr(swp_entry_t entry, int nr_pages)
>>> +{
>>> +}
>>> +
>>>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>>>  {
>>>  }
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 28642c188c93..f4c65aeb088d 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>>>               __swap_entry_free(p, entry);
>>>  }
>>>
>>> +/*
>>> + * Free up the maximum number of swap entries at once to limit the
>>> + * maximum kernel stack usage.
>>> + */
>>> +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>>> +
>>> +/*
>>> + * Called after swapping in a large folio, batched free swap entries
>>> + * for this large folio, entry should be for the first subpage and
>>> + * its offset is aligned with nr_pages
>>> + */
>>> +void swap_free_nr(swp_entry_t entry, int nr_pages)
>>> +{
>>> +     int i, j;
>>> +     struct swap_cluster_info *ci;
>>> +     struct swap_info_struct *p;
>>> +     unsigned int type = swp_type(entry);
>>> +     unsigned long offset = swp_offset(entry);
>>> +     int batch_nr, remain_nr;
>>> +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>>> +
>>> +     /* all swap entries are within a cluster for mTHP */
>>> +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>>> +
>>> +     if (nr_pages == 1) {
>>> +             swap_free(entry);
>>> +             return;
>>> +     }
>>> +
>>> +     remain_nr = nr_pages;
>>> +     p = _swap_info_get(entry);
>>> +     if (p) {
>>
>> nit: perhaps return early if (!p) ? Then you dedent the for() block.
> 
> Agreed!
> 
>>
>>> +             for (i = 0; i < nr_pages; i += batch_nr) {
>>> +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>>> +
>>> +                     ci = lock_cluster_or_swap_info(p, offset);
>>> +                     for (j = 0; j < batch_nr; j++) {
>>> +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>>> +                                     __bitmap_set(usage, j, 1);
>>> +                     }
>>> +                     unlock_cluster_or_swap_info(p, ci);
>>> +
>>> +                     for_each_clear_bit(j, usage, batch_nr)
>>> +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>>> +
>>
>> nit: perhaps change to for (;;), and do the checks here to avoid clearing the
>> bitmap on the last run:
>>
>>                         i += batch_nr;
>>                         if (i < nr_pages)
>>                                 break;
> Great, thank you for your advice!

Or maybe leave the for() as is, but don't explicitly init the bitmap at the
start of the function and instead call:

	bitmap_clear(usage, 0, SWAP_BATCH_NR);

At the start of each loop?

>>
>>> +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>>> +                     remain_nr -= batch_nr;
>>> +             }
>>> +     }
>>> +}
>>> +
>>>  /*
>>>   * Called after dropping swapcache to decrease refcnt to swap entries.
>>>   */
>>
>>
> 
>
Chuanhua Han April 12, 2024, 11:38 a.m. UTC | #6
Ryan Roberts <ryan.roberts@arm.com> 于2024年4月12日周五 19:28写道:
>
> On 12/04/2024 03:07, Chuanhua Han wrote:
> > Ryan Roberts <ryan.roberts@arm.com> 于2024年4月11日周四 22:30写道:
> >>
> >> On 09/04/2024 09:26, Barry Song wrote:
> >>> From: Chuanhua Han <hanchuanhua@oppo.com>
> >>>
> >>> While swapping in a large folio, we need to free swaps related to the whole
> >>> folio. To avoid frequently acquiring and releasing swap locks, it is better
> >>> to introduce an API for batched free.
> >>>
> >>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>
> >> Couple of nits; feel free to ignore.
> >>
> >> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> >>
> >>> ---
> >>>  include/linux/swap.h |  5 +++++
> >>>  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >>>  2 files changed, 56 insertions(+)
> >>>
> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>> index 11c53692f65f..b7a107e983b8 100644
> >>> --- a/include/linux/swap.h
> >>> +++ b/include/linux/swap.h
> >>> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >>>  extern int swap_duplicate(swp_entry_t);
> >>>  extern int swapcache_prepare(swp_entry_t);
> >>>  extern void swap_free(swp_entry_t);
> >>> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >>>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >>>  int swap_type_of(dev_t device, sector_t offset);
> >>> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >>>  {
> >>>  }
> >>>
> >>> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >>> +{
> >>> +}
> >>> +
> >>>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >>>  {
> >>>  }
> >>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >>> index 28642c188c93..f4c65aeb088d 100644
> >>> --- a/mm/swapfile.c
> >>> +++ b/mm/swapfile.c
> >>> @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >>>               __swap_entry_free(p, entry);
> >>>  }
> >>>
> >>> +/*
> >>> + * Free up the maximum number of swap entries at once to limit the
> >>> + * maximum kernel stack usage.
> >>> + */
> >>> +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >>> +
> >>> +/*
> >>> + * Called after swapping in a large folio, batched free swap entries
> >>> + * for this large folio, entry should be for the first subpage and
> >>> + * its offset is aligned with nr_pages
> >>> + */
> >>> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >>> +{
> >>> +     int i, j;
> >>> +     struct swap_cluster_info *ci;
> >>> +     struct swap_info_struct *p;
> >>> +     unsigned int type = swp_type(entry);
> >>> +     unsigned long offset = swp_offset(entry);
> >>> +     int batch_nr, remain_nr;
> >>> +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >>> +
> >>> +     /* all swap entries are within a cluster for mTHP */
> >>> +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >>> +
> >>> +     if (nr_pages == 1) {
> >>> +             swap_free(entry);
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     remain_nr = nr_pages;
> >>> +     p = _swap_info_get(entry);
> >>> +     if (p) {
> >>
> >> nit: perhaps return early if (!p) ? Then you dedent the for() block.
> >
> > Agreed!
> >
> >>
> >>> +             for (i = 0; i < nr_pages; i += batch_nr) {
> >>> +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >>> +
> >>> +                     ci = lock_cluster_or_swap_info(p, offset);
> >>> +                     for (j = 0; j < batch_nr; j++) {
> >>> +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >>> +                                     __bitmap_set(usage, j, 1);
> >>> +                     }
> >>> +                     unlock_cluster_or_swap_info(p, ci);
> >>> +
> >>> +                     for_each_clear_bit(j, usage, batch_nr)
> >>> +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >>> +
> >>
> >> nit: perhaps change to for (;;), and do the checks here to avoid clearing the
> >> bitmap on the last run:
> >>
> >>                         i += batch_nr;
> >>                         if (i < nr_pages)
> >>                                 break;
Should be:
if (i >= nr_pages)
    break;
> > Great, thank you for your advice!
>
> Or maybe leave the for() as is, but don't explicitly init the bitmap at the
> start of the function and instead call:
>
>         bitmap_clear(usage, 0, SWAP_BATCH_NR);
>
> At the start of each loop?
Yeah, that's OK, actually these two ways are similar, both are to
reduce the number of bitmap_clear calls.
>
> >>
> >>> +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >>> +                     remain_nr -= batch_nr;
> >>> +             }
> >>> +     }
> >>> +}
> >>> +
> >>>  /*
> >>>   * Called after dropping swapcache to decrease refcnt to swap entries.
> >>>   */
> >>
> >>
> >
> >
>
Huang, Ying April 15, 2024, 6:17 a.m. UTC | #7
Barry Song <21cnbao@gmail.com> writes:

> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
>
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/swap.h |  5 +++++
>  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 56 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..b7a107e983b8 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>  int swap_type_of(dev_t device, sector_t offset);
> @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>  
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +}
> +
>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>  {
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 28642c188c93..f4c65aeb088d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>  		__swap_entry_free(p, entry);
>  }
>  
> +/*
> + * Free up the maximum number of swap entries at once to limit the
> + * maximum kernel stack usage.
> + */
> +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> +
> +/*
> + * Called after swapping in a large folio,

IMHO, it's not good to document the caller in the function definition.
Because this will discourage function reusing.

> batched free swap entries
> + * for this large folio, entry should be for the first subpage and
> + * its offset is aligned with nr_pages

Why do we need this?

> + */
> +void swap_free_nr(swp_entry_t entry, int nr_pages)
> +{
> +	int i, j;
> +	struct swap_cluster_info *ci;
> +	struct swap_info_struct *p;
> +	unsigned int type = swp_type(entry);
> +	unsigned long offset = swp_offset(entry);
> +	int batch_nr, remain_nr;
> +	DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> +
> +	/* all swap entries are within a cluster for mTHP */
> +	VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> +
> +	if (nr_pages == 1) {
> +		swap_free(entry);
> +		return;
> +	}

Is it possible to unify swap_free() and swap_free_nr() into one function
with acceptable performance?  IIUC, the general rule in mTHP effort is
to avoid duplicate functions between mTHP and normal small folio.
Right?

> +
> +	remain_nr = nr_pages;
> +	p = _swap_info_get(entry);
> +	if (p) {
> +		for (i = 0; i < nr_pages; i += batch_nr) {
> +			batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> +
> +			ci = lock_cluster_or_swap_info(p, offset);
> +			for (j = 0; j < batch_nr; j++) {
> +				if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> +					__bitmap_set(usage, j, 1);
> +			}
> +			unlock_cluster_or_swap_info(p, ci);
> +
> +			for_each_clear_bit(j, usage, batch_nr)
> +				free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> +
> +			bitmap_clear(usage, 0, SWAP_BATCH_NR);
> +			remain_nr -= batch_nr;
> +		}
> +	}
> +}
> +
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */

put_swap_folio() implements batching in another method.  Do you think
that it's good to use the batching method in that function here?  It
avoids to use bitmap operations and stack space.

--
Best Regards,
Huang, Ying
Barry Song April 15, 2024, 7:04 a.m. UTC | #8
On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > While swapping in a large folio, we need to free swaps related to the whole
> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > to introduce an API for batched free.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/swap.h |  5 +++++
> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 56 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 11c53692f65f..b7a107e983b8 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >  extern int swap_duplicate(swp_entry_t);
> >  extern int swapcache_prepare(swp_entry_t);
> >  extern void swap_free(swp_entry_t);
> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >  int swap_type_of(dev_t device, sector_t offset);
> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >  {
> >  }
> >
> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > +{
> > +}
> > +
> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >  {
> >  }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 28642c188c93..f4c65aeb088d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >               __swap_entry_free(p, entry);
> >  }
> >
> > +/*
> > + * Free up the maximum number of swap entries at once to limit the
> > + * maximum kernel stack usage.
> > + */
> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > +
> > +/*
> > + * Called after swapping in a large folio,
>
> IMHO, it's not good to document the caller in the function definition.
> Because this will discourage function reusing.

ok. right now there is only one user that is why it is added. but i agree
we can actually remove this.

>
> > batched free swap entries
> > + * for this large folio, entry should be for the first subpage and
> > + * its offset is aligned with nr_pages
>
> Why do we need this?

This is a fundamental requirement for the existing kernel, folio's
swap offset is naturally aligned from the first moment add_to_swap
to add swapcache's xa. so this comment is describing the existing
fact. In the future, if we want to support swap-out folio to discontiguous
and not-aligned offsets, we can't pass entry as the parameter, we should
instead pass ptep or another different data struct which can connect
multiple discontiguous swap offsets.

I feel like we only need "for this large folio, entry should be for
the first subpage" and drop "and its offset is aligned with nr_pages",
the latter is not important to this context at all.

>
> > + */
> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > +{
> > +     int i, j;
> > +     struct swap_cluster_info *ci;
> > +     struct swap_info_struct *p;
> > +     unsigned int type = swp_type(entry);
> > +     unsigned long offset = swp_offset(entry);
> > +     int batch_nr, remain_nr;
> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> > +
> > +     /* all swap entries are within a cluster for mTHP */
> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> > +
> > +     if (nr_pages == 1) {
> > +             swap_free(entry);
> > +             return;
> > +     }
>
> Is it possible to unify swap_free() and swap_free_nr() into one function
> with acceptable performance?  IIUC, the general rule in mTHP effort is
> to avoid duplicate functions between mTHP and normal small folio.
> Right?

I don't see why. but we have lots of places calling swap_free(), we may
have to change them all to call swap_free_nr(entry, 1); the other possible
way is making swap_free() a wrapper of swap_free_nr() always using
1 as the argument. In both cases, we are changing the semantics of
swap_free_nr() to partially freeing large folio cases and have to drop
"entry should be for the first subpage" then.

Right now, the semantics is
* swap_free_nr() for an entire large folio;
* swap_free() for one entry of either a large folio or a small folio

>
> > +
> > +     remain_nr = nr_pages;
> > +     p = _swap_info_get(entry);
> > +     if (p) {
> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> > +
> > +                     ci = lock_cluster_or_swap_info(p, offset);
> > +                     for (j = 0; j < batch_nr; j++) {
> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> > +                                     __bitmap_set(usage, j, 1);
> > +                     }
> > +                     unlock_cluster_or_swap_info(p, ci);
> > +
> > +                     for_each_clear_bit(j, usage, batch_nr)
> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> > +
> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> > +                     remain_nr -= batch_nr;
> > +             }
> > +     }
> > +}
> > +
> >  /*
> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >   */
>
> put_swap_folio() implements batching in another method.  Do you think
> that it's good to use the batching method in that function here?  It
> avoids to use bitmap operations and stack space.

Chuanhua has strictly limited the maximum stack usage to several
unsigned long, so this should be safe. on the other hand, i believe this
implementation is more efficient, as  put_swap_folio() might lock/
unlock much more often whenever __swap_entry_free_locked returns
0.

>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Barry Song April 15, 2024, 8:06 a.m. UTC | #9
On Mon, Apr 15, 2024 at 7:04 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > From: Chuanhua Han <hanchuanhua@oppo.com>
> > >
> > > While swapping in a large folio, we need to free swaps related to the whole
> > > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > > to introduce an API for batched free.
> > >
> > > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  include/linux/swap.h |  5 +++++
> > >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 56 insertions(+)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 11c53692f65f..b7a107e983b8 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> > >  extern int swap_duplicate(swp_entry_t);
> > >  extern int swapcache_prepare(swp_entry_t);
> > >  extern void swap_free(swp_entry_t);
> > > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> > >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> > >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> > >  int swap_type_of(dev_t device, sector_t offset);
> > > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> > >  {
> > >  }
> > >
> > > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > > +{
> > > +}
> > > +
> > >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> > >  {
> > >  }
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 28642c188c93..f4c65aeb088d 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> > >               __swap_entry_free(p, entry);
> > >  }
> > >
> > > +/*
> > > + * Free up the maximum number of swap entries at once to limit the
> > > + * maximum kernel stack usage.
> > > + */
> > > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > > +
> > > +/*
> > > + * Called after swapping in a large folio,
> >
> > IMHO, it's not good to document the caller in the function definition.
> > Because this will discourage function reusing.
>
> ok. right now there is only one user that is why it is added. but i agree
> we can actually remove this.
>
> >
> > > batched free swap entries
> > > + * for this large folio, entry should be for the first subpage and
> > > + * its offset is aligned with nr_pages
> >
> > Why do we need this?
>
> This is a fundamental requirement for the existing kernel, folio's
> swap offset is naturally aligned from the first moment add_to_swap
> to add swapcache's xa. so this comment is describing the existing
> fact. In the future, if we want to support swap-out folio to discontiguous
> and not-aligned offsets, we can't pass entry as the parameter, we should
> instead pass ptep or another different data struct which can connect
> multiple discontiguous swap offsets.
>
> I feel like we only need "for this large folio, entry should be for
> the first subpage" and drop "and its offset is aligned with nr_pages",
> the latter is not important to this context at all.

upon further consideration, the comment is inaccurate since we do support
nr_pages == 1, and do_swap_page() has indeed been invoked with this value.
Therefore, we should completely remove the comment.

>
> >
> > > + */
> > > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > > +{
> > > +     int i, j;
> > > +     struct swap_cluster_info *ci;
> > > +     struct swap_info_struct *p;
> > > +     unsigned int type = swp_type(entry);
> > > +     unsigned long offset = swp_offset(entry);
> > > +     int batch_nr, remain_nr;
> > > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> > > +
> > > +     /* all swap entries are within a cluster for mTHP */
> > > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> > > +
> > > +     if (nr_pages == 1) {
> > > +             swap_free(entry);
> > > +             return;
> > > +     }
> >
> > Is it possible to unify swap_free() and swap_free_nr() into one function
> > with acceptable performance?  IIUC, the general rule in mTHP effort is
> > to avoid duplicate functions between mTHP and normal small folio.
> > Right?
>
> I don't see why. but we have lots of places calling swap_free(), we may
> have to change them all to call swap_free_nr(entry, 1); the other possible
> way is making swap_free() a wrapper of swap_free_nr() always using
> 1 as the argument. In both cases, we are changing the semantics of
> swap_free_nr() to partially freeing large folio cases and have to drop
> "entry should be for the first subpage" then.
>
> Right now, the semantics is
> * swap_free_nr() for an entire large folio;
> * swap_free() for one entry of either a large folio or a small folio
>
> >
> > > +
> > > +     remain_nr = nr_pages;
> > > +     p = _swap_info_get(entry);
> > > +     if (p) {
> > > +             for (i = 0; i < nr_pages; i += batch_nr) {
> > > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> > > +
> > > +                     ci = lock_cluster_or_swap_info(p, offset);
> > > +                     for (j = 0; j < batch_nr; j++) {
> > > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> > > +                                     __bitmap_set(usage, j, 1);
> > > +                     }
> > > +                     unlock_cluster_or_swap_info(p, ci);
> > > +
> > > +                     for_each_clear_bit(j, usage, batch_nr)
> > > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> > > +
> > > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> > > +                     remain_nr -= batch_nr;
> > > +             }
> > > +     }
> > > +}
> > > +
> > >  /*
> > >   * Called after dropping swapcache to decrease refcnt to swap entries.
> > >   */
> >
> > put_swap_folio() implements batching in another method.  Do you think
> > that it's good to use the batching method in that function here?  It
> > avoids to use bitmap operations and stack space.
>
> Chuanhua has strictly limited the maximum stack usage to several
> unsigned long, so this should be safe. on the other hand, i believe this
> implementation is more efficient, as  put_swap_folio() might lock/
> unlock much more often whenever __swap_entry_free_locked returns
> 0.
>
> >
> > --
> > Best Regards,
> > Huang, Ying
>
> Thanks
> Barry
Huang, Ying April 15, 2024, 8:19 a.m. UTC | #10
Barry Song <21cnbao@gmail.com> writes:

> On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >
>> > While swapping in a large folio, we need to free swaps related to the whole
>> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> > to introduce an API for batched free.
>> >
>> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> > ---
>> >  include/linux/swap.h |  5 +++++
>> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> >  2 files changed, 56 insertions(+)
>> >
>> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > index 11c53692f65f..b7a107e983b8 100644
>> > --- a/include/linux/swap.h
>> > +++ b/include/linux/swap.h
>> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> >  extern int swap_duplicate(swp_entry_t);
>> >  extern int swapcache_prepare(swp_entry_t);
>> >  extern void swap_free(swp_entry_t);
>> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> >  int swap_type_of(dev_t device, sector_t offset);
>> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> >  {
>> >  }
>> >
>> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> > +{
>> > +}
>> > +
>> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> >  {
>> >  }
>> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > index 28642c188c93..f4c65aeb088d 100644
>> > --- a/mm/swapfile.c
>> > +++ b/mm/swapfile.c
>> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> >               __swap_entry_free(p, entry);
>> >  }
>> >
>> > +/*
>> > + * Free up the maximum number of swap entries at once to limit the
>> > + * maximum kernel stack usage.
>> > + */
>> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> > +
>> > +/*
>> > + * Called after swapping in a large folio,
>>
>> IMHO, it's not good to document the caller in the function definition.
>> Because this will discourage function reusing.
>
> ok. right now there is only one user that is why it is added. but i agree
> we can actually remove this.
>
>>
>> > batched free swap entries
>> > + * for this large folio, entry should be for the first subpage and
>> > + * its offset is aligned with nr_pages
>>
>> Why do we need this?
>
> This is a fundamental requirement for the existing kernel, folio's
> swap offset is naturally aligned from the first moment add_to_swap
> to add swapcache's xa. so this comment is describing the existing
> fact. In the future, if we want to support swap-out folio to discontiguous
> and not-aligned offsets, we can't pass entry as the parameter, we should
> instead pass ptep or another different data struct which can connect
> multiple discontiguous swap offsets.
>
> I feel like we only need "for this large folio, entry should be for
> the first subpage" and drop "and its offset is aligned with nr_pages",
> the latter is not important to this context at all.

IIUC, all these are requirements of the only caller now, not the
function itself.  If only part of the all swap entries of a mTHP are
called with swap_free_nr(), can swap_free_nr() still do its work?  If
so, why not make swap_free_nr() as general as possible?

>>
>> > + */
>> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> > +{
>> > +     int i, j;
>> > +     struct swap_cluster_info *ci;
>> > +     struct swap_info_struct *p;
>> > +     unsigned int type = swp_type(entry);
>> > +     unsigned long offset = swp_offset(entry);
>> > +     int batch_nr, remain_nr;
>> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> > +
>> > +     /* all swap entries are within a cluster for mTHP */
>> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> > +
>> > +     if (nr_pages == 1) {
>> > +             swap_free(entry);
>> > +             return;
>> > +     }
>>
>> Is it possible to unify swap_free() and swap_free_nr() into one function
>> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> to avoid duplicate functions between mTHP and normal small folio.
>> Right?
>
> I don't see why.

Because duplicated implementation are hard to maintain in the long term.

> but we have lots of places calling swap_free(), we may
> have to change them all to call swap_free_nr(entry, 1); the other possible
> way is making swap_free() a wrapper of swap_free_nr() always using
> 1 as the argument. In both cases, we are changing the semantics of
> swap_free_nr() to partially freeing large folio cases and have to drop
> "entry should be for the first subpage" then.
>
> Right now, the semantics is
> * swap_free_nr() for an entire large folio;
> * swap_free() for one entry of either a large folio or a small folio

As above, I don't think the these semantics are important for
swap_free_nr() implementation.

>>
>> > +
>> > +     remain_nr = nr_pages;
>> > +     p = _swap_info_get(entry);
>> > +     if (p) {
>> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> > +
>> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> > +                     for (j = 0; j < batch_nr; j++) {
>> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> > +                                     __bitmap_set(usage, j, 1);
>> > +                     }
>> > +                     unlock_cluster_or_swap_info(p, ci);
>> > +
>> > +                     for_each_clear_bit(j, usage, batch_nr)
>> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> > +
>> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> > +                     remain_nr -= batch_nr;
>> > +             }
>> > +     }
>> > +}
>> > +
>> >  /*
>> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> >   */
>>
>> put_swap_folio() implements batching in another method.  Do you think
>> that it's good to use the batching method in that function here?  It
>> avoids to use bitmap operations and stack space.
>
> Chuanhua has strictly limited the maximum stack usage to several
> unsigned long,

512 / 8 = 64 bytes.

So, not trivial.

> so this should be safe. on the other hand, i believe this
> implementation is more efficient, as  put_swap_folio() might lock/
> unlock much more often whenever __swap_entry_free_locked returns
> 0.

There are 2 most common use cases,

- all swap entries have usage count == 0
- all swap entries have usage count != 0

In both cases, we only need to lock/unlock once.  In fact, I didn't
find possible use cases other than above.

And, we should add batching in __swap_entry_free().  That will help
free_swap_and_cache_nr() too.

--
Best Regards,
Huang, Ying
Barry Song April 15, 2024, 8:34 a.m. UTC | #11
On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >
> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> > to introduce an API for batched free.
> >> >
> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> > ---
> >> >  include/linux/swap.h |  5 +++++
> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> >  2 files changed, 56 insertions(+)
> >> >
> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> > index 11c53692f65f..b7a107e983b8 100644
> >> > --- a/include/linux/swap.h
> >> > +++ b/include/linux/swap.h
> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> >  extern int swap_duplicate(swp_entry_t);
> >> >  extern int swapcache_prepare(swp_entry_t);
> >> >  extern void swap_free(swp_entry_t);
> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> >  {
> >> >  }
> >> >
> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> > +{
> >> > +}
> >> > +
> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> >  {
> >> >  }
> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> > index 28642c188c93..f4c65aeb088d 100644
> >> > --- a/mm/swapfile.c
> >> > +++ b/mm/swapfile.c
> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> >               __swap_entry_free(p, entry);
> >> >  }
> >> >
> >> > +/*
> >> > + * Free up the maximum number of swap entries at once to limit the
> >> > + * maximum kernel stack usage.
> >> > + */
> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> > +
> >> > +/*
> >> > + * Called after swapping in a large folio,
> >>
> >> IMHO, it's not good to document the caller in the function definition.
> >> Because this will discourage function reusing.
> >
> > ok. right now there is only one user that is why it is added. but i agree
> > we can actually remove this.
> >
> >>
> >> > batched free swap entries
> >> > + * for this large folio, entry should be for the first subpage and
> >> > + * its offset is aligned with nr_pages
> >>
> >> Why do we need this?
> >
> > This is a fundamental requirement for the existing kernel, folio's
> > swap offset is naturally aligned from the first moment add_to_swap
> > to add swapcache's xa. so this comment is describing the existing
> > fact. In the future, if we want to support swap-out folio to discontiguous
> > and not-aligned offsets, we can't pass entry as the parameter, we should
> > instead pass ptep or another different data struct which can connect
> > multiple discontiguous swap offsets.
> >
> > I feel like we only need "for this large folio, entry should be for
> > the first subpage" and drop "and its offset is aligned with nr_pages",
> > the latter is not important to this context at all.
>
> IIUC, all these are requirements of the only caller now, not the
> function itself.  If only part of the all swap entries of a mTHP are
> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> so, why not make swap_free_nr() as general as possible?

right , i believe we can make swap_free_nr() as general as possible.

>
> >>
> >> > + */
> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> > +{
> >> > +     int i, j;
> >> > +     struct swap_cluster_info *ci;
> >> > +     struct swap_info_struct *p;
> >> > +     unsigned int type = swp_type(entry);
> >> > +     unsigned long offset = swp_offset(entry);
> >> > +     int batch_nr, remain_nr;
> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> > +
> >> > +     /* all swap entries are within a cluster for mTHP */
> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> > +
> >> > +     if (nr_pages == 1) {
> >> > +             swap_free(entry);
> >> > +             return;
> >> > +     }
> >>
> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> to avoid duplicate functions between mTHP and normal small folio.
> >> Right?
> >
> > I don't see why.
>
> Because duplicated implementation are hard to maintain in the long term.

sorry, i actually meant "I don't see why not",  for some reason, the "not"
was missed. Obviously I meant "why not", there was a "but" after it :-)

>
> > but we have lots of places calling swap_free(), we may
> > have to change them all to call swap_free_nr(entry, 1); the other possible
> > way is making swap_free() a wrapper of swap_free_nr() always using
> > 1 as the argument. In both cases, we are changing the semantics of
> > swap_free_nr() to partially freeing large folio cases and have to drop
> > "entry should be for the first subpage" then.
> >
> > Right now, the semantics is
> > * swap_free_nr() for an entire large folio;
> > * swap_free() for one entry of either a large folio or a small folio
>
> As above, I don't think the these semantics are important for
> swap_free_nr() implementation.

right. I agree. If we are ready to change all those callers, nothing
can stop us from removing swap_free().

>
> >>
> >> > +
> >> > +     remain_nr = nr_pages;
> >> > +     p = _swap_info_get(entry);
> >> > +     if (p) {
> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> > +
> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> > +                                     __bitmap_set(usage, j, 1);
> >> > +                     }
> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> > +
> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> > +
> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> > +                     remain_nr -= batch_nr;
> >> > +             }
> >> > +     }
> >> > +}
> >> > +
> >> >  /*
> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> >   */
> >>
> >> put_swap_folio() implements batching in another method.  Do you think
> >> that it's good to use the batching method in that function here?  It
> >> avoids to use bitmap operations and stack space.
> >
> > Chuanhua has strictly limited the maximum stack usage to several
> > unsigned long,
>
> 512 / 8 = 64 bytes.
>
> So, not trivial.
>
> > so this should be safe. on the other hand, i believe this
> > implementation is more efficient, as  put_swap_folio() might lock/
> > unlock much more often whenever __swap_entry_free_locked returns
> > 0.
>
> There are 2 most common use cases,
>
> - all swap entries have usage count == 0
> - all swap entries have usage count != 0
>
> In both cases, we only need to lock/unlock once.  In fact, I didn't
> find possible use cases other than above.

i guess the point is free_swap_slot() shouldn't be called within
lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
we'll have to unlock and lock nr_pages times?  and this is the most
common scenario.

void put_swap_folio(struct folio *folio, swp_entry_t entry)
{
        ...

        ci = lock_cluster_or_swap_info(si, offset);
        ...
        for (i = 0; i < size; i++, entry.val++) {
                if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
                        unlock_cluster_or_swap_info(si, ci);
                        free_swap_slot(entry);
                        if (i == size - 1)
                                return;
                        lock_cluster_or_swap_info(si, offset);
                }
        }
        unlock_cluster_or_swap_info(si, ci);
}

>
> And, we should add batching in __swap_entry_free().  That will help
> free_swap_and_cache_nr() too.
>
> --
> Best Regards,
> Huang, Ying
Huang, Ying April 15, 2024, 8:51 a.m. UTC | #12
Barry Song <21cnbao@gmail.com> writes:

> On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >
>> >> > While swapping in a large folio, we need to free swaps related to the whole
>> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> >> > to introduce an API for batched free.
>> >> >
>> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> >> > ---
>> >> >  include/linux/swap.h |  5 +++++
>> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> >> >  2 files changed, 56 insertions(+)
>> >> >
>> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> > index 11c53692f65f..b7a107e983b8 100644
>> >> > --- a/include/linux/swap.h
>> >> > +++ b/include/linux/swap.h
>> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> >> >  extern int swap_duplicate(swp_entry_t);
>> >> >  extern int swapcache_prepare(swp_entry_t);
>> >> >  extern void swap_free(swp_entry_t);
>> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> >> >  int swap_type_of(dev_t device, sector_t offset);
>> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> >> >  {
>> >> >  }
>> >> >
>> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> > +{
>> >> > +}
>> >> > +
>> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> >> >  {
>> >> >  }
>> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> > index 28642c188c93..f4c65aeb088d 100644
>> >> > --- a/mm/swapfile.c
>> >> > +++ b/mm/swapfile.c
>> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> >> >               __swap_entry_free(p, entry);
>> >> >  }
>> >> >
>> >> > +/*
>> >> > + * Free up the maximum number of swap entries at once to limit the
>> >> > + * maximum kernel stack usage.
>> >> > + */
>> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> >> > +
>> >> > +/*
>> >> > + * Called after swapping in a large folio,
>> >>
>> >> IMHO, it's not good to document the caller in the function definition.
>> >> Because this will discourage function reusing.
>> >
>> > ok. right now there is only one user that is why it is added. but i agree
>> > we can actually remove this.
>> >
>> >>
>> >> > batched free swap entries
>> >> > + * for this large folio, entry should be for the first subpage and
>> >> > + * its offset is aligned with nr_pages
>> >>
>> >> Why do we need this?
>> >
>> > This is a fundamental requirement for the existing kernel, folio's
>> > swap offset is naturally aligned from the first moment add_to_swap
>> > to add swapcache's xa. so this comment is describing the existing
>> > fact. In the future, if we want to support swap-out folio to discontiguous
>> > and not-aligned offsets, we can't pass entry as the parameter, we should
>> > instead pass ptep or another different data struct which can connect
>> > multiple discontiguous swap offsets.
>> >
>> > I feel like we only need "for this large folio, entry should be for
>> > the first subpage" and drop "and its offset is aligned with nr_pages",
>> > the latter is not important to this context at all.
>>
>> IIUC, all these are requirements of the only caller now, not the
>> function itself.  If only part of the all swap entries of a mTHP are
>> called with swap_free_nr(), can swap_free_nr() still do its work?  If
>> so, why not make swap_free_nr() as general as possible?
>
> right , i believe we can make swap_free_nr() as general as possible.
>
>>
>> >>
>> >> > + */
>> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> > +{
>> >> > +     int i, j;
>> >> > +     struct swap_cluster_info *ci;
>> >> > +     struct swap_info_struct *p;
>> >> > +     unsigned int type = swp_type(entry);
>> >> > +     unsigned long offset = swp_offset(entry);
>> >> > +     int batch_nr, remain_nr;
>> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> >> > +
>> >> > +     /* all swap entries are within a cluster for mTHP */
>> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> >> > +
>> >> > +     if (nr_pages == 1) {
>> >> > +             swap_free(entry);
>> >> > +             return;
>> >> > +     }
>> >>
>> >> Is it possible to unify swap_free() and swap_free_nr() into one function
>> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> >> to avoid duplicate functions between mTHP and normal small folio.
>> >> Right?
>> >
>> > I don't see why.
>>
>> Because duplicated implementation are hard to maintain in the long term.
>
> sorry, i actually meant "I don't see why not",  for some reason, the "not"
> was missed. Obviously I meant "why not", there was a "but" after it :-)
>
>>
>> > but we have lots of places calling swap_free(), we may
>> > have to change them all to call swap_free_nr(entry, 1); the other possible
>> > way is making swap_free() a wrapper of swap_free_nr() always using
>> > 1 as the argument. In both cases, we are changing the semantics of
>> > swap_free_nr() to partially freeing large folio cases and have to drop
>> > "entry should be for the first subpage" then.
>> >
>> > Right now, the semantics is
>> > * swap_free_nr() for an entire large folio;
>> > * swap_free() for one entry of either a large folio or a small folio
>>
>> As above, I don't think the these semantics are important for
>> swap_free_nr() implementation.
>
> right. I agree. If we are ready to change all those callers, nothing
> can stop us from removing swap_free().
>
>>
>> >>
>> >> > +
>> >> > +     remain_nr = nr_pages;
>> >> > +     p = _swap_info_get(entry);
>> >> > +     if (p) {
>> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> >> > +
>> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> >> > +                     for (j = 0; j < batch_nr; j++) {
>> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> >> > +                                     __bitmap_set(usage, j, 1);
>> >> > +                     }
>> >> > +                     unlock_cluster_or_swap_info(p, ci);
>> >> > +
>> >> > +                     for_each_clear_bit(j, usage, batch_nr)
>> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> >> > +
>> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> >> > +                     remain_nr -= batch_nr;
>> >> > +             }
>> >> > +     }
>> >> > +}
>> >> > +
>> >> >  /*
>> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> >> >   */
>> >>
>> >> put_swap_folio() implements batching in another method.  Do you think
>> >> that it's good to use the batching method in that function here?  It
>> >> avoids to use bitmap operations and stack space.
>> >
>> > Chuanhua has strictly limited the maximum stack usage to several
>> > unsigned long,
>>
>> 512 / 8 = 64 bytes.
>>
>> So, not trivial.
>>
>> > so this should be safe. on the other hand, i believe this
>> > implementation is more efficient, as  put_swap_folio() might lock/
>> > unlock much more often whenever __swap_entry_free_locked returns
>> > 0.
>>
>> There are 2 most common use cases,
>>
>> - all swap entries have usage count == 0
>> - all swap entries have usage count != 0
>>
>> In both cases, we only need to lock/unlock once.  In fact, I didn't
>> find possible use cases other than above.
>
> i guess the point is free_swap_slot() shouldn't be called within
> lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> we'll have to unlock and lock nr_pages times?  and this is the most
> common scenario.

No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
is, nr_pages) or 0.  These are the most common cases.

> void put_swap_folio(struct folio *folio, swp_entry_t entry)
> {
>         ...
>
>         ci = lock_cluster_or_swap_info(si, offset);
>         ...
>         for (i = 0; i < size; i++, entry.val++) {
>                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
>                         unlock_cluster_or_swap_info(si, ci);
>                         free_swap_slot(entry);
>                         if (i == size - 1)
>                                 return;
>                         lock_cluster_or_swap_info(si, offset);
>                 }
>         }
>         unlock_cluster_or_swap_info(si, ci);
> }
>
>>
>> And, we should add batching in __swap_entry_free().  That will help
>> free_swap_and_cache_nr() too.

Please consider this too.

--
Best Regards,
Huang, Ying
Barry Song April 15, 2024, 9:01 a.m. UTC | #13
On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >
> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> >> > to introduce an API for batched free.
> >> >> >
> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> >> > ---
> >> >> >  include/linux/swap.h |  5 +++++
> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> >> >  2 files changed, 56 insertions(+)
> >> >> >
> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> > index 11c53692f65f..b7a107e983b8 100644
> >> >> > --- a/include/linux/swap.h
> >> >> > +++ b/include/linux/swap.h
> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> >> >  extern int swap_duplicate(swp_entry_t);
> >> >> >  extern int swapcache_prepare(swp_entry_t);
> >> >> >  extern void swap_free(swp_entry_t);
> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> >> >  {
> >> >> >  }
> >> >> >
> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> > +{
> >> >> > +}
> >> >> > +
> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> >> >  {
> >> >> >  }
> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> > index 28642c188c93..f4c65aeb088d 100644
> >> >> > --- a/mm/swapfile.c
> >> >> > +++ b/mm/swapfile.c
> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> >> >               __swap_entry_free(p, entry);
> >> >> >  }
> >> >> >
> >> >> > +/*
> >> >> > + * Free up the maximum number of swap entries at once to limit the
> >> >> > + * maximum kernel stack usage.
> >> >> > + */
> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> >> > +
> >> >> > +/*
> >> >> > + * Called after swapping in a large folio,
> >> >>
> >> >> IMHO, it's not good to document the caller in the function definition.
> >> >> Because this will discourage function reusing.
> >> >
> >> > ok. right now there is only one user that is why it is added. but i agree
> >> > we can actually remove this.
> >> >
> >> >>
> >> >> > batched free swap entries
> >> >> > + * for this large folio, entry should be for the first subpage and
> >> >> > + * its offset is aligned with nr_pages
> >> >>
> >> >> Why do we need this?
> >> >
> >> > This is a fundamental requirement for the existing kernel, folio's
> >> > swap offset is naturally aligned from the first moment add_to_swap
> >> > to add swapcache's xa. so this comment is describing the existing
> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> >> > instead pass ptep or another different data struct which can connect
> >> > multiple discontiguous swap offsets.
> >> >
> >> > I feel like we only need "for this large folio, entry should be for
> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> >> > the latter is not important to this context at all.
> >>
> >> IIUC, all these are requirements of the only caller now, not the
> >> function itself.  If only part of the all swap entries of a mTHP are
> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> >> so, why not make swap_free_nr() as general as possible?
> >
> > right , i believe we can make swap_free_nr() as general as possible.
> >
> >>
> >> >>
> >> >> > + */
> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> > +{
> >> >> > +     int i, j;
> >> >> > +     struct swap_cluster_info *ci;
> >> >> > +     struct swap_info_struct *p;
> >> >> > +     unsigned int type = swp_type(entry);
> >> >> > +     unsigned long offset = swp_offset(entry);
> >> >> > +     int batch_nr, remain_nr;
> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> >> > +
> >> >> > +     /* all swap entries are within a cluster for mTHP */
> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> >> > +
> >> >> > +     if (nr_pages == 1) {
> >> >> > +             swap_free(entry);
> >> >> > +             return;
> >> >> > +     }
> >> >>
> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> >> to avoid duplicate functions between mTHP and normal small folio.
> >> >> Right?
> >> >
> >> > I don't see why.
> >>
> >> Because duplicated implementation are hard to maintain in the long term.
> >
> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> >
> >>
> >> > but we have lots of places calling swap_free(), we may
> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> >> > 1 as the argument. In both cases, we are changing the semantics of
> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> >> > "entry should be for the first subpage" then.
> >> >
> >> > Right now, the semantics is
> >> > * swap_free_nr() for an entire large folio;
> >> > * swap_free() for one entry of either a large folio or a small folio
> >>
> >> As above, I don't think the these semantics are important for
> >> swap_free_nr() implementation.
> >
> > right. I agree. If we are ready to change all those callers, nothing
> > can stop us from removing swap_free().
> >
> >>
> >> >>
> >> >> > +
> >> >> > +     remain_nr = nr_pages;
> >> >> > +     p = _swap_info_get(entry);
> >> >> > +     if (p) {
> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> >> > +
> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> >> > +                                     __bitmap_set(usage, j, 1);
> >> >> > +                     }
> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> >> > +
> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> >> > +
> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> >> > +                     remain_nr -= batch_nr;
> >> >> > +             }
> >> >> > +     }
> >> >> > +}
> >> >> > +
> >> >> >  /*
> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> >> >   */
> >> >>
> >> >> put_swap_folio() implements batching in another method.  Do you think
> >> >> that it's good to use the batching method in that function here?  It
> >> >> avoids to use bitmap operations and stack space.
> >> >
> >> > Chuanhua has strictly limited the maximum stack usage to several
> >> > unsigned long,
> >>
> >> 512 / 8 = 64 bytes.
> >>
> >> So, not trivial.
> >>
> >> > so this should be safe. on the other hand, i believe this
> >> > implementation is more efficient, as  put_swap_folio() might lock/
> >> > unlock much more often whenever __swap_entry_free_locked returns
> >> > 0.
> >>
> >> There are 2 most common use cases,
> >>
> >> - all swap entries have usage count == 0
> >> - all swap entries have usage count != 0
> >>
> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> >> find possible use cases other than above.
> >
> > i guess the point is free_swap_slot() shouldn't be called within
> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> > we'll have to unlock and lock nr_pages times?  and this is the most
> > common scenario.
>
> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> is, nr_pages) or 0.  These are the most common cases.
>

i am actually talking about the below code path,

void put_swap_folio(struct folio *folio, swp_entry_t entry)
{
        ...

        ci = lock_cluster_or_swap_info(si, offset);
        ...
        for (i = 0; i < size; i++, entry.val++) {
                if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
                        unlock_cluster_or_swap_info(si, ci);
                        free_swap_slot(entry);
                        if (i == size - 1)
                                return;
                        lock_cluster_or_swap_info(si, offset);
                }
        }
        unlock_cluster_or_swap_info(si, ci);
}

but i guess you are talking about the below code path:

void put_swap_folio(struct folio *folio, swp_entry_t entry)
{
        ...

        ci = lock_cluster_or_swap_info(si, offset);
        if (size == SWAPFILE_CLUSTER) {
                map = si->swap_map + offset;
                for (i = 0; i < SWAPFILE_CLUSTER; i++) {
                        val = map[i];
                        VM_BUG_ON(!(val & SWAP_HAS_CACHE));
                        if (val == SWAP_HAS_CACHE)
                                free_entries++;
                }
                if (free_entries == SWAPFILE_CLUSTER) {
                        unlock_cluster_or_swap_info(si, ci);
                        spin_lock(&si->lock);
                        mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
                        swap_free_cluster(si, idx);
                        spin_unlock(&si->lock);
                        return;
                }
        }
}

we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
path?


> >>
> >> And, we should add batching in __swap_entry_free().  That will help
> >> free_swap_and_cache_nr() too.

Chris Li and I actually discussed it before, while I completely
agree this can be batched. but i'd like to defer this as an incremental
patchset later to keep this swapcache-refault small.

>
> Please consider this too.
>
> --
> Best Regards,
> Huang, Ying
Huang, Ying April 16, 2024, 1:40 a.m. UTC | #14
Barry Song <21cnbao@gmail.com> writes:

> On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >>
>> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> >
>> >> >> > While swapping in a large folio, we need to free swaps related to the whole
>> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> >> >> > to introduce an API for batched free.
>> >> >> >
>> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> > ---
>> >> >> >  include/linux/swap.h |  5 +++++
>> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> >> >> >  2 files changed, 56 insertions(+)
>> >> >> >
>> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> > index 11c53692f65f..b7a107e983b8 100644
>> >> >> > --- a/include/linux/swap.h
>> >> >> > +++ b/include/linux/swap.h
>> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> >> >> >  extern int swap_duplicate(swp_entry_t);
>> >> >> >  extern int swapcache_prepare(swp_entry_t);
>> >> >> >  extern void swap_free(swp_entry_t);
>> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> >> >> >  int swap_type_of(dev_t device, sector_t offset);
>> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> >> >> >  {
>> >> >> >  }
>> >> >> >
>> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> > +{
>> >> >> > +}
>> >> >> > +
>> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> >> >> >  {
>> >> >> >  }
>> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> > index 28642c188c93..f4c65aeb088d 100644
>> >> >> > --- a/mm/swapfile.c
>> >> >> > +++ b/mm/swapfile.c
>> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> >> >> >               __swap_entry_free(p, entry);
>> >> >> >  }
>> >> >> >
>> >> >> > +/*
>> >> >> > + * Free up the maximum number of swap entries at once to limit the
>> >> >> > + * maximum kernel stack usage.
>> >> >> > + */
>> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> >> >> > +
>> >> >> > +/*
>> >> >> > + * Called after swapping in a large folio,
>> >> >>
>> >> >> IMHO, it's not good to document the caller in the function definition.
>> >> >> Because this will discourage function reusing.
>> >> >
>> >> > ok. right now there is only one user that is why it is added. but i agree
>> >> > we can actually remove this.
>> >> >
>> >> >>
>> >> >> > batched free swap entries
>> >> >> > + * for this large folio, entry should be for the first subpage and
>> >> >> > + * its offset is aligned with nr_pages
>> >> >>
>> >> >> Why do we need this?
>> >> >
>> >> > This is a fundamental requirement for the existing kernel, folio's
>> >> > swap offset is naturally aligned from the first moment add_to_swap
>> >> > to add swapcache's xa. so this comment is describing the existing
>> >> > fact. In the future, if we want to support swap-out folio to discontiguous
>> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
>> >> > instead pass ptep or another different data struct which can connect
>> >> > multiple discontiguous swap offsets.
>> >> >
>> >> > I feel like we only need "for this large folio, entry should be for
>> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
>> >> > the latter is not important to this context at all.
>> >>
>> >> IIUC, all these are requirements of the only caller now, not the
>> >> function itself.  If only part of the all swap entries of a mTHP are
>> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
>> >> so, why not make swap_free_nr() as general as possible?
>> >
>> > right , i believe we can make swap_free_nr() as general as possible.
>> >
>> >>
>> >> >>
>> >> >> > + */
>> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> > +{
>> >> >> > +     int i, j;
>> >> >> > +     struct swap_cluster_info *ci;
>> >> >> > +     struct swap_info_struct *p;
>> >> >> > +     unsigned int type = swp_type(entry);
>> >> >> > +     unsigned long offset = swp_offset(entry);
>> >> >> > +     int batch_nr, remain_nr;
>> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> >> >> > +
>> >> >> > +     /* all swap entries are within a cluster for mTHP */
>> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> >> >> > +
>> >> >> > +     if (nr_pages == 1) {
>> >> >> > +             swap_free(entry);
>> >> >> > +             return;
>> >> >> > +     }
>> >> >>
>> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
>> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> >> >> to avoid duplicate functions between mTHP and normal small folio.
>> >> >> Right?
>> >> >
>> >> > I don't see why.
>> >>
>> >> Because duplicated implementation are hard to maintain in the long term.
>> >
>> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
>> > was missed. Obviously I meant "why not", there was a "but" after it :-)
>> >
>> >>
>> >> > but we have lots of places calling swap_free(), we may
>> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
>> >> > way is making swap_free() a wrapper of swap_free_nr() always using
>> >> > 1 as the argument. In both cases, we are changing the semantics of
>> >> > swap_free_nr() to partially freeing large folio cases and have to drop
>> >> > "entry should be for the first subpage" then.
>> >> >
>> >> > Right now, the semantics is
>> >> > * swap_free_nr() for an entire large folio;
>> >> > * swap_free() for one entry of either a large folio or a small folio
>> >>
>> >> As above, I don't think the these semantics are important for
>> >> swap_free_nr() implementation.
>> >
>> > right. I agree. If we are ready to change all those callers, nothing
>> > can stop us from removing swap_free().
>> >
>> >>
>> >> >>
>> >> >> > +
>> >> >> > +     remain_nr = nr_pages;
>> >> >> > +     p = _swap_info_get(entry);
>> >> >> > +     if (p) {
>> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> >> >> > +
>> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> >> >> > +                     for (j = 0; j < batch_nr; j++) {
>> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> >> >> > +                                     __bitmap_set(usage, j, 1);
>> >> >> > +                     }
>> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
>> >> >> > +
>> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
>> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> >> >> > +
>> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> >> >> > +                     remain_nr -= batch_nr;
>> >> >> > +             }
>> >> >> > +     }
>> >> >> > +}
>> >> >> > +
>> >> >> >  /*
>> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> >> >> >   */
>> >> >>
>> >> >> put_swap_folio() implements batching in another method.  Do you think
>> >> >> that it's good to use the batching method in that function here?  It
>> >> >> avoids to use bitmap operations and stack space.
>> >> >
>> >> > Chuanhua has strictly limited the maximum stack usage to several
>> >> > unsigned long,
>> >>
>> >> 512 / 8 = 64 bytes.
>> >>
>> >> So, not trivial.
>> >>
>> >> > so this should be safe. on the other hand, i believe this
>> >> > implementation is more efficient, as  put_swap_folio() might lock/
>> >> > unlock much more often whenever __swap_entry_free_locked returns
>> >> > 0.
>> >>
>> >> There are 2 most common use cases,
>> >>
>> >> - all swap entries have usage count == 0
>> >> - all swap entries have usage count != 0
>> >>
>> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
>> >> find possible use cases other than above.
>> >
>> > i guess the point is free_swap_slot() shouldn't be called within
>> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
>> > we'll have to unlock and lock nr_pages times?  and this is the most
>> > common scenario.
>>
>> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
>> is, nr_pages) or 0.  These are the most common cases.
>>
>
> i am actually talking about the below code path,
>
> void put_swap_folio(struct folio *folio, swp_entry_t entry)
> {
>         ...
>
>         ci = lock_cluster_or_swap_info(si, offset);
>         ...
>         for (i = 0; i < size; i++, entry.val++) {
>                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
>                         unlock_cluster_or_swap_info(si, ci);
>                         free_swap_slot(entry);
>                         if (i == size - 1)
>                                 return;
>                         lock_cluster_or_swap_info(si, offset);
>                 }
>         }
>         unlock_cluster_or_swap_info(si, ci);
> }
>
> but i guess you are talking about the below code path:
>
> void put_swap_folio(struct folio *folio, swp_entry_t entry)
> {
>         ...
>
>         ci = lock_cluster_or_swap_info(si, offset);
>         if (size == SWAPFILE_CLUSTER) {
>                 map = si->swap_map + offset;
>                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
>                         val = map[i];
>                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
>                         if (val == SWAP_HAS_CACHE)
>                                 free_entries++;
>                 }
>                 if (free_entries == SWAPFILE_CLUSTER) {
>                         unlock_cluster_or_swap_info(si, ci);
>                         spin_lock(&si->lock);
>                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
>                         swap_free_cluster(si, idx);
>                         spin_unlock(&si->lock);
>                         return;
>                 }
>         }
> }

I am talking about both code paths.  In 2 most common cases,
__swap_entry_free_locked() will return 0 or !0 for all entries in range.

> we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
> or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
> instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
> path?

Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.

>
>> >>
>> >> And, we should add batching in __swap_entry_free().  That will help
>> >> free_swap_and_cache_nr() too.
>
> Chris Li and I actually discussed it before, while I completely
> agree this can be batched. but i'd like to defer this as an incremental
> patchset later to keep this swapcache-refault small.

OK.

>>
>> Please consider this too.

--
Best Regards,
Huang, Ying
Barry Song April 16, 2024, 2:08 a.m. UTC | #15
On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >>
> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> >
> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> >> >> > to introduce an API for batched free.
> >> >> >> >
> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> > ---
> >> >> >> >  include/linux/swap.h |  5 +++++
> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> >> >> >  2 files changed, 56 insertions(+)
> >> >> >> >
> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
> >> >> >> > --- a/include/linux/swap.h
> >> >> >> > +++ b/include/linux/swap.h
> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> >> >> >  extern int swap_duplicate(swp_entry_t);
> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
> >> >> >> >  extern void swap_free(swp_entry_t);
> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> >> >> >  {
> >> >> >> >  }
> >> >> >> >
> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> > +{
> >> >> >> > +}
> >> >> >> > +
> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> >> >> >  {
> >> >> >> >  }
> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
> >> >> >> > --- a/mm/swapfile.c
> >> >> >> > +++ b/mm/swapfile.c
> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> >> >> >               __swap_entry_free(p, entry);
> >> >> >> >  }
> >> >> >> >
> >> >> >> > +/*
> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
> >> >> >> > + * maximum kernel stack usage.
> >> >> >> > + */
> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> >> >> > +
> >> >> >> > +/*
> >> >> >> > + * Called after swapping in a large folio,
> >> >> >>
> >> >> >> IMHO, it's not good to document the caller in the function definition.
> >> >> >> Because this will discourage function reusing.
> >> >> >
> >> >> > ok. right now there is only one user that is why it is added. but i agree
> >> >> > we can actually remove this.
> >> >> >
> >> >> >>
> >> >> >> > batched free swap entries
> >> >> >> > + * for this large folio, entry should be for the first subpage and
> >> >> >> > + * its offset is aligned with nr_pages
> >> >> >>
> >> >> >> Why do we need this?
> >> >> >
> >> >> > This is a fundamental requirement for the existing kernel, folio's
> >> >> > swap offset is naturally aligned from the first moment add_to_swap
> >> >> > to add swapcache's xa. so this comment is describing the existing
> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> >> >> > instead pass ptep or another different data struct which can connect
> >> >> > multiple discontiguous swap offsets.
> >> >> >
> >> >> > I feel like we only need "for this large folio, entry should be for
> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> >> >> > the latter is not important to this context at all.
> >> >>
> >> >> IIUC, all these are requirements of the only caller now, not the
> >> >> function itself.  If only part of the all swap entries of a mTHP are
> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> >> >> so, why not make swap_free_nr() as general as possible?
> >> >
> >> > right , i believe we can make swap_free_nr() as general as possible.
> >> >
> >> >>
> >> >> >>
> >> >> >> > + */
> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> > +{
> >> >> >> > +     int i, j;
> >> >> >> > +     struct swap_cluster_info *ci;
> >> >> >> > +     struct swap_info_struct *p;
> >> >> >> > +     unsigned int type = swp_type(entry);
> >> >> >> > +     unsigned long offset = swp_offset(entry);
> >> >> >> > +     int batch_nr, remain_nr;
> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> >> >> > +
> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> >> >> > +
> >> >> >> > +     if (nr_pages == 1) {
> >> >> >> > +             swap_free(entry);
> >> >> >> > +             return;
> >> >> >> > +     }
> >> >> >>
> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
> >> >> >> Right?
> >> >> >
> >> >> > I don't see why.
> >> >>
> >> >> Because duplicated implementation are hard to maintain in the long term.
> >> >
> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> >> >
> >> >>
> >> >> > but we have lots of places calling swap_free(), we may
> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> >> >> > 1 as the argument. In both cases, we are changing the semantics of
> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> >> >> > "entry should be for the first subpage" then.
> >> >> >
> >> >> > Right now, the semantics is
> >> >> > * swap_free_nr() for an entire large folio;
> >> >> > * swap_free() for one entry of either a large folio or a small folio
> >> >>
> >> >> As above, I don't think the these semantics are important for
> >> >> swap_free_nr() implementation.
> >> >
> >> > right. I agree. If we are ready to change all those callers, nothing
> >> > can stop us from removing swap_free().
> >> >
> >> >>
> >> >> >>
> >> >> >> > +
> >> >> >> > +     remain_nr = nr_pages;
> >> >> >> > +     p = _swap_info_get(entry);
> >> >> >> > +     if (p) {
> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> >> >> > +
> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> >> >> > +                                     __bitmap_set(usage, j, 1);
> >> >> >> > +                     }
> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> >> >> > +
> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> >> >> > +
> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> >> >> > +                     remain_nr -= batch_nr;
> >> >> >> > +             }
> >> >> >> > +     }
> >> >> >> > +}
> >> >> >> > +
> >> >> >> >  /*
> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> >> >> >   */
> >> >> >>
> >> >> >> put_swap_folio() implements batching in another method.  Do you think
> >> >> >> that it's good to use the batching method in that function here?  It
> >> >> >> avoids to use bitmap operations and stack space.
> >> >> >
> >> >> > Chuanhua has strictly limited the maximum stack usage to several
> >> >> > unsigned long,
> >> >>
> >> >> 512 / 8 = 64 bytes.
> >> >>
> >> >> So, not trivial.
> >> >>
> >> >> > so this should be safe. on the other hand, i believe this
> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
> >> >> > unlock much more often whenever __swap_entry_free_locked returns
> >> >> > 0.
> >> >>
> >> >> There are 2 most common use cases,
> >> >>
> >> >> - all swap entries have usage count == 0
> >> >> - all swap entries have usage count != 0
> >> >>
> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> >> >> find possible use cases other than above.
> >> >
> >> > i guess the point is free_swap_slot() shouldn't be called within
> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> >> > we'll have to unlock and lock nr_pages times?  and this is the most
> >> > common scenario.
> >>
> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> >> is, nr_pages) or 0.  These are the most common cases.
> >>
> >
> > i am actually talking about the below code path,
> >
> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> > {
> >         ...
> >
> >         ci = lock_cluster_or_swap_info(si, offset);
> >         ...
> >         for (i = 0; i < size; i++, entry.val++) {
> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> >                         unlock_cluster_or_swap_info(si, ci);
> >                         free_swap_slot(entry);
> >                         if (i == size - 1)
> >                                 return;
> >                         lock_cluster_or_swap_info(si, offset);
> >                 }
> >         }
> >         unlock_cluster_or_swap_info(si, ci);
> > }
> >
> > but i guess you are talking about the below code path:
> >
> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> > {
> >         ...
> >
> >         ci = lock_cluster_or_swap_info(si, offset);
> >         if (size == SWAPFILE_CLUSTER) {
> >                 map = si->swap_map + offset;
> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> >                         val = map[i];
> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> >                         if (val == SWAP_HAS_CACHE)
> >                                 free_entries++;
> >                 }
> >                 if (free_entries == SWAPFILE_CLUSTER) {
> >                         unlock_cluster_or_swap_info(si, ci);
> >                         spin_lock(&si->lock);
> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> >                         swap_free_cluster(si, idx);
> >                         spin_unlock(&si->lock);
> >                         return;
> >                 }
> >         }
> > }
>
> I am talking about both code paths.  In 2 most common cases,
> __swap_entry_free_locked() will return 0 or !0 for all entries in range.

I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
end up repeatedly unlocking and locking. Picture a scenario with a large
folio shared by multiple processes. One process might unmap a portion
while another still holds an entire mapping to it. This could lead to situations
where free_entries doesn't equal 0 and free_entries doesn't equal
nr_pages, resulting in multiple unlock and lock operations.

Chuanhua has invested significant effort in following Ryan's suggestion
for the current approach, which generally handles all cases, especially
partial unmapping. Additionally, the widespread use of swap_free_nr()
as you suggested across various scenarios is noteworthy.

Unless there's evidence indicating performance issues or bugs, I believe
the current approach remains preferable.

>
> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
> > path?
>
> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
>
> >
> >> >>
> >> >> And, we should add batching in __swap_entry_free().  That will help
> >> >> free_swap_and_cache_nr() too.
> >
> > Chris Li and I actually discussed it before, while I completely
> > agree this can be batched. but i'd like to defer this as an incremental
> > patchset later to keep this swapcache-refault small.
>
> OK.
>
> >>
> >> Please consider this too.
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Huang, Ying April 16, 2024, 3:11 a.m. UTC | #16
Barry Song <21cnbao@gmail.com> writes:

> On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >>
>> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> >> >
>> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
>> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> >> >> >> > to introduce an API for batched free.
>> >> >> >> >
>> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> >> > ---
>> >> >> >> >  include/linux/swap.h |  5 +++++
>> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> >> >> >> >  2 files changed, 56 insertions(+)
>> >> >> >> >
>> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
>> >> >> >> > --- a/include/linux/swap.h
>> >> >> >> > +++ b/include/linux/swap.h
>> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> >> >> >> >  extern int swap_duplicate(swp_entry_t);
>> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
>> >> >> >> >  extern void swap_free(swp_entry_t);
>> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
>> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> >> >> >> >  {
>> >> >> >> >  }
>> >> >> >> >
>> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> >> > +{
>> >> >> >> > +}
>> >> >> >> > +
>> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> >> >> >> >  {
>> >> >> >> >  }
>> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
>> >> >> >> > --- a/mm/swapfile.c
>> >> >> >> > +++ b/mm/swapfile.c
>> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> >> >> >> >               __swap_entry_free(p, entry);
>> >> >> >> >  }
>> >> >> >> >
>> >> >> >> > +/*
>> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
>> >> >> >> > + * maximum kernel stack usage.
>> >> >> >> > + */
>> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> >> >> >> > +
>> >> >> >> > +/*
>> >> >> >> > + * Called after swapping in a large folio,
>> >> >> >>
>> >> >> >> IMHO, it's not good to document the caller in the function definition.
>> >> >> >> Because this will discourage function reusing.
>> >> >> >
>> >> >> > ok. right now there is only one user that is why it is added. but i agree
>> >> >> > we can actually remove this.
>> >> >> >
>> >> >> >>
>> >> >> >> > batched free swap entries
>> >> >> >> > + * for this large folio, entry should be for the first subpage and
>> >> >> >> > + * its offset is aligned with nr_pages
>> >> >> >>
>> >> >> >> Why do we need this?
>> >> >> >
>> >> >> > This is a fundamental requirement for the existing kernel, folio's
>> >> >> > swap offset is naturally aligned from the first moment add_to_swap
>> >> >> > to add swapcache's xa. so this comment is describing the existing
>> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
>> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
>> >> >> > instead pass ptep or another different data struct which can connect
>> >> >> > multiple discontiguous swap offsets.
>> >> >> >
>> >> >> > I feel like we only need "for this large folio, entry should be for
>> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
>> >> >> > the latter is not important to this context at all.
>> >> >>
>> >> >> IIUC, all these are requirements of the only caller now, not the
>> >> >> function itself.  If only part of the all swap entries of a mTHP are
>> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
>> >> >> so, why not make swap_free_nr() as general as possible?
>> >> >
>> >> > right , i believe we can make swap_free_nr() as general as possible.
>> >> >
>> >> >>
>> >> >> >>
>> >> >> >> > + */
>> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> >> > +{
>> >> >> >> > +     int i, j;
>> >> >> >> > +     struct swap_cluster_info *ci;
>> >> >> >> > +     struct swap_info_struct *p;
>> >> >> >> > +     unsigned int type = swp_type(entry);
>> >> >> >> > +     unsigned long offset = swp_offset(entry);
>> >> >> >> > +     int batch_nr, remain_nr;
>> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> >> >> >> > +
>> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
>> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> >> >> >> > +
>> >> >> >> > +     if (nr_pages == 1) {
>> >> >> >> > +             swap_free(entry);
>> >> >> >> > +             return;
>> >> >> >> > +     }
>> >> >> >>
>> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
>> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
>> >> >> >> Right?
>> >> >> >
>> >> >> > I don't see why.
>> >> >>
>> >> >> Because duplicated implementation are hard to maintain in the long term.
>> >> >
>> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
>> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
>> >> >
>> >> >>
>> >> >> > but we have lots of places calling swap_free(), we may
>> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
>> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
>> >> >> > 1 as the argument. In both cases, we are changing the semantics of
>> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
>> >> >> > "entry should be for the first subpage" then.
>> >> >> >
>> >> >> > Right now, the semantics is
>> >> >> > * swap_free_nr() for an entire large folio;
>> >> >> > * swap_free() for one entry of either a large folio or a small folio
>> >> >>
>> >> >> As above, I don't think the these semantics are important for
>> >> >> swap_free_nr() implementation.
>> >> >
>> >> > right. I agree. If we are ready to change all those callers, nothing
>> >> > can stop us from removing swap_free().
>> >> >
>> >> >>
>> >> >> >>
>> >> >> >> > +
>> >> >> >> > +     remain_nr = nr_pages;
>> >> >> >> > +     p = _swap_info_get(entry);
>> >> >> >> > +     if (p) {
>> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> >> >> >> > +
>> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
>> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> >> >> >> > +                                     __bitmap_set(usage, j, 1);
>> >> >> >> > +                     }
>> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
>> >> >> >> > +
>> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
>> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> >> >> >> > +
>> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> >> >> >> > +                     remain_nr -= batch_nr;
>> >> >> >> > +             }
>> >> >> >> > +     }
>> >> >> >> > +}
>> >> >> >> > +
>> >> >> >> >  /*
>> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> >> >> >> >   */
>> >> >> >>
>> >> >> >> put_swap_folio() implements batching in another method.  Do you think
>> >> >> >> that it's good to use the batching method in that function here?  It
>> >> >> >> avoids to use bitmap operations and stack space.
>> >> >> >
>> >> >> > Chuanhua has strictly limited the maximum stack usage to several
>> >> >> > unsigned long,
>> >> >>
>> >> >> 512 / 8 = 64 bytes.
>> >> >>
>> >> >> So, not trivial.
>> >> >>
>> >> >> > so this should be safe. on the other hand, i believe this
>> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
>> >> >> > unlock much more often whenever __swap_entry_free_locked returns
>> >> >> > 0.
>> >> >>
>> >> >> There are 2 most common use cases,
>> >> >>
>> >> >> - all swap entries have usage count == 0
>> >> >> - all swap entries have usage count != 0
>> >> >>
>> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
>> >> >> find possible use cases other than above.
>> >> >
>> >> > i guess the point is free_swap_slot() shouldn't be called within
>> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
>> >> > we'll have to unlock and lock nr_pages times?  and this is the most
>> >> > common scenario.
>> >>
>> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
>> >> is, nr_pages) or 0.  These are the most common cases.
>> >>
>> >
>> > i am actually talking about the below code path,
>> >
>> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> > {
>> >         ...
>> >
>> >         ci = lock_cluster_or_swap_info(si, offset);
>> >         ...
>> >         for (i = 0; i < size; i++, entry.val++) {
>> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
>> >                         unlock_cluster_or_swap_info(si, ci);
>> >                         free_swap_slot(entry);
>> >                         if (i == size - 1)
>> >                                 return;
>> >                         lock_cluster_or_swap_info(si, offset);
>> >                 }
>> >         }
>> >         unlock_cluster_or_swap_info(si, ci);
>> > }
>> >
>> > but i guess you are talking about the below code path:
>> >
>> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> > {
>> >         ...
>> >
>> >         ci = lock_cluster_or_swap_info(si, offset);
>> >         if (size == SWAPFILE_CLUSTER) {
>> >                 map = si->swap_map + offset;
>> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
>> >                         val = map[i];
>> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
>> >                         if (val == SWAP_HAS_CACHE)
>> >                                 free_entries++;
>> >                 }
>> >                 if (free_entries == SWAPFILE_CLUSTER) {
>> >                         unlock_cluster_or_swap_info(si, ci);
>> >                         spin_lock(&si->lock);
>> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
>> >                         swap_free_cluster(si, idx);
>> >                         spin_unlock(&si->lock);
>> >                         return;
>> >                 }
>> >         }
>> > }
>>
>> I am talking about both code paths.  In 2 most common cases,
>> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
>
> I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
> end up repeatedly unlocking and locking. Picture a scenario with a large
> folio shared by multiple processes. One process might unmap a portion
> while another still holds an entire mapping to it. This could lead to situations
> where free_entries doesn't equal 0 and free_entries doesn't equal
> nr_pages, resulting in multiple unlock and lock operations.

This is impossible in current caller, because the folio is in the swap
cache.  But if we move the change to __swap_entry_free_nr(), we may run
into this situation.

> Chuanhua has invested significant effort in following Ryan's suggestion
> for the current approach, which generally handles all cases, especially
> partial unmapping. Additionally, the widespread use of swap_free_nr()
> as you suggested across various scenarios is noteworthy.
>
> Unless there's evidence indicating performance issues or bugs, I believe
> the current approach remains preferable.

TBH, I don't like the large stack space usage (64 bytes).  How about use
a "unsigned long" as bitmap?  Then, we use much less stack space, use
bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
use cases.  And, we have enough batching.

>>
>> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
>> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
>> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
>> > path?
>>
>> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
>>
>> >
>> >> >>
>> >> >> And, we should add batching in __swap_entry_free().  That will help
>> >> >> free_swap_and_cache_nr() too.
>> >
>> > Chris Li and I actually discussed it before, while I completely
>> > agree this can be batched. but i'd like to defer this as an incremental
>> > patchset later to keep this swapcache-refault small.
>>
>> OK.
>>
>> >>
>> >> Please consider this too.

--
Best Regards,
Huang, Ying
Barry Song April 16, 2024, 4:32 a.m. UTC | #17
On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> >> >
> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> >> >> >> > to introduce an API for batched free.
> >> >> >> >> >
> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >> > ---
> >> >> >> >> >  include/linux/swap.h |  5 +++++
> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> >> >> >> >  2 files changed, 56 insertions(+)
> >> >> >> >> >
> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
> >> >> >> >> > --- a/include/linux/swap.h
> >> >> >> >> > +++ b/include/linux/swap.h
> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
> >> >> >> >> >  extern void swap_free(swp_entry_t);
> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> >> >> >> >  {
> >> >> >> >> >  }
> >> >> >> >> >
> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> >> > +{
> >> >> >> >> > +}
> >> >> >> >> > +
> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> >> >> >> >  {
> >> >> >> >> >  }
> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
> >> >> >> >> > --- a/mm/swapfile.c
> >> >> >> >> > +++ b/mm/swapfile.c
> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> >> >> >> >               __swap_entry_free(p, entry);
> >> >> >> >> >  }
> >> >> >> >> >
> >> >> >> >> > +/*
> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
> >> >> >> >> > + * maximum kernel stack usage.
> >> >> >> >> > + */
> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> >> >> >> > +
> >> >> >> >> > +/*
> >> >> >> >> > + * Called after swapping in a large folio,
> >> >> >> >>
> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
> >> >> >> >> Because this will discourage function reusing.
> >> >> >> >
> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
> >> >> >> > we can actually remove this.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> > batched free swap entries
> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
> >> >> >> >> > + * its offset is aligned with nr_pages
> >> >> >> >>
> >> >> >> >> Why do we need this?
> >> >> >> >
> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
> >> >> >> > to add swapcache's xa. so this comment is describing the existing
> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> >> >> >> > instead pass ptep or another different data struct which can connect
> >> >> >> > multiple discontiguous swap offsets.
> >> >> >> >
> >> >> >> > I feel like we only need "for this large folio, entry should be for
> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> >> >> >> > the latter is not important to this context at all.
> >> >> >>
> >> >> >> IIUC, all these are requirements of the only caller now, not the
> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> >> >> >> so, why not make swap_free_nr() as general as possible?
> >> >> >
> >> >> > right , i believe we can make swap_free_nr() as general as possible.
> >> >> >
> >> >> >>
> >> >> >> >>
> >> >> >> >> > + */
> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> >> > +{
> >> >> >> >> > +     int i, j;
> >> >> >> >> > +     struct swap_cluster_info *ci;
> >> >> >> >> > +     struct swap_info_struct *p;
> >> >> >> >> > +     unsigned int type = swp_type(entry);
> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
> >> >> >> >> > +     int batch_nr, remain_nr;
> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> >> >> >> > +
> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> >> >> >> > +
> >> >> >> >> > +     if (nr_pages == 1) {
> >> >> >> >> > +             swap_free(entry);
> >> >> >> >> > +             return;
> >> >> >> >> > +     }
> >> >> >> >>
> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
> >> >> >> >> Right?
> >> >> >> >
> >> >> >> > I don't see why.
> >> >> >>
> >> >> >> Because duplicated implementation are hard to maintain in the long term.
> >> >> >
> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> >> >> >
> >> >> >>
> >> >> >> > but we have lots of places calling swap_free(), we may
> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> >> >> >> > "entry should be for the first subpage" then.
> >> >> >> >
> >> >> >> > Right now, the semantics is
> >> >> >> > * swap_free_nr() for an entire large folio;
> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
> >> >> >>
> >> >> >> As above, I don't think the these semantics are important for
> >> >> >> swap_free_nr() implementation.
> >> >> >
> >> >> > right. I agree. If we are ready to change all those callers, nothing
> >> >> > can stop us from removing swap_free().
> >> >> >
> >> >> >>
> >> >> >> >>
> >> >> >> >> > +
> >> >> >> >> > +     remain_nr = nr_pages;
> >> >> >> >> > +     p = _swap_info_get(entry);
> >> >> >> >> > +     if (p) {
> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> >> >> >> > +
> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
> >> >> >> >> > +                     }
> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> >> >> >> > +
> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> >> >> >> > +
> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> >> >> >> > +                     remain_nr -= batch_nr;
> >> >> >> >> > +             }
> >> >> >> >> > +     }
> >> >> >> >> > +}
> >> >> >> >> > +
> >> >> >> >> >  /*
> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> >> >> >> >   */
> >> >> >> >>
> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
> >> >> >> >> that it's good to use the batching method in that function here?  It
> >> >> >> >> avoids to use bitmap operations and stack space.
> >> >> >> >
> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
> >> >> >> > unsigned long,
> >> >> >>
> >> >> >> 512 / 8 = 64 bytes.
> >> >> >>
> >> >> >> So, not trivial.
> >> >> >>
> >> >> >> > so this should be safe. on the other hand, i believe this
> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
> >> >> >> > 0.
> >> >> >>
> >> >> >> There are 2 most common use cases,
> >> >> >>
> >> >> >> - all swap entries have usage count == 0
> >> >> >> - all swap entries have usage count != 0
> >> >> >>
> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> >> >> >> find possible use cases other than above.
> >> >> >
> >> >> > i guess the point is free_swap_slot() shouldn't be called within
> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
> >> >> > common scenario.
> >> >>
> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> >> >> is, nr_pages) or 0.  These are the most common cases.
> >> >>
> >> >
> >> > i am actually talking about the below code path,
> >> >
> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> > {
> >> >         ...
> >> >
> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> >         ...
> >> >         for (i = 0; i < size; i++, entry.val++) {
> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> >                         free_swap_slot(entry);
> >> >                         if (i == size - 1)
> >> >                                 return;
> >> >                         lock_cluster_or_swap_info(si, offset);
> >> >                 }
> >> >         }
> >> >         unlock_cluster_or_swap_info(si, ci);
> >> > }
> >> >
> >> > but i guess you are talking about the below code path:
> >> >
> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> > {
> >> >         ...
> >> >
> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> >         if (size == SWAPFILE_CLUSTER) {
> >> >                 map = si->swap_map + offset;
> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> >> >                         val = map[i];
> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> >> >                         if (val == SWAP_HAS_CACHE)
> >> >                                 free_entries++;
> >> >                 }
> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> >                         spin_lock(&si->lock);
> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> >> >                         swap_free_cluster(si, idx);
> >> >                         spin_unlock(&si->lock);
> >> >                         return;
> >> >                 }
> >> >         }
> >> > }
> >>
> >> I am talking about both code paths.  In 2 most common cases,
> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
> >
> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
> > end up repeatedly unlocking and locking. Picture a scenario with a large
> > folio shared by multiple processes. One process might unmap a portion
> > while another still holds an entire mapping to it. This could lead to situations
> > where free_entries doesn't equal 0 and free_entries doesn't equal
> > nr_pages, resulting in multiple unlock and lock operations.
>
> This is impossible in current caller, because the folio is in the swap
> cache.  But if we move the change to __swap_entry_free_nr(), we may run
> into this situation.

I don't understand why it is impossible, after try_to_unmap_one() has done
on one process, mprotect and munmap called on a part of the large folio
pte entries which now have been swap entries, we are removing the PTE
for this part. Another process can entirely hit the swapcache and have
all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
do_swap_page. Within those swap entries, some have swapcount=1 and others
have swapcount > 1. Am I missing something?

>
> > Chuanhua has invested significant effort in following Ryan's suggestion
> > for the current approach, which generally handles all cases, especially
> > partial unmapping. Additionally, the widespread use of swap_free_nr()
> > as you suggested across various scenarios is noteworthy.
> >
> > Unless there's evidence indicating performance issues or bugs, I believe
> > the current approach remains preferable.
>
> TBH, I don't like the large stack space usage (64 bytes).  How about use
> a "unsigned long" as bitmap?  Then, we use much less stack space, use
> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
> use cases.  And, we have enough batching.

that is quite a straightforward modification like,

- #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
+ #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)

there is no necessity to remove the bitmap API and move to raw
unsigned long operations.
as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
unsigned long,
on 32bit CPU, it is now two unsigned long.

>
> >>
> >> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
> >> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
> >> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
> >> > path?
> >>
> >> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
> >>
> >> >
> >> >> >>
> >> >> >> And, we should add batching in __swap_entry_free().  That will help
> >> >> >> free_swap_and_cache_nr() too.
> >> >
> >> > Chris Li and I actually discussed it before, while I completely
> >> > agree this can be batched. but i'd like to defer this as an incremental
> >> > patchset later to keep this swapcache-refault small.
> >>
> >> OK.
> >>
> >> >>
> >> >> Please consider this too.
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Huang, Ying April 17, 2024, 12:32 a.m. UTC | #18
Barry Song <21cnbao@gmail.com> writes:

> On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >>
>> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> >> >> >> >>
>> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> >> >> >
>> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
>> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> >> >> >> >> > to introduce an API for batched free.
>> >> >> >> >> >
>> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> >> >> >> >> > ---
>> >> >> >> >> >  include/linux/swap.h |  5 +++++
>> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> >> >> >> >> >  2 files changed, 56 insertions(+)
>> >> >> >> >> >
>> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
>> >> >> >> >> > --- a/include/linux/swap.h
>> >> >> >> >> > +++ b/include/linux/swap.h
>> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
>> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
>> >> >> >> >> >  extern void swap_free(swp_entry_t);
>> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
>> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> >> >> >> >> >  {
>> >> >> >> >> >  }
>> >> >> >> >> >
>> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> >> >> > +{
>> >> >> >> >> > +}
>> >> >> >> >> > +
>> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> >> >> >> >> >  {
>> >> >> >> >> >  }
>> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
>> >> >> >> >> > --- a/mm/swapfile.c
>> >> >> >> >> > +++ b/mm/swapfile.c
>> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> >> >> >> >> >               __swap_entry_free(p, entry);
>> >> >> >> >> >  }
>> >> >> >> >> >
>> >> >> >> >> > +/*
>> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
>> >> >> >> >> > + * maximum kernel stack usage.
>> >> >> >> >> > + */
>> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> >> >> >> >> > +
>> >> >> >> >> > +/*
>> >> >> >> >> > + * Called after swapping in a large folio,
>> >> >> >> >>
>> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
>> >> >> >> >> Because this will discourage function reusing.
>> >> >> >> >
>> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
>> >> >> >> > we can actually remove this.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> > batched free swap entries
>> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
>> >> >> >> >> > + * its offset is aligned with nr_pages
>> >> >> >> >>
>> >> >> >> >> Why do we need this?
>> >> >> >> >
>> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
>> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
>> >> >> >> > to add swapcache's xa. so this comment is describing the existing
>> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
>> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
>> >> >> >> > instead pass ptep or another different data struct which can connect
>> >> >> >> > multiple discontiguous swap offsets.
>> >> >> >> >
>> >> >> >> > I feel like we only need "for this large folio, entry should be for
>> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
>> >> >> >> > the latter is not important to this context at all.
>> >> >> >>
>> >> >> >> IIUC, all these are requirements of the only caller now, not the
>> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
>> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
>> >> >> >> so, why not make swap_free_nr() as general as possible?
>> >> >> >
>> >> >> > right , i believe we can make swap_free_nr() as general as possible.
>> >> >> >
>> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > + */
>> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> >> >> >> >> > +{
>> >> >> >> >> > +     int i, j;
>> >> >> >> >> > +     struct swap_cluster_info *ci;
>> >> >> >> >> > +     struct swap_info_struct *p;
>> >> >> >> >> > +     unsigned int type = swp_type(entry);
>> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
>> >> >> >> >> > +     int batch_nr, remain_nr;
>> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> >> >> >> >> > +
>> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
>> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> >> >> >> >> > +
>> >> >> >> >> > +     if (nr_pages == 1) {
>> >> >> >> >> > +             swap_free(entry);
>> >> >> >> >> > +             return;
>> >> >> >> >> > +     }
>> >> >> >> >>
>> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
>> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
>> >> >> >> >> Right?
>> >> >> >> >
>> >> >> >> > I don't see why.
>> >> >> >>
>> >> >> >> Because duplicated implementation are hard to maintain in the long term.
>> >> >> >
>> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
>> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
>> >> >> >
>> >> >> >>
>> >> >> >> > but we have lots of places calling swap_free(), we may
>> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
>> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
>> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
>> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
>> >> >> >> > "entry should be for the first subpage" then.
>> >> >> >> >
>> >> >> >> > Right now, the semantics is
>> >> >> >> > * swap_free_nr() for an entire large folio;
>> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
>> >> >> >>
>> >> >> >> As above, I don't think the these semantics are important for
>> >> >> >> swap_free_nr() implementation.
>> >> >> >
>> >> >> > right. I agree. If we are ready to change all those callers, nothing
>> >> >> > can stop us from removing swap_free().
>> >> >> >
>> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > +
>> >> >> >> >> > +     remain_nr = nr_pages;
>> >> >> >> >> > +     p = _swap_info_get(entry);
>> >> >> >> >> > +     if (p) {
>> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> >> >> >> >> > +
>> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
>> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
>> >> >> >> >> > +                     }
>> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
>> >> >> >> >> > +
>> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
>> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> >> >> >> >> > +
>> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> >> >> >> >> > +                     remain_nr -= batch_nr;
>> >> >> >> >> > +             }
>> >> >> >> >> > +     }
>> >> >> >> >> > +}
>> >> >> >> >> > +
>> >> >> >> >> >  /*
>> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> >> >> >> >> >   */
>> >> >> >> >>
>> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
>> >> >> >> >> that it's good to use the batching method in that function here?  It
>> >> >> >> >> avoids to use bitmap operations and stack space.
>> >> >> >> >
>> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
>> >> >> >> > unsigned long,
>> >> >> >>
>> >> >> >> 512 / 8 = 64 bytes.
>> >> >> >>
>> >> >> >> So, not trivial.
>> >> >> >>
>> >> >> >> > so this should be safe. on the other hand, i believe this
>> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
>> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
>> >> >> >> > 0.
>> >> >> >>
>> >> >> >> There are 2 most common use cases,
>> >> >> >>
>> >> >> >> - all swap entries have usage count == 0
>> >> >> >> - all swap entries have usage count != 0
>> >> >> >>
>> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
>> >> >> >> find possible use cases other than above.
>> >> >> >
>> >> >> > i guess the point is free_swap_slot() shouldn't be called within
>> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
>> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
>> >> >> > common scenario.
>> >> >>
>> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
>> >> >> is, nr_pages) or 0.  These are the most common cases.
>> >> >>
>> >> >
>> >> > i am actually talking about the below code path,
>> >> >
>> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> >> > {
>> >> >         ...
>> >> >
>> >> >         ci = lock_cluster_or_swap_info(si, offset);
>> >> >         ...
>> >> >         for (i = 0; i < size; i++, entry.val++) {
>> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
>> >> >                         unlock_cluster_or_swap_info(si, ci);
>> >> >                         free_swap_slot(entry);
>> >> >                         if (i == size - 1)
>> >> >                                 return;
>> >> >                         lock_cluster_or_swap_info(si, offset);
>> >> >                 }
>> >> >         }
>> >> >         unlock_cluster_or_swap_info(si, ci);
>> >> > }
>> >> >
>> >> > but i guess you are talking about the below code path:
>> >> >
>> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> >> > {
>> >> >         ...
>> >> >
>> >> >         ci = lock_cluster_or_swap_info(si, offset);
>> >> >         if (size == SWAPFILE_CLUSTER) {
>> >> >                 map = si->swap_map + offset;
>> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
>> >> >                         val = map[i];
>> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
>> >> >                         if (val == SWAP_HAS_CACHE)
>> >> >                                 free_entries++;
>> >> >                 }
>> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
>> >> >                         unlock_cluster_or_swap_info(si, ci);
>> >> >                         spin_lock(&si->lock);
>> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
>> >> >                         swap_free_cluster(si, idx);
>> >> >                         spin_unlock(&si->lock);
>> >> >                         return;
>> >> >                 }
>> >> >         }
>> >> > }
>> >>
>> >> I am talking about both code paths.  In 2 most common cases,
>> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
>> >
>> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
>> > end up repeatedly unlocking and locking. Picture a scenario with a large
>> > folio shared by multiple processes. One process might unmap a portion
>> > while another still holds an entire mapping to it. This could lead to situations
>> > where free_entries doesn't equal 0 and free_entries doesn't equal
>> > nr_pages, resulting in multiple unlock and lock operations.
>>
>> This is impossible in current caller, because the folio is in the swap
>> cache.  But if we move the change to __swap_entry_free_nr(), we may run
>> into this situation.
>
> I don't understand why it is impossible, after try_to_unmap_one() has done
> on one process, mprotect and munmap called on a part of the large folio
> pte entries which now have been swap entries, we are removing the PTE
> for this part. Another process can entirely hit the swapcache and have
> all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
> do_swap_page. Within those swap entries, some have swapcount=1 and others
> have swapcount > 1. Am I missing something?

For swap entries with swapcount=1, its sis->swap_map[] will be

1 | SWAP_HAS_CACHE

so, __swap_entry_free_locked() will return SWAP_HAS_CACHE instead of 0.

The swap entries will be free in

folio_free_swap
  delete_from_swap_cache
    put_swap_folio

>> > Chuanhua has invested significant effort in following Ryan's suggestion
>> > for the current approach, which generally handles all cases, especially
>> > partial unmapping. Additionally, the widespread use of swap_free_nr()
>> > as you suggested across various scenarios is noteworthy.
>> >
>> > Unless there's evidence indicating performance issues or bugs, I believe
>> > the current approach remains preferable.
>>
>> TBH, I don't like the large stack space usage (64 bytes).  How about use
>> a "unsigned long" as bitmap?  Then, we use much less stack space, use
>> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
>> use cases.  And, we have enough batching.
>
> that is quite a straightforward modification like,
>
> - #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> + #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
>
> there is no necessity to remove the bitmap API and move to raw
> unsigned long operations.
> as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
> unsigned long,
> on 32bit CPU, it is now two unsigned long.

Yes.  We can still use most bitmap APIs if we use "unsigned long" as
bitmap.  The advantage of "unsigned long" is to guarantee that
bitmap_empty() and bitmap_full() is trivial.  We can use that for
optimization.  For example, we can skip unlock/lock if bitmap_empty().

>>
>> >>
>> >> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
>> >> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
>> >> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
>> >> > path?
>> >>
>> >> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
>> >>
>> >> >
>> >> >> >>
>> >> >> >> And, we should add batching in __swap_entry_free().  That will help
>> >> >> >> free_swap_and_cache_nr() too.
>> >> >
>> >> > Chris Li and I actually discussed it before, while I completely
>> >> > agree this can be batched. but i'd like to defer this as an incremental
>> >> > patchset later to keep this swapcache-refault small.
>> >>
>> >> OK.
>> >>
>> >> >>
>> >> >> Please consider this too.

--
Best Regards,
Huang, Ying
Barry Song April 17, 2024, 1:35 a.m. UTC | #19
On Wed, Apr 17, 2024 at 12:34 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >> >> >> >>
> >> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> >> >> >
> >> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> >> >> >> >> > to introduce an API for batched free.
> >> >> >> >> >> >
> >> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >> >> > ---
> >> >> >> >> >> >  include/linux/swap.h |  5 +++++
> >> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> >> >> >> >> >  2 files changed, 56 insertions(+)
> >> >> >> >> >> >
> >> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
> >> >> >> >> >> > --- a/include/linux/swap.h
> >> >> >> >> >> > +++ b/include/linux/swap.h
> >> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
> >> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
> >> >> >> >> >> >  extern void swap_free(swp_entry_t);
> >> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> >> >> >> >> >  {
> >> >> >> >> >> >  }
> >> >> >> >> >> >
> >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> >> >> > +{
> >> >> >> >> >> > +}
> >> >> >> >> >> > +
> >> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> >> >> >> >> >  {
> >> >> >> >> >> >  }
> >> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
> >> >> >> >> >> > --- a/mm/swapfile.c
> >> >> >> >> >> > +++ b/mm/swapfile.c
> >> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> >> >> >> >> >               __swap_entry_free(p, entry);
> >> >> >> >> >> >  }
> >> >> >> >> >> >
> >> >> >> >> >> > +/*
> >> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
> >> >> >> >> >> > + * maximum kernel stack usage.
> >> >> >> >> >> > + */
> >> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> >> >> >> >> > +
> >> >> >> >> >> > +/*
> >> >> >> >> >> > + * Called after swapping in a large folio,
> >> >> >> >> >>
> >> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
> >> >> >> >> >> Because this will discourage function reusing.
> >> >> >> >> >
> >> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
> >> >> >> >> > we can actually remove this.
> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >> > batched free swap entries
> >> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
> >> >> >> >> >> > + * its offset is aligned with nr_pages
> >> >> >> >> >>
> >> >> >> >> >> Why do we need this?
> >> >> >> >> >
> >> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
> >> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
> >> >> >> >> > to add swapcache's xa. so this comment is describing the existing
> >> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> >> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> >> >> >> >> > instead pass ptep or another different data struct which can connect
> >> >> >> >> > multiple discontiguous swap offsets.
> >> >> >> >> >
> >> >> >> >> > I feel like we only need "for this large folio, entry should be for
> >> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> >> >> >> >> > the latter is not important to this context at all.
> >> >> >> >>
> >> >> >> >> IIUC, all these are requirements of the only caller now, not the
> >> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
> >> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> >> >> >> >> so, why not make swap_free_nr() as general as possible?
> >> >> >> >
> >> >> >> > right , i believe we can make swap_free_nr() as general as possible.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> > + */
> >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> >> >> >> >> > +{
> >> >> >> >> >> > +     int i, j;
> >> >> >> >> >> > +     struct swap_cluster_info *ci;
> >> >> >> >> >> > +     struct swap_info_struct *p;
> >> >> >> >> >> > +     unsigned int type = swp_type(entry);
> >> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
> >> >> >> >> >> > +     int batch_nr, remain_nr;
> >> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> >> >> >> >> > +
> >> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
> >> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> >> >> >> >> > +
> >> >> >> >> >> > +     if (nr_pages == 1) {
> >> >> >> >> >> > +             swap_free(entry);
> >> >> >> >> >> > +             return;
> >> >> >> >> >> > +     }
> >> >> >> >> >>
> >> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
> >> >> >> >> >> Right?
> >> >> >> >> >
> >> >> >> >> > I don't see why.
> >> >> >> >>
> >> >> >> >> Because duplicated implementation are hard to maintain in the long term.
> >> >> >> >
> >> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> >> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> >> >> >> >
> >> >> >> >>
> >> >> >> >> > but we have lots of places calling swap_free(), we may
> >> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> >> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> >> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
> >> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> >> >> >> >> > "entry should be for the first subpage" then.
> >> >> >> >> >
> >> >> >> >> > Right now, the semantics is
> >> >> >> >> > * swap_free_nr() for an entire large folio;
> >> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
> >> >> >> >>
> >> >> >> >> As above, I don't think the these semantics are important for
> >> >> >> >> swap_free_nr() implementation.
> >> >> >> >
> >> >> >> > right. I agree. If we are ready to change all those callers, nothing
> >> >> >> > can stop us from removing swap_free().
> >> >> >> >
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> > +
> >> >> >> >> >> > +     remain_nr = nr_pages;
> >> >> >> >> >> > +     p = _swap_info_get(entry);
> >> >> >> >> >> > +     if (p) {
> >> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> >> >> >> >> > +
> >> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
> >> >> >> >> >> > +                     }
> >> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> >> >> >> >> > +
> >> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> >> >> >> >> > +
> >> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> >> >> >> >> > +                     remain_nr -= batch_nr;
> >> >> >> >> >> > +             }
> >> >> >> >> >> > +     }
> >> >> >> >> >> > +}
> >> >> >> >> >> > +
> >> >> >> >> >> >  /*
> >> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> >> >> >> >> >   */
> >> >> >> >> >>
> >> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
> >> >> >> >> >> that it's good to use the batching method in that function here?  It
> >> >> >> >> >> avoids to use bitmap operations and stack space.
> >> >> >> >> >
> >> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
> >> >> >> >> > unsigned long,
> >> >> >> >>
> >> >> >> >> 512 / 8 = 64 bytes.
> >> >> >> >>
> >> >> >> >> So, not trivial.
> >> >> >> >>
> >> >> >> >> > so this should be safe. on the other hand, i believe this
> >> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
> >> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
> >> >> >> >> > 0.
> >> >> >> >>
> >> >> >> >> There are 2 most common use cases,
> >> >> >> >>
> >> >> >> >> - all swap entries have usage count == 0
> >> >> >> >> - all swap entries have usage count != 0
> >> >> >> >>
> >> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> >> >> >> >> find possible use cases other than above.
> >> >> >> >
> >> >> >> > i guess the point is free_swap_slot() shouldn't be called within
> >> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> >> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
> >> >> >> > common scenario.
> >> >> >>
> >> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> >> >> >> is, nr_pages) or 0.  These are the most common cases.
> >> >> >>
> >> >> >
> >> >> > i am actually talking about the below code path,
> >> >> >
> >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> >> > {
> >> >> >         ...
> >> >> >
> >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> >> >         ...
> >> >> >         for (i = 0; i < size; i++, entry.val++) {
> >> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> >> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> >> >                         free_swap_slot(entry);
> >> >> >                         if (i == size - 1)
> >> >> >                                 return;
> >> >> >                         lock_cluster_or_swap_info(si, offset);
> >> >> >                 }
> >> >> >         }
> >> >> >         unlock_cluster_or_swap_info(si, ci);
> >> >> > }
> >> >> >
> >> >> > but i guess you are talking about the below code path:
> >> >> >
> >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> >> > {
> >> >> >         ...
> >> >> >
> >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> >> >         if (size == SWAPFILE_CLUSTER) {
> >> >> >                 map = si->swap_map + offset;
> >> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> >> >> >                         val = map[i];
> >> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> >> >> >                         if (val == SWAP_HAS_CACHE)
> >> >> >                                 free_entries++;
> >> >> >                 }
> >> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
> >> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> >> >                         spin_lock(&si->lock);
> >> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> >> >> >                         swap_free_cluster(si, idx);
> >> >> >                         spin_unlock(&si->lock);
> >> >> >                         return;
> >> >> >                 }
> >> >> >         }
> >> >> > }
> >> >>
> >> >> I am talking about both code paths.  In 2 most common cases,
> >> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
> >> >
> >> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
> >> > end up repeatedly unlocking and locking. Picture a scenario with a large
> >> > folio shared by multiple processes. One process might unmap a portion
> >> > while another still holds an entire mapping to it. This could lead to situations
> >> > where free_entries doesn't equal 0 and free_entries doesn't equal
> >> > nr_pages, resulting in multiple unlock and lock operations.
> >>
> >> This is impossible in current caller, because the folio is in the swap
> >> cache.  But if we move the change to __swap_entry_free_nr(), we may run
> >> into this situation.
> >
> > I don't understand why it is impossible, after try_to_unmap_one() has done
> > on one process, mprotect and munmap called on a part of the large folio
> > pte entries which now have been swap entries, we are removing the PTE
> > for this part. Another process can entirely hit the swapcache and have
> > all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
> > do_swap_page. Within those swap entries, some have swapcount=1 and others
> > have swapcount > 1. Am I missing something?
>
> For swap entries with swapcount=1, its sis->swap_map[] will be
>
> 1 | SWAP_HAS_CACHE
>
> so, __swap_entry_free_locked() will return SWAP_HAS_CACHE instead of 0.
>
> The swap entries will be free in
>
> folio_free_swap
>   delete_from_swap_cache
>     put_swap_folio
>

Yes. I realized this after replying to you yesterday.

> >> > Chuanhua has invested significant effort in following Ryan's suggestion
> >> > for the current approach, which generally handles all cases, especially
> >> > partial unmapping. Additionally, the widespread use of swap_free_nr()
> >> > as you suggested across various scenarios is noteworthy.
> >> >
> >> > Unless there's evidence indicating performance issues or bugs, I believe
> >> > the current approach remains preferable.
> >>
> >> TBH, I don't like the large stack space usage (64 bytes).  How about use
> >> a "unsigned long" as bitmap?  Then, we use much less stack space, use
> >> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
> >> use cases.  And, we have enough batching.
> >
> > that is quite a straightforward modification like,
> >
> > - #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > + #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
> >
> > there is no necessity to remove the bitmap API and move to raw
> > unsigned long operations.
> > as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
> > unsigned long,
> > on 32bit CPU, it is now two unsigned long.
>
> Yes.  We can still use most bitmap APIs if we use "unsigned long" as
> bitmap.  The advantage of "unsigned long" is to guarantee that
> bitmap_empty() and bitmap_full() is trivial.  We can use that for
> optimization.  For example, we can skip unlock/lock if bitmap_empty().

anyway we have avoided lock_cluster_or_swap_info and unlock_cluster_or_swap_info
for each individual swap entry.

if bitma_empty(), we won't call free_swap_slot() so no chance to
further take any lock,
right?

the optimization of bitmap_full() seems to be more useful only after we have
void free_swap_slot(swp_entry_t entry, int nr)

in which we can avoid many spin_lock_irq(&cache->free_lock);

On the other hand, it seems we can directly call
swapcache_free_entries() to skip cache if
nr_pages >= SWAP_BATCH(64) this might be an optimization as we are now
having a bitmap exactly equals 64.

>
> >>
> >> >>
> >> >> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
> >> >> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
> >> >> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
> >> >> > path?
> >> >>
> >> >> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
> >> >>
> >> >> >
> >> >> >> >>
> >> >> >> >> And, we should add batching in __swap_entry_free().  That will help
> >> >> >> >> free_swap_and_cache_nr() too.
> >> >> >
> >> >> > Chris Li and I actually discussed it before, while I completely
> >> >> > agree this can be batched. but i'd like to defer this as an incremental
> >> >> > patchset later to keep this swapcache-refault small.
> >> >>
> >> >> OK.
> >> >>
> >> >> >>
> >> >> >> Please consider this too.
>
> --
> Best Regards,
> Huang, Ying
Barry Song April 18, 2024, 5:27 a.m. UTC | #20
On Wed, Apr 17, 2024 at 1:35 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Apr 17, 2024 at 12:34 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >>
> > >> Barry Song <21cnbao@gmail.com> writes:
> > >>
> > >> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >>
> > >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >>
> > >> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >>
> > >> >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >> >>
> > >> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >>
> > >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >> >> >>
> > >> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >> >> >> >>
> > >> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> > >> >> >> >> >>
> > >> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> > >> >> >> >> >> >
> > >> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> > >> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > >> >> >> >> >> > to introduce an API for batched free.
> > >> >> >> >> >> >
> > >> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > >> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > >> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > >> >> >> >> >> > ---
> > >> >> >> >> >> >  include/linux/swap.h |  5 +++++
> > >> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> > >> >> >> >> >> >  2 files changed, 56 insertions(+)
> > >> >> >> >> >> >
> > >> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > >> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
> > >> >> >> >> >> > --- a/include/linux/swap.h
> > >> >> >> >> >> > +++ b/include/linux/swap.h
> > >> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> > >> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
> > >> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
> > >> >> >> >> >> >  extern void swap_free(swp_entry_t);
> > >> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> > >> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> > >> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> > >> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> > >> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> > >> >> >> >> >> >  {
> > >> >> >> >> >> >  }
> > >> >> >> >> >> >
> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > >> >> >> >> >> > +{
> > >> >> >> >> >> > +}
> > >> >> >> >> >> > +
> > >> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> > >> >> >> >> >> >  {
> > >> >> >> >> >> >  }
> > >> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > >> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
> > >> >> >> >> >> > --- a/mm/swapfile.c
> > >> >> >> >> >> > +++ b/mm/swapfile.c
> > >> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> > >> >> >> >> >> >               __swap_entry_free(p, entry);
> > >> >> >> >> >> >  }
> > >> >> >> >> >> >
> > >> >> >> >> >> > +/*
> > >> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
> > >> >> >> >> >> > + * maximum kernel stack usage.
> > >> >> >> >> >> > + */
> > >> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +/*
> > >> >> >> >> >> > + * Called after swapping in a large folio,
> > >> >> >> >> >>
> > >> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
> > >> >> >> >> >> Because this will discourage function reusing.
> > >> >> >> >> >
> > >> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
> > >> >> >> >> > we can actually remove this.
> > >> >> >> >> >
> > >> >> >> >> >>
> > >> >> >> >> >> > batched free swap entries
> > >> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
> > >> >> >> >> >> > + * its offset is aligned with nr_pages
> > >> >> >> >> >>
> > >> >> >> >> >> Why do we need this?
> > >> >> >> >> >
> > >> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
> > >> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
> > >> >> >> >> > to add swapcache's xa. so this comment is describing the existing
> > >> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> > >> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> > >> >> >> >> > instead pass ptep or another different data struct which can connect
> > >> >> >> >> > multiple discontiguous swap offsets.
> > >> >> >> >> >
> > >> >> >> >> > I feel like we only need "for this large folio, entry should be for
> > >> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> > >> >> >> >> > the latter is not important to this context at all.
> > >> >> >> >>
> > >> >> >> >> IIUC, all these are requirements of the only caller now, not the
> > >> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
> > >> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> > >> >> >> >> so, why not make swap_free_nr() as general as possible?
> > >> >> >> >
> > >> >> >> > right , i believe we can make swap_free_nr() as general as possible.
> > >> >> >> >
> > >> >> >> >>
> > >> >> >> >> >>
> > >> >> >> >> >> > + */
> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> > >> >> >> >> >> > +{
> > >> >> >> >> >> > +     int i, j;
> > >> >> >> >> >> > +     struct swap_cluster_info *ci;
> > >> >> >> >> >> > +     struct swap_info_struct *p;
> > >> >> >> >> >> > +     unsigned int type = swp_type(entry);
> > >> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
> > >> >> >> >> >> > +     int batch_nr, remain_nr;
> > >> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
> > >> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +     if (nr_pages == 1) {
> > >> >> >> >> >> > +             swap_free(entry);
> > >> >> >> >> >> > +             return;
> > >> >> >> >> >> > +     }
> > >> >> >> >> >>
> > >> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> > >> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> > >> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
> > >> >> >> >> >> Right?
> > >> >> >> >> >
> > >> >> >> >> > I don't see why.
> > >> >> >> >>
> > >> >> >> >> Because duplicated implementation are hard to maintain in the long term.
> > >> >> >> >
> > >> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> > >> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> > >> >> >> >
> > >> >> >> >>
> > >> >> >> >> > but we have lots of places calling swap_free(), we may
> > >> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> > >> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> > >> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
> > >> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> > >> >> >> >> > "entry should be for the first subpage" then.
> > >> >> >> >> >
> > >> >> >> >> > Right now, the semantics is
> > >> >> >> >> > * swap_free_nr() for an entire large folio;
> > >> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
> > >> >> >> >>
> > >> >> >> >> As above, I don't think the these semantics are important for
> > >> >> >> >> swap_free_nr() implementation.
> > >> >> >> >
> > >> >> >> > right. I agree. If we are ready to change all those callers, nothing
> > >> >> >> > can stop us from removing swap_free().
> > >> >> >> >
> > >> >> >> >>
> > >> >> >> >> >>
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +     remain_nr = nr_pages;
> > >> >> >> >> >> > +     p = _swap_info_get(entry);
> > >> >> >> >> >> > +     if (p) {
> > >> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> > >> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> > >> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> > >> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> > >> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
> > >> >> >> >> >> > +                     }
> > >> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> > >> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> > >> >> >> >> >> > +
> > >> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> > >> >> >> >> >> > +                     remain_nr -= batch_nr;
> > >> >> >> >> >> > +             }
> > >> >> >> >> >> > +     }
> > >> >> >> >> >> > +}
> > >> >> >> >> >> > +
> > >> >> >> >> >> >  /*
> > >> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> > >> >> >> >> >> >   */
> > >> >> >> >> >>
> > >> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
> > >> >> >> >> >> that it's good to use the batching method in that function here?  It
> > >> >> >> >> >> avoids to use bitmap operations and stack space.
> > >> >> >> >> >
> > >> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
> > >> >> >> >> > unsigned long,
> > >> >> >> >>
> > >> >> >> >> 512 / 8 = 64 bytes.
> > >> >> >> >>
> > >> >> >> >> So, not trivial.
> > >> >> >> >>
> > >> >> >> >> > so this should be safe. on the other hand, i believe this
> > >> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
> > >> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
> > >> >> >> >> > 0.
> > >> >> >> >>
> > >> >> >> >> There are 2 most common use cases,
> > >> >> >> >>
> > >> >> >> >> - all swap entries have usage count == 0
> > >> >> >> >> - all swap entries have usage count != 0
> > >> >> >> >>
> > >> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> > >> >> >> >> find possible use cases other than above.
> > >> >> >> >
> > >> >> >> > i guess the point is free_swap_slot() shouldn't be called within
> > >> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> > >> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
> > >> >> >> > common scenario.
> > >> >> >>
> > >> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> > >> >> >> is, nr_pages) or 0.  These are the most common cases.
> > >> >> >>
> > >> >> >
> > >> >> > i am actually talking about the below code path,
> > >> >> >
> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> > >> >> > {
> > >> >> >         ...
> > >> >> >
> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> > >> >> >         ...
> > >> >> >         for (i = 0; i < size; i++, entry.val++) {
> > >> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
> > >> >> >                         free_swap_slot(entry);
> > >> >> >                         if (i == size - 1)
> > >> >> >                                 return;
> > >> >> >                         lock_cluster_or_swap_info(si, offset);
> > >> >> >                 }
> > >> >> >         }
> > >> >> >         unlock_cluster_or_swap_info(si, ci);
> > >> >> > }
> > >> >> >
> > >> >> > but i guess you are talking about the below code path:
> > >> >> >
> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> > >> >> > {
> > >> >> >         ...
> > >> >> >
> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> > >> >> >         if (size == SWAPFILE_CLUSTER) {
> > >> >> >                 map = si->swap_map + offset;
> > >> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> > >> >> >                         val = map[i];
> > >> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> > >> >> >                         if (val == SWAP_HAS_CACHE)
> > >> >> >                                 free_entries++;
> > >> >> >                 }
> > >> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
> > >> >> >                         spin_lock(&si->lock);
> > >> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> > >> >> >                         swap_free_cluster(si, idx);
> > >> >> >                         spin_unlock(&si->lock);
> > >> >> >                         return;
> > >> >> >                 }
> > >> >> >         }
> > >> >> > }
> > >> >>
> > >> >> I am talking about both code paths.  In 2 most common cases,
> > >> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
> > >> >
> > >> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
> > >> > end up repeatedly unlocking and locking. Picture a scenario with a large
> > >> > folio shared by multiple processes. One process might unmap a portion
> > >> > while another still holds an entire mapping to it. This could lead to situations
> > >> > where free_entries doesn't equal 0 and free_entries doesn't equal
> > >> > nr_pages, resulting in multiple unlock and lock operations.
> > >>
> > >> This is impossible in current caller, because the folio is in the swap
> > >> cache.  But if we move the change to __swap_entry_free_nr(), we may run
> > >> into this situation.
> > >
> > > I don't understand why it is impossible, after try_to_unmap_one() has done
> > > on one process, mprotect and munmap called on a part of the large folio
> > > pte entries which now have been swap entries, we are removing the PTE
> > > for this part. Another process can entirely hit the swapcache and have
> > > all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
> > > do_swap_page. Within those swap entries, some have swapcount=1 and others
> > > have swapcount > 1. Am I missing something?
> >
> > For swap entries with swapcount=1, its sis->swap_map[] will be
> >
> > 1 | SWAP_HAS_CACHE
> >
> > so, __swap_entry_free_locked() will return SWAP_HAS_CACHE instead of 0.
> >
> > The swap entries will be free in
> >
> > folio_free_swap
> >   delete_from_swap_cache
> >     put_swap_folio
> >
>
> Yes. I realized this after replying to you yesterday.
>
> > >> > Chuanhua has invested significant effort in following Ryan's suggestion
> > >> > for the current approach, which generally handles all cases, especially
> > >> > partial unmapping. Additionally, the widespread use of swap_free_nr()
> > >> > as you suggested across various scenarios is noteworthy.
> > >> >
> > >> > Unless there's evidence indicating performance issues or bugs, I believe
> > >> > the current approach remains preferable.
> > >>
> > >> TBH, I don't like the large stack space usage (64 bytes).  How about use
> > >> a "unsigned long" as bitmap?  Then, we use much less stack space, use
> > >> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
> > >> use cases.  And, we have enough batching.
> > >
> > > that is quite a straightforward modification like,
> > >
> > > - #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> > > + #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
> > >
> > > there is no necessity to remove the bitmap API and move to raw
> > > unsigned long operations.
> > > as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
> > > unsigned long,
> > > on 32bit CPU, it is now two unsigned long.
> >
> > Yes.  We can still use most bitmap APIs if we use "unsigned long" as
> > bitmap.  The advantage of "unsigned long" is to guarantee that
> > bitmap_empty() and bitmap_full() is trivial.  We can use that for
> > optimization.  For example, we can skip unlock/lock if bitmap_empty().
>
> anyway we have avoided lock_cluster_or_swap_info and unlock_cluster_or_swap_info
> for each individual swap entry.
>
> if bitma_empty(), we won't call free_swap_slot() so no chance to
> further take any lock,
> right?
>
> the optimization of bitmap_full() seems to be more useful only after we have
> void free_swap_slot(swp_entry_t entry, int nr)
>
> in which we can avoid many spin_lock_irq(&cache->free_lock);
>
> On the other hand, it seems we can directly call
> swapcache_free_entries() to skip cache if
> nr_pages >= SWAP_BATCH(64) this might be an optimization as we are now
> having a bitmap exactly equals 64.

Hi ying,
considering the below code which has changed bitmap to 64 and generally support
different nr_pages(1 and ever cross cluster),

#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)

void swap_free_nr(swp_entry_t entry, int nr_pages)
{
        int i = 0, j;
        struct swap_cluster_info *ci;
        struct swap_info_struct *p;
        unsigned int type = swp_type(entry);
        unsigned long offset = swp_offset(entry);
        int batch_nr, remain_nr;
        DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };

        remain_nr = nr_pages;
        p = _swap_info_get(entry);
        if (!p)
                return;

        for ( ; ; ) {
                batch_nr = min3(SWAP_BATCH_NR, remain_nr,
                                (int)(SWAPFILE_CLUSTER - (offset %
SWAPFILE_CLUSTER)));

                ci = lock_cluster_or_swap_info(p, offset);
                for (j = 0; j < batch_nr; j++) {
                        if (__swap_entry_free_locked(p, offset + i *
SWAP_BATCH_NR + j, 1))
                                __bitmap_set(usage, j, 1);
                }
                unlock_cluster_or_swap_info(p, ci);

                for_each_clear_bit(j, usage, batch_nr)
                        free_swap_slot(swp_entry(type, offset + i *
SWAP_BATCH_NR + j));

                i += batch_nr;
                if (i >= nr_pages)
                        break;

                bitmap_clear(usage, 0, SWAP_BATCH_NR);
                remain_nr -= batch_nr;
        }
}

I still don't see the benefits of using bitmap_full and bitmap_empty over simple
for_each_clear_bit() unless we begin to support free_swap_slot_nr(), which,
I believe, needs a separate incremental patchset.

using bitmap_empty and full, if we want to free all slots, we need
if (bitmap_empty(usage))
{
    for (i=0;i<batch_nr;i++)
              free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
}
This seems just a game replacing for_each_clear_bit by
bitmap_empty()+another for loop.

if we don't want to free any one, we need
if(bitmap_full(usage)
       do_nothing.

in the for_each_clear_bit() case, the loop just simply ends.

What's your proposal code to use bitmap_empty and bitmap_full here?
Am I missing something?

>
> >
> > >>
> > >> >>
> > >> >> > we are mTHP, so we can't assume our size is SWAPFILE_CLUSTER?
> > >> >> > or you want to check free_entries == "1 << swap_entry_order(folio_order(folio))"
> > >> >> > instead of SWAPFILE_CLUSTER for the "for (i = 0; i < size; i++, entry.val++)"
> > >> >> > path?
> > >> >>
> > >> >> Just replace SWAPFILE_CLUSTER with "nr_pages" in your code.
> > >> >>
> > >> >> >
> > >> >> >> >>
> > >> >> >> >> And, we should add batching in __swap_entry_free().  That will help
> > >> >> >> >> free_swap_and_cache_nr() too.
> > >> >> >
> > >> >> > Chris Li and I actually discussed it before, while I completely
> > >> >> > agree this can be batched. but i'd like to defer this as an incremental
> > >> >> > patchset later to keep this swapcache-refault small.
> > >> >>
> > >> >> OK.
> > >> >>
> > >> >> >>
> > >> >> >> Please consider this too.
> >
> > --
> > Best Regards,
> > Huang, Ying

Thanks
Barry
Huang, Ying April 18, 2024, 8:55 a.m. UTC | #21
Barry Song <21cnbao@gmail.com> writes:

> On Wed, Apr 17, 2024 at 1:35 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Wed, Apr 17, 2024 at 12:34 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >
>> > Barry Song <21cnbao@gmail.com> writes:
>> >
>> > > On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >>
>> > >> Barry Song <21cnbao@gmail.com> writes:
>> > >>
>> > >> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >>
>> > >> >> Barry Song <21cnbao@gmail.com> writes:
>> > >> >>
>> > >> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >>
>> > >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> > >> >> >>
>> > >> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >>
>> > >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> > >> >> >> >>
>> > >> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > >> >> >> >> >>
>> > >> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
>> > >> >> >> >> >>
>> > >> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
>> > >> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
>> > >> >> >> >> >> > to introduce an API for batched free.
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>> > >> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>> > >> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> > >> >> >> >> >> > ---
>> > >> >> >> >> >> >  include/linux/swap.h |  5 +++++
>> > >> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
>> > >> >> >> >> >> >  2 files changed, 56 insertions(+)
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > >> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
>> > >> >> >> >> >> > --- a/include/linux/swap.h
>> > >> >> >> >> >> > +++ b/include/linux/swap.h
>> > >> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>> > >> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
>> > >> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
>> > >> >> >> >> >> >  extern void swap_free(swp_entry_t);
>> > >> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
>> > >> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> > >> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> > >> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
>> > >> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
>> > >> >> >> >> >> >  {
>> > >> >> >> >> >> >  }
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> > >> >> >> >> >> > +{
>> > >> >> >> >> >> > +}
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>> > >> >> >> >> >> >  {
>> > >> >> >> >> >> >  }
>> > >> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > >> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
>> > >> >> >> >> >> > --- a/mm/swapfile.c
>> > >> >> >> >> >> > +++ b/mm/swapfile.c
>> > >> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
>> > >> >> >> >> >> >               __swap_entry_free(p, entry);
>> > >> >> >> >> >> >  }
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > +/*
>> > >> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
>> > >> >> >> >> >> > + * maximum kernel stack usage.
>> > >> >> >> >> >> > + */
>> > >> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +/*
>> > >> >> >> >> >> > + * Called after swapping in a large folio,
>> > >> >> >> >> >>
>> > >> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
>> > >> >> >> >> >> Because this will discourage function reusing.
>> > >> >> >> >> >
>> > >> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
>> > >> >> >> >> > we can actually remove this.
>> > >> >> >> >> >
>> > >> >> >> >> >>
>> > >> >> >> >> >> > batched free swap entries
>> > >> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
>> > >> >> >> >> >> > + * its offset is aligned with nr_pages
>> > >> >> >> >> >>
>> > >> >> >> >> >> Why do we need this?
>> > >> >> >> >> >
>> > >> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
>> > >> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
>> > >> >> >> >> > to add swapcache's xa. so this comment is describing the existing
>> > >> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
>> > >> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
>> > >> >> >> >> > instead pass ptep or another different data struct which can connect
>> > >> >> >> >> > multiple discontiguous swap offsets.
>> > >> >> >> >> >
>> > >> >> >> >> > I feel like we only need "for this large folio, entry should be for
>> > >> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
>> > >> >> >> >> > the latter is not important to this context at all.
>> > >> >> >> >>
>> > >> >> >> >> IIUC, all these are requirements of the only caller now, not the
>> > >> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
>> > >> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
>> > >> >> >> >> so, why not make swap_free_nr() as general as possible?
>> > >> >> >> >
>> > >> >> >> > right , i believe we can make swap_free_nr() as general as possible.
>> > >> >> >> >
>> > >> >> >> >>
>> > >> >> >> >> >>
>> > >> >> >> >> >> > + */
>> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
>> > >> >> >> >> >> > +{
>> > >> >> >> >> >> > +     int i, j;
>> > >> >> >> >> >> > +     struct swap_cluster_info *ci;
>> > >> >> >> >> >> > +     struct swap_info_struct *p;
>> > >> >> >> >> >> > +     unsigned int type = swp_type(entry);
>> > >> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
>> > >> >> >> >> >> > +     int batch_nr, remain_nr;
>> > >> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
>> > >> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +     if (nr_pages == 1) {
>> > >> >> >> >> >> > +             swap_free(entry);
>> > >> >> >> >> >> > +             return;
>> > >> >> >> >> >> > +     }
>> > >> >> >> >> >>
>> > >> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
>> > >> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
>> > >> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
>> > >> >> >> >> >> Right?
>> > >> >> >> >> >
>> > >> >> >> >> > I don't see why.
>> > >> >> >> >>
>> > >> >> >> >> Because duplicated implementation are hard to maintain in the long term.
>> > >> >> >> >
>> > >> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
>> > >> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
>> > >> >> >> >
>> > >> >> >> >>
>> > >> >> >> >> > but we have lots of places calling swap_free(), we may
>> > >> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
>> > >> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
>> > >> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
>> > >> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
>> > >> >> >> >> > "entry should be for the first subpage" then.
>> > >> >> >> >> >
>> > >> >> >> >> > Right now, the semantics is
>> > >> >> >> >> > * swap_free_nr() for an entire large folio;
>> > >> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
>> > >> >> >> >>
>> > >> >> >> >> As above, I don't think the these semantics are important for
>> > >> >> >> >> swap_free_nr() implementation.
>> > >> >> >> >
>> > >> >> >> > right. I agree. If we are ready to change all those callers, nothing
>> > >> >> >> > can stop us from removing swap_free().
>> > >> >> >> >
>> > >> >> >> >>
>> > >> >> >> >> >>
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +     remain_nr = nr_pages;
>> > >> >> >> >> >> > +     p = _swap_info_get(entry);
>> > >> >> >> >> >> > +     if (p) {
>> > >> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
>> > >> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
>> > >> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
>> > >> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
>> > >> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
>> > >> >> >> >> >> > +                     }
>> > >> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
>> > >> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
>> > >> >> >> >> >> > +                     remain_nr -= batch_nr;
>> > >> >> >> >> >> > +             }
>> > >> >> >> >> >> > +     }
>> > >> >> >> >> >> > +}
>> > >> >> >> >> >> > +
>> > >> >> >> >> >> >  /*
>> > >> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
>> > >> >> >> >> >> >   */
>> > >> >> >> >> >>
>> > >> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
>> > >> >> >> >> >> that it's good to use the batching method in that function here?  It
>> > >> >> >> >> >> avoids to use bitmap operations and stack space.
>> > >> >> >> >> >
>> > >> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
>> > >> >> >> >> > unsigned long,
>> > >> >> >> >>
>> > >> >> >> >> 512 / 8 = 64 bytes.
>> > >> >> >> >>
>> > >> >> >> >> So, not trivial.
>> > >> >> >> >>
>> > >> >> >> >> > so this should be safe. on the other hand, i believe this
>> > >> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
>> > >> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
>> > >> >> >> >> > 0.
>> > >> >> >> >>
>> > >> >> >> >> There are 2 most common use cases,
>> > >> >> >> >>
>> > >> >> >> >> - all swap entries have usage count == 0
>> > >> >> >> >> - all swap entries have usage count != 0
>> > >> >> >> >>
>> > >> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
>> > >> >> >> >> find possible use cases other than above.
>> > >> >> >> >
>> > >> >> >> > i guess the point is free_swap_slot() shouldn't be called within
>> > >> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
>> > >> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
>> > >> >> >> > common scenario.
>> > >> >> >>
>> > >> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
>> > >> >> >> is, nr_pages) or 0.  These are the most common cases.
>> > >> >> >>
>> > >> >> >
>> > >> >> > i am actually talking about the below code path,
>> > >> >> >
>> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> > >> >> > {
>> > >> >> >         ...
>> > >> >> >
>> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
>> > >> >> >         ...
>> > >> >> >         for (i = 0; i < size; i++, entry.val++) {
>> > >> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
>> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
>> > >> >> >                         free_swap_slot(entry);
>> > >> >> >                         if (i == size - 1)
>> > >> >> >                                 return;
>> > >> >> >                         lock_cluster_or_swap_info(si, offset);
>> > >> >> >                 }
>> > >> >> >         }
>> > >> >> >         unlock_cluster_or_swap_info(si, ci);
>> > >> >> > }
>> > >> >> >
>> > >> >> > but i guess you are talking about the below code path:
>> > >> >> >
>> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
>> > >> >> > {
>> > >> >> >         ...
>> > >> >> >
>> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
>> > >> >> >         if (size == SWAPFILE_CLUSTER) {
>> > >> >> >                 map = si->swap_map + offset;
>> > >> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
>> > >> >> >                         val = map[i];
>> > >> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
>> > >> >> >                         if (val == SWAP_HAS_CACHE)
>> > >> >> >                                 free_entries++;
>> > >> >> >                 }
>> > >> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
>> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
>> > >> >> >                         spin_lock(&si->lock);
>> > >> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
>> > >> >> >                         swap_free_cluster(si, idx);
>> > >> >> >                         spin_unlock(&si->lock);
>> > >> >> >                         return;
>> > >> >> >                 }
>> > >> >> >         }
>> > >> >> > }
>> > >> >>
>> > >> >> I am talking about both code paths.  In 2 most common cases,
>> > >> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
>> > >> >
>> > >> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
>> > >> > end up repeatedly unlocking and locking. Picture a scenario with a large
>> > >> > folio shared by multiple processes. One process might unmap a portion
>> > >> > while another still holds an entire mapping to it. This could lead to situations
>> > >> > where free_entries doesn't equal 0 and free_entries doesn't equal
>> > >> > nr_pages, resulting in multiple unlock and lock operations.
>> > >>
>> > >> This is impossible in current caller, because the folio is in the swap
>> > >> cache.  But if we move the change to __swap_entry_free_nr(), we may run
>> > >> into this situation.
>> > >
>> > > I don't understand why it is impossible, after try_to_unmap_one() has done
>> > > on one process, mprotect and munmap called on a part of the large folio
>> > > pte entries which now have been swap entries, we are removing the PTE
>> > > for this part. Another process can entirely hit the swapcache and have
>> > > all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
>> > > do_swap_page. Within those swap entries, some have swapcount=1 and others
>> > > have swapcount > 1. Am I missing something?
>> >
>> > For swap entries with swapcount=1, its sis->swap_map[] will be
>> >
>> > 1 | SWAP_HAS_CACHE
>> >
>> > so, __swap_entry_free_locked() will return SWAP_HAS_CACHE instead of 0.
>> >
>> > The swap entries will be free in
>> >
>> > folio_free_swap
>> >   delete_from_swap_cache
>> >     put_swap_folio
>> >
>>
>> Yes. I realized this after replying to you yesterday.
>>
>> > >> > Chuanhua has invested significant effort in following Ryan's suggestion
>> > >> > for the current approach, which generally handles all cases, especially
>> > >> > partial unmapping. Additionally, the widespread use of swap_free_nr()
>> > >> > as you suggested across various scenarios is noteworthy.
>> > >> >
>> > >> > Unless there's evidence indicating performance issues or bugs, I believe
>> > >> > the current approach remains preferable.
>> > >>
>> > >> TBH, I don't like the large stack space usage (64 bytes).  How about use
>> > >> a "unsigned long" as bitmap?  Then, we use much less stack space, use
>> > >> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
>> > >> use cases.  And, we have enough batching.
>> > >
>> > > that is quite a straightforward modification like,
>> > >
>> > > - #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
>> > > + #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
>> > >
>> > > there is no necessity to remove the bitmap API and move to raw
>> > > unsigned long operations.
>> > > as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
>> > > unsigned long,
>> > > on 32bit CPU, it is now two unsigned long.
>> >
>> > Yes.  We can still use most bitmap APIs if we use "unsigned long" as
>> > bitmap.  The advantage of "unsigned long" is to guarantee that
>> > bitmap_empty() and bitmap_full() is trivial.  We can use that for
>> > optimization.  For example, we can skip unlock/lock if bitmap_empty().
>>
>> anyway we have avoided lock_cluster_or_swap_info and unlock_cluster_or_swap_info
>> for each individual swap entry.
>>
>> if bitma_empty(), we won't call free_swap_slot() so no chance to
>> further take any lock,
>> right?
>>
>> the optimization of bitmap_full() seems to be more useful only after we have
>> void free_swap_slot(swp_entry_t entry, int nr)
>>
>> in which we can avoid many spin_lock_irq(&cache->free_lock);
>>
>> On the other hand, it seems we can directly call
>> swapcache_free_entries() to skip cache if
>> nr_pages >= SWAP_BATCH(64) this might be an optimization as we are now
>> having a bitmap exactly equals 64.
>
> Hi ying,
> considering the below code which has changed bitmap to 64 and generally support
> different nr_pages(1 and ever cross cluster),
>
> #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
>
> void swap_free_nr(swp_entry_t entry, int nr_pages)
> {
>         int i = 0, j;
>         struct swap_cluster_info *ci;
>         struct swap_info_struct *p;
>         unsigned int type = swp_type(entry);
>         unsigned long offset = swp_offset(entry);
>         int batch_nr, remain_nr;
>         DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
>
>         remain_nr = nr_pages;
>         p = _swap_info_get(entry);
>         if (!p)
>                 return;
>
>         for ( ; ; ) {
>                 batch_nr = min3(SWAP_BATCH_NR, remain_nr,
>                                 (int)(SWAPFILE_CLUSTER - (offset %
> SWAPFILE_CLUSTER)));
>
>                 ci = lock_cluster_or_swap_info(p, offset);
>                 for (j = 0; j < batch_nr; j++) {
>                         if (__swap_entry_free_locked(p, offset + i *
> SWAP_BATCH_NR + j, 1))
>                                 __bitmap_set(usage, j, 1);
>                 }
>                 unlock_cluster_or_swap_info(p, ci);
>
>                 for_each_clear_bit(j, usage, batch_nr)
>                         free_swap_slot(swp_entry(type, offset + i *
> SWAP_BATCH_NR + j));
>
>                 i += batch_nr;
>                 if (i >= nr_pages)
>                         break;
>
>                 bitmap_clear(usage, 0, SWAP_BATCH_NR);
>                 remain_nr -= batch_nr;
>         }
> }
>
> I still don't see the benefits of using bitmap_full and bitmap_empty over simple
> for_each_clear_bit() unless we begin to support free_swap_slot_nr(), which,
> I believe, needs a separate incremental patchset.
>
> using bitmap_empty and full, if we want to free all slots, we need
> if (bitmap_empty(usage))
> {
>     for (i=0;i<batch_nr;i++)
>               free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> }
> This seems just a game replacing for_each_clear_bit by
> bitmap_empty()+another for loop.
>
> if we don't want to free any one, we need
> if(bitmap_full(usage)
>        do_nothing.
>
> in the for_each_clear_bit() case, the loop just simply ends.
>
> What's your proposal code to use bitmap_empty and bitmap_full here?
> Am I missing something?

My idea is something as below.  It's only build tested.

static void cluster_swap_free_nr(struct swap_info_struct *sis,
				 unsigned long offset, int nr_pages)
{
	struct swap_cluster_info *ci;
	DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
	int i, nr;

	ci = lock_cluster_or_swap_info(sis, offset);
	while (nr_pages) {
		nr = min(BITS_PER_LONG, nr_pages);
		for (i = 0; i < nr; i++) {
			if (!__swap_entry_free_locked(sis, offset + i, 1))
				bitmap_set(to_free, i, 1);
		}
		if (!bitmap_empty(to_free, BITS_PER_LONG)) {
			unlock_cluster_or_swap_info(sis, ci);
			for_each_set_bit(i, to_free, BITS_PER_LONG)
				free_swap_slot(swp_entry(sis->type, offset + i));
			if (nr == nr_pages)
				return;
			bitmap_clear(to_free, 0, BITS_PER_LONG);
			ci = lock_cluster_or_swap_info(sis, offset);
		}
		offset += nr;
		nr_pages -= nr;
	}
	unlock_cluster_or_swap_info(sis, ci);
}

void swap_free_nr(swp_entry_t entry, int nr_pages)
{
	int nr;
	struct swap_info_struct *sis;
	unsigned long offset = swp_offset(entry);

	sis = _swap_info_get(entry);
	if (!sis)
		return;

	while (nr_pages >= 0) {
		nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
		cluster_swap_free_nr(sis, offset, nr);
		offset += nr;
		nr_pages -= nr;
	}
}

--
Best Regards,
Huang, Ying
Barry Song April 18, 2024, 9:14 a.m. UTC | #22
On Thu, Apr 18, 2024 at 8:57 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Apr 17, 2024 at 1:35 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Wed, Apr 17, 2024 at 12:34 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >
> >> > Barry Song <21cnbao@gmail.com> writes:
> >> >
> >> > > On Tue, Apr 16, 2024 at 3:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >>
> >> > >> Barry Song <21cnbao@gmail.com> writes:
> >> > >>
> >> > >> > On Tue, Apr 16, 2024 at 1:42 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >>
> >> > >> >> Barry Song <21cnbao@gmail.com> writes:
> >> > >> >>
> >> > >> >> > On Mon, Apr 15, 2024 at 8:53 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >>
> >> > >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> > >> >> >>
> >> > >> >> >> > On Mon, Apr 15, 2024 at 8:21 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >> >>
> >> > >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> > >> >> >> >>
> >> > >> >> >> >> > On Mon, Apr 15, 2024 at 6:19 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > While swapping in a large folio, we need to free swaps related to the whole
> >> > >> >> >> >> >> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> >> > >> >> >> >> >> > to introduce an API for batched free.
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >> > >> >> >> >> >> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >> > >> >> >> >> >> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> > >> >> >> >> >> > ---
> >> > >> >> >> >> >> >  include/linux/swap.h |  5 +++++
> >> > >> >> >> >> >> >  mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++++++++++++++
> >> > >> >> >> >> >> >  2 files changed, 56 insertions(+)
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> > >> >> >> >> >> > index 11c53692f65f..b7a107e983b8 100644
> >> > >> >> >> >> >> > --- a/include/linux/swap.h
> >> > >> >> >> >> >> > +++ b/include/linux/swap.h
> >> > >> >> >> >> >> > @@ -483,6 +483,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >> > >> >> >> >> >> >  extern int swap_duplicate(swp_entry_t);
> >> > >> >> >> >> >> >  extern int swapcache_prepare(swp_entry_t);
> >> > >> >> >> >> >> >  extern void swap_free(swp_entry_t);
> >> > >> >> >> >> >> > +extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> >> > >> >> >> >> >> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >> > >> >> >> >> >> >  extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
> >> > >> >> >> >> >> >  int swap_type_of(dev_t device, sector_t offset);
> >> > >> >> >> >> >> > @@ -564,6 +565,10 @@ static inline void swap_free(swp_entry_t swp)
> >> > >> >> >> >> >> >  {
> >> > >> >> >> >> >> >  }
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> > >> >> >> >> >> > +{
> >> > >> >> >> >> >> > +}
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >> > >> >> >> >> >> >  {
> >> > >> >> >> >> >> >  }
> >> > >> >> >> >> >> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> > >> >> >> >> >> > index 28642c188c93..f4c65aeb088d 100644
> >> > >> >> >> >> >> > --- a/mm/swapfile.c
> >> > >> >> >> >> >> > +++ b/mm/swapfile.c
> >> > >> >> >> >> >> > @@ -1356,6 +1356,57 @@ void swap_free(swp_entry_t entry)
> >> > >> >> >> >> >> >               __swap_entry_free(p, entry);
> >> > >> >> >> >> >> >  }
> >> > >> >> >> >> >> >
> >> > >> >> >> >> >> > +/*
> >> > >> >> >> >> >> > + * Free up the maximum number of swap entries at once to limit the
> >> > >> >> >> >> >> > + * maximum kernel stack usage.
> >> > >> >> >> >> >> > + */
> >> > >> >> >> >> >> > +#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +/*
> >> > >> >> >> >> >> > + * Called after swapping in a large folio,
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> IMHO, it's not good to document the caller in the function definition.
> >> > >> >> >> >> >> Because this will discourage function reusing.
> >> > >> >> >> >> >
> >> > >> >> >> >> > ok. right now there is only one user that is why it is added. but i agree
> >> > >> >> >> >> > we can actually remove this.
> >> > >> >> >> >> >
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> > batched free swap entries
> >> > >> >> >> >> >> > + * for this large folio, entry should be for the first subpage and
> >> > >> >> >> >> >> > + * its offset is aligned with nr_pages
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> Why do we need this?
> >> > >> >> >> >> >
> >> > >> >> >> >> > This is a fundamental requirement for the existing kernel, folio's
> >> > >> >> >> >> > swap offset is naturally aligned from the first moment add_to_swap
> >> > >> >> >> >> > to add swapcache's xa. so this comment is describing the existing
> >> > >> >> >> >> > fact. In the future, if we want to support swap-out folio to discontiguous
> >> > >> >> >> >> > and not-aligned offsets, we can't pass entry as the parameter, we should
> >> > >> >> >> >> > instead pass ptep or another different data struct which can connect
> >> > >> >> >> >> > multiple discontiguous swap offsets.
> >> > >> >> >> >> >
> >> > >> >> >> >> > I feel like we only need "for this large folio, entry should be for
> >> > >> >> >> >> > the first subpage" and drop "and its offset is aligned with nr_pages",
> >> > >> >> >> >> > the latter is not important to this context at all.
> >> > >> >> >> >>
> >> > >> >> >> >> IIUC, all these are requirements of the only caller now, not the
> >> > >> >> >> >> function itself.  If only part of the all swap entries of a mTHP are
> >> > >> >> >> >> called with swap_free_nr(), can swap_free_nr() still do its work?  If
> >> > >> >> >> >> so, why not make swap_free_nr() as general as possible?
> >> > >> >> >> >
> >> > >> >> >> > right , i believe we can make swap_free_nr() as general as possible.
> >> > >> >> >> >
> >> > >> >> >> >>
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> > + */
> >> > >> >> >> >> >> > +void swap_free_nr(swp_entry_t entry, int nr_pages)
> >> > >> >> >> >> >> > +{
> >> > >> >> >> >> >> > +     int i, j;
> >> > >> >> >> >> >> > +     struct swap_cluster_info *ci;
> >> > >> >> >> >> >> > +     struct swap_info_struct *p;
> >> > >> >> >> >> >> > +     unsigned int type = swp_type(entry);
> >> > >> >> >> >> >> > +     unsigned long offset = swp_offset(entry);
> >> > >> >> >> >> >> > +     int batch_nr, remain_nr;
> >> > >> >> >> >> >> > +     DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +     /* all swap entries are within a cluster for mTHP */
> >> > >> >> >> >> >> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +     if (nr_pages == 1) {
> >> > >> >> >> >> >> > +             swap_free(entry);
> >> > >> >> >> >> >> > +             return;
> >> > >> >> >> >> >> > +     }
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> Is it possible to unify swap_free() and swap_free_nr() into one function
> >> > >> >> >> >> >> with acceptable performance?  IIUC, the general rule in mTHP effort is
> >> > >> >> >> >> >> to avoid duplicate functions between mTHP and normal small folio.
> >> > >> >> >> >> >> Right?
> >> > >> >> >> >> >
> >> > >> >> >> >> > I don't see why.
> >> > >> >> >> >>
> >> > >> >> >> >> Because duplicated implementation are hard to maintain in the long term.
> >> > >> >> >> >
> >> > >> >> >> > sorry, i actually meant "I don't see why not",  for some reason, the "not"
> >> > >> >> >> > was missed. Obviously I meant "why not", there was a "but" after it :-)
> >> > >> >> >> >
> >> > >> >> >> >>
> >> > >> >> >> >> > but we have lots of places calling swap_free(), we may
> >> > >> >> >> >> > have to change them all to call swap_free_nr(entry, 1); the other possible
> >> > >> >> >> >> > way is making swap_free() a wrapper of swap_free_nr() always using
> >> > >> >> >> >> > 1 as the argument. In both cases, we are changing the semantics of
> >> > >> >> >> >> > swap_free_nr() to partially freeing large folio cases and have to drop
> >> > >> >> >> >> > "entry should be for the first subpage" then.
> >> > >> >> >> >> >
> >> > >> >> >> >> > Right now, the semantics is
> >> > >> >> >> >> > * swap_free_nr() for an entire large folio;
> >> > >> >> >> >> > * swap_free() for one entry of either a large folio or a small folio
> >> > >> >> >> >>
> >> > >> >> >> >> As above, I don't think the these semantics are important for
> >> > >> >> >> >> swap_free_nr() implementation.
> >> > >> >> >> >
> >> > >> >> >> > right. I agree. If we are ready to change all those callers, nothing
> >> > >> >> >> > can stop us from removing swap_free().
> >> > >> >> >> >
> >> > >> >> >> >>
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +     remain_nr = nr_pages;
> >> > >> >> >> >> >> > +     p = _swap_info_get(entry);
> >> > >> >> >> >> >> > +     if (p) {
> >> > >> >> >> >> >> > +             for (i = 0; i < nr_pages; i += batch_nr) {
> >> > >> >> >> >> >> > +                     batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +                     ci = lock_cluster_or_swap_info(p, offset);
> >> > >> >> >> >> >> > +                     for (j = 0; j < batch_nr; j++) {
> >> > >> >> >> >> >> > +                             if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
> >> > >> >> >> >> >> > +                                     __bitmap_set(usage, j, 1);
> >> > >> >> >> >> >> > +                     }
> >> > >> >> >> >> >> > +                     unlock_cluster_or_swap_info(p, ci);
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +                     for_each_clear_bit(j, usage, batch_nr)
> >> > >> >> >> >> >> > +                             free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> > +                     bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >> > >> >> >> >> >> > +                     remain_nr -= batch_nr;
> >> > >> >> >> >> >> > +             }
> >> > >> >> >> >> >> > +     }
> >> > >> >> >> >> >> > +}
> >> > >> >> >> >> >> > +
> >> > >> >> >> >> >> >  /*
> >> > >> >> >> >> >> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >> > >> >> >> >> >> >   */
> >> > >> >> >> >> >>
> >> > >> >> >> >> >> put_swap_folio() implements batching in another method.  Do you think
> >> > >> >> >> >> >> that it's good to use the batching method in that function here?  It
> >> > >> >> >> >> >> avoids to use bitmap operations and stack space.
> >> > >> >> >> >> >
> >> > >> >> >> >> > Chuanhua has strictly limited the maximum stack usage to several
> >> > >> >> >> >> > unsigned long,
> >> > >> >> >> >>
> >> > >> >> >> >> 512 / 8 = 64 bytes.
> >> > >> >> >> >>
> >> > >> >> >> >> So, not trivial.
> >> > >> >> >> >>
> >> > >> >> >> >> > so this should be safe. on the other hand, i believe this
> >> > >> >> >> >> > implementation is more efficient, as  put_swap_folio() might lock/
> >> > >> >> >> >> > unlock much more often whenever __swap_entry_free_locked returns
> >> > >> >> >> >> > 0.
> >> > >> >> >> >>
> >> > >> >> >> >> There are 2 most common use cases,
> >> > >> >> >> >>
> >> > >> >> >> >> - all swap entries have usage count == 0
> >> > >> >> >> >> - all swap entries have usage count != 0
> >> > >> >> >> >>
> >> > >> >> >> >> In both cases, we only need to lock/unlock once.  In fact, I didn't
> >> > >> >> >> >> find possible use cases other than above.
> >> > >> >> >> >
> >> > >> >> >> > i guess the point is free_swap_slot() shouldn't be called within
> >> > >> >> >> > lock_cluster_or_swap_info? so when we are freeing nr_pages slots,
> >> > >> >> >> > we'll have to unlock and lock nr_pages times?  and this is the most
> >> > >> >> >> > common scenario.
> >> > >> >> >>
> >> > >> >> >> No.  In put_swap_folio(), free_entries is either SWAPFILE_CLUSTER (that
> >> > >> >> >> is, nr_pages) or 0.  These are the most common cases.
> >> > >> >> >>
> >> > >> >> >
> >> > >> >> > i am actually talking about the below code path,
> >> > >> >> >
> >> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> > >> >> > {
> >> > >> >> >         ...
> >> > >> >> >
> >> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> > >> >> >         ...
> >> > >> >> >         for (i = 0; i < size; i++, entry.val++) {
> >> > >> >> >                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> >> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> > >> >> >                         free_swap_slot(entry);
> >> > >> >> >                         if (i == size - 1)
> >> > >> >> >                                 return;
> >> > >> >> >                         lock_cluster_or_swap_info(si, offset);
> >> > >> >> >                 }
> >> > >> >> >         }
> >> > >> >> >         unlock_cluster_or_swap_info(si, ci);
> >> > >> >> > }
> >> > >> >> >
> >> > >> >> > but i guess you are talking about the below code path:
> >> > >> >> >
> >> > >> >> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >> > >> >> > {
> >> > >> >> >         ...
> >> > >> >> >
> >> > >> >> >         ci = lock_cluster_or_swap_info(si, offset);
> >> > >> >> >         if (size == SWAPFILE_CLUSTER) {
> >> > >> >> >                 map = si->swap_map + offset;
> >> > >> >> >                 for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> >> > >> >> >                         val = map[i];
> >> > >> >> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> >> > >> >> >                         if (val == SWAP_HAS_CACHE)
> >> > >> >> >                                 free_entries++;
> >> > >> >> >                 }
> >> > >> >> >                 if (free_entries == SWAPFILE_CLUSTER) {
> >> > >> >> >                         unlock_cluster_or_swap_info(si, ci);
> >> > >> >> >                         spin_lock(&si->lock);
> >> > >> >> >                         mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> >> > >> >> >                         swap_free_cluster(si, idx);
> >> > >> >> >                         spin_unlock(&si->lock);
> >> > >> >> >                         return;
> >> > >> >> >                 }
> >> > >> >> >         }
> >> > >> >> > }
> >> > >> >>
> >> > >> >> I am talking about both code paths.  In 2 most common cases,
> >> > >> >> __swap_entry_free_locked() will return 0 or !0 for all entries in range.
> >> > >> >
> >> > >> > I grasp your point, but if conditions involving 0 or non-0 values fail, we'll
> >> > >> > end up repeatedly unlocking and locking. Picture a scenario with a large
> >> > >> > folio shared by multiple processes. One process might unmap a portion
> >> > >> > while another still holds an entire mapping to it. This could lead to situations
> >> > >> > where free_entries doesn't equal 0 and free_entries doesn't equal
> >> > >> > nr_pages, resulting in multiple unlock and lock operations.
> >> > >>
> >> > >> This is impossible in current caller, because the folio is in the swap
> >> > >> cache.  But if we move the change to __swap_entry_free_nr(), we may run
> >> > >> into this situation.
> >> > >
> >> > > I don't understand why it is impossible, after try_to_unmap_one() has done
> >> > > on one process, mprotect and munmap called on a part of the large folio
> >> > > pte entries which now have been swap entries, we are removing the PTE
> >> > > for this part. Another process can entirely hit the swapcache and have
> >> > > all swap entries mapped there, and we call swap_free_nr(entry, nr_pages) in
> >> > > do_swap_page. Within those swap entries, some have swapcount=1 and others
> >> > > have swapcount > 1. Am I missing something?
> >> >
> >> > For swap entries with swapcount=1, its sis->swap_map[] will be
> >> >
> >> > 1 | SWAP_HAS_CACHE
> >> >
> >> > so, __swap_entry_free_locked() will return SWAP_HAS_CACHE instead of 0.
> >> >
> >> > The swap entries will be free in
> >> >
> >> > folio_free_swap
> >> >   delete_from_swap_cache
> >> >     put_swap_folio
> >> >
> >>
> >> Yes. I realized this after replying to you yesterday.
> >>
> >> > >> > Chuanhua has invested significant effort in following Ryan's suggestion
> >> > >> > for the current approach, which generally handles all cases, especially
> >> > >> > partial unmapping. Additionally, the widespread use of swap_free_nr()
> >> > >> > as you suggested across various scenarios is noteworthy.
> >> > >> >
> >> > >> > Unless there's evidence indicating performance issues or bugs, I believe
> >> > >> > the current approach remains preferable.
> >> > >>
> >> > >> TBH, I don't like the large stack space usage (64 bytes).  How about use
> >> > >> a "unsigned long" as bitmap?  Then, we use much less stack space, use
> >> > >> bitmap == 0 and bitmap == (unsigned long)(-1) to check the most common
> >> > >> use cases.  And, we have enough batching.
> >> > >
> >> > > that is quite a straightforward modification like,
> >> > >
> >> > > - #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
> >> > > + #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
> >> > >
> >> > > there is no necessity to remove the bitmap API and move to raw
> >> > > unsigned long operations.
> >> > > as bitmap is exactly some unsigned long. on 64bit CPU, we are now one
> >> > > unsigned long,
> >> > > on 32bit CPU, it is now two unsigned long.
> >> >
> >> > Yes.  We can still use most bitmap APIs if we use "unsigned long" as
> >> > bitmap.  The advantage of "unsigned long" is to guarantee that
> >> > bitmap_empty() and bitmap_full() is trivial.  We can use that for
> >> > optimization.  For example, we can skip unlock/lock if bitmap_empty().
> >>
> >> anyway we have avoided lock_cluster_or_swap_info and unlock_cluster_or_swap_info
> >> for each individual swap entry.
> >>
> >> if bitma_empty(), we won't call free_swap_slot() so no chance to
> >> further take any lock,
> >> right?
> >>
> >> the optimization of bitmap_full() seems to be more useful only after we have
> >> void free_swap_slot(swp_entry_t entry, int nr)
> >>
> >> in which we can avoid many spin_lock_irq(&cache->free_lock);
> >>
> >> On the other hand, it seems we can directly call
> >> swapcache_free_entries() to skip cache if
> >> nr_pages >= SWAP_BATCH(64) this might be an optimization as we are now
> >> having a bitmap exactly equals 64.
> >
> > Hi ying,
> > considering the below code which has changed bitmap to 64 and generally support
> > different nr_pages(1 and ever cross cluster),
> >
> > #define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 64 ? 64 : SWAPFILE_CLUSTER)
> >
> > void swap_free_nr(swp_entry_t entry, int nr_pages)
> > {
> >         int i = 0, j;
> >         struct swap_cluster_info *ci;
> >         struct swap_info_struct *p;
> >         unsigned int type = swp_type(entry);
> >         unsigned long offset = swp_offset(entry);
> >         int batch_nr, remain_nr;
> >         DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
> >
> >         remain_nr = nr_pages;
> >         p = _swap_info_get(entry);
> >         if (!p)
> >                 return;
> >
> >         for ( ; ; ) {
> >                 batch_nr = min3(SWAP_BATCH_NR, remain_nr,
> >                                 (int)(SWAPFILE_CLUSTER - (offset %
> > SWAPFILE_CLUSTER)));
> >
> >                 ci = lock_cluster_or_swap_info(p, offset);
> >                 for (j = 0; j < batch_nr; j++) {
> >                         if (__swap_entry_free_locked(p, offset + i *
> > SWAP_BATCH_NR + j, 1))
> >                                 __bitmap_set(usage, j, 1);
> >                 }
> >                 unlock_cluster_or_swap_info(p, ci);
> >
> >                 for_each_clear_bit(j, usage, batch_nr)
> >                         free_swap_slot(swp_entry(type, offset + i *
> > SWAP_BATCH_NR + j));
> >
> >                 i += batch_nr;
> >                 if (i >= nr_pages)
> >                         break;
> >
> >                 bitmap_clear(usage, 0, SWAP_BATCH_NR);
> >                 remain_nr -= batch_nr;
> >         }
> > }
> >
> > I still don't see the benefits of using bitmap_full and bitmap_empty over simple
> > for_each_clear_bit() unless we begin to support free_swap_slot_nr(), which,
> > I believe, needs a separate incremental patchset.
> >
> > using bitmap_empty and full, if we want to free all slots, we need
> > if (bitmap_empty(usage))
> > {
> >     for (i=0;i<batch_nr;i++)
> >               free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
> > }
> > This seems just a game replacing for_each_clear_bit by
> > bitmap_empty()+another for loop.
> >
> > if we don't want to free any one, we need
> > if(bitmap_full(usage)
> >        do_nothing.
> >
> > in the for_each_clear_bit() case, the loop just simply ends.
> >
> > What's your proposal code to use bitmap_empty and bitmap_full here?
> > Am I missing something?
>
> My idea is something as below.  It's only build tested.
>
> static void cluster_swap_free_nr(struct swap_info_struct *sis,
>                                  unsigned long offset, int nr_pages)
> {
>         struct swap_cluster_info *ci;
>         DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
>         int i, nr;
>
>         ci = lock_cluster_or_swap_info(sis, offset);
>         while (nr_pages) {
>                 nr = min(BITS_PER_LONG, nr_pages);
>                 for (i = 0; i < nr; i++) {
>                         if (!__swap_entry_free_locked(sis, offset + i, 1))
>                                 bitmap_set(to_free, i, 1);
>                 }
>                 if (!bitmap_empty(to_free, BITS_PER_LONG)) {
>                         unlock_cluster_or_swap_info(sis, ci);
>                         for_each_set_bit(i, to_free, BITS_PER_LONG)
>                                 free_swap_slot(swp_entry(sis->type, offset + i));
>                         if (nr == nr_pages)
>                                 return;
>                         bitmap_clear(to_free, 0, BITS_PER_LONG);
>                         ci = lock_cluster_or_swap_info(sis, offset);
>                 }
>                 offset += nr;
>                 nr_pages -= nr;
>         }
>         unlock_cluster_or_swap_info(sis, ci);
> }
>
> void swap_free_nr(swp_entry_t entry, int nr_pages)
> {
>         int nr;
>         struct swap_info_struct *sis;
>         unsigned long offset = swp_offset(entry);
>
>         sis = _swap_info_get(entry);
>         if (!sis)
>                 return;
>
>         while (nr_pages >= 0) {
>                 nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
>                 cluster_swap_free_nr(sis, offset, nr);
>                 offset += nr;
>                 nr_pages -= nr;
>         }
> }

Thanks!
It seems quite promising. I guess your intention is appreciating the
advantage of
small_const_nbits().
Conversely, you've created a distinct function for entries within a cluster.

I also agree with your change to:
if (!__swap_entry_free_locked(sis, offset + i, 1))

This adjustment ensures that for_each_set_bit() can also benefit from
small_const_nbits.
The original code didn't pass a compile-time const to for_each_clear_bit.

>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 11c53692f65f..b7a107e983b8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -483,6 +483,7 @@  extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
+extern void swap_free_nr(swp_entry_t entry, int nr_pages);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
@@ -564,6 +565,10 @@  static inline void swap_free(swp_entry_t swp)
 {
 }
 
+void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+}
+
 static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
 {
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 28642c188c93..f4c65aeb088d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1356,6 +1356,57 @@  void swap_free(swp_entry_t entry)
 		__swap_entry_free(p, entry);
 }
 
+/*
+ * Free up the maximum number of swap entries at once to limit the
+ * maximum kernel stack usage.
+ */
+#define SWAP_BATCH_NR (SWAPFILE_CLUSTER > 512 ? 512 : SWAPFILE_CLUSTER)
+
+/*
+ * Called after swapping in a large folio, batched free swap entries
+ * for this large folio, entry should be for the first subpage and
+ * its offset is aligned with nr_pages
+ */
+void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+	int i, j;
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *p;
+	unsigned int type = swp_type(entry);
+	unsigned long offset = swp_offset(entry);
+	int batch_nr, remain_nr;
+	DECLARE_BITMAP(usage, SWAP_BATCH_NR) = { 0 };
+
+	/* all swap entries are within a cluster for mTHP */
+	VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
+
+	if (nr_pages == 1) {
+		swap_free(entry);
+		return;
+	}
+
+	remain_nr = nr_pages;
+	p = _swap_info_get(entry);
+	if (p) {
+		for (i = 0; i < nr_pages; i += batch_nr) {
+			batch_nr = min_t(int, SWAP_BATCH_NR, remain_nr);
+
+			ci = lock_cluster_or_swap_info(p, offset);
+			for (j = 0; j < batch_nr; j++) {
+				if (__swap_entry_free_locked(p, offset + i * SWAP_BATCH_NR + j, 1))
+					__bitmap_set(usage, j, 1);
+			}
+			unlock_cluster_or_swap_info(p, ci);
+
+			for_each_clear_bit(j, usage, batch_nr)
+				free_swap_slot(swp_entry(type, offset + i * SWAP_BATCH_NR + j));
+
+			bitmap_clear(usage, 0, SWAP_BATCH_NR);
+			remain_nr -= batch_nr;
+		}
+	}
+}
+
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */