diff mbox series

[RFC,v3,2/5] mm: swap: introduce swap_nr_free() for batched swap_free()

Message ID 20240304081348.197341-3-21cnbao@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm: support large folios swap-in | expand

Commit Message

Barry Song March 4, 2024, 8:13 a.m. UTC
From: Chuanhua Han <hanchuanhua@oppo.com>

While swapping in a large folio, we need to free swaps related to the whole
folio. To avoid frequently acquiring and releasing swap locks, it is better
to introduce an API for batched free.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/swap.h |  6 ++++++
 mm/swapfile.c        | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

Comments

Ryan Roberts March 11, 2024, 6:51 p.m. UTC | #1
On 04/03/2024 08:13, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/swap.h |  6 ++++++
>  mm/swapfile.c        | 35 +++++++++++++++++++++++++++++++++++
>  2 files changed, 41 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2955f7a78d8d..d6ab27929458 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>  extern int swap_duplicate(swp_entry_t);
>  extern int swapcache_prepare(swp_entry_t);
>  extern void swap_free(swp_entry_t);
> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);

nit: In my swap-out v4 series, I've created a batched version of
free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
preferable to align the naming schemes - i.e. call this swap_free_nr(). Your
scheme doesn't really work when applied to free_swap_and_cache().

>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>  extern int free_swap_and_cache(swp_entry_t);
>  int swap_type_of(dev_t device, sector_t offset);
> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
>  {
>  }
>  
> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> +{
> +
> +}
> +
>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>  {
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 3f594be83b58..244106998a69 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
>  		__swap_entry_free(p, entry);
>  }
>  
> +/*
> + * Called after swapping in a large folio, batched free swap entries
> + * for this large folio, entry should be for the first subpage and
> + * its offset is aligned with nr_pages
> + */
> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> +{
> +	int i;
> +	struct swap_cluster_info *ci;
> +	struct swap_info_struct *p;
> +	unsigned type = swp_type(entry);

nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
int" or at least it did for me when I did something similar in my swap-out patch
set.

> +	unsigned long offset = swp_offset(entry);
> +	DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };

I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
increases. But the only other way I can think of is to explicitly loop over
fixed size chunks, and that's not much better.

> +
> +	/* all swap entries are within a cluster for mTHP */
> +	VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> +
> +	if (nr_pages == 1) {
> +		swap_free(entry);
> +		return;
> +	}
> +
> +	p = _swap_info_get(entry);

You need to handle this returning NULL, like swap_free() does.

> +
> +	ci = lock_cluster(p, offset);

The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
by rotating media, and clusters are not in use, it will lock the whole swap
info. But your new version only calls lock_cluster() which won't lock anything
if clusters are not in use. So I think this is a locking bug.

> +	for (i = 0; i < nr_pages; i++) {
> +		if (__swap_entry_free_locked(p, offset + i, 1))
> +			__bitmap_set(usage, i, 1);
> +	}
> +	unlock_cluster(ci);
> +
> +	for_each_clear_bit(i, usage, nr_pages)
> +		free_swap_slot(swp_entry(type, offset + i));
> +}
> +
>  /*
>   * Called after dropping swapcache to decrease refcnt to swap entries.
>   */

Thanks,
Ryan
Chuanhua Han March 14, 2024, 1:12 p.m. UTC | #2
Ryan Roberts <ryan.roberts@arm.com> 于2024年3月12日周二 02:51写道:
>
> On 04/03/2024 08:13, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > While swapping in a large folio, we need to free swaps related to the whole
> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > to introduce an API for batched free.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/swap.h |  6 ++++++
> >  mm/swapfile.c        | 35 +++++++++++++++++++++++++++++++++++
> >  2 files changed, 41 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2955f7a78d8d..d6ab27929458 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >  extern int swap_duplicate(swp_entry_t);
> >  extern int swapcache_prepare(swp_entry_t);
> >  extern void swap_free(swp_entry_t);
> > +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
>
> nit: In my swap-out v4 series, I've created a batched version of
> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
> preferable to align the naming schemes - i.e. call this swap_free_nr(). Your
> scheme doesn't really work when applied to free_swap_and_cache().
Thanks for your suggestions, and for the next version, we'll see which
package is more appropriate!
>
> >  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >  extern int free_swap_and_cache(swp_entry_t);
> >  int swap_type_of(dev_t device, sector_t offset);
> > @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
> >  {
> >  }
> >
> > +void swap_nr_free(swp_entry_t entry, int nr_pages)
> > +{
> > +
> > +}
> > +
> >  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >  {
> >  }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 3f594be83b58..244106998a69 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
> >               __swap_entry_free(p, entry);
> >  }
> >
> > +/*
> > + * Called after swapping in a large folio, batched free swap entries
> > + * for this large folio, entry should be for the first subpage and
> > + * its offset is aligned with nr_pages
> > + */
> > +void swap_nr_free(swp_entry_t entry, int nr_pages)
> > +{
> > +     int i;
> > +     struct swap_cluster_info *ci;
> > +     struct swap_info_struct *p;
> > +     unsigned type = swp_type(entry);
>
> nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
> int" or at least it did for me when I did something similar in my swap-out patch
> set.
Gee, thanks for pointing that out!
>
> > +     unsigned long offset = swp_offset(entry);
> > +     DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
>
> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
> increases. But the only other way I can think of is to explicitly loop over
> fixed size chunks, and that's not much better.
Is it possible to save kernel stack better by using bit_map here?  If
SWAPFILE_CLUSTER=512, we consume only (512/64)*8= 64 bytes.
>
> > +
> > +     /* all swap entries are within a cluster for mTHP */
> > +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> > +
> > +     if (nr_pages == 1) {
> > +             swap_free(entry);
> > +             return;
> > +     }
> > +
> > +     p = _swap_info_get(entry);
>
> You need to handle this returning NULL, like swap_free() does.
Yes, you're right! We did forget to judge NULL here.
>
> > +
> > +     ci = lock_cluster(p, offset);
>
> The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
> by rotating media, and clusters are not in use, it will lock the whole swap
> info. But your new version only calls lock_cluster() which won't lock anything
> if clusters are not in use. So I think this is a locking bug.
Again, you're right, it's bug!
>
> > +     for (i = 0; i < nr_pages; i++) {
> > +             if (__swap_entry_free_locked(p, offset + i, 1))
> > +                     __bitmap_set(usage, i, 1);
> > +     }
> > +     unlock_cluster(ci);
> > +
> > +     for_each_clear_bit(i, usage, nr_pages)
> > +             free_swap_slot(swp_entry(type, offset + i));
> > +}
> > +
> >  /*
> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >   */
>
> Thanks,
> Ryan
>
>
Ryan Roberts March 14, 2024, 1:43 p.m. UTC | #3
On 14/03/2024 13:12, Chuanhua Han wrote:
> Ryan Roberts <ryan.roberts@arm.com> 于2024年3月12日周二 02:51写道:
>>
>> On 04/03/2024 08:13, Barry Song wrote:
>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>
>>> While swapping in a large folio, we need to free swaps related to the whole
>>> folio. To avoid frequently acquiring and releasing swap locks, it is better
>>> to introduce an API for batched free.
>>>
>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>  include/linux/swap.h |  6 ++++++
>>>  mm/swapfile.c        | 35 +++++++++++++++++++++++++++++++++++
>>>  2 files changed, 41 insertions(+)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 2955f7a78d8d..d6ab27929458 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>>>  extern int swap_duplicate(swp_entry_t);
>>>  extern int swapcache_prepare(swp_entry_t);
>>>  extern void swap_free(swp_entry_t);
>>> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
>>
>> nit: In my swap-out v4 series, I've created a batched version of
>> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
>> preferable to align the naming schemes - i.e. call this swap_free_nr(). Your
>> scheme doesn't really work when applied to free_swap_and_cache().
> Thanks for your suggestions, and for the next version, we'll see which
> package is more appropriate!
>>
>>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>>  extern int free_swap_and_cache(swp_entry_t);
>>>  int swap_type_of(dev_t device, sector_t offset);
>>> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
>>>  {
>>>  }
>>>
>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
>>> +{
>>> +
>>> +}
>>> +
>>>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>>>  {
>>>  }
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 3f594be83b58..244106998a69 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
>>>               __swap_entry_free(p, entry);
>>>  }
>>>
>>> +/*
>>> + * Called after swapping in a large folio, batched free swap entries
>>> + * for this large folio, entry should be for the first subpage and
>>> + * its offset is aligned with nr_pages
>>> + */
>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
>>> +{
>>> +     int i;
>>> +     struct swap_cluster_info *ci;
>>> +     struct swap_info_struct *p;
>>> +     unsigned type = swp_type(entry);
>>
>> nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
>> int" or at least it did for me when I did something similar in my swap-out patch
>> set.
> Gee, thanks for pointing that out!
>>
>>> +     unsigned long offset = swp_offset(entry);
>>> +     DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
>>
>> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
>> increases. But the only other way I can think of is to explicitly loop over
>> fixed size chunks, and that's not much better.
> Is it possible to save kernel stack better by using bit_map here?  If
> SWAPFILE_CLUSTER=512, we consume only (512/64)*8= 64 bytes.

I'm not sure I've understood what you are saying? You're already using
DECLARE_BITMAP(), so its already consuming 64 bytes if SWAPFILE_CLUSTER=512, no?

I actually did a bad job of trying to express a couple of different points:

- Are there any configurations today where SWAPFILE_CLUSTER > 512? I'm not sure.
Certainly not for arm64, but not sure about other architectures. For example if
an arch had 64K pages with 8192 entries per THP and supports SWAP_THP, that's 1K
for the bitmap, which is now looking pretty big for the stack.

- Would it be better to decouple stack usage from SWAPFILE_CLUSTER and instead
define a fixed stack size (e.g. 64 bytes -> 512 entries). Then free the range of
entries in batches no bigger than this size. This approach could also allow
removing the constraint that the range has to be aligned and fit in a single
cluster. Personally I think an approach like this would be much more robust, in
return for a tiny bit more complexity.

>>
>>> +
>>> +     /* all swap entries are within a cluster for mTHP */
>>> +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>>> +
>>> +     if (nr_pages == 1) {
>>> +             swap_free(entry);
>>> +             return;
>>> +     }
>>> +
>>> +     p = _swap_info_get(entry);
>>
>> You need to handle this returning NULL, like swap_free() does.
> Yes, you're right! We did forget to judge NULL here.
>>
>>> +
>>> +     ci = lock_cluster(p, offset);
>>
>> The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
>> by rotating media, and clusters are not in use, it will lock the whole swap
>> info. But your new version only calls lock_cluster() which won't lock anything
>> if clusters are not in use. So I think this is a locking bug.
> Again, you're right, it's bug!
>>
>>> +     for (i = 0; i < nr_pages; i++) {
>>> +             if (__swap_entry_free_locked(p, offset + i, 1))
>>> +                     __bitmap_set(usage, i, 1);
>>> +     }
>>> +     unlock_cluster(ci);
>>> +
>>> +     for_each_clear_bit(i, usage, nr_pages)
>>> +             free_swap_slot(swp_entry(type, offset + i));
>>> +}
>>> +
>>>  /*
>>>   * Called after dropping swapcache to decrease refcnt to swap entries.
>>>   */
>>
>> Thanks,
>> Ryan
>>
>>
> 
>
Chuanhua Han March 15, 2024, 8:34 a.m. UTC | #4
Ryan Roberts <ryan.roberts@arm.com> 于2024年3月14日周四 21:43写道:
>
> On 14/03/2024 13:12, Chuanhua Han wrote:
> > Ryan Roberts <ryan.roberts@arm.com> 于2024年3月12日周二 02:51写道:
> >>
> >> On 04/03/2024 08:13, Barry Song wrote:
> >>> From: Chuanhua Han <hanchuanhua@oppo.com>
> >>>
> >>> While swapping in a large folio, we need to free swaps related to the whole
> >>> folio. To avoid frequently acquiring and releasing swap locks, it is better
> >>> to introduce an API for batched free.
> >>>
> >>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> ---
> >>>  include/linux/swap.h |  6 ++++++
> >>>  mm/swapfile.c        | 35 +++++++++++++++++++++++++++++++++++
> >>>  2 files changed, 41 insertions(+)
> >>>
> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>> index 2955f7a78d8d..d6ab27929458 100644
> >>> --- a/include/linux/swap.h
> >>> +++ b/include/linux/swap.h
> >>> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >>>  extern int swap_duplicate(swp_entry_t);
> >>>  extern int swapcache_prepare(swp_entry_t);
> >>>  extern void swap_free(swp_entry_t);
> >>> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> >>
> >> nit: In my swap-out v4 series, I've created a batched version of
> >> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
> >> preferable to align the naming schemes - i.e. call this swap_free_nr(). Your
> >> scheme doesn't really work when applied to free_swap_and_cache().
> > Thanks for your suggestions, and for the next version, we'll see which
> > package is more appropriate!
> >>
> >>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >>>  extern int free_swap_and_cache(swp_entry_t);
> >>>  int swap_type_of(dev_t device, sector_t offset);
> >>> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
> >>>  {
> >>>  }
> >>>
> >>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> >>> +{
> >>> +
> >>> +}
> >>> +
> >>>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >>>  {
> >>>  }
> >>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >>> index 3f594be83b58..244106998a69 100644
> >>> --- a/mm/swapfile.c
> >>> +++ b/mm/swapfile.c
> >>> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
> >>>               __swap_entry_free(p, entry);
> >>>  }
> >>>
> >>> +/*
> >>> + * Called after swapping in a large folio, batched free swap entries
> >>> + * for this large folio, entry should be for the first subpage and
> >>> + * its offset is aligned with nr_pages
> >>> + */
> >>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> >>> +{
> >>> +     int i;
> >>> +     struct swap_cluster_info *ci;
> >>> +     struct swap_info_struct *p;
> >>> +     unsigned type = swp_type(entry);
> >>
> >> nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
> >> int" or at least it did for me when I did something similar in my swap-out patch
> >> set.
> > Gee, thanks for pointing that out!
> >>
> >>> +     unsigned long offset = swp_offset(entry);
> >>> +     DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
> >>
> >> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
> >> increases. But the only other way I can think of is to explicitly loop over
> >> fixed size chunks, and that's not much better.
> > Is it possible to save kernel stack better by using bit_map here?  If
> > SWAPFILE_CLUSTER=512, we consume only (512/64)*8= 64 bytes.
>
> I'm not sure I've understood what you are saying? You're already using
> DECLARE_BITMAP(), so its already consuming 64 bytes if SWAPFILE_CLUSTER=512, no?
>
> I actually did a bad job of trying to express a couple of different points:
>
> - Are there any configurations today where SWAPFILE_CLUSTER > 512? I'm not sure.
> Certainly not for arm64, but not sure about other architectures. For example if
> an arch had 64K pages with 8192 entries per THP and supports SWAP_THP, that's 1K
> for the bitmap, which is now looking pretty big for the stack.
I agree with you.The current bit_map grows linearly with the
SWAPFILE_CLUSTER, which may cause the kernel stack to swell.
I need to think of a way to save more memory .
>
> - Would it be better to decouple stack usage from SWAPFILE_CLUSTER and instead
> define a fixed stack size (e.g. 64 bytes -> 512 entries). Then free the range of
> entries in batches no bigger than this size. This approach could also allow
> removing the constraint that the range has to be aligned and fit in a single
> cluster. Personally I think an approach like this would be much more robust, in
> return for a tiny bit more complexity.
Because we cannot determine how many swap entries a cluster has in an
architecture or a configuration, we do not know how large the variable
needs to be defined?
>
> >>
> >>> +
> >>> +     /* all swap entries are within a cluster for mTHP */
> >>> +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >>> +
> >>> +     if (nr_pages == 1) {
> >>> +             swap_free(entry);
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     p = _swap_info_get(entry);
> >>
> >> You need to handle this returning NULL, like swap_free() does.
> > Yes, you're right! We did forget to judge NULL here.
> >>
> >>> +
> >>> +     ci = lock_cluster(p, offset);
> >>
> >> The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
> >> by rotating media, and clusters are not in use, it will lock the whole swap
> >> info. But your new version only calls lock_cluster() which won't lock anything
> >> if clusters are not in use. So I think this is a locking bug.
> > Again, you're right, it's bug!
> >>
> >>> +     for (i = 0; i < nr_pages; i++) {
> >>> +             if (__swap_entry_free_locked(p, offset + i, 1))
> >>> +                     __bitmap_set(usage, i, 1);
> >>> +     }
> >>> +     unlock_cluster(ci);
> >>> +
> >>> +     for_each_clear_bit(i, usage, nr_pages)
> >>> +             free_swap_slot(swp_entry(type, offset + i));
> >>> +}
> >>> +
> >>>  /*
> >>>   * Called after dropping swapcache to decrease refcnt to swap entries.
> >>>   */
> >>
> >> Thanks,
> >> Ryan
> >>
> >>
> >
> >
>
Ryan Roberts March 15, 2024, 10:57 a.m. UTC | #5
On 15/03/2024 08:34, Chuanhua Han wrote:
> Ryan Roberts <ryan.roberts@arm.com> 于2024年3月14日周四 21:43写道:
>>
>> On 14/03/2024 13:12, Chuanhua Han wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> 于2024年3月12日周二 02:51写道:
>>>>
>>>> On 04/03/2024 08:13, Barry Song wrote:
>>>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>>>
>>>>> While swapping in a large folio, we need to free swaps related to the whole
>>>>> folio. To avoid frequently acquiring and releasing swap locks, it is better
>>>>> to introduce an API for batched free.
>>>>>
>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>>>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>> ---
>>>>>  include/linux/swap.h |  6 ++++++
>>>>>  mm/swapfile.c        | 35 +++++++++++++++++++++++++++++++++++
>>>>>  2 files changed, 41 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>> index 2955f7a78d8d..d6ab27929458 100644
>>>>> --- a/include/linux/swap.h
>>>>> +++ b/include/linux/swap.h
>>>>> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
>>>>>  extern int swap_duplicate(swp_entry_t);
>>>>>  extern int swapcache_prepare(swp_entry_t);
>>>>>  extern void swap_free(swp_entry_t);
>>>>> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
>>>>
>>>> nit: In my swap-out v4 series, I've created a batched version of
>>>> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
>>>> preferable to align the naming schemes - i.e. call this swap_free_nr(). Your
>>>> scheme doesn't really work when applied to free_swap_and_cache().
>>> Thanks for your suggestions, and for the next version, we'll see which
>>> package is more appropriate!
>>>>
>>>>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
>>>>>  extern int free_swap_and_cache(swp_entry_t);
>>>>>  int swap_type_of(dev_t device, sector_t offset);
>>>>> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
>>>>>  {
>>>>>  }
>>>>>
>>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
>>>>> +{
>>>>> +
>>>>> +}
>>>>> +
>>>>>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
>>>>>  {
>>>>>  }
>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>> index 3f594be83b58..244106998a69 100644
>>>>> --- a/mm/swapfile.c
>>>>> +++ b/mm/swapfile.c
>>>>> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
>>>>>               __swap_entry_free(p, entry);
>>>>>  }
>>>>>
>>>>> +/*
>>>>> + * Called after swapping in a large folio, batched free swap entries
>>>>> + * for this large folio, entry should be for the first subpage and
>>>>> + * its offset is aligned with nr_pages
>>>>> + */
>>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
>>>>> +{
>>>>> +     int i;
>>>>> +     struct swap_cluster_info *ci;
>>>>> +     struct swap_info_struct *p;
>>>>> +     unsigned type = swp_type(entry);
>>>>
>>>> nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
>>>> int" or at least it did for me when I did something similar in my swap-out patch
>>>> set.
>>> Gee, thanks for pointing that out!
>>>>
>>>>> +     unsigned long offset = swp_offset(entry);
>>>>> +     DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
>>>>
>>>> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
>>>> increases. But the only other way I can think of is to explicitly loop over
>>>> fixed size chunks, and that's not much better.
>>> Is it possible to save kernel stack better by using bit_map here?  If
>>> SWAPFILE_CLUSTER=512, we consume only (512/64)*8= 64 bytes.
>>
>> I'm not sure I've understood what you are saying? You're already using
>> DECLARE_BITMAP(), so its already consuming 64 bytes if SWAPFILE_CLUSTER=512, no?
>>
>> I actually did a bad job of trying to express a couple of different points:
>>
>> - Are there any configurations today where SWAPFILE_CLUSTER > 512? I'm not sure.
>> Certainly not for arm64, but not sure about other architectures. For example if
>> an arch had 64K pages with 8192 entries per THP and supports SWAP_THP, that's 1K
>> for the bitmap, which is now looking pretty big for the stack.
> I agree with you.The current bit_map grows linearly with the
> SWAPFILE_CLUSTER, which may cause the kernel stack to swell.
> I need to think of a way to save more memory .
>>
>> - Would it be better to decouple stack usage from SWAPFILE_CLUSTER and instead
>> define a fixed stack size (e.g. 64 bytes -> 512 entries). Then free the range of
>> entries in batches no bigger than this size. This approach could also allow
>> removing the constraint that the range has to be aligned and fit in a single
>> cluster. Personally I think an approach like this would be much more robust, in
>> return for a tiny bit more complexity.
> Because we cannot determine how many swap entries a cluster has in an
> architecture or a configuration, we do not know how large the variable
> needs to be defined?

My point is that we could define a fixed size, then loop through the passed in
range, operating on batches of that fixed size. You could even take into
consideration the cluster boundaries so that you take the correct lock for every
batch and can drop the "must be naturally aligned, must be no bigger than
cluster size" constraint.


>>
>>>>
>>>>> +
>>>>> +     /* all swap entries are within a cluster for mTHP */
>>>>> +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>>>>> +
>>>>> +     if (nr_pages == 1) {
>>>>> +             swap_free(entry);
>>>>> +             return;
>>>>> +     }
>>>>> +
>>>>> +     p = _swap_info_get(entry);
>>>>
>>>> You need to handle this returning NULL, like swap_free() does.
>>> Yes, you're right! We did forget to judge NULL here.
>>>>
>>>>> +
>>>>> +     ci = lock_cluster(p, offset);
>>>>
>>>> The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
>>>> by rotating media, and clusters are not in use, it will lock the whole swap
>>>> info. But your new version only calls lock_cluster() which won't lock anything
>>>> if clusters are not in use. So I think this is a locking bug.
>>> Again, you're right, it's bug!
>>>>
>>>>> +     for (i = 0; i < nr_pages; i++) {
>>>>> +             if (__swap_entry_free_locked(p, offset + i, 1))
>>>>> +                     __bitmap_set(usage, i, 1);
>>>>> +     }
>>>>> +     unlock_cluster(ci);
>>>>> +
>>>>> +     for_each_clear_bit(i, usage, nr_pages)
>>>>> +             free_swap_slot(swp_entry(type, offset + i));
>>>>> +}
>>>>> +
>>>>>  /*
>>>>>   * Called after dropping swapcache to decrease refcnt to swap entries.
>>>>>   */
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>
>>>
>>
> 
>
Chuanhua Han March 18, 2024, 1:28 a.m. UTC | #6
Ryan Roberts <ryan.roberts@arm.com> 于2024年3月15日周五 18:57写道:
>
> On 15/03/2024 08:34, Chuanhua Han wrote:
> > Ryan Roberts <ryan.roberts@arm.com> 于2024年3月14日周四 21:43写道:
> >>
> >> On 14/03/2024 13:12, Chuanhua Han wrote:
> >>> Ryan Roberts <ryan.roberts@arm.com> 于2024年3月12日周二 02:51写道:
> >>>>
> >>>> On 04/03/2024 08:13, Barry Song wrote:
> >>>>> From: Chuanhua Han <hanchuanhua@oppo.com>
> >>>>>
> >>>>> While swapping in a large folio, we need to free swaps related to the whole
> >>>>> folio. To avoid frequently acquiring and releasing swap locks, it is better
> >>>>> to introduce an API for batched free.
> >>>>>
> >>>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >>>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>> ---
> >>>>>  include/linux/swap.h |  6 ++++++
> >>>>>  mm/swapfile.c        | 35 +++++++++++++++++++++++++++++++++++
> >>>>>  2 files changed, 41 insertions(+)
> >>>>>
> >>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>>>> index 2955f7a78d8d..d6ab27929458 100644
> >>>>> --- a/include/linux/swap.h
> >>>>> +++ b/include/linux/swap.h
> >>>>> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> >>>>>  extern int swap_duplicate(swp_entry_t);
> >>>>>  extern int swapcache_prepare(swp_entry_t);
> >>>>>  extern void swap_free(swp_entry_t);
> >>>>> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> >>>>
> >>>> nit: In my swap-out v4 series, I've created a batched version of
> >>>> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is
> >>>> preferable to align the naming schemes - i.e. call this swap_free_nr(). Your
> >>>> scheme doesn't really work when applied to free_swap_and_cache().
> >>> Thanks for your suggestions, and for the next version, we'll see which
> >>> package is more appropriate!
> >>>>
> >>>>>  extern void swapcache_free_entries(swp_entry_t *entries, int n);
> >>>>>  extern int free_swap_and_cache(swp_entry_t);
> >>>>>  int swap_type_of(dev_t device, sector_t offset);
> >>>>> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp)
> >>>>>  {
> >>>>>  }
> >>>>>
> >>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> >>>>> +{
> >>>>> +
> >>>>> +}
> >>>>> +
> >>>>>  static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> >>>>>  {
> >>>>>  }
> >>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >>>>> index 3f594be83b58..244106998a69 100644
> >>>>> --- a/mm/swapfile.c
> >>>>> +++ b/mm/swapfile.c
> >>>>> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry)
> >>>>>               __swap_entry_free(p, entry);
> >>>>>  }
> >>>>>
> >>>>> +/*
> >>>>> + * Called after swapping in a large folio, batched free swap entries
> >>>>> + * for this large folio, entry should be for the first subpage and
> >>>>> + * its offset is aligned with nr_pages
> >>>>> + */
> >>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> >>>>> +{
> >>>>> +     int i;
> >>>>> +     struct swap_cluster_info *ci;
> >>>>> +     struct swap_info_struct *p;
> >>>>> +     unsigned type = swp_type(entry);
> >>>>
> >>>> nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned
> >>>> int" or at least it did for me when I did something similar in my swap-out patch
> >>>> set.
> >>> Gee, thanks for pointing that out!
> >>>>
> >>>>> +     unsigned long offset = swp_offset(entry);
> >>>>> +     DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
> >>>>
> >>>> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever
> >>>> increases. But the only other way I can think of is to explicitly loop over
> >>>> fixed size chunks, and that's not much better.
> >>> Is it possible to save kernel stack better by using bit_map here?  If
> >>> SWAPFILE_CLUSTER=512, we consume only (512/64)*8= 64 bytes.
> >>
> >> I'm not sure I've understood what you are saying? You're already using
> >> DECLARE_BITMAP(), so its already consuming 64 bytes if SWAPFILE_CLUSTER=512, no?
> >>
> >> I actually did a bad job of trying to express a couple of different points:
> >>
> >> - Are there any configurations today where SWAPFILE_CLUSTER > 512? I'm not sure.
> >> Certainly not for arm64, but not sure about other architectures. For example if
> >> an arch had 64K pages with 8192 entries per THP and supports SWAP_THP, that's 1K
> >> for the bitmap, which is now looking pretty big for the stack.
> > I agree with you.The current bit_map grows linearly with the
> > SWAPFILE_CLUSTER, which may cause the kernel stack to swell.
> > I need to think of a way to save more memory .
> >>
> >> - Would it be better to decouple stack usage from SWAPFILE_CLUSTER and instead
> >> define a fixed stack size (e.g. 64 bytes -> 512 entries). Then free the range of
> >> entries in batches no bigger than this size. This approach could also allow
> >> removing the constraint that the range has to be aligned and fit in a single
> >> cluster. Personally I think an approach like this would be much more robust, in
> >> return for a tiny bit more complexity.
> > Because we cannot determine how many swap entries a cluster has in an
> > architecture or a configuration, we do not know how large the variable
> > needs to be defined?
>
> My point is that we could define a fixed size, then loop through the passed in
> range, operating on batches of that fixed size. You could even take into
> consideration the cluster boundaries so that you take the correct lock for every
> batch and can drop the "must be naturally aligned, must be no bigger than
> cluster size" constraint.

Thank you. I understand it!

>
>
> >>
> >>>>
> >>>>> +
> >>>>> +     /* all swap entries are within a cluster for mTHP */
> >>>>> +     VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
> >>>>> +
> >>>>> +     if (nr_pages == 1) {
> >>>>> +             swap_free(entry);
> >>>>> +             return;
> >>>>> +     }
> >>>>> +
> >>>>> +     p = _swap_info_get(entry);
> >>>>
> >>>> You need to handle this returning NULL, like swap_free() does.
> >>> Yes, you're right! We did forget to judge NULL here.
> >>>>
> >>>>> +
> >>>>> +     ci = lock_cluster(p, offset);
> >>>>
> >>>> The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed
> >>>> by rotating media, and clusters are not in use, it will lock the whole swap
> >>>> info. But your new version only calls lock_cluster() which won't lock anything
> >>>> if clusters are not in use. So I think this is a locking bug.
> >>> Again, you're right, it's bug!
> >>>>
> >>>>> +     for (i = 0; i < nr_pages; i++) {
> >>>>> +             if (__swap_entry_free_locked(p, offset + i, 1))
> >>>>> +                     __bitmap_set(usage, i, 1);
> >>>>> +     }
> >>>>> +     unlock_cluster(ci);
> >>>>> +
> >>>>> +     for_each_clear_bit(i, usage, nr_pages)
> >>>>> +             free_swap_slot(swp_entry(type, offset + i));
> >>>>> +}
> >>>>> +
> >>>>>  /*
> >>>>>   * Called after dropping swapcache to decrease refcnt to swap entries.
> >>>>>   */
> >>>>
> >>>> Thanks,
> >>>> Ryan
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2955f7a78d8d..d6ab27929458 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -481,6 +481,7 @@  extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
+extern void swap_nr_free(swp_entry_t entry, int nr_pages);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 int swap_type_of(dev_t device, sector_t offset);
@@ -561,6 +562,11 @@  static inline void swap_free(swp_entry_t swp)
 {
 }
 
+void swap_nr_free(swp_entry_t entry, int nr_pages)
+{
+
+}
+
 static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
 {
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3f594be83b58..244106998a69 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1341,6 +1341,41 @@  void swap_free(swp_entry_t entry)
 		__swap_entry_free(p, entry);
 }
 
+/*
+ * Called after swapping in a large folio, batched free swap entries
+ * for this large folio, entry should be for the first subpage and
+ * its offset is aligned with nr_pages
+ */
+void swap_nr_free(swp_entry_t entry, int nr_pages)
+{
+	int i;
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *p;
+	unsigned type = swp_type(entry);
+	unsigned long offset = swp_offset(entry);
+	DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
+
+	/* all swap entries are within a cluster for mTHP */
+	VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
+
+	if (nr_pages == 1) {
+		swap_free(entry);
+		return;
+	}
+
+	p = _swap_info_get(entry);
+
+	ci = lock_cluster(p, offset);
+	for (i = 0; i < nr_pages; i++) {
+		if (__swap_entry_free_locked(p, offset + i, 1))
+			__bitmap_set(usage, i, 1);
+	}
+	unlock_cluster(ci);
+
+	for_each_clear_bit(i, usage, nr_pages)
+		free_swap_slot(swp_entry(type, offset + i));
+}
+
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */