diff mbox series

swap: cleanup get/put_swap_device usage

Message ID 20230516052957.175432-1-ying.huang@intel.com (mailing list archive)
State New
Headers show
Series swap: cleanup get/put_swap_device usage | expand

Commit Message

Huang, Ying May 16, 2023, 5:29 a.m. UTC
The general rule to use a swap entry is as follows.

When we get a swap entry, if there isn't some other way to prevent
swapoff, such as page lock for swap cache, page table lock, etc., the
swap entry may become invalid because of swapoff.  Then, we need to
enclose all swap related functions with get_swap_device() and
put_swap_device(), unless the swap functions call
get/put_swap_device() by themselves.

Add the rule as comments of get_swap_device(), and cleanup some
functions which call get/put_swap_device().

1. Enlarge the get/put_swap_device() protection range in
__read_swap_cache_async().  This makes the function a little easier to
be understood because we don't need to consider swapoff.  And this
makes it possible to remove get/put_swap_device() calling in some
function called by __read_swap_cache_async().

2. Remove get/put_swap_device() in __swap_count().  Which is call in
do_swap_page() only, which encloses the call with get/put_swap_device()
already.

3. Remove get/put_swap_device() in __swp_swapcount().  Which is call
in __read_swap_cache_async() only, which encloses the call with
get/put_swap_device() already.

4. Remove get/put_swap_device() in __swap_duplicate(). Which is called
by

- swap_shmem_alloc(): the swap cache is locked.

- copy_nonpresent_pte() -> swap_duplicate() and try_to_unmap_one() ->
swap_duplicate(): the page table lock is held.

- __read_swap_cache_async() -> swapcache_prepare(): enclosed with
get/put_swap_device() already.

Other get/put_swap_device() usages are checked too.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
---
 include/linux/swap.h |  4 ++--
 mm/swap_state.c      | 33 ++++++++++++++++++++-----------
 mm/swapfile.c        | 47 ++++++++++++--------------------------------
 3 files changed, 37 insertions(+), 47 deletions(-)

Comments

David Hildenbrand May 16, 2023, 11:06 a.m. UTC | #1
On 16.05.23 07:29, Huang Ying wrote:
> The general rule to use a swap entry is as follows.
> 
> When we get a swap entry, if there isn't some other way to prevent
> swapoff, such as page lock for swap cache, page table lock, etc., the
> swap entry may become invalid because of swapoff.  Then, we need to
> enclose all swap related functions with get_swap_device() and
> put_swap_device(), unless the swap functions call
> get/put_swap_device() by themselves.
> 
> Add the rule as comments of get_swap_device(), and cleanup some
> functions which call get/put_swap_device().
> 
> 1. Enlarge the get/put_swap_device() protection range in
> __read_swap_cache_async().  This makes the function a little easier to
> be understood because we don't need to consider swapoff.  And this
> makes it possible to remove get/put_swap_device() calling in some
> function called by __read_swap_cache_async().
> 
> 2. Remove get/put_swap_device() in __swap_count().  Which is call in
> do_swap_page() only, which encloses the call with get/put_swap_device()
> already.
> 
> 3. Remove get/put_swap_device() in __swp_swapcount().  Which is call
> in __read_swap_cache_async() only, which encloses the call with
> get/put_swap_device() already.
> 
> 4. Remove get/put_swap_device() in __swap_duplicate(). Which is called
> by
> 
> - swap_shmem_alloc(): the swap cache is locked.
> 
> - copy_nonpresent_pte() -> swap_duplicate() and try_to_unmap_one() ->
> swap_duplicate(): the page table lock is held.
> 
> - __read_swap_cache_async() -> swapcache_prepare(): enclosed with
> get/put_swap_device() already.
> 
> Other get/put_swap_device() usages are checked too.

I suggest splitting this patch up into logical pieces as outlined here 
by you already.
Huang, Ying May 17, 2023, 12:23 a.m. UTC | #2
David Hildenbrand <david@redhat.com> writes:

> On 16.05.23 07:29, Huang Ying wrote:
>> The general rule to use a swap entry is as follows.
>> When we get a swap entry, if there isn't some other way to prevent
>> swapoff, such as page lock for swap cache, page table lock, etc., the
>> swap entry may become invalid because of swapoff.  Then, we need to
>> enclose all swap related functions with get_swap_device() and
>> put_swap_device(), unless the swap functions call
>> get/put_swap_device() by themselves.
>> Add the rule as comments of get_swap_device(), and cleanup some
>> functions which call get/put_swap_device().
>> 1. Enlarge the get/put_swap_device() protection range in
>> __read_swap_cache_async().  This makes the function a little easier to
>> be understood because we don't need to consider swapoff.  And this
>> makes it possible to remove get/put_swap_device() calling in some
>> function called by __read_swap_cache_async().
>> 2. Remove get/put_swap_device() in __swap_count().  Which is call in
>> do_swap_page() only, which encloses the call with get/put_swap_device()
>> already.
>> 3. Remove get/put_swap_device() in __swp_swapcount().  Which is call
>> in __read_swap_cache_async() only, which encloses the call with
>> get/put_swap_device() already.
>> 4. Remove get/put_swap_device() in __swap_duplicate(). Which is
>> called
>> by
>> - swap_shmem_alloc(): the swap cache is locked.
>> - copy_nonpresent_pte() -> swap_duplicate() and try_to_unmap_one()
>> ->
>> swap_duplicate(): the page table lock is held.
>> - __read_swap_cache_async() -> swapcache_prepare(): enclosed with
>> get/put_swap_device() already.
>> Other get/put_swap_device() usages are checked too.
>
> I suggest splitting this patch up into logical pieces as outlined here
> by you already.

OK.  Will do that in the next version.

Best Regards,
Huang, Ying
Chris Li May 20, 2023, 4:40 p.m. UTC | #3
On Wed, May 17, 2023 at 08:23:18AM +0800, Huang, Ying wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
> > On 16.05.23 07:29, Huang Ying wrote:
> >> The general rule to use a swap entry is as follows.
> >> When we get a swap entry, if there isn't some other way to prevent
> >> swapoff, such as page lock for swap cache, page table lock, etc., the
> >> swap entry may become invalid because of swapoff.  Then, we need to
> >> enclose all swap related functions with get_swap_device() and
> >> put_swap_device(), unless the swap functions call
> >> get/put_swap_device() by themselves.
> >> Add the rule as comments of get_swap_device(), and cleanup some
> >> functions which call get/put_swap_device().
> >> 1. Enlarge the get/put_swap_device() protection range in
> >> __read_swap_cache_async().  This makes the function a little easier to
> >> be understood because we don't need to consider swapoff.  And this
> >> makes it possible to remove get/put_swap_device() calling in some
> >> function called by __read_swap_cache_async().
> >> 2. Remove get/put_swap_device() in __swap_count().  Which is call in
> >> do_swap_page() only, which encloses the call with get/put_swap_device()
> >> already.
> >> 3. Remove get/put_swap_device() in __swp_swapcount().  Which is call
> >> in __read_swap_cache_async() only, which encloses the call with
> >> get/put_swap_device() already.
> >> 4. Remove get/put_swap_device() in __swap_duplicate(). Which is
> >> called
> >> by
> >> - swap_shmem_alloc(): the swap cache is locked.
> >> - copy_nonpresent_pte() -> swap_duplicate() and try_to_unmap_one()
> >> ->
> >> swap_duplicate(): the page table lock is held.
> >> - __read_swap_cache_async() -> swapcache_prepare(): enclosed with
> >> get/put_swap_device() already.
> >> Other get/put_swap_device() usages are checked too.
> >
> > I suggest splitting this patch up into logical pieces as outlined here
> > by you already.

Agree with David here.

> 
> OK.  Will do that in the next version.

Your patch make sense to me.

Looking forward to your next version.

BTW, no relat to your patch, but just when I look
at your patch I notice is that we have too many swap
count functions.
The naming scheme is very confusing.

1) swap_count(), just mask out SWAP_HAS_CACHE

2) __swap_count() the name with underscore suggest it
is more internal.  But __swap_count() calls swap_count().
It is basically swap_count() with device lookup.

3) swap_swapcount()
similar to __swap_count() but with cluster level
locking if possible. otherwise fall back to device level locking.

4) __swp_swapcount()
swap_swapcount () with device lookup.  not consider continuing.
Again this function is more external while swap_swapcount()
is more internal.

5) swp_swapcount() similar to __swp_swapcount()
exact count consider continue

We should have a more consistent naming regarding swap count.
Device level, then cluster level, then entry level.

Also I consider the continuing is internal to the current
swap index implementation. If we have alternative swap file
implementation, we might not have count continuing at all.

Chris
Huang, Ying May 22, 2023, 1:26 a.m. UTC | #4
Chris Li <chrisl@kernel.org> writes:

> On Wed, May 17, 2023 at 08:23:18AM +0800, Huang, Ying wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>> > On 16.05.23 07:29, Huang Ying wrote:
>> >> The general rule to use a swap entry is as follows.
>> >> When we get a swap entry, if there isn't some other way to prevent
>> >> swapoff, such as page lock for swap cache, page table lock, etc., the
>> >> swap entry may become invalid because of swapoff.  Then, we need to
>> >> enclose all swap related functions with get_swap_device() and
>> >> put_swap_device(), unless the swap functions call
>> >> get/put_swap_device() by themselves.
>> >> Add the rule as comments of get_swap_device(), and cleanup some
>> >> functions which call get/put_swap_device().
>> >> 1. Enlarge the get/put_swap_device() protection range in
>> >> __read_swap_cache_async().  This makes the function a little easier to
>> >> be understood because we don't need to consider swapoff.  And this
>> >> makes it possible to remove get/put_swap_device() calling in some
>> >> function called by __read_swap_cache_async().
>> >> 2. Remove get/put_swap_device() in __swap_count().  Which is call in
>> >> do_swap_page() only, which encloses the call with get/put_swap_device()
>> >> already.
>> >> 3. Remove get/put_swap_device() in __swp_swapcount().  Which is call
>> >> in __read_swap_cache_async() only, which encloses the call with
>> >> get/put_swap_device() already.
>> >> 4. Remove get/put_swap_device() in __swap_duplicate(). Which is
>> >> called
>> >> by
>> >> - swap_shmem_alloc(): the swap cache is locked.
>> >> - copy_nonpresent_pte() -> swap_duplicate() and try_to_unmap_one()
>> >> ->
>> >> swap_duplicate(): the page table lock is held.
>> >> - __read_swap_cache_async() -> swapcache_prepare(): enclosed with
>> >> get/put_swap_device() already.
>> >> Other get/put_swap_device() usages are checked too.
>> >
>> > I suggest splitting this patch up into logical pieces as outlined here
>> > by you already.
>
> Agree with David here.
>
>> 
>> OK.  Will do that in the next version.
>
> Your patch make sense to me.
>
> Looking forward to your next version.
>
> BTW, no relat to your patch, but just when I look
> at your patch I notice is that we have too many swap
> count functions.
> The naming scheme is very confusing.
>
> 1) swap_count(), just mask out SWAP_HAS_CACHE
>
> 2) __swap_count() the name with underscore suggest it
> is more internal.  But __swap_count() calls swap_count().
> It is basically swap_count() with device lookup.
>
> 3) swap_swapcount()
> similar to __swap_count() but with cluster level
> locking if possible. otherwise fall back to device level locking.
>
> 4) __swp_swapcount()
> swap_swapcount () with device lookup.  not consider continuing.
> Again this function is more external while swap_swapcount()
> is more internal.
>
> 5) swp_swapcount() similar to __swp_swapcount()
> exact count consider continue
>
> We should have a more consistent naming regarding swap count.
> Device level, then cluster level, then entry level.

Yes.  The original naming is confusing.

> Also I consider the continuing is internal to the current
> swap index implementation. If we have alternative swap file
> implementation, we might not have count continuing at all.

There's some difficulties to hide continuation completely.  For example,
we want to call add_swap_count_continuation() in non-atomic context in
copy_pte_range(), while the fast path calls swap_duplicate() in atomic
context (via copy_nonpresent_pte()).

Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3c69cb653cb9..f6bd51aa05ea 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -512,7 +512,7 @@  int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int __swap_count(swp_entry_t entry);
-extern int __swp_swapcount(swp_entry_t entry);
+extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern struct swap_info_struct *swp_swap_info(swp_entry_t entry);
@@ -590,7 +590,7 @@  static inline int __swap_count(swp_entry_t entry)
 	return 0;
 }
 
-static inline int __swp_swapcount(swp_entry_t entry)
+static inline int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
 	return 0;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b76a65ac28b3..a1028fe7214e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -417,9 +417,13 @@  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 {
 	struct swap_info_struct *si;
 	struct folio *folio;
+	struct page *page;
 	void *shadow = NULL;
 
 	*new_page_allocated = false;
+	si = get_swap_device(entry);
+	if (!si)
+		return NULL;
 
 	for (;;) {
 		int err;
@@ -428,14 +432,12 @@  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * called after swap_cache_get_folio() failed, re-calling
 		 * that would confuse statistics.
 		 */
-		si = get_swap_device(entry);
-		if (!si)
-			return NULL;
 		folio = filemap_get_folio(swap_address_space(entry),
 						swp_offset(entry));
-		put_swap_device(si);
-		if (!IS_ERR(folio))
-			return folio_file_page(folio, swp_offset(entry));
+		if (!IS_ERR(folio)) {
+			page = folio_file_page(folio, swp_offset(entry));
+			goto got_page;
+		}
 
 		/*
 		 * Just skip read ahead for unused swap slot.
@@ -445,8 +447,8 @@  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * as SWAP_HAS_CACHE.  That's done in later part of code or
 		 * else swap_off will be aborted if we return NULL.
 		 */
-		if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
-			return NULL;
+		if (!swap_swapcount(si, entry) && swap_slot_cache_enabled)
+			goto fail;
 
 		/*
 		 * Get a new page to read into from swap.  Allocate it now,
@@ -455,7 +457,7 @@  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		folio = vma_alloc_folio(gfp_mask, 0, vma, addr, false);
 		if (!folio)
-			return NULL;
+                        goto fail;
 
 		/*
 		 * Swap entry may have been freed since our caller observed it.
@@ -466,7 +468,7 @@  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 		folio_put(folio);
 		if (err != -EEXIST)
-			return NULL;
+			goto fail;
 
 		/*
 		 * We might race against __delete_from_swap_cache(), and
@@ -500,12 +502,17 @@  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	/* Caller will initiate read into locked folio */
 	folio_add_lru(folio);
 	*new_page_allocated = true;
-	return &folio->page;
+	page = &folio->page;
+got_page:
+	put_swap_device(si);
+	return page;
 
 fail_unlock:
 	put_swap_folio(folio, entry);
 	folio_unlock(folio);
 	folio_put(folio);
+fail:
+	put_swap_device(si);
 	return NULL;
 }
 
@@ -514,6 +521,10 @@  struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  * and reading the disk if it is not already cached.
  * A failure return means that either the page allocation failed or that
  * the swap entry is no longer in use.
+ *
+ * get/put_swap_device() aren't needed to call this function, because
+ * __read_swap_cache_async() call them and swap_readpage() holds the
+ * swap cache folio lock.
  */
 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 				   struct vm_area_struct *vma,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 274bbf797480..0c1cb935b2eb 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1219,6 +1219,13 @@  static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
 }
 
 /*
+ * When we get a swap entry, if there isn't some other way to prevent
+ * swapoff, such as page lock for swap cache, page table lock, etc.,
+ * the swap entry may become invalid because of swapoff.  Then, we
+ * need to enclose all swap related functions with get_swap_device()
+ * and put_swap_device(), unless the swap functions call
+ * get/put_swap_device() by themselves.
+ *
  * Check whether swap entry is valid in the swap device.  If so,
  * return pointer to swap_info_struct, and keep the swap entry valid
  * via preventing the swap device from being swapoff, until
@@ -1227,9 +1234,8 @@  static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
  * Notice that swapoff or swapoff+swapon can still happen before the
  * percpu_ref_tryget_live() in get_swap_device() or after the
  * percpu_ref_put() in put_swap_device() if there isn't any other way
- * to prevent swapoff, such as page lock, page table lock, etc.  The
- * caller must be prepared for that.  For example, the following
- * situation is possible.
+ * to prevent swapoff.  The caller must be prepared for that.  For
+ * example, the following situation is possible.
  *
  *   CPU1				CPU2
  *   do_swap_page()
@@ -1432,16 +1438,10 @@  void swapcache_free_entries(swp_entry_t *entries, int n)
 
 int __swap_count(swp_entry_t entry)
 {
-	struct swap_info_struct *si;
+	struct swap_info_struct *si = swp_swap_info(entry);
 	pgoff_t offset = swp_offset(entry);
-	int count = 0;
 
-	si = get_swap_device(entry);
-	if (si) {
-		count = swap_count(si->swap_map[offset]);
-		put_swap_device(si);
-	}
-	return count;
+	return swap_count(si->swap_map[offset]);
 }
 
 /*
@@ -1449,7 +1449,7 @@  int __swap_count(swp_entry_t entry)
  * This does not give an exact answer when swap count is continued,
  * but does include the high COUNT_CONTINUED flag to allow for that.
  */
-static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
+int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
 	pgoff_t offset = swp_offset(entry);
 	struct swap_cluster_info *ci;
@@ -1461,24 +1461,6 @@  static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 	return count;
 }
 
-/*
- * How many references to @entry are currently swapped out?
- * This does not give an exact answer when swap count is continued,
- * but does include the high COUNT_CONTINUED flag to allow for that.
- */
-int __swp_swapcount(swp_entry_t entry)
-{
-	int count = 0;
-	struct swap_info_struct *si;
-
-	si = get_swap_device(entry);
-	if (si) {
-		count = swap_swapcount(si, entry);
-		put_swap_device(si);
-	}
-	return count;
-}
-
 /*
  * How many references to @entry are currently swapped out?
  * This considers COUNT_CONTINUED so it returns exact answer.
@@ -3288,9 +3270,7 @@  static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 	unsigned char has_cache;
 	int err;
 
-	p = get_swap_device(entry);
-	if (!p)
-		return -EINVAL;
+	p = swp_swap_info(entry);
 
 	offset = swp_offset(entry);
 	ci = lock_cluster_or_swap_info(p, offset);
@@ -3337,7 +3317,6 @@  static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 
 unlock_out:
 	unlock_cluster_or_swap_info(p, ci);
-	put_swap_device(p);
 	return err;
 }