[RFC] mm: add folio in swapcache if swapin from zswap

Message ID	20240322163939.17846-1-chengming.zhou@linux.dev (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: chengming.zhou@linux.dev To: hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com, chengming.zhou@linux.dev, akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Zhongkun He <hezhongkun.hzk@bytedance.com> Subject: [RFC PATCH] mm: add folio in swapcache if swapin from zswap Date: Sat, 23 Mar 2024 00:39:39 +0800 Message-Id: <20240322163939.17846-1-chengming.zhou@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] mm: add folio in swapcache if swapin from zswap \| expand [RFC] mm: add folio in swapcache if swapin from zswap

Chengming Zhou March 22, 2024, 4:39 p.m. UTC

From: Chengming Zhou <chengming.zhou@linux.dev>

There is a report of data corruption caused by double swapin, which is
only possible in the skip swapcache path on SWP_SYNCHRONOUS_IO backends.

The root cause is that zswap is not like other "normal" swap backends,
it won't keep the copy of data after the first time of swapin. So if
the folio in the first time of swapin can't be installed in the pagetable
successfully and we just free it directly. Then in the second time of
swapin, we can't find anything in zswap and read wrong data from swapfile,
so this data corruption problem happened.

We can fix it by always adding the folio into swapcache if we know the
pinned swap entry can be found in zswap, so it won't get freed even though
it can't be installed successfully in the first time of swapin.

And we have to check if the swap entry is in zswap after entry pinned,
only then we can make sure the check result is stable.

Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
Closes: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com
Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
---
 include/linux/zswap.h |  6 ++++++
 mm/memory.c           | 28 ++++++++++++++++++++++++----
 mm/zswap.c            | 10 ++++++++++
 3 files changed, 40 insertions(+), 4 deletions(-)

Yosry Ahmed March 22, 2024, 7:37 p.m. UTC | #1

On Fri, Mar 22, 2024 at 9:40 AM <chengming.zhou@linux.dev> wrote:
>
> From: Chengming Zhou <chengming.zhou@linux.dev>
>
> There is a report of data corruption caused by double swapin, which is
> only possible in the skip swapcache path on SWP_SYNCHRONOUS_IO backends.
>
> The root cause is that zswap is not like other "normal" swap backends,
> it won't keep the copy of data after the first time of swapin. So if
> the folio in the first time of swapin can't be installed in the pagetable
> successfully and we just free it directly. Then in the second time of
> swapin, we can't find anything in zswap and read wrong data from swapfile,
> so this data corruption problem happened.
>
> We can fix it by always adding the folio into swapcache if we know the
> pinned swap entry can be found in zswap, so it won't get freed even though
> it can't be installed successfully in the first time of swapin.

A concurrent faulting thread could have already checked the swapcache
before we add the folio to it, right? In this case, that thread will
go ahead and call swap_read_folio() anyway.

Also, I suspect the zswap lookup might hurt performance. Would it be
better to add the folio back to zswap upon failure? This should be
detectable by checking if the folio is dirty as I mentioned in the bug
report thread.

Barry Song March 22, 2024, 9:41 p.m. UTC | #2

On Sat, Mar 23, 2024 at 8:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Mar 22, 2024 at 9:40 AM <chengming.zhou@linux.dev> wrote:
> >
> > From: Chengming Zhou <chengming.zhou@linux.dev>
> >
> > There is a report of data corruption caused by double swapin, which is
> > only possible in the skip swapcache path on SWP_SYNCHRONOUS_IO backends.
> >
> > The root cause is that zswap is not like other "normal" swap backends,
> > it won't keep the copy of data after the first time of swapin. So if

I don't quite understand this, so once we load a page from zswap, zswap
will free it even though do_swap_page might not set it to PTE?

shouldn't zswap free the memory after notify_free just like zram?

> > the folio in the first time of swapin can't be installed in the pagetable
> > successfully and we just free it directly. Then in the second time of
> > swapin, we can't find anything in zswap and read wrong data from swapfile,
> > so this data corruption problem happened.
> >
> > We can fix it by always adding the folio into swapcache if we know the
> > pinned swap entry can be found in zswap, so it won't get freed even though
> > it can't be installed successfully in the first time of swapin.
>
> A concurrent faulting thread could have already checked the swapcache
> before we add the folio to it, right? In this case, that thread will
> go ahead and call swap_read_folio() anyway.
>
> Also, I suspect the zswap lookup might hurt performance. Would it be
> better to add the folio back to zswap upon failure? This should be
> detectable by checking if the folio is dirty as I mentioned in the bug
> report thread.

I don't like the idea either as sync-io is the fast path for zram etc.
or, can we use
the way of zram to free compressed data?

>

Yosry Ahmed March 22, 2024, 10:33 p.m. UTC | #3

On Sat, Mar 23, 2024 at 10:41:32AM +1300, Barry Song wrote:
> On Sat, Mar 23, 2024 at 8:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Fri, Mar 22, 2024 at 9:40 AM <chengming.zhou@linux.dev> wrote:
> > >
> > > From: Chengming Zhou <chengming.zhou@linux.dev>
> > >
> > > There is a report of data corruption caused by double swapin, which is
> > > only possible in the skip swapcache path on SWP_SYNCHRONOUS_IO backends.
> > >
> > > The root cause is that zswap is not like other "normal" swap backends,
> > > it won't keep the copy of data after the first time of swapin. So if
> 
> I don't quite understand this, so once we load a page from zswap, zswap
> will free it even though do_swap_page might not set it to PTE?
> 
> shouldn't zswap free the memory after notify_free just like zram?

It's an optimization that zswap has, exclusive loads. After a page is
swapped in it can stick around in the swapcache for a while. In this
case, there would be two copies in memory with zram (compressed and
uncompressed). Zswap implements exclusive loads to drop the compressed
copy. The folio is marked as dirty so that any attempts to reclaim it
cause a new write (compression) to zswap. It is also for a lot of
cleanups and straightforward entry lifetime tracking in zswap.

It is mostly fine, the problem here happens because we skip the
swapcache during swapin, so there is a possibility that we load the
folio from zswap then just drop it without stashing it anywhere.

> 
> > > the folio in the first time of swapin can't be installed in the pagetable
> > > successfully and we just free it directly. Then in the second time of
> > > swapin, we can't find anything in zswap and read wrong data from swapfile,
> > > so this data corruption problem happened.
> > >
> > > We can fix it by always adding the folio into swapcache if we know the
> > > pinned swap entry can be found in zswap, so it won't get freed even though
> > > it can't be installed successfully in the first time of swapin.
> >
> > A concurrent faulting thread could have already checked the swapcache
> > before we add the folio to it, right? In this case, that thread will
> > go ahead and call swap_read_folio() anyway.
> >
> > Also, I suspect the zswap lookup might hurt performance. Would it be
> > better to add the folio back to zswap upon failure? This should be
> > detectable by checking if the folio is dirty as I mentioned in the bug
> > report thread.
> 
> I don't like the idea either as sync-io is the fast path for zram etc.
> or, can we use
> the way of zram to free compressed data?

I don't think we want to stop doing exclusive loads in zswap due to this
interaction with zram, which shouldn't be common.

I think we can solve this by just writing the folio back to zswap upon
failure as I mentioned.

Johannes Weiner March 22, 2024, 11:48 p.m. UTC | #4

On Fri, Mar 22, 2024 at 10:33:13PM +0000, Yosry Ahmed wrote:
> On Sat, Mar 23, 2024 at 10:41:32AM +1300, Barry Song wrote:
> > On Sat, Mar 23, 2024 at 8:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 9:40 AM <chengming.zhou@linux.dev> wrote:
> > > >
> > > > From: Chengming Zhou <chengming.zhou@linux.dev>
> > > >
> > > > There is a report of data corruption caused by double swapin, which is
> > > > only possible in the skip swapcache path on SWP_SYNCHRONOUS_IO backends.
> > > >
> > > > The root cause is that zswap is not like other "normal" swap backends,
> > > > it won't keep the copy of data after the first time of swapin. So if
> > 
> > I don't quite understand this, so once we load a page from zswap, zswap
> > will free it even though do_swap_page might not set it to PTE?
> > 
> > shouldn't zswap free the memory after notify_free just like zram?
> 
> It's an optimization that zswap has, exclusive loads. After a page is
> swapped in it can stick around in the swapcache for a while. In this
> case, there would be two copies in memory with zram (compressed and
> uncompressed). Zswap implements exclusive loads to drop the compressed
> copy. The folio is marked as dirty so that any attempts to reclaim it
> cause a new write (compression) to zswap. It is also for a lot of
> cleanups and straightforward entry lifetime tracking in zswap.
> 
> It is mostly fine, the problem here happens because we skip the
> swapcache during swapin, so there is a possibility that we load the
> folio from zswap then just drop it without stashing it anywhere.
> 
> > 
> > > > the folio in the first time of swapin can't be installed in the pagetable
> > > > successfully and we just free it directly. Then in the second time of
> > > > swapin, we can't find anything in zswap and read wrong data from swapfile,
> > > > so this data corruption problem happened.
> > > >
> > > > We can fix it by always adding the folio into swapcache if we know the
> > > > pinned swap entry can be found in zswap, so it won't get freed even though
> > > > it can't be installed successfully in the first time of swapin.
> > >
> > > A concurrent faulting thread could have already checked the swapcache
> > > before we add the folio to it, right? In this case, that thread will
> > > go ahead and call swap_read_folio() anyway.
> > >
> > > Also, I suspect the zswap lookup might hurt performance. Would it be
> > > better to add the folio back to zswap upon failure? This should be
> > > detectable by checking if the folio is dirty as I mentioned in the bug
> > > report thread.
> > 
> > I don't like the idea either as sync-io is the fast path for zram etc.
> > or, can we use
> > the way of zram to free compressed data?
> 
> I don't think we want to stop doing exclusive loads in zswap due to this
> interaction with zram, which shouldn't be common.
> 
> I think we can solve this by just writing the folio back to zswap upon
> failure as I mentioned.

Instead of storing again, can we avoid invalidating the entry in the
first place if the load is not "exclusive"?

The reason for exclusive loads is that the ownership is transferred to
the swapcache, so there is no point in keeping our copy. With an
optimistic read that doesn't transfer ownership, this doesn't
apply. And we can easily tell inside zswap_load() if we're dealing
with a swapcache read or not by testing the folio.

The synchronous read already has to pin the swp_entry_t to be safe,
using swapcache_prepare(). That blocks __read_swap_cache_async() which
means no other (exclusive) loads and no invalidates can occur.

The zswap entry is freed during the regular swap_free() path, which
the sync fault calls on success. Otherwise we keep it.

diff --git a/mm/zswap.c b/mm/zswap.c
index 535c907345e0..686364a6dd86 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
 	swp_entry_t swp = folio->swap;
 	pgoff_t offset = swp_offset(swp);
 	struct page *page = &folio->page;
+	bool swapcache = folio_test_swapcache(folio);
 	struct zswap_tree *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 	u8 *dst;
@@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio)
 		spin_unlock(&tree->lock);
 		return false;
 	}
-	zswap_rb_erase(&tree->rbroot, entry);
+	if (swapcache)
+		zswap_rb_erase(&tree->rbroot, entry);
 	spin_unlock(&tree->lock);
 
 	if (entry->length)
@@ -1649,9 +1651,10 @@ bool zswap_load(struct folio *folio)
 	if (entry->objcg)
 		count_objcg_event(entry->objcg, ZSWPIN);
 
-	zswap_entry_free(entry);
-
-	folio_mark_dirty(folio);
+	if (swapcache) {
+		zswap_entry_free(entry);
+		folio_mark_dirty(folio);
+	}
 
 	return true;
 }

Yosry Ahmed March 23, 2024, 12:12 a.m. UTC | #5

On Fri, Mar 22, 2024 at 4:48 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Mar 22, 2024 at 10:33:13PM +0000, Yosry Ahmed wrote:
> > On Sat, Mar 23, 2024 at 10:41:32AM +1300, Barry Song wrote:
> > > On Sat, Mar 23, 2024 at 8:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Fri, Mar 22, 2024 at 9:40 AM <chengming.zhou@linux.dev> wrote:
> > > > >
> > > > > From: Chengming Zhou <chengming.zhou@linux.dev>
> > > > >
> > > > > There is a report of data corruption caused by double swapin, which is
> > > > > only possible in the skip swapcache path on SWP_SYNCHRONOUS_IO backends.
> > > > >
> > > > > The root cause is that zswap is not like other "normal" swap backends,
> > > > > it won't keep the copy of data after the first time of swapin. So if
> > >
> > > I don't quite understand this, so once we load a page from zswap, zswap
> > > will free it even though do_swap_page might not set it to PTE?
> > >
> > > shouldn't zswap free the memory after notify_free just like zram?
> >
> > It's an optimization that zswap has, exclusive loads. After a page is
> > swapped in it can stick around in the swapcache for a while. In this
> > case, there would be two copies in memory with zram (compressed and
> > uncompressed). Zswap implements exclusive loads to drop the compressed
> > copy. The folio is marked as dirty so that any attempts to reclaim it
> > cause a new write (compression) to zswap. It is also for a lot of
> > cleanups and straightforward entry lifetime tracking in zswap.
> >
> > It is mostly fine, the problem here happens because we skip the
> > swapcache during swapin, so there is a possibility that we load the
> > folio from zswap then just drop it without stashing it anywhere.
> >
> > >
> > > > > the folio in the first time of swapin can't be installed in the pagetable
> > > > > successfully and we just free it directly. Then in the second time of
> > > > > swapin, we can't find anything in zswap and read wrong data from swapfile,
> > > > > so this data corruption problem happened.
> > > > >
> > > > > We can fix it by always adding the folio into swapcache if we know the
> > > > > pinned swap entry can be found in zswap, so it won't get freed even though
> > > > > it can't be installed successfully in the first time of swapin.
> > > >
> > > > A concurrent faulting thread could have already checked the swapcache
> > > > before we add the folio to it, right? In this case, that thread will
> > > > go ahead and call swap_read_folio() anyway.
> > > >
> > > > Also, I suspect the zswap lookup might hurt performance. Would it be
> > > > better to add the folio back to zswap upon failure? This should be
> > > > detectable by checking if the folio is dirty as I mentioned in the bug
> > > > report thread.
> > >
> > > I don't like the idea either as sync-io is the fast path for zram etc.
> > > or, can we use
> > > the way of zram to free compressed data?
> >
> > I don't think we want to stop doing exclusive loads in zswap due to this
> > interaction with zram, which shouldn't be common.
> >
> > I think we can solve this by just writing the folio back to zswap upon
> > failure as I mentioned.
>
> Instead of storing again, can we avoid invalidating the entry in the
> first place if the load is not "exclusive"?
>
> The reason for exclusive loads is that the ownership is transferred to
> the swapcache, so there is no point in keeping our copy. With an
> optimistic read that doesn't transfer ownership, this doesn't
> apply. And we can easily tell inside zswap_load() if we're dealing
> with a swapcache read or not by testing the folio.
>
> The synchronous read already has to pin the swp_entry_t to be safe,
> using swapcache_prepare(). That blocks __read_swap_cache_async() which
> means no other (exclusive) loads and no invalidates can occur.
>
> The zswap entry is freed during the regular swap_free() path, which
> the sync fault calls on success. Otherwise we keep it.

I thought about this, but I was particularly worried about the need to
bring back the refcount that was removed when we switched to only
supporting exclusive loads:
https://lore.kernel.org/lkml/20240201-b4-zswap-invalidate-entry-v2-6-99d4084260a0@bytedance.com/

It seems to be that we don't need it, because swap_free() will free
the entry as you mentioned before anyone else has the chance to load
it or invalidate it. Writeback used to grab a reference as well, but
it removes the entry from the tree anyway and takes full ownership of
it then frees it, so that should be okay.

It makes me nervous though to be honest. For example, not long ago
swap_free() didn't call zswap_invalidate() directly (used to happen to
swap slots cache draining). Without it, a subsequent load could race
with writeback without refcount protection, right? We would need to
make sure to backport 0827a1fb143f ("mm/zswap: invalidate zswap entry
when swap entry free") with the fix to stable for instance.

I can't find a problem with your diff, but it just makes me nervous to
have non-exclusive loads without a refcount.

>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 535c907345e0..686364a6dd86 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct page *page = &folio->page;
> +       bool swapcache = folio_test_swapcache(folio);
>         struct zswap_tree *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry;
>         u8 *dst;
> @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio)
>                 spin_unlock(&tree->lock);
>                 return false;
>         }
> -       zswap_rb_erase(&tree->rbroot, entry);
> +       if (swapcache)
> +               zswap_rb_erase(&tree->rbroot, entry);
>         spin_unlock(&tree->lock);
>
>         if (entry->length)
> @@ -1649,9 +1651,10 @@ bool zswap_load(struct folio *folio)
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
>
> -       zswap_entry_free(entry);
> -
> -       folio_mark_dirty(folio);
> +       if (swapcache) {
> +               zswap_entry_free(entry);
> +               folio_mark_dirty(folio);
> +       }
>
>         return true;
>  }

Yosry Ahmed March 23, 2024, 12:14 a.m. UTC | #6

[..]
> > > I don't think we want to stop doing exclusive loads in zswap due to this
> > > interaction with zram, which shouldn't be common.
> > >
> > > I think we can solve this by just writing the folio back to zswap upon
> > > failure as I mentioned.
> >
> > Instead of storing again, can we avoid invalidating the entry in the
> > first place if the load is not "exclusive"?
> >
> > The reason for exclusive loads is that the ownership is transferred to
> > the swapcache, so there is no point in keeping our copy. With an
> > optimistic read that doesn't transfer ownership, this doesn't
> > apply. And we can easily tell inside zswap_load() if we're dealing
> > with a swapcache read or not by testing the folio.
> >
> > The synchronous read already has to pin the swp_entry_t to be safe,
> > using swapcache_prepare(). That blocks __read_swap_cache_async() which
> > means no other (exclusive) loads and no invalidates can occur.
> >
> > The zswap entry is freed during the regular swap_free() path, which
> > the sync fault calls on success. Otherwise we keep it.
>
> I thought about this, but I was particularly worried about the need to
> bring back the refcount that was removed when we switched to only
> supporting exclusive loads:
> https://lore.kernel.org/lkml/20240201-b4-zswap-invalidate-entry-v2-6-99d4084260a0@bytedance.com/
>
> It seems to be that we don't need it, because swap_free() will free
> the entry as you mentioned before anyone else has the chance to load
> it or invalidate it. Writeback used to grab a reference as well, but
> it removes the entry from the tree anyway and takes full ownership of
> it then frees it, so that should be okay.
>
> It makes me nervous though to be honest. For example, not long ago
> swap_free() didn't call zswap_invalidate() directly (used to happen to
> swap slots cache draining). Without it, a subsequent load could race
> with writeback without refcount protection, right? We would need to
> make sure to backport 0827a1fb143f ("mm/zswap: invalidate zswap entry
> when swap entry free") with the fix to stable for instance.
>
> I can't find a problem with your diff, but it just makes me nervous to
> have non-exclusive loads without a refcount.
>
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 535c907345e0..686364a6dd86 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
> >         swp_entry_t swp = folio->swap;
> >         pgoff_t offset = swp_offset(swp);
> >         struct page *page = &folio->page;
> > +       bool swapcache = folio_test_swapcache(folio);
> >         struct zswap_tree *tree = swap_zswap_tree(swp);
> >         struct zswap_entry *entry;
> >         u8 *dst;
> > @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio)
> >                 spin_unlock(&tree->lock);
> >                 return false;
> >         }
> > -       zswap_rb_erase(&tree->rbroot, entry);
> > +       if (swapcache)
> > +               zswap_rb_erase(&tree->rbroot, entry);

On second thought, if we don't remove the entry from the tree here,
writeback could free the entry from under us after we drop the lock
here, right?

> >         spin_unlock(&tree->lock);
> >
> >         if (entry->length)
> > @@ -1649,9 +1651,10 @@ bool zswap_load(struct folio *folio)
> >         if (entry->objcg)
> >                 count_objcg_event(entry->objcg, ZSWPIN);
> >
> > -       zswap_entry_free(entry);
> > -
> > -       folio_mark_dirty(folio);
> > +       if (swapcache) {
> > +               zswap_entry_free(entry);
> > +               folio_mark_dirty(folio);
> > +       }
> >
> >         return true;
> >  }

Johannes Weiner March 23, 2024, 1:55 a.m. UTC | #7

On Fri, Mar 22, 2024 at 05:14:37PM -0700, Yosry Ahmed wrote:
> [..]
> > > > I don't think we want to stop doing exclusive loads in zswap due to this
> > > > interaction with zram, which shouldn't be common.
> > > >
> > > > I think we can solve this by just writing the folio back to zswap upon
> > > > failure as I mentioned.
> > >
> > > Instead of storing again, can we avoid invalidating the entry in the
> > > first place if the load is not "exclusive"?
> > >
> > > The reason for exclusive loads is that the ownership is transferred to
> > > the swapcache, so there is no point in keeping our copy. With an
> > > optimistic read that doesn't transfer ownership, this doesn't
> > > apply. And we can easily tell inside zswap_load() if we're dealing
> > > with a swapcache read or not by testing the folio.
> > >
> > > The synchronous read already has to pin the swp_entry_t to be safe,
> > > using swapcache_prepare(). That blocks __read_swap_cache_async() which
> > > means no other (exclusive) loads and no invalidates can occur.
> > >
> > > The zswap entry is freed during the regular swap_free() path, which
> > > the sync fault calls on success. Otherwise we keep it.
> >
> > I thought about this, but I was particularly worried about the need to
> > bring back the refcount that was removed when we switched to only
> > supporting exclusive loads:
> > https://lore.kernel.org/lkml/20240201-b4-zswap-invalidate-entry-v2-6-99d4084260a0@bytedance.com/
> >
> > It seems to be that we don't need it, because swap_free() will free
> > the entry as you mentioned before anyone else has the chance to load
> > it or invalidate it. Writeback used to grab a reference as well, but
> > it removes the entry from the tree anyway and takes full ownership of
> > it then frees it, so that should be okay.
> >
> > It makes me nervous though to be honest. For example, not long ago
> > swap_free() didn't call zswap_invalidate() directly (used to happen to
> > swap slots cache draining). Without it, a subsequent load could race
> > with writeback without refcount protection, right? We would need to
> > make sure to backport 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > when swap entry free") with the fix to stable for instance.
> >
> > I can't find a problem with your diff, but it just makes me nervous to
> > have non-exclusive loads without a refcount.
> >
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 535c907345e0..686364a6dd86 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
> > >         swp_entry_t swp = folio->swap;
> > >         pgoff_t offset = swp_offset(swp);
> > >         struct page *page = &folio->page;
> > > +       bool swapcache = folio_test_swapcache(folio);
> > >         struct zswap_tree *tree = swap_zswap_tree(swp);
> > >         struct zswap_entry *entry;
> > >         u8 *dst;
> > > @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio)
> > >                 spin_unlock(&tree->lock);
> > >                 return false;
> > >         }
> > > -       zswap_rb_erase(&tree->rbroot, entry);
> > > +       if (swapcache)
> > > +               zswap_rb_erase(&tree->rbroot, entry);
> 
> On second thought, if we don't remove the entry from the tree here,
> writeback could free the entry from under us after we drop the lock
> here, right?

The sync-swapin does swapcache_prepare() and holds SWAP_HAS_CACHE, so
racing writeback would loop on the -EEXIST in __read_swap_cache_async().
(Or, if writeback wins the race, sync-swapin fails on swapcache_prepare()
instead and bails on the fault.)

This isn't coincidental. The sync-swapin needs to, and does, serialize
against the swap entry moving into swapcache or being invalidated for
it to be safe. Which is the same requirement that zswap ops have.

Yosry Ahmed March 23, 2024, 2:02 a.m. UTC | #8

On Fri, Mar 22, 2024 at 6:55 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Mar 22, 2024 at 05:14:37PM -0700, Yosry Ahmed wrote:
> > [..]
> > > > > I don't think we want to stop doing exclusive loads in zswap due to this
> > > > > interaction with zram, which shouldn't be common.
> > > > >
> > > > > I think we can solve this by just writing the folio back to zswap upon
> > > > > failure as I mentioned.
> > > >
> > > > Instead of storing again, can we avoid invalidating the entry in the
> > > > first place if the load is not "exclusive"?
> > > >
> > > > The reason for exclusive loads is that the ownership is transferred to
> > > > the swapcache, so there is no point in keeping our copy. With an
> > > > optimistic read that doesn't transfer ownership, this doesn't
> > > > apply. And we can easily tell inside zswap_load() if we're dealing
> > > > with a swapcache read or not by testing the folio.
> > > >
> > > > The synchronous read already has to pin the swp_entry_t to be safe,
> > > > using swapcache_prepare(). That blocks __read_swap_cache_async() which
> > > > means no other (exclusive) loads and no invalidates can occur.
> > > >
> > > > The zswap entry is freed during the regular swap_free() path, which
> > > > the sync fault calls on success. Otherwise we keep it.
> > >
> > > I thought about this, but I was particularly worried about the need to
> > > bring back the refcount that was removed when we switched to only
> > > supporting exclusive loads:
> > > https://lore.kernel.org/lkml/20240201-b4-zswap-invalidate-entry-v2-6-99d4084260a0@bytedance.com/
> > >
> > > It seems to be that we don't need it, because swap_free() will free
> > > the entry as you mentioned before anyone else has the chance to load
> > > it or invalidate it. Writeback used to grab a reference as well, but
> > > it removes the entry from the tree anyway and takes full ownership of
> > > it then frees it, so that should be okay.
> > >
> > > It makes me nervous though to be honest. For example, not long ago
> > > swap_free() didn't call zswap_invalidate() directly (used to happen to
> > > swap slots cache draining). Without it, a subsequent load could race
> > > with writeback without refcount protection, right? We would need to
> > > make sure to backport 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > when swap entry free") with the fix to stable for instance.
> > >
> > > I can't find a problem with your diff, but it just makes me nervous to
> > > have non-exclusive loads without a refcount.
> > >
> > > >
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 535c907345e0..686364a6dd86 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
> > > >         swp_entry_t swp = folio->swap;
> > > >         pgoff_t offset = swp_offset(swp);
> > > >         struct page *page = &folio->page;
> > > > +       bool swapcache = folio_test_swapcache(folio);
> > > >         struct zswap_tree *tree = swap_zswap_tree(swp);
> > > >         struct zswap_entry *entry;
> > > >         u8 *dst;
> > > > @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio)
> > > >                 spin_unlock(&tree->lock);
> > > >                 return false;
> > > >         }
> > > > -       zswap_rb_erase(&tree->rbroot, entry);
> > > > +       if (swapcache)
> > > > +               zswap_rb_erase(&tree->rbroot, entry);
> >
> > On second thought, if we don't remove the entry from the tree here,
> > writeback could free the entry from under us after we drop the lock
> > here, right?
>
> The sync-swapin does swapcache_prepare() and holds SWAP_HAS_CACHE, so
> racing writeback would loop on the -EEXIST in __read_swap_cache_async().
> (Or, if writeback wins the race, sync-swapin fails on swapcache_prepare()
> instead and bails on the fault.)
>
> This isn't coincidental. The sync-swapin needs to, and does, serialize
> against the swap entry moving into swapcache or being invalidated for
> it to be safe. Which is the same requirement that zswap ops have.

You are right. Even if swap_free() isn't called under SWAP_HAS_CACHE's
protection, a subsequent load will also be protected by SWAP_HAS_CACHE
(whether it's swapped in with sync swapin or throught the swapcache)
-- so it would be protected against writeback as well. Now it seems
like we may have been able to drop the refcount even without exclusive
loads..?

Anyway, I think your fix is sound. Zhongkun, do you mind confirming
that the diff Johannes sent fixes the problem for you?

Zhongkun He March 23, 2024, 2:40 a.m. UTC | #9

On Sat, Mar 23, 2024 at 10:03 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Mar 22, 2024 at 6:55 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Fri, Mar 22, 2024 at 05:14:37PM -0700, Yosry Ahmed wrote:
> > > [..]
> > > > > > I don't think we want to stop doing exclusive loads in zswap due to this
> > > > > > interaction with zram, which shouldn't be common.
> > > > > >
> > > > > > I think we can solve this by just writing the folio back to zswap upon
> > > > > > failure as I mentioned.
> > > > >
> > > > > Instead of storing again, can we avoid invalidating the entry in the
> > > > > first place if the load is not "exclusive"?
> > > > >
> > > > > The reason for exclusive loads is that the ownership is transferred to
> > > > > the swapcache, so there is no point in keeping our copy. With an
> > > > > optimistic read that doesn't transfer ownership, this doesn't
> > > > > apply. And we can easily tell inside zswap_load() if we're dealing
> > > > > with a swapcache read or not by testing the folio.
> > > > >
> > > > > The synchronous read already has to pin the swp_entry_t to be safe,
> > > > > using swapcache_prepare(). That blocks __read_swap_cache_async() which
> > > > > means no other (exclusive) loads and no invalidates can occur.
> > > > >
> > > > > The zswap entry is freed during the regular swap_free() path, which
> > > > > the sync fault calls on success. Otherwise we keep it.
> > > >
> > > > I thought about this, but I was particularly worried about the need to
> > > > bring back the refcount that was removed when we switched to only
> > > > supporting exclusive loads:
> > > > https://lore.kernel.org/lkml/20240201-b4-zswap-invalidate-entry-v2-6-99d4084260a0@bytedance.com/
> > > >
> > > > It seems to be that we don't need it, because swap_free() will free
> > > > the entry as you mentioned before anyone else has the chance to load
> > > > it or invalidate it. Writeback used to grab a reference as well, but
> > > > it removes the entry from the tree anyway and takes full ownership of
> > > > it then frees it, so that should be okay.
> > > >
> > > > It makes me nervous though to be honest. For example, not long ago
> > > > swap_free() didn't call zswap_invalidate() directly (used to happen to
> > > > swap slots cache draining). Without it, a subsequent load could race
> > > > with writeback without refcount protection, right? We would need to
> > > > make sure to backport 0827a1fb143f ("mm/zswap: invalidate zswap entry
> > > > when swap entry free") with the fix to stable for instance.
> > > >
> > > > I can't find a problem with your diff, but it just makes me nervous to
> > > > have non-exclusive loads without a refcount.
> > > >
> > > > >
> > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > index 535c907345e0..686364a6dd86 100644
> > > > > --- a/mm/zswap.c
> > > > > +++ b/mm/zswap.c
> > > > > @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
> > > > >         swp_entry_t swp = folio->swap;
> > > > >         pgoff_t offset = swp_offset(swp);
> > > > >         struct page *page = &folio->page;
> > > > > +       bool swapcache = folio_test_swapcache(folio);
> > > > >         struct zswap_tree *tree = swap_zswap_tree(swp);
> > > > >         struct zswap_entry *entry;
> > > > >         u8 *dst;
> > > > > @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio)
> > > > >                 spin_unlock(&tree->lock);
> > > > >                 return false;
> > > > >         }
> > > > > -       zswap_rb_erase(&tree->rbroot, entry);
> > > > > +       if (swapcache)
> > > > > +               zswap_rb_erase(&tree->rbroot, entry);
> > >
> > > On second thought, if we don't remove the entry from the tree here,
> > > writeback could free the entry from under us after we drop the lock
> > > here, right?
> >
> > The sync-swapin does swapcache_prepare() and holds SWAP_HAS_CACHE, so
> > racing writeback would loop on the -EEXIST in __read_swap_cache_async().
> > (Or, if writeback wins the race, sync-swapin fails on swapcache_prepare()
> > instead and bails on the fault.)
> >
> > This isn't coincidental. The sync-swapin needs to, and does, serialize
> > against the swap entry moving into swapcache or being invalidated for
> > it to be safe. Which is the same requirement that zswap ops have.
>
> You are right. Even if swap_free() isn't called under SWAP_HAS_CACHE's
> protection, a subsequent load will also be protected by SWAP_HAS_CACHE
> (whether it's swapped in with sync swapin or throught the swapcache)
> -- so it would be protected against writeback as well. Now it seems
> like we may have been able to drop the refcount even without exclusive
> loads..?
>
> Anyway, I think your fix is sound. Zhongkun, do you mind confirming
> that the diff Johannes sent fixes the problem for you?

OK， I will try it and come back in a few hours.

Thanks for the solution, it sounds great.

Chengming Zhou March 23, 2024, 2:49 a.m. UTC | #10

On 2024/3/23 03:37, Yosry Ahmed wrote:
> On Fri, Mar 22, 2024 at 9:40 AM <chengming.zhou@linux.dev> wrote:
>>
>> From: Chengming Zhou <chengming.zhou@linux.dev>
>>
>> There is a report of data corruption caused by double swapin, which is
>> only possible in the skip swapcache path on SWP_SYNCHRONOUS_IO backends.
>>
>> The root cause is that zswap is not like other "normal" swap backends,
>> it won't keep the copy of data after the first time of swapin. So if
>> the folio in the first time of swapin can't be installed in the pagetable
>> successfully and we just free it directly. Then in the second time of
>> swapin, we can't find anything in zswap and read wrong data from swapfile,
>> so this data corruption problem happened.
>>
>> We can fix it by always adding the folio into swapcache if we know the
>> pinned swap entry can be found in zswap, so it won't get freed even though
>> it can't be installed successfully in the first time of swapin.
> 
> A concurrent faulting thread could have already checked the swapcache
> before we add the folio to it, right? In this case, that thread will
> go ahead and call swap_read_folio() anyway.

Right, but it has to lock the folio to proceed.

> 
> Also, I suspect the zswap lookup might hurt performance. Would it be
> better to add the folio back to zswap upon failure? This should be
> detectable by checking if the folio is dirty as I mentioned in the bug
> report thread.

Yes, may hurt performance. As for adding back upon failure, the problem
is that adding may fail too... and I don't know how to handle that.

Anyway, I think the fix of Johannes is much better, we should take that way.

Thanks.

Zhongkun He March 23, 2024, 8:38 a.m. UTC | #11

On Sat, Mar 23, 2024 at 7:48 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Mar 22, 2024 at 10:33:13PM +0000, Yosry Ahmed wrote:
> > On Sat, Mar 23, 2024 at 10:41:32AM +1300, Barry Song wrote:
> > > On Sat, Mar 23, 2024 at 8:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Fri, Mar 22, 2024 at 9:40 AM <chengming.zhou@linux.dev> wrote:
> > > > >
> > > > > From: Chengming Zhou <chengming.zhou@linux.dev>
> > > > >
> > > > > There is a report of data corruption caused by double swapin, which is
> > > > > only possible in the skip swapcache path on SWP_SYNCHRONOUS_IO backends.
> > > > >
> > > > > The root cause is that zswap is not like other "normal" swap backends,
> > > > > it won't keep the copy of data after the first time of swapin. So if
> > >
> > > I don't quite understand this, so once we load a page from zswap, zswap
> > > will free it even though do_swap_page might not set it to PTE?
> > >
> > > shouldn't zswap free the memory after notify_free just like zram?
> >
> > It's an optimization that zswap has, exclusive loads. After a page is
> > swapped in it can stick around in the swapcache for a while. In this
> > case, there would be two copies in memory with zram (compressed and
> > uncompressed). Zswap implements exclusive loads to drop the compressed
> > copy. The folio is marked as dirty so that any attempts to reclaim it
> > cause a new write (compression) to zswap. It is also for a lot of
> > cleanups and straightforward entry lifetime tracking in zswap.
> >
> > It is mostly fine, the problem here happens because we skip the
> > swapcache during swapin, so there is a possibility that we load the
> > folio from zswap then just drop it without stashing it anywhere.
> >
> > >
> > > > > the folio in the first time of swapin can't be installed in the pagetable
> > > > > successfully and we just free it directly. Then in the second time of
> > > > > swapin, we can't find anything in zswap and read wrong data from swapfile,
> > > > > so this data corruption problem happened.
> > > > >
> > > > > We can fix it by always adding the folio into swapcache if we know the
> > > > > pinned swap entry can be found in zswap, so it won't get freed even though
> > > > > it can't be installed successfully in the first time of swapin.
> > > >
> > > > A concurrent faulting thread could have already checked the swapcache
> > > > before we add the folio to it, right? In this case, that thread will
> > > > go ahead and call swap_read_folio() anyway.
> > > >
> > > > Also, I suspect the zswap lookup might hurt performance. Would it be
> > > > better to add the folio back to zswap upon failure? This should be
> > > > detectable by checking if the folio is dirty as I mentioned in the bug
> > > > report thread.
> > >
> > > I don't like the idea either as sync-io is the fast path for zram etc.
> > > or, can we use
> > > the way of zram to free compressed data?
> >
> > I don't think we want to stop doing exclusive loads in zswap due to this
> > interaction with zram, which shouldn't be common.
> >
> > I think we can solve this by just writing the folio back to zswap upon
> > failure as I mentioned.
>
> Instead of storing again, can we avoid invalidating the entry in the
> first place if the load is not "exclusive"?
>
> The reason for exclusive loads is that the ownership is transferred to
> the swapcache, so there is no point in keeping our copy. With an
> optimistic read that doesn't transfer ownership, this doesn't
> apply. And we can easily tell inside zswap_load() if we're dealing
> with a swapcache read or not by testing the folio.
>
> The synchronous read already has to pin the swp_entry_t to be safe,
> using swapcache_prepare(). That blocks __read_swap_cache_async() which
> means no other (exclusive) loads and no invalidates can occur.
>
> The zswap entry is freed during the regular swap_free() path, which
> the sync fault calls on success. Otherwise we keep it.
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 535c907345e0..686364a6dd86 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio)
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct page *page = &folio->page;
> +       bool swapcache = folio_test_swapcache(folio);
>         struct zswap_tree *tree = swap_zswap_tree(swp);
>         struct zswap_entry *entry;
>         u8 *dst;
> @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio)
>                 spin_unlock(&tree->lock);
>                 return false;
>         }
> -       zswap_rb_erase(&tree->rbroot, entry);
> +       if (swapcache)
> +               zswap_rb_erase(&tree->rbroot, entry);
>         spin_unlock(&tree->lock);
>
>         if (entry->length)
> @@ -1649,9 +1651,10 @@ bool zswap_load(struct folio *folio)
>         if (entry->objcg)
>                 count_objcg_event(entry->objcg, ZSWPIN);
>
> -       zswap_entry_free(entry);
> -
> -       folio_mark_dirty(folio);
> +       if (swapcache) {
> +               zswap_entry_free(entry);
> +               folio_mark_dirty(folio);
> +       }
>
>         return true;
>  }

Hi Johannes,

After a few hours I haven't seen any problems.

If you don't mind，please add the following tag:
Tested-by:Zhongkun He <hezhongkun.hzk@bytedance.com>
Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>

Thanks.

[RFC] mm: add folio in swapcache if swapin from zswap

Commit Message

Comments

Patch