mbox series

[0/2] minimize swapping on zswap store failure

Message ID 20231017003519.1426574-1-nphamcs@gmail.com (mailing list archive)
Headers show
Series minimize swapping on zswap store failure | expand

Message

Nhat Pham Oct. 17, 2023, 12:35 a.m. UTC
Currently, when a zswap store attempt fails, the page is immediately
swapped out. This could happen for a variety of reasons. For instance,
the compression algorithm could fail (such as when the data is not
compressible), or the backend allocator might not be able to find a
suitable slot for the compressed page. If these pages are needed
later on, users will incur IOs from swapins.

This issue prevents the adoption of zswap for potential users who
cannot tolerate the latency associated with swapping. In many cases,
these IOs are avoidable if we just keep in memory the pages that zswap
fail to store.

This patch series add two new features for zswap that will alleviate
the risk of swapping:

a) When a store attempt fail, keep the page untouched in memory
instead of swapping it out.

b) If the store attempt fails at the compression step, allow the page
to be stored in its uncompressed form in the zswap pool. This maintains
the LRU ordering of pages, which will be helpful for accurate
memory reclaim (zswap writeback in particular).

These features could be enabled independently via two new zswap module
parameters.

Nhat Pham (2):
  swap: allows swap bypassing on zswap store failure
  zswap: store uncompressed pages when compression algorithm fails

 Documentation/admin-guide/mm/zswap.rst | 16 +++++++
 include/linux/zswap.h                  |  9 ++++
 mm/page_io.c                           |  6 +++
 mm/shmem.c                             |  8 +++-
 mm/zswap.c                             | 64 +++++++++++++++++++++++---
 5 files changed, 95 insertions(+), 8 deletions(-)

Comments

Yosry Ahmed Oct. 17, 2023, 12:57 a.m. UTC | #1
On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Currently, when a zswap store attempt fails, the page is immediately
> swapped out. This could happen for a variety of reasons. For instance,
> the compression algorithm could fail (such as when the data is not
> compressible), or the backend allocator might not be able to find a
> suitable slot for the compressed page. If these pages are needed
> later on, users will incur IOs from swapins.
>
> This issue prevents the adoption of zswap for potential users who
> cannot tolerate the latency associated with swapping. In many cases,
> these IOs are avoidable if we just keep in memory the pages that zswap
> fail to store.
>
> This patch series add two new features for zswap that will alleviate
> the risk of swapping:
>
> a) When a store attempt fail, keep the page untouched in memory
> instead of swapping it out.

What about writeback when the zswap limit is hit? I understand the
problem, but I am wondering if this is the correct way of fixing it.
We really need to make zswap work without a backing swapfile, which I
think is the correct way to fix all these problems. I was working on
that, but unfortunately I had to pivot to something else before I had
something that was working.

At Google, we have "ghost" swapfiles that we use just to use zswap
without a swapfile. They are sparse files, and we have internal kernel
patches to flag them and never try to actually write to them.

I am not sure how many bandaids we can afford before doing the right
thing. I understand it's a much larger surgery, perhaps there is a way
to get a short-term fix that is also a step towards the final state we
want to reach instead?

>
> b) If the store attempt fails at the compression step, allow the page
> to be stored in its uncompressed form in the zswap pool. This maintains
> the LRU ordering of pages, which will be helpful for accurate
> memory reclaim (zswap writeback in particular).

This is dangerous. Johannes and I discussed this before. This means
that reclaim can end up allocating more memory instead of freeing.
Allocations made in the reclaim path are made under the assumption
that we will eventually free memory. In this case, we won't. In the
worst case scenario, reclaim can leave the system/memcg in a worse
state than before it started.

Perhaps there is a way we can do this without allocating a zswap entry?

I thought before about having a special list_head that allows us to
use the lower bits of the pointers as markers, similar to the xarray.
The markers can be used to place different objects on the same list.
We can have a list that is a mixture of struct page and struct
zswap_entry. I never pursued this idea, and I am sure someone will
scream at me for suggesting it. Maybe there is a less convoluted way
to keep the LRU ordering intact without allocating memory on the
reclaim path.

>
> These features could be enabled independently via two new zswap module
> parameters.
>
> Nhat Pham (2):
>   swap: allows swap bypassing on zswap store failure
>   zswap: store uncompressed pages when compression algorithm fails
>
>  Documentation/admin-guide/mm/zswap.rst | 16 +++++++
>  include/linux/zswap.h                  |  9 ++++
>  mm/page_io.c                           |  6 +++
>  mm/shmem.c                             |  8 +++-
>  mm/zswap.c                             | 64 +++++++++++++++++++++++---
>  5 files changed, 95 insertions(+), 8 deletions(-)
>
> --
> 2.34.1
Johannes Weiner Oct. 17, 2023, 4:47 a.m. UTC | #2
On Mon, Oct 16, 2023 at 05:57:31PM -0700, Yosry Ahmed wrote:
> On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Currently, when a zswap store attempt fails, the page is immediately
> > swapped out. This could happen for a variety of reasons. For instance,
> > the compression algorithm could fail (such as when the data is not
> > compressible), or the backend allocator might not be able to find a
> > suitable slot for the compressed page. If these pages are needed
> > later on, users will incur IOs from swapins.
> >
> > This issue prevents the adoption of zswap for potential users who
> > cannot tolerate the latency associated with swapping. In many cases,
> > these IOs are avoidable if we just keep in memory the pages that zswap
> > fail to store.
> >
> > This patch series add two new features for zswap that will alleviate
> > the risk of swapping:
> >
> > a) When a store attempt fail, keep the page untouched in memory
> > instead of swapping it out.
> 
> What about writeback when the zswap limit is hit? I understand the
> problem, but I am wondering if this is the correct way of fixing it.
> We really need to make zswap work without a backing swapfile, which I
> think is the correct way to fix all these problems. I was working on
> that, but unfortunately I had to pivot to something else before I had
> something that was working.
>
> At Google, we have "ghost" swapfiles that we use just to use zswap
> without a swapfile. They are sparse files, and we have internal kernel
> patches to flag them and never try to actually write to them.
> 
> I am not sure how many bandaids we can afford before doing the right
> thing. I understand it's a much larger surgery, perhaps there is a way
> to get a short-term fix that is also a step towards the final state we
> want to reach instead?

I agree it should also stop writeback due to the limit.

Note that a knob like this is still useful even when zswap space is
decoupled from disk swap slots. We still are using disk swap broadly
in the fleet as well, so a static ghost file setup wouldn't be a good
solution for us. Swapoff with common swapfile sizes is often not an
option during runtime, due to how slow it is, and the destabilizing
effect it can have on the system unless that's basically completely
idle. As such, we expect to continue deploying swap files for physical
use, and switch the zswap-is-terminal knob depending on the workload.

The other aspect to this is that workloads that do not want the
swapout/swapin overhead for themselves are usually co-located with
system management software, and/or can share the host with less
latency sensitive workloads, that should continue to use disk swap.

Due to the latter case, I wonder if a global knob is actually
enough. More likely we'd need per-cgroup control over this.

[ It's at this point where the historic coupling of zswap to disk
  space is especially unfortunate. Because of it, zswap usage counts
  toward the memory.swap allowance. If these were separate, we could
  have easily set memory.zswap.max=max, memory.swap.max=0 to achieve
  the desired effect.

  Alas, that ship has sailed. Even after a decoupling down the line it
  would be difficult to change established memory.swap semantics. ]

So I obviously agree that we still need to invest in decoupling zswap
space from physical disk slots. It's insanely wasteful, especially
with larger memory capacities. But while it would be a fantastic
optimization, I don't see how it would be an automatic solution to the
problem that inspired this proposal.

We still need some way to reasonably express desired workload policy
in a setup that supports multiple, simultaneous modes of operation.

> > b) If the store attempt fails at the compression step, allow the page
> > to be stored in its uncompressed form in the zswap pool. This maintains
> > the LRU ordering of pages, which will be helpful for accurate
> > memory reclaim (zswap writeback in particular).
> 
> This is dangerous. Johannes and I discussed this before. This means
> that reclaim can end up allocating more memory instead of freeing.
> Allocations made in the reclaim path are made under the assumption
> that we will eventually free memory. In this case, we won't. In the
> worst case scenario, reclaim can leave the system/memcg in a worse
> state than before it started.

Yeah, this is a concern. It's not such a big deal if it's only a few
pages, and we're still shrinking the footprint on aggregate. But it's
conceivable this can happen systematically with some datasets, in
which case reclaim will drive up the memory consumption and cause
OOMs, or potentially deplete the reserves with PF_MEMALLOC and cause
memory deadlocks.

This isn't something we can reasonably accept as worst-case behavior.

> Perhaps there is a way we can do this without allocating a zswap entry?
>
> I thought before about having a special list_head that allows us to
> use the lower bits of the pointers as markers, similar to the xarray.
> The markers can be used to place different objects on the same list.
> We can have a list that is a mixture of struct page and struct
> zswap_entry. I never pursued this idea, and I am sure someone will
> scream at me for suggesting it. Maybe there is a less convoluted way
> to keep the LRU ordering intact without allocating memory on the
> reclaim path.

That should work. Once zswap has exclusive control over the page, it
is free to muck with its lru linkage. A lower bit tag on the next or
prev pointer should suffice to distinguish between struct page and
struct zswap_entry when pulling stuff from the list.

We'd also have to teach vmscan.c to hand off the page. It currently
expects that it either frees the page back to the allocator, or puts
it back on the LRU. We'd need a compromise where it continues to tear
down the page and remove the mapping, but then leaves it to zswap.

Neither of those sound impossible. But since it's a bigger
complication than this proposal, it probably needs a new cost/benefit
analysis, with potentially more data on the problem of LRU inversions.

Thanks for your insightful feedback, Yosry.
Yosry Ahmed Oct. 17, 2023, 5:33 a.m. UTC | #3
On Mon, Oct 16, 2023 at 9:47 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Oct 16, 2023 at 05:57:31PM -0700, Yosry Ahmed wrote:
> > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > Currently, when a zswap store attempt fails, the page is immediately
> > > swapped out. This could happen for a variety of reasons. For instance,
> > > the compression algorithm could fail (such as when the data is not
> > > compressible), or the backend allocator might not be able to find a
> > > suitable slot for the compressed page. If these pages are needed
> > > later on, users will incur IOs from swapins.
> > >
> > > This issue prevents the adoption of zswap for potential users who
> > > cannot tolerate the latency associated with swapping. In many cases,
> > > these IOs are avoidable if we just keep in memory the pages that zswap
> > > fail to store.
> > >
> > > This patch series add two new features for zswap that will alleviate
> > > the risk of swapping:
> > >
> > > a) When a store attempt fail, keep the page untouched in memory
> > > instead of swapping it out.
> >
> > What about writeback when the zswap limit is hit? I understand the
> > problem, but I am wondering if this is the correct way of fixing it.
> > We really need to make zswap work without a backing swapfile, which I
> > think is the correct way to fix all these problems. I was working on
> > that, but unfortunately I had to pivot to something else before I had
> > something that was working.
> >
> > At Google, we have "ghost" swapfiles that we use just to use zswap
> > without a swapfile. They are sparse files, and we have internal kernel
> > patches to flag them and never try to actually write to them.
> >
> > I am not sure how many bandaids we can afford before doing the right
> > thing. I understand it's a much larger surgery, perhaps there is a way
> > to get a short-term fix that is also a step towards the final state we
> > want to reach instead?
>
> I agree it should also stop writeback due to the limit.
>
> Note that a knob like this is still useful even when zswap space is
> decoupled from disk swap slots. We still are using disk swap broadly
> in the fleet as well, so a static ghost file setup wouldn't be a good
> solution for us. Swapoff with common swapfile sizes is often not an
> option during runtime, due to how slow it is, and the destabilizing
> effect it can have on the system unless that's basically completely
> idle. As such, we expect to continue deploying swap files for physical
> use, and switch the zswap-is-terminal knob depending on the workload.
>
> The other aspect to this is that workloads that do not want the
> swapout/swapin overhead for themselves are usually co-located with
> system management software, and/or can share the host with less
> latency sensitive workloads, that should continue to use disk swap.
>
> Due to the latter case, I wonder if a global knob is actually
> enough. More likely we'd need per-cgroup control over this.

In conjunction with ghost swapfiles, we have a knob to determine the
type of swapfile to use for each cgroup (normal, ghost, either, or
none). This achieves what you are describing, allowing different
workloads on the same machine to use zswap only or disk swap, although
in practice we only use zswap now.

I am not saying that's necessarily the correct way of doing it. Having
a zswap-is-terminal knob per-cgroup is another way to achieve this. I
will loop in folks maintaining this code internally to see what they
think.

>
> [ It's at this point where the historic coupling of zswap to disk
>   space is especially unfortunate. Because of it, zswap usage counts
>   toward the memory.swap allowance. If these were separate, we could
>   have easily set memory.zswap.max=max, memory.swap.max=0 to achieve
>   the desired effect.
>
>   Alas, that ship has sailed. Even after a decoupling down the line it
>   would be difficult to change established memory.swap semantics. ]

Fully agree here. This is unfortunate.

>
> So I obviously agree that we still need to invest in decoupling zswap
> space from physical disk slots. It's insanely wasteful, especially
> with larger memory capacities. But while it would be a fantastic
> optimization, I don't see how it would be an automatic solution to the
> problem that inspired this proposal.

Well, in my head, I imagine such a world where we have multiple
separate swapping backends with cgroup knob(s) that control what
backends are allowed for each cgroup. A zswap-is-terminal knob is
hacky-ish way of doing that where the backends are only zswap and disk
swap.

>
> We still need some way to reasonably express desired workload policy
> in a setup that supports multiple, simultaneous modes of operation.
>
> > > b) If the store attempt fails at the compression step, allow the page
> > > to be stored in its uncompressed form in the zswap pool. This maintains
> > > the LRU ordering of pages, which will be helpful for accurate
> > > memory reclaim (zswap writeback in particular).
> >
> > This is dangerous. Johannes and I discussed this before. This means
> > that reclaim can end up allocating more memory instead of freeing.
> > Allocations made in the reclaim path are made under the assumption
> > that we will eventually free memory. In this case, we won't. In the
> > worst case scenario, reclaim can leave the system/memcg in a worse
> > state than before it started.
>
> Yeah, this is a concern. It's not such a big deal if it's only a few
> pages, and we're still shrinking the footprint on aggregate. But it's
> conceivable this can happen systematically with some datasets, in
> which case reclaim will drive up the memory consumption and cause
> OOMs, or potentially deplete the reserves with PF_MEMALLOC and cause
> memory deadlocks.
>
> This isn't something we can reasonably accept as worst-case behavior.

Right.

>
> > Perhaps there is a way we can do this without allocating a zswap entry?
> >
> > I thought before about having a special list_head that allows us to
> > use the lower bits of the pointers as markers, similar to the xarray.
> > The markers can be used to place different objects on the same list.
> > We can have a list that is a mixture of struct page and struct
> > zswap_entry. I never pursued this idea, and I am sure someone will
> > scream at me for suggesting it. Maybe there is a less convoluted way
> > to keep the LRU ordering intact without allocating memory on the
> > reclaim path.
>
> That should work. Once zswap has exclusive control over the page, it
> is free to muck with its lru linkage. A lower bit tag on the next or
> prev pointer should suffice to distinguish between struct page and
> struct zswap_entry when pulling stuff from the list.

Right.

We handle incompressible memory internally in a different way, we put
them back on the unevictable list with an incompressible page flag.
This achieves a similar effect.

A missing point here is that those pages, if dirtied, may be
compressible again. When we have them on the LRUs, we rely on periodic
scanning (similar to the MGLRU-based periodic scanning we proposed
before) to check the dirty bit and make those pages evictable again.

If we leave them on the zswap LRU, we will incur a fault instead to
pull them back to the LRUs. For anon pages, that's probably fine, as
in both cases by the time we reach zswap the page has been unmapped,
and accessing it again incurs a fault anyway (whether it's in zswap
LRUs or in the reclaim LRUs). For shmem though, we put the
incompressible pages back in the page cache, preventing a page fault
on the next access. This is a drawback of the zswap LRU approach
AFAICT. Not sure how much it matters in practice.

>
> We'd also have to teach vmscan.c to hand off the page. It currently
> expects that it either frees the page back to the allocator, or puts
> it back on the LRU. We'd need a compromise where it continues to tear
> down the page and remove the mapping, but then leaves it to zswap.

Right.

>
> Neither of those sound impossible. But since it's a bigger
> complication than this proposal, it probably needs a new cost/benefit
> analysis, with potentially more data on the problem of LRU inversions.

Makes sense.

>
> Thanks for your insightful feedback, Yosry.
Johannes Weiner Oct. 17, 2023, 2:51 p.m. UTC | #4
On Mon, Oct 16, 2023 at 10:33:23PM -0700, Yosry Ahmed wrote:
> On Mon, Oct 16, 2023 at 9:47 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Mon, Oct 16, 2023 at 05:57:31PM -0700, Yosry Ahmed wrote:
> > > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > So I obviously agree that we still need to invest in decoupling zswap
> > space from physical disk slots. It's insanely wasteful, especially
> > with larger memory capacities. But while it would be a fantastic
> > optimization, I don't see how it would be an automatic solution to the
> > problem that inspired this proposal.
> 
> Well, in my head, I imagine such a world where we have multiple
> separate swapping backends with cgroup knob(s) that control what
> backends are allowed for each cgroup. A zswap-is-terminal knob is
> hacky-ish way of doing that where the backends are only zswap and disk
> swap.

"I want compression" vs "I want disk offloading" is a more reasonable
question to ask at the cgroup level. We've had historically a variety
of swap configurations across the fleet. E.g. it's a lot easier to add
another swapfile than it is to grow an existing one at runtime. In
other cases, one storage config might have one swapfile, another
machine model might want to spread it out over multiple disks etc.

This doesn't matter much with ghost files. But with conventional
swapfiles this requires an unnecessary awareness of the backend
topology in order to express container policy. That's no bueno.

> > > Perhaps there is a way we can do this without allocating a zswap entry?
> > >
> > > I thought before about having a special list_head that allows us to
> > > use the lower bits of the pointers as markers, similar to the xarray.
> > > The markers can be used to place different objects on the same list.
> > > We can have a list that is a mixture of struct page and struct
> > > zswap_entry. I never pursued this idea, and I am sure someone will
> > > scream at me for suggesting it. Maybe there is a less convoluted way
> > > to keep the LRU ordering intact without allocating memory on the
> > > reclaim path.
> >
> > That should work. Once zswap has exclusive control over the page, it
> > is free to muck with its lru linkage. A lower bit tag on the next or
> > prev pointer should suffice to distinguish between struct page and
> > struct zswap_entry when pulling stuff from the list.
> 
> Right.
> 
> We handle incompressible memory internally in a different way, we put
> them back on the unevictable list with an incompressible page flag.
> This achieves a similar effect.

It doesn't. We want those incompressible pages to continue aging
alongside their compressible peers, and eventually get written back to
disk with them.
Yosry Ahmed Oct. 17, 2023, 3:51 p.m. UTC | #5
On Tue, Oct 17, 2023 at 7:51 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Oct 16, 2023 at 10:33:23PM -0700, Yosry Ahmed wrote:
> > On Mon, Oct 16, 2023 at 9:47 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > On Mon, Oct 16, 2023 at 05:57:31PM -0700, Yosry Ahmed wrote:
> > > > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > > So I obviously agree that we still need to invest in decoupling zswap
> > > space from physical disk slots. It's insanely wasteful, especially
> > > with larger memory capacities. But while it would be a fantastic
> > > optimization, I don't see how it would be an automatic solution to the
> > > problem that inspired this proposal.
> >
> > Well, in my head, I imagine such a world where we have multiple
> > separate swapping backends with cgroup knob(s) that control what
> > backends are allowed for each cgroup. A zswap-is-terminal knob is
> > hacky-ish way of doing that where the backends are only zswap and disk
> > swap.
>
> "I want compression" vs "I want disk offloading" is a more reasonable
> question to ask at the cgroup level. We've had historically a variety
> of swap configurations across the fleet. E.g. it's a lot easier to add
> another swapfile than it is to grow an existing one at runtime. In
> other cases, one storage config might have one swapfile, another
> machine model might want to spread it out over multiple disks etc.
>
> This doesn't matter much with ghost files. But with conventional
> swapfiles this requires an unnecessary awareness of the backend
> topology in order to express container policy. That's no bueno.

Oh I didn't mean that cgroups would designate specific swapfiles, but
rather swapping backends, which would be "zswap" or "disk" or both in
this case. I just imagined an interface that is more generic and
extensible rather than a specific zswap-is-terminal knob.

>
> > > > Perhaps there is a way we can do this without allocating a zswap entry?
> > > >
> > > > I thought before about having a special list_head that allows us to
> > > > use the lower bits of the pointers as markers, similar to the xarray.
> > > > The markers can be used to place different objects on the same list.
> > > > We can have a list that is a mixture of struct page and struct
> > > > zswap_entry. I never pursued this idea, and I am sure someone will
> > > > scream at me for suggesting it. Maybe there is a less convoluted way
> > > > to keep the LRU ordering intact without allocating memory on the
> > > > reclaim path.
> > >
> > > That should work. Once zswap has exclusive control over the page, it
> > > is free to muck with its lru linkage. A lower bit tag on the next or
> > > prev pointer should suffice to distinguish between struct page and
> > > struct zswap_entry when pulling stuff from the list.
> >
> > Right.
> >
> > We handle incompressible memory internally in a different way, we put
> > them back on the unevictable list with an incompressible page flag.
> > This achieves a similar effect.
>
> It doesn't. We want those incompressible pages to continue aging
> alongside their compressible peers, and eventually get written back to
> disk with them.

Sorry I wasn't clear, I was talking about the case where zswap is
terminal. When zswap is not, in our approach we just skip zswap for
incompressible pages and write them directly to disk. Aging them on
the LRU is probably the better approach here.

For the case where zswap is terminal, making them unevictable incurs
less page faults, at least for shmem.
Nhat Pham Oct. 17, 2023, 7:03 p.m. UTC | #6
On Mon, Oct 16, 2023 at 5:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > Currently, when a zswap store attempt fails, the page is immediately
> > swapped out. This could happen for a variety of reasons. For instance,
> > the compression algorithm could fail (such as when the data is not
> > compressible), or the backend allocator might not be able to find a
> > suitable slot for the compressed page. If these pages are needed
> > later on, users will incur IOs from swapins.
> >
> > This issue prevents the adoption of zswap for potential users who
> > cannot tolerate the latency associated with swapping. In many cases,
> > these IOs are avoidable if we just keep in memory the pages that zswap
> > fail to store.
> >
> > This patch series add two new features for zswap that will alleviate
> > the risk of swapping:
> >
> > a) When a store attempt fail, keep the page untouched in memory
> > instead of swapping it out.
>
> What about writeback when the zswap limit is hit? I understand the
> problem, but I am wondering if this is the correct way of fixing it.
> We really need to make zswap work without a backing swapfile, which I
> think is the correct way to fix all these problems. I was working on
> that, but unfortunately I had to pivot to something else before I had
> something that was working.
>
> At Google, we have "ghost" swapfiles that we use just to use zswap
> without a swapfile. They are sparse files, and we have internal kernel
> patches to flag them and never try to actually write to them.
>
> I am not sure how many bandaids we can afford before doing the right
> thing. I understand it's a much larger surgery, perhaps there is a way
> to get a short-term fix that is also a step towards the final state we
> want to reach instead?

Regarding the writeback - I'll make sure to also short-circuit writeback
when the bypass_swap option is enabled in v2 :) I'll probably send out
v2 after

I absolutely agree that we must decouple zswap and swap (and would
be happy to help out in any capacity I could - we have heard similar
concerns/complaints about swap wastage from internal parties as well).

However, as Johannes has pointed out, this feature still has its place,
given our already existing swapfile deployments. I do agree that a
global knob is insufficient tho. I'll add a per-cgroup knob in v2 so that
we can enable/disable this feature on a per-workload basis.

>
> >
> > b) If the store attempt fails at the compression step, allow the page
> > to be stored in its uncompressed form in the zswap pool. This maintains
> > the LRU ordering of pages, which will be helpful for accurate
> > memory reclaim (zswap writeback in particular).
>
> This is dangerous. Johannes and I discussed this before. This means
> that reclaim can end up allocating more memory instead of freeing.
> Allocations made in the reclaim path are made under the assumption
> that we will eventually free memory. In this case, we won't. In the
> worst case scenario, reclaim can leave the system/memcg in a worse
> state than before it started.
>
> Perhaps there is a way we can do this without allocating a zswap entry?
>
> I thought before about having a special list_head that allows us to
> use the lower bits of the pointers as markers, similar to the xarray.
> The markers can be used to place different objects on the same list.
> We can have a list that is a mixture of struct page and struct
> zswap_entry. I never pursued this idea, and I am sure someone will
> scream at me for suggesting it. Maybe there is a less convoluted way
> to keep the LRU ordering intact without allocating memory on the
> reclaim path.

Hmm yeah you're right about these concerns. That seems like a lot more
involved than what I envisioned initially.

Let's put this aside for now. I'll just send the first patch in v2, and we can
work on + discuss more about uncompressed pages storing later on.

>
> >
> > These features could be enabled independently via two new zswap module
> > parameters.
> >
> > Nhat Pham (2):
> >   swap: allows swap bypassing on zswap store failure
> >   zswap: store uncompressed pages when compression algorithm fails
> >
> >  Documentation/admin-guide/mm/zswap.rst | 16 +++++++
> >  include/linux/zswap.h                  |  9 ++++
> >  mm/page_io.c                           |  6 +++
> >  mm/shmem.c                             |  8 +++-
> >  mm/zswap.c                             | 64 +++++++++++++++++++++++---
> >  5 files changed, 95 insertions(+), 8 deletions(-)
> >
> > --
> > 2.34.1
Nhat Pham Oct. 17, 2023, 7:04 p.m. UTC | #7
On Tue, Oct 17, 2023 at 12:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Oct 16, 2023 at 5:58 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > Currently, when a zswap store attempt fails, the page is immediately
> > > swapped out. This could happen for a variety of reasons. For instance,
> > > the compression algorithm could fail (such as when the data is not
> > > compressible), or the backend allocator might not be able to find a
> > > suitable slot for the compressed page. If these pages are needed
> > > later on, users will incur IOs from swapins.
> > >
> > > This issue prevents the adoption of zswap for potential users who
> > > cannot tolerate the latency associated with swapping. In many cases,
> > > these IOs are avoidable if we just keep in memory the pages that zswap
> > > fail to store.
> > >
> > > This patch series add two new features for zswap that will alleviate
> > > the risk of swapping:
> > >
> > > a) When a store attempt fail, keep the page untouched in memory
> > > instead of swapping it out.
> >
> > What about writeback when the zswap limit is hit? I understand the
> > problem, but I am wondering if this is the correct way of fixing it.
> > We really need to make zswap work without a backing swapfile, which I
> > think is the correct way to fix all these problems. I was working on
> > that, but unfortunately I had to pivot to something else before I had
> > something that was working.
> >
> > At Google, we have "ghost" swapfiles that we use just to use zswap
> > without a swapfile. They are sparse files, and we have internal kernel
> > patches to flag them and never try to actually write to them.
> >
> > I am not sure how many bandaids we can afford before doing the right
> > thing. I understand it's a much larger surgery, perhaps there is a way
> > to get a short-term fix that is also a step towards the final state we
> > want to reach instead?
>
> Regarding the writeback - I'll make sure to also short-circuit writeback
> when the bypass_swap option is enabled in v2 :) I'll probably send out
> v2 after

... I gather all these feedbacks.

> I absolutely agree that we must decouple zswap and swap (and would
> be happy to help out in any capacity I could - we have heard similar
> concerns/complaints about swap wastage from internal parties as well).
>
> However, as Johannes has pointed out, this feature still has its place,
> given our already existing swapfile deployments. I do agree that a
> global knob is insufficient tho. I'll add a per-cgroup knob in v2 so that
> we can enable/disable this feature on a per-workload basis.
>
> >
> > >
> > > b) If the store attempt fails at the compression step, allow the page
> > > to be stored in its uncompressed form in the zswap pool. This maintains
> > > the LRU ordering of pages, which will be helpful for accurate
> > > memory reclaim (zswap writeback in particular).
> >
> > This is dangerous. Johannes and I discussed this before. This means
> > that reclaim can end up allocating more memory instead of freeing.
> > Allocations made in the reclaim path are made under the assumption
> > that we will eventually free memory. In this case, we won't. In the
> > worst case scenario, reclaim can leave the system/memcg in a worse
> > state than before it started.
> >
> > Perhaps there is a way we can do this without allocating a zswap entry?
> >
> > I thought before about having a special list_head that allows us to
> > use the lower bits of the pointers as markers, similar to the xarray.
> > The markers can be used to place different objects on the same list.
> > We can have a list that is a mixture of struct page and struct
> > zswap_entry. I never pursued this idea, and I am sure someone will
> > scream at me for suggesting it. Maybe there is a less convoluted way
> > to keep the LRU ordering intact without allocating memory on the
> > reclaim path.
>
> Hmm yeah you're right about these concerns. That seems like a lot more
> involved than what I envisioned initially.
>
> Let's put this aside for now. I'll just send the first patch in v2, and we can
> work on + discuss more about uncompressed pages storing later on.
>
> >
> > >
> > > These features could be enabled independently via two new zswap module
> > > parameters.
> > >
> > > Nhat Pham (2):
> > >   swap: allows swap bypassing on zswap store failure
> > >   zswap: store uncompressed pages when compression algorithm fails
> > >
> > >  Documentation/admin-guide/mm/zswap.rst | 16 +++++++
> > >  include/linux/zswap.h                  |  9 ++++
> > >  mm/page_io.c                           |  6 +++
> > >  mm/shmem.c                             |  8 +++-
> > >  mm/zswap.c                             | 64 +++++++++++++++++++++++---
> > >  5 files changed, 95 insertions(+), 8 deletions(-)
> > >
> > > --
> > > 2.34.1
Nhat Pham Oct. 17, 2023, 7:24 p.m. UTC | #8
On Mon, Oct 16, 2023 at 9:47 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Oct 16, 2023 at 05:57:31PM -0700, Yosry Ahmed wrote:
> > On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > Currently, when a zswap store attempt fails, the page is immediately
> > > swapped out. This could happen for a variety of reasons. For instance,
> > > the compression algorithm could fail (such as when the data is not
> > > compressible), or the backend allocator might not be able to find a
> > > suitable slot for the compressed page. If these pages are needed
> > > later on, users will incur IOs from swapins.
> > >
> > > This issue prevents the adoption of zswap for potential users who
> > > cannot tolerate the latency associated with swapping. In many cases,
> > > these IOs are avoidable if we just keep in memory the pages that zswap
> > > fail to store.
> > >
> > > This patch series add two new features for zswap that will alleviate
> > > the risk of swapping:
> > >
> > > a) When a store attempt fail, keep the page untouched in memory
> > > instead of swapping it out.
> >
> > What about writeback when the zswap limit is hit? I understand the
> > problem, but I am wondering if this is the correct way of fixing it.
> > We really need to make zswap work without a backing swapfile, which I
> > think is the correct way to fix all these problems. I was working on
> > that, but unfortunately I had to pivot to something else before I had
> > something that was working.
> >
> > At Google, we have "ghost" swapfiles that we use just to use zswap
> > without a swapfile. They are sparse files, and we have internal kernel
> > patches to flag them and never try to actually write to them.
> >
> > I am not sure how many bandaids we can afford before doing the right
> > thing. I understand it's a much larger surgery, perhaps there is a way
> > to get a short-term fix that is also a step towards the final state we
> > want to reach instead?
>
> I agree it should also stop writeback due to the limit.
>
> Note that a knob like this is still useful even when zswap space is
> decoupled from disk swap slots. We still are using disk swap broadly
> in the fleet as well, so a static ghost file setup wouldn't be a good
> solution for us. Swapoff with common swapfile sizes is often not an
> option during runtime, due to how slow it is, and the destabilizing
> effect it can have on the system unless that's basically completely
> idle. As such, we expect to continue deploying swap files for physical
> use, and switch the zswap-is-terminal knob depending on the workload.
>
> The other aspect to this is that workloads that do not want the
> swapout/swapin overhead for themselves are usually co-located with
> system management software, and/or can share the host with less
> latency sensitive workloads, that should continue to use disk swap.
>
> Due to the latter case, I wonder if a global knob is actually
> enough. More likely we'd need per-cgroup control over this.
>
> [ It's at this point where the historic coupling of zswap to disk
>   space is especially unfortunate. Because of it, zswap usage counts
>   toward the memory.swap allowance. If these were separate, we could
>   have easily set memory.zswap.max=max, memory.swap.max=0 to achieve
>   the desired effect.
>
>   Alas, that ship has sailed. Even after a decoupling down the line it
>   would be difficult to change established memory.swap semantics. ]
>
> So I obviously agree that we still need to invest in decoupling zswap
> space from physical disk slots. It's insanely wasteful, especially
> with larger memory capacities. But while it would be a fantastic
> optimization, I don't see how it would be an automatic solution to the
> problem that inspired this proposal.
>
> We still need some way to reasonably express desired workload policy
> in a setup that supports multiple, simultaneous modes of operation.
>
> > > b) If the store attempt fails at the compression step, allow the page
> > > to be stored in its uncompressed form in the zswap pool. This maintains
> > > the LRU ordering of pages, which will be helpful for accurate
> > > memory reclaim (zswap writeback in particular).
> >
> > This is dangerous. Johannes and I discussed this before. This means
> > that reclaim can end up allocating more memory instead of freeing.
> > Allocations made in the reclaim path are made under the assumption
> > that we will eventually free memory. In this case, we won't. In the
> > worst case scenario, reclaim can leave the system/memcg in a worse
> > state than before it started.
>
> Yeah, this is a concern. It's not such a big deal if it's only a few
> pages, and we're still shrinking the footprint on aggregate. But it's
> conceivable this can happen systematically with some datasets, in
> which case reclaim will drive up the memory consumption and cause
> OOMs, or potentially deplete the reserves with PF_MEMALLOC and cause
> memory deadlocks.
>
> This isn't something we can reasonably accept as worst-case behavior.
>
> > Perhaps there is a way we can do this without allocating a zswap entry?
> >
> > I thought before about having a special list_head that allows us to
> > use the lower bits of the pointers as markers, similar to the xarray.
> > The markers can be used to place different objects on the same list.
> > We can have a list that is a mixture of struct page and struct
> > zswap_entry. I never pursued this idea, and I am sure someone will
> > scream at me for suggesting it. Maybe there is a less convoluted way
> > to keep the LRU ordering intact without allocating memory on the
> > reclaim path.
>
> That should work. Once zswap has exclusive control over the page, it
> is free to muck with its lru linkage. A lower bit tag on the next or
> prev pointer should suffice to distinguish between struct page and
> struct zswap_entry when pulling stuff from the list.

Hmm I'm a bit iffy about pointers bit hacking, but I guess that's
the least involved way to store the uncompressed page in the zswap
LRU without allocating a zswap_entry for it.

Lemme give this a try. If it looks decently clean I'll send it out :)

>
> We'd also have to teach vmscan.c to hand off the page. It currently
> expects that it either frees the page back to the allocator, or puts
> it back on the LRU. We'd need a compromise where it continues to tear
> down the page and remove the mapping, but then leaves it to zswap.
>
> Neither of those sound impossible. But since it's a bigger
> complication than this proposal, it probably needs a new cost/benefit
> analysis, with potentially more data on the problem of LRU inversions.
>
> Thanks for your insightful feedback, Yosry.