[RFC,0/2] zswap: fix placement inversion in memory tiering systems

Message ID	20250329110230.2459730-1-nphamcs@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Nhat Pham <nphamcs@gmail.com> To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, yosry.ahmed@linux.dev, chengming.zhou@linux.dev, sj@kernel.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, gourry@gourry.net, willy@infradead.org, ying.huang@linux.alibaba.com, jonathan.cameron@huawei.com, dan.j.williams@intel.com, linux-cxl@vger.kernel.org, minchan@kernel.org, senozhatsky@chromium.org Subject: [RFC PATCH 0/2] zswap: fix placement inversion in memory tiering systems Date: Sat, 29 Mar 2025 04:02:28 -0700 Message-ID: <20250329110230.2459730-1-nphamcs@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	zswap: fix placement inversion in memory tiering systems \| expand [RFC,0/2] zswap: fix placement inversion in memory tiering systems [RFC,1/2] zsmalloc: let callers select NUMA node to store the compressed objects [RFC,2/2] zswap: add sysfs knob for same node mode

Nhat Pham March 29, 2025, 11:02 a.m. UTC

Currently, systems with CXL-based memory tiering can encounter the
following inversion with zswap: the coldest pages demoted to the CXL
tier can return to the high tier when they are zswapped out,
creating memory pressure on the high tier.

This happens because zsmalloc, zswap's backend memory allocator, does
not enforce any memory policy. If the task reclaiming memory follows
the local-first policy for example, the memory requested for zswap can
be served by the upper tier, leading to the aformentioned inversion.

This RFC fixes this inversion by adding a new memory allocation mode
for zswap (exposed through a zswap sysfs knob), intended for
hosts with CXL, where the memory for the compressed object is requested
preferentially from the same node that the original page resides on.

With the new zswap allocation mode enabled, we should observe the
following dynamics:

1. When demotion is turned on, under reasonable conditions, zswap will
   prefer CXL memory by default, since top-tier memory being reclaimed
   will typically be demoted instead of swapped.

2. This should prevent reclaim on the lower tier from causing high-tier
   memory pressure due to new allocations.

3. This should avoid a quiet promotion of cold memory (memory being
   zswapped is cold, but is promoted when put into the zswap pool
   because the memory allocated for the compressed copy comes from the
   high tier).
   
4. However, this may actually cause pressure on the CXL tier, which may
   actually result in further demotion (to swap, etc). This needs to be
   tested.

I'm still testing and collecting more data, but figure I should send
this out as an RFC to spark the discussion:

1. Is this the right policy? Do we need a more complicated policy?
   Should we instead go for the "lowest" node (which would require new
   memory tiering API)? Or maybe trying each node from current node
   to the lowest node in the hierarchy?

   Also, I hack together this fix with CXL in mind, but if there are
   other cases that I should also address we can explore a more general
   memory allocation strategy or interface.

2. Similarly, is this the right zsmalloc API? For instance, we can build
   build a full-fledged mempolicy-based API for zsmalloc, but I haven't
   found a use case for it yet.

3. Assuming this is the right policy, what should be the semantics? Not
   very good at naming things, so same_node_mode might not be it :)

Nhat Pham (2):
  zsmalloc: let callers select NUMA node to store the compressed objects
  zswap: add sysfs knob for same node mode

 Documentation/admin-guide/mm/zswap.rst |  9 +++++++++
 include/linux/zpool.h                  |  4 ++--
 mm/zpool.c                             |  8 +++++---
 mm/zsmalloc.c                          | 28 +++++++++++++++++++-------
 mm/zswap.c                             | 10 +++++++--
 5 files changed, 45 insertions(+), 14 deletions(-)


base-commit: 4135040c342ba080328891f1b7e523c8f2f04c58

Yosry Ahmed March 29, 2025, 7:53 p.m. UTC | #1

March 29, 2025 at 1:02 PM, "Nhat Pham" <nphamcs@gmail.com> wrote:

> Currently, systems with CXL-based memory tiering can encounter the
> following inversion with zswap: the coldest pages demoted to the CXL
> tier can return to the high tier when they are zswapped out,
> creating memory pressure on the high tier.
> This happens because zsmalloc, zswap's backend memory allocator, does
> not enforce any memory policy. If the task reclaiming memory follows
> the local-first policy for example, the memory requested for zswap can
> be served by the upper tier, leading to the aformentioned inversion.
> This RFC fixes this inversion by adding a new memory allocation mode
> for zswap (exposed through a zswap sysfs knob), intended for
> hosts with CXL, where the memory for the compressed object is requested
> preferentially from the same node that the original page resides on.

I didn't look too closely, but why not just prefer the same node by default? Why is a knob needed?

Or maybe if there's a way to tell the "tier" of the node we can prefer to allocate from the same "tier"?

Nhat Pham March 29, 2025, 10:13 p.m. UTC | #2

On Sat, Mar 29, 2025 at 12:53 PM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> March 29, 2025 at 1:02 PM, "Nhat Pham" <nphamcs@gmail.com> wrote:
>
> > Currently, systems with CXL-based memory tiering can encounter the
> > following inversion with zswap: the coldest pages demoted to the CXL
> > tier can return to the high tier when they are zswapped out,
> > creating memory pressure on the high tier.
> > This happens because zsmalloc, zswap's backend memory allocator, does
> > not enforce any memory policy. If the task reclaiming memory follows
> > the local-first policy for example, the memory requested for zswap can
> > be served by the upper tier, leading to the aformentioned inversion.
> > This RFC fixes this inversion by adding a new memory allocation mode
> > for zswap (exposed through a zswap sysfs knob), intended for
> > hosts with CXL, where the memory for the compressed object is requested
> > preferentially from the same node that the original page resides on.
>
> I didn't look too closely, but why not just prefer the same node by default? Why is a knob needed?

Good question, yeah the knob is to maintain the old behavior :) It
might not be optimal, or even advisable, for all set up.

For hosts with node-based memory tiering, then yeah it's a good idea
in general, but I don't quite know how to have information about that
from the kernel's perspective.

>
> Or maybe if there's a way to tell the "tier" of the node we can prefer to allocate from the same "tier"?

Is there an abstraction of the "tier" that we can use here?

Nhat Pham March 29, 2025, 10:17 p.m. UTC | #3

On Sat, Mar 29, 2025 at 3:13 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Good question, yeah the knob is to maintain the old behavior :) It
> might not be optimal, or even advisable, for all set up.
>
> For hosts with node-based memory tiering, then yeah it's a good idea
> in general, but I don't quite know how to have information about that
> from the kernel's perspective.
>
> >
> > Or maybe if there's a way to tell the "tier" of the node we can prefer to allocate from the same "tier"?
>
> Is there an abstraction of the "tier" that we can use here?

Maaaybe "struct memory_tier" (memory-tiers.c)? Lemme take a look at that...

Johannes Weiner March 31, 2025, 4:53 p.m. UTC | #4

On Sat, Mar 29, 2025 at 07:53:23PM +0000, Yosry Ahmed wrote:
> March 29, 2025 at 1:02 PM, "Nhat Pham" <nphamcs@gmail.com> wrote:
> 
> > Currently, systems with CXL-based memory tiering can encounter the
> > following inversion with zswap: the coldest pages demoted to the CXL
> > tier can return to the high tier when they are zswapped out,
> > creating memory pressure on the high tier.
> > This happens because zsmalloc, zswap's backend memory allocator, does
> > not enforce any memory policy. If the task reclaiming memory follows
> > the local-first policy for example, the memory requested for zswap can
> > be served by the upper tier, leading to the aformentioned inversion.
> > This RFC fixes this inversion by adding a new memory allocation mode
> > for zswap (exposed through a zswap sysfs knob), intended for
> > hosts with CXL, where the memory for the compressed object is requested
> > preferentially from the same node that the original page resides on.
> 
> I didn't look too closely, but why not just prefer the same node by
> default? Why is a knob needed?

+1 It should really be the default.

Even on regular NUMA setups this behavior makes more sense. Consider a
direct reclaimer scanning nodes in order of allocation preference. If
it ventures into remote nodes, the memory it compresses there should
stay there. Trying to shift those contents over to the reclaiming
thread's preferred node further *increases* its local pressure, and
provoking more spills. The remote node is also the most likely to
refault this data again. This is just bad for everybody.

> Or maybe if there's a way to tell the "tier" of the node we can
> prefer to allocate from the same "tier"?

Presumably, other nodes in the same tier would come first in the
fallback zonelist of that node, so page_to_nid() should just work.

I wouldn't complicate this until somebody has real systems where it
does the wrong thing.

My vote is to stick with page_to_nid(), but do it unconditionally.

Gregory Price March 31, 2025, 5:06 p.m. UTC | #5

On Sat, Mar 29, 2025 at 07:53:23PM +0000, Yosry Ahmed wrote:
> March 29, 2025 at 1:02 PM, "Nhat Pham" <nphamcs@gmail.com> wrote:
> 
> > Currently, systems with CXL-based memory tiering can encounter the
> > following inversion with zswap: the coldest pages demoted to the CXL
> > tier can return to the high tier when they are zswapped out,
> > creating memory pressure on the high tier.
> > This happens because zsmalloc, zswap's backend memory allocator, does
> > not enforce any memory policy. If the task reclaiming memory follows
> > the local-first policy for example, the memory requested for zswap can
> > be served by the upper tier, leading to the aformentioned inversion.
> > This RFC fixes this inversion by adding a new memory allocation mode
> > for zswap (exposed through a zswap sysfs knob), intended for
> > hosts with CXL, where the memory for the compressed object is requested
> > preferentially from the same node that the original page resides on.
> 
> I didn't look too closely, but why not just prefer the same node by default? Why is a knob needed?
> 

Bit of an open question: does this hurt zswap performance?

And of course the begged question: Does that matter?

Probably the answer is not really and no, but nice to have the knob for
testing.  I imagine we'd drop it with the RFC tag.

> Or maybe if there's a way to tell the "tier" of the node we can prefer to allocate from the same "tier"?

In almost every system, tier=node for any sane situation, though nodes
across sockets can end up lumped into the same tier - which maybe
doesn't matter for zswap but isn't useful for almost anything else.

But maybe there's an argument for adding new tier-policies.

:think: 

int memtier_get_node(enum memtier_policy, int nid);

enum memtier_policy {
    MEMTIER_SAME_TIER,     // get a different node from same tier
    MEMTIER_DEMOTE_ONE,    // demote one step
    MEMTIER_DEMOTE_FAR,    // demote one step away from swap
    MEMTIER_PROMOTE_ONE,   // promote one step
    MEMTIER_PROMOTE_LOCAL, // promote to local on topology
};

Might be worth investigating.  Just spitballing here.

The issue is really fallback allocations.  In most cases, we know what
we'd like to do, but when the system is under pressure the question is
what behavior do we want from these components.  I'd hesistate to make a
strong claim about whether zswap should/should not fall back to a
higher-tier node under system pressure without strong data.

~Gregory

Nhat Pham March 31, 2025, 5:32 p.m. UTC | #6

On Mon, Mar 31, 2025 at 9:53 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Sat, Mar 29, 2025 at 07:53:23PM +0000, Yosry Ahmed wrote:
> > March 29, 2025 at 1:02 PM, "Nhat Pham" <nphamcs@gmail.com> wrote:
> >
> > > Currently, systems with CXL-based memory tiering can encounter the
> > > following inversion with zswap: the coldest pages demoted to the CXL
> > > tier can return to the high tier when they are zswapped out,
> > > creating memory pressure on the high tier.
> > > This happens because zsmalloc, zswap's backend memory allocator, does
> > > not enforce any memory policy. If the task reclaiming memory follows
> > > the local-first policy for example, the memory requested for zswap can
> > > be served by the upper tier, leading to the aformentioned inversion.
> > > This RFC fixes this inversion by adding a new memory allocation mode
> > > for zswap (exposed through a zswap sysfs knob), intended for
> > > hosts with CXL, where the memory for the compressed object is requested
> > > preferentially from the same node that the original page resides on.
> >
> > I didn't look too closely, but why not just prefer the same node by
> > default? Why is a knob needed?
>
> +1 It should really be the default.
>
> Even on regular NUMA setups this behavior makes more sense. Consider a
> direct reclaimer scanning nodes in order of allocation preference. If
> it ventures into remote nodes, the memory it compresses there should
> stay there. Trying to shift those contents over to the reclaiming
> thread's preferred node further *increases* its local pressure, and
> provoking more spills. The remote node is also the most likely to
> refault this data again. This is just bad for everybody.

Makes a lot of sense. I'll include this in the v2 of the patch series,
and rephrase this as a generic, NUMA system fix (with CXL as one of
the examples/motivations).

Thanks for the comment, Johannes! I'll remove this knob altogether and
make this the default behavior.

>
> > Or maybe if there's a way to tell the "tier" of the node we can
> > prefer to allocate from the same "tier"?
>
> Presumably, other nodes in the same tier would come first in the
> fallback zonelist of that node, so page_to_nid() should just work.
>
> I wouldn't complicate this until somebody has real systems where it
> does the wrong thing.
>
> My vote is to stick with page_to_nid(), but do it unconditionally.

SGTM.

>

[RFC,0/2] zswap: fix placement inversion in memory tiering systems

Message

Comments