diff mbox series

[4/5] mm: zswap: add basic meminfo and vmstat coverage

Message ID 20220427160016.144237-5-hannes@cmpxchg.org (mailing list archive)
State New
Headers show
Series zswap: cgroup accounting & control | expand

Commit Message

Johannes Weiner April 27, 2022, 4 p.m. UTC
Currently it requires poking at debugfs to figure out the size and
population of the zswap cache on a host. There are no counters for
reads and writes against the cache. As a result, it's difficult to
understand zswap behavior on production systems.

Print zswap memory consumption and how many pages are zswapped out in
/proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 fs/proc/meminfo.c             |  7 +++++++
 include/linux/swap.h          |  5 +++++
 include/linux/vm_event_item.h |  4 ++++
 mm/vmstat.c                   |  4 ++++
 mm/zswap.c                    | 13 ++++++-------
 5 files changed, 26 insertions(+), 7 deletions(-)

Comments

Andrew Morton April 27, 2022, 6:36 p.m. UTC | #1
On Wed, 27 Apr 2022 12:00:15 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Currently it requires poking at debugfs to figure out the size and
> population of the zswap cache on a host. There are no counters for
> reads and writes against the cache. As a result, it's difficult to
> understand zswap behavior on production systems.
> 
> Print zswap memory consumption and how many pages are zswapped out in
> /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.

/proc/meminfo is rather prime real estate.  Is this important enough to
be placed in there, or should it instead be in the more lowly
/proc/vmstat?

/proc/meminfo is documented in Documentation/filesystems/proc.rst ;)

That file appears to need a bit of updating for other things.
Johannes Weiner April 27, 2022, 6:53 p.m. UTC | #2
On Wed, Apr 27, 2022 at 11:36:54AM -0700, Andrew Morton wrote:
> On Wed, 27 Apr 2022 12:00:15 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Currently it requires poking at debugfs to figure out the size and
> > population of the zswap cache on a host. There are no counters for
> > reads and writes against the cache. As a result, it's difficult to
> > understand zswap behavior on production systems.
> > 
> > Print zswap memory consumption and how many pages are zswapped out in
> > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> 
> /proc/meminfo is rather prime real estate.  Is this important enough to
> be placed in there, or should it instead be in the more lowly
> /proc/vmstat?

The zswap pool size is capped to 20% of available RAM, and we usually
have a utilization of tens of gigabytes. I think it's fair to say it's
a first class memory consumer when enabled, and actually a huge hole
in /proc/meminfo coverage right now.

> /proc/meminfo is documented in Documentation/filesystems/proc.rst ;)
> 
> That file appears to need a bit of updating for other things.

"The following is from a 16GB PIII, which has highmem enabled."

lmao.

I'll send a general update for that, and a delta fixlet for 4/5.

Thanks!
Johannes Weiner April 27, 2022, 7:50 p.m. UTC | #3
On Wed, Apr 27, 2022 at 02:53:10PM -0400, Johannes Weiner wrote:
> I'll send a general update for that [...]

From dca20a3a4ae2218f2db7d6e9abb47f6ca9004273 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 27 Apr 2022 15:36:07 -0400
Subject: [PATCH 1/7] Documentation: filesystems: proc: update meminfo section

Add new entries. Minor corrections and cleanups.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/filesystems/proc.rst | 155 ++++++++++++++++++-----------
 1 file changed, 99 insertions(+), 56 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 061744c436d9..736ed384750c 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -942,56 +942,71 @@ can be substantial.  In many cases there are other means to find out
 additional memory using subsystem specific interfaces, for instance
 /proc/net/sockstat for TCP memory allocations.
 
-The following is from a 16GB PIII, which has highmem enabled.
-You may not have all of these fields.
+Example output. You may not have all of these fields.
 
 ::
 
     > cat /proc/meminfo
 
-    MemTotal:     16344972 kB
-    MemFree:      13634064 kB
-    MemAvailable: 14836172 kB
-    Buffers:          3656 kB
-    Cached:        1195708 kB
-    SwapCached:          0 kB
-    Active:         891636 kB
-    Inactive:      1077224 kB
-    HighTotal:    15597528 kB
-    HighFree:     13629632 kB
-    LowTotal:       747444 kB
-    LowFree:          4432 kB
-    SwapTotal:           0 kB
-    SwapFree:            0 kB
-    Dirty:             968 kB
-    Writeback:           0 kB
-    AnonPages:      861800 kB
-    Mapped:         280372 kB
-    Shmem:             644 kB
-    KReclaimable:   168048 kB
-    Slab:           284364 kB
-    SReclaimable:   159856 kB
-    SUnreclaim:     124508 kB
-    PageTables:      24448 kB
-    NFS_Unstable:        0 kB
-    Bounce:              0 kB
-    WritebackTmp:        0 kB
-    CommitLimit:   7669796 kB
-    Committed_AS:   100056 kB
-    VmallocTotal:   112216 kB
-    VmallocUsed:       428 kB
-    VmallocChunk:   111088 kB
-    Percpu:          62080 kB
-    HardwareCorrupted:   0 kB
-    AnonHugePages:   49152 kB
-    ShmemHugePages:      0 kB
-    ShmemPmdMapped:      0 kB
+    MemTotal:       32858820 kB
+    MemFree:        21001236 kB
+    MemAvailable:   27214312 kB
+    Buffers:          581092 kB
+    Cached:          5587612 kB
+    SwapCached:            0 kB
+    Active:          3237152 kB
+    Inactive:        7586256 kB
+    Active(anon):      94064 kB
+    Inactive(anon):  4570616 kB
+    Active(file):    3143088 kB
+    Inactive(file):  3015640 kB
+    Unevictable:           0 kB
+    Mlocked:               0 kB
+    SwapTotal:             0 kB
+    SwapFree:              0 kB
+    Dirty:                12 kB
+    Writeback:             0 kB
+    AnonPages:       4654780 kB
+    Mapped:           266244 kB
+    Shmem:              9976 kB
+    KReclaimable:     517708 kB
+    Slab:             660044 kB
+    SReclaimable:     517708 kB
+    SUnreclaim:       142336 kB
+    KernelStack:       11168 kB
+    PageTables:        20540 kB
+    NFS_Unstable:          0 kB
+    Bounce:                0 kB
+    WritebackTmp:          0 kB
+    CommitLimit:    16429408 kB
+    Committed_AS:    7715148 kB
+    VmallocTotal:   34359738367 kB
+    VmallocUsed:       40444 kB
+    VmallocChunk:          0 kB
+    Percpu:            29312 kB
+    HardwareCorrupted:     0 kB
+    AnonHugePages:   4149248 kB
+    ShmemHugePages:        0 kB
+    ShmemPmdMapped:        0 kB
+    FileHugePages:         0 kB
+    FilePmdMapped:         0 kB
+    CmaTotal:              0 kB
+    CmaFree:               0 kB
+    HugePages_Total:       0
+    HugePages_Free:        0
+    HugePages_Rsvd:        0
+    HugePages_Surp:        0
+    Hugepagesize:       2048 kB
+    Hugetlb:               0 kB
+    DirectMap4k:      401152 kB
+    DirectMap2M:    10008576 kB
+    DirectMap1G:    24117248 kB
 
 MemTotal
               Total usable RAM (i.e. physical RAM minus a few reserved
               bits and the kernel binary code)
 MemFree
-              The sum of LowFree+HighFree
+              Total free RAM. On highmem systems, the sum of LowFree+HighFree
 MemAvailable
               An estimate of how much memory is available for starting new
               applications, without swapping. Calculated from MemFree,
@@ -1005,8 +1020,9 @@ Buffers
               Relatively temporary storage for raw disk blocks
               shouldn't get tremendously large (20MB or so)
 Cached
-              in-memory cache for files read from the disk (the
-              pagecache).  Doesn't include SwapCached
+              In-memory cache for files read from the disk (the
+              pagecache) as well as tmpfs & shmem.
+              Doesn't include SwapCached.
 SwapCached
               Memory that once was swapped out, is swapped back in but
               still also is in the swapfile (if memory is needed it
@@ -1018,6 +1034,11 @@ Active
 Inactive
               Memory which has been less recently used.  It is more
               eligible to be reclaimed for other purposes
+Unevictable
+              Memory that cannot be reclaimed, such as mlocked pages,
+              ramfs backing pages, secret memfd pages etc.
+Mlocked
+              Memory locked with mlock().
 HighTotal, HighFree
               Highmem is all memory above ~860MB of physical memory.
               Highmem areas are for use by userspace programs, or
@@ -1040,20 +1061,10 @@ Writeback
               Memory which is actively being written back to the disk
 AnonPages
               Non-file backed pages mapped into userspace page tables
-HardwareCorrupted
-              The amount of RAM/memory in KB, the kernel identifies as
-	      corrupted.
-AnonHugePages
-              Non-file backed huge pages mapped into userspace page tables
 Mapped
               files which have been mmaped, such as libraries
 Shmem
               Total memory used by shared memory (shmem) and tmpfs
-ShmemHugePages
-              Memory used by shared memory (shmem) and tmpfs allocated
-              with huge pages
-ShmemPmdMapped
-              Shared memory mapped into userspace with huge pages
 KReclaimable
               Kernel allocations that the kernel will attempt to reclaim
               under memory pressure. Includes SReclaimable (below), and other
@@ -1064,9 +1075,10 @@ SReclaimable
               Part of Slab, that might be reclaimed, such as caches
 SUnreclaim
               Part of Slab, that cannot be reclaimed on memory pressure
+KernelStack
+              Memory consumed by the kernel stacks of all tasks
 PageTables
-              amount of memory dedicated to the lowest level of page
-              tables.
+              Memory consumed by userspace page tables
 NFS_Unstable
               Always zero. Previous counted pages which had been written to
               the server, but has not been committed to stable storage.
@@ -1098,7 +1110,7 @@ Committed_AS
               has been allocated by processes, even if it has not been
               "used" by them as of yet. A process which malloc()'s 1G
               of memory, but only touches 300M of it will show up as
-	      using 1G. This 1G is memory which has been "committed" to
+              using 1G. This 1G is memory which has been "committed" to
               by the VM and can be used at any time by the allocating
               application. With strict overcommit enabled on the system
               (mode 2 in 'vm.overcommit_memory'), allocations which would
@@ -1107,7 +1119,7 @@ Committed_AS
               not fail due to lack of memory once that memory has been
               successfully allocated.
 VmallocTotal
-              total size of vmalloc memory area
+              total size of vmalloc virtual address space
 VmallocUsed
               amount of vmalloc area which is used
 VmallocChunk
@@ -1115,6 +1127,37 @@ VmallocChunk
 Percpu
               Memory allocated to the percpu allocator used to back percpu
               allocations. This stat excludes the cost of metadata.
+HardwareCorrupted
+              The amount of RAM/memory in KB, the kernel identifies as
+              corrupted.
+AnonHugePages
+              Non-file backed huge pages mapped into userspace page tables
+ShmemHugePages
+              Memory used by shared memory (shmem) and tmpfs allocated
+              with huge pages
+ShmemPmdMapped
+              Shared memory mapped into userspace with huge pages
+FileHugePages
+              Memory used for filesystem data (page cache) allocated
+              with huge pages
+FilePmdMapped
+              Page cache mapped into userspace with huge pages
+CmaTotal
+              Memory reserved for the Contiguous Memory Allocator (CMA)
+CmaFree
+              Free remaining memory in the CMA reserves
+HugePages_Total
+HugePages_Free
+HugePages_Rsvd
+HugePages_Surp
+Hugepagesize
+Hugetlb
+              See Documentation/admin-guide/mm/hugetlbpage.rst.
+DirectMap4k
+DirectMap2M
+DirectMap1G
+              Breakdown of page table sizes used in the kernel's
+              identity mapping of RAM
 
 vmallocinfo
 ~~~~~~~~~~~
Johannes Weiner April 27, 2022, 7:51 p.m. UTC | #4
On Wed, Apr 27, 2022 at 02:53:10PM -0400, Johannes Weiner wrote:
> [...] and a delta fixlet for 4/5.

From 35851ad3ddbf30122d755bdf8abea6dc188492a2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 27 Apr 2022 15:44:23 -0400
Subject: [PATCH 6/7] mm: zswap: add basic meminfo and vmstat coverage fix

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/filesystems/proc.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 736ed384750c..8b5a94cfa722 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -964,6 +964,8 @@ Example output. You may not have all of these fields.
     Mlocked:               0 kB
     SwapTotal:             0 kB
     SwapFree:              0 kB
+    Zswap:              1904 kB
+    Zswapped:           7792 kB
     Dirty:                12 kB
     Writeback:             0 kB
     AnonPages:       4654780 kB
@@ -1055,6 +1057,10 @@ SwapTotal
 SwapFree
               Memory which has been evicted from RAM, and is temporarily
               on the disk
+Zswap
+              Memory consumed by the zswap backend (compressed size)
+Zswapped
+              Amount of anonymous memory stored in zswap (original size)
 Dirty
               Memory which is waiting to get written back to the disk
 Writeback
Minchan Kim April 27, 2022, 8:29 p.m. UTC | #5
Hi Johannes,

On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> Currently it requires poking at debugfs to figure out the size and
> population of the zswap cache on a host. There are no counters for
> reads and writes against the cache. As a result, it's difficult to
> understand zswap behavior on production systems.
> 
> Print zswap memory consumption and how many pages are zswapped out in
> /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  fs/proc/meminfo.c             |  7 +++++++
>  include/linux/swap.h          |  5 +++++
>  include/linux/vm_event_item.h |  4 ++++
>  mm/vmstat.c                   |  4 ++++
>  mm/zswap.c                    | 13 ++++++-------
>  5 files changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 6fa761c9cc78..6e89f0e2fd20 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  
>  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
>  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> +#ifdef CONFIG_ZSWAP
> +	seq_printf(m,  "Zswap:          %8lu kB\n",
> +		   (unsigned long)(zswap_pool_total_size >> 10));
> +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> +		   (PAGE_SHIFT - 10));
> +#endif

I agree it would be very handy to have the memory consumption in meminfo

https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/

If we really go this Zswap only metric instead of general term
"Compressed", I'd like to post maybe "Zram:" with same reason
in this patchset. Do you think that's better idea instead of
introducing general term like "Compressed:" or something else?
Johannes Weiner April 27, 2022, 9:20 p.m. UTC | #6
On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> Hi Johannes,
> 
> On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > Currently it requires poking at debugfs to figure out the size and
> > population of the zswap cache on a host. There are no counters for
> > reads and writes against the cache. As a result, it's difficult to
> > understand zswap behavior on production systems.
> > 
> > Print zswap memory consumption and how many pages are zswapped out in
> > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  fs/proc/meminfo.c             |  7 +++++++
> >  include/linux/swap.h          |  5 +++++
> >  include/linux/vm_event_item.h |  4 ++++
> >  mm/vmstat.c                   |  4 ++++
> >  mm/zswap.c                    | 13 ++++++-------
> >  5 files changed, 26 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > index 6fa761c9cc78..6e89f0e2fd20 100644
> > --- a/fs/proc/meminfo.c
> > +++ b/fs/proc/meminfo.c
> > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> >  
> >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > +#ifdef CONFIG_ZSWAP
> > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > +		   (PAGE_SHIFT - 10));
> > +#endif
> 
> I agree it would be very handy to have the memory consumption in meminfo
> 
> https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> 
> If we really go this Zswap only metric instead of general term
> "Compressed", I'd like to post maybe "Zram:" with same reason
> in this patchset. Do you think that's better idea instead of
> introducing general term like "Compressed:" or something else?

I'm fine with changing it to Compressed. If somebody cares about a
more detailed breakdown, we can add Zswap, Zram subsets as needed.

From 8e9e2d6490b7082c41743fbdb9ffd2db4e3ce962 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 27 Apr 2022 17:15:15 -0400
Subject: [PATCH] mm: zswap: add basic meminfo and vmstat coverage fix fix

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/filesystems/proc.rst | 7 ++++---
 fs/proc/meminfo.c                  | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 8b5a94cfa722..93edcf233464 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -964,7 +964,7 @@ Example output. You may not have all of these fields.
     Mlocked:               0 kB
     SwapTotal:             0 kB
     SwapFree:              0 kB
-    Zswap:              1904 kB
+    Compressed:         1904 kB
     Zswapped:           7792 kB
     Dirty:                12 kB
     Writeback:             0 kB
@@ -1057,8 +1057,9 @@ SwapTotal
 SwapFree
               Memory which has been evicted from RAM, and is temporarily
               on the disk
-Zswap
-              Memory consumed by the zswap backend (compressed size)
+Compressed
+              Memory consumed by compression backends, such as zswap
+              (compressed size)
 Zswapped
               Amount of anonymous memory stored in zswap (original size)
 Dirty
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 6e89f0e2fd20..554d6f230e67 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -87,7 +87,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "SwapTotal:      ", i.totalswap);
 	show_val_kb(m, "SwapFree:       ", i.freeswap);
 #ifdef CONFIG_ZSWAP
-	seq_printf(m,  "Zswap:          %8lu kB\n",
+	seq_printf(m,  "Compressed:     %8lu kB\n",
 		   (unsigned long)(zswap_pool_total_size >> 10));
 	seq_printf(m,  "Zswapped:       %8lu kB\n",
 		   (unsigned long)atomic_read(&zswap_stored_pages) <<
Johannes Weiner April 27, 2022, 9:36 p.m. UTC | #7
On Wed, Apr 27, 2022 at 05:20:31PM -0400, Johannes Weiner wrote:
> On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > Hi Johannes,
> > 
> > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > Currently it requires poking at debugfs to figure out the size and
> > > population of the zswap cache on a host. There are no counters for
> > > reads and writes against the cache. As a result, it's difficult to
> > > understand zswap behavior on production systems.
> > > 
> > > Print zswap memory consumption and how many pages are zswapped out in
> > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  fs/proc/meminfo.c             |  7 +++++++
> > >  include/linux/swap.h          |  5 +++++
> > >  include/linux/vm_event_item.h |  4 ++++
> > >  mm/vmstat.c                   |  4 ++++
> > >  mm/zswap.c                    | 13 ++++++-------
> > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > --- a/fs/proc/meminfo.c
> > > +++ b/fs/proc/meminfo.c
> > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > >  
> > >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > +#ifdef CONFIG_ZSWAP
> > > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > +		   (PAGE_SHIFT - 10));
> > > +#endif
> > 
> > I agree it would be very handy to have the memory consumption in meminfo
> > 
> > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > 
> > If we really go this Zswap only metric instead of general term
> > "Compressed", I'd like to post maybe "Zram:" with same reason
> > in this patchset. Do you think that's better idea instead of
> > introducing general term like "Compressed:" or something else?
> 
> I'm fine with changing it to Compressed. If somebody cares about a
> more detailed breakdown, we can add Zswap, Zram subsets as needed.

It does raise the question what to do about cgroup, though. Should the
control files (memory.zswap.current & memory.zswap.max) apply to zram
in the future? If so, we should rename them, too.

I'm not too familiar with zram, maybe you can provide some
background. AFAIU, Google uses zram quite widely; all the more
confusing why there is no container support for it yet.

Could you shed some light?

Thanks
Minchan Kim April 27, 2022, 10:12 p.m. UTC | #8
On Wed, Apr 27, 2022 at 05:36:26PM -0400, Johannes Weiner wrote:
> On Wed, Apr 27, 2022 at 05:20:31PM -0400, Johannes Weiner wrote:
> > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > Hi Johannes,
> > > 
> > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > Currently it requires poking at debugfs to figure out the size and
> > > > population of the zswap cache on a host. There are no counters for
> > > > reads and writes against the cache. As a result, it's difficult to
> > > > understand zswap behavior on production systems.
> > > > 
> > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > 
> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > ---
> > > >  fs/proc/meminfo.c             |  7 +++++++
> > > >  include/linux/swap.h          |  5 +++++
> > > >  include/linux/vm_event_item.h |  4 ++++
> > > >  mm/vmstat.c                   |  4 ++++
> > > >  mm/zswap.c                    | 13 ++++++-------
> > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > --- a/fs/proc/meminfo.c
> > > > +++ b/fs/proc/meminfo.c
> > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > >  
> > > >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > +#ifdef CONFIG_ZSWAP
> > > > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > > > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > +		   (PAGE_SHIFT - 10));
> > > > +#endif
> > > 
> > > I agree it would be very handy to have the memory consumption in meminfo
> > > 
> > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > 
> > > If we really go this Zswap only metric instead of general term
> > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > in this patchset. Do you think that's better idea instead of
> > > introducing general term like "Compressed:" or something else?
> > 
> > I'm fine with changing it to Compressed. If somebody cares about a
> > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> 
> It does raise the question what to do about cgroup, though. Should the
> control files (memory.zswap.current & memory.zswap.max) apply to zram
> in the future? If so, we should rename them, too.
> 
> I'm not too familiar with zram, maybe you can provide some
> background. AFAIU, Google uses zram quite widely; all the more
> confusing why there is no container support for it yet.

My usecase with zram is Android which doesn't use memcg.
Minchan Kim April 27, 2022, 10:16 p.m. UTC | #9
On Wed, Apr 27, 2022 at 05:20:29PM -0400, Johannes Weiner wrote:
> On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > Hi Johannes,
> > 
> > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > Currently it requires poking at debugfs to figure out the size and
> > > population of the zswap cache on a host. There are no counters for
> > > reads and writes against the cache. As a result, it's difficult to
> > > understand zswap behavior on production systems.
> > > 
> > > Print zswap memory consumption and how many pages are zswapped out in
> > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  fs/proc/meminfo.c             |  7 +++++++
> > >  include/linux/swap.h          |  5 +++++
> > >  include/linux/vm_event_item.h |  4 ++++
> > >  mm/vmstat.c                   |  4 ++++
> > >  mm/zswap.c                    | 13 ++++++-------
> > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > --- a/fs/proc/meminfo.c
> > > +++ b/fs/proc/meminfo.c
> > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > >  
> > >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > +#ifdef CONFIG_ZSWAP
> > > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > +		   (PAGE_SHIFT - 10));
> > > +#endif
> > 
> > I agree it would be very handy to have the memory consumption in meminfo
> > 
> > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > 
> > If we really go this Zswap only metric instead of general term
> > "Compressed", I'd like to post maybe "Zram:" with same reason
> > in this patchset. Do you think that's better idea instead of
> > introducing general term like "Compressed:" or something else?
> 
> I'm fine with changing it to Compressed. If somebody cares about a
> more detailed breakdown, we can add Zswap, Zram subsets as needed.

Thanks! Please consider ZSWPIN to rename more general term, too.

> 
> From 8e9e2d6490b7082c41743fbdb9ffd2db4e3ce962 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Wed, 27 Apr 2022 17:15:15 -0400
> Subject: [PATCH] mm: zswap: add basic meminfo and vmstat coverage fix fix
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/filesystems/proc.rst | 7 ++++---
>  fs/proc/meminfo.c                  | 2 +-
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 8b5a94cfa722..93edcf233464 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -964,7 +964,7 @@ Example output. You may not have all of these fields.
>      Mlocked:               0 kB
>      SwapTotal:             0 kB
>      SwapFree:              0 kB
> -    Zswap:              1904 kB
> +    Compressed:         1904 kB
>      Zswapped:           7792 kB
>      Dirty:                12 kB
>      Writeback:             0 kB
> @@ -1057,8 +1057,9 @@ SwapTotal
>  SwapFree
>                Memory which has been evicted from RAM, and is temporarily
>                on the disk
> -Zswap
> -              Memory consumed by the zswap backend (compressed size)
> +Compressed
> +              Memory consumed by compression backends, such as zswap
> +              (compressed size)
>  Zswapped
>                Amount of anonymous memory stored in zswap (original size)
>  Dirty
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 6e89f0e2fd20..554d6f230e67 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -87,7 +87,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
>  	show_val_kb(m, "SwapFree:       ", i.freeswap);
>  #ifdef CONFIG_ZSWAP
> -	seq_printf(m,  "Zswap:          %8lu kB\n",
> +	seq_printf(m,  "Compressed:     %8lu kB\n",
>  		   (unsigned long)(zswap_pool_total_size >> 10));
>  	seq_printf(m,  "Zswapped:       %8lu kB\n",
>  		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> -- 
> 2.35.3
>
Shakeel Butt April 27, 2022, 11:36 p.m. UTC | #10
On Wed, Apr 27, 2022 at 3:32 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Apr 27, 2022 at 05:20:31PM -0400, Johannes Weiner wrote:
> > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > Hi Johannes,
> > >
> > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > Currently it requires poking at debugfs to figure out the size and
> > > > population of the zswap cache on a host. There are no counters for
> > > > reads and writes against the cache. As a result, it's difficult to
> > > > understand zswap behavior on production systems.
> > > >
> > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > >
> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > ---
> > > >  fs/proc/meminfo.c             |  7 +++++++
> > > >  include/linux/swap.h          |  5 +++++
> > > >  include/linux/vm_event_item.h |  4 ++++
> > > >  mm/vmstat.c                   |  4 ++++
> > > >  mm/zswap.c                    | 13 ++++++-------
> > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > --- a/fs/proc/meminfo.c
> > > > +++ b/fs/proc/meminfo.c
> > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > >
> > > >   show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > >   show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > +#ifdef CONFIG_ZSWAP
> > > > + seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > +            (unsigned long)(zswap_pool_total_size >> 10));
> > > > + seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > +            (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > +            (PAGE_SHIFT - 10));
> > > > +#endif
> > >
> > > I agree it would be very handy to have the memory consumption in meminfo
> > >
> > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > >
> > > If we really go this Zswap only metric instead of general term
> > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > in this patchset. Do you think that's better idea instead of
> > > introducing general term like "Compressed:" or something else?
> >
> > I'm fine with changing it to Compressed. If somebody cares about a
> > more detailed breakdown, we can add Zswap, Zram subsets as needed.
>
> It does raise the question what to do about cgroup, though. Should the
> control files (memory.zswap.current & memory.zswap.max) apply to zram
> in the future? If so, we should rename them, too.
>
> I'm not too familiar with zram, maybe you can provide some
> background. AFAIU, Google uses zram quite widely; all the more
> confusing why there is no container support for it yet.
>
> Could you shed some light?
>

I can shed light on the datacenter workloads. We use cgroup (still on
v1) and zswap. For the workloads/applications, the swap (or zswap) is
transparent in the sense that they are charged exactly the same
irrespective of how much their memory is zswapped-out. Basically the
applications see the same usage which is actually v1's
memsw.usage_in_bytes. We dynamically increase the swap size if it is
low, so we are not really worried about one job hogging the swap
space.

Regarding stats we actually do have them internally representing
compressed size and number of pages in zswap. The compressed size is
actually used for OOM victim selection. The memsw or v2's swap usage
in the presence of compression based swap does not actually tell how
much memory can potentially be released by evicting a job. For example
if there are two jobs 'A' and 'B'. Both of them have 100 pages
compressed but A's 100 pages are compressed to let's say 10 pages
while B's 100 pages are compressed to 70 pages. It is preferable to
kill B as that will release 70 pages. (This is a very simplified
explanation of what we actually do).
Johannes Weiner April 28, 2022, 2:05 p.m. UTC | #11
On Wed, Apr 27, 2022 at 03:12:17PM -0700, Minchan Kim wrote:
> On Wed, Apr 27, 2022 at 05:36:26PM -0400, Johannes Weiner wrote:
> > On Wed, Apr 27, 2022 at 05:20:31PM -0400, Johannes Weiner wrote:
> > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > Hi Johannes,
> > > > 
> > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > population of the zswap cache on a host. There are no counters for
> > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > understand zswap behavior on production systems.
> > > > > 
> > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > > 
> > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > ---
> > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > >  include/linux/swap.h          |  5 +++++
> > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > >  mm/vmstat.c                   |  4 ++++
> > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > --- a/fs/proc/meminfo.c
> > > > > +++ b/fs/proc/meminfo.c
> > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > >  
> > > > >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > +#ifdef CONFIG_ZSWAP
> > > > > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > > > > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > +		   (PAGE_SHIFT - 10));
> > > > > +#endif
> > > > 
> > > > I agree it would be very handy to have the memory consumption in meminfo
> > > > 
> > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > > 
> > > > If we really go this Zswap only metric instead of general term
> > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > in this patchset. Do you think that's better idea instead of
> > > > introducing general term like "Compressed:" or something else?
> > > 
> > > I'm fine with changing it to Compressed. If somebody cares about a
> > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> > 
> > It does raise the question what to do about cgroup, though. Should the
> > control files (memory.zswap.current & memory.zswap.max) apply to zram
> > in the future? If so, we should rename them, too.
> > 
> > I'm not too familiar with zram, maybe you can provide some
> > background. AFAIU, Google uses zram quite widely; all the more
> > confusing why there is no container support for it yet.
> 
> My usecase with zram is Android which doesn't use memcg.

Ok.

After more thought, my take is that in the future it could make sense
to track zram pages in a cgroup's memory.current. But it should NOT be
included in the dedicated memory.zswap.* files. Zswap is an in-kernel
writeback cache, and those files allow userspace to tune writeback
thresholds depending on the composition of the workload's
workingset. This doesn't translate to zram: the wb facility that it
has is triggered by hand, based on criteria such as idle pages and
compression rate. It's not based on size. From a cgroup POV, it's a
memory consumer that should be subject to memory.max, nothing more.

This distinction applies to meminfo as well, though. While I think it
makes sense to have a combined "Compressed" counter for zram and
zswap, it's still important to understand zswap behavior on its own to
tune the system-wide writeback threshold in max_pool_percent. (And
again, while zram can also be limited, it's not a writeback threshold,
it's just a red line for returning -ENOMEM).

So I'm going to keep the Zswap and Zswapped items and retract the
delta patch for renaming it to Compressed.

But I'd ack a patch that adds a combined "Compressed" counter for zram
+ zswap if you send it, Minchan.
Johannes Weiner April 28, 2022, 2:25 p.m. UTC | #12
On Wed, Apr 27, 2022 at 03:16:48PM -0700, Minchan Kim wrote:
> On Wed, Apr 27, 2022 at 05:20:29PM -0400, Johannes Weiner wrote:
> > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > Hi Johannes,
> > > 
> > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > Currently it requires poking at debugfs to figure out the size and
> > > > population of the zswap cache on a host. There are no counters for
> > > > reads and writes against the cache. As a result, it's difficult to
> > > > understand zswap behavior on production systems.
> > > > 
> > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > 
> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > ---
> > > >  fs/proc/meminfo.c             |  7 +++++++
> > > >  include/linux/swap.h          |  5 +++++
> > > >  include/linux/vm_event_item.h |  4 ++++
> > > >  mm/vmstat.c                   |  4 ++++
> > > >  mm/zswap.c                    | 13 ++++++-------
> > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > --- a/fs/proc/meminfo.c
> > > > +++ b/fs/proc/meminfo.c
> > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > >  
> > > >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > +#ifdef CONFIG_ZSWAP
> > > > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > > > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > +		   (PAGE_SHIFT - 10));
> > > > +#endif
> > > 
> > > I agree it would be very handy to have the memory consumption in meminfo
> > > 
> > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > 
> > > If we really go this Zswap only metric instead of general term
> > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > in this patchset. Do you think that's better idea instead of
> > > introducing general term like "Compressed:" or something else?
> > 
> > I'm fine with changing it to Compressed. If somebody cares about a
> > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> 
> Thanks! Please consider ZSWPIN to rename more general term, too.

That doesn't make sense to me.

Zram is a swap backend, its traffic is accounted in PSWPIN/OUT. Zswap
is a writeback cache on top of the swap backend. It has pages
entering, refaulting, and being written back to the swap backend
(PSWPOUT). A zswpout and a zramout are different things.

> > From 8e9e2d6490b7082c41743fbdb9ffd2db4e3ce962 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Wed, 27 Apr 2022 17:15:15 -0400
> > Subject: [PATCH] mm: zswap: add basic meminfo and vmstat coverage fix fix
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Just for completeness,

Nacked-by: Johannes Weiner <hannes@cmxpchg.org>

> > @@ -87,7 +87,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> >  #ifdef CONFIG_ZSWAP
> > -	seq_printf(m,  "Zswap:          %8lu kB\n",
> > +	seq_printf(m,  "Compressed:     %8lu kB\n",
> >  		   (unsigned long)(zswap_pool_total_size >> 10));
> >  	seq_printf(m,  "Zswapped:       %8lu kB\n",
> >  		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > -- 
> > 2.35.3
> >
Johannes Weiner April 28, 2022, 2:36 p.m. UTC | #13
On Wed, Apr 27, 2022 at 04:36:22PM -0700, Shakeel Butt wrote:
> On Wed, Apr 27, 2022 at 3:32 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Apr 27, 2022 at 05:20:31PM -0400, Johannes Weiner wrote:
> > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > Hi Johannes,
> > > >
> > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > population of the zswap cache on a host. There are no counters for
> > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > understand zswap behavior on production systems.
> > > > >
> > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > >
> > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > ---
> > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > >  include/linux/swap.h          |  5 +++++
> > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > >  mm/vmstat.c                   |  4 ++++
> > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > --- a/fs/proc/meminfo.c
> > > > > +++ b/fs/proc/meminfo.c
> > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > >
> > > > >   show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > >   show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > +#ifdef CONFIG_ZSWAP
> > > > > + seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > +            (unsigned long)(zswap_pool_total_size >> 10));
> > > > > + seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > +            (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > +            (PAGE_SHIFT - 10));
> > > > > +#endif
> > > >
> > > > I agree it would be very handy to have the memory consumption in meminfo
> > > >
> > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > >
> > > > If we really go this Zswap only metric instead of general term
> > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > in this patchset. Do you think that's better idea instead of
> > > > introducing general term like "Compressed:" or something else?
> > >
> > > I'm fine with changing it to Compressed. If somebody cares about a
> > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> >
> > It does raise the question what to do about cgroup, though. Should the
> > control files (memory.zswap.current & memory.zswap.max) apply to zram
> > in the future? If so, we should rename them, too.
> >
> > I'm not too familiar with zram, maybe you can provide some
> > background. AFAIU, Google uses zram quite widely; all the more
> > confusing why there is no container support for it yet.
> >
> > Could you shed some light?
> >
> 
> I can shed light on the datacenter workloads. We use cgroup (still on
> v1) and zswap. For the workloads/applications, the swap (or zswap) is
> transparent in the sense that they are charged exactly the same
> irrespective of how much their memory is zswapped-out. Basically the
> applications see the same usage which is actually v1's
> memsw.usage_in_bytes. We dynamically increase the swap size if it is
> low, so we are not really worried about one job hogging the swap
> space.
> 
> Regarding stats we actually do have them internally representing
> compressed size and number of pages in zswap. The compressed size is
> actually used for OOM victim selection. The memsw or v2's swap usage
> in the presence of compression based swap does not actually tell how
> much memory can potentially be released by evicting a job. For example
> if there are two jobs 'A' and 'B'. Both of them have 100 pages
> compressed but A's 100 pages are compressed to let's say 10 pages
> while B's 100 pages are compressed to 70 pages. It is preferable to
> kill B as that will release 70 pages. (This is a very simplified
> explanation of what we actually do).

Ah, so zram is really only used by the mobile stuff after all.

In the DC, I guess you don't use disk swap in conjunction with zswap,
so those writeback cache controls are less interesting to you?

But it sounds like you would benefit from the zswap(ped) counters in
memory.stat at least.

Thanks, that is enlightening!
Shakeel Butt April 28, 2022, 2:49 p.m. UTC | #14
On Thu, Apr 28, 2022 at 7:36 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Apr 27, 2022 at 04:36:22PM -0700, Shakeel Butt wrote:
> > On Wed, Apr 27, 2022 at 3:32 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Apr 27, 2022 at 05:20:31PM -0400, Johannes Weiner wrote:
> > > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > > Hi Johannes,
> > > > >
> > > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > > population of the zswap cache on a host. There are no counters for
> > > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > > understand zswap behavior on production systems.
> > > > > >
> > > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > > >
> > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > ---
> > > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > > >  include/linux/swap.h          |  5 +++++
> > > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > > >  mm/vmstat.c                   |  4 ++++
> > > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > > --- a/fs/proc/meminfo.c
> > > > > > +++ b/fs/proc/meminfo.c
> > > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > > >
> > > > > >   show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > > >   show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > > +#ifdef CONFIG_ZSWAP
> > > > > > + seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > > +            (unsigned long)(zswap_pool_total_size >> 10));
> > > > > > + seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > > +            (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > > +            (PAGE_SHIFT - 10));
> > > > > > +#endif
> > > > >
> > > > > I agree it would be very handy to have the memory consumption in meminfo
> > > > >
> > > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > > >
> > > > > If we really go this Zswap only metric instead of general term
> > > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > > in this patchset. Do you think that's better idea instead of
> > > > > introducing general term like "Compressed:" or something else?
> > > >
> > > > I'm fine with changing it to Compressed. If somebody cares about a
> > > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> > >
> > > It does raise the question what to do about cgroup, though. Should the
> > > control files (memory.zswap.current & memory.zswap.max) apply to zram
> > > in the future? If so, we should rename them, too.
> > >
> > > I'm not too familiar with zram, maybe you can provide some
> > > background. AFAIU, Google uses zram quite widely; all the more
> > > confusing why there is no container support for it yet.
> > >
> > > Could you shed some light?
> > >
> >
> > I can shed light on the datacenter workloads. We use cgroup (still on
> > v1) and zswap. For the workloads/applications, the swap (or zswap) is
> > transparent in the sense that they are charged exactly the same
> > irrespective of how much their memory is zswapped-out. Basically the
> > applications see the same usage which is actually v1's
> > memsw.usage_in_bytes. We dynamically increase the swap size if it is
> > low, so we are not really worried about one job hogging the swap
> > space.
> >
> > Regarding stats we actually do have them internally representing
> > compressed size and number of pages in zswap. The compressed size is
> > actually used for OOM victim selection. The memsw or v2's swap usage
> > in the presence of compression based swap does not actually tell how
> > much memory can potentially be released by evicting a job. For example
> > if there are two jobs 'A' and 'B'. Both of them have 100 pages
> > compressed but A's 100 pages are compressed to let's say 10 pages
> > while B's 100 pages are compressed to 70 pages. It is preferable to
> > kill B as that will release 70 pages. (This is a very simplified
> > explanation of what we actually do).
>
> Ah, so zram is really only used by the mobile stuff after all.
>
> In the DC, I guess you don't use disk swap in conjunction with zswap,
> so those writeback cache controls are less interesting to you?

Yes, we have some modifications to zswap to make it work without any
backing real swap. Though there is a future plan to move to zram
eventually.

>
> But it sounds like you would benefit from the zswap(ped) counters in
> memory.stat at least.

Yes and I think if we need zram specific counters/stats in future,
those can be added then.

>
> Thanks, that is enlightening!
Johannes Weiner April 28, 2022, 3:16 p.m. UTC | #15
On Thu, Apr 28, 2022 at 07:49:33AM -0700, Shakeel Butt wrote:
> On Thu, Apr 28, 2022 at 7:36 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Apr 27, 2022 at 04:36:22PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 27, 2022 at 3:32 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Wed, Apr 27, 2022 at 05:20:31PM -0400, Johannes Weiner wrote:
> > > > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > > > Hi Johannes,
> > > > > >
> > > > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > > > population of the zswap cache on a host. There are no counters for
> > > > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > > > understand zswap behavior on production systems.
> > > > > > >
> > > > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > > > >
> > > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > ---
> > > > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > > > >  include/linux/swap.h          |  5 +++++
> > > > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > > > >  mm/vmstat.c                   |  4 ++++
> > > > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > > > >
> > > > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > > > --- a/fs/proc/meminfo.c
> > > > > > > +++ b/fs/proc/meminfo.c
> > > > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > > > >
> > > > > > >   show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > > > >   show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > > > +#ifdef CONFIG_ZSWAP
> > > > > > > + seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > > > +            (unsigned long)(zswap_pool_total_size >> 10));
> > > > > > > + seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > > > +            (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > > > +            (PAGE_SHIFT - 10));
> > > > > > > +#endif
> > > > > >
> > > > > > I agree it would be very handy to have the memory consumption in meminfo
> > > > > >
> > > > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > > > >
> > > > > > If we really go this Zswap only metric instead of general term
> > > > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > > > in this patchset. Do you think that's better idea instead of
> > > > > > introducing general term like "Compressed:" or something else?
> > > > >
> > > > > I'm fine with changing it to Compressed. If somebody cares about a
> > > > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> > > >
> > > > It does raise the question what to do about cgroup, though. Should the
> > > > control files (memory.zswap.current & memory.zswap.max) apply to zram
> > > > in the future? If so, we should rename them, too.
> > > >
> > > > I'm not too familiar with zram, maybe you can provide some
> > > > background. AFAIU, Google uses zram quite widely; all the more
> > > > confusing why there is no container support for it yet.
> > > >
> > > > Could you shed some light?
> > > >
> > >
> > > I can shed light on the datacenter workloads. We use cgroup (still on
> > > v1) and zswap. For the workloads/applications, the swap (or zswap) is
> > > transparent in the sense that they are charged exactly the same
> > > irrespective of how much their memory is zswapped-out. Basically the
> > > applications see the same usage which is actually v1's
> > > memsw.usage_in_bytes. We dynamically increase the swap size if it is
> > > low, so we are not really worried about one job hogging the swap
> > > space.
> > >
> > > Regarding stats we actually do have them internally representing
> > > compressed size and number of pages in zswap. The compressed size is
> > > actually used for OOM victim selection. The memsw or v2's swap usage
> > > in the presence of compression based swap does not actually tell how
> > > much memory can potentially be released by evicting a job. For example
> > > if there are two jobs 'A' and 'B'. Both of them have 100 pages
> > > compressed but A's 100 pages are compressed to let's say 10 pages
> > > while B's 100 pages are compressed to 70 pages. It is preferable to
> > > kill B as that will release 70 pages. (This is a very simplified
> > > explanation of what we actually do).
> >
> > Ah, so zram is really only used by the mobile stuff after all.
> >
> > In the DC, I guess you don't use disk swap in conjunction with zswap,
> > so those writeback cache controls are less interesting to you?
> 
> Yes, we have some modifications to zswap to make it work without any
> backing real swap.

Not sure if you can share them, but I would be interested in those
changes. We have real backing swap, but because of the way swap
entries are allocated, pages stored in zswap will consume physical
disk slots. So on top of regular swap, you need to provision disk
space for zswap as well, which is unfortunate.

What could be useful is a separate swap entry address space that maps
zswap slots and disk slots alike. This would fix the above problem. It
would have the added benefit of making swapoff much simpler and faster
too, as it doesn't need to chase down page tables to free disk slots.

> > But it sounds like you would benefit from the zswap(ped) counters in
> > memory.stat at least.
> 
> Yes and I think if we need zram specific counters/stats in future,
> those can be added then.

I agree.
Yang Shi April 28, 2022, 4:54 p.m. UTC | #16
On Thu, Apr 28, 2022 at 7:49 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Apr 28, 2022 at 7:36 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Apr 27, 2022 at 04:36:22PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 27, 2022 at 3:32 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Wed, Apr 27, 2022 at 05:20:31PM -0400, Johannes Weiner wrote:
> > > > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > > > Hi Johannes,
> > > > > >
> > > > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > > > population of the zswap cache on a host. There are no counters for
> > > > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > > > understand zswap behavior on production systems.
> > > > > > >
> > > > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > > > >
> > > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > ---
> > > > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > > > >  include/linux/swap.h          |  5 +++++
> > > > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > > > >  mm/vmstat.c                   |  4 ++++
> > > > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > > > >
> > > > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > > > --- a/fs/proc/meminfo.c
> > > > > > > +++ b/fs/proc/meminfo.c
> > > > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > > > >
> > > > > > >   show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > > > >   show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > > > +#ifdef CONFIG_ZSWAP
> > > > > > > + seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > > > +            (unsigned long)(zswap_pool_total_size >> 10));
> > > > > > > + seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > > > +            (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > > > +            (PAGE_SHIFT - 10));
> > > > > > > +#endif
> > > > > >
> > > > > > I agree it would be very handy to have the memory consumption in meminfo
> > > > > >
> > > > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > > > >
> > > > > > If we really go this Zswap only metric instead of general term
> > > > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > > > in this patchset. Do you think that's better idea instead of
> > > > > > introducing general term like "Compressed:" or something else?
> > > > >
> > > > > I'm fine with changing it to Compressed. If somebody cares about a
> > > > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> > > >
> > > > It does raise the question what to do about cgroup, though. Should the
> > > > control files (memory.zswap.current & memory.zswap.max) apply to zram
> > > > in the future? If so, we should rename them, too.
> > > >
> > > > I'm not too familiar with zram, maybe you can provide some
> > > > background. AFAIU, Google uses zram quite widely; all the more
> > > > confusing why there is no container support for it yet.
> > > >
> > > > Could you shed some light?
> > > >
> > >
> > > I can shed light on the datacenter workloads. We use cgroup (still on
> > > v1) and zswap. For the workloads/applications, the swap (or zswap) is
> > > transparent in the sense that they are charged exactly the same
> > > irrespective of how much their memory is zswapped-out. Basically the
> > > applications see the same usage which is actually v1's
> > > memsw.usage_in_bytes. We dynamically increase the swap size if it is
> > > low, so we are not really worried about one job hogging the swap
> > > space.
> > >
> > > Regarding stats we actually do have them internally representing
> > > compressed size and number of pages in zswap. The compressed size is
> > > actually used for OOM victim selection. The memsw or v2's swap usage
> > > in the presence of compression based swap does not actually tell how
> > > much memory can potentially be released by evicting a job. For example
> > > if there are two jobs 'A' and 'B'. Both of them have 100 pages
> > > compressed but A's 100 pages are compressed to let's say 10 pages
> > > while B's 100 pages are compressed to 70 pages. It is preferable to
> > > kill B as that will release 70 pages. (This is a very simplified
> > > explanation of what we actually do).
> >
> > Ah, so zram is really only used by the mobile stuff after all.
> >
> > In the DC, I guess you don't use disk swap in conjunction with zswap,
> > so those writeback cache controls are less interesting to you?
>
> Yes, we have some modifications to zswap to make it work without any
> backing real swap. Though there is a future plan to move to zram
> eventually.

Interesting, if so why not just simply use zram?

>
> >
> > But it sounds like you would benefit from the zswap(ped) counters in
> > memory.stat at least.
>
> Yes and I think if we need zram specific counters/stats in future,
> those can be added then.
>
> >
> > Thanks, that is enlightening!
>
Yang Shi April 28, 2022, 4:59 p.m. UTC | #17
On Thu, Apr 28, 2022 at 8:17 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Apr 28, 2022 at 07:49:33AM -0700, Shakeel Butt wrote:
> > On Thu, Apr 28, 2022 at 7:36 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Apr 27, 2022 at 04:36:22PM -0700, Shakeel Butt wrote:
> > > > On Wed, Apr 27, 2022 at 3:32 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > On Wed, Apr 27, 2022 at 05:20:31PM -0400, Johannes Weiner wrote:
> > > > > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > > > > Hi Johannes,
> > > > > > >
> > > > > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > > > > population of the zswap cache on a host. There are no counters for
> > > > > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > > > > understand zswap behavior on production systems.
> > > > > > > >
> > > > > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > > > > >
> > > > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > > ---
> > > > > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > > > > >  include/linux/swap.h          |  5 +++++
> > > > > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > > > > >  mm/vmstat.c                   |  4 ++++
> > > > > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > > > > --- a/fs/proc/meminfo.c
> > > > > > > > +++ b/fs/proc/meminfo.c
> > > > > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > > > > >
> > > > > > > >   show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > > > > >   show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > > > > +#ifdef CONFIG_ZSWAP
> > > > > > > > + seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > > > > +            (unsigned long)(zswap_pool_total_size >> 10));
> > > > > > > > + seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > > > > +            (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > > > > +            (PAGE_SHIFT - 10));
> > > > > > > > +#endif
> > > > > > >
> > > > > > > I agree it would be very handy to have the memory consumption in meminfo
> > > > > > >
> > > > > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > > > > >
> > > > > > > If we really go this Zswap only metric instead of general term
> > > > > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > > > > in this patchset. Do you think that's better idea instead of
> > > > > > > introducing general term like "Compressed:" or something else?
> > > > > >
> > > > > > I'm fine with changing it to Compressed. If somebody cares about a
> > > > > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> > > > >
> > > > > It does raise the question what to do about cgroup, though. Should the
> > > > > control files (memory.zswap.current & memory.zswap.max) apply to zram
> > > > > in the future? If so, we should rename them, too.
> > > > >
> > > > > I'm not too familiar with zram, maybe you can provide some
> > > > > background. AFAIU, Google uses zram quite widely; all the more
> > > > > confusing why there is no container support for it yet.
> > > > >
> > > > > Could you shed some light?
> > > > >
> > > >
> > > > I can shed light on the datacenter workloads. We use cgroup (still on
> > > > v1) and zswap. For the workloads/applications, the swap (or zswap) is
> > > > transparent in the sense that they are charged exactly the same
> > > > irrespective of how much their memory is zswapped-out. Basically the
> > > > applications see the same usage which is actually v1's
> > > > memsw.usage_in_bytes. We dynamically increase the swap size if it is
> > > > low, so we are not really worried about one job hogging the swap
> > > > space.
> > > >
> > > > Regarding stats we actually do have them internally representing
> > > > compressed size and number of pages in zswap. The compressed size is
> > > > actually used for OOM victim selection. The memsw or v2's swap usage
> > > > in the presence of compression based swap does not actually tell how
> > > > much memory can potentially be released by evicting a job. For example
> > > > if there are two jobs 'A' and 'B'. Both of them have 100 pages
> > > > compressed but A's 100 pages are compressed to let's say 10 pages
> > > > while B's 100 pages are compressed to 70 pages. It is preferable to
> > > > kill B as that will release 70 pages. (This is a very simplified
> > > > explanation of what we actually do).
> > >
> > > Ah, so zram is really only used by the mobile stuff after all.
> > >
> > > In the DC, I guess you don't use disk swap in conjunction with zswap,
> > > so those writeback cache controls are less interesting to you?
> >
> > Yes, we have some modifications to zswap to make it work without any
> > backing real swap.
>
> Not sure if you can share them, but I would be interested in those
> changes. We have real backing swap, but because of the way swap
> entries are allocated, pages stored in zswap will consume physical
> disk slots. So on top of regular swap, you need to provision disk
> space for zswap as well, which is unfortunate.

Yes, exactly. For our usecase I noticed the swap backend is used up,
but there is no writeback from zswap to swap backend at all. The
bright side is it may mean the compression ratio is high for our
workload, but the disk space is actually wasted.

>
> What could be useful is a separate swap entry address space that maps
> zswap slots and disk slots alike. This would fix the above problem. It
> would have the added benefit of making swapoff much simpler and faster
> too, as it doesn't need to chase down page tables to free disk slots.

I was thinking about this too, but it seems not easy since the swap
slot on swap backen is allocated when the page is added to swap, but
not entry on zswap since zswap is just a cache and invisible to
vmscan. If we have separate entries for zswap and swap backend, it
would be complicated to convert zswap entries to swap backend entries
since we may have to traverse rmap to find all the PTEs mapped to
zswap entry in order to convert them to swap backend entry.

>
> > > But it sounds like you would benefit from the zswap(ped) counters in
> > > memory.stat at least.
> >
> > Yes and I think if we need zram specific counters/stats in future,
> > those can be added then.
>
> I agree.
>
Minchan Kim April 28, 2022, 4:59 p.m. UTC | #18
On Thu, Apr 28, 2022 at 10:25:59AM -0400, Johannes Weiner wrote:
> On Wed, Apr 27, 2022 at 03:16:48PM -0700, Minchan Kim wrote:
> > On Wed, Apr 27, 2022 at 05:20:29PM -0400, Johannes Weiner wrote:
> > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > Hi Johannes,
> > > > 
> > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > population of the zswap cache on a host. There are no counters for
> > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > understand zswap behavior on production systems.
> > > > > 
> > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > > 
> > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > ---
> > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > >  include/linux/swap.h          |  5 +++++
> > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > >  mm/vmstat.c                   |  4 ++++
> > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > --- a/fs/proc/meminfo.c
> > > > > +++ b/fs/proc/meminfo.c
> > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > >  
> > > > >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > +#ifdef CONFIG_ZSWAP
> > > > > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > > > > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > +		   (PAGE_SHIFT - 10));
> > > > > +#endif
> > > > 
> > > > I agree it would be very handy to have the memory consumption in meminfo
> > > > 
> > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > > 
> > > > If we really go this Zswap only metric instead of general term
> > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > in this patchset. Do you think that's better idea instead of
> > > > introducing general term like "Compressed:" or something else?
> > > 
> > > I'm fine with changing it to Compressed. If somebody cares about a
> > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> > 
> > Thanks! Please consider ZSWPIN to rename more general term, too.
> 
> That doesn't make sense to me.
> 
> Zram is a swap backend, its traffic is accounted in PSWPIN/OUT. Zswap
> is a writeback cache on top of the swap backend. It has pages
> entering, refaulting, and being written back to the swap backend
> (PSWPOUT). A zswpout and a zramout are different things.

Think about that system has two swap devices (storage + zram).
I think it's useful to know how many swap IO comes from zram
and rest of them are storage.
Minchan Kim April 28, 2022, 5:02 p.m. UTC | #19
On Thu, Apr 28, 2022 at 10:05:13AM -0400, Johannes Weiner wrote:
> On Wed, Apr 27, 2022 at 03:12:17PM -0700, Minchan Kim wrote:
> > On Wed, Apr 27, 2022 at 05:36:26PM -0400, Johannes Weiner wrote:
> > > On Wed, Apr 27, 2022 at 05:20:31PM -0400, Johannes Weiner wrote:
> > > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > > Hi Johannes,
> > > > > 
> > > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > > population of the zswap cache on a host. There are no counters for
> > > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > > understand zswap behavior on production systems.
> > > > > > 
> > > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > > > 
> > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > ---
> > > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > > >  include/linux/swap.h          |  5 +++++
> > > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > > >  mm/vmstat.c                   |  4 ++++
> > > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > > --- a/fs/proc/meminfo.c
> > > > > > +++ b/fs/proc/meminfo.c
> > > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > > >  
> > > > > >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > > >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > > +#ifdef CONFIG_ZSWAP
> > > > > > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > > > > > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > > +		   (PAGE_SHIFT - 10));
> > > > > > +#endif
> > > > > 
> > > > > I agree it would be very handy to have the memory consumption in meminfo
> > > > > 
> > > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > > > 
> > > > > If we really go this Zswap only metric instead of general term
> > > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > > in this patchset. Do you think that's better idea instead of
> > > > > introducing general term like "Compressed:" or something else?
> > > > 
> > > > I'm fine with changing it to Compressed. If somebody cares about a
> > > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> > > 
> > > It does raise the question what to do about cgroup, though. Should the
> > > control files (memory.zswap.current & memory.zswap.max) apply to zram
> > > in the future? If so, we should rename them, too.
> > > 
> > > I'm not too familiar with zram, maybe you can provide some
> > > background. AFAIU, Google uses zram quite widely; all the more
> > > confusing why there is no container support for it yet.
> > 
> > My usecase with zram is Android which doesn't use memcg.
> 
> Ok.
> 
> After more thought, my take is that in the future it could make sense
> to track zram pages in a cgroup's memory.current. But it should NOT be
> included in the dedicated memory.zswap.* files. Zswap is an in-kernel
> writeback cache, and those files allow userspace to tune writeback
> thresholds depending on the composition of the workload's
> workingset. This doesn't translate to zram: the wb facility that it
> has is triggered by hand, based on criteria such as idle pages and
> compression rate. It's not based on size. From a cgroup POV, it's a
> memory consumer that should be subject to memory.max, nothing more.
> 
> This distinction applies to meminfo as well, though. While I think it
> makes sense to have a combined "Compressed" counter for zram and
> zswap, it's still important to understand zswap behavior on its own to
> tune the system-wide writeback threshold in max_pool_percent. (And
> again, while zram can also be limited, it's not a writeback threshold,
> it's just a red line for returning -ENOMEM).
> 
> So I'm going to keep the Zswap and Zswapped items and retract the
> delta patch for renaming it to Compressed.
> 
> But I'd ack a patch that adds a combined "Compressed" counter for zram
> + zswap if you send it, Minchan.

If we really want to go separate stat for zswap and zram, it would
be better to use direct name "Zram: " instead of comrpessed.
Johannes Weiner April 28, 2022, 5:23 p.m. UTC | #20
On Thu, Apr 28, 2022 at 09:59:53AM -0700, Minchan Kim wrote:
> On Thu, Apr 28, 2022 at 10:25:59AM -0400, Johannes Weiner wrote:
> > On Wed, Apr 27, 2022 at 03:16:48PM -0700, Minchan Kim wrote:
> > > On Wed, Apr 27, 2022 at 05:20:29PM -0400, Johannes Weiner wrote:
> > > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > > Hi Johannes,
> > > > > 
> > > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > > population of the zswap cache on a host. There are no counters for
> > > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > > understand zswap behavior on production systems.
> > > > > > 
> > > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > > > 
> > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > ---
> > > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > > >  include/linux/swap.h          |  5 +++++
> > > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > > >  mm/vmstat.c                   |  4 ++++
> > > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > > --- a/fs/proc/meminfo.c
> > > > > > +++ b/fs/proc/meminfo.c
> > > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > > >  
> > > > > >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > > >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > > +#ifdef CONFIG_ZSWAP
> > > > > > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > > > > > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > > +		   (PAGE_SHIFT - 10));
> > > > > > +#endif
> > > > > 
> > > > > I agree it would be very handy to have the memory consumption in meminfo
> > > > > 
> > > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > > > 
> > > > > If we really go this Zswap only metric instead of general term
> > > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > > in this patchset. Do you think that's better idea instead of
> > > > > introducing general term like "Compressed:" or something else?
> > > > 
> > > > I'm fine with changing it to Compressed. If somebody cares about a
> > > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> > > 
> > > Thanks! Please consider ZSWPIN to rename more general term, too.
> > 
> > That doesn't make sense to me.
> > 
> > Zram is a swap backend, its traffic is accounted in PSWPIN/OUT. Zswap
> > is a writeback cache on top of the swap backend. It has pages
> > entering, refaulting, and being written back to the swap backend
> > (PSWPOUT). A zswpout and a zramout are different things.
> 
> Think about that system has two swap devices (storage + zram).
> I think it's useful to know how many swap IO comes from zram
> and rest of them are storage.

Hm, isn't this comparable to having one swap on flash and one swap on
a rotating disk? /sys/block/*/stat should be able to tell you how
traffic is distributed, no?

What I'm more worried about is the fact that in theory you can stack
zswap on top of zram. Consider a fast compression cache on top of a
higher compression backend. Is somebody doing this now? I doubt
it. But as people look into memory tiering more and more, this doesn't
sound entirely implausible. If the stacked layers then share the same
in/out events, it would be quite confusing.

If you think PSWPIN/OUT and per-device stats aren't enough, I'm not
opposed to adding zramin/out to /proc/vmstat as well. I think we're
less worried there than with /proc/meminfo. I'd just prefer to keep
them separate from the zswap events.

Does that sound reasonable?
Johannes Weiner April 28, 2022, 5:27 p.m. UTC | #21
On Thu, Apr 28, 2022 at 10:02:46AM -0700, Minchan Kim wrote:
> On Thu, Apr 28, 2022 at 10:05:13AM -0400, Johannes Weiner wrote:
> > But I'd ack a patch that adds a combined "Compressed" counter for zram
> > + zswap if you send it, Minchan.
> 
> If we really want to go separate stat for zswap and zram, it would
> be better to use direct name "Zram: " instead of comrpessed.

That works for me as well.
Minchan Kim April 28, 2022, 5:31 p.m. UTC | #22
On Thu, Apr 28, 2022 at 01:23:21PM -0400, Johannes Weiner wrote:
> On Thu, Apr 28, 2022 at 09:59:53AM -0700, Minchan Kim wrote:
> > On Thu, Apr 28, 2022 at 10:25:59AM -0400, Johannes Weiner wrote:
> > > On Wed, Apr 27, 2022 at 03:16:48PM -0700, Minchan Kim wrote:
> > > > On Wed, Apr 27, 2022 at 05:20:29PM -0400, Johannes Weiner wrote:
> > > > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > > > Hi Johannes,
> > > > > > 
> > > > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > > > population of the zswap cache on a host. There are no counters for
> > > > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > > > understand zswap behavior on production systems.
> > > > > > > 
> > > > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > > > > 
> > > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > ---
> > > > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > > > >  include/linux/swap.h          |  5 +++++
> > > > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > > > >  mm/vmstat.c                   |  4 ++++
> > > > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > > > --- a/fs/proc/meminfo.c
> > > > > > > +++ b/fs/proc/meminfo.c
> > > > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > > > >  
> > > > > > >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > > > >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > > > +#ifdef CONFIG_ZSWAP
> > > > > > > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > > > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > > > > > > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > > > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > > > +		   (PAGE_SHIFT - 10));
> > > > > > > +#endif
> > > > > > 
> > > > > > I agree it would be very handy to have the memory consumption in meminfo
> > > > > > 
> > > > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > > > > 
> > > > > > If we really go this Zswap only metric instead of general term
> > > > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > > > in this patchset. Do you think that's better idea instead of
> > > > > > introducing general term like "Compressed:" or something else?
> > > > > 
> > > > > I'm fine with changing it to Compressed. If somebody cares about a
> > > > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> > > > 
> > > > Thanks! Please consider ZSWPIN to rename more general term, too.
> > > 
> > > That doesn't make sense to me.
> > > 
> > > Zram is a swap backend, its traffic is accounted in PSWPIN/OUT. Zswap
> > > is a writeback cache on top of the swap backend. It has pages
> > > entering, refaulting, and being written back to the swap backend
> > > (PSWPOUT). A zswpout and a zramout are different things.
> > 
> > Think about that system has two swap devices (storage + zram).
> > I think it's useful to know how many swap IO comes from zram
> > and rest of them are storage.
> 
> Hm, isn't this comparable to having one swap on flash and one swap on
> a rotating disk? /sys/block/*/stat should be able to tell you how
> traffic is distributed, no?

That raises me a same question. Could you also look at the zswap stat
instead of adding it into vmstat? (If zswap doesn't have the counter,
couldn't we simply add new stat in sysfs?)

I thought the patch aims for exposting statistics to grab easier
using popular meminfo and vmstat and wanted to leverage it for
zram, too.

> 
> What I'm more worried about is the fact that in theory you can stack
> zswap on top of zram. Consider a fast compression cache on top of a
> higher compression backend. Is somebody doing this now? I doubt
> it. But as people look into memory tiering more and more, this doesn't
> sound entirely implausible. If the stacked layers then share the same
> in/out events, it would be quite confusing.
> 
> If you think PSWPIN/OUT and per-device stats aren't enough, I'm not
> opposed to adding zramin/out to /proc/vmstat as well. I think we're
> less worried there than with /proc/meminfo. I'd just prefer to keep
> them separate from the zswap events.
> 
> Does that sound reasonable?
>
Johannes Weiner April 28, 2022, 6:34 p.m. UTC | #23
On Thu, Apr 28, 2022 at 10:31:45AM -0700, Minchan Kim wrote:
> On Thu, Apr 28, 2022 at 01:23:21PM -0400, Johannes Weiner wrote:
> > On Thu, Apr 28, 2022 at 09:59:53AM -0700, Minchan Kim wrote:
> > > On Thu, Apr 28, 2022 at 10:25:59AM -0400, Johannes Weiner wrote:
> > > > On Wed, Apr 27, 2022 at 03:16:48PM -0700, Minchan Kim wrote:
> > > > > On Wed, Apr 27, 2022 at 05:20:29PM -0400, Johannes Weiner wrote:
> > > > > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > > > > Hi Johannes,
> > > > > > > 
> > > > > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > > > > population of the zswap cache on a host. There are no counters for
> > > > > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > > > > understand zswap behavior on production systems.
> > > > > > > > 
> > > > > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > > > > > 
> > > > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > > ---
> > > > > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > > > > >  include/linux/swap.h          |  5 +++++
> > > > > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > > > > >  mm/vmstat.c                   |  4 ++++
> > > > > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > > > > --- a/fs/proc/meminfo.c
> > > > > > > > +++ b/fs/proc/meminfo.c
> > > > > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > > > > >  
> > > > > > > >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > > > > >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > > > > +#ifdef CONFIG_ZSWAP
> > > > > > > > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > > > > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > > > > > > > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > > > > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > > > > +		   (PAGE_SHIFT - 10));
> > > > > > > > +#endif
> > > > > > > 
> > > > > > > I agree it would be very handy to have the memory consumption in meminfo
> > > > > > > 
> > > > > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > > > > > 
> > > > > > > If we really go this Zswap only metric instead of general term
> > > > > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > > > > in this patchset. Do you think that's better idea instead of
> > > > > > > introducing general term like "Compressed:" or something else?
> > > > > > 
> > > > > > I'm fine with changing it to Compressed. If somebody cares about a
> > > > > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> > > > > 
> > > > > Thanks! Please consider ZSWPIN to rename more general term, too.
> > > > 
> > > > That doesn't make sense to me.
> > > > 
> > > > Zram is a swap backend, its traffic is accounted in PSWPIN/OUT. Zswap
> > > > is a writeback cache on top of the swap backend. It has pages
> > > > entering, refaulting, and being written back to the swap backend
> > > > (PSWPOUT). A zswpout and a zramout are different things.
> > > 
> > > Think about that system has two swap devices (storage + zram).
> > > I think it's useful to know how many swap IO comes from zram
> > > and rest of them are storage.
> > 
> > Hm, isn't this comparable to having one swap on flash and one swap on
> > a rotating disk? /sys/block/*/stat should be able to tell you how
> > traffic is distributed, no?
> 
> That raises me a same question. Could you also look at the zswap stat
> instead of adding it into vmstat? (If zswap doesn't have the counter,
> couldn't we simply add new stat in sysfs?)

My point is that for regular swap backends there is already
PSWP*. Distinguishing traffic between two swap backends is legitimate
of course, but zram is not really special compared to other backends
from that POV. It's only special in its memory consumption.

zswap *is* special, though. Even though some people use it *like* a
swap backend, it's also a cache on top of swap. zswap loads and stores
do not show up in PSWP*. And they shouldn't, because in a cache
configuration, you still need the separate PSWP* stats to understand
cache eviction behavior and cache miss ratio. memory -> zswap is
ZSWPOUT; zswap -> disk is PSWPOUT; PSWPIN is a cache miss etc.

> I thought the patch aims for exposting statistics to grab easier
> using popular meminfo and vmstat and wanted to leverage it for
> zram, too.

Right. zram and zswap overlap in their functionality and have similar
deficits in their stats. Both should be fixed, I'm not opposing
that. But IMO we should be careful about conflating
them. Fundamentally, one is a block device, the other is an MM-native
cache layer that sits on top of block devices. Drawing false
equivalencies between them will come back to haunt us.
Minchan Kim April 28, 2022, 7:58 p.m. UTC | #24
On Thu, Apr 28, 2022 at 02:34:28PM -0400, Johannes Weiner wrote:
> On Thu, Apr 28, 2022 at 10:31:45AM -0700, Minchan Kim wrote:
> > On Thu, Apr 28, 2022 at 01:23:21PM -0400, Johannes Weiner wrote:
> > > On Thu, Apr 28, 2022 at 09:59:53AM -0700, Minchan Kim wrote:
> > > > On Thu, Apr 28, 2022 at 10:25:59AM -0400, Johannes Weiner wrote:
> > > > > On Wed, Apr 27, 2022 at 03:16:48PM -0700, Minchan Kim wrote:
> > > > > > On Wed, Apr 27, 2022 at 05:20:29PM -0400, Johannes Weiner wrote:
> > > > > > > On Wed, Apr 27, 2022 at 01:29:34PM -0700, Minchan Kim wrote:
> > > > > > > > Hi Johannes,
> > > > > > > > 
> > > > > > > > On Wed, Apr 27, 2022 at 12:00:15PM -0400, Johannes Weiner wrote:
> > > > > > > > > Currently it requires poking at debugfs to figure out the size and
> > > > > > > > > population of the zswap cache on a host. There are no counters for
> > > > > > > > > reads and writes against the cache. As a result, it's difficult to
> > > > > > > > > understand zswap behavior on production systems.
> > > > > > > > > 
> > > > > > > > > Print zswap memory consumption and how many pages are zswapped out in
> > > > > > > > > /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > > > > > ---
> > > > > > > > >  fs/proc/meminfo.c             |  7 +++++++
> > > > > > > > >  include/linux/swap.h          |  5 +++++
> > > > > > > > >  include/linux/vm_event_item.h |  4 ++++
> > > > > > > > >  mm/vmstat.c                   |  4 ++++
> > > > > > > > >  mm/zswap.c                    | 13 ++++++-------
> > > > > > > > >  5 files changed, 26 insertions(+), 7 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > > > > > > > > index 6fa761c9cc78..6e89f0e2fd20 100644
> > > > > > > > > --- a/fs/proc/meminfo.c
> > > > > > > > > +++ b/fs/proc/meminfo.c
> > > > > > > > > @@ -86,6 +86,13 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > > > > > > > >  
> > > > > > > > >  	show_val_kb(m, "SwapTotal:      ", i.totalswap);
> > > > > > > > >  	show_val_kb(m, "SwapFree:       ", i.freeswap);
> > > > > > > > > +#ifdef CONFIG_ZSWAP
> > > > > > > > > +	seq_printf(m,  "Zswap:          %8lu kB\n",
> > > > > > > > > +		   (unsigned long)(zswap_pool_total_size >> 10));
> > > > > > > > > +	seq_printf(m,  "Zswapped:       %8lu kB\n",
> > > > > > > > > +		   (unsigned long)atomic_read(&zswap_stored_pages) <<
> > > > > > > > > +		   (PAGE_SHIFT - 10));
> > > > > > > > > +#endif
> > > > > > > > 
> > > > > > > > I agree it would be very handy to have the memory consumption in meminfo
> > > > > > > > 
> > > > > > > > https://lore.kernel.org/all/YYwZXrL3Fu8%2FvLZw@google.com/
> > > > > > > > 
> > > > > > > > If we really go this Zswap only metric instead of general term
> > > > > > > > "Compressed", I'd like to post maybe "Zram:" with same reason
> > > > > > > > in this patchset. Do you think that's better idea instead of
> > > > > > > > introducing general term like "Compressed:" or something else?
> > > > > > > 
> > > > > > > I'm fine with changing it to Compressed. If somebody cares about a
> > > > > > > more detailed breakdown, we can add Zswap, Zram subsets as needed.
> > > > > > 
> > > > > > Thanks! Please consider ZSWPIN to rename more general term, too.
> > > > > 
> > > > > That doesn't make sense to me.
> > > > > 
> > > > > Zram is a swap backend, its traffic is accounted in PSWPIN/OUT. Zswap
> > > > > is a writeback cache on top of the swap backend. It has pages
> > > > > entering, refaulting, and being written back to the swap backend
> > > > > (PSWPOUT). A zswpout and a zramout are different things.
> > > > 
> > > > Think about that system has two swap devices (storage + zram).
> > > > I think it's useful to know how many swap IO comes from zram
> > > > and rest of them are storage.
> > > 
> > > Hm, isn't this comparable to having one swap on flash and one swap on
> > > a rotating disk? /sys/block/*/stat should be able to tell you how
> > > traffic is distributed, no?
> > 
> > That raises me a same question. Could you also look at the zswap stat
> > instead of adding it into vmstat? (If zswap doesn't have the counter,
> > couldn't we simply add new stat in sysfs?)
> 
> My point is that for regular swap backends there is already
> PSWP*. Distinguishing traffic between two swap backends is legitimate
> of course, but zram is not really special compared to other backends
> from that POV. It's only special in its memory consumption.
> 
> zswap *is* special, though. Even though some people use it *like* a
> swap backend, it's also a cache on top of swap. zswap loads and stores
> do not show up in PSWP*. And they shouldn't, because in a cache
> configuration, you still need the separate PSWP* stats to understand
> cache eviction behavior and cache miss ratio. memory -> zswap is
> ZSWPOUT; zswap -> disk is PSWPOUT; PSWPIN is a cache miss etc.
> 
> > I thought the patch aims for exposting statistics to grab easier
> > using popular meminfo and vmstat and wanted to leverage it for
> > zram, too.
> 
> Right. zram and zswap overlap in their functionality and have similar
> deficits in their stats. Both should be fixed, I'm not opposing
> that. But IMO we should be careful about conflating
> them. Fundamentally, one is a block device, the other is an MM-native
> cache layer that sits on top of block devices. Drawing false
> equivalencies between them will come back to haunt us.

Make sense to me.
Shakeel Butt May 5, 2022, 7:30 p.m. UTC | #25
+Yosry & Yuanchu

On Thu, Apr 28, 2022 at 8:17 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
[...]
> >
> > Yes, we have some modifications to zswap to make it work without any
> > backing real swap.
>
> Not sure if you can share them, but I would be interested in those
> changes. We have real backing swap, but because of the way swap
> entries are allocated, pages stored in zswap will consume physical
> disk slots. So on top of regular swap, you need to provision disk
> space for zswap as well, which is unfortunate.
>
> What could be useful is a separate swap entry address space that maps
> zswap slots and disk slots alike. This would fix the above problem. It
> would have the added benefit of making swapoff much simpler and faster
> too, as it doesn't need to chase down page tables to free disk slots.
>

I think we can share the code. Adding Yosry & Yuanchu who are
currently maintaining that piece of code.

Though that code might not be in an upstreamable state. At the high
level, it introduces a new type of swap (SWP_GHOST) which underlying
is a truncated file, so no real disk space is needed. The zswap always
accepts the page, so the kernel never tries to go to the underlying
swapfile (reality is a bit more complicated due to the presence of
incompressible memory and no real disk present on the system).
Shakeel Butt May 5, 2022, 7:33 p.m. UTC | #26
On Thu, Apr 28, 2022 at 9:54 AM Yang Shi <shy828301@gmail.com> wrote:
>
[...]
> > Yes, we have some modifications to zswap to make it work without any
> > backing real swap. Though there is a future plan to move to zram
> > eventually.
>
> Interesting, if so why not just simply use zram?
>

Historical reasons. When we started trying out the zswap, I think zram
was still in staging or not stable enough (Suleiman can give a better
answer).
Suleiman Souhlal May 5, 2022, 10:24 p.m. UTC | #27
On Fri, May 6, 2022 at 4:33 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Apr 28, 2022 at 9:54 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> [...]
> > > Yes, we have some modifications to zswap to make it work without any
> > > backing real swap. Though there is a future plan to move to zram
> > > eventually.
> >
> > Interesting, if so why not just simply use zram?
> >
>
> Historical reasons. When we started trying out the zswap, I think zram
> was still in staging or not stable enough (Suleiman can give a better
> answer).

One of the reasons we chose zswap instead of zram is that zswap can
reject pages.
Also, we wanted to have per-memcg pools, which zswap made much easier to do.

-- Suleiman
Yu Zhao May 5, 2022, 11:54 p.m. UTC | #28
On Thu, May 5, 2022 at 3:25 PM Suleiman Souhlal <suleiman@google.com> wrote:
>
> On Fri, May 6, 2022 at 4:33 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Apr 28, 2022 at 9:54 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > [...]
> > > > Yes, we have some modifications to zswap to make it work without any
> > > > backing real swap. Though there is a future plan to move to zram
> > > > eventually.
> > >
> > > Interesting, if so why not just simply use zram?
> > >
> >
> > Historical reasons. When we started trying out the zswap, I think zram
> > was still in staging or not stable enough (Suleiman can give a better
> > answer).
>
> One of the reasons we chose zswap instead of zram is that zswap can
> reject pages.
> Also, we wanted to have per-memcg pools, which zswap made much easier to do.

Yes, it was a design choice. zswap was cache-like (tiering) and zram
was storage-like (endpoint). Though nowadays the distinction is
blurry.

It had nothing to do with zram being in staging -- when we took zswap,
it was out of the tree.
diff mbox series

Patch

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 6fa761c9cc78..6e89f0e2fd20 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -86,6 +86,13 @@  static int meminfo_proc_show(struct seq_file *m, void *v)
 
 	show_val_kb(m, "SwapTotal:      ", i.totalswap);
 	show_val_kb(m, "SwapFree:       ", i.freeswap);
+#ifdef CONFIG_ZSWAP
+	seq_printf(m,  "Zswap:          %8lu kB\n",
+		   (unsigned long)(zswap_pool_total_size >> 10));
+	seq_printf(m,  "Zswapped:       %8lu kB\n",
+		   (unsigned long)atomic_read(&zswap_stored_pages) <<
+		   (PAGE_SHIFT - 10));
+#endif
 	show_val_kb(m, "Dirty:          ",
 		    global_node_page_state(NR_FILE_DIRTY));
 	show_val_kb(m, "Writeback:      ",
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b82c196d8867..07074afa79a7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -632,6 +632,11 @@  static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 }
 #endif
 
+#ifdef CONFIG_ZSWAP
+extern u64 zswap_pool_total_size;
+extern atomic_t zswap_stored_pages;
+#endif
+
 #if defined(CONFIG_SWAP) && defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
 extern void __cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask);
 static inline  void cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 5e80138ce624..1ce8fadb2b1c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -132,6 +132,10 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_KSM
 		COW_KSM,
 #endif
+#ifdef CONFIG_ZSWAP
+		ZSWPIN,
+		ZSWPOUT,
+#endif
 #ifdef CONFIG_X86
 		DIRECT_MAP_LEVEL2_SPLIT,
 		DIRECT_MAP_LEVEL3_SPLIT,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4a2aa2fa88db..da7e389cf33c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1392,6 +1392,10 @@  const char * const vmstat_text[] = {
 #ifdef CONFIG_KSM
 	"cow_ksm",
 #endif
+#ifdef CONFIG_ZSWAP
+	"zswpin",
+	"zswpout",
+#endif
 #ifdef CONFIG_X86
 	"direct_map_level2_splits",
 	"direct_map_level3_splits",
diff --git a/mm/zswap.c b/mm/zswap.c
index 2c5db4cbedea..e3c16a70f533 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -42,9 +42,9 @@ 
 * statistics
 **********************************/
 /* Total bytes used by the compressed storage */
-static u64 zswap_pool_total_size;
+u64 zswap_pool_total_size;
 /* The number of compressed pages currently stored in zswap */
-static atomic_t zswap_stored_pages = ATOMIC_INIT(0);
+atomic_t zswap_stored_pages = ATOMIC_INIT(0);
 /* The number of same-value filled pages currently stored in zswap */
 static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0);
 
@@ -1243,6 +1243,7 @@  static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 	/* update stats */
 	atomic_inc(&zswap_stored_pages);
 	zswap_update_total_size();
+	count_vm_event(ZSWPOUT);
 
 	return 0;
 
@@ -1285,11 +1286,10 @@  static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 		zswap_fill_page(dst, entry->value);
 		kunmap_atomic(dst);
 		ret = 0;
-		goto freeentry;
+		goto stats;
 	}
 
 	if (!zpool_can_sleep_mapped(entry->pool->zpool)) {
-
 		tmp = kmalloc(entry->length, GFP_ATOMIC);
 		if (!tmp) {
 			ret = -ENOMEM;
@@ -1304,10 +1304,8 @@  static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 		src += sizeof(struct zswap_header);
 
 	if (!zpool_can_sleep_mapped(entry->pool->zpool)) {
-
 		memcpy(tmp, src, entry->length);
 		src = tmp;
-
 		zpool_unmap_handle(entry->pool->zpool, entry->handle);
 	}
 
@@ -1326,7 +1324,8 @@  static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 		kfree(tmp);
 
 	BUG_ON(ret);
-
+stats:
+	count_vm_event(ZSWPIN);
 freeentry:
 	spin_lock(&tree->lock);
 	zswap_entry_put(tree, entry);