[v2,00/16] Multigenerational LRU Framework

Message ID	20210413065633.2782273-1-yuzhao@google.com (mailing list archive)
Headers	show Return-Path: <SRS0=kYq9=JK=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 366B5613AE Date: Tue, 13 Apr 2021 00:56:17 -0600 Message-Id: <20210413065633.2782273-1-yuzhao@google.com> Mime-Version: 1.0 Subject: [PATCH v2 00/16] Multigenerational LRU Framework From: Yu Zhao <yuzhao@google.com> To: linux-mm@kvack.org Cc: Alex Shi <alexs@kernel.org>, Andi Kleen <ak@linux.intel.com>, Andrew Morton <akpm@linux-foundation.org>, Benjamin Manes <ben.manes@gmail.com>, Dave Chinner <david@fromorbit.com>, Dave Hansen <dave.hansen@linux.intel.com>, Hillf Danton <hdanton@sina.com>, Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>, Jonathan Corbet <corbet@lwn.net>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, Matthew Wilcox <willy@infradead.org>, Mel Gorman <mgorman@suse.de>, Miaohe Lin <linmiaohe@huawei.com>, Michael Larabel <michael@michaellarabel.com>, Michal Hocko <mhocko@suse.com>, Michel Lespinasse <michel@lespinasse.org>, Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>, Rong Chen <rong.a.chen@intel.com>, SeongJae Park <sjpark@amazon.de>, Tim Chen <tim.c.chen@linux.intel.com>, Vlastimil Babka <vbabka@suse.cz>, Yang Shi <shy828301@gmail.com>, Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>, linux-kernel@vger.kernel.org, lkp@lists.01.org, page-reclaim@google.com, Yu Zhao <yuzhao@google.com> Content-Type: text/plain; charset="UTF-8" Received-SPF: none (flex--yuzhao.bounces.google.com>: No applicable sender policy available) receiver=imf19; identity=mailfrom; envelope-from="<3qEB1YAYKCAk738qjxpxxpun.lxvurw36-vvt4jlt.x0p@flex--yuzhao.bounces.google.com>"; helo=mail-yb1-f202.google.com; client-ip=209.85.219.202 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Multigenerational LRU Framework \| expand [v2,00/16] Multigenerational LRU Framework [v2,01/16] include/linux/memcontrol.h: do not warn in page_memcg_rcu() if !CONFIG_MEMCG [v2,02/16] include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA [v2,03/16] include/linux/huge_mm.h: define is_huge_zero_pmd() if !CONFIG_TRANSPARENT_HUGEPAGE [v2,04/16] include/linux/cgroup.h: export cgroup_mutex [v2,05/16] mm/swap.c: export activate_page() [v2,06/16] mm, x86: support the access bit on non-leaf PMD entries [v2,07/16] mm/vmscan.c: refactor shrink_node() [v2,08/16] mm: multigenerational lru: groundwork [v2,09/16] mm: multigenerational lru: activation [v2,10/16] mm: multigenerational lru: mm_struct list [v2,11/16] mm: multigenerational lru: aging [v2,12/16] mm: multigenerational lru: eviction [v2,13/16] mm: multigenerational lru: page reclaim [v2,14/16] mm: multigenerational lru: user interface [v2,15/16] mm: multigenerational lru: Kconfig [v2,16/16] mm: multigenerational lru: documentation

Yu Zhao April 13, 2021, 6:56 a.m. UTC

What's new in v2
================
Special thanks to Jens Axboe for reporting a regression in buffered
I/O and helping test the fix.

This version includes the support of tiers, which represent levels of
usage from file descriptors only. Pages accessed N times via file
descriptors belong to tier order_base_2(N). Each generation contains
at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2
bits in page->flags. In contrast to moving across generations which
requires the lru lock, moving across tiers only involves an atomic
operation on page->flags and therefore has a negligible cost. A
feedback loop modeled after the well-known PID controller monitors the
refault rates across all tiers and decides when to activate pages from
which tiers, on the reclaim path.

This feedback model has a few advantages over the current feedforward
model:
1) It has a negligible overhead in the buffered I/O access path
   because activations are done in the reclaim path.
2) It takes mapped pages into account and avoids overprotecting pages
   accessed multiple times via file descriptors.
3) More tiers offer better protection to pages accessed more than
   twice when buffered-I/O-intensive workloads are under memory
   pressure.

The fio/io_uring benchmark shows 14% improvement in IOPS when randomly
accessing Samsung PM981a in the buffered I/O mode.

Highlights from the discussions on v1
=====================================
Thanks to Ying Huang and Dave Hansen for the comments and suggestions
on page table scanning.

A simple worst-case scenario test did not find page table scanning
underperforms the rmap because of the following optimizations:
1) It will not scan page tables from processes that have been sleeping
   since the last scan.
2) It will not scan PTE tables under non-leaf PMD entries that do not
   have the accessed bit set, when
   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
3) It will not zigzag between the PGD table and the same PMD or PTE
   table spanning multiple VMAs. In other words, it finishes all the
   VMAs with the range of the same PMD or PTE table before it returns
   to the PGD table. This optimizes workloads that have large numbers
   of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.

TLDR
====
The current page reclaim is too expensive in terms of CPU usage and
often making poor choices about what to evict. We would like to offer
an alternative framework that is performant, versatile and
straightforward.

Repo
====
git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1

Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1173

Background
==========
DRAM is a major factor in total cost of ownership, and improving
memory overcommit brings a high return on investment. Over the past
decade of research and experimentation in memory overcommit, we
observed a distinct trend across millions of servers and clients: the
size of page cache has been decreasing because of the growing
popularity of cloud storage. Nowadays anon pages account for more than
90% of our memory consumption and page cache contains mostly
executable pages.

Problems
========
Notion of active/inactive
-------------------------
For servers equipped with hundreds of gigabytes of memory, the
granularity of the active/inactive is too coarse to be useful for job
scheduling. False active/inactive rates are relatively high, and thus
the assumed savings may not materialize.

For phones and laptops, executable pages are frequently evicted
despite the fact that there are many less recently used anon pages.
Major faults on executable pages cause "janks" (slow UI renderings)
and negatively impact user experience.

For lruvecs from different memcgs or nodes, comparisons are impossible
due to the lack of a common frame of reference.

Incremental scans via rmap
--------------------------
Each incremental scan picks up at where the last scan left off and
stops after it has found a handful of unreferenced pages. For
workloads using a large amount of anon memory, incremental scans lose
the advantage under sustained memory pressure due to high ratios of
the number of scanned pages to the number of reclaimed pages. In our
case, the average ratio of pgscan to pgsteal is above 7.

On top of that, the rmap has poor memory locality due to its complex
data structures. The combined effects typically result in a high
amount of CPU usage in the reclaim path. For example, with zram, a
typical kswapd profile on v5.11 looks like:
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

And with real swap, it looks like:
  45.16%  page_vma_mapped_walk
   7.61%  do_raw_spin_lock
   5.69%  vma_interval_tree_iter_next
   4.91%  vma_interval_tree_subtree_search
   3.71%  page_referenced_one

Solutions
=========
Notion of generation numbers
----------------------------
The notion of generation numbers introduces a quantitative approach to
memory overcommit. A larger number of pages can be spread out across
a configurable number of generations, and each generation includes all
pages that have been referenced since the last generation. This
improved granularity yields relatively low false active/inactive
rates.

Given an lruvec, scans of anon and file types and selections between
them are all based on direct comparisons of generation numbers, which
are simple and yet effective. For different lruvecs, comparisons are
still possible based on birth times of generations.

Differential scans via page tables
----------------------------------
Each differential scan discovers all pages that have been referenced
since the last scan. Specifically, it walks the mm_struct list
associated with an lruvec to scan page tables of processes that have
been scheduled since the last scan. The cost of each differential scan
is roughly proportional to the number of referenced pages it
discovers. Unless address spaces are extremely sparse, page tables
usually have better memory locality than the rmap. The end result is
generally a significant reduction in CPU usage, for workloads using a
large amount of anon memory.

Our real-world benchmark that browses popular websites in multiple
Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
less PSI on v5.11. With this patchset, kswapd profile looks like:
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

In addition, direct reclaim latency is reduced by 22% at 99th
percentile and the number of refaults is reduced by 7%. Both metrics
are important to phones and laptops as they are correlated to user
experience.

Framework
=========
For each lruvec, evictable pages are divided into multiple
generations. The youngest generation number is stored in
lruvec->evictable.max_seq for both anon and file types as they are
aged on an equal footing. The oldest generation numbers are stored in
lruvec->evictable.min_seq[2] separately for anon and file types as
clean file pages can be evicted regardless of may_swap or
may_writepage. Generation numbers are truncated into
order_base_2(MAX_NR_GENS+1) bits in order to fit into page->flags. The
sliding window technique is used to prevent truncated generation
numbers from overlapping. Each truncated generation number is an inde
to lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
Evictable pages are added to the per-zone lists indexed by max_seq or
min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
faulted in.

Each generation is then divided into multiple tiers. Tiers represent
levels of usage from file descriptors only. Pages accessed N times via
file descriptors belong to tier order_base_2(N). In contrast to moving
across generations which requires the lru lock, moving across tiers
only involves an atomic operation on page->flags and therefore has a
lower cost. A feedback loop modeled after the well-known PID
controller monitors the refault rates across all tiers and decides
when to activate pages from which tiers on the reclaim path.

The framework comprises two conceptually independent components: the
aging and the eviction, which can be invoked separately from user
space.

Aging
-----
The aging produces young generations. Given an lruvec, the aging scans
page tables for referenced pages of this lruvec. Upon finding one, the
aging updates its generation number to max_seq. After each round of
scan, the aging increments max_seq.

The aging maintains either a system-wide mm_struct list or per-memcg
mm_struct lists and tracks whether an mm_struct is being used or has
been used since the last scan. Multiple threads can concurrently work
on the same mm_struct list, and each of them will be given a different
mm_struct belonging to a process that has been scheduled since the
last scan.

The aging is due when both of min_seq[2] reaches max_seq-1, assuming
both anon and file types are reclaimable.

Eviction
--------
The eviction consumes old generations. Given an lruvec, the eviction
scans the pages on the per-zone lists indexed by either of min_seq[2].
It first tries to select a type based on the values of min_seq[2].
When anon and file types are both available from the same generation,
it selects the one that has a lower refault rate.

During a scan, the eviction sorts pages according to their generation
numbers, if the aging has found them referenced. It also moves pages
from the tiers that have higher refault rates than tier 0 to the next
generation.

When it finds all the per-zone lists of a selected type are empty, the
eviction increments min_seq[2] indexed by this selected type.

Use cases
=========
On Android, our most advanced simulation that generates memory
pressure from realistic user behavior shows 18% fewer low-memory
kills, which in turn reduces cold starts by 16%.

On Borg, a similar approach enables us to identify jobs that
underutilize their memory and downsize them considerably without
compromising any of our service level indicators.

On Chrome OS, our field telemetry reports 96% fewer low-memory tab
discards and 59% fewer OOM kills from fully-utilized devices and no
regressions in monitored user experience from underutilized devices.

Working set estimation
----------------------
User space can invoke the aging by writing "+ memcg_id node_id gen
[swappiness]" to /sys/kernel/debug/lru_gen. This debugfs interface
also provides the birth time and the size of each generation.

Proactive reclaim
-----------------
User space can invoke the eviction by writing "- memcg_id node_id gen
[swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen. Multiple
command lines are supported, so does concatenation with delimiters.

Intensive buffered I/O
----------------------
Tiers are specifically designed to improve the performance of
intensive buffered I/O under memory pressure. The fio/io_uring
benchmark shows 14% improvement in IOPS when randomly accessing
Samsung PM981a in buffered I/O mode.

For far memory tiering and NUMA-aware job scheduling, please refer to
the reference section.

FAQ
===
Why not try to improve the existing code?
-----------------------------------------
We have tried but concluded the aforementioned problems are
fundamental, and therefore changes made on top of them will not result
in substantial gains.

What particular workloads does it help?
---------------------------------------
This framework is designed to improve the performance of the page
reclaim under any types of workloads.

How would it benefit the community?
-----------------------------------
Google is committed to promoting sustainable development of the
community. We hope successful adoptions of this framework will
steadily climb over time. To that end, we would be happy to learn your
workloads and work with you case by case, and we will do our best to
keep the repo fully maintained. For those whose workloads rely on the
existing code, we will make sure you will not be affected in any way.

References
==========
1. Long-term SLOs for reclaimed cloud computing resources
   https://research.google/pubs/pub43017/
2. Profiling a warehouse-scale computer
   https://research.google/pubs/pub44271/
3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
   https://research.google/pubs/pub48329/
4. Software-defined far memory in warehouse-scale computers
   https://research.google/pubs/pub48551/
5. Borg: the Next Generation
   https://research.google/pubs/pub49065/

Yu Zhao (16):
  include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
    !CONFIG_MEMCG
  include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
  include/linux/huge_mm.h: define is_huge_zero_pmd() if
    !CONFIG_TRANSPARENT_HUGEPAGE
  include/linux/cgroup.h: export cgroup_mutex
  mm/swap.c: export activate_page()
  mm, x86: support the access bit on non-leaf PMD entries
  mm/vmscan.c: refactor shrink_node()
  mm: multigenerational lru: groundwork
  mm: multigenerational lru: activation
  mm: multigenerational lru: mm_struct list
  mm: multigenerational lru: aging
  mm: multigenerational lru: eviction
  mm: multigenerational lru: page reclaim
  mm: multigenerational lru: user interface
  mm: multigenerational lru: Kconfig
  mm: multigenerational lru: documentation

 Documentation/vm/index.rst        |    1 +
 Documentation/vm/multigen_lru.rst |  192 +++
 arch/Kconfig                      |    9 +
 arch/x86/Kconfig                  |    1 +
 arch/x86/include/asm/pgtable.h    |    2 +-
 arch/x86/mm/pgtable.c             |    5 +-
 fs/exec.c                         |    2 +
 fs/fuse/dev.c                     |    3 +-
 fs/proc/task_mmu.c                |    3 +-
 include/linux/cgroup.h            |   15 +-
 include/linux/huge_mm.h           |    5 +
 include/linux/memcontrol.h        |    7 +-
 include/linux/mm.h                |    2 +
 include/linux/mm_inline.h         |  294 ++++
 include/linux/mm_types.h          |  117 ++
 include/linux/mmzone.h            |  118 +-
 include/linux/nodemask.h          |    1 +
 include/linux/page-flags-layout.h |   20 +-
 include/linux/page-flags.h        |    4 +-
 include/linux/pgtable.h           |    4 +-
 include/linux/swap.h              |    5 +-
 kernel/bounds.c                   |    6 +
 kernel/events/uprobes.c           |    2 +-
 kernel/exit.c                     |    1 +
 kernel/fork.c                     |   10 +
 kernel/kthread.c                  |    1 +
 kernel/sched/core.c               |    2 +
 mm/Kconfig                        |   55 +
 mm/huge_memory.c                  |    5 +-
 mm/khugepaged.c                   |    2 +-
 mm/memcontrol.c                   |   28 +
 mm/memory.c                       |   14 +-
 mm/migrate.c                      |    2 +-
 mm/mm_init.c                      |   16 +-
 mm/mmzone.c                       |    2 +
 mm/rmap.c                         |    6 +
 mm/swap.c                         |   54 +-
 mm/swapfile.c                     |    6 +-
 mm/userfaultfd.c                  |    2 +-
 mm/vmscan.c                       | 2580 ++++++++++++++++++++++++++++-
 mm/workingset.c                   |  179 +-
 41 files changed, 3603 insertions(+), 180 deletions(-)
 create mode 100644 Documentation/vm/multigen_lru.rst

SeongJae Park April 13, 2021, 7:51 a.m. UTC | #1

From: SeongJae Park <sjpark@amazon.de>

Hello,


Very interesting work, thank you for sharing this :)

On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:

> What's new in v2
> ================
> Special thanks to Jens Axboe for reporting a regression in buffered
> I/O and helping test the fix.

Is the discussion open?  If so, could you please give me a link?

> 
> This version includes the support of tiers, which represent levels of
> usage from file descriptors only. Pages accessed N times via file
> descriptors belong to tier order_base_2(N). Each generation contains
> at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2
> bits in page->flags. In contrast to moving across generations which
> requires the lru lock, moving across tiers only involves an atomic
> operation on page->flags and therefore has a negligible cost. A
> feedback loop modeled after the well-known PID controller monitors the
> refault rates across all tiers and decides when to activate pages from
> which tiers, on the reclaim path.
> 
> This feedback model has a few advantages over the current feedforward
> model:
> 1) It has a negligible overhead in the buffered I/O access path
>    because activations are done in the reclaim path.
> 2) It takes mapped pages into account and avoids overprotecting pages
>    accessed multiple times via file descriptors.
> 3) More tiers offer better protection to pages accessed more than
>    twice when buffered-I/O-intensive workloads are under memory
>    pressure.
> 
> The fio/io_uring benchmark shows 14% improvement in IOPS when randomly
> accessing Samsung PM981a in the buffered I/O mode.

Improvement under memory pressure, right?  How much pressure?

[...]
> 
> Differential scans via page tables
> ----------------------------------
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan.

Does this means it scans only virtual address spaces of processes and therefore
pages in the page cache that are not mmap()-ed will not be scanned?

> The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for workloads using a
> large amount of anon memory.

When and how frequently it scans?


Thanks,
SeongJae Park

[...]

Jens Axboe April 13, 2021, 4:13 p.m. UTC | #2

On 4/13/21 1:51 AM, SeongJae Park wrote:
> From: SeongJae Park <sjpark@amazon.de>
> 
> Hello,
> 
> 
> Very interesting work, thank you for sharing this :)
> 
> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
> 
>> What's new in v2
>> ================
>> Special thanks to Jens Axboe for reporting a regression in buffered
>> I/O and helping test the fix.
> 
> Is the discussion open?  If so, could you please give me a link?

I wasn't on the initial post (or any of the lists it was posted to), but
it's on the google page reclaim list. Not sure if that is public or not.

tldr is that I was pretty excited about this work, as buffered IO tends
to suck (a lot) for high throughput applications. My test case was
pretty simple:

Randomly read a fast device, using 4k buffered IO, and watch what
happens when the page cache gets filled up. For this particular test,
we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
with kswapd using a lot of CPU trying to keep up. That's mainline
behavior.

The initial posting of this patchset did no better, in fact it did a bit
worse. Performance dropped to the same levels and kswapd was using as
much CPU as before, but on top of that we also got excessive swapping.
Not at a high rate, but 5-10MB/sec continually.

I had some back and forths with Yu Zhao and tested a few new revisions,
and the current series does much better in this regard. Performance
still dips a bit when page cache fills, but not nearly as much, and
kswapd is using less CPU than before.

Hope that helps,

SeongJae Park April 13, 2021, 4:42 p.m. UTC | #3

From: SeongJae Park <sjpark@amazon.de>

On Tue, 13 Apr 2021 10:13:24 -0600 Jens Axboe <axboe@kernel.dk> wrote:

> On 4/13/21 1:51 AM, SeongJae Park wrote:
> > From: SeongJae Park <sjpark@amazon.de>
> > 
> > Hello,
> > 
> > 
> > Very interesting work, thank you for sharing this :)
> > 
> > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
> > 
> >> What's new in v2
> >> ================
> >> Special thanks to Jens Axboe for reporting a regression in buffered
> >> I/O and helping test the fix.
> > 
> > Is the discussion open?  If so, could you please give me a link?
> 
> I wasn't on the initial post (or any of the lists it was posted to), but
> it's on the google page reclaim list. Not sure if that is public or not.
> 
> tldr is that I was pretty excited about this work, as buffered IO tends
> to suck (a lot) for high throughput applications. My test case was
> pretty simple:
> 
> Randomly read a fast device, using 4k buffered IO, and watch what
> happens when the page cache gets filled up. For this particular test,
> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> with kswapd using a lot of CPU trying to keep up. That's mainline
> behavior.
> 
> The initial posting of this patchset did no better, in fact it did a bit
> worse. Performance dropped to the same levels and kswapd was using as
> much CPU as before, but on top of that we also got excessive swapping.
> Not at a high rate, but 5-10MB/sec continually.
> 
> I had some back and forths with Yu Zhao and tested a few new revisions,
> and the current series does much better in this regard. Performance
> still dips a bit when page cache fills, but not nearly as much, and
> kswapd is using less CPU than before.
> 
> Hope that helps,

Appreciate this kind and detailed explanation, Jens!

So, my understanding is that v2 of this patchset improved the performance by
using frequency (tier) in addition to recency (generation number) for buffered
I/O pages.  That makes sense to me.  If I'm misunderstanding, please let me
know.


Thanks,
SeongJae Park

> -- 
> Jens Axboe
>

Dave Chinner April 13, 2021, 11:14 p.m. UTC | #4

On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> On 4/13/21 1:51 AM, SeongJae Park wrote:
> > From: SeongJae Park <sjpark@amazon.de>
> > 
> > Hello,
> > 
> > 
> > Very interesting work, thank you for sharing this :)
> > 
> > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
> > 
> >> What's new in v2
> >> ================
> >> Special thanks to Jens Axboe for reporting a regression in buffered
> >> I/O and helping test the fix.
> > 
> > Is the discussion open?  If so, could you please give me a link?
> 
> I wasn't on the initial post (or any of the lists it was posted to), but
> it's on the google page reclaim list. Not sure if that is public or not.
> 
> tldr is that I was pretty excited about this work, as buffered IO tends
> to suck (a lot) for high throughput applications. My test case was
> pretty simple:
> 
> Randomly read a fast device, using 4k buffered IO, and watch what
> happens when the page cache gets filled up. For this particular test,
> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> with kswapd using a lot of CPU trying to keep up. That's mainline
> behavior.

I see this exact same behaviour here, too, but I RCA'd it to
contention between the inode and memory reclaim for the mapping
structure that indexes the page cache. Basically the mapping tree
lock is the contention point here - you can either be adding pages
to the mapping during IO, or memory reclaim can be removing pages
from the mapping, but we can't do both at once.

So we end up with kswapd spinning on the mapping tree lock like so
when doing 1.6GB/s in 4kB buffered IO:

-   20.06%     0.00%  [kernel]               [k] kswapd                                                                                                        ▒
   - 20.06% kswapd                                                                                                                                             ▒
      - 20.05% balance_pgdat                                                                                                                                   ▒
         - 20.03% shrink_node                                                                                                                                  ▒
            - 19.92% shrink_lruvec                                                                                                                             ▒
               - 19.91% shrink_inactive_list                                                                                                                   ▒
                  - 19.22% shrink_page_list                                                                                                                    ▒
                     - 17.51% __remove_mapping                                                                                                                 ▒
                        - 14.16% _raw_spin_lock_irqsave                                                                                                        ▒
                           - 14.14% do_raw_spin_lock                                                                                                           ▒
                                __pv_queued_spin_lock_slowpath                                                                                                 ▒
                        - 1.56% __delete_from_page_cache                                                                                                       ▒
                             0.63% xas_store                                                                                                                   ▒
                        - 0.78% _raw_spin_unlock_irqrestore                                                                                                    ▒
                           - 0.69% do_raw_spin_unlock                                                                                                          ▒
                                __raw_callee_save___pv_queued_spin_unlock                                                                                      ▒
                     - 0.82% free_unref_page_list                                                                                                              ▒
                        - 0.72% free_unref_page_commit                                                                                                         ▒
                             0.57% free_pcppages_bulk                                                                                                          ▒

And these are the processes consuming CPU:

   5171 root      20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
   1150 root      20   0       0      0      0 S  47.4   0.0   0:22.70 kswapd1
   1146 root      20   0       0      0      0 S  44.0   0.0   0:21.85 kswapd0
   1152 root      20   0       0      0      0 S  39.7   0.0   0:18.28 kswapd3
   1151 root      20   0       0      0      0 S  15.2   0.0   0:12.14 kswapd2

i.e. when memory reclaim kicks in, the read process has 20% less
time with exclusive access to the mapping tree to insert new pages.
Hence buffered read performance goes down quite substantially when
memory reclaim kicks in, and this really has nothing to do with the
memory reclaim LRU scanning algorithm.

I can actually get this machine to pin those 5 processes to 100% CPU
under certain conditions. Each process is spinning all that extra
time on the mapping tree lock, and performance degrades further.
Changing the LRU reclaim algorithm won't fix this - the workload is
solidly bound by the exclusive nature of the mapping tree lock and
the number of tasks trying to obtain it exclusively...

> The initial posting of this patchset did no better, in fact it did a bit
> worse. Performance dropped to the same levels and kswapd was using as
> much CPU as before, but on top of that we also got excessive swapping.
> Not at a high rate, but 5-10MB/sec continually.
>
> I had some back and forths with Yu Zhao and tested a few new revisions,
> and the current series does much better in this regard. Performance
> still dips a bit when page cache fills, but not nearly as much, and
> kswapd is using less CPU than before.

Profiles would be interesting, because it sounds to me like reclaim
*might* be batching page cache removal better (e.g. fewer, larger
batches) and so spending less time contending on the mapping tree
lock...

IOWs, I suspect this result might actually be a result of less lock
contention due to a change in batch processing characteristics of
the new algorithm rather than it being a "better" algorithm...

Cheers,

Dave.

Rik van Riel April 14, 2021, 2:29 a.m. UTC | #5

On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
> On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> 
> > The initial posting of this patchset did no better, in fact it did
> > a bit
> > worse. Performance dropped to the same levels and kswapd was using
> > as
> > much CPU as before, but on top of that we also got excessive
> > swapping.
> > Not at a high rate, but 5-10MB/sec continually.
> > 
> > I had some back and forths with Yu Zhao and tested a few new
> > revisions,
> > and the current series does much better in this regard. Performance
> > still dips a bit when page cache fills, but not nearly as much, and
> > kswapd is using less CPU than before.
> 
> Profiles would be interesting, because it sounds to me like reclaim
> *might* be batching page cache removal better (e.g. fewer, larger
> batches) and so spending less time contending on the mapping tree
> lock...
> 
> IOWs, I suspect this result might actually be a result of less lock
> contention due to a change in batch processing characteristics of
> the new algorithm rather than it being a "better" algorithm...

That seems quite likely to me, given the issues we have
had with virtual scan reclaim algorithms in the past.

SeongJae, what is this algorithm supposed to do when faced
with situations like this:
1) Running on a system with 8 NUMA nodes, and
memory
   pressure in one of those nodes.
2) Running PostgresQL or Oracle, with hundreds of
   processes mapping the same (very large) shared
   memory segment.

How do you keep your algorithm from falling into the worst
case virtual scanning scenarios that were crippling the
2.4 kernel 15+ years ago on systems with just a few GB of
memory?

Yu Zhao April 14, 2021, 3:40 a.m. UTC | #6

On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> > On 4/13/21 1:51 AM, SeongJae Park wrote:
> > > From: SeongJae Park <sjpark@amazon.de>
> > >
> > > Hello,
> > >
> > >
> > > Very interesting work, thank you for sharing this :)
> > >
> > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
> > >
> > >> What's new in v2
> > >> ================
> > >> Special thanks to Jens Axboe for reporting a regression in buffered
> > >> I/O and helping test the fix.
> > >
> > > Is the discussion open?  If so, could you please give me a link?
> >
> > I wasn't on the initial post (or any of the lists it was posted to), but
> > it's on the google page reclaim list. Not sure if that is public or not.
> >
> > tldr is that I was pretty excited about this work, as buffered IO tends
> > to suck (a lot) for high throughput applications. My test case was
> > pretty simple:
> >
> > Randomly read a fast device, using 4k buffered IO, and watch what
> > happens when the page cache gets filled up. For this particular test,
> > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> > with kswapd using a lot of CPU trying to keep up. That's mainline
> > behavior.
>
> I see this exact same behaviour here, too, but I RCA'd it to
> contention between the inode and memory reclaim for the mapping
> structure that indexes the page cache. Basically the mapping tree
> lock is the contention point here - you can either be adding pages
> to the mapping during IO, or memory reclaim can be removing pages
> from the mapping, but we can't do both at once.
>
> So we end up with kswapd spinning on the mapping tree lock like so
> when doing 1.6GB/s in 4kB buffered IO:
>
> -   20.06%     0.00%  [kernel]               [k] kswapd                                                                                                        ▒
>    - 20.06% kswapd                                                                                                                                             ▒
>       - 20.05% balance_pgdat                                                                                                                                   ▒
>          - 20.03% shrink_node                                                                                                                                  ▒
>             - 19.92% shrink_lruvec                                                                                                                             ▒
>                - 19.91% shrink_inactive_list                                                                                                                   ▒
>                   - 19.22% shrink_page_list                                                                                                                    ▒
>                      - 17.51% __remove_mapping                                                                                                                 ▒
>                         - 14.16% _raw_spin_lock_irqsave                                                                                                        ▒
>                            - 14.14% do_raw_spin_lock                                                                                                           ▒
>                                 __pv_queued_spin_lock_slowpath                                                                                                 ▒
>                         - 1.56% __delete_from_page_cache                                                                                                       ▒
>                              0.63% xas_store                                                                                                                   ▒
>                         - 0.78% _raw_spin_unlock_irqrestore                                                                                                    ▒
>                            - 0.69% do_raw_spin_unlock                                                                                                          ▒
>                                 __raw_callee_save___pv_queued_spin_unlock                                                                                      ▒
>                      - 0.82% free_unref_page_list                                                                                                              ▒
>                         - 0.72% free_unref_page_commit                                                                                                         ▒
>                              0.57% free_pcppages_bulk                                                                                                          ▒
>
> And these are the processes consuming CPU:
>
>    5171 root      20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
>    1150 root      20   0       0      0      0 S  47.4   0.0   0:22.70 kswapd1
>    1146 root      20   0       0      0      0 S  44.0   0.0   0:21.85 kswapd0
>    1152 root      20   0       0      0      0 S  39.7   0.0   0:18.28 kswapd3
>    1151 root      20   0       0      0      0 S  15.2   0.0   0:12.14 kswapd2
>
> i.e. when memory reclaim kicks in, the read process has 20% less
> time with exclusive access to the mapping tree to insert new pages.
> Hence buffered read performance goes down quite substantially when
> memory reclaim kicks in, and this really has nothing to do with the
> memory reclaim LRU scanning algorithm.
>
> I can actually get this machine to pin those 5 processes to 100% CPU
> under certain conditions. Each process is spinning all that extra
> time on the mapping tree lock, and performance degrades further.
> Changing the LRU reclaim algorithm won't fix this - the workload is
> solidly bound by the exclusive nature of the mapping tree lock and
> the number of tasks trying to obtain it exclusively...
>
> > The initial posting of this patchset did no better, in fact it did a bit
> > worse. Performance dropped to the same levels and kswapd was using as
> > much CPU as before, but on top of that we also got excessive swapping.
> > Not at a high rate, but 5-10MB/sec continually.
> >
> > I had some back and forths with Yu Zhao and tested a few new revisions,
> > and the current series does much better in this regard. Performance
> > still dips a bit when page cache fills, but not nearly as much, and
> > kswapd is using less CPU than before.
>
> Profiles would be interesting, because it sounds to me like reclaim
> *might* be batching page cache removal better (e.g. fewer, larger
> batches) and so spending less time contending on the mapping tree
> lock...
>
> IOWs, I suspect this result might actually be a result of less lock
> contention due to a change in batch processing characteristics of
> the new algorithm rather than it being a "better" algorithm...

I appreciate the profile. But there is no batching in
__remove_mapping() -- it locks the mapping for each page, and
therefore the lock contention penalizes the mainline and this patchset
equally. It looks worse on your system because the four kswapd threads
from different nodes were working on the same file.

And kswapd is only one of two paths that could affect the performance.
The kernel context of the test process is where the improvement mainly
comes from.

I also suspect you were testing a file much larger than your memory
size. If so, sorry to tell you that a file only a few times larger,
e.g. twice, would be worse.

Here is my take:

Claim
-----
This patchset is a "better" algorithm. (Technically it's not an
algorithm, it's a feedback loop.)

Theoretical basis
-----------------
An open-loop control (the mainline) can only be better if the margin
of error in its prediction of the future events is less than that from
the trial-and-error of a closed-loop control (this patchset). For
simple machines, it surely can. For page reclaim, AFAIK, it can't.

A typical example: when randomly accessing a (not infinitely) large
file via buffered io long enough, we're bound to hit the same blocks
multiple times. Should we activate the pages containing those blocks,
i.e., to move them to the active lru list?  No.

RCA
---
For the fio/io_uring benchmark, the "No" is the key.

The mainline activates pages accessed multiple times. This is done in
the buffered io access path by mark_page_accessed(), and it takes the
lru lock, which is contended under memory pressure. This contention
slows down both the access path and kswapd. But kswapd is not the
problem here because we are measuring the io_uring process, not kswap.

For this patchset, there are no activations since the refault rates of
pages accessed multiple times are similar to those accessed only once
-- activations will only be done to pages from tiers with higher
refault rates.

If you wish to debunk
---------------------
git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1

CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y

Run your benchmarks

Profiles (200G mem + 400G file)
-------------------------------
A quick test from Jens' fio/io_uring:

-rc7
    13.30%  io_uring  xas_load
    13.22%  io_uring  _copy_to_iter
    12.30%  io_uring  __add_to_page_cache_locked
     7.43%  io_uring  clear_page_erms
     4.18%  io_uring  filemap_get_read_batch
     3.54%  io_uring  get_page_from_freelist
     2.98%  io_uring  ***native_queued_spin_lock_slowpath***
     1.61%  io_uring  page_cache_ra_unbounded
     1.16%  io_uring  xas_start
     1.08%  io_uring  filemap_read
     1.07%  io_uring  ***__activate_page***

lru lock: 2.98% (lru addition + activation)
activation: 1.07%

-rc7 + this patchset
    14.44%  io_uring  xas_load
    14.14%  io_uring  _copy_to_iter
    11.15%  io_uring  __add_to_page_cache_locked
     6.56%  io_uring  clear_page_erms
     4.44%  io_uring  filemap_get_read_batch
     2.14%  io_uring  get_page_from_freelist
     1.32%  io_uring  page_cache_ra_unbounded
     1.20%  io_uring  psi_group_change
     1.18%  io_uring  filemap_read
     1.09%  io_uring  ****native_queued_spin_lock_slowpath****
     1.08%  io_uring  do_mpage_readpage

lru lock: 1.09% (lru addition only)

And I plan to reach out to other communities, e.g., PostgreSQL, to
benchmark the patchset. I heard they have been complaining about the
buffered io performance under memory pressure. Any other benchmarks
you'd suggest?

BTW, you might find another surprise in how less frequently slab
shrinkers are called under memory pressure, because this patchset is a
lot better at finding pages to reclaim and therefore doesn't overkill
slabs.

Thanks.

Yu Zhao April 14, 2021, 4:13 a.m. UTC | #7

On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> >
> > > The initial posting of this patchset did no better, in fact it did
> > > a bit
> > > worse. Performance dropped to the same levels and kswapd was using
> > > as
> > > much CPU as before, but on top of that we also got excessive
> > > swapping.
> > > Not at a high rate, but 5-10MB/sec continually.
> > >
> > > I had some back and forths with Yu Zhao and tested a few new
> > > revisions,
> > > and the current series does much better in this regard. Performance
> > > still dips a bit when page cache fills, but not nearly as much, and
> > > kswapd is using less CPU than before.
> >
> > Profiles would be interesting, because it sounds to me like reclaim
> > *might* be batching page cache removal better (e.g. fewer, larger
> > batches) and so spending less time contending on the mapping tree
> > lock...
> >
> > IOWs, I suspect this result might actually be a result of less lock
> > contention due to a change in batch processing characteristics of
> > the new algorithm rather than it being a "better" algorithm...
>
> That seems quite likely to me, given the issues we have
> had with virtual scan reclaim algorithms in the past.

Hi Rik,

Let paste the code so we can move beyond the "batching" hypothesis:

static int __remove_mapping(struct address_space *mapping, struct page
*page,
                            bool reclaimed, struct mem_cgroup *target_memcg)
{
        unsigned long flags;
        int refcount;
        void *shadow = NULL;

        BUG_ON(!PageLocked(page));
        BUG_ON(mapping != page_mapping(page));

        xa_lock_irqsave(&mapping->i_pages, flags);

> SeongJae, what is this algorithm supposed to do when faced
> with situations like this:

I'll assume the questions were directed at me, not SeongJae.

> 1) Running on a system with 8 NUMA nodes, and
> memory
>    pressure in one of those nodes.
> 2) Running PostgresQL or Oracle, with hundreds of
>    processes mapping the same (very large) shared
>    memory segment.
>
> How do you keep your algorithm from falling into the worst
> case virtual scanning scenarios that were crippling the
> 2.4 kernel 15+ years ago on systems with just a few GB of
> memory?

There is a fundamental shift: that time we were scanning for cold pages,
and nowadays we are scanning for hot pages.

I'd be surprised if scanning for cold pages didn't fall apart, because it'd
find most of the entries accessed, if they are present at all.

Scanning for hot pages, on the other hand, is way better. Let me just
reiterate:
1) It will not scan page tables from processes that have been sleeping
   since the last scan.
2) It will not scan PTE tables under non-leaf PMD entries that do not
   have the accessed bit set, when
   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
3) It will not zigzag between the PGD table and the same PMD or PTE
   table spanning multiple VMAs. In other words, it finishes all the
   VMAs with the range of the same PMD or PTE table before it returns
   to the PGD table. This optimizes workloads that have large numbers
   of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.

So the cost is roughly proportional to the number of referenced pages it
discovers. If there is no memory pressure, no scanning at all. For a system
under heavy memory pressure, most of the pages are referenced (otherwise
why would it be under memory pressure?), and if we use the rmap, we need to
scan a lot of pages anyway. Why not just scan them all? This way you save a
lot because of batching (now it's time to talk about batching). Besides,
page tables have far better memory locality than the rmap. For the shared
memory example you gave, the rmap needs to lock *each* page it scans. How
many 4KB pages does your large file have? I'll leave the math to you.

Here are some profiles:

zram with the rmap (mainline)
  31.03%  page_vma_mapped_walk
  25.59%  lzo1x_1_do_compress
   4.63%  do_raw_spin_lock
   3.89%  vma_interval_tree_iter_next
   3.33%  vma_interval_tree_subtree_search

zram with page table scanning (this patchset)
  49.36%  lzo1x_1_do_compress
   4.54%  page_vma_mapped_walk
   4.45%  memset_erms
   3.47%  walk_pte_range
   2.88%  zram_bvec_rw

Note that these are not just what I saw from some local benchmarks. We have
observed *millions* of machines in our fleet.

I encourage you to try it and see for yourself. It's as simple as:

git fetch https://linux-mm.googlesource.com/page-reclaim
 refs/changes/73/1173/1

CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y

and build and run your favorite benchmarks.

Dave Chinner April 14, 2021, 4:50 a.m. UTC | #8

On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> > > On 4/13/21 1:51 AM, SeongJae Park wrote:
> > > > From: SeongJae Park <sjpark@amazon.de>
> > > >
> > > > Hello,
> > > >
> > > >
> > > > Very interesting work, thank you for sharing this :)
> > > >
> > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > >> What's new in v2
> > > >> ================
> > > >> Special thanks to Jens Axboe for reporting a regression in buffered
> > > >> I/O and helping test the fix.
> > > >
> > > > Is the discussion open?  If so, could you please give me a link?
> > >
> > > I wasn't on the initial post (or any of the lists it was posted to), but
> > > it's on the google page reclaim list. Not sure if that is public or not.
> > >
> > > tldr is that I was pretty excited about this work, as buffered IO tends
> > > to suck (a lot) for high throughput applications. My test case was
> > > pretty simple:
> > >
> > > Randomly read a fast device, using 4k buffered IO, and watch what
> > > happens when the page cache gets filled up. For this particular test,
> > > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> > > with kswapd using a lot of CPU trying to keep up. That's mainline
> > > behavior.
> >
> > I see this exact same behaviour here, too, but I RCA'd it to
> > contention between the inode and memory reclaim for the mapping
> > structure that indexes the page cache. Basically the mapping tree
> > lock is the contention point here - you can either be adding pages
> > to the mapping during IO, or memory reclaim can be removing pages
> > from the mapping, but we can't do both at once.
> >
> > So we end up with kswapd spinning on the mapping tree lock like so
> > when doing 1.6GB/s in 4kB buffered IO:
> >
> > -   20.06%     0.00%  [kernel]               [k] kswapd                                                                                                        ▒
> >    - 20.06% kswapd                                                                                                                                             ▒
> >       - 20.05% balance_pgdat                                                                                                                                   ▒
> >          - 20.03% shrink_node                                                                                                                                  ▒
> >             - 19.92% shrink_lruvec                                                                                                                             ▒
> >                - 19.91% shrink_inactive_list                                                                                                                   ▒
> >                   - 19.22% shrink_page_list                                                                                                                    ▒
> >                      - 17.51% __remove_mapping                                                                                                                 ▒
> >                         - 14.16% _raw_spin_lock_irqsave                                                                                                        ▒
> >                            - 14.14% do_raw_spin_lock                                                                                                           ▒
> >                                 __pv_queued_spin_lock_slowpath                                                                                                 ▒
> >                         - 1.56% __delete_from_page_cache                                                                                                       ▒
> >                              0.63% xas_store                                                                                                                   ▒
> >                         - 0.78% _raw_spin_unlock_irqrestore                                                                                                    ▒
> >                            - 0.69% do_raw_spin_unlock                                                                                                          ▒
> >                                 __raw_callee_save___pv_queued_spin_unlock                                                                                      ▒
> >                      - 0.82% free_unref_page_list                                                                                                              ▒
> >                         - 0.72% free_unref_page_commit                                                                                                         ▒
> >                              0.57% free_pcppages_bulk                                                                                                          ▒
> >
> > And these are the processes consuming CPU:
> >
> >    5171 root      20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
> >    1150 root      20   0       0      0      0 S  47.4   0.0   0:22.70 kswapd1
> >    1146 root      20   0       0      0      0 S  44.0   0.0   0:21.85 kswapd0
> >    1152 root      20   0       0      0      0 S  39.7   0.0   0:18.28 kswapd3
> >    1151 root      20   0       0      0      0 S  15.2   0.0   0:12.14 kswapd2
> >
> > i.e. when memory reclaim kicks in, the read process has 20% less
> > time with exclusive access to the mapping tree to insert new pages.
> > Hence buffered read performance goes down quite substantially when
> > memory reclaim kicks in, and this really has nothing to do with the
> > memory reclaim LRU scanning algorithm.
> >
> > I can actually get this machine to pin those 5 processes to 100% CPU
> > under certain conditions. Each process is spinning all that extra
> > time on the mapping tree lock, and performance degrades further.
> > Changing the LRU reclaim algorithm won't fix this - the workload is
> > solidly bound by the exclusive nature of the mapping tree lock and
> > the number of tasks trying to obtain it exclusively...
> >
> > > The initial posting of this patchset did no better, in fact it did a bit
> > > worse. Performance dropped to the same levels and kswapd was using as
> > > much CPU as before, but on top of that we also got excessive swapping.
> > > Not at a high rate, but 5-10MB/sec continually.
> > >
> > > I had some back and forths with Yu Zhao and tested a few new revisions,
> > > and the current series does much better in this regard. Performance
> > > still dips a bit when page cache fills, but not nearly as much, and
> > > kswapd is using less CPU than before.
> >
> > Profiles would be interesting, because it sounds to me like reclaim
> > *might* be batching page cache removal better (e.g. fewer, larger
> > batches) and so spending less time contending on the mapping tree
> > lock...
> >
> > IOWs, I suspect this result might actually be a result of less lock
> > contention due to a change in batch processing characteristics of
> > the new algorithm rather than it being a "better" algorithm...
> 
> I appreciate the profile. But there is no batching in
> __remove_mapping() -- it locks the mapping for each page, and
> therefore the lock contention penalizes the mainline and this patchset
> equally. It looks worse on your system because the four kswapd threads
> from different nodes were working on the same file.

I think you misunderstand exactly what I mean by "batching" here.
I'm not talking about doing multiple pieces of work under a single
lock. What I mean is that the overall amount of work done in a
single reclaim scan (i.e a "reclaim batch") is packaged differently.

We already batch up page reclaim via building a page list and then
passing it to shrink_page_list() to process the batch of pages in a
single pass. Each page in this page list batch then calls
remove_mapping() to pull the page form the LRU, we have a run of
contention between the foreground read() thread and the background
kswapd.

If the size or nature of the pages in the batch passed to
shrink_page_list() changes, then the amount of time a reclaim batch
is going to put pressure on the mapping tree lock will also change.
That's the "change in batching behaviour" I'm referring to here. I
haven't read through the patchset to determine if you change the
shrink_page_list() algorithm, but it likely changes what is passed
to be reclaimed and that in turn changes the locking patterns that
fall out of shrink_page_list...

> And kswapd is only one of two paths that could affect the performance.
> The kernel context of the test process is where the improvement mainly
> comes from.
> 
> I also suspect you were testing a file much larger than your memory
> size. If so, sorry to tell you that a file only a few times larger,
> e.g. twice, would be worse.
> 
> Here is my take:
> 
> Claim
> -----
> This patchset is a "better" algorithm. (Technically it's not an
> algorithm, it's a feedback loop.)
> 
> Theoretical basis
> -----------------
> An open-loop control (the mainline) can only be better if the margin
> of error in its prediction of the future events is less than that from
> the trial-and-error of a closed-loop control (this patchset). For
> simple machines, it surely can. For page reclaim, AFAIK, it can't.
> 
> A typical example: when randomly accessing a (not infinitely) large
> file via buffered io long enough, we're bound to hit the same blocks
> multiple times. Should we activate the pages containing those blocks,
> i.e., to move them to the active lru list?  No.
> 
> RCA
> ---
> For the fio/io_uring benchmark, the "No" is the key.
> 
> The mainline activates pages accessed multiple times. This is done in
> the buffered io access path by mark_page_accessed(), and it takes the
> lru lock, which is contended under memory pressure. This contention
> slows down both the access path and kswapd. But kswapd is not the
> problem here because we are measuring the io_uring process, not kswap.
> 
> For this patchset, there are no activations since the refault rates of
> pages accessed multiple times are similar to those accessed only once
> -- activations will only be done to pages from tiers with higher
> refault rates.
> 
> If you wish to debunk
> ---------------------

Nope, it's your job to convince us that it works, not the other way
around. It's up to you to prove that your assertions are correct,
not for us to prove they are false.

> git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1
> 
> CONFIG_LRU_GEN=y
> CONFIG_LRU_GEN_ENABLED=y
> 
> Run your benchmarks
> 
> Profiles (200G mem + 400G file)
> -------------------------------
> A quick test from Jens' fio/io_uring:
> 
> -rc7
>     13.30%  io_uring  xas_load
>     13.22%  io_uring  _copy_to_iter
>     12.30%  io_uring  __add_to_page_cache_locked
>      7.43%  io_uring  clear_page_erms
>      4.18%  io_uring  filemap_get_read_batch
>      3.54%  io_uring  get_page_from_freelist
>      2.98%  io_uring  ***native_queued_spin_lock_slowpath***
>      1.61%  io_uring  page_cache_ra_unbounded
>      1.16%  io_uring  xas_start
>      1.08%  io_uring  filemap_read
>      1.07%  io_uring  ***__activate_page***
> 
> lru lock: 2.98% (lru addition + activation)
> activation: 1.07%
> 
> -rc7 + this patchset
>     14.44%  io_uring  xas_load
>     14.14%  io_uring  _copy_to_iter
>     11.15%  io_uring  __add_to_page_cache_locked
>      6.56%  io_uring  clear_page_erms
>      4.44%  io_uring  filemap_get_read_batch
>      2.14%  io_uring  get_page_from_freelist
>      1.32%  io_uring  page_cache_ra_unbounded
>      1.20%  io_uring  psi_group_change
>      1.18%  io_uring  filemap_read
>      1.09%  io_uring  ****native_queued_spin_lock_slowpath****
>      1.08%  io_uring  do_mpage_readpage
> 
> lru lock: 1.09% (lru addition only)

All this tells us is that there was *less contention on the mapping
tree lock*. It does not tell us why there was less contention.

You've handily omitted the kswapd profile, which is really the one
of interest to the discussion here - how did the memory reclaim CPU
usage profile also change at the same time?

> And I plan to reach out to other communities, e.g., PostgreSQL, to
> benchmark the patchset. I heard they have been complaining about the
> buffered io performance under memory pressure. Any other benchmarks
> you'd suggest?
> 
> BTW, you might find another surprise in how less frequently slab
> shrinkers are called under memory pressure, because this patchset is a
> lot better at finding pages to reclaim and therefore doesn't overkill
> slabs.

That's actually very likely to be a Bad Thing and cause unexpected
perofrmance and OOM based regressions. When the machine finally runs
out of page cache it can easily reclaim, it's going to get stuck
with long tail latencies reclaiming huge slab caches as they've had
no substantial ongoing pressure put on them to keep them in balance
with the overall memory pressure the system is under...

Cheers,

Dave.

Huang, Ying April 14, 2021, 6:15 a.m. UTC | #9

Yu Zhao <yuzhao@google.com> writes:

> On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel <riel@surriel.com> wrote:
>>
>> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
>> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
>> >
>> > > The initial posting of this patchset did no better, in fact it did
>> > > a bit
>> > > worse. Performance dropped to the same levels and kswapd was using
>> > > as
>> > > much CPU as before, but on top of that we also got excessive
>> > > swapping.
>> > > Not at a high rate, but 5-10MB/sec continually.
>> > >
>> > > I had some back and forths with Yu Zhao and tested a few new
>> > > revisions,
>> > > and the current series does much better in this regard. Performance
>> > > still dips a bit when page cache fills, but not nearly as much, and
>> > > kswapd is using less CPU than before.
>> >
>> > Profiles would be interesting, because it sounds to me like reclaim
>> > *might* be batching page cache removal better (e.g. fewer, larger
>> > batches) and so spending less time contending on the mapping tree
>> > lock...
>> >
>> > IOWs, I suspect this result might actually be a result of less lock
>> > contention due to a change in batch processing characteristics of
>> > the new algorithm rather than it being a "better" algorithm...
>>
>> That seems quite likely to me, given the issues we have
>> had with virtual scan reclaim algorithms in the past.
>
> Hi Rik,
>
> Let paste the code so we can move beyond the "batching" hypothesis:
>
> static int __remove_mapping(struct address_space *mapping, struct page
> *page,
>                             bool reclaimed, struct mem_cgroup *target_memcg)
> {
>         unsigned long flags;
>         int refcount;
>         void *shadow = NULL;
>
>         BUG_ON(!PageLocked(page));
>         BUG_ON(mapping != page_mapping(page));
>
>         xa_lock_irqsave(&mapping->i_pages, flags);
>
>> SeongJae, what is this algorithm supposed to do when faced
>> with situations like this:
>
> I'll assume the questions were directed at me, not SeongJae.
>
>> 1) Running on a system with 8 NUMA nodes, and
>> memory
>>    pressure in one of those nodes.
>> 2) Running PostgresQL or Oracle, with hundreds of
>>    processes mapping the same (very large) shared
>>    memory segment.
>>
>> How do you keep your algorithm from falling into the worst
>> case virtual scanning scenarios that were crippling the
>> 2.4 kernel 15+ years ago on systems with just a few GB of
>> memory?
>
> There is a fundamental shift: that time we were scanning for cold pages,
> and nowadays we are scanning for hot pages.
>
> I'd be surprised if scanning for cold pages didn't fall apart, because it'd
> find most of the entries accessed, if they are present at all.
>
> Scanning for hot pages, on the other hand, is way better. Let me just
> reiterate:
> 1) It will not scan page tables from processes that have been sleeping
>    since the last scan.
> 2) It will not scan PTE tables under non-leaf PMD entries that do not
>    have the accessed bit set, when
>    CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> 3) It will not zigzag between the PGD table and the same PMD or PTE
>    table spanning multiple VMAs. In other words, it finishes all the
>    VMAs with the range of the same PMD or PTE table before it returns
>    to the PGD table. This optimizes workloads that have large numbers
>    of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
>
> So the cost is roughly proportional to the number of referenced pages it
> discovers. If there is no memory pressure, no scanning at all. For a system
> under heavy memory pressure, most of the pages are referenced (otherwise
> why would it be under memory pressure?), and if we use the rmap, we need to
> scan a lot of pages anyway. Why not just scan them all?

This may be not the case.  For rmap scanning, it's possible to scan only
a small portion of memory.  But with the page table scanning, you need
to scan almost all (I understand you have some optimization as above).
As Rik shown in the test case above, there may be memory pressure on
only one of 8 NUMA nodes (because of NUMA binding?).  Then ramp scanning
only needs to scan pages in this node, while the page table scanning may
need to scan pages in other nodes too.

Best Regards,
Huang, Ying

> This way you save a
> lot because of batching (now it's time to talk about batching). Besides,
> page tables have far better memory locality than the rmap. For the shared
> memory example you gave, the rmap needs to lock *each* page it scans. How
> many 4KB pages does your large file have? I'll leave the math to you.
>
> Here are some profiles:
>
> zram with the rmap (mainline)
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>    4.63%  do_raw_spin_lock
>    3.89%  vma_interval_tree_iter_next
>    3.33%  vma_interval_tree_subtree_search
>
> zram with page table scanning (this patchset)
>   49.36%  lzo1x_1_do_compress
>    4.54%  page_vma_mapped_walk
>    4.45%  memset_erms
>    3.47%  walk_pte_range
>    2.88%  zram_bvec_rw
>
> Note that these are not just what I saw from some local benchmarks. We have
> observed *millions* of machines in our fleet.
>
> I encourage you to try it and see for yourself. It's as simple as:
>
> git fetch https://linux-mm.googlesource.com/page-reclaim
>  refs/changes/73/1173/1
>
> CONFIG_LRU_GEN=y
> CONFIG_LRU_GEN_ENABLED=y
>
> and build and run your favorite benchmarks.

Yu Zhao April 14, 2021, 7:16 a.m. UTC | #10

On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner <david@fromorbit.com> wrote:
> > > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> > > > On 4/13/21 1:51 AM, SeongJae Park wrote:
> > > > > From: SeongJae Park <sjpark@amazon.de>
> > > > >
> > > > > Hello,
> > > > >
> > > > >
> > > > > Very interesting work, thank you for sharing this :)
> > > > >
> > > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > >> What's new in v2
> > > > >> ================
> > > > >> Special thanks to Jens Axboe for reporting a regression in buffered
> > > > >> I/O and helping test the fix.
> > > > >
> > > > > Is the discussion open?  If so, could you please give me a link?
> > > >
> > > > I wasn't on the initial post (or any of the lists it was posted to), but
> > > > it's on the google page reclaim list. Not sure if that is public or not.
> > > >
> > > > tldr is that I was pretty excited about this work, as buffered IO tends
> > > > to suck (a lot) for high throughput applications. My test case was
> > > > pretty simple:
> > > >
> > > > Randomly read a fast device, using 4k buffered IO, and watch what
> > > > happens when the page cache gets filled up. For this particular test,
> > > > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> > > > with kswapd using a lot of CPU trying to keep up. That's mainline
> > > > behavior.
> > >
> > > I see this exact same behaviour here, too, but I RCA'd it to
> > > contention between the inode and memory reclaim for the mapping
> > > structure that indexes the page cache. Basically the mapping tree
> > > lock is the contention point here - you can either be adding pages
> > > to the mapping during IO, or memory reclaim can be removing pages
> > > from the mapping, but we can't do both at once.
> > >
> > > So we end up with kswapd spinning on the mapping tree lock like so
> > > when doing 1.6GB/s in 4kB buffered IO:
> > >
> > > -   20.06%     0.00%  [kernel]               [k] kswapd                                                                                                        ▒
> > >    - 20.06% kswapd                                                                                                                                             ▒
> > >       - 20.05% balance_pgdat                                                                                                                                   ▒
> > >          - 20.03% shrink_node                                                                                                                                  ▒
> > >             - 19.92% shrink_lruvec                                                                                                                             ▒
> > >                - 19.91% shrink_inactive_list                                                                                                                   ▒
> > >                   - 19.22% shrink_page_list                                                                                                                    ▒
> > >                      - 17.51% __remove_mapping                                                                                                                 ▒
> > >                         - 14.16% _raw_spin_lock_irqsave                                                                                                        ▒
> > >                            - 14.14% do_raw_spin_lock                                                                                                           ▒
> > >                                 __pv_queued_spin_lock_slowpath                                                                                                 ▒
> > >                         - 1.56% __delete_from_page_cache                                                                                                       ▒
> > >                              0.63% xas_store                                                                                                                   ▒
> > >                         - 0.78% _raw_spin_unlock_irqrestore                                                                                                    ▒
> > >                            - 0.69% do_raw_spin_unlock                                                                                                          ▒
> > >                                 __raw_callee_save___pv_queued_spin_unlock                                                                                      ▒
> > >                      - 0.82% free_unref_page_list                                                                                                              ▒
> > >                         - 0.72% free_unref_page_commit                                                                                                         ▒
> > >                              0.57% free_pcppages_bulk                                                                                                          ▒
> > >
> > > And these are the processes consuming CPU:
> > >
> > >    5171 root      20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
> > >    1150 root      20   0       0      0      0 S  47.4   0.0   0:22.70 kswapd1
> > >    1146 root      20   0       0      0      0 S  44.0   0.0   0:21.85 kswapd0
> > >    1152 root      20   0       0      0      0 S  39.7   0.0   0:18.28 kswapd3
> > >    1151 root      20   0       0      0      0 S  15.2   0.0   0:12.14 kswapd2
> > >
> > > i.e. when memory reclaim kicks in, the read process has 20% less
> > > time with exclusive access to the mapping tree to insert new pages.
> > > Hence buffered read performance goes down quite substantially when
> > > memory reclaim kicks in, and this really has nothing to do with the
> > > memory reclaim LRU scanning algorithm.
> > >
> > > I can actually get this machine to pin those 5 processes to 100% CPU
> > > under certain conditions. Each process is spinning all that extra
> > > time on the mapping tree lock, and performance degrades further.
> > > Changing the LRU reclaim algorithm won't fix this - the workload is
> > > solidly bound by the exclusive nature of the mapping tree lock and
> > > the number of tasks trying to obtain it exclusively...
> > >
> > > > The initial posting of this patchset did no better, in fact it did a bit
> > > > worse. Performance dropped to the same levels and kswapd was using as
> > > > much CPU as before, but on top of that we also got excessive swapping.
> > > > Not at a high rate, but 5-10MB/sec continually.
> > > >
> > > > I had some back and forths with Yu Zhao and tested a few new revisions,
> > > > and the current series does much better in this regard. Performance
> > > > still dips a bit when page cache fills, but not nearly as much, and
> > > > kswapd is using less CPU than before.
> > >
> > > Profiles would be interesting, because it sounds to me like reclaim
> > > *might* be batching page cache removal better (e.g. fewer, larger
> > > batches) and so spending less time contending on the mapping tree
> > > lock...
> > >
> > > IOWs, I suspect this result might actually be a result of less lock
> > > contention due to a change in batch processing characteristics of
> > > the new algorithm rather than it being a "better" algorithm...
> >
> > I appreciate the profile. But there is no batching in
> > __remove_mapping() -- it locks the mapping for each page, and
> > therefore the lock contention penalizes the mainline and this patchset
> > equally. It looks worse on your system because the four kswapd threads
> > from different nodes were working on the same file.
>
> I think you misunderstand exactly what I mean by "batching" here.
> I'm not talking about doing multiple pieces of work under a single
> lock. What I mean is that the overall amount of work done in a
> single reclaim scan (i.e a "reclaim batch") is packaged differently.
>
> We already batch up page reclaim via building a page list and then
> passing it to shrink_page_list() to process the batch of pages in a
> single pass. Each page in this page list batch then calls
> remove_mapping() to pull the page form the LRU, we have a run of
> contention between the foreground read() thread and the background
> kswapd.
>
> If the size or nature of the pages in the batch passed to
> shrink_page_list() changes, then the amount of time a reclaim batch
> is going to put pressure on the mapping tree lock will also change.
> That's the "change in batching behaviour" I'm referring to here. I
> haven't read through the patchset to determine if you change the
> shrink_page_list() algorithm, but it likely changes what is passed
> to be reclaimed and that in turn changes the locking patterns that
> fall out of shrink_page_list...

Ok, if we are talking about the size of the batch passed to
shrink_page_list(), both the mainline and this patchset cap it at
SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when
running fio/io_uring, it's safe to say both use 32.

> > And kswapd is only one of two paths that could affect the performance.
> > The kernel context of the test process is where the improvement mainly
> > comes from.
> >
> > I also suspect you were testing a file much larger than your memory
> > size. If so, sorry to tell you that a file only a few times larger,
> > e.g. twice, would be worse.
> >
> > Here is my take:
> >
> > Claim
> > -----
> > This patchset is a "better" algorithm. (Technically it's not an
> > algorithm, it's a feedback loop.)
> >
> > Theoretical basis
> > -----------------
> > An open-loop control (the mainline) can only be better if the margin
> > of error in its prediction of the future events is less than that from
> > the trial-and-error of a closed-loop control (this patchset). For
> > simple machines, it surely can. For page reclaim, AFAIK, it can't.
> >
> > A typical example: when randomly accessing a (not infinitely) large
> > file via buffered io long enough, we're bound to hit the same blocks
> > multiple times. Should we activate the pages containing those blocks,
> > i.e., to move them to the active lru list?  No.
> >
> > RCA
> > ---
> > For the fio/io_uring benchmark, the "No" is the key.
> >
> > The mainline activates pages accessed multiple times. This is done in
> > the buffered io access path by mark_page_accessed(), and it takes the
> > lru lock, which is contended under memory pressure. This contention
> > slows down both the access path and kswapd. But kswapd is not the
> > problem here because we are measuring the io_uring process, not kswap.
> >
> > For this patchset, there are no activations since the refault rates of
> > pages accessed multiple times are similar to those accessed only once
> > -- activations will only be done to pages from tiers with higher
> > refault rates.
> >
> > If you wish to debunk
> > ---------------------
>
> Nope, it's your job to convince us that it works, not the other way
> around. It's up to you to prove that your assertions are correct,
> not for us to prove they are false.

Just trying to keep people motivated, my homework is my own.

> > git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1
> >
> > CONFIG_LRU_GEN=y
> > CONFIG_LRU_GEN_ENABLED=y
> >
> > Run your benchmarks
> >
> > Profiles (200G mem + 400G file)
> > -------------------------------
> > A quick test from Jens' fio/io_uring:
> >
> > -rc7
> >     13.30%  io_uring  xas_load
> >     13.22%  io_uring  _copy_to_iter
> >     12.30%  io_uring  __add_to_page_cache_locked
> >      7.43%  io_uring  clear_page_erms
> >      4.18%  io_uring  filemap_get_read_batch
> >      3.54%  io_uring  get_page_from_freelist
> >      2.98%  io_uring  ***native_queued_spin_lock_slowpath***
> >      1.61%  io_uring  page_cache_ra_unbounded
> >      1.16%  io_uring  xas_start
> >      1.08%  io_uring  filemap_read
> >      1.07%  io_uring  ***__activate_page***
> >
> > lru lock: 2.98% (lru addition + activation)
> > activation: 1.07%
> >
> > -rc7 + this patchset
> >     14.44%  io_uring  xas_load
> >     14.14%  io_uring  _copy_to_iter
> >     11.15%  io_uring  __add_to_page_cache_locked
> >      6.56%  io_uring  clear_page_erms
> >      4.44%  io_uring  filemap_get_read_batch
> >      2.14%  io_uring  get_page_from_freelist
> >      1.32%  io_uring  page_cache_ra_unbounded
> >      1.20%  io_uring  psi_group_change
> >      1.18%  io_uring  filemap_read
> >      1.09%  io_uring  ****native_queued_spin_lock_slowpath****
> >      1.08%  io_uring  do_mpage_readpage
> >
> > lru lock: 1.09% (lru addition only)
>
> All this tells us is that there was *less contention on the mapping
> tree lock*. It does not tell us why there was less contention.
>
> You've handily omitted the kswapd profile, which is really the one
> of interest to the discussion here - how did the memory reclaim CPU
> usage profile also change at the same time?

Well, let me attach them. Suffix -1 is the mainline, -2 is the patchset.

  mainline
     57.65%  kswapd0  __remove_mapping
  this patchset
     61.61%  kswapd0  __remove_mapping

As I said, the mapping lock contention penalizes both heavily. Its
percentage is even higher with the patchset, because it has less
overhead. I'm trying to explain "the less overhead" part: it's the
activations that make the mainline worse.

  mainline
    6.53%  kswapd0  shrink_active_list
  this patchset
    0

From the io_uring context:
  mainline
     2.53%  io_uring  mark_page_accessed
  this patchset
     0.52%  io_uring  mark_page_accessed

mark_page_accessed() moves pages accessed multiple times to the active
lru list. Then shrink_active_list() moves them back to the inactive
list. All for nothing.

I don't want to paste everything here -- they'd clutter. Please see
all the detailed profiles in the attachment. Let me know if their
formats are no to your liking. I still have the raw perf.data.

> > And I plan to reach out to other communities, e.g., PostgreSQL, to
> > benchmark the patchset. I heard they have been complaining about the
> > buffered io performance under memory pressure. Any other benchmarks
> > you'd suggest?
> >
> > BTW, you might find another surprise in how less frequently slab
> > shrinkers are called under memory pressure, because this patchset is a
> > lot better at finding pages to reclaim and therefore doesn't overkill
> > slabs.
>
> That's actually very likely to be a Bad Thing and cause unexpected
> perofrmance and OOM based regressions. When the machine finally runs
> out of page cache it can easily reclaim, it's going to get stuck
> with long tail latencies reclaiming huge slab caches as they've had
> no substantial ongoing pressure put on them to keep them in balance
> with the overall memory pressure the system is under...

Well. It does use the existing equation. That is if it scans X% of
pages, then it scans X% of slab objects. But 1) it often finds pages
to reclaim at a lower X% 2) the pages it reclaims are less likely to
refault. So the side effect is the overall slab objects it scans also
reduce. I do see your point but don't see any options, at the moment.
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 954K of event 'cycles'
# Event count (approx.): 856006306336
#
# Children      Self  Command       Shared Object      Symbol                                      
# ........  ........  ............  .................  ............................................
#
    99.90%     0.00%  io_uring      [unknown]          [k] 0x0000000000000005
    99.90%     0.00%  io_uring      [unknown]          [k] 0x0000564cf9afc450
    99.50%     0.02%  io_uring      libc-2.32.so       [.] syscall
    99.19%     0.01%  io_uring      [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
    99.16%     0.00%  io_uring      [kernel.kallsyms]  [k] do_syscall_64
    94.78%     0.18%  io_uring      [kernel.kallsyms]  [k] __io_queue_sqe
    94.41%     0.25%  io_uring      [kernel.kallsyms]  [k] io_issue_sqe
    93.60%     0.48%  io_uring      [kernel.kallsyms]  [k] io_read
    89.35%     0.96%  io_uring      [kernel.kallsyms]  [k] blkdev_read_iter
    88.44%     0.12%  io_uring      [kernel.kallsyms]  [k] io_iter_do_read
    88.25%     0.16%  io_uring      [kernel.kallsyms]  [k] generic_file_read_iter
    88.00%     1.20%  io_uring      [kernel.kallsyms]  [k] filemap_read
    84.01%     0.01%  io_uring      [kernel.kallsyms]  [k] __x64_sys_io_uring_enter
    83.91%     0.01%  io_uring      [kernel.kallsyms]  [k] __do_sys_io_uring_enter
    83.74%     0.37%  io_uring      [kernel.kallsyms]  [k] io_submit_sqes
    81.28%     0.07%  io_uring      [kernel.kallsyms]  [k] io_queue_sqe
    74.65%     0.96%  io_uring      [kernel.kallsyms]  [k] filemap_get_pages
    55.92%     0.35%  io_uring      [kernel.kallsyms]  [k] ondemand_readahead
    54.57%     1.34%  io_uring      [kernel.kallsyms]  [k] page_cache_ra_unbounded
    51.57%     0.12%  io_uring      [kernel.kallsyms]  [k] page_cache_sync_ra
    24.14%     0.51%  io_uring      [kernel.kallsyms]  [k] add_to_page_cache_lru
    19.04%    11.51%  io_uring      [kernel.kallsyms]  [k] __add_to_page_cache_locked
    18.48%     0.13%  io_uring      [kernel.kallsyms]  [k] read_pages
    18.42%     0.18%  io_uring      [kernel.kallsyms]  [k] blkdev_readahead
    18.20%     0.55%  io_uring      [kernel.kallsyms]  [k] mpage_readahead
    16.81%     2.31%  io_uring      [kernel.kallsyms]  [k] filemap_get_read_batch
    16.37%    14.83%  io_uring      [kernel.kallsyms]  [k] xas_load
    15.40%     0.02%  io_uring      [kernel.kallsyms]  [k] task_work_run
    15.38%     0.03%  io_uring      [kernel.kallsyms]  [k] exit_to_user_mode_prepare
    15.31%     0.05%  io_uring      [kernel.kallsyms]  [k] tctx_task_work
    15.14%     0.15%  io_uring      [kernel.kallsyms]  [k] syscall_exit_to_user_mode
    14.05%     0.04%  io_uring      [kernel.kallsyms]  [k] io_req_task_submit
    13.86%     0.05%  io_uring      [kernel.kallsyms]  [k] __io_req_task_submit
    12.92%     0.12%  io_uring      [kernel.kallsyms]  [k] submit_bio
    11.40%     0.13%  io_uring      [kernel.kallsyms]  [k] copy_page_to_iter
    10.65%     9.61%  io_uring      [kernel.kallsyms]  [k] _copy_to_iter
     9.45%     0.03%  io_uring      [kernel.kallsyms]  [k] __page_cache_alloc
     9.42%     0.16%  io_uring      [kernel.kallsyms]  [k] submit_bio_noacct
     9.40%     0.11%  io_uring      [kernel.kallsyms]  [k] alloc_pages_current
     9.11%     0.30%  io_uring      [kernel.kallsyms]  [k] __alloc_pages_nodemask
     8.53%     1.81%  io_uring      [kernel.kallsyms]  [k] get_page_from_freelist
     8.38%     0.10%  io_uring      [kernel.kallsyms]  [k] asm_common_interrupt
     8.26%     0.06%  io_uring      [kernel.kallsyms]  [k] common_interrupt
     7.75%     0.05%  io_uring      [kernel.kallsyms]  [k] __common_interrupt
     7.62%     0.44%  io_uring      [kernel.kallsyms]  [k] blk_mq_submit_bio
     7.56%     0.20%  io_uring      [kernel.kallsyms]  [k] handle_edge_irq
     6.45%     5.90%  io_uring      [kernel.kallsyms]  [k] clear_page_erms
     5.25%     0.10%  io_uring      [kernel.kallsyms]  [k] handle_irq_event
     4.88%     0.19%  io_uring      [kernel.kallsyms]  [k] nvme_irq
     4.83%     0.07%  io_uring      [kernel.kallsyms]  [k] __handle_irq_event_percpu
     4.73%     0.52%  io_uring      [kernel.kallsyms]  [k] nvme_process_cq
     4.52%     0.01%  io_uring      [kernel.kallsyms]  [k] page_cache_async_ra
     4.00%     0.04%  io_uring      [kernel.kallsyms]  [k] nvme_pci_complete_rq
     3.82%     0.04%  io_uring      [kernel.kallsyms]  [k] nvme_complete_rq
     3.76%     1.11%  io_uring      [kernel.kallsyms]  [k] do_mpage_readpage
     3.74%     0.06%  io_uring      [kernel.kallsyms]  [k] blk_mq_end_request
     3.03%     0.01%  io_uring      [kernel.kallsyms]  [k] blk_flush_plug_list
     3.02%     0.06%  io_uring      [kernel.kallsyms]  [k] blk_mq_flush_plug_list
     2.96%     0.01%  io_uring      [kernel.kallsyms]  [k] blk_mq_sched_insert_requests
     2.94%     0.10%  io_uring      [kernel.kallsyms]  [k] blk_mq_try_issue_list_directly
     2.89%     0.00%  io_uring      [kernel.kallsyms]  [k] __irqentry_text_start
     2.71%     0.41%  io_uring      [kernel.kallsyms]  [k] psi_task_change
     2.67%     0.21%  io_uring      [kernel.kallsyms]  [k] lru_cache_add
     2.65%     0.14%  io_uring      [kernel.kallsyms]  [k] __blk_mq_try_issue_directly
     2.53%     0.54%  io_uring      [kernel.kallsyms]  [k] nvme_queue_rq
     2.43%     0.17%  io_uring      [kernel.kallsyms]  [k] blk_update_request
     2.42%     0.58%  io_uring      [kernel.kallsyms]  [k] __pagevec_lru_add
     2.29%     1.42%  io_uring      [kernel.kallsyms]  [k] psi_group_change
     2.22%     0.85%  io_uring      [kernel.kallsyms]  [k] blk_attempt_plug_merge
     2.14%     0.04%  io_uring      [kernel.kallsyms]  [k] bio_endio
     2.13%     0.11%  io_uring      [kernel.kallsyms]  [k] rw_verify_area
     2.08%     0.18%  io_uring      [kernel.kallsyms]  [k] mpage_end_io
     2.01%     0.08%  io_uring      [kernel.kallsyms]  [k] mpage_alloc
     1.71%     0.56%  io_uring      [kernel.kallsyms]  [k] _raw_spin_lock_irq
     1.65%     0.98%  io_uring      [kernel.kallsyms]  [k] workingset_refault
     1.65%     0.09%  io_uring      [kernel.kallsyms]  [k] psi_memstall_leave
     1.64%     0.20%  io_uring      [kernel.kallsyms]  [k] security_file_permission
     1.61%     1.59%  io_uring      [kernel.kallsyms]  [k] _raw_spin_lock
     1.58%     0.08%  io_uring      [kernel.kallsyms]  [k] psi_memstall_enter
     1.44%     0.37%  io_uring      [kernel.kallsyms]  [k] xa_get_order
     1.43%     0.13%  io_uring      [kernel.kallsyms]  [k] __blk_mq_alloc_request
     1.37%     0.26%  io_uring      [kernel.kallsyms]  [k] io_submit_flush_completions
     1.36%     0.14%  io_uring      [kernel.kallsyms]  [k] xa_load
     1.34%     0.31%  io_uring      [kernel.kallsyms]  [k] bio_alloc_bioset
     1.31%     0.26%  io_uring      [kernel.kallsyms]  [k] submit_bio_checks
     1.29%     0.04%  io_uring      [kernel.kallsyms]  [k] blk_finish_plug
     1.28%     1.27%  io_uring      [kernel.kallsyms]  [k] native_queued_spin_lock_slowpath
     1.24%     0.19%  io_uring      [kernel.kallsyms]  [k] page_endio
     1.13%     0.10%  io_uring      [kernel.kallsyms]  [k] unlock_page
     1.09%     0.99%  io_uring      [kernel.kallsyms]  [k] read_tsc
     1.07%     0.74%  io_uring      [kernel.kallsyms]  [k] lru_gen_addition
     1.03%     0.09%  io_uring      [kernel.kallsyms]  [k] mempool_alloc
     1.02%     0.13%  io_uring      [kernel.kallsyms]  [k] wake_up_page_bit
     1.02%     0.92%  io_uring      [kernel.kallsyms]  [k] xas_start
     1.01%     0.78%  io_uring      [kernel.kallsyms]  [k] apparmor_file_permission
     0.99%     0.99%  io_uring      [kernel.kallsyms]  [k] native_irq_return_iret
     0.94%     0.16%  io_uring      [kernel.kallsyms]  [k] __mod_lruvec_state
     0.93%     0.55%  io_uring      [kernel.kallsyms]  [k] blk_rq_merge_ok
     0.93%     0.16%  io_uring      [kernel.kallsyms]  [k] irq_chip_ack_parent
     0.91%     0.22%  io_uring      [kernel.kallsyms]  [k] mem_cgroup_charge
     0.88%     0.24%  io_uring      [kernel.kallsyms]  [k] record_times
     0.86%     0.17%  io_uring      [kernel.kallsyms]  [k] page_cache_prev_miss
     0.84%     0.15%  io_uring      [kernel.kallsyms]  [k] PageHuge
     0.84%     0.09%  io_uring      [kernel.kallsyms]  [k] io_req_free_batch
     0.82%     0.20%  io_uring      [kernel.kallsyms]  [k] xas_store
     0.82%     0.06%  io_uring      [kernel.kallsyms]  [k] mempool_alloc_slab
     0.78%     0.11%  io_uring      [kernel.kallsyms]  [k] io_setup_async_rw
     0.77%     0.76%  io_uring      [kernel.kallsyms]  [k] workingset_update_node
     0.73%     0.72%  io_uring      [kernel.kallsyms]  [k] native_apic_msr_eoi_write
     0.73%     0.35%  io_uring      [kernel.kallsyms]  [k] __fsnotify_parent
     0.73%     0.10%  io_uring      [kernel.kallsyms]  [k] io_dismantle_req
     0.71%     0.01%  io_uring      [kernel.kallsyms]  [k] __wake_up_locked_key_bookmark
     0.69%     0.17%  io_uring      [kernel.kallsyms]  [k] blk_mq_get_tag
     0.69%     0.04%  io_uring      [kernel.kallsyms]  [k] __xas_prev
     0.67%     0.09%  io_uring      [kernel.kallsyms]  [k] blk_mq_start_request
     0.65%     0.34%  io_uring      [kernel.kallsyms]  [k] __mod_memcg_lruvec_state
     0.64%     0.09%  io_uring      [kernel.kallsyms]  [k] sched_clock_cpu
     0.63%     0.31%  io_uring      [kernel.kallsyms]  [k] io_async_buf_func
     0.62%     0.23%  io_uring      [kernel.kallsyms]  [k] kfree
     0.60%     0.21%  io_uring      [kernel.kallsyms]  [k] blk_mq_rq_ctx_init
     0.59%     0.08%  io_uring      [kernel.kallsyms]  [k] bio_associate_blkg
     0.59%     0.42%  io_uring      [kernel.kallsyms]  [k] __x86_retpoline_rax
     0.58%     0.13%  io_uring      [kernel.kallsyms]  [k] __check_object_size
     0.58%     0.37%  io_uring      [kernel.kallsyms]  [k] kmem_cache_alloc
     0.57%     0.20%  io_uring      [kernel.kallsyms]  [k] __mod_lruvec_page_state
     0.57%     0.06%  io_uring      [kernel.kallsyms]  [k] __wake_up_common
     0.57%     0.01%  io_uring      [kernel.kallsyms]  [k] bio_put
     0.57%     0.35%  io_uring      [kernel.kallsyms]  [k] __lock_page_async
     0.55%     0.14%  io_uring      [kernel.kallsyms]  [k] blk_mq_free_request
     0.55%     0.27%  io_uring      [kernel.kallsyms]  [k] blk_throtl_bio
     0.54%     0.02%  io_uring      [kernel.kallsyms]  [k] bio_free
     0.54%     0.47%  io_uring      [kernel.kallsyms]  [k] io_file_supports_async
     0.53%     0.40%  io_uring      [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     0.52%     0.26%  io_uring      [kernel.kallsyms]  [k] __kmalloc
     0.52%     0.04%  io_uring      [kernel.kallsyms]  [k] __blk_mq_get_tag
     0.52%     0.46%  io_uring      [kernel.kallsyms]  [k] mark_page_accessed
     0.50%     0.49%  io_uring      [kernel.kallsyms]  [k] native_sched_clock
     0.49%     0.06%  io_uring      [kernel.kallsyms]  [k] __sbitmap_queue_get
     0.47%     0.42%  io_uring      [kernel.kallsyms]  [k] memset_erms
     0.45%     0.01%  io_uring      [kernel.kallsyms]  [k] irqentry_exit
     0.45%     0.08%  io_uring      [kernel.kallsyms]  [k] mempool_free_slab
     0.45%     0.40%  io_uring      [kernel.kallsyms]  [k] bio_associate_blkg_from_css
     0.45%     0.03%  io_uring      [kernel.kallsyms]  [k] mempool_free
     0.44%     0.41%  io_uring      [kernel.kallsyms]  [k] xas_find_conflict
     0.44%     0.05%  io_uring      [kernel.kallsyms]  [k] irqentry_exit_to_user_mode
     0.43%     0.38%  io_uring      [kernel.kallsyms]  [k] __virt_addr_valid
     0.42%     0.39%  io_uring      [kernel.kallsyms]  [k] nvme_setup_cmd
     0.41%     0.25%  io_uring      [kernel.kallsyms]  [k] rcu_read_unlock_strict
     0.41%     0.15%  io_uring      [kernel.kallsyms]  [k] sbitmap_get
     0.39%     0.37%  io_uring      [kernel.kallsyms]  [k] __mod_node_page_state
     0.39%     0.31%  io_uring      [kernel.kallsyms]  [k] kmem_cache_free
     0.37%     0.34%  io_uring      [kernel.kallsyms]  [k] ktime_get
     0.37%     0.33%  io_uring      [kernel.kallsyms]  [k] io_import_iovec
     0.36%     0.32%  io_uring      [kernel.kallsyms]  [k] io_prep_rw
     0.36%     0.01%  io_uring      io_uring           [.] submitter_fn
     0.36%     0.32%  io_uring      [kernel.kallsyms]  [k] blk_attempt_bio_merge.part.0
     0.34%     0.30%  io_uring      [kernel.kallsyms]  [k] fsnotify
     0.32%     0.32%  io_uring      [kernel.kallsyms]  [k] __mod_memcg_state.part.0
     0.32%     0.18%  io_uring      [kernel.kallsyms]  [k] try_charge
     0.31%     0.01%  io_uring      [kernel.kallsyms]  [k] io_req_task_queue
     0.31%     0.27%  io_uring      [kernel.kallsyms]  [k] percpu_counter_add_batch
     0.31%     0.26%  io_uring      [kernel.kallsyms]  [k] blk_integrity_merge_bio
     0.31%     0.27%  io_uring      [kernel.kallsyms]  [k] wbt_wait
     0.30%     0.26%  io_uring      [kernel.kallsyms]  [k] io_file_get
     0.30%     0.26%  io_uring      [kernel.kallsyms]  [k] blkdev_get_block
     0.29%     0.18%  io_uring      [kernel.kallsyms]  [k] __cond_resched
     0.28%     0.25%  io_uring      [kernel.kallsyms]  [k] wbt_track
     0.28%     0.07%  io_uring      [kernel.kallsyms]  [k] io_req_task_work_add
     0.28%     0.01%  io_uring      [kernel.kallsyms]  [k] asm_sysvec_reschedule_ipi
     0.28%     0.18%  io_uring      [kernel.kallsyms]  [k] apic_ack_irq
     0.27%     0.24%  io_uring      [kernel.kallsyms]  [k] mutex_unlock
     0.26%     0.24%  io_uring      [kernel.kallsyms]  [k] __blk_mq_sched_bio_merge
     0.26%     0.06%  io_uring      [kernel.kallsyms]  [k] __lock_text_start
     0.25%     0.22%  io_uring      [kernel.kallsyms]  [k] __blk_queue_split
     0.25%     0.22%  io_uring      [kernel.kallsyms]  [k] __slab_free
     0.25%     0.05%  io_uring      [kernel.kallsyms]  [k] __blk_mq_free_request
     0.25%     0.10%  io_uring      [kernel.kallsyms]  [k] bio_add_page
     0.25%     0.16%  io_uring      [kernel.kallsyms]  [k] __sbitmap_get_word
     0.23%     0.16%  io_uring      [kernel.kallsyms]  [k] mutex_lock
     0.22%     0.20%  io_uring      [kernel.kallsyms]  [k] __next_zones_zonelist
     0.21%     0.15%  io_uring      [kernel.kallsyms]  [k] __x86_indirect_thunk_rax
     0.21%     0.05%  io_uring      [kernel.kallsyms]  [k] __rq_qos_throttle
     0.20%     0.09%  io_uring      [kernel.kallsyms]  [k] kiocb_done
     0.19%     0.16%  io_uring      [kernel.kallsyms]  [k] bio_crypt_rq_ctx_compatible
     0.19%     0.14%  io_uring      [kernel.kallsyms]  [k] slab_pre_alloc_hook.constprop.0
     0.19%     0.18%  io_uring      [kernel.kallsyms]  [k] slab_free_freelist_hook
     0.19%     0.16%  io_uring      [kernel.kallsyms]  [k] error_entry
     0.19%     0.02%  io_uring      [kernel.kallsyms]  [k] lock_page_lruvec_irqsave
     0.19%     0.16%  io_uring      [kernel.kallsyms]  [k] release_pages
     0.18%     0.02%  io_uring      [kernel.kallsyms]  [k] blk_mq_put_tag
     0.18%     0.15%  io_uring      [kernel.kallsyms]  [k] get_mem_cgroup_from_mm
     0.17%     0.16%  io_uring      [kernel.kallsyms]  [k] sbitmap_queue_clear
     0.16%     0.04%  io_uring      [kernel.kallsyms]  [k] blk_account_io_start
     0.16%     0.14%  io_uring      [kernel.kallsyms]  [k] io_put_req
     0.16%     0.12%  io_uring      [kernel.kallsyms]  [k] blk_account_io_done
     0.16%     0.15%  io_uring      [kernel.kallsyms]  [k] update_io_ticks
     0.16%     0.14%  io_uring      [kernel.kallsyms]  [k] blk_stat_add
     0.15%     0.12%  io_uring      [kernel.kallsyms]  [k] aa_file_perm
     0.15%     0.11%  io_uring      [kernel.kallsyms]  [k] add_interrupt_randomness
     0.15%     0.14%  io_uring      [kernel.kallsyms]  [k] page_mapping
     0.14%     0.00%  io_uring      [kernel.kallsyms]  [k] sysvec_reschedule_ipi
     0.13%     0.11%  io_uring      [kernel.kallsyms]  [k] rcu_all_qs
     0.13%     0.01%  io_uring      [kernel.kallsyms]  [k] mempool_kmalloc
     0.13%     0.12%  io_uring      [kernel.kallsyms]  [k] wbt_issue
     0.13%     0.05%  io_uring      [kernel.kallsyms]  [k] __rq_qos_track
     0.13%     0.12%  io_uring      [kernel.kallsyms]  [k] wbt_done
     0.12%     0.11%  io_uring      [kernel.kallsyms]  [k] blk_add_timer
     0.12%     0.12%  io_uring      [kernel.kallsyms]  [k] __io_cqring_fill_event
     0.12%     0.10%  io_uring      [kernel.kallsyms]  [k] page_counter_try_charge
     0.12%     0.10%  io_uring      [kernel.kallsyms]  [k] nvme_submit_cmd
     0.11%     0.09%  io_uring      [kernel.kallsyms]  [k] memcg_slab_post_alloc_hook
     0.11%     0.05%  io_uring      [kernel.kallsyms]  [k] mem_cgroup_charge_statistics.constprop.0
     0.11%     0.09%  io_uring      [kernel.kallsyms]  [k] blk_queue_enter
     0.11%     0.06%  io_uring      [kernel.kallsyms]  [k] kernel_init_free_pages
     0.11%     0.10%  io_uring      [kernel.kallsyms]  [k] __blk_rq_map_sg
     0.11%     0.10%  io_uring      [kernel.kallsyms]  [k] nvme_setup_rw
     0.11%     0.09%  io_uring      [kernel.kallsyms]  [k] __bio_try_merge_page
     0.11%     0.10%  io_uring      [kernel.kallsyms]  [k] bio_integrity_prep
     0.10%     0.08%  io_uring      [kernel.kallsyms]  [k] __xas_next
     0.10%     0.09%  io_uring      [kernel.kallsyms]  [k] __bio_add_page
     0.10%     0.08%  io_uring      [kernel.kallsyms]  [k] __io_complete_rw.constprop.0
     0.10%     0.09%  io_uring      [kernel.kallsyms]  [k] blk_mq_complete_request_remote
     0.10%     0.10%  io_uring      [kernel.kallsyms]  [k] __count_memcg_events.part.0
     0.09%     0.07%  io_uring      [kernel.kallsyms]  [k] kthread_blkcg
     0.09%     0.08%  io_uring      [kernel.kallsyms]  [k] blk_queue_bounce
     0.09%     0.00%  io_uring      [unknown]          [k] 0xbaff630d0ccd3500
     0.09%     0.09%  io_uring      [kernel.kallsyms]  [k] blk_mq_tag_to_rq
     0.09%     0.08%  io_uring      [kernel.kallsyms]  [k] dma_map_page_attrs
     0.09%     0.08%  io_uring      [kernel.kallsyms]  [k] wbt_data_dir
     0.09%     0.03%  io_uring      [kernel.kallsyms]  [k] __rq_qos_done
     0.09%     0.08%  io_uring      [kernel.kallsyms]  [k] blk_cgroup_bio_start
     0.08%     0.06%  io_uring      [kernel.kallsyms]  [k] sched_clock
     0.08%     0.06%  io_uring      [kernel.kallsyms]  [k] __entry_text_start
     0.08%     0.08%  io_uring      [kernel.kallsyms]  [k] xas_create
     0.08%     0.04%  io_uring      [kernel.kallsyms]  [k] __rq_qos_issue
     0.08%     0.07%  io_uring      [kernel.kallsyms]  [k] blk_mq_get_driver_tag
     0.08%     0.06%  io_uring      [kernel.kallsyms]  [k] policy_nodemask
     0.07%     0.01%  io_uring      [kernel.kallsyms]  [k] nvme_unmap_data.part.0
     0.07%     0.05%  io_uring      [kernel.kallsyms]  [k] __inc_numa_state
     0.07%     0.06%  io_uring      [kernel.kallsyms]  [k] bio_uninit
     0.07%     0.05%  io_uring      [kernel.kallsyms]  [k] blk_add_rq_to_plug
     0.06%     0.06%  io_uring      [kernel.kallsyms]  [k] mem_cgroup_update_lru_size
     0.06%     0.00%  io_uring      libc-2.32.so       [.] __nrand48_r
     0.06%     0.06%  io_uring      [kernel.kallsyms]  [k] __mod_zone_page_state
     0.06%     0.06%  io_uring      [kernel.kallsyms]  [k] bdev_read_page
     0.06%     0.05%  io_uring      [kernel.kallsyms]  [k] _find_next_bit.constprop.0
     0.06%     0.06%  io_uring      [kernel.kallsyms]  [k] syscall_return_via_sysret
     0.06%     0.06%  io_uring      [kernel.kallsyms]  [k] guard_bio_eod
     0.06%     0.01%  io_uring      [kernel.kallsyms]  [k] __slab_alloc
     0.06%     0.05%  io_uring      [kernel.kallsyms]  [k] dma_unmap_page_attrs
     0.05%     0.03%  io_uring      [kernel.kallsyms]  [k] I_BDEV
     0.05%     0.04%  io_uring      [kernel.kallsyms]  [k] bio_advance
     0.05%     0.00%  io_uring      libc-2.32.so       [.] __drand48_iterate
     0.04%     0.04%  io_uring      [kernel.kallsyms]  [k] kmalloc_slab
     0.04%     0.01%  io_uring      [kernel.kallsyms]  [k] wake_up_process
     0.04%     0.04%  io_uring      [kernel.kallsyms]  [k] ___slab_alloc
     0.04%     0.03%  io_uring      [kernel.kallsyms]  [k] dput
     0.04%     0.04%  io_uring      [kernel.kallsyms]  [k] io_cqring_ev_posted
     0.04%     0.03%  io_uring      [kernel.kallsyms]  [k] irq_exit_rcu
     0.04%     0.03%  io_uring      [kernel.kallsyms]  [k] xas_nomem
     0.04%     0.04%  io_uring      [kernel.kallsyms]  [k] note_interrupt
     0.04%     0.02%  io_uring      [kernel.kallsyms]  [k] memcg_check_events
     0.04%     0.04%  io_uring      [kernel.kallsyms]  [k] psi_flags_change
     0.04%     0.03%  io_uring      [kernel.kallsyms]  [k] try_to_wake_up
     0.04%     0.03%  io_uring      [kernel.kallsyms]  [k] should_failslab
     0.04%     0.03%  io_uring      [kernel.kallsyms]  [k] policy_node
     0.03%     0.03%  io_uring      [kernel.kallsyms]  [k] check_stack_object
     0.03%     0.01%  io_uring      [kernel.kallsyms]  [k] mempool_kfree
     0.03%     0.02%  io_uring      [kernel.kallsyms]  [k] hctx_unlock
     0.03%     0.02%  io_uring      [kernel.kallsyms]  [k] page_cache_next_miss
     0.03%     0.00%  io_uring      libc-2.32.so       [.] lrand48
     0.03%     0.02%  io_uring      [kernel.kallsyms]  [k] irq_enter_rcu
     0.03%     0.02%  io_uring      [kernel.kallsyms]  [k] find_next_zero_bit
     0.03%     0.02%  io_uring      [kernel.kallsyms]  [k] __rq_qos_done_bio
     0.03%     0.01%  io_uring      [kernel.kallsyms]  [k] memset
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] __fget_light
     0.02%     0.01%  io_uring      [kernel.kallsyms]  [k] dma_pool_alloc
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] asm_sysvec_apic_timer_interrupt
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] sysvec_apic_timer_interrupt
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] iov_iter_bvec
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] cpuset_nodemask_valid_mems_allowed
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] blk_start_plug
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] task_work_add
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] irqentry_enter
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] __fdget
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] restore_regs_and_return_to_kernel
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] _raw_spin_trylock
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] __fget_files
     0.02%     0.01%  io_uring      [kernel.kallsyms]  [k] nvme_error_status
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] __mix_pool_bytes
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] should_fail_alloc_page
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] sync_regs
     0.02%     0.01%  io_uring      [kernel.kallsyms]  [k] io_commit_cqring
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] _mix_pool_bytes
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] hrtimer_interrupt
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] __sysvec_apic_timer_interrupt
     0.02%     0.01%  io_uring      [kernel.kallsyms]  [k] bvec_free
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] __irqentry_text_end
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] syscall_enter_from_user_mode
     0.02%     0.01%  io_uring      [kernel.kallsyms]  [k] iov_iter_revert
     0.02%     0.01%  io_uring      [kernel.kallsyms]  [k] blk_status_to_errno
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] idle_cpu
     0.02%     0.01%  io_uring      [kernel.kallsyms]  [k] blk_queue_exit
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] dma_map_sg_attrs
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] dma_unmap_sg_attrs
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] __hrtimer_run_queues
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] dma_direct_unmap_sg
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] tick_sched_timer
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] bvec_alloc
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] arch_do_signal_or_restart
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] should_fail_bio
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] __mem_cgroup_threshold
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] dma_direct_map_sg
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] propagate_protected_usage
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] __sbq_wake_up
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] dma_pool_free
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] entry_SYSCALL_64_safe_stack
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] nvme_cleanup_cmd
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] free_unref_page_list
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] blk_mq_sched_restart
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] update_process_times
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] error_return
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] bvec_split_segs
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] tick_sched_handle
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] sg_init_table
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] scheduler_tick
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] put_cpu_partial
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] refill_stock
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] __softirqentry_text_start
     0.01%     0.00%  io_uring      io_uring           [.] 0x0000564cf7f80320
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] native_iret
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] mem_cgroup_uncharge_list
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] inode_congested
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] sg_next
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] fpregs_assert_state_consistent
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] fput
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] task_tick_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] tick_do_update_jiffies64
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] wake_up_state
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] allocate_slab
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_run_task_work_sig
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_wall_time
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] timekeeping_advance
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] run_rebalance_domains
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __schedule
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_event_task_tick
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] tick_program_event
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] credit_entropy_bits.constprop.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __alloc_pages_slowpath.constprop.0
     0.00%     0.00%  io_uring      io_uring           [.] 0x0000564cf7f802d0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_uring_add_task_file
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rebalance_domains
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_curr
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] native_write_msr
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] schedule
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] timekeeping_update
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_core
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] schedule_timeout
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] clockevents_program_event
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] nvme_free_prps
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_core_si
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_load_avg
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] load_balance
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ttwu_do_activate
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] run_timer_softirq
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] enqueue_task
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_report_qs_rnp
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __queue_work
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_sd_lb_stats.constprop.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] find_busiest_group
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_blocked_averages
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] pvclock_gtod_notify
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __run_timers.part.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] call_timer_fn
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ktime_get_update_offsets_now
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] account_system_time
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] vma_migratable
     0.00%     0.00%  io_uring      libc-2.32.so       [.] __GI___libc_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] get_random_u32
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __wake_up_common_lock
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __wake_up
     0.00%     0.00%  iou-mgr-3119  [unknown]          [k] 0000000000000000
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] ret_from_fork
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] io_wq_manager
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ksys_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] vfs_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] intel_tfa_pmu_enable_all
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] x86_pmu_enable
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_cfs_group
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] hrtimer_active
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] lapic_next_deadline
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] delayed_work_timer_fn
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] arch_scale_freq_tick
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_sched_clock_irq
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] account_process_tick
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] wakeup_kswapd
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] _warn_unseeded_randomness
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] pick_next_task_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] new_sync_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] tty_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] file_tty_write.constprop.0
     0.00%     0.00%  iou-mgr-3116  [unknown]          [k] 0000000000000000
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] io_wq_manager
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] ret_from_fork
     0.00%     0.00%  io_uring      libc-2.32.so       [.] clock_nanosleep@@GLIBC_2.17
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] schedule_timeout
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] account_system_index_time
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __update_load_avg_se
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] put_prev_entity
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] trigger_load_balance
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_gp_kthread_wake
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] swake_up_one
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_clock_nanosleep
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] schedule
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] __schedule
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] common_nsleep
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] io_wqe_worker
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] ret_from_fork
     0.00%     0.00%  iou-wrk-3119  [unknown]          [k] 0000000000000000
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] schedule
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] schedule_timeout
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] hrtimer_nanosleep
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] reweight_entity
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] enqueue_hrtimer
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __cgroup_account_cputime_field
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] enqueue_entity
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] enqueue_task_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kick_process
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] autoremove_wake_function
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] default_wake_function
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] wake_all_kswapds
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __zone_watermark_ok
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] n_tty_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] insert_work
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] finish_task_switch.isra.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_nanosleep
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] __schedule
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] dequeue_task
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_rt_rq_load_avg
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] pty_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __calc_delta
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] dequeue_task_fair
     0.00%     0.00%  io_uring      [unknown]          [k] 0000000000000000
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] pick_next_task_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_min_vruntime
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_vsyscall
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rb_next
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rb_insert_color
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] x86_pmu_disable
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __update_load_avg_cfs_rq
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __radix_tree_lookup
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] idr_find
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] radix_tree_lookup
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_queue_async_work
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_rw_reissue
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_wq_enqueue
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_wqe_activate_free_worker.isra.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_wqe_enqueue
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_wqe_wake_worker
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] task_numa_work
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __note_gp_changes
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] note_gp_changes
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] psi_task_switch
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] get_partial_node.part.0
     0.00%     0.00%  io_uring      io_uring           [.] 0x0000564cf7f80324
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] newidle_balance
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] schedule_timeout
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] __schedule
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] schedule
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_syscall_64
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_rq_clock
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] io_wq_check_workers
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] io_worker_handle_work
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] queue_work_on
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] tty_flip_buffer_push
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] intel_pmu_disable_all
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] newidle_balance
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] pick_next_task_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rb_erase
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] timerqueue_add
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] acct_account_cputime
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] group_balance_cpu
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_fast_timekeeper
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_dl_rq_load_avg
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] blkcg_maybe_throttle_current
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] cpuacct_charge
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] put_prev_task_fair
     0.00%     0.00%  io_uring      io_uring           [.] 0x0000564cf7f802d4
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] cgroup_rstat_updated
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] calc_global_load
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] send_call_function_single_ipi
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] generic_exec_single
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kick_ilb
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] smp_call_function_single_async
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] native_read_msr
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] run_posix_cpu_timers
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] cpuacct_account_field
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] irq_work_tick
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] mod_node_page_state
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] zone_watermark_ok_safe
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_qs
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ttwu_queue_wakelist
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] dequeue_entity
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] update_sd_lb_stats.constprop.0
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] find_busiest_group
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] load_balance
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] newidle_balance
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] dequeue_task
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] _nohz_idle_balance
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __handle_mm_fault
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] handle_mm_fault
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] io_wq_submit_work
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] dequeue_task
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] update_sd_lb_stats.constprop.0
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] find_busiest_group
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] io_wq_check_workers
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] load_balance
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] dequeue_entity
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] dequeue_task_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] switch_fpu_return
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] prepare_to_wait_exclusive
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] io_issue_sqe
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] io_read
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] pick_next_task_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] vm_mmap_pgoff
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] blkdev_read_iter
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] generic_file_read_iter
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] io_iter_do_read
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_event_alloc
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] copy_process
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] inherit_event.constprop.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] inherit_task_group.isra.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_event_init_task
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] asm_exc_page_fault
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_user_addr_fault
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] exc_page_fault
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __x64_sys_execve
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] bprm_execve
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_execveat_common
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] load_elf_binary
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] dequeue_task
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] copy_process
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] create_io_thread
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] create_io_worker.isra.0
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] inherit_event.constprop.0
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] inherit_task_group.isra.0
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] perf_event_init_task
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] update_curr
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] update_rq_clock
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] create_io_thread
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_uring_alloc_task_context
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_wq_create
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_wq_fork_manager
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] tty_insert_flip_string_fixed_flag
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] filemap_read
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c383030
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] newidle_balance
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c343234
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __io_uring_register
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_io_uring_register
     0.00%     0.00%  io_uring      [unknown]          [k] 0x000000000000cc81
     0.00%     0.00%  io_uring      [unknown]          [k] 0x00007f5c1654bc00
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_mmap
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_mmap
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ksys_mmap_pgoff
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] alloc_pages_vma
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] psi_group_change
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] psi_task_change
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_io_uring_setup
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_uring_setup
     0.00%     0.00%  io_uring      [unknown]          [k] 0x6966206465646441
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] _nohz_idle_balance
     0.00%     0.00%  numactl       [unknown]          [k] 0x00007f3fff40bb7b
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] finish_wait
     0.00%     0.00%  numactl       [unknown]          [.] 0000000000000000
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] _find_next_bit.constprop.0
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] __update_idle_core
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] _raw_spin_lock_irq
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] __perf_event_task_sched_in
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] _find_next_bit.constprop.0
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] update_curr
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] _nohz_idle_balance
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] dequeue_entity
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] dequeue_task_fair
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] finish_task_switch.isra.0
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] pick_next_task_idle
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] sched_clock_cpu
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] vm_mmap_pgoff
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] copy_page_to_iter
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_iterate_ctx
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_iterate_sb
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] update_sd_lb_stats.constprop.0
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] find_busiest_group
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] load_balance
     0.00%     0.00%  io_uring      libc-2.32.so       [.] __clone
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] __io_free_req
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] io_free_work
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] native_irq_return_iret
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] copy_user_generic_unrolled
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __tty_buffer_request_room
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ktime_get_real_seconds
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_output_char
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c303237
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c323933
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c333030
     0.00%     0.00%  io_uring      [unknown]          [.] 0x534f49202c343636
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c353937
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c373834
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c383237
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] tty_paranoia_check
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c303032
     0.00%     0.00%  io_uring      [unknown]          [.] 0x534f49202c303637
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c363534
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c363731
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] get_unmapped_area
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __do_sys_clone
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_clone
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kernel_clone
     0.00%     0.00%  io_uring      libc-2.32.so       [.] __mmap
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __get_user_pages
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kvfree
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] pin_user_pages
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] finish_task_switch.isra.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] copy_user_generic_unrolled
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] get_timespec64
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c353333
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c363136
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kmem_cache_alloc_trace
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __alloc_percpu_gfp
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __alloc_file
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __memcg_kmem_charge
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_openat
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] alloc_empty_file
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_filp_open
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_sys_openat2
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] obj_cgroup_charge
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] path_openat
     0.00%     0.00%  io_uring      libc-2.32.so       [.] __GI___libc_open
     0.00%     0.00%  io_uring      [unknown]          [k] 0x4c003270316e3065
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] kthread_blkcg
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __ext4_get_inode_loc
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __ext4_iget
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __x64_sys_openat
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] bio_associate_blkg
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_filp_open
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_sys_openat2
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] ext4_lookup
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] ext4_read_bh_lock
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] ext4_read_bh_nowait
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] ext4_sb_breadahead_unmovable
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] path_openat
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] submit_bh
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] submit_bh_wbc
     0.00%     0.00%  numactl       ld-2.32.so         [.] __GI___open64_nocancel
     0.00%     0.00%  numactl       ld-2.32.so         [.] _dl_map_object
     0.00%     0.00%  io_uring      libc-2.32.so       [.] _int_malloc
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] security_task_getsecid
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_execve
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] bprm_execve
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_execveat_common
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] elf_map
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ima_file_mmap
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] load_elf_binary
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] security_mmap_file
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] vm_mmap
     0.00%     0.00%  io_uring      libc-2.32.so       [.] __GI___execve
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] add_mm_counter_fast
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_wp_page
     0.00%     0.00%  io_uring      ld-2.32.so         [.] _dl_relocate_object
     0.00%     0.00%  io_uring      ld-2.32.so         [.] _dl_sysdep_start
     0.00%     0.00%  io_uring      ld-2.32.so         [.] dl_main
     0.00%     0.00%  io_uring      libc-2.32.so       [.] sysmalloc
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kthread_is_per_cpu
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c303030
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c323332
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c323931
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c343833
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] mmap_region
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_event_mmap
     0.00%     0.00%  io_uring      ld-2.32.so         [.] mmap64
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] aa_get_task_label
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __x64_sys_mmap
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] apparmor_task_getsecid
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] ima_file_mmap
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] ksys_mmap_pgoff
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] security_mmap_file
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] security_task_getsecid
     0.00%     0.00%  numactl       ld-2.32.so         [.] mmap64
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] flush_tlb_batched_pending
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] begin_new_exec
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] exit_mmap
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] mmput
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] unmap_page_range
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] unmap_single_vma
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] unmap_vmas
     0.00%     0.00%  numactl       libc-2.32.so       [.] __GI___execve
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __count_memcg_events.part.0
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __count_memcg_events
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] asm_exc_page_fault
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_user_addr_fault
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] exc_page_fault
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] handle_mm_fault
     0.00%     0.00%  numactl       libc-2.32.so       [.] sysmalloc
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __fsnotify_parent
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] ____fput
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __fput
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] exit_to_user_mode_prepare
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] syscall_exit_to_user_mode
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] task_work_run
     0.00%     0.00%  numactl       libc-2.32.so       [.] __close_nocancel
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] cpumask_next_and
     0.00%     0.00%  numactl       ld-2.32.so         [.] init_tls
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __entry_text_start
     0.00%     0.00%  numactl       ld-2.32.so         [.] _dl_allocate_tls_storage
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] put_dec_trunc8
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __x64_sys_read
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] dev_attr_show
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] kernfs_fop_read_iter
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] kernfs_seq_show
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] ksys_read
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] new_sync_read
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] node_read_meminfo
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] number
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] seq_read_iter
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] sysfs_emit_at
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] sysfs_kf_seq_show
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] vfs_read
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] vscnprintf
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] vsnprintf
     0.00%     0.00%  numactl       libc-2.32.so       [.] read
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] tty_hung_up_p
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __perf_event_task_sched_out
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] remove_wait_queue
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c323139
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c323732
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c343439
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] deactivate_slab
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] ___slab_alloc
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] __slab_alloc
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] allocate_fake_cpuc
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] intel_cpuc_prepare
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] kmem_cache_alloc_node_trace
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] perf_event_alloc
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] perf_try_init_event
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] x86_pmu_event_init
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] _raw_spin_lock
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] del_timer_sync
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] native_write_msr
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] put_prev_task_fair
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] update_load_avg
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] x86_pmu_disable
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] cpumask_next
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] intel_tfa_pmu_enable_all
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] rcu_read_unlock_strict
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] update_blocked_averages
     0.00%     0.00%  iou-mgr-3116  [kernel.kallsyms]  [k] x86_pmu_enable
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] native_sched_clock
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] _nohz_idle_balance
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] update_rq_clock
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] __mutex_init
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] native_sched_clock
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] check_stack_object
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] cpuacct_charge
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] hrtick_update
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] io_wq_worker_running
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] record_times
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] update_curr
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] cpuacct_charge
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] io_wq_worker_sleeping
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] psi_group_change
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] update_min_vruntime
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] x86_pmu_disable
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] del_timer_sync
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] psi_task_change
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] sched_clock_cpu
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kmem_cache_alloc_node_trace
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] allocate_fake_cpuc
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] intel_cpuc_prepare
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_try_init_event
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] x86_pmu_event_init
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] pgd_free
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] update_blocked_averages
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] __mmdrop
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] __update_load_avg_cfs_rq
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] update_load_avg
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] update_cfs_group
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] dequeue_task_fair
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] perf_iterate_ctx
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_mmap
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] elf_map
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] mmap_region
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] perf_event_mmap
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] perf_iterate_sb
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] vm_mmap
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] __update_idle_core
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] block_read_full_page
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] kfree
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] refill_stock
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] update_blocked_averages
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] drain_obj_stock
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] filemap_get_pages
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] filemap_read_page
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] io_dismantle_req
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] kmem_cache_free
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] obj_cgroup_uncharge
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] pick_next_task_idle
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] rcu_read_unlock_strict
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] refill_obj_stock
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ret_from_fork
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] tsk_fork_get_node
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __set_task_comm
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_wq_manager
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_event_comm
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __perf_event_task_sched_in
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] perf_iterate_ctx
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] __set_task_comm
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] io_wqe_worker
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] perf_event_comm
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] perf_iterate_sb
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] __x64_sys_execve
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] begin_new_exec
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] bprm_execve
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] do_execveat_common
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] do_syscall_64
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] load_elf_binary
     0.00%     0.00%  perf          [unknown]          [k] 0x00007f3fff40bb7b
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] apparmor_file_permission
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] rw_verify_area
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] security_file_permission
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] _raw_spin_lock_irq
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] native_write_msr
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] __perf_event_task_sched_in
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] intel_tfa_pmu_enable_all
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] x86_pmu_enable
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] perf_iterate_sb
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] __set_task_comm
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] perf_event_comm
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] schedule_tail
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] schedule_tail
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] native_flush_tlb_one_user
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] native_flush_tlb_one_user
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] end_repeat_nmi
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] native_write_msr
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] __perf_event_task_sched_in
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] finish_task_switch.isra.0
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] intel_tfa_pmu_enable_all
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] x86_pmu_enable
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] native_set_fixmap
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] native_write_msr
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] ctx_resched
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] intel_tfa_pmu_enable_all
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] perf_event_exec
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] x86_pmu_enable
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] native_flush_tlb_one_user
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] native_flush_tlb_one_user
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] memcpy_fromio
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] nmi_cpu_backtrace
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] acpi_os_read_memory
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] acpi_os_read_memory
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] nmi_cpu_backtrace
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] nmi_cpu_backtrace
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] native_sched_clock
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] native_apic_msr_write
     0.00%     0.00%  iou-mgr-3119  [kernel.kallsyms]  [k] native_apic_msr_write
     0.00%     0.00%  iou-wrk-3119  [kernel.kallsyms]  [k] native_apic_msr_write
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] native_apic_msr_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __intel_pmu_enable_all.constprop.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] intel_pmu_handle_irq


#
# (Tip: To count events in every 1000 msec: perf stat -I 1000)
#
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 841K of event 'cycles'
# Event count (approx.): 753744677630
#
# Children      Self  Command       Shared Object      Symbol                                      
# ........  ........  ............  .................  ............................................
#
    99.91%     0.00%  io_uring      [unknown]          [k] 0x0000000000000005
    99.91%     0.00%  io_uring      [unknown]          [k] 0x000055c42c2c2450
    99.45%     0.02%  io_uring      libc-2.32.so       [.] syscall
    99.24%     0.01%  io_uring      [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
    99.22%     0.00%  io_uring      [kernel.kallsyms]  [k] do_syscall_64
    94.48%     0.22%  io_uring      [kernel.kallsyms]  [k] __io_queue_sqe
    94.09%     0.31%  io_uring      [kernel.kallsyms]  [k] io_issue_sqe
    93.33%     0.62%  io_uring      [kernel.kallsyms]  [k] io_read
    88.57%     0.93%  io_uring      [kernel.kallsyms]  [k] blkdev_read_iter
    87.80%     0.15%  io_uring      [kernel.kallsyms]  [k] io_iter_do_read
    87.57%     0.16%  io_uring      [kernel.kallsyms]  [k] generic_file_read_iter
    87.35%     1.08%  io_uring      [kernel.kallsyms]  [k] filemap_read
    82.44%     0.01%  io_uring      [kernel.kallsyms]  [k] __x64_sys_io_uring_enter
    82.34%     0.01%  io_uring      [kernel.kallsyms]  [k] __do_sys_io_uring_enter
    82.09%     0.47%  io_uring      [kernel.kallsyms]  [k] io_submit_sqes
    79.50%     0.08%  io_uring      [kernel.kallsyms]  [k] io_queue_sqe
    71.47%     1.08%  io_uring      [kernel.kallsyms]  [k] filemap_get_pages
    53.08%     0.35%  io_uring      [kernel.kallsyms]  [k] ondemand_readahead
    51.74%     1.49%  io_uring      [kernel.kallsyms]  [k] page_cache_ra_unbounded
    49.92%     0.13%  io_uring      [kernel.kallsyms]  [k] page_cache_sync_ra
    21.26%     0.63%  io_uring      [kernel.kallsyms]  [k] add_to_page_cache_lru
    17.08%     0.02%  io_uring      [kernel.kallsyms]  [k] task_work_run
    16.98%     0.03%  io_uring      [kernel.kallsyms]  [k] exit_to_user_mode_prepare
    16.94%    10.52%  io_uring      [kernel.kallsyms]  [k] __add_to_page_cache_locked
    16.94%     0.06%  io_uring      [kernel.kallsyms]  [k] tctx_task_work
    16.79%     0.19%  io_uring      [kernel.kallsyms]  [k] syscall_exit_to_user_mode
    16.41%     2.61%  io_uring      [kernel.kallsyms]  [k] filemap_get_read_batch
    16.08%    15.72%  io_uring      [kernel.kallsyms]  [k] xas_load
    15.99%     0.15%  io_uring      [kernel.kallsyms]  [k] read_pages
    15.87%     0.13%  io_uring      [kernel.kallsyms]  [k] blkdev_readahead
    15.72%     0.56%  io_uring      [kernel.kallsyms]  [k] mpage_readahead
    15.58%     0.09%  io_uring      [kernel.kallsyms]  [k] io_req_task_submit
    15.37%     0.07%  io_uring      [kernel.kallsyms]  [k] __io_req_task_submit
    12.14%     0.15%  io_uring      [kernel.kallsyms]  [k] copy_page_to_iter
    12.03%     0.03%  io_uring      [kernel.kallsyms]  [k] __page_cache_alloc
    11.98%     0.14%  io_uring      [kernel.kallsyms]  [k] alloc_pages_current
    11.66%     0.30%  io_uring      [kernel.kallsyms]  [k] __alloc_pages_nodemask
    11.28%    11.05%  io_uring      [kernel.kallsyms]  [k] _copy_to_iter
    11.06%     3.16%  io_uring      [kernel.kallsyms]  [k] get_page_from_freelist
    10.53%     0.10%  io_uring      [kernel.kallsyms]  [k] submit_bio
    10.27%     0.21%  io_uring      [kernel.kallsyms]  [k] submit_bio_noacct
     8.30%     0.51%  io_uring      [kernel.kallsyms]  [k] blk_mq_submit_bio
     7.73%     7.62%  io_uring      [kernel.kallsyms]  [k] clear_page_erms
     3.68%     1.02%  io_uring      [kernel.kallsyms]  [k] do_mpage_readpage
     3.33%     0.01%  io_uring      [kernel.kallsyms]  [k] page_cache_async_ra
     3.25%     0.01%  io_uring      [kernel.kallsyms]  [k] blk_flush_plug_list
     3.24%     0.06%  io_uring      [kernel.kallsyms]  [k] blk_mq_flush_plug_list
     3.18%     0.01%  io_uring      [kernel.kallsyms]  [k] blk_mq_sched_insert_requests
     3.15%     0.13%  io_uring      [kernel.kallsyms]  [k] blk_mq_try_issue_list_directly
     2.84%     0.15%  io_uring      [kernel.kallsyms]  [k] __blk_mq_try_issue_directly
     2.69%     0.58%  io_uring      [kernel.kallsyms]  [k] nvme_queue_rq
     2.54%     1.05%  io_uring      [kernel.kallsyms]  [k] blk_attempt_plug_merge
     2.53%     0.44%  io_uring      [kernel.kallsyms]  [k] mark_page_accessed
     2.36%     0.12%  io_uring      [kernel.kallsyms]  [k] rw_verify_area
     2.16%     0.09%  io_uring      [kernel.kallsyms]  [k] mpage_alloc
     2.10%     0.27%  io_uring      [kernel.kallsyms]  [k] lru_cache_add
     1.81%     0.27%  io_uring      [kernel.kallsyms]  [k] security_file_permission
     1.75%     0.87%  io_uring      [kernel.kallsyms]  [k] __pagevec_lru_add
     1.70%     0.63%  io_uring      [kernel.kallsyms]  [k] _raw_spin_lock_irq
     1.53%     0.40%  io_uring      [kernel.kallsyms]  [k] xa_get_order
     1.52%     0.15%  io_uring      [kernel.kallsyms]  [k] __blk_mq_alloc_request
     1.50%     0.65%  io_uring      [kernel.kallsyms]  [k] workingset_refault
     1.50%     0.07%  io_uring      [kernel.kallsyms]  [k] activate_page
     1.48%     0.29%  io_uring      [kernel.kallsyms]  [k] io_submit_flush_completions
     1.46%     0.32%  io_uring      [kernel.kallsyms]  [k] bio_alloc_bioset
     1.42%     0.15%  io_uring      [kernel.kallsyms]  [k] xa_load
     1.40%     0.16%  io_uring      [kernel.kallsyms]  [k] pagevec_lru_move_fn
     1.39%     0.06%  io_uring      [kernel.kallsyms]  [k] blk_finish_plug
     1.35%     0.28%  io_uring      [kernel.kallsyms]  [k] submit_bio_checks
     1.26%     0.02%  io_uring      [kernel.kallsyms]  [k] asm_common_interrupt
     1.25%     0.01%  io_uring      [kernel.kallsyms]  [k] common_interrupt
     1.21%     1.21%  io_uring      [kernel.kallsyms]  [k] native_queued_spin_lock_slowpath
     1.18%     0.10%  io_uring      [kernel.kallsyms]  [k] mempool_alloc
     1.17%     0.00%  io_uring      [kernel.kallsyms]  [k] __common_interrupt
     1.14%     0.03%  io_uring      [kernel.kallsyms]  [k] handle_edge_irq
     1.08%     0.87%  io_uring      [kernel.kallsyms]  [k] apparmor_file_permission
     1.03%     0.19%  io_uring      [kernel.kallsyms]  [k] __mod_lruvec_state
     1.02%     0.65%  io_uring      [kernel.kallsyms]  [k] blk_rq_merge_ok
     1.01%     0.94%  io_uring      [kernel.kallsyms]  [k] xas_start
     0.95%     0.11%  io_uring      [kernel.kallsyms]  [k] mempool_alloc_slab
     0.93%     0.89%  io_uring      [kernel.kallsyms]  [k] read_tsc
     0.93%     0.24%  io_uring      [kernel.kallsyms]  [k] mem_cgroup_charge
     0.92%     0.10%  io_uring      [kernel.kallsyms]  [k] io_req_free_batch
     0.92%     0.13%  io_uring      [kernel.kallsyms]  [k] io_setup_async_rw
     0.91%     0.24%  io_uring      [kernel.kallsyms]  [k] xas_store
     0.89%     0.19%  io_uring      [kernel.kallsyms]  [k] page_cache_prev_miss
     0.88%     0.63%  io_uring      [kernel.kallsyms]  [k] __activate_page
     0.88%     0.02%  io_uring      [kernel.kallsyms]  [k] asm_sysvec_reschedule_ipi
     0.84%     0.46%  io_uring      [kernel.kallsyms]  [k] __fsnotify_parent
     0.82%     0.11%  io_uring      [kernel.kallsyms]  [k] io_dismantle_req
     0.80%     0.02%  io_uring      [kernel.kallsyms]  [k] handle_irq_event
     0.78%     0.77%  io_uring      [kernel.kallsyms]  [k] workingset_update_node
     0.77%     0.14%  io_uring      [kernel.kallsyms]  [k] PageHuge
     0.74%     0.03%  io_uring      [kernel.kallsyms]  [k] nvme_irq
     0.73%     0.01%  io_uring      [kernel.kallsyms]  [k] __handle_irq_event_percpu
     0.73%     0.10%  io_uring      [kernel.kallsyms]  [k] blk_mq_start_request
     0.72%     0.07%  io_uring      [kernel.kallsyms]  [k] nvme_process_cq
     0.71%     0.04%  io_uring      [kernel.kallsyms]  [k] __xas_prev
     0.71%     0.35%  io_uring      [kernel.kallsyms]  [k] __mod_memcg_lruvec_state
     0.70%     0.15%  io_uring      [kernel.kallsyms]  [k] __check_object_size
     0.69%     0.19%  io_uring      [kernel.kallsyms]  [k] blk_mq_get_tag
     0.69%     0.29%  io_uring      [kernel.kallsyms]  [k] kfree
     0.66%     0.26%  io_uring      [kernel.kallsyms]  [k] blk_mq_rq_ctx_init
     0.65%     0.35%  io_uring      [kernel.kallsyms]  [k] __kmalloc
     0.64%     0.43%  io_uring      [kernel.kallsyms]  [k] kmem_cache_alloc
     0.64%     0.21%  io_uring      [kernel.kallsyms]  [k] __mod_lruvec_page_state
     0.61%     0.01%  io_uring      [kernel.kallsyms]  [k] nvme_pci_complete_rq
     0.60%     0.39%  io_uring      [kernel.kallsyms]  [k] __lock_page_async
     0.58%     0.00%  io_uring      [kernel.kallsyms]  [k] nvme_complete_rq
     0.58%     0.08%  io_uring      [kernel.kallsyms]  [k] bio_associate_blkg
     0.57%     0.31%  io_uring      [kernel.kallsyms]  [k] blk_throtl_bio
     0.57%     0.01%  io_uring      [kernel.kallsyms]  [k] blk_mq_end_request
     0.57%     0.14%  io_uring      [kernel.kallsyms]  [k] workingset_activation
     0.55%     0.53%  io_uring      [kernel.kallsyms]  [k] __virt_addr_valid
     0.55%     0.52%  io_uring      [kernel.kallsyms]  [k] memset_erms
     0.52%     0.39%  io_uring      [kernel.kallsyms]  [k] __x86_retpoline_rax
     0.51%     0.04%  io_uring      [kernel.kallsyms]  [k] __blk_mq_get_tag
     0.49%     0.49%  io_uring      [kernel.kallsyms]  [k] workingset_age_nonresident
     0.48%     0.06%  io_uring      [kernel.kallsyms]  [k] __sbitmap_queue_get
     0.48%     0.09%  io_uring      [kernel.kallsyms]  [k] irqentry_exit_to_user_mode
     0.48%     0.00%  io_uring      [kernel.kallsyms]  [k] irqentry_exit
     0.48%     0.46%  io_uring      [kernel.kallsyms]  [k] bio_associate_blkg_from_css
     0.46%     0.02%  io_uring      [kernel.kallsyms]  [k] sysvec_reschedule_ipi
     0.43%     0.00%  io_uring      [kernel.kallsyms]  [k] __irqentry_text_start
     0.43%     0.41%  io_uring      [kernel.kallsyms]  [k] xas_find_conflict
     0.43%     0.41%  io_uring      [kernel.kallsyms]  [k] io_file_supports_async
     0.42%     0.26%  io_uring      [kernel.kallsyms]  [k] rcu_read_unlock_strict
     0.42%     0.40%  io_uring      [kernel.kallsyms]  [k] blk_attempt_bio_merge.part.0
     0.42%     0.01%  io_uring      io_uring           [.] submitter_fn
     0.41%     0.14%  io_uring      [kernel.kallsyms]  [k] sbitmap_get
     0.39%     0.37%  io_uring      [kernel.kallsyms]  [k] io_import_iovec
     0.37%     0.36%  io_uring      [kernel.kallsyms]  [k] nvme_setup_cmd
     0.37%     0.35%  io_uring      [kernel.kallsyms]  [k] fsnotify
     0.37%     0.36%  io_uring      [kernel.kallsyms]  [k] __mod_memcg_state.part.0
     0.37%     0.35%  io_uring      [kernel.kallsyms]  [k] __mod_node_page_state
     0.36%     0.02%  io_uring      [kernel.kallsyms]  [k] mem_cgroup_from_id
     0.36%     0.35%  io_uring      [kernel.kallsyms]  [k] _raw_spin_lock
     0.36%     0.35%  io_uring      [kernel.kallsyms]  [k] io_prep_rw
     0.36%     0.03%  io_uring      [kernel.kallsyms]  [k] idr_find
     0.36%     0.02%  io_uring      [kernel.kallsyms]  [k] blk_update_request
     0.34%     0.23%  io_uring      [kernel.kallsyms]  [k] __cond_resched
     0.34%     0.32%  io_uring      [kernel.kallsyms]  [k] ktime_get
     0.34%     0.31%  io_uring      [kernel.kallsyms]  [k] blk_integrity_merge_bio
     0.33%     0.30%  io_uring      [kernel.kallsyms]  [k] mem_cgroup_update_lru_size
     0.32%     0.01%  io_uring      [kernel.kallsyms]  [k] bio_endio
     0.32%     0.30%  io_uring      [kernel.kallsyms]  [k] percpu_counter_add_batch
     0.32%     0.31%  io_uring      [kernel.kallsyms]  [k] blkdev_get_block
     0.32%     0.02%  io_uring      [kernel.kallsyms]  [k] radix_tree_lookup
     0.31%     0.30%  io_uring      [kernel.kallsyms]  [k] wbt_wait
     0.31%     0.03%  io_uring      [kernel.kallsyms]  [k] mpage_end_io
     0.31%     0.29%  io_uring      [kernel.kallsyms]  [k] io_file_get
     0.30%     0.30%  io_uring      [kernel.kallsyms]  [k] __radix_tree_lookup
     0.30%     0.30%  io_uring      [kernel.kallsyms]  [k] __blk_queue_split
     0.30%     0.18%  io_uring      [kernel.kallsyms]  [k] try_charge
     0.30%     0.30%  io_uring      [kernel.kallsyms]  [k] __blk_mq_sched_bio_merge
     0.28%     0.26%  io_uring      [kernel.kallsyms]  [k] wbt_track
     0.27%     0.19%  io_uring      [kernel.kallsyms]  [k] __sbitmap_get_word
     0.26%     0.25%  io_uring      [kernel.kallsyms]  [k] __slab_free
     0.25%     0.20%  io_uring      [kernel.kallsyms]  [k] mutex_lock
     0.25%     0.24%  io_uring      [kernel.kallsyms]  [k] mutex_unlock
     0.24%     0.24%  io_uring      [kernel.kallsyms]  [k] native_irq_return_iret
     0.24%     0.22%  io_uring      [kernel.kallsyms]  [k] release_pages
     0.24%     0.10%  io_uring      [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     0.23%     0.09%  io_uring      [kernel.kallsyms]  [k] bio_add_page
     0.23%     0.20%  io_uring      [kernel.kallsyms]  [k] bio_crypt_rq_ctx_compatible
     0.23%     0.22%  io_uring      [kernel.kallsyms]  [k] __next_zones_zonelist
     0.22%     0.17%  io_uring      [kernel.kallsyms]  [k] slab_pre_alloc_hook.constprop.0
     0.20%     0.20%  io_uring      [kernel.kallsyms]  [k] native_apic_msr_eoi_write
     0.20%     0.10%  io_uring      [kernel.kallsyms]  [k] kiocb_done
     0.19%     0.01%  io_uring      [kernel.kallsyms]  [k] mempool_kmalloc
     0.19%     0.17%  io_uring      [kernel.kallsyms]  [k] get_mem_cgroup_from_mm
     0.19%     0.03%  io_uring      [kernel.kallsyms]  [k] lock_page_lruvec_irqsave
     0.19%     0.12%  io_uring      [kernel.kallsyms]  [k] __x86_indirect_thunk_rax
     0.19%     0.08%  io_uring      [kernel.kallsyms]  [k] __rq_qos_throttle
     0.18%     0.03%  io_uring      [kernel.kallsyms]  [k] page_endio
     0.18%     0.15%  io_uring      [kernel.kallsyms]  [k] aa_file_perm
     0.17%     0.05%  io_uring      [kernel.kallsyms]  [k] blk_account_io_start
     0.16%     0.01%  io_uring      [kernel.kallsyms]  [k] unlock_page
     0.16%     0.15%  io_uring      [kernel.kallsyms]  [k] io_put_req
     0.16%     0.15%  io_uring      [kernel.kallsyms]  [k] wbt_issue
     0.16%     0.13%  io_uring      [kernel.kallsyms]  [k] rcu_all_qs
     0.16%     0.06%  io_uring      [kernel.kallsyms]  [k] __rq_qos_track
     0.15%     0.08%  io_uring      [kernel.kallsyms]  [k] kernel_init_free_pages
     0.15%     0.15%  io_uring      [kernel.kallsyms]  [k] __io_cqring_fill_event
     0.15%     0.02%  io_uring      [kernel.kallsyms]  [k] wake_up_page_bit
     0.15%     0.14%  io_uring      [kernel.kallsyms]  [k] __mod_zone_page_state
     0.15%     0.14%  io_uring      [kernel.kallsyms]  [k] page_mapping
     0.15%     0.14%  io_uring      [kernel.kallsyms]  [k] __blk_rq_map_sg
     0.14%     0.14%  io_uring      [kernel.kallsyms]  [k] slab_free_freelist_hook
     0.14%     0.14%  io_uring      [kernel.kallsyms]  [k] blk_cgroup_bio_start
     0.14%     0.13%  io_uring      [kernel.kallsyms]  [k] nvme_submit_cmd
     0.14%     0.02%  io_uring      [kernel.kallsyms]  [k] irq_chip_ack_parent
     0.14%     0.03%  io_uring      [kernel.kallsyms]  [k] psi_task_change
     0.13%     0.06%  io_uring      [kernel.kallsyms]  [k] mem_cgroup_charge_statistics.constprop.0
     0.13%     0.12%  io_uring      [kernel.kallsyms]  [k] blk_queue_enter
     0.13%     0.12%  io_uring      [kernel.kallsyms]  [k] nvme_setup_rw
     0.13%     0.13%  io_uring      [kernel.kallsyms]  [k] __count_memcg_events.part.0
     0.13%     0.12%  io_uring      [kernel.kallsyms]  [k] update_io_ticks
     0.12%     0.11%  io_uring      [kernel.kallsyms]  [k] memcg_slab_post_alloc_hook
     0.11%     0.11%  io_uring      [kernel.kallsyms]  [k] blk_mq_get_driver_tag
     0.11%     0.08%  io_uring      [kernel.kallsyms]  [k] psi_group_change
     0.11%     0.11%  io_uring      [kernel.kallsyms]  [k] blk_queue_bounce
     0.11%     0.10%  io_uring      [kernel.kallsyms]  [k] page_counter_try_charge
     0.10%     0.09%  io_uring      [kernel.kallsyms]  [k] kthread_blkcg
     0.10%     0.00%  io_uring      [kernel.kallsyms]  [k] __wake_up_locked_key_bookmark
     0.10%     0.10%  io_uring      [kernel.kallsyms]  [k] __bio_add_page
     0.10%     0.09%  io_uring      [kernel.kallsyms]  [k] __io_complete_rw.constprop.0
     0.10%     0.09%  io_uring      [kernel.kallsyms]  [k] bio_integrity_prep
     0.09%     0.08%  io_uring      [kernel.kallsyms]  [k] __xas_next
     0.09%     0.08%  io_uring      [kernel.kallsyms]  [k] __entry_text_start
     0.09%     0.09%  io_uring      [kernel.kallsyms]  [k] dma_map_page_attrs
     0.09%     0.04%  io_uring      [kernel.kallsyms]  [k] io_async_buf_func
     0.09%     0.09%  io_uring      [kernel.kallsyms]  [k] xas_create
     0.09%     0.02%  io_uring      [kernel.kallsyms]  [k] __slab_alloc
     0.09%     0.00%  io_uring      [kernel.kallsyms]  [k] asm_sysvec_apic_timer_interrupt
     0.09%     0.00%  io_uring      [kernel.kallsyms]  [k] bio_put
     0.09%     0.01%  io_uring      [kernel.kallsyms]  [k] __wake_up_common
     0.09%     0.00%  io_uring      [kernel.kallsyms]  [k] sysvec_apic_timer_interrupt
     0.09%     0.08%  io_uring      [kernel.kallsyms]  [k] policy_nodemask
     0.09%     0.01%  io_uring      [kernel.kallsyms]  [k] bio_free
     0.09%     0.08%  io_uring      [kernel.kallsyms]  [k] blk_add_timer
     0.08%     0.04%  io_uring      [kernel.kallsyms]  [k] __rq_qos_issue
     0.08%     0.00%  io_uring      [kernel.kallsyms]  [k] psi_memstall_enter
     0.08%     0.00%  io_uring      [unknown]          [k] 0x43a3da3afeb43b00
     0.08%     0.02%  io_uring      [kernel.kallsyms]  [k] blk_mq_free_request
     0.08%     0.07%  io_uring      [kernel.kallsyms]  [k] __bio_try_merge_page
     0.08%     0.07%  io_uring      [kernel.kallsyms]  [k] __inc_numa_state
     0.08%     0.00%  io_uring      [kernel.kallsyms]  [k] __sysvec_apic_timer_interrupt
     0.07%     0.00%  io_uring      [kernel.kallsyms]  [k] hrtimer_interrupt
     0.07%     0.03%  io_uring      [kernel.kallsyms]  [k] __lock_text_start
     0.07%     0.01%  io_uring      [kernel.kallsyms]  [k] mempool_free_slab
     0.07%     0.06%  io_uring      [kernel.kallsyms]  [k] ___slab_alloc
     0.07%     0.07%  io_uring      [kernel.kallsyms]  [k] syscall_return_via_sysret
     0.07%     0.00%  io_uring      [kernel.kallsyms]  [k] mempool_free
     0.07%     0.00%  io_uring      [kernel.kallsyms]  [k] psi_memstall_leave
     0.06%     0.00%  io_uring      [kernel.kallsyms]  [k] __hrtimer_run_queues
     0.06%     0.00%  io_uring      libc-2.32.so       [.] __nrand48_r
     0.06%     0.05%  io_uring      [kernel.kallsyms]  [k] error_entry
     0.06%     0.00%  io_uring      [kernel.kallsyms]  [k] tick_sched_timer
     0.06%     0.06%  io_uring      [kernel.kallsyms]  [k] blk_add_rq_to_plug
     0.06%     0.05%  io_uring      [kernel.kallsyms]  [k] kmem_cache_free
     0.06%     0.05%  io_uring      [kernel.kallsyms]  [k] _find_next_bit.constprop.0
     0.06%     0.00%  io_uring      [kernel.kallsyms]  [k] __alloc_pages_slowpath.constprop.0
     0.05%     0.05%  io_uring      [kernel.kallsyms]  [k] kmalloc_slab
     0.05%     0.01%  io_uring      [kernel.kallsyms]  [k] lru_note_cost_page
     0.05%     0.05%  io_uring      [kernel.kallsyms]  [k] bdev_read_page
     0.05%     0.03%  io_uring      [kernel.kallsyms]  [k] I_BDEV
     0.05%     0.03%  io_uring      [kernel.kallsyms]  [k] memcg_check_events
     0.05%     0.00%  io_uring      [kernel.kallsyms]  [k] io_req_task_queue
     0.05%     0.00%  io_uring      [kernel.kallsyms]  [k] tick_sched_handle
     0.05%     0.01%  io_uring      [kernel.kallsyms]  [k] lru_note_cost
     0.05%     0.00%  io_uring      [kernel.kallsyms]  [k] update_process_times
     0.05%     0.04%  io_uring      [kernel.kallsyms]  [k] should_failslab
     0.04%     0.00%  io_uring      libc-2.32.so       [.] __drand48_iterate
     0.04%     0.03%  io_uring      [kernel.kallsyms]  [k] apic_ack_irq
     0.04%     0.04%  io_uring      [kernel.kallsyms]  [k] io_cqring_ev_posted
     0.04%     0.01%  io_uring      [kernel.kallsyms]  [k] io_req_task_work_add
     0.04%     0.03%  io_uring      [kernel.kallsyms]  [k] xas_nomem
     0.04%     0.00%  io_uring      [kernel.kallsyms]  [k] scheduler_tick
     0.04%     0.00%  io_uring      [kernel.kallsyms]  [k] dma_map_sg_attrs
     0.04%     0.00%  io_uring      libc-2.32.so       [.] lrand48
     0.04%     0.02%  io_uring      [kernel.kallsyms]  [k] hctx_unlock
     0.04%     0.03%  io_uring      [kernel.kallsyms]  [k] policy_node
     0.04%     0.03%  io_uring      [kernel.kallsyms]  [k] check_stack_object
     0.04%     0.01%  io_uring      [kernel.kallsyms]  [k] __blk_mq_free_request
     0.03%     0.03%  io_uring      [kernel.kallsyms]  [k] dput
     0.03%     0.01%  io_uring      [kernel.kallsyms]  [k] record_times
     0.03%     0.03%  io_uring      [kernel.kallsyms]  [k] blk_stat_add
     0.03%     0.02%  io_uring      [kernel.kallsyms]  [k] dma_pool_alloc
     0.03%     0.03%  io_uring      [kernel.kallsyms]  [k] find_next_zero_bit
     0.03%     0.03%  io_uring      [kernel.kallsyms]  [k] dma_direct_map_sg
     0.03%     0.01%  io_uring      [kernel.kallsyms]  [k] __fget_light
     0.03%     0.02%  io_uring      [kernel.kallsyms]  [k] blk_account_io_done
     0.03%     0.03%  io_uring      [kernel.kallsyms]  [k] guard_bio_eod
     0.03%     0.02%  io_uring      [kernel.kallsyms]  [k] add_interrupt_randomness
     0.03%     0.02%  io_uring      [kernel.kallsyms]  [k] memset
     0.03%     0.02%  io_uring      [kernel.kallsyms]  [k] iov_iter_bvec
     0.03%     0.00%  io_uring      [kernel.kallsyms]  [k] blk_mq_put_tag
     0.03%     0.02%  io_uring      [kernel.kallsyms]  [k] cpuset_nodemask_valid_mems_allowed
     0.03%     0.00%  io_uring      [kernel.kallsyms]  [k] sched_clock_cpu
     0.03%     0.00%  io_uring      [kernel.kallsyms]  [k] __count_memcg_events
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] mem_cgroup_get_nr_swap_pages
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] blk_start_plug
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] sbitmap_queue_clear
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] io_commit_cqring
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] should_fail_bio
     0.02%     0.01%  io_uring      [kernel.kallsyms]  [k] page_cache_next_miss
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] task_tick_fair
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] should_fail_alloc_page
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] native_sched_clock
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] __fdget
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] wbt_data_dir
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] sync_regs
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] wbt_done
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] __irqentry_text_end
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] __fget_files
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] do_try_to_free_pages
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] shrink_node
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] try_to_free_pages
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] shrink_lruvec
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] sg_init_table
     0.02%     0.02%  io_uring      [kernel.kallsyms]  [k] blk_mq_complete_request_remote
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] shrink_inactive_list
     0.02%     0.01%  io_uring      [kernel.kallsyms]  [k] sg_next
     0.02%     0.00%  io_uring      [kernel.kallsyms]  [k] __rq_qos_done
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] irq_exit_rcu
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] free_unref_page_list
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] __mem_cgroup_threshold
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] nvme_unmap_data.part.0
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] bvec_alloc
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] shrink_page_list
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] tick_do_update_jiffies64
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] propagate_protected_usage
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] update_wall_time
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] timekeeping_advance
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] iov_iter_revert
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] dma_unmap_page_attrs
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] entry_SYSCALL_64_safe_stack
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] arch_do_signal_or_restart
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] __remove_mapping
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] blk_mq_tag_to_rq
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] try_to_wake_up
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] __softirqentry_text_start
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] perf_event_task_tick
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] wake_up_process
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] bio_uninit
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] note_interrupt
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] irqentry_enter
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] fput
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] update_load_avg
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] __x86_retpoline_r13
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] allocate_slab
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] refill_stock
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] syscall_enter_from_user_mode
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] tick_program_event
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] mempool_kfree
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] update_curr
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] fpregs_assert_state_consistent
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] restore_regs_and_return_to_kernel
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] bio_advance
     0.01%     0.00%  io_uring      [kernel.kallsyms]  [k] mem_cgroup_uncharge_list
     0.01%     0.00%  io_uring      io_uring           [.] 0x000055c42b500320
     0.01%     0.01%  io_uring      [kernel.kallsyms]  [k] native_write_msr
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] inode_congested
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] timekeeping_update
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] run_rebalance_domains
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] task_work_add
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] _raw_spin_trylock
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] irq_enter_rcu
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x86_indirect_thunk_r13
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] dma_unmap_sg_attrs
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] clockevents_program_event
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __rq_qos_done_bio
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __delete_from_page_cache
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] wake_all_kswapds
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rebalance_domains
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] trigger_load_balance
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] page_cache_delete
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] blk_queue_exit
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] error_return
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] idle_cpu
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] put_cpu_partial
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] x86_pmu_enable
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] dma_direct_unmap_sg
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] intel_tfa_pmu_enable_all
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __mix_pool_bytes
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] sched_clock
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] _mix_pool_bytes
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] bvec_free
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __update_load_avg_cfs_rq
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __queue_work
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __sbq_wake_up
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __run_timers.part.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] call_timer_fn
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] run_timer_softirq
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] hrtimer_active
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __schedule
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] blk_status_to_errno
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] arch_scale_freq_tick
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_fast_timekeeper
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] native_iret
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_vsyscall
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_cfs_group
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] delayed_work_timer_fn
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] load_balance
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ttwu_do_activate
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] insert_work
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] dma_pool_free
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] pvclock_gtod_notify
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] account_process_tick
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __zone_watermark_ok
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] lapic_next_deadline
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __update_load_avg_se
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] account_system_index_time
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] account_system_time
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] wakeup_kswapd
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_run_task_work_sig
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] isolate_lru_pages
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] psi_flags_change
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] find_busiest_group
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] enqueue_task
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_uring_add_task_file
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_sd_lb_stats.constprop.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] free_unref_page_commit
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] nvme_error_status
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] nvme_cleanup_cmd
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __wake_up
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] enqueue_task_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] schedule
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] free_pcppages_bulk
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] _warn_unseeded_randomness
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __wake_up_common_lock
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] autoremove_wake_function
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] default_wake_function
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_blocked_averages
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kick_ilb
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] calc_global_load_tick
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_rq_clock
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] wake_up_state
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] native_read_msr
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_segcblist_ready_cbs
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] x86_pmu_disable
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_sched_clock_irq
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __hrtimer_next_event_base
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_min_vruntime
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] dequeue_task
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] schedule_timeout
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] shrink_active_list
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ktime_get_update_offsets_now
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] enqueue_entity
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] io_wq_manager
     0.00%     0.00%  iou-mgr-2493  [unknown]          [k] 0000000000000000
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] ret_from_fork
     0.00%     0.00%  io_uring      libc-2.32.so       [.] __GI___libc_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] unaccount_page_cache_page
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] smp_call_function_single_async
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __cgroup_account_cputime_field
     0.00%     0.00%  io_uring      io_uring           [.] 0x000055c42b500324
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] run_posix_cpu_timers
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] schedule_timeout
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_write
     0.00%     0.00%  io_uring      libc-2.32.so       [.] clock_nanosleep@@GLIBC_2.17
     0.00%     0.00%  iou-mgr-2496  [unknown]          [k] 0000000000000000
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] ret_from_fork
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] queue_work_on
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] io_wq_manager
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ksys_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] credit_entropy_bits.constprop.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] blk_mq_sched_restart
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] send_call_function_single_ipi
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] generic_exec_single
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] raw_notifier_call_chain
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_clock_nanosleep
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] profile_tick
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] bvec_split_segs
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] pick_next_task_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] dequeue_task_fair
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] __schedule
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] schedule
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] vfs_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] common_nsleep
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_nanosleep
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] hrtimer_nanosleep
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] schedule
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] schedule_timeout
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kthread_is_per_cpu
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] file_tty_write.constprop.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] new_sync_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] tty_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __free_one_page
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] __schedule
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] workingset_eviction
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] enqueue_hrtimer
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rb_next
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] blk_rq_timed_out_timer
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __calc_delta
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] cpumask_next_and
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] vma_migratable
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] timerqueue_del
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] nohz_balance_exit_idle
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] get_partial_node.part.0
     0.00%     0.00%  iou-wrk-2496  [unknown]          [k] 0000000000000000
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] io_wqe_worker
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] ret_from_fork
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] newidle_balance
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] pick_next_task_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] update_rt_rq_load_avg
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] dequeue_entity
     0.00%     0.00%  io_uring      [unknown]          [k] 0000000000000000
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] uncharge_batch
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] newidle_balance
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] pick_next_task_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] timerqueue_add
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] move_pages_to_lru
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] hrtimer_run_queues
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] watchdog_timer_fn
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] nvme_free_prps
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] set_next_entity
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] task_numa_work
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] select_task_rq_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_core_si
     0.00%     0.00%  io_uring      io_uring           [.] 0x0000000000001320
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] reweight_entity
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] get_random_u32
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] mod_node_page_state
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] newidle_balance
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] switch_fpu_return
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] io_worker_handle_work
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] n_tty_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] _nohz_idle_balance
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] _nohz_idle_balance
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] pty_write
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] cpuacct_charge
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] update_rq_clock
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __handle_mm_fault
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] asm_exc_page_fault
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_user_addr_fault
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] exc_page_fault
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] handle_mm_fault
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] io_read
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] io_issue_sqe
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] io_wq_submit_work
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] finish_task_switch.isra.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] tty_flip_buffer_push
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __perf_event_task_sched_in
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] dequeue_task
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] cpumask_next
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_shrink_slab
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] shrink_slab
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] completion_done
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] update_sd_lb_stats.constprop.0
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] _nohz_idle_balance
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] find_busiest_group
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] load_balance
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] xas_clear_mark
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] sched_clock_tick
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] lru_add_drain
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] lru_add_drain_cpu
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] group_balance_cpu
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kick_process
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __perf_event_task_sched_out
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] cpu_stop_queue_work
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] stop_one_cpu_nowait
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] wake_up_q
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] find_next_and_bit
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_nmi_enter
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __intel_pmu_enable_all.constprop.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] account_user_time
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] psi_task_switch
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] pick_next_entity
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rb_erase
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] irq_work_tick
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] raise_softirq
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] prep_compound_page
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_core
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_gp_kthread_wake
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rcu_report_qs_rnp
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] swake_up_one
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] setup_object.isra.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] acct_account_cputime
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_rw_should_reissue
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_complete_rw
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] vma_policy_mof
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] llist_add_batch
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] pgdat_balanced
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] mempolicy_slab_node
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] zone_watermark_ok_safe
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] list_lru_del
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] irqentry_enter_from_user_mode
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] check_preempt_curr
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ttwu_do_wakeup
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] finish_wait
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] dequeue_task_fair
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] update_sd_lb_stats.constprop.0
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] find_busiest_group
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] load_balance
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] copy_process
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] inherit_event.constprop.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] inherit_task_group.isra.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_event_init_task
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_syscall_64
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] dequeue_task
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __handle_mm_fault
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] handle_mm_fault
     0.00%     0.00%  io_uring      libc-2.32.so       [.] _int_malloc
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] kmem_cache_free
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] __io_free_req
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] io_free_work
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] schedule_timeout
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] inherit_event.constprop.0
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] copy_process
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] create_io_thread
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] create_io_worker.isra.0
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] inherit_task_group.isra.0
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] io_wq_check_workers
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] perf_event_init_task
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] dequeue_entity
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] filemap_map_pages
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c383436
     0.00%     0.00%  io_uring      [unknown]          [.] 0x534f49202c343039
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] create_io_thread
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_uring_alloc_task_context
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_wq_create
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_wq_fork_manager
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] hrtimer_start_range_ns
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c323939
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c363733
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kmem_cache_alloc_trace
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_event_alloc
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __get_user_pages
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __io_uring_register
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_io_uring_register
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] pin_user_pages
     0.00%     0.00%  io_uring      [unknown]          [k] 0x000000000000cc81
     0.00%     0.00%  io_uring      [unknown]          [k] 0x00007f130eedcc00
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] update_rq_clock
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] update_cfs_group
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] dequeue_task_fair
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] asm_exc_page_fault
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] perf_event_alloc
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] blkdev_read_iter
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] kiocb_done
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] __schedule
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] pick_next_task_fair
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] schedule
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __x64_sys_execve
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] bprm_execve
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_execveat_common
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] load_elf_binary
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] update_cfs_group
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] __update_idle_core
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __drain_all_pages
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] flush_work
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __add_to_page_cache_locked
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __ext4_get_inode_loc
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __ext4_iget
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __getblk_gfp
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __getblk_slow
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __x64_sys_openat
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] add_to_page_cache_lru
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_filp_open
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_sys_openat2
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] ext4_lookup
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] ext4_sb_breadahead_unmovable
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] pagecache_get_page
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] path_openat
     0.00%     0.00%  numactl       ld-2.32.so         [.] __GI___open64_nocancel
     0.00%     0.00%  numactl       ld-2.32.so         [.] _dl_map_object
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] native_set_pte
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_set_pte
     0.00%     0.00%  numactl       ld-2.32.so         [.] _dl_check_map_versions
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] page_remove_rmap
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_wp_page
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] wp_page_copy
     0.00%     0.00%  numactl       ld-2.32.so         [.] _dl_start
     0.00%     0.00%  numactl       ld-2.32.so         [.] _dl_start_user
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] clear_page_erms
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] __alloc_pages_nodemask
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] alloc_pages_vma
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] get_page_from_freelist
     0.00%     0.00%  numactl       libc-2.32.so       [.] _int_malloc
     0.00%     0.00%  numactl       [unknown]          [k] 0000000000000000
     0.00%     0.00%  io_uring      libc-2.32.so       [.] __clone
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] process_echoes
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c383830
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] next_uptodate_page
     0.00%     0.00%  numactl       libc-2.32.so       [.] _getopt_internal_r
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] vma_interval_tree_remove
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __split_vma
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __vma_adjust
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_mprotect
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_mprotect_pkey
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] mprotect_fixup
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] split_vma
     0.00%     0.00%  io_uring      ld-2.32.so         [.] _dl_sysdep_start
     0.00%     0.00%  io_uring      ld-2.32.so         [.] dl_main
     0.00%     0.00%  io_uring      ld-2.32.so         [.] mprotect
     0.00%     0.00%  io_uring      ld-2.32.so         [.] mmap64
     0.00%     0.00%  io_uring      ld-2.32.so         [.] _dl_new_object
     0.00%     0.00%  io_uring      [unknown]          [.] 0x00007f130ef48fb0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __wait_for_common
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] wait_for_completion
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_close_on_exec
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] begin_new_exec
     0.00%     0.00%  numactl       libc-2.32.so       [.] __GI___execve
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] rb_insert_color
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c303034
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c303639
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ldsem_up_read
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c303237
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c323338
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c343232
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c343632
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c363337
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c373337
     0.00%     0.00%  io_uring      [unknown]          [k] 0x534f49202c383239
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __do_sys_clone
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_clone
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kernel_clone
     0.00%     0.00%  io_uring      [unknown]          [k] 0x7830203d20727470
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] lru_cache_add_page_vma
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] __lock_text_start
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] __update_idle_core
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] idle_cpu
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] psi_task_change
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] dequeue_entity
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] pick_next_task_idle
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __x64_sys_io_uring_setup
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_uring_setup
     0.00%     0.00%  io_uring      [unknown]          [k] 0x6966206465646441
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] p4d_offset
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] do_user_addr_fault
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] exc_page_fault
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] clear_page_erms
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] ___slab_alloc
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] __alloc_pages_nodemask
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] __slab_alloc
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] alloc_pages_current
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] allocate_slab
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] get_page_from_freelist
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] kmem_cache_alloc_trace
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] __mod_timer
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] filemap_read
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] io_cqring_ev_posted
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] io_wq_worker_running
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] newidle_balance
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] page_counter_uncharge
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] __calc_delta
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] __io_complete_rw.constprop.0
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] drain_obj_stock
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] generic_file_read_iter
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] io_iter_do_read
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] io_req_complete_post
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] obj_cgroup_uncharge
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] refill_obj_stock
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] update_curr
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] memset_erms
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] allocate_fake_cpuc
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] perf_try_init_event
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] x86_pmu_event_init
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] prepare_to_wait_exclusive
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] __update_load_avg_se
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] find_next_and_bit
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] idle_cpu
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] io_wq_worker_running
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] lock_timer_base
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] psi_task_change
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] reweight_entity
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] _find_next_bit.constprop.0
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] calc_wheel_index
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] cpumask_next_and
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] pick_next_task_idle
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] update_load_avg
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] kmem_cache_alloc_bulk
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] strlen
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] do_mmap
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] elf_map
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] mmap_region
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] vm_mmap
     0.00%     0.00%  numactl       [kernel.kallsyms]  [k] vm_mmap_pgoff
     0.00%     0.00%  numactl       [unknown]          [k] 0x00007fccebb22b7b
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] perf_iterate_ctx
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] __set_task_comm
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] perf_event_comm
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] perf_iterate_sb
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __mutex_init
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] ret_from_fork
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] memcpy_erms
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] __perf_event__output_id_sample
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] perf_event_comm_output
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] perf_iterate_ctx
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] __set_task_comm
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] perf_event_comm
     0.00%     0.00%  iou-wrk-2496  [kernel.kallsyms]  [k] perf_iterate_sb
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] io_wqe_worker
     0.00%     0.00%  io_uring      libc-2.32.so       [.] __GI___gettid
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] start_flush_work.constprop.0
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_iterate_ctx
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] __set_task_comm
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] io_wq_manager
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_event_comm
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] perf_iterate_sb
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] __x64_sys_execve
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] begin_new_exec
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] bprm_execve
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] do_execveat_common
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] do_syscall_64
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] load_elf_binary
     0.00%     0.00%  perf          [unknown]          [k] 0x00007fccebb22b7b
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] schedule_tail
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] get_nohz_timer_target
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] perf_iterate_sb
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] __set_task_comm
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] perf_event_comm
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] schedule_tail
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] native_write_msr
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] __perf_event_task_sched_in
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] finish_task_switch.isra.0
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] intel_tfa_pmu_enable_all
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] x86_pmu_enable
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] nmi_restore
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] nmi_restore
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] recalc_sigpending
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] end_repeat_nmi
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] native_write_msr
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] __perf_event_task_sched_in
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] finish_task_switch.isra.0
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] intel_tfa_pmu_enable_all
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] x86_pmu_enable
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] native_write_msr
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] ctx_resched
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] intel_tfa_pmu_enable_all
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] perf_event_exec
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] x86_pmu_enable
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] native_flush_tlb_one_user
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] native_flush_tlb_one_user
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] native_set_fixmap
     0.00%     0.00%  iou-mgr-2496  [kernel.kallsyms]  [k] ghes_notify_nmi
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] nmi_cpu_backtrace
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] nmi_cpu_backtrace_handler
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] sched_clock
     0.00%     0.00%  io_uring      [kernel.kallsyms]  [k] native_apic_msr_write
     0.00%     0.00%  iou-mgr-2493  [kernel.kallsyms]  [k] native_apic_msr_write
     0.00%     0.00%  perf          [kernel.kallsyms]  [k] native_apic_msr_write


#
# (Tip: System-wide collection from all CPUs: perf record -a)
#
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 773K of event 'cycles'
# Event count (approx.): 456054447930
#
# Children      Self  Command  Shared Object      Symbol                                     
# ........  ........  .......  .................  ...........................................
#
   100.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kthread
   100.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ret_from_fork
   100.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kswapd
    99.90%     0.01%  kswapd0  [kernel.kallsyms]  [k] balance_pgdat
    99.86%     0.05%  kswapd0  [kernel.kallsyms]  [k] shrink_node
    98.00%     0.19%  kswapd0  [kernel.kallsyms]  [k] shrink_lruvec
    91.13%     0.10%  kswapd0  [kernel.kallsyms]  [k] shrink_inactive_list
    75.67%     5.97%  kswapd0  [kernel.kallsyms]  [k] shrink_page_list
    57.65%     2.59%  kswapd0  [kernel.kallsyms]  [k] __remove_mapping
    45.99%     0.27%  kswapd0  [kernel.kallsyms]  [k] __delete_from_page_cache
    42.88%     0.88%  kswapd0  [kernel.kallsyms]  [k] page_cache_delete
    39.05%     1.04%  kswapd0  [kernel.kallsyms]  [k] xas_store
    37.71%    37.67%  kswapd0  [kernel.kallsyms]  [k] xas_create
    12.62%    11.79%  kswapd0  [kernel.kallsyms]  [k] isolate_lru_pages
     8.52%     0.98%  kswapd0  [kernel.kallsyms]  [k] free_unref_page_list
     7.38%     0.61%  kswapd0  [kernel.kallsyms]  [k] free_unref_page_commit
     6.68%     1.78%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
     6.53%     0.83%  kswapd0  [kernel.kallsyms]  [k] shrink_active_list
     6.45%     3.21%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     4.82%     4.60%  kswapd0  [kernel.kallsyms]  [k] __free_one_page
     4.58%     4.21%  kswapd0  [kernel.kallsyms]  [k] unlock_page
     3.62%     3.55%  kswapd0  [kernel.kallsyms]  [k] native_queued_spin_lock_slowpath
     2.49%     0.71%  kswapd0  [kernel.kallsyms]  [k] unaccount_page_cache_page
     2.46%     0.81%  kswapd0  [kernel.kallsyms]  [k] workingset_eviction
     2.14%     0.33%  kswapd0  [kernel.kallsyms]  [k] __mod_lruvec_state
     1.97%     1.88%  kswapd0  [kernel.kallsyms]  [k] xas_clear_mark
     1.73%     0.26%  kswapd0  [kernel.kallsyms]  [k] __mod_lruvec_page_state
     1.71%     1.06%  kswapd0  [kernel.kallsyms]  [k] move_pages_to_lru
     1.66%     1.62%  kswapd0  [kernel.kallsyms]  [k] workingset_age_nonresident
     1.60%     0.85%  kswapd0  [kernel.kallsyms]  [k] __mod_memcg_lruvec_state
     1.58%     0.02%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
     1.49%     0.13%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_uncharge_list
     1.45%     0.06%  kswapd0  [kernel.kallsyms]  [k] do_shrink_slab
     1.37%     1.32%  kswapd0  [kernel.kallsyms]  [k] page_mapping
     1.06%     0.76%  kswapd0  [kernel.kallsyms]  [k] count_shadow_nodes
     0.89%     0.85%  kswapd0  [kernel.kallsyms]  [k] xas_init_marks
     0.84%     0.58%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock_irq
     0.82%     0.49%  kswapd0  [kernel.kallsyms]  [k] page_referenced
     0.81%     0.81%  kswapd0  [kernel.kallsyms]  [k] __mod_memcg_state.part.0
     0.76%     0.08%  kswapd0  [kernel.kallsyms]  [k] uncharge_batch
     0.73%     0.65%  kswapd0  [kernel.kallsyms]  [k] uncharge_page
     0.50%     0.45%  kswapd0  [kernel.kallsyms]  [k] __mod_zone_page_state
     0.50%     0.43%  kswapd0  [kernel.kallsyms]  [k] page_counter_uncharge
     0.48%     0.43%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page_prepare
     0.47%     0.47%  kswapd0  [kernel.kallsyms]  [k] __count_memcg_events.part.0
     0.46%     0.11%  kswapd0  [kernel.kallsyms]  [k] lru_note_cost
     0.45%     0.41%  kswapd0  [kernel.kallsyms]  [k] workingset_update_node
     0.42%     0.36%  kswapd0  [kernel.kallsyms]  [k] __mod_node_page_state
     0.37%     0.29%  kswapd0  [kernel.kallsyms]  [k] __lock_text_start
     0.37%     0.06%  kswapd0  [kernel.kallsyms]  [k] wake_up_page_bit
     0.33%     0.12%  kswapd0  [kernel.kallsyms]  [k] cpumask_next
     0.32%     0.22%  kswapd0  [kernel.kallsyms]  [k] __cond_resched
     0.32%     0.28%  kswapd0  [kernel.kallsyms]  [k] free_pcp_prepare
     0.31%     0.01%  kswapd0  [kernel.kallsyms]  [k] __count_memcg_events
     0.30%     0.00%  kswapd0  [kernel.kallsyms]  [k] rmap_walk
     0.29%     0.25%  kswapd0  [kernel.kallsyms]  [k] PageHuge
     0.25%     0.00%  kswapd0  [kernel.kallsyms]  [k] rmap_walk_file
     0.23%     0.06%  kswapd0  [kernel.kallsyms]  [k] super_cache_count
     0.23%     0.17%  kswapd0  [kernel.kallsyms]  [k] rcu_read_unlock_strict
     0.21%     0.01%  kswapd0  [kernel.kallsyms]  [k] page_referenced_one
     0.21%     0.16%  kswapd0  [kernel.kallsyms]  [k] page_mapped
     0.20%     0.20%  kswapd0  [kernel.kallsyms]  [k] list_lru_count_one
     0.20%     0.20%  kswapd0  [kernel.kallsyms]  [k] page_vma_mapped_walk
     0.17%     0.12%  kswapd0  [kernel.kallsyms]  [k] rcu_all_qs
     0.16%     0.16%  kswapd0  [kernel.kallsyms]  [k] _find_next_bit.constprop.0
     0.13%     0.13%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_iter
     0.13%     0.09%  kswapd0  [kernel.kallsyms]  [k] __x86_retpoline_rax
     0.13%     0.13%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_update_lru_size
     0.13%     0.08%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
     0.12%     0.08%  kswapd0  [kernel.kallsyms]  [k] find_next_bit
     0.12%     0.11%  kswapd0  [kernel.kallsyms]  [k] total_mapcount
     0.09%     0.05%  kswapd0  [kernel.kallsyms]  [k] __x86_indirect_thunk_rax
     0.09%     0.00%  kswapd0  [kernel.kallsyms]  [k] asm_sysvec_apic_timer_interrupt
     0.09%     0.01%  kswapd0  [kernel.kallsyms]  [k] __schedule
     0.08%     0.00%  kswapd0  [kernel.kallsyms]  [k] sysvec_apic_timer_interrupt
     0.08%     0.06%  kswapd0  [kernel.kallsyms]  [k] propagate_protected_usage
     0.07%     0.00%  kswapd0  [kernel.kallsyms]  [k] __sysvec_apic_timer_interrupt
     0.07%     0.00%  kswapd0  [kernel.kallsyms]  [k] hrtimer_interrupt
     0.06%     0.00%  kswapd0  [kernel.kallsyms]  [k] schedule
     0.06%     0.01%  kswapd0  [kernel.kallsyms]  [k] __wake_up_locked_key_bookmark
     0.06%     0.00%  kswapd0  [kernel.kallsyms]  [k] schedule_timeout
     0.06%     0.03%  kswapd0  [kernel.kallsyms]  [k] memcg_check_events
     0.05%     0.00%  kswapd0  [kernel.kallsyms]  [k] __hrtimer_run_queues
     0.05%     0.00%  kswapd0  [kernel.kallsyms]  [k] pick_next_task_fair
     0.05%     0.01%  kswapd0  [kernel.kallsyms]  [k] lru_add_drain
     0.05%     0.00%  kswapd0  [kernel.kallsyms]  [k] tick_sched_timer
     0.05%     0.05%  kswapd0  [kernel.kallsyms]  [k] lru_add_drain_cpu
     0.05%     0.00%  kswapd0  [kernel.kallsyms]  [k] newidle_balance
     0.04%     0.00%  kswapd0  [kernel.kallsyms]  [k] tick_sched_handle
     0.04%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_process_times
     0.04%     0.04%  kswapd0  [kernel.kallsyms]  [k] find_first_bit
     0.04%     0.04%  kswapd0  [kernel.kallsyms]  [k] __wake_up_common
     0.04%     0.02%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_iter_next
     0.04%     0.00%  kswapd0  [kernel.kallsyms]  [k] load_balance
     0.04%     0.00%  kswapd0  [kernel.kallsyms]  [k] find_busiest_group
     0.04%     0.03%  kswapd0  [kernel.kallsyms]  [k] update_sd_lb_stats.constprop.0
     0.03%     0.00%  kswapd0  [kernel.kallsyms]  [k] scheduler_tick
     0.03%     0.03%  kswapd0  [kernel.kallsyms]  [k] __mem_cgroup_threshold
     0.03%     0.00%  kswapd0  [kernel.kallsyms]  [k] pageout
     0.03%     0.03%  kswapd0  [kernel.kallsyms]  [k] deferred_split_count
     0.03%     0.00%  kswapd0  [kernel.kallsyms]  [k] vmpressure
     0.03%     0.03%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_subtree_search
     0.03%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_writepage
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] idr_find
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] queue_work_on
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] __swap_writepage
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_unmap
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] set_pgdat_percpu_threshold
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] __queue_work
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] finish_task_switch.isra.0
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] radix_tree_lookup
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] rmap_walk_anon
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] __perf_event_task_sched_in
     0.02%     0.02%  kswapd0  [kernel.kallsyms]  [k] __radix_tree_lookup
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] insert_work
     0.02%     0.02%  kswapd0  [kernel.kallsyms]  [k] kfree_rcu_shrink_count
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] wake_up_process
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_wake_up
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_event_sched_in
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] ctx_sched_in
     0.02%     0.01%  kswapd0  [kernel.kallsyms]  [k] try_to_unmap_one
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_finish_plug
     0.02%     0.02%  kswapd0  [kernel.kallsyms]  [k] io_async_buf_func
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] ttwu_do_activate
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_flush_plug_list
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] task_tick_fair
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] shmem_unused_huge_count
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] super_cache_scan
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] submit_bio
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] up_read
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] submit_bio_noacct
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] enqueue_task
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] visit_groups_merge.constprop.0.isra.0
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
     0.01%     0.01%  kswapd0  [unknown]          [.] 0000000000000000
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] prune_dcache_sb
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] merge_sched_in
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_sched_insert_requests
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_flush_plug_list
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] shrink_dentry_list
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_iter_first
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_try_issue_list_directly
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_submit_bio
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_load_avg
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_mq_try_issue_directly
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] enqueue_task_fair
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_queue_rq
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_event_task_tick
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] add_to_swap
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_swevent_add
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] try_to_unmap_flush
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] __dentry_kill
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] irq_exit_rcu
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] native_write_msr
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_calculate_protection
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_task_change
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] anon_vma_interval_tree_iter_first
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] __softirqentry_text_start
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] enqueue_entity
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] ktime_get_update_offsets_now
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] intel_tfa_pmu_enable_all
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] x86_pmu_enable
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] css_next_descendant_pre
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_rq_clock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_blocked_averages
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] idle_cpu
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_curr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] _nohz_idle_balance
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __update_load_avg_se
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] trigger_load_balance
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_event_groups_first
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] add_to_swap_cache
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] tick_program_event
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] inactive_is_low
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] backend_shrink_memory_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] iput
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dentry_unlink_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_group_change
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] iput.part.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] evict
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] cpumask_next_and
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_swapout
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bus_for_each_dev
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ttm_pool_shrinker_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ktime_get
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] run_rebalance_domains
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ttwu_do_wakeup
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] page_lock_anon_vma_read
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] clockevents_program_event
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] check_preempt_curr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_memstall_leave
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] event_sched_in
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __delete_from_swap_cache
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] cpumask_any_but
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mempool_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] read_tsc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_release_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_event_update_userpage
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_trylock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] prepare_kswapd_sleep
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_mq_alloc_request
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __update_load_avg_cfs_rq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bdev_try_to_free_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_irq_return_iret
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] offset_to_swap_extent
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] page_remove_rmap
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blkdev_releasepage
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] put_prev_entity
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] arch_scale_freq_tick
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __perf_event_task_sched_out
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __frontswap_store
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] prune_icache_sb
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ptep_clear_flush
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rb_next
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] account_process_tick
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_page_sector
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] jbd2_journal_try_to_free_buffers
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mmu_shrink_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_get_tag
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] cgroup_e_css
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] down_read
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] sched_clock_cpu
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] submit_bio_checks
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] hrtimer_active
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] set_page_dirty
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] shrink_lock_dentry.part.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_core_si
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kmem_cache_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_alloc_bioset
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] check_preempt_wakeup
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_start_plug
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kick_ilb
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] pgdat_balanced
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_associate_blkg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] account_system_time
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] x86_pmu_disable
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ___d_drop
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_read_msr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_core
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_apic_msr_eoi_write
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_duplicate
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] buffer_check_dirty_writeback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_cgroup_record
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dequeue_task
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rebalance_domains
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __mod_timer
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_memstall_tick
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_sched_clock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] shrink_huge_zero_page_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_sched_clock_irq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] account_system_index_time
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mempool_kmalloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] check_pte
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] cpuacct_charge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_setup_rw
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] zone_watermark_ok_safe
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_attempt_plug_merge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] lapic_next_deadline
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mempool_alloc_slab
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] PageHeadHuge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_associate_blkg_from_css
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] klist_next
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] memcg_to_vmpressure
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dqcache_shrink_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_task_switch
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __rq_qos_throttle
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_get_nr_swap_pages
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] xas_create_range
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_rq_map_sg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __swap_duplicate
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kthread_is_per_cpu
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wbt_wait
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] flush_tlb_mm_range
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] smp_call_function_single_async
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_dl_rq_load_avg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] get_swap_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kmem_cache_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_soft_limit_reclaim
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nohz_balance_exit_idle
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_free_buffers
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] run_posix_cpu_timers
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] zswap_frontswap_store
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ptep_clear_flush_young
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_throtl_bio
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_mq_sched_bio_merge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] call_rcu
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] put_swap_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] vmpressure_calc_level
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] _swap_info_get
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __sbitmap_queue_get
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __list_lru_walk_one
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_attempt_back_merge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_attempt_bio_merge.part.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] set_task_reclaim_state
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] find_next_and_bit
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_pmu_nop_txn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] shmem_writepage
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __run_timers.part.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] run_timer_softirq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] record_times
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_set_page_dirty
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_memstall_enter
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] set_next_entity
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] list_lru_walk_one
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dma_pool_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] call_timer_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_free_swap
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] set_next_buddy
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __kmalloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] pick_next_entity
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mb_cache_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __slab_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_start_request
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __anon_vma_interval_tree_subtree_search
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] prepare_to_wait
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_cgroup_bio_start
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kthread_should_stop
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] reset_isolation_suitable
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] generic_exec_single
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] destroy_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] enqueue_timer
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __test_set_page_writeback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] memset_erms
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dentry_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bdev_write_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dequeue_task_fair
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_flags_change
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __es_shrink
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_es_scan
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_mq_get_tag
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_es_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __d_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_segcblist_ready_cbs
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] send_call_function_single_ipi
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] drop_buffers
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] truncate_inode_pages_final
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] inc_node_page_state
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_add_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] sum_zone_node_page_state
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] free_buffer_head
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wbt_track
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ll_back_merge_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] delayed_work_timer_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] i_callback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] enqueue_hrtimer
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __set_page_dirty_no_writeback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] sbitmap_get
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] page_swap_info
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rb_erase
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] clear_page_dirty_for_io
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] proc_evict_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] del_timer_sync
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] pagevec_lru_move_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_submit_cmd
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __update_idle_core
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] page_not_mapped
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] lock_timer_base
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kernfs_evict_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dma_map_page_attrs
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_queue_split
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] list_lru_add
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dentry_lru_isolate
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __sbitmap_get_word
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_min_vruntime
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rb_insert_color
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_rq_ctx_init
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] get_swap_pages
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] fsnotify_grab_connector
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __destroy_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __fsnotify_inode_delete
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] fsnotify_destroy_marks
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rq_qos_wait
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] clear_buddies
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] timerqueue_del
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_try_charge_swap
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] klist_iter_exit
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __hrtimer_next_event_base
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_segcblist_enqueue
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_qs
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_note_context_switch
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] pick_next_task_idle
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] put_prev_task_fair
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] calculate_pressure_threshold
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dequeue_entity
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_queue_bounce
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] resched_curr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] scan_swap_map_slots
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] llist_add_batch
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] error_entry
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] intel_pmu_disable_all
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] unlock_page_memcg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] irq_work_tick
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] inode_wait_for_writeback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_unmap_flush_dirty
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __common_interrupt
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __handle_irq_event_percpu
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] asm_common_interrupt
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] common_interrupt
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] handle_edge_irq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] handle_irq_event
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_irq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_pci_complete_rq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_process_cq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] cpuacct_account_field
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] es_reclaim_extents
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_stat_timer_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] sched_clock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_cblist_dequeue
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __rq_qos_track
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] scan_shadow_nodes
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_get_last_bvec
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dma_map_sg_attrs
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] irq_enter_rcu
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_get_driver_tag
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] release_pages
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __remove_inode_hash
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] pde_put
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_rt_rq_load_avg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bdev_read_only
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] get_swap_device
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] jbd2_journal_grab_journal_head
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __intel_pmu_enable_all.constprop.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] _atomic_dec_and_lock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_from_obj
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] hrtimer_forward
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __inode_wait_for_writeback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wbt_issue
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] calc_global_load_tick
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_account_io_merge_bio
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kernfs_put
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mutex_unlock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] percpu_counter_add_batch
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kthread_blkcg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_drop_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] proc_free_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] refill_obj_stock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] obj_cgroup_uncharge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __srcu_read_lock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_evict_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] calc_wheel_index
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] generic_delete_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] finish_wait
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __bio_add_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wake_up_bit
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] irqentry_exit
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __srcu_read_unlock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] lock_page_memcg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_shmem_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] hrtick_update
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __bio_try_merge_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] acct_account_cputime
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __es_tree_search.isra.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] es_do_reclaim_extents
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_es_free_extent
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_queue_enter
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ___slab_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __slab_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] memcg_slab_post_alloc_hook
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_accelerate_cbs_unlocked
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_gp_kthread_wake
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] swake_up_one
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] irqentry_enter
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] timerqueue_add
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] housekeeping_cpumask
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __irqentry_text_start
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __x86_indirect_thunk_r12
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __x86_retpoline_r12
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] xas_start
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] scan_swap_map_try_ssd_cluster
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __rq_qos_issue
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] slab_pre_alloc_hook.constprop.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] asm_sysvec_reschedule_ipi
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __swap_entry_free_locked
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_cleanup_cmd
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] error_return
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kmalloc_slab
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] note_gp_changes
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] jiffies_to_usecs
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_releasepage
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_id_get_online
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_cfs_group
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bit_waitqueue
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_endio
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_put
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_end_request
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_update_request
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] end_swap_bio_write
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mempool_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mempool_free_slab
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_complete_rq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_unmap_data.part.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] list_lru_del
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __dput_to_list
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] d_lru_del
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dma_direct_map_sg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] list_lru_walk_one_irq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] shadow_lru_isolate
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] xa_delete_node
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] xas_free_nodes
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wbt_inflight_cb
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] _raw_write_trylock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] arch_perf_update_userpage
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] calc_timer_values
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_setup_cmd
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] clear_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_pmu_nop_int
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] node_page_state
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_group_capacity
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_add_timer
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_accelerate_cbs
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_flush_tlb_one_user
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] acpi_os_read_memory
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ghes_notify_nmi
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] set_pte_vaddr_p4d
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nmi_cpu_backtrace
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] end_repeat_nmi
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] intel_pmu_handle_irq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] intel_bts_enable_local


#
# (Tip: To browse sample contexts use perf report --sample 10 and select in context menu)
#
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 723K of event 'cycles'
# Event count (approx.): 440094111395
#
# Children      Self  Command  Shared Object      Symbol                                      
# ........  ........  .......  .................  ............................................
#
   100.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kthread
   100.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ret_from_fork
   100.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kswapd
    99.92%     0.11%  kswapd0  [kernel.kallsyms]  [k] balance_pgdat
    99.32%     0.03%  kswapd0  [kernel.kallsyms]  [k] shrink_node
    97.25%     0.32%  kswapd0  [kernel.kallsyms]  [k] shrink_lruvec
    96.80%     0.09%  kswapd0  [kernel.kallsyms]  [k] evict_lru_gen_pages
    77.82%     6.28%  kswapd0  [kernel.kallsyms]  [k] shrink_page_list
    61.61%     2.76%  kswapd0  [kernel.kallsyms]  [k] __remove_mapping
    50.28%     0.33%  kswapd0  [kernel.kallsyms]  [k] __delete_from_page_cache
    46.63%     1.08%  kswapd0  [kernel.kallsyms]  [k] page_cache_delete
    42.20%     1.16%  kswapd0  [kernel.kallsyms]  [k] xas_store
    40.71%    40.67%  kswapd0  [kernel.kallsyms]  [k] xas_create
    12.54%     7.76%  kswapd0  [kernel.kallsyms]  [k] isolate_lru_gen_pages
     6.42%     3.19%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     6.15%     0.91%  kswapd0  [kernel.kallsyms]  [k] free_unref_page_list
     5.62%     5.45%  kswapd0  [kernel.kallsyms]  [k] unlock_page
     5.05%     0.59%  kswapd0  [kernel.kallsyms]  [k] free_unref_page_commit
     4.35%     2.04%  kswapd0  [kernel.kallsyms]  [k] lru_gen_update_size
     4.31%     1.41%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
     3.43%     3.36%  kswapd0  [kernel.kallsyms]  [k] native_queued_spin_lock_slowpath
     3.38%     0.59%  kswapd0  [kernel.kallsyms]  [k] __mod_lruvec_state
     2.97%     0.78%  kswapd0  [kernel.kallsyms]  [k] unaccount_page_cache_page
     2.82%     2.52%  kswapd0  [kernel.kallsyms]  [k] __free_one_page
     2.33%     1.18%  kswapd0  [kernel.kallsyms]  [k] __mod_memcg_lruvec_state
     2.28%     2.17%  kswapd0  [kernel.kallsyms]  [k] xas_clear_mark
     2.13%     0.30%  kswapd0  [kernel.kallsyms]  [k] __mod_lruvec_page_state
     1.88%     0.04%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
     1.82%     1.78%  kswapd0  [kernel.kallsyms]  [k] workingset_eviction
     1.74%     0.06%  kswapd0  [kernel.kallsyms]  [k] do_shrink_slab
     1.70%     0.15%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_uncharge_list
     1.39%     1.01%  kswapd0  [kernel.kallsyms]  [k] count_shadow_nodes
     1.22%     1.18%  kswapd0  [kernel.kallsyms]  [k] __mod_memcg_state.part.0
     1.16%     1.11%  kswapd0  [kernel.kallsyms]  [k] page_mapping
     1.02%     0.98%  kswapd0  [kernel.kallsyms]  [k] xas_init_marks
     0.93%     0.08%  kswapd0  [kernel.kallsyms]  [k] uncharge_batch
     0.84%     0.79%  kswapd0  [kernel.kallsyms]  [k] __mod_node_page_state
     0.74%     0.67%  kswapd0  [kernel.kallsyms]  [k] __mod_zone_page_state
     0.71%     0.61%  kswapd0  [kernel.kallsyms]  [k] uncharge_page
     0.64%     0.56%  kswapd0  [kernel.kallsyms]  [k] page_counter_uncharge
     0.63%     0.45%  kswapd0  [kernel.kallsyms]  [k] page_referenced
     0.48%     0.43%  kswapd0  [kernel.kallsyms]  [k] workingset_update_node
     0.41%     0.31%  kswapd0  [kernel.kallsyms]  [k] __lock_text_start
     0.39%     0.12%  kswapd0  [kernel.kallsyms]  [k] cpumask_next
     0.37%     0.33%  kswapd0  [kernel.kallsyms]  [k] free_pcp_prepare
     0.35%     0.35%  kswapd0  [kernel.kallsyms]  [k] __count_memcg_events.part.0
     0.34%     0.30%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_update_lru_size
     0.33%     0.13%  kswapd0  [kernel.kallsyms]  [k] try_walk_mm_list
     0.32%     0.28%  kswapd0  [kernel.kallsyms]  [k] PageHuge
     0.27%     0.20%  kswapd0  [kernel.kallsyms]  [k] __cond_resched
     0.25%     0.16%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock_irq
     0.25%     0.20%  kswapd0  [kernel.kallsyms]  [k] page_mapped
     0.22%     0.00%  kswapd0  [kernel.kallsyms]  [k] rmap_walk
     0.21%     0.13%  kswapd0  [kernel.kallsyms]  [k] rcu_read_unlock_strict
     0.20%     0.20%  kswapd0  [kernel.kallsyms]  [k] _find_next_bit.constprop.0
     0.20%     0.20%  kswapd0  [kernel.kallsyms]  [k] get_tier_to_isolate
     0.19%     0.18%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_iter
     0.19%     0.05%  kswapd0  [kernel.kallsyms]  [k] super_cache_count
     0.17%     0.01%  kswapd0  [kernel.kallsyms]  [k] __count_memcg_events
     0.17%     0.02%  kswapd0  [kernel.kallsyms]  [k] wake_up_page_bit
     0.17%     0.16%  kswapd0  [kernel.kallsyms]  [k] list_lru_count_one
     0.16%     0.10%  kswapd0  [kernel.kallsyms]  [k] find_next_bit
     0.15%     0.10%  kswapd0  [kernel.kallsyms]  [k] rcu_all_qs
     0.14%     0.14%  kswapd0  [kernel.kallsyms]  [k] get_swappiness
     0.14%     0.01%  kswapd0  [kernel.kallsyms]  [k] rmap_walk_file
     0.14%     0.09%  kswapd0  [kernel.kallsyms]  [k] __x86_retpoline_rax
     0.13%     0.11%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
     0.09%     0.01%  kswapd0  [kernel.kallsyms]  [k] page_referenced_one
     0.09%     0.04%  kswapd0  [kernel.kallsyms]  [k] __x86_indirect_thunk_rax
     0.09%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_unmap
     0.09%     0.00%  kswapd0  [kernel.kallsyms]  [k] asm_sysvec_apic_timer_interrupt
     0.09%     0.08%  kswapd0  [kernel.kallsyms]  [k] page_vma_mapped_walk
     0.09%     0.08%  kswapd0  [kernel.kallsyms]  [k] propagate_protected_usage
     0.08%     0.00%  kswapd0  [kernel.kallsyms]  [k] sysvec_apic_timer_interrupt
     0.08%     0.00%  kswapd0  [kernel.kallsyms]  [k] __schedule
     0.08%     0.00%  kswapd0  [kernel.kallsyms]  [k] walk_mm_list
     0.08%     0.00%  kswapd0  [kernel.kallsyms]  [k] walk_page_range
     0.08%     0.00%  kswapd0  [kernel.kallsyms]  [k] __walk_page_range
     0.07%     0.05%  kswapd0  [kernel.kallsyms]  [k] walk_pud_range
     0.07%     0.02%  kswapd0  [kernel.kallsyms]  [k] try_to_unmap_one
     0.07%     0.00%  kswapd0  [kernel.kallsyms]  [k] pageout
     0.07%     0.00%  kswapd0  [kernel.kallsyms]  [k] __sysvec_apic_timer_interrupt
     0.07%     0.00%  kswapd0  [kernel.kallsyms]  [k] hrtimer_interrupt
     0.07%     0.00%  kswapd0  [kernel.kallsyms]  [k] schedule_timeout
     0.07%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_writepage
     0.06%     0.03%  kswapd0  [kernel.kallsyms]  [k] memcg_check_events
     0.06%     0.00%  kswapd0  [kernel.kallsyms]  [k] schedule
     0.06%     0.00%  kswapd0  [kernel.kallsyms]  [k] __swap_writepage
     0.05%     0.00%  kswapd0  [kernel.kallsyms]  [k] pick_next_task_fair
     0.05%     0.00%  kswapd0  [kernel.kallsyms]  [k] __hrtimer_run_queues
     0.05%     0.00%  kswapd0  [kernel.kallsyms]  [k] rmap_walk_anon
     0.05%     0.05%  kswapd0  [kernel.kallsyms]  [k] try_inc_min_seq
     0.05%     0.00%  kswapd0  [kernel.kallsyms]  [k] newidle_balance
     0.05%     0.00%  kswapd0  [kernel.kallsyms]  [k] tick_sched_timer
     0.05%     0.01%  kswapd0  [kernel.kallsyms]  [k] lru_add_drain
     0.04%     0.00%  kswapd0  [kernel.kallsyms]  [k] load_balance
     0.04%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_process_times
     0.04%     0.00%  kswapd0  [kernel.kallsyms]  [k] tick_sched_handle
     0.04%     0.00%  kswapd0  [kernel.kallsyms]  [k] find_busiest_group
     0.04%     0.04%  kswapd0  [kernel.kallsyms]  [k] lru_add_drain_cpu
     0.04%     0.03%  kswapd0  [kernel.kallsyms]  [k] update_sd_lb_stats.constprop.0
     0.04%     0.03%  kswapd0  [kernel.kallsyms]  [k] __mem_cgroup_threshold
     0.03%     0.00%  kswapd0  [kernel.kallsyms]  [k] scheduler_tick
     0.03%     0.00%  kswapd0  [kernel.kallsyms]  [k] vmpressure
     0.03%     0.03%  kswapd0  [kernel.kallsyms]  [k] find_first_bit
     0.03%     0.03%  kswapd0  [kernel.kallsyms]  [k] deferred_split_count
     0.03%     0.00%  kswapd0  [kernel.kallsyms]  [k] submit_bio
     0.03%     0.00%  kswapd0  [kernel.kallsyms]  [k] submit_bio_noacct
     0.03%     0.00%  kswapd0  [kernel.kallsyms]  [k] queue_work_on
     0.03%     0.00%  kswapd0  [kernel.kallsyms]  [k] __queue_work
     0.03%     0.00%  kswapd0  [kernel.kallsyms]  [k] __wake_up_locked_key_bookmark
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] insert_work
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_finish_plug
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] wake_up_process
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_wake_up
     0.02%     0.02%  kswapd0  [kernel.kallsyms]  [k] kfree_rcu_shrink_count
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_flush_plug_list
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] idr_find
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] __delete_from_swap_cache
     0.02%     0.02%  kswapd0  [kernel.kallsyms]  [k] move_pages_to_lru
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] add_to_swap
     0.02%     0.01%  kswapd0  [kernel.kallsyms]  [k] get_next_interval
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] radix_tree_lookup
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_submit_bio
     0.02%     0.02%  kswapd0  [kernel.kallsyms]  [k] __radix_tree_lookup
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] finish_task_switch.isra.0
     0.02%     0.02%  kswapd0  [kernel.kallsyms]  [k] __wake_up_common
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] __perf_event_task_sched_in
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] super_cache_scan
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] ttwu_do_activate
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_flush_plug_list
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_swapout
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_sched_insert_requests
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_try_issue_list_directly
     0.02%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_mq_try_issue_directly
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_calculate_protection
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_queue_rq
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] ctx_sched_in
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_event_sched_in
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] up_read
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_iter_next
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] prune_dcache_sb
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] put_swap_page
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] irq_exit_rcu
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_get_nr_swap_pages
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] shrink_dentry_list
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] task_tick_fair
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] __softirqentry_text_start
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] add_to_swap_cache
     0.01%     0.01%  kswapd0  [unknown]          [.] 0000000000000000
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] enqueue_task
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] total_mapcount
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] should_skip_vma
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_subtree_search
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] shmem_unused_huge_count
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] __dentry_kill
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] try_to_unmap_flush
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] mempool_alloc
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] free_swap_slot
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] update_blocked_averages
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] swap_cgroup_record
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] page_remove_rmap
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] dentry_unlink_inode
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_alloc_bioset
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] enqueue_task_fair
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] css_next_descendant_pre
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] swapcache_free_entries
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] io_async_buf_func
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] iput
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_attempt_plug_merge
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] iput.part.0
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_load_avg
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] ptep_clear_flush
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] trigger_load_balance
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_event_task_tick
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] submit_bio_checks
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] lru_gen_scan_around
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] evict
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] anon_vma_interval_tree_iter_first
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] idle_cpu
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_task_change
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] page_lock_anon_vma_read
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] visit_groups_merge.constprop.0.isra.0
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] perf_event_groups_first
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_duplicate
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_core_si
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] native_write_msr
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] kmem_cache_free
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] merge_sched_in
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] tick_program_event
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_attempt_bio_merge.part.0
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_core
     0.01%     0.01%  kswapd0  [kernel.kallsyms]  [k] ktime_get_update_offsets_now
     0.01%     0.00%  kswapd0  [kernel.kallsyms]  [k] mempool_alloc_slab
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] get_swap_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] _swap_info_get
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_attempt_back_merge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_group_change
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] flush_tlb_mm_range
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_curr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] enqueue_entity
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_swevent_add
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ktime_get
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __common_interrupt
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] asm_common_interrupt
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] common_interrupt
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ptep_test_and_clear_young
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] arch_scale_freq_tick
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] read_tsc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] intel_tfa_pmu_enable_all
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] x86_pmu_enable
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] handle_edge_irq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] handle_irq_event
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] set_page_dirty
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] page_update_gen
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] vma_interval_tree_iter_first
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __test_set_page_writeback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_range_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __handle_irq_event_percpu
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] run_rebalance_domains
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] clockevents_program_event
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ttwu_do_wakeup
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_release_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] cpumask_any_but
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_associate_blkg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] lru_gen_addition
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_rq_map_sg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __frontswap_store
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __update_load_avg_cfs_rq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] xas_create_range
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] jbd2_journal_try_to_free_buffers
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kmem_cache_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_page_sector
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] check_preempt_curr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] backend_shrink_memory_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_pci_complete_rq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_irq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_process_cq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ll_back_merge_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_associate_blkg_from_css
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bus_for_each_dev
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_mq_alloc_request
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] offset_to_swap_extent
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] clear_shadow_from_swap_cache
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_uncharge_swap
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] prepare_kswapd_sleep
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_add_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] page_not_mapped
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] cpumask_next_and
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mempool_kmalloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] memset_erms
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] pgdat_balanced
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __rq_qos_throttle
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_rq_clock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_end_request
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] get_swap_pages
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_complete_rq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_irq_return_iret
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_trylock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_batch_size
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __list_lru_walk_one
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mmu_shrink_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __update_load_avg_se
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dequeue_task
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] shmem_writepage
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] prune_icache_sb
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] event_sched_in
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_throtl_bio
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] account_process_tick
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_event_update_userpage
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_get_tag
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] scan_swap_map_slots
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_update_request
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __swap_duplicate
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __perf_event_task_sched_out
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_free_swap
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dqcache_shrink_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] hrtimer_active
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wbt_wait
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __mod_timer
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_releasepage
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] account_system_time
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rebalance_domains
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __run_timers.part.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] run_timer_softirq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] account_system_index_time
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_free_buffers
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nohz_balance_exit_idle
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_read_msr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kick_ilb
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] sched_clock_cpu
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_set_page_dirty
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_mq_sched_bio_merge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] list_lru_walk_one
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] cpuacct_charge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] cgroup_e_css
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rb_next
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_apic_msr_eoi_write
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] check_preempt_wakeup
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] i_callback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __anon_vma_interval_tree_subtree_search
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] free_buffer_head
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] zswap_frontswap_store
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_endio
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] end_swap_bio_write
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ttm_pool_shrinker_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] shrink_lock_dentry.part.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] down_read
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_soft_limit_reclaim
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] put_prev_entity
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blkdev_releasepage
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] list_lru_add
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] clear_page_dirty_for_io
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] check_pte
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] smp_call_function_single_async
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] klist_next
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] call_timer_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __kmalloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] pagevec_lru_move_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_sched_clock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_start_plug
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bdev_try_to_free_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] PageHeadHuge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] shrink_huge_zero_page_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_rt_rq_load_avg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_sched_clock_irq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_memstall_tick
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kthread_blkcg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_start_request
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] truncate_inode_pages_final
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] xas_find
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_mq_get_tag
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] destroy_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_memstall_enter
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dequeue_task_fair
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] set_task_reclaim_state
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] memcg_to_vmpressure
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __slab_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] end_page_writeback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] pagevec_move_tail_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ___slab_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __d_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] try_to_unmap_flush_dirty
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] fsnotify_destroy_marks
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __destroy_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wakeup_flusher_threads
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] record_times
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] pick_next_entity
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_task_switch
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] call_rcu
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] zone_watermark_ok_safe
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __slab_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_stat_timer_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] pick_next_task_idle
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __sbitmap_queue_get
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] _nohz_idle_balance
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] lapic_next_deadline
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] slab_free_freelist_hook
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_submit_cmd
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] page_swapcount
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bdev_write_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] percpu_counter_add_batch
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kernfs_evict_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] prepare_to_wait
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] sbitmap_get
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] run_posix_cpu_timers
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] refill_obj_stock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_cblist_dequeue
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_dl_rq_load_avg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ___d_drop
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] fsnotify_grab_connector
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __fsnotify_inode_delete
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_account_io_merge_bio
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_es_scan
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] jbd2_journal_grab_journal_head
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] xas_load
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] error_entry
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_get_last_bvec
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_charge_statistics.constprop.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rotate_reclaimable_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] obj_cgroup_uncharge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] lru_gen_addition
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mempool_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __es_shrink
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dma_pool_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mutex_lock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] proc_free_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wakeup_kcompactd
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_queue_enter
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rq_qos_wait
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dentry_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] lock_timer_base
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] klist_iter_exit
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_segcblist_enqueue
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mb_cache_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] send_call_function_single_ipi
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] generic_exec_single
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] page_swap_info
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_rq_merge_ok
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __blk_queue_split
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] get_swap_device
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __update_idle_core
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_setup_rw
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] enqueue_timer
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __set_page_dirty_no_writeback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_flags_change
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ptep_clear_flush_young
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wbt_inflight_cb
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] psi_memstall_leave
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_rq_ctx_init
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] shadow_lru_isolate
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] list_lru_walk_one_irq
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] scan_shadow_nodes
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_from_obj
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] set_next_buddy
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dequeue_entity
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mutex_unlock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_segcblist_ready_cbs
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __bio_add_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] xas_start
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] lock_page_memcg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] clear_buddies
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] unlock_page_memcg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] inc_node_page_state
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kmalloc_slab
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_queue_bounce
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] cpuacct_account_field
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __remove_inode_hash
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] reset_isolation_suitable
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] set_next_entity
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] x86_pmu_disable
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kthread_is_per_cpu
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] memcg_slab_post_alloc_hook
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __frontswap_invalidate_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] default_wake_function
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __wake_up
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __wake_up_common_lock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] autoremove_wake_function
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dma_map_sg_attrs
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_evict_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] find_next_and_bit
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wbt_track
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] es_reclaim_extents
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] arch_tlbbatch_flush
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_flush_tlb_others
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] should_failslab
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dentry_lru_isolate
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] find_vma
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] restore_regs_and_return_to_kernel
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_cgroup_bio_start
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] llist_add_batch
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] xas_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] slab_pre_alloc_hook.constprop.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] _atomic_dec_and_lock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_put
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mempool_free_slab
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_es_count
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] smp_call_function_many_cond
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] jiffies_to_usecs
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_integrity_prep
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_min_vruntime
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_id_get_online
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __queue_delayed_work
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mod_delayed_work_on
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] delayed_work_timer_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bdev_read_only
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvkm_mc_intr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvkm_pci_intr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __zone_watermark_ok
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __sbitmap_get_word
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rb_insert_color
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] walk_page_test
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] on_each_cpu_cond_mask
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kfree
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mempool_kfree
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_unmap_data.part.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] timerqueue_del
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_accelerate_cbs
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_es_free_extent
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] es_do_reclaim_extents
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] hrtimer_run_queues
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __next_timer_interrupt
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] inc_zone_page_state
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] scan_swap_map_try_ssd_cluster
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __srcu_read_unlock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] page_inc_gen
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kthread_should_stop
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __irqentry_text_start
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] swap_range_alloc
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wbt_issue
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_integrity_merge_bio
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] calc_global_load_tick
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] smp_call_function_single
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] finish_wait
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_set_pte
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] proc_evict_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_clear_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] truncate_inode_pages_range
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] test_clear_page_writeback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] allocate_slab
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rb_erase
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] resched_curr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] xa_delete_node
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] compaction_suitable
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_cfs_group
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] drop_buffers
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] calc_wheel_index
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ioread32
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] alarm_timer_callback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nv04_timer_intr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvkm_subdev_intr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvkm_timer_alarm
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvkm_timer_alarm_trigger
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvkm_timer_intr
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] inode_lru_isolate
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_io_ticks
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_account_io_start
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_crypt_rq_ctx_compatible
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_add_timer
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wake_up_bit
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] kernfs_put
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] vmpressure_calc_level
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_id_put_many
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __hrtimer_next_event_base
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] alloc_pages_current
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] enqueue_hrtimer
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] clear_page_erms
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] intel_pmu_disable_all
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __mod_lruvec_kmem_state
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] calc_timer_values
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_segcblist_accelerate
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __note_gp_changes
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] note_gp_changes
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] get_nohz_timer_target
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] xas_free_nodes
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] add_interrupt_randomness
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] put_prev_task_fair
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] select_task_rq_fair
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] zone_watermark_ok
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ext4_drop_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] jbd2_journal_put_journal_head
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mem_cgroup_try_charge_swap
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] mm_trace_rss_stat
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] lru_gen_update_size
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] page_rmapping
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_free_request
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] update_group_capacity
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __x86_retpoline_rdx
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] vma_is_shmem
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] tick_do_update_jiffies64
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] drain_obj_stock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] release_pages
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_iret
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] radix_tree_node_rcu_free
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] cyc2ns_read_end
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] arch_perf_update_userpage
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] timerqueue_add
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] swake_up_one
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_accelerate_cbs_unlocked
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rcu_gp_kthread_wake
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bio_crypt_ctx_mergeable
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __xas_next
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] anon_vma_interval_tree_iter_next
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __rq_qos_track
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] klist_iter_init_node
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __rq_qos_done_bio
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] blk_mq_get_driver_tag
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __pagevec_lru_add
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __es_tree_search.isra.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __rq_qos_issue
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] get_gate_vma
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nvme_setup_cmd
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __alloc_pages_nodemask
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] get_page_from_freelist
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] rq_wait_inc_below
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dma_direct_map_sg
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wbt_data_dir
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_pmu_nop_int
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] dma_map_page_attrs
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __intel_pmu_enable_all.constprop.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] inode_add_lru
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] generic_delete_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] bit_waitqueue
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __inode_wait_for_writeback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] inode_wait_for_writeback
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] _raw_write_trylock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] _raw_write_lock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] put_pid.part.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] proc_pid_evict_inode
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] put_pid
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __rq_qos_merge
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] prandom_u32
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] detach_if_pending
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] del_timer_sync
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] wb_timer_fn
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] credit_entropy_bits.constprop.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] hrtimer_forward
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] irq_work_tick
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __x86_retpoline_r12
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] d_shrink_del
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] error_return
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] invoke_rcu_core
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] perf_event_set_state.part.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] fragmentation_index
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __bio_try_merge_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] radix_tree_node_ctor
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] sched_clock
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] should_fail_bio
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] pmdp_test_and_clear_young
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] set_pgdat_percpu_threshold
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __activate_page
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] acpi_os_read_memory
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] native_flush_tlb_one_user
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nmi_restore
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] ghes_notify_nmi
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] __ghes_peek_estatus.isra.0
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] nmi_cpu_backtrace_handler
     0.00%     0.00%  kswapd0  [kernel.kallsyms]  [k] intel_bts_enable_local


#
# (Tip: To record callchains for each sample: perf record -g)
#

Yu Zhao April 14, 2021, 7:58 a.m. UTC | #11

On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yu Zhao <yuzhao@google.com> writes:
>
> > On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel <riel@surriel.com> wrote:
> >>
> >> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
> >> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> >> >
> >> > > The initial posting of this patchset did no better, in fact it did
> >> > > a bit
> >> > > worse. Performance dropped to the same levels and kswapd was using
> >> > > as
> >> > > much CPU as before, but on top of that we also got excessive
> >> > > swapping.
> >> > > Not at a high rate, but 5-10MB/sec continually.
> >> > >
> >> > > I had some back and forths with Yu Zhao and tested a few new
> >> > > revisions,
> >> > > and the current series does much better in this regard. Performance
> >> > > still dips a bit when page cache fills, but not nearly as much, and
> >> > > kswapd is using less CPU than before.
> >> >
> >> > Profiles would be interesting, because it sounds to me like reclaim
> >> > *might* be batching page cache removal better (e.g. fewer, larger
> >> > batches) and so spending less time contending on the mapping tree
> >> > lock...
> >> >
> >> > IOWs, I suspect this result might actually be a result of less lock
> >> > contention due to a change in batch processing characteristics of
> >> > the new algorithm rather than it being a "better" algorithm...
> >>
> >> That seems quite likely to me, given the issues we have
> >> had with virtual scan reclaim algorithms in the past.
> >
> > Hi Rik,
> >
> > Let paste the code so we can move beyond the "batching" hypothesis:
> >
> > static int __remove_mapping(struct address_space *mapping, struct page
> > *page,
> >                             bool reclaimed, struct mem_cgroup *target_memcg)
> > {
> >         unsigned long flags;
> >         int refcount;
> >         void *shadow = NULL;
> >
> >         BUG_ON(!PageLocked(page));
> >         BUG_ON(mapping != page_mapping(page));
> >
> >         xa_lock_irqsave(&mapping->i_pages, flags);
> >
> >> SeongJae, what is this algorithm supposed to do when faced
> >> with situations like this:
> >
> > I'll assume the questions were directed at me, not SeongJae.
> >
> >> 1) Running on a system with 8 NUMA nodes, and
> >> memory
> >>    pressure in one of those nodes.
> >> 2) Running PostgresQL or Oracle, with hundreds of
> >>    processes mapping the same (very large) shared
> >>    memory segment.
> >>
> >> How do you keep your algorithm from falling into the worst
> >> case virtual scanning scenarios that were crippling the
> >> 2.4 kernel 15+ years ago on systems with just a few GB of
> >> memory?
> >
> > There is a fundamental shift: that time we were scanning for cold pages,
> > and nowadays we are scanning for hot pages.
> >
> > I'd be surprised if scanning for cold pages didn't fall apart, because it'd
> > find most of the entries accessed, if they are present at all.
> >
> > Scanning for hot pages, on the other hand, is way better. Let me just
> > reiterate:
> > 1) It will not scan page tables from processes that have been sleeping
> >    since the last scan.
> > 2) It will not scan PTE tables under non-leaf PMD entries that do not
> >    have the accessed bit set, when
> >    CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> > 3) It will not zigzag between the PGD table and the same PMD or PTE
> >    table spanning multiple VMAs. In other words, it finishes all the
> >    VMAs with the range of the same PMD or PTE table before it returns
> >    to the PGD table. This optimizes workloads that have large numbers
> >    of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
> >
> > So the cost is roughly proportional to the number of referenced pages it
> > discovers. If there is no memory pressure, no scanning at all. For a system
> > under heavy memory pressure, most of the pages are referenced (otherwise
> > why would it be under memory pressure?), and if we use the rmap, we need to
> > scan a lot of pages anyway. Why not just scan them all?
>
> This may be not the case.  For rmap scanning, it's possible to scan only
> a small portion of memory.  But with the page table scanning, you need
> to scan almost all (I understand you have some optimization as above).

Hi Ying,

Let's take a step back.

For the sake of discussion, when does the scanning have to happen? Can
we agree that the simplest answer is when we have evicted all inactive
pages?

If so, my next question is who's filled in the memory space previously
occupied by those inactive pages? Newly faulted in pages, right? They
have the accessed bit set, and we can't evict them without scanning
them first, would you agree?

And there are also existing active pages, and they were protected from
eviction. But now we need to deactivate some of them. Do you think
whether they'd have been used or not since the last scan? (Remember
they were active.)

You mentioned "a small portion" and "almost all". How do you interpret
them in terms of these steps?

Intuitively, "a small portion" and "almost all" seem right. But our
observations from *millions* of machines say the ratio of
pgscan_kswapd to pgsteal_kswapd is well over 7 when anon percentage is
>90%. Unlikely streaming files, it doesn't make sense to "stream" anon
memory.

> As Rik shown in the test case above, there may be memory pressure on
> only one of 8 NUMA nodes (because of NUMA binding?).  Then ramp scanning
> only needs to scan pages in this node, while the page table scanning may
> need to scan pages in other nodes too.

Yes, and this is on my to-do list in the patchset:

To-do List
==========
KVM Optimization
----------------
Support shadow page table scanning.

NUMA Optimization
-----------------
Support NUMA policies and per-node RSS counters.

We only can move forward one step at a time. Fair?

Huang, Ying April 14, 2021, 8:27 a.m. UTC | #12

Yu Zhao <yuzhao@google.com> writes:

> On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yu Zhao <yuzhao@google.com> writes:
>>
>> > On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel <riel@surriel.com> wrote:
>> >>
>> >> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote:
>> >> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
>> >> >
>> >> > > The initial posting of this patchset did no better, in fact it did
>> >> > > a bit
>> >> > > worse. Performance dropped to the same levels and kswapd was using
>> >> > > as
>> >> > > much CPU as before, but on top of that we also got excessive
>> >> > > swapping.
>> >> > > Not at a high rate, but 5-10MB/sec continually.
>> >> > >
>> >> > > I had some back and forths with Yu Zhao and tested a few new
>> >> > > revisions,
>> >> > > and the current series does much better in this regard. Performance
>> >> > > still dips a bit when page cache fills, but not nearly as much, and
>> >> > > kswapd is using less CPU than before.
>> >> >
>> >> > Profiles would be interesting, because it sounds to me like reclaim
>> >> > *might* be batching page cache removal better (e.g. fewer, larger
>> >> > batches) and so spending less time contending on the mapping tree
>> >> > lock...
>> >> >
>> >> > IOWs, I suspect this result might actually be a result of less lock
>> >> > contention due to a change in batch processing characteristics of
>> >> > the new algorithm rather than it being a "better" algorithm...
>> >>
>> >> That seems quite likely to me, given the issues we have
>> >> had with virtual scan reclaim algorithms in the past.
>> >
>> > Hi Rik,
>> >
>> > Let paste the code so we can move beyond the "batching" hypothesis:
>> >
>> > static int __remove_mapping(struct address_space *mapping, struct page
>> > *page,
>> >                             bool reclaimed, struct mem_cgroup *target_memcg)
>> > {
>> >         unsigned long flags;
>> >         int refcount;
>> >         void *shadow = NULL;
>> >
>> >         BUG_ON(!PageLocked(page));
>> >         BUG_ON(mapping != page_mapping(page));
>> >
>> >         xa_lock_irqsave(&mapping->i_pages, flags);
>> >
>> >> SeongJae, what is this algorithm supposed to do when faced
>> >> with situations like this:
>> >
>> > I'll assume the questions were directed at me, not SeongJae.
>> >
>> >> 1) Running on a system with 8 NUMA nodes, and
>> >> memory
>> >>    pressure in one of those nodes.
>> >> 2) Running PostgresQL or Oracle, with hundreds of
>> >>    processes mapping the same (very large) shared
>> >>    memory segment.
>> >>
>> >> How do you keep your algorithm from falling into the worst
>> >> case virtual scanning scenarios that were crippling the
>> >> 2.4 kernel 15+ years ago on systems with just a few GB of
>> >> memory?
>> >
>> > There is a fundamental shift: that time we were scanning for cold pages,
>> > and nowadays we are scanning for hot pages.
>> >
>> > I'd be surprised if scanning for cold pages didn't fall apart, because it'd
>> > find most of the entries accessed, if they are present at all.
>> >
>> > Scanning for hot pages, on the other hand, is way better. Let me just
>> > reiterate:
>> > 1) It will not scan page tables from processes that have been sleeping
>> >    since the last scan.
>> > 2) It will not scan PTE tables under non-leaf PMD entries that do not
>> >    have the accessed bit set, when
>> >    CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
>> > 3) It will not zigzag between the PGD table and the same PMD or PTE
>> >    table spanning multiple VMAs. In other words, it finishes all the
>> >    VMAs with the range of the same PMD or PTE table before it returns
>> >    to the PGD table. This optimizes workloads that have large numbers
>> >    of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
>> >
>> > So the cost is roughly proportional to the number of referenced pages it
>> > discovers. If there is no memory pressure, no scanning at all. For a system
>> > under heavy memory pressure, most of the pages are referenced (otherwise
>> > why would it be under memory pressure?), and if we use the rmap, we need to
>> > scan a lot of pages anyway. Why not just scan them all?
>>
>> This may be not the case.  For rmap scanning, it's possible to scan only
>> a small portion of memory.  But with the page table scanning, you need
>> to scan almost all (I understand you have some optimization as above).
>
> Hi Ying,
>
> Let's take a step back.
>
> For the sake of discussion, when does the scanning have to happen? Can
> we agree that the simplest answer is when we have evicted all inactive
> pages?
>
> If so, my next question is who's filled in the memory space previously
> occupied by those inactive pages? Newly faulted in pages, right? They
> have the accessed bit set, and we can't evict them without scanning
> them first, would you agree?
>
> And there are also existing active pages, and they were protected from
> eviction. But now we need to deactivate some of them. Do you think
> whether they'd have been used or not since the last scan? (Remember
> they were active.)
>
> You mentioned "a small portion" and "almost all". How do you interpret
> them in terms of these steps?
>
> Intuitively, "a small portion" and "almost all" seem right. But our
> observations from *millions* of machines say the ratio of
> pgscan_kswapd to pgsteal_kswapd is well over 7 when anon percentage is
>>90%. Unlikely streaming files, it doesn't make sense to "stream" anon
> memory.

What I said is that it is "POSSIBLE" to scan only a small portion of
memory.  Whether and in which cases to do that depends on the policy we
choose.  I didn't say we have chosen to do that for all cases.

>> As Rik shown in the test case above, there may be memory pressure on
>> only one of 8 NUMA nodes (because of NUMA binding?).  Then ramp scanning
>> only needs to scan pages in this node, while the page table scanning may
>> need to scan pages in other nodes too.
>
> Yes, and this is on my to-do list in the patchset:
>
> To-do List
> ==========
> KVM Optimization
> ----------------
> Support shadow page table scanning.
>
> NUMA Optimization
> -----------------
> Support NUMA policies and per-node RSS counters.
>
> We only can move forward one step at a time. Fair?

You don't need to implement that now definitely.  But we can discuss the
possible solution now.

Note that it's possible that only some processes are bound to some NUMA
nodes, while other processes aren't bound.

Best Regards,
Huang, Ying

Yu Zhao April 14, 2021, 10 a.m. UTC | #13

On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> > > > > On 4/13/21 1:51 AM, SeongJae Park wrote:
> > > > > > From: SeongJae Park <sjpark@amazon.de>
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > >
> > > > > > Very interesting work, thank you for sharing this :)
> > > > > >
> > > > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
> > > > > >
> > > > > >> What's new in v2
> > > > > >> ================
> > > > > >> Special thanks to Jens Axboe for reporting a regression in buffered
> > > > > >> I/O and helping test the fix.
> > > > > >
> > > > > > Is the discussion open?  If so, could you please give me a link?
> > > > >
> > > > > I wasn't on the initial post (or any of the lists it was posted to), but
> > > > > it's on the google page reclaim list. Not sure if that is public or not.
> > > > >
> > > > > tldr is that I was pretty excited about this work, as buffered IO tends
> > > > > to suck (a lot) for high throughput applications. My test case was
> > > > > pretty simple:
> > > > >
> > > > > Randomly read a fast device, using 4k buffered IO, and watch what
> > > > > happens when the page cache gets filled up. For this particular test,
> > > > > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> > > > > with kswapd using a lot of CPU trying to keep up. That's mainline
> > > > > behavior.
> > > >
> > > > I see this exact same behaviour here, too, but I RCA'd it to
> > > > contention between the inode and memory reclaim for the mapping
> > > > structure that indexes the page cache. Basically the mapping tree
> > > > lock is the contention point here - you can either be adding pages
> > > > to the mapping during IO, or memory reclaim can be removing pages
> > > > from the mapping, but we can't do both at once.
> > > >
> > > > So we end up with kswapd spinning on the mapping tree lock like so
> > > > when doing 1.6GB/s in 4kB buffered IO:
> > > >
> > > > -   20.06%     0.00%  [kernel]               [k] kswapd                                                                                                        ▒
> > > >    - 20.06% kswapd                                                                                                                                             ▒
> > > >       - 20.05% balance_pgdat                                                                                                                                   ▒
> > > >          - 20.03% shrink_node                                                                                                                                  ▒
> > > >             - 19.92% shrink_lruvec                                                                                                                             ▒
> > > >                - 19.91% shrink_inactive_list                                                                                                                   ▒
> > > >                   - 19.22% shrink_page_list                                                                                                                    ▒
> > > >                      - 17.51% __remove_mapping                                                                                                                 ▒
> > > >                         - 14.16% _raw_spin_lock_irqsave                                                                                                        ▒
> > > >                            - 14.14% do_raw_spin_lock                                                                                                           ▒
> > > >                                 __pv_queued_spin_lock_slowpath                                                                                                 ▒
> > > >                         - 1.56% __delete_from_page_cache                                                                                                       ▒
> > > >                              0.63% xas_store                                                                                                                   ▒
> > > >                         - 0.78% _raw_spin_unlock_irqrestore                                                                                                    ▒
> > > >                            - 0.69% do_raw_spin_unlock                                                                                                          ▒
> > > >                                 __raw_callee_save___pv_queued_spin_unlock                                                                                      ▒
> > > >                      - 0.82% free_unref_page_list                                                                                                              ▒
> > > >                         - 0.72% free_unref_page_commit                                                                                                         ▒
> > > >                              0.57% free_pcppages_bulk                                                                                                          ▒
> > > >
> > > > And these are the processes consuming CPU:
> > > >
> > > >    5171 root      20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
> > > >    1150 root      20   0       0      0      0 S  47.4   0.0   0:22.70 kswapd1
> > > >    1146 root      20   0       0      0      0 S  44.0   0.0   0:21.85 kswapd0
> > > >    1152 root      20   0       0      0      0 S  39.7   0.0   0:18.28 kswapd3
> > > >    1151 root      20   0       0      0      0 S  15.2   0.0   0:12.14 kswapd2
> > > >
> > > > i.e. when memory reclaim kicks in, the read process has 20% less
> > > > time with exclusive access to the mapping tree to insert new pages.
> > > > Hence buffered read performance goes down quite substantially when
> > > > memory reclaim kicks in, and this really has nothing to do with the
> > > > memory reclaim LRU scanning algorithm.
> > > >
> > > > I can actually get this machine to pin those 5 processes to 100% CPU
> > > > under certain conditions. Each process is spinning all that extra
> > > > time on the mapping tree lock, and performance degrades further.
> > > > Changing the LRU reclaim algorithm won't fix this - the workload is
> > > > solidly bound by the exclusive nature of the mapping tree lock and
> > > > the number of tasks trying to obtain it exclusively...
> > > >
> > > > > The initial posting of this patchset did no better, in fact it did a bit
> > > > > worse. Performance dropped to the same levels and kswapd was using as
> > > > > much CPU as before, but on top of that we also got excessive swapping.
> > > > > Not at a high rate, but 5-10MB/sec continually.
> > > > >
> > > > > I had some back and forths with Yu Zhao and tested a few new revisions,
> > > > > and the current series does much better in this regard. Performance
> > > > > still dips a bit when page cache fills, but not nearly as much, and
> > > > > kswapd is using less CPU than before.
> > > >
> > > > Profiles would be interesting, because it sounds to me like reclaim
> > > > *might* be batching page cache removal better (e.g. fewer, larger
> > > > batches) and so spending less time contending on the mapping tree
> > > > lock...
> > > >
> > > > IOWs, I suspect this result might actually be a result of less lock
> > > > contention due to a change in batch processing characteristics of
> > > > the new algorithm rather than it being a "better" algorithm...
> > >
> > > I appreciate the profile. But there is no batching in
> > > __remove_mapping() -- it locks the mapping for each page, and
> > > therefore the lock contention penalizes the mainline and this patchset
> > > equally. It looks worse on your system because the four kswapd threads
> > > from different nodes were working on the same file.
> >
> > I think you misunderstand exactly what I mean by "batching" here.
> > I'm not talking about doing multiple pieces of work under a single
> > lock. What I mean is that the overall amount of work done in a
> > single reclaim scan (i.e a "reclaim batch") is packaged differently.
> >
> > We already batch up page reclaim via building a page list and then
> > passing it to shrink_page_list() to process the batch of pages in a
> > single pass. Each page in this page list batch then calls
> > remove_mapping() to pull the page form the LRU, we have a run of
> > contention between the foreground read() thread and the background
> > kswapd.
> >
> > If the size or nature of the pages in the batch passed to
> > shrink_page_list() changes, then the amount of time a reclaim batch
> > is going to put pressure on the mapping tree lock will also change.
> > That's the "change in batching behaviour" I'm referring to here. I
> > haven't read through the patchset to determine if you change the
> > shrink_page_list() algorithm, but it likely changes what is passed
> > to be reclaimed and that in turn changes the locking patterns that
> > fall out of shrink_page_list...
> 
> Ok, if we are talking about the size of the batch passed to
> shrink_page_list(), both the mainline and this patchset cap it at
> SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when
> running fio/io_uring, it's safe to say both use 32.
> 
> > > And kswapd is only one of two paths that could affect the performance.
> > > The kernel context of the test process is where the improvement mainly
> > > comes from.
> > >
> > > I also suspect you were testing a file much larger than your memory
> > > size. If so, sorry to tell you that a file only a few times larger,
> > > e.g. twice, would be worse.
> > >
> > > Here is my take:
> > >
> > > Claim
> > > -----
> > > This patchset is a "better" algorithm. (Technically it's not an
> > > algorithm, it's a feedback loop.)
> > >
> > > Theoretical basis
> > > -----------------
> > > An open-loop control (the mainline) can only be better if the margin
> > > of error in its prediction of the future events is less than that from
> > > the trial-and-error of a closed-loop control (this patchset). For
> > > simple machines, it surely can. For page reclaim, AFAIK, it can't.
> > >
> > > A typical example: when randomly accessing a (not infinitely) large
> > > file via buffered io long enough, we're bound to hit the same blocks
> > > multiple times. Should we activate the pages containing those blocks,
> > > i.e., to move them to the active lru list?  No.
> > >
> > > RCA
> > > ---
> > > For the fio/io_uring benchmark, the "No" is the key.
> > >
> > > The mainline activates pages accessed multiple times. This is done in
> > > the buffered io access path by mark_page_accessed(), and it takes the
> > > lru lock, which is contended under memory pressure. This contention
> > > slows down both the access path and kswapd. But kswapd is not the
> > > problem here because we are measuring the io_uring process, not kswap.
> > >
> > > For this patchset, there are no activations since the refault rates of
> > > pages accessed multiple times are similar to those accessed only once
> > > -- activations will only be done to pages from tiers with higher
> > > refault rates.
> > >
> > > If you wish to debunk
> > > ---------------------
> >
> > Nope, it's your job to convince us that it works, not the other way
> > around. It's up to you to prove that your assertions are correct,
> > not for us to prove they are false.
> 
> Just trying to keep people motivated, my homework is my own.
> 
> > > git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1
> > >
> > > CONFIG_LRU_GEN=y
> > > CONFIG_LRU_GEN_ENABLED=y
> > >
> > > Run your benchmarks
> > >
> > > Profiles (200G mem + 400G file)
> > > -------------------------------
> > > A quick test from Jens' fio/io_uring:
> > >
> > > -rc7
> > >     13.30%  io_uring  xas_load
> > >     13.22%  io_uring  _copy_to_iter
> > >     12.30%  io_uring  __add_to_page_cache_locked
> > >      7.43%  io_uring  clear_page_erms
> > >      4.18%  io_uring  filemap_get_read_batch
> > >      3.54%  io_uring  get_page_from_freelist
> > >      2.98%  io_uring  ***native_queued_spin_lock_slowpath***
> > >      1.61%  io_uring  page_cache_ra_unbounded
> > >      1.16%  io_uring  xas_start
> > >      1.08%  io_uring  filemap_read
> > >      1.07%  io_uring  ***__activate_page***
> > >
> > > lru lock: 2.98% (lru addition + activation)
> > > activation: 1.07%
> > >
> > > -rc7 + this patchset
> > >     14.44%  io_uring  xas_load
> > >     14.14%  io_uring  _copy_to_iter
> > >     11.15%  io_uring  __add_to_page_cache_locked
> > >      6.56%  io_uring  clear_page_erms
> > >      4.44%  io_uring  filemap_get_read_batch
> > >      2.14%  io_uring  get_page_from_freelist
> > >      1.32%  io_uring  page_cache_ra_unbounded
> > >      1.20%  io_uring  psi_group_change
> > >      1.18%  io_uring  filemap_read
> > >      1.09%  io_uring  ****native_queued_spin_lock_slowpath****
> > >      1.08%  io_uring  do_mpage_readpage
> > >
> > > lru lock: 1.09% (lru addition only)
> >
> > All this tells us is that there was *less contention on the mapping
> > tree lock*. It does not tell us why there was less contention.
> >
> > You've handily omitted the kswapd profile, which is really the one
> > of interest to the discussion here - how did the memory reclaim CPU
> > usage profile also change at the same time?
> 
> Well, let me attach them. Suffix -1 is the mainline, -2 is the patchset.
> 
>   mainline
>      57.65%  kswapd0  __remove_mapping
>   this patchset
>      61.61%  kswapd0  __remove_mapping
> 
> As I said, the mapping lock contention penalizes both heavily. Its
> percentage is even higher with the patchset, because it has less
> overhead. I'm trying to explain "the less overhead" part: it's the
> activations that make the mainline worse.
> 
>   mainline
>     6.53%  kswapd0  shrink_active_list
>   this patchset
>     0
> 
> From the io_uring context:
>   mainline
>      2.53%  io_uring  mark_page_accessed
>   this patchset
>      0.52%  io_uring  mark_page_accessed
> 
> mark_page_accessed() moves pages accessed multiple times to the active
> lru list. Then shrink_active_list() moves them back to the inactive
> list. All for nothing.
> 
> I don't want to paste everything here -- they'd clutter. Please see
> all the detailed profiles in the attachment. Let me know if their
> formats are no to your liking. I still have the raw perf.data.
> 
> > > And I plan to reach out to other communities, e.g., PostgreSQL, to
> > > benchmark the patchset. I heard they have been complaining about the
> > > buffered io performance under memory pressure. Any other benchmarks
> > > you'd suggest?
> > >
> > > BTW, you might find another surprise in how less frequently slab
> > > shrinkers are called under memory pressure, because this patchset is a
> > > lot better at finding pages to reclaim and therefore doesn't overkill
> > > slabs.
> >
> > That's actually very likely to be a Bad Thing and cause unexpected
> > perofrmance and OOM based regressions. When the machine finally runs
> > out of page cache it can easily reclaim, it's going to get stuck
> > with long tail latencies reclaiming huge slab caches as they've had
> > no substantial ongoing pressure put on them to keep them in balance
> > with the overall memory pressure the system is under...
> 
> Well. It does use the existing equation. That is if it scans X% of
> pages, then it scans X% of slab objects. But 1) it often finds pages
> to reclaim at a lower X% 2) the pages it reclaims are less likely to
> refault. So the side effect is the overall slab objects it scans also
> reduce. I do see your point but don't see any options, at the moment.

I apologize for the spam. Apparent the attachment in my previous email
didn't reach everybody. I hope this would work:

git clone https://linux-mm.googlesource.com/benchmarks

Repo contains profiles collected when running fio/io_uring,
  mainline:
    kswapd-1.txt
    kswapd-1.svg
    io_uring-1.txt
    io_uring-1.svg
  
  patched:
    kswapd-2.txt
    kswapd-2.svg
    io_uring-2.txt
    io_uring-2.svg

Thanks.

Rik van Riel April 14, 2021, 1:51 p.m. UTC | #14

On Wed, 2021-04-14 at 16:27 +0800, Huang, Ying wrote:
> Yu Zhao <yuzhao@google.com> writes:
> 
> > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying <ying.huang@intel.com>
> > wrote:
> > > 
> > NUMA Optimization
> > -----------------
> > Support NUMA policies and per-node RSS counters.
> > 
> > We only can move forward one step at a time. Fair?
> 
> You don't need to implement that now definitely.  But we can discuss
> the
> possible solution now.

That was my intention, too. I want to make sure we don't
end up "painting ourselves into a corner" by moving in some
direction we have no way to get out of.

The patch set looks promising, but we need some plan to
avoid the worst case behaviors that forced us into rmap
based scanning initially.

> Note that it's possible that only some processes are bound to some
> NUMA
> nodes, while other processes aren't bound.

For workloads like PostgresQL or Oracle, it is common
to have maybe 70% of memory in a large shared memory
segment, spread between all the NUMA nodes, and mapped
into hundreds, if not thousands, of processes in the
system.

Now imagine we have an 8 node system, and memory
pressure in the DMA32 zone of node 0.

How will the current VM behave?

Wha
t will the virtual scanning need to do?

If we can come up with a solution to make virtual
scanning scale for that kind of workload, great.

If not ... if it turns out most of the benefits of
the multigeneratinal LRU framework come from sorting
the pages into multiple LRUs, and from being able
to easily reclaim unmapped pages before having to
scan mapped ones, could it be an idea to implement
that first, independently from virtual scanning?

I am all for improving
our page reclaim system, I
just want to make sure we don't revisit the old traps
that forced us where we are today :)

Jens Axboe April 14, 2021, 2:43 p.m. UTC | #15

On 4/13/21 5:14 PM, Dave Chinner wrote:
> On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
>> On 4/13/21 1:51 AM, SeongJae Park wrote:
>>> From: SeongJae Park <sjpark@amazon.de>
>>>
>>> Hello,
>>>
>>>
>>> Very interesting work, thank you for sharing this :)
>>>
>>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
>>>
>>>> What's new in v2
>>>> ================
>>>> Special thanks to Jens Axboe for reporting a regression in buffered
>>>> I/O and helping test the fix.
>>>
>>> Is the discussion open?  If so, could you please give me a link?
>>
>> I wasn't on the initial post (or any of the lists it was posted to), but
>> it's on the google page reclaim list. Not sure if that is public or not.
>>
>> tldr is that I was pretty excited about this work, as buffered IO tends
>> to suck (a lot) for high throughput applications. My test case was
>> pretty simple:
>>
>> Randomly read a fast device, using 4k buffered IO, and watch what
>> happens when the page cache gets filled up. For this particular test,
>> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
>> with kswapd using a lot of CPU trying to keep up. That's mainline
>> behavior.
> 
> I see this exact same behaviour here, too, but I RCA'd it to
> contention between the inode and memory reclaim for the mapping
> structure that indexes the page cache. Basically the mapping tree
> lock is the contention point here - you can either be adding pages
> to the mapping during IO, or memory reclaim can be removing pages
> from the mapping, but we can't do both at once.
> 
> So we end up with kswapd spinning on the mapping tree lock like so
> when doing 1.6GB/s in 4kB buffered IO:
> 
> -   20.06%     0.00%  [kernel]               [k] kswapd                                                                                                        ▒
>    - 20.06% kswapd                                                                                                                                             ▒
>       - 20.05% balance_pgdat                                                                                                                                   ▒
>          - 20.03% shrink_node                                                                                                                                  ▒
>             - 19.92% shrink_lruvec                                                                                                                             ▒
>                - 19.91% shrink_inactive_list                                                                                                                   ▒
>                   - 19.22% shrink_page_list                                                                                                                    ▒
>                      - 17.51% __remove_mapping                                                                                                                 ▒
>                         - 14.16% _raw_spin_lock_irqsave                                                                                                        ▒
>                            - 14.14% do_raw_spin_lock                                                                                                           ▒
>                                 __pv_queued_spin_lock_slowpath                                                                                                 ▒
>                         - 1.56% __delete_from_page_cache                                                                                                       ▒
>                              0.63% xas_store                                                                                                                   ▒
>                         - 0.78% _raw_spin_unlock_irqrestore                                                                                                    ▒
>                            - 0.69% do_raw_spin_unlock                                                                                                          ▒
>                                 __raw_callee_save___pv_queued_spin_unlock                                                                                      ▒
>                      - 0.82% free_unref_page_list                                                                                                              ▒
>                         - 0.72% free_unref_page_commit                                                                                                         ▒
>                              0.57% free_pcppages_bulk                                                                                                          ▒
> 
> And these are the processes consuming CPU:
> 
>    5171 root      20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
>    1150 root      20   0       0      0      0 S  47.4   0.0   0:22.70 kswapd1
>    1146 root      20   0       0      0      0 S  44.0   0.0   0:21.85 kswapd0
>    1152 root      20   0       0      0      0 S  39.7   0.0   0:18.28 kswapd3
>    1151 root      20   0       0      0      0 S  15.2   0.0   0:12.14 kswapd2

Here's my profile when memory reclaim is active for the above mentioned
test case. This is a single node system, so just kswapd. It's using around
40-45% CPU:

    43.69%  kswapd0  [kernel.vmlinux]  [k] xas_create
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               shrink_page_list
               __delete_from_page_cache
               xas_store
               xas_create

    16.88%  kswapd0  [kernel.vmlinux]  [k] queued_spin_lock_slowpath
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               |          
                --16.82%--shrink_inactive_list
                          |          
                           --16.55%--shrink_page_list
                                     |          
                                      --16.26%--_raw_spin_lock_irqsave
                                                queued_spin_lock_slowpath

     9.89%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               shrink_page_list

     5.46%  kswapd0  [kernel.vmlinux]  [k] xas_init_marks
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               shrink_page_list
               |          
                --5.41%--__delete_from_page_cache
                          xas_init_marks

     4.42%  kswapd0  [kernel.vmlinux]  [k] __delete_from_page_cache
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               |          
                --4.40%--shrink_page_list
                          __delete_from_page_cache

     2.82%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               |          
               |--1.43%--shrink_active_list
               |          isolate_lru_pages
               |          
                --1.39%--shrink_inactive_list
                          isolate_lru_pages

     1.99%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               shrink_page_list
               free_unref_page_list
               free_unref_page_commit
               free_pcppages_bulk

     1.79%  kswapd0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               |          
                --1.76%--shrink_node
                          shrink_lruvec
                          shrink_inactive_list
                          |          
                           --1.72%--shrink_page_list
                                     _raw_spin_lock_irqsave

     1.02%  kswapd0  [kernel.vmlinux]  [k] workingset_eviction
            |
            ---ret_from_fork
               kthread
               kswapd
               balance_pgdat
               shrink_node
               shrink_lruvec
               shrink_inactive_list
               |          
                --1.00%--shrink_page_list
                          workingset_eviction

> i.e. when memory reclaim kicks in, the read process has 20% less
> time with exclusive access to the mapping tree to insert new pages.
> Hence buffered read performance goes down quite substantially when
> memory reclaim kicks in, and this really has nothing to do with the
> memory reclaim LRU scanning algorithm.
> 
> I can actually get this machine to pin those 5 processes to 100% CPU
> under certain conditions. Each process is spinning all that extra
> time on the mapping tree lock, and performance degrades further.
> Changing the LRU reclaim algorithm won't fix this - the workload is
> solidly bound by the exclusive nature of the mapping tree lock and
> the number of tasks trying to obtain it exclusively...

I've seen way worse than the above as well, it's just my go-to easy test
case for "man I wish buffered IO didn't suck so much".

>> The initial posting of this patchset did no better, in fact it did a bit
>> worse. Performance dropped to the same levels and kswapd was using as
>> much CPU as before, but on top of that we also got excessive swapping.
>> Not at a high rate, but 5-10MB/sec continually.
>>
>> I had some back and forths with Yu Zhao and tested a few new revisions,
>> and the current series does much better in this regard. Performance
>> still dips a bit when page cache fills, but not nearly as much, and
>> kswapd is using less CPU than before.
> 
> Profiles would be interesting, because it sounds to me like reclaim
> *might* be batching page cache removal better (e.g. fewer, larger
> batches) and so spending less time contending on the mapping tree
> lock...
> 
> IOWs, I suspect this result might actually be a result of less lock
> contention due to a change in batch processing characteristics of
> the new algorithm rather than it being a "better" algorithm...

See above - let me know if you want to see more specific profiling as
well.

Andi Kleen April 14, 2021, 3:51 p.m. UTC | #16

>    2) It will not scan PTE tables under non-leaf PMD entries that do not
>       have the accessed bit set, when
>       CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.

This assumes  that workloads have reasonable locality. Could there
be a worst case where only one or two pages in each PTE are used,
so this PTE skipping trick doesn't work?

-Andi

Andi Kleen April 14, 2021, 3:56 p.m. UTC | #17

> Now imagine we have an 8 node system, and memory
> pressure in the DMA32 zone of node 0.

The question is how much do we still care about DMA32.
If there are problems they can probably just turn on the IOMMU for
these IO mappings.

-Andi

Rik van Riel April 14, 2021, 3:58 p.m. UTC | #18

On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote:
> >    2) It will not scan PTE tables under non-leaf PMD entries that
> > do not
> >       have the accessed bit set, when
> >       CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> 
> This assumes  that workloads have reasonable locality. Could there
> be a worst case where only one or two pages in each PTE are used,
> so this PTE skipping trick doesn't work?

Databases with large shared memory segments shared between
many processes come to mind as a real-world example of a
worst case scenario.

Shakeel Butt April 14, 2021, 3:58 p.m. UTC | #19

On Wed, Apr 14, 2021 at 6:52 AM Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2021-04-14 at 16:27 +0800, Huang, Ying wrote:
> > Yu Zhao <yuzhao@google.com> writes:
> >
> > > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying <ying.huang@intel.com>
> > > wrote:
> > > >
> > > NUMA Optimization
> > > -----------------
> > > Support NUMA policies and per-node RSS counters.
> > >
> > > We only can move forward one step at a time. Fair?
> >
> > You don't need to implement that now definitely.  But we can discuss
> > the
> > possible solution now.
>
> That was my intention, too. I want to make sure we don't
> end up "painting ourselves into a corner" by moving in some
> direction we have no way to get out of.
>
> The patch set looks promising, but we need some plan to
> avoid the worst case behaviors that forced us into rmap
> based scanning initially.
>
> > Note that it's possible that only some processes are bound to some
> > NUMA
> > nodes, while other processes aren't bound.
>
> For workloads like PostgresQL or Oracle, it is common
> to have maybe 70% of memory in a large shared memory
> segment, spread between all the NUMA nodes, and mapped
> into hundreds, if not thousands, of processes in the
> system.
>
> Now imagine we have an 8 node system, and memory
> pressure in the DMA32 zone of node 0.
>
> How will the current VM behave?
>
> Wha
> t will the virtual scanning need to do?
>
> If we can come up with a solution to make virtual
> scanning scale for that kind of workload, great.
>
> If not ... if it turns out most of the benefits of
> the multigeneratinal LRU framework come from sorting
> the pages into multiple LRUs, and from being able
> to easily reclaim unmapped pages before having to
> scan mapped ones, could it be an idea to implement
> that first, independently from virtual scanning?
>
> I am all for improving
> our page reclaim system, I
> just want to make sure we don't revisit the old traps
> that forced us where we are today :)
>

One potential idea is to take the hybrid 'of rmap and virtual
scanning' approach. If the number of pages that are targeted to be
scanned is below some threshold, do rmap otherwise virtual scanning. I
think we can experimentally find good value for that threshold.

Johannes Weiner April 14, 2021, 5:43 p.m. UTC | #20

Hello Yu,

On Tue, Apr 13, 2021 at 12:56:17AM -0600, Yu Zhao wrote:
> What's new in v2
> ================
> Special thanks to Jens Axboe for reporting a regression in buffered
> I/O and helping test the fix.
> 
> This version includes the support of tiers, which represent levels of
> usage from file descriptors only. Pages accessed N times via file
> descriptors belong to tier order_base_2(N). Each generation contains
> at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2
> bits in page->flags. In contrast to moving across generations which
> requires the lru lock, moving across tiers only involves an atomic
> operation on page->flags and therefore has a negligible cost. A
> feedback loop modeled after the well-known PID controller monitors the
> refault rates across all tiers and decides when to activate pages from
> which tiers, on the reclaim path.

Could you elaborate a bit more on the difference between generations
and tiers?

A refault, a page table reference, or a buffered read through a file
descriptor ultimately all boil down to a memory access. The value of
having that memory resident and the cost of bringing it in from
backing storage should be the same regardless of how it's accessed by
userspace; and whether it's an in-memory reference or a non-resident
reference should have the same relative impact on the page's age.

With that context, I don't understand why file descriptor refs and
refaults get such special treatment. Could you shed some light here?

> This feedback model has a few advantages over the current feedforward
> model:
> 1) It has a negligible overhead in the buffered I/O access path
>    because activations are done in the reclaim path.

This is useful if the workload isn't reclaim bound, but it can be
hazardous to defer work to reclaim, too.

If you go through the git history, there have been several patches to
soften access recognition inside reclaim because it can come with
large latencies when page reclaim kicks in after a longer period with
no memory pressure and doesn't have uptodate reference information -
to the point where eating a few extra IOs tend to add less latency to
the workload than waiting for reclaim to refresh its aging data.

Could you elaborate a bit more on the tradeoff here?

> Highlights from the discussions on v1
> =====================================
> Thanks to Ying Huang and Dave Hansen for the comments and suggestions
> on page table scanning.
> 
> A simple worst-case scenario test did not find page table scanning
> underperforms the rmap because of the following optimizations:
> 1) It will not scan page tables from processes that have been sleeping
>    since the last scan.
> 2) It will not scan PTE tables under non-leaf PMD entries that do not
>    have the accessed bit set, when
>    CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> 3) It will not zigzag between the PGD table and the same PMD or PTE
>    table spanning multiple VMAs. In other words, it finishes all the
>    VMAs with the range of the same PMD or PTE table before it returns
>    to the PGD table. This optimizes workloads that have large numbers
>    of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
> 
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> an alternative framework that is performant, versatile and
> straightforward.
> 
> Repo
> ====
> git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1
> 
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1173
> 
> Background
> ==========
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment.

RAM cost on one hand.

On the other, paging backends have seen a revolutionary explosion in
iop/s capacity from solid state devices and CPUs that allow in-memory
compression at scale, so a higher rate of paging (semi-random IO) and
thus larger levels of overcommit are possible than ever before.

There is a lot of new opportunity here.

> Over the past decade of research and experimentation in memory
> overcommit, we observed a distinct trend across millions of servers
> and clients: the size of page cache has been decreasing because of
> the growing popularity of cloud storage. Nowadays anon pages account
> for more than 90% of our memory consumption and page cache contains
> mostly executable pages.

This gives the impression that because the number of setups heavily
using the page cache has reduced somewhat, its significance is waning
as well. I don't think that's true. I think we'll continue to have
mainstream workloads for which the page cache is significant.

Yes, the importance of paging anon memory more efficiently (or paging
it at all again, for that matter), has increased dramatically. But IMO
not because it's more prevalent, but rather because of the increase in
paging capacity from the hardware side. It's not like we've been
heavily paging filesystem data beyond cold starts either when it was
more prevalent - workloads quickly fall apart when you do that on
rotating drives.

So that increase in paging capacity also applies to filesystem data,
and makes local filesystems an option again where they might have been
replaced by anonymous blobs managed by a userspace network filesystem.

Take disaggregated storage for example. It's an attractive measure for
reducing per-host CAPEX when the alternative is a local spindle, whose
seekiness doesn't make the network distance look so bad, and prevents
significant memory overcommit anyway. You have to spec the same RAM in
either case.

The equation is different for flash. You can *significantly* reduce
RAM needs of even latency-sensitive, interactive workloads with cheap,
consumer-grade local SSD drives. Disaggregating those drives and
adding the network to the paging path would directly eat into the much
higher RAM savings. It's a much less attractive proposition now. And
that's bringing larger data sets back to local filesystems.

And of course, even in cloud and disaggregated environments, there ARE
those systems that deal with things like source code trees -
development machines, build hosts etc. For those, filesystem data
continues to be the primary workload.

So while I agree with what you say about anon pages, I don't expect
non-trivial (local) filesystem loads to go away anytime soon. The
kernel needs to continue treating it as a first-class citizen.

> Problems
> ========
> Notion of active/inactive
> -------------------------
> For servers equipped with hundreds of gigabytes of memory, the
> granularity of the active/inactive is too coarse to be useful for job
> scheduling. False active/inactive rates are relatively high, and thus
> the assumed savings may not materialize.

The inactive/active naming is certainly confusing for users of the
system. The kernel uses it to preselect reclaim candidates, it's not
meant to indicate how much memory capacity is idle and available.

But a confusion around naming doesn't necessarily indicate it's bad at
what it is actually designed to do.

Fundamentally, LRU ordering is susceptible to a flood of recent pages
with no reuse pushing out the established frequent pages. The split
into inactive and active is simply there to address this shortcoming,
and protect frequent pages from recent ones - where pages that are
only accessed once get reclaimed before pages used twice or more.

Obviously, 'twice or more' is a coarse category, and it's not hard to
imagine that it might go wrong. But please, don't leave it up to the
imagination ;-) It's been in use for two decades or so, it needs a bit
more in-depth analysis of its limitations to justify replacing it.

> For phones and laptops, executable pages are frequently evicted
> despite the fact that there are many less recently used anon pages.
> Major faults on executable pages cause "janks" (slow UI renderings)
> and negatively impact user experience.

This is not because of the inactive/active scheme but rather because
of the anon/file split, which has evolved over the years to just not
swap onto iop-anemic rotational drives.

We ran into the same issue at FB too, where even with painfully
obvious anon candidates and a fast paging backend the kernel would
happily thrash on the page cache instead.

There has been significant work in this area recently to address this
(see commit 5df741963d52506a985b14c4bcd9a25beb9d1981). We've added
extensive testing and production time onto these patches since and
have not found the kernel to be thrashing executables or be reluctant
to go after anonymous pages anymore.

I wonder if your observation takes these recent changes into account?

> For lruvecs from different memcgs or nodes, comparisons are impossible
> due to the lack of a common frame of reference.

My first thought is that this is expected. Workloads running under
different memory constraints, IO priority levels etc. will not have
comparable workingsets: an access frequency that is considered high in
one domain could be considered quite cold in another.

Could you elaborate a bit on the situations where you would want to
compare, and how this is possible by having more generations?

> Solutions
> =========
> Notion of generation numbers
> ----------------------------
> The notion of generation numbers introduces a quantitative approach to
> memory overcommit. A larger number of pages can be spread out across
> a configurable number of generations, and each generation includes all
> pages that have been referenced since the last generation. This
> improved granularity yields relatively low false active/inactive
> rates.
> 
> Given an lruvec, scans of anon and file types and selections between
> them are all based on direct comparisons of generation numbers, which
> are simple and yet effective. For different lruvecs, comparisons are
> still possible based on birth times of generations.

This describes *what* it's doing, but could you elaborate more on how
to think about generations in relation to workload behavior and what
you can predict based on how your workload gets bucketed into these?

If we accept that the current two generations are not enough, how many
should there be instead? Four? Ten?

What determines this? Is it the workload's access pattern? Or the
memory size?

How do I know whether the number of generations I have chosen is right
for my setup? How do I detect when the underlying factors changed and
it no longer is?

How does it manifest if I have too few generations? What about too
many?

What about systems that host a variety of workloads that come and go?
Is there a generation number that will be good for any combination of
workloads on the system as jobs come and go?

For a general purpose OS like Linux, it's nice to be *able* to tune to
your specific requirements, but it's always bad to *have* to. Whatever
we end up doing, there needs to be some reasonable default behavior
that works acceptably for a broad range of workloads out of the box.

> Differential scans via page tables
> ----------------------------------
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan. The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for workloads using a
> large amount of anon memory.
> 
> Our real-world benchmark that browses popular websites in multiple
> Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
> less PSI on v5.11. With this patchset, kswapd profile looks like:
>   49.36%  lzo1x_1_do_compress
>    4.54%  page_vma_mapped_walk
>    4.45%  memset_erms
>    3.47%  walk_pte_range
>    2.88%  zram_bvec_rw
> 
> In addition, direct reclaim latency is reduced by 22% at 99th
> percentile and the number of refaults is reduced by 7%. Both metrics
> are important to phones and laptops as they are correlated to user
> experience.

This looks very exciting!

However, this seems to be an improvement completely in its own right:
getting the mapped page access information in a more efficient way.

Is there anything that ties it to the multi-generation LRU that I may
be missing here? Or could it simply be a drop-in replacement for rmap
that gives us the CPU savings right away?

> Framework
> =========
> For each lruvec, evictable pages are divided into multiple
> generations. The youngest generation number is stored in
> lruvec->evictable.max_seq for both anon and file types as they are
> aged on an equal footing. The oldest generation numbers are stored in
> lruvec->evictable.min_seq[2] separately for anon and file types as
> clean file pages can be evicted regardless of may_swap or
> may_writepage. Generation numbers are truncated into
> order_base_2(MAX_NR_GENS+1) bits in order to fit into page->flags. The
> sliding window technique is used to prevent truncated generation
> numbers from overlapping. Each truncated generation number is an inde
> to lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> Evictable pages are added to the per-zone lists indexed by max_seq or
> min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> faulted in.
> 
> Each generation is then divided into multiple tiers. Tiers represent
> levels of usage from file descriptors only. Pages accessed N times via
> file descriptors belong to tier order_base_2(N). In contrast to moving
> across generations which requires the lru lock, moving across tiers
> only involves an atomic operation on page->flags and therefore has a
> lower cost. A feedback loop modeled after the well-known PID
> controller monitors the refault rates across all tiers and decides
> when to activate pages from which tiers on the reclaim path.
> 
> The framework comprises two conceptually independent components: the
> aging and the eviction, which can be invoked separately from user
> space.

Why from userspace?

> Aging
> -----
> The aging produces young generations. Given an lruvec, the aging scans
> page tables for referenced pages of this lruvec. Upon finding one, the
> aging updates its generation number to max_seq. After each round of
> scan, the aging increments max_seq.
> 
> The aging maintains either a system-wide mm_struct list or per-memcg
> mm_struct lists and tracks whether an mm_struct is being used or has
> been used since the last scan. Multiple threads can concurrently work
> on the same mm_struct list, and each of them will be given a different
> mm_struct belonging to a process that has been scheduled since the
> last scan.
> 
> The aging is due when both of min_seq[2] reaches max_seq-1, assuming
> both anon and file types are reclaimable.

As per above, this is centered around mapped pages, but it really
needs to include a detailed answer for unmapped pages, such as page
cache and shmem/tmpfs data, as well as how sampled page table
references behave wrt realtime syscall references.

> Eviction
> --------
> The eviction consumes old generations. Given an lruvec, the eviction
> scans the pages on the per-zone lists indexed by either of min_seq[2].
> It first tries to select a type based on the values of min_seq[2].
> When anon and file types are both available from the same generation,
> it selects the one that has a lower refault rate.
> 
> During a scan, the eviction sorts pages according to their generation
> numbers, if the aging has found them referenced. It also moves pages
> from the tiers that have higher refault rates than tier 0 to the next
> generation.
> 
> When it finds all the per-zone lists of a selected type are empty, the
> eviction increments min_seq[2] indexed by this selected type.
> 
> Use cases
> =========
> On Android, our most advanced simulation that generates memory
> pressure from realistic user behavior shows 18% fewer low-memory
> kills, which in turn reduces cold starts by 16%.

I assume you refer to pressure-induced lmkd kills rather than
conventional kernel OOM kills?

I.e. multi-gen LRU does a better job of identifying the workingset,
rather than giving up too early.

Again, I would be interested if the baseline here includes the recent
anon/file balancing rework or not.

> On Borg, a similar approach enables us to identify jobs that
> underutilize their memory and downsize them considerably without
> compromising any of our service level indicators.

This is doable with the current reclaim implementation as well. At FB
we drive proactive reclaim through cgroup control, in a feedback loop
with psi metrics.

Obviously, this would benefit from better workingset identification in
the kernel, as more memory could be offloaded under the same pressure
tolerances from the workload, but it's more of an optimization than
enabling a uniquely new usecase.

> On Chrome OS, our field telemetry reports 96% fewer low-memory tab
> discards and 59% fewer OOM kills from fully-utilized devices and no
> regressions in monitored user experience from underutilized devices.

Again, lkmd rather than kernel oom kills, right? And with or without
the anon/file rework?

> Working set estimation
> ----------------------
> User space can invoke the aging by writing "+ memcg_id node_id gen
> [swappiness]" to /sys/kernel/debug/lru_gen. This debugfs interface
> also provides the birth time and the size of each generation.
> 
> Proactive reclaim
> -----------------
> User space can invoke the eviction by writing "- memcg_id node_id gen
> [swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen. Multiple
> command lines are supported, so does concatenation with delimiters.

Can you explain a bit more how these two are supposed to be used?

The memcg id is self-explanatory: Age or evict pages from this
particular workload.

The node is a bit less intuitive. In most setups, the distance to a
remote NUMA node is much smaller than the distance to the storage
backend, and users would prefer finding and evicting the coldest
memory between multiple nodes, not within individual node.

Swappiness raises a similar question. Why would the user prefer one
type of data to be reclaimed over the other? Shouldn't it want to
reclaim the pages that are least likely to be used again soon?

> FAQ
> ===
> Why not try to improve the existing code?
> -----------------------------------------
> We have tried but concluded the aforementioned problems are
> fundamental, and therefore changes made on top of them will not result
> in substantial gains.

Realistically, I think incremental changes are unavoidable to get this
merged upstream.

Not just in the sense that they need to be smaller changes, but also
in the sense that they need to replace old code. It would be
impossible to maintain both, focus development and testing resources,
and provide a reasonably stable experience with both systems tugging
at a complicated shared code base.

On the other hand, the existing code also has billions of hours of
production testing and tuning. We can't throw this all out overnight -
it needs to be surgical and the broader consequences of each step need
to be well understood.

We also have millions of servers relying on being able to do upgrades
for drivers and fixes in other subsystems that we can't put on hold
until we stabilized a new reclaim implementation from scratch.

The good thing is that swap really hasn't been used much
recently. There definitely is room to maneuver without being too
disruptive. There *are* swap configurations today, but for the most
part, users don't expect the kernel to swap until the machine is under
heavy pressure. Few have expectations of it doing a nuanced and
efficient memory offloading job under nominal loads. So the anon side
could well be a testbed for the multigen LRU that has a more
reasonable blast radius than doing everything at once.

And if the rmap replacement for mapped pages could be split out as a
CPU optimzation for getting MMU info, without changing how those are
interpreted in the same step, I think we'd get into a more manageable
territory with this proposal.

Thanks!
Johannes

Yu Zhao April 14, 2021, 6:45 p.m. UTC | #21

On Wed, Apr 14, 2021 at 7:52 AM Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2021-04-14 at 16:27 +0800, Huang, Ying wrote:
> > Yu Zhao <yuzhao@google.com> writes:
> >
> > > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying <ying.huang@intel.com>
> > > wrote:
> > > >
> > > NUMA Optimization
> > > -----------------
> > > Support NUMA policies and per-node RSS counters.
> > >
> > > We only can move forward one step at a time. Fair?
> >
> > You don't need to implement that now definitely.  But we can discuss
> > the
> > possible solution now.
>
> That was my intention, too. I want to make sure we don't
> end up "painting ourselves into a corner" by moving in some
> direction we have no way to get out of.
>
> The patch set looks promising, but we need some plan to
> avoid the worst case behaviors that forced us into rmap
> based scanning initially.

Hi Rik,

By design, we voluntarily fall back to the rmap when page tables of a
process are too sparse. At the moment, we have

bool should_skip_mm()
{
    ...
    /* leave the legwork to the rmap if mapped pages are too sparse */
    if (RSS < mm_pgtables_bytes(mm) / PAGE_SIZE)
        return true;
    ....
}

So yes, I agree we have more work to do in this direction, the
fallback should be per VMA and NUMA aware. Note that once the fallback
happens, it shares the same path with the existing implementation.

Probably I should have clarified that this patchset does not replace
the rmap with page table scanning. It conditionally uses page table
scanning when it thinks most of the pages on a system could have been
referenced, i.e., when it thinks walking the rmap would be less
efficient, based on generations.

It *unconditionally* walks the rmap to scan each of the pages it
eventually tries to evict, because scanning page tables for a small
batch of pages it wants to evict is too costly.

One of the simple ways to look at how the mixture of page table
scanning and the rmap works is:
  1) it scans page tables (but might fallback to the rmap) to
deactivate pages from the active list to the inactive list, when the
inactive list becomes empty
  2) it walks the rmap (not page table scanning) when it evicts
individual pages from the inactive list.
Does it make sense?

I fully agree "the mixture" is currently statistically decided, and it
must be made worst-case scenario proof.

> > Note that it's possible that only some processes are bound to some
> > NUMA
> > nodes, while other processes aren't bound.
>
> For workloads like PostgresQL or Oracle, it is common
> to have maybe 70% of memory in a large shared memory
> segment, spread between all the NUMA nodes, and mapped
> into hundreds, if not thousands, of processes in the
> system.

I do plan to reach out to the PostgreSQL community and ask for help to
benchmark this patchset. Will keep everybody posted.

> Now imagine we have an 8 node system, and memory
> pressure in the DMA32 zone of node 0.
>
> How will the current VM behave?

At the moment, we don't plan to make the DMA32 zone reclaim a
priority. Rather, I'd suggest
  1) stay with the existing implementation
  2) boost the watermark for DMA32

> What will the virtual scanning need to do?

The high priority items are:

To-do List
==========
KVM Optimization
----------------
Support shadow page table scanning.

NUMA Optimization
-----------------
Support NUMA policies and per-node RSS counters.

We are just trying to focus our resources on the trending use cases. Reasonable?

> If we can come up with a solution to make virtual
> scanning scale for that kind of workload, great.

It won't be easy, but IMO nothing worth doing is easy :)

> If not ... if it turns out most of the benefits of
> the multigeneratinal LRU framework come from sorting
> the pages into multiple LRUs, and from being able
> to easily reclaim unmapped pages before having to
> scan mapped ones, could it be an idea to implement
> that first, independently from virtual scanning?

This option is on the table considering the possibilities
  1) there are unforeseeable problems we couldn't solve
  2) sorting pages alone has demonstrated its standalone value

I guess 2) alone will help people heavily using page cache. Google
isn't one of them though. Personally I'm neutral (at least trying to
be), and my goal is to accommodate everybody as best as I can.

> I am all for improving
> our page reclaim system, I
> just want to make sure we don't revisit the old traps
> that forced us where we are today :)

Yeah, I do see your concerns and we need more data. Any suggestions on
benchmarks you'd be interested in?

Thanks.

Yu Zhao April 14, 2021, 7:04 p.m. UTC | #22

On Wed, Apr 14, 2021 at 9:51 AM Andi Kleen <ak@linux.intel.com> wrote:
>
> >    2) It will not scan PTE tables under non-leaf PMD entries that do not
> >       have the accessed bit set, when
> >       CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
>
> This assumes  that workloads have reasonable locality. Could there
> be a worst case where only one or two pages in each PTE are used,
> so this PTE skipping trick doesn't work?

Hi Andi,

Yes, it does make that assumption. And yes, there could. AFAIK, only
x86 supports this.

I wrote a crude test to verify this, and it maps exactly one page
within each PTE table. And I found page table scanning didn't
underperform the rmap:

https://lore.kernel.org/linux-mm/YHFuL%2FDdtiml4biw@google.com/#t

The reason (sorry for repeating this) is page table scanning is conditional:

bool should_skip_mm()
{
    ...
    /* leave the legwork to the rmap if mapped pages are too sparse */
    if (RSS < mm_pgtables_bytes(mm) / PAGE_SIZE)
        return true;
    ....
}

We fall back to the rmap when it's obviously not smart to do so. There
is still a lot of room for improvement in this function though, i.e.,
it should be per VMA and NUMA aware.

Note that page table scanning doesn't replace the existing rmap scan.
It's complementary, and it happens when there is a good chance that
most of the pages on a system under pressure have been referenced.
IOW, scanning them one by one with the rmap would cost more than
scanning them all at once via page tables.

Sounds reasonable?

Thanks.

Yu Zhao April 14, 2021, 7:14 p.m. UTC | #23

On Wed, Apr 14, 2021 at 9:59 AM Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote:
> > >    2) It will not scan PTE tables under non-leaf PMD entries that
> > > do not
> > >       have the accessed bit set, when
> > >       CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> >
> > This assumes  that workloads have reasonable locality. Could there
> > be a worst case where only one or two pages in each PTE are used,
> > so this PTE skipping trick doesn't work?
>
> Databases with large shared memory segments shared between
> many processes come to mind as a real-world example of a
> worst case scenario.

Well, I don't think you two are talking about the same thing. Andi was
focusing on sparsity. Your example seems to be about sharing, i.e.,
ihgh mapcount. Of course both can happen at the same time, as I tested
here:
https://lore.kernel.org/linux-mm/YHFuL%2FDdtiml4biw@google.com/#t

I'm skeptical that shared memory used by databases is that sparse,
i.e., one page per PTE table, because the extremely low locality would
heavily penalize their performance. But my knowledge in databases is
close to zero. So feel free to enlighten me or just ignore what I
said.

Rik van Riel April 14, 2021, 7:41 p.m. UTC | #24

On Wed, 2021-04-14 at 13:14 -0600, Yu Zhao wrote:
> On Wed, Apr 14, 2021 at 9:59 AM Rik van Riel <riel@surriel.com>
> wrote:
> > On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote:
> > > >    2) It will not scan PTE tables under non-leaf PMD entries
> > > > that
> > > > do not
> > > >       have the accessed bit set, when
> > > >       CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> > > 
> > > This assumes  that workloads have reasonable locality. Could
> > > there
> > > be a worst case where only one or two pages in each PTE are used,
> > > so this PTE skipping trick doesn't work?
> > 
> > Databases with large shared memory segments shared between
> > many processes come to mind as a real-world example of a
> > worst case scenario.
> 
> Well, I don't think you two are talking about the same thing. Andi
> was
> focusing on sparsity. Your example seems to be about sharing, i.e.,
> ihgh mapcount. Of course both can happen at the same time, as I
> tested
> here:
> https://lore.kernel.org/linux-mm/YHFuL%2FDdtiml4biw@google.com/#t
> 
> I'm skeptical that shared memory used by databases is that sparse,
> i.e., one page per PTE table, because the extremely low locality
> would
> heavily penalize their performance. But my knowledge in databases is
> close to zero. So feel free to enlighten me or just ignore what I
> said.

A database may have a 200GB shared memory segment,
and a worker task that gets spun up to handle a
query might access only 1MB of memory to answer
that query.

That memory could be from anywhere inside the
shared memory segment. Maybe some of the accesses
are more dense, and others more sparse, who knows?

A lot of the locality
will depend on how memory
space inside the shared memory segment is reclaimed
and recycled inside the database.

Yu Zhao April 14, 2021, 7:42 p.m. UTC | #25

On Wed, Apr 14, 2021 at 8:43 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/13/21 5:14 PM, Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> >> On 4/13/21 1:51 AM, SeongJae Park wrote:
> >>> From: SeongJae Park <sjpark@amazon.de>
> >>>
> >>> Hello,
> >>>
> >>>
> >>> Very interesting work, thank you for sharing this :)
> >>>
> >>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
> >>>
> >>>> What's new in v2
> >>>> ================
> >>>> Special thanks to Jens Axboe for reporting a regression in buffered
> >>>> I/O and helping test the fix.
> >>>
> >>> Is the discussion open?  If so, could you please give me a link?
> >>
> >> I wasn't on the initial post (or any of the lists it was posted to), but
> >> it's on the google page reclaim list. Not sure if that is public or not.
> >>
> >> tldr is that I was pretty excited about this work, as buffered IO tends
> >> to suck (a lot) for high throughput applications. My test case was
> >> pretty simple:
> >>
> >> Randomly read a fast device, using 4k buffered IO, and watch what
> >> happens when the page cache gets filled up. For this particular test,
> >> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> >> with kswapd using a lot of CPU trying to keep up. That's mainline
> >> behavior.
> >
> > I see this exact same behaviour here, too, but I RCA'd it to
> > contention between the inode and memory reclaim for the mapping
> > structure that indexes the page cache. Basically the mapping tree
> > lock is the contention point here - you can either be adding pages
> > to the mapping during IO, or memory reclaim can be removing pages
> > from the mapping, but we can't do both at once.
> >
> > So we end up with kswapd spinning on the mapping tree lock like so
> > when doing 1.6GB/s in 4kB buffered IO:
> >
> > -   20.06%     0.00%  [kernel]               [k] kswapd                                                                                                        ▒
> >    - 20.06% kswapd                                                                                                                                             ▒
> >       - 20.05% balance_pgdat                                                                                                                                   ▒
> >          - 20.03% shrink_node                                                                                                                                  ▒
> >             - 19.92% shrink_lruvec                                                                                                                             ▒
> >                - 19.91% shrink_inactive_list                                                                                                                   ▒
> >                   - 19.22% shrink_page_list                                                                                                                    ▒
> >                      - 17.51% __remove_mapping                                                                                                                 ▒
> >                         - 14.16% _raw_spin_lock_irqsave                                                                                                        ▒
> >                            - 14.14% do_raw_spin_lock                                                                                                           ▒
> >                                 __pv_queued_spin_lock_slowpath                                                                                                 ▒
> >                         - 1.56% __delete_from_page_cache                                                                                                       ▒
> >                              0.63% xas_store                                                                                                                   ▒
> >                         - 0.78% _raw_spin_unlock_irqrestore                                                                                                    ▒
> >                            - 0.69% do_raw_spin_unlock                                                                                                          ▒
> >                                 __raw_callee_save___pv_queued_spin_unlock                                                                                      ▒
> >                      - 0.82% free_unref_page_list                                                                                                              ▒
> >                         - 0.72% free_unref_page_commit                                                                                                         ▒
> >                              0.57% free_pcppages_bulk                                                                                                          ▒
> >
> > And these are the processes consuming CPU:
> >
> >    5171 root      20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
> >    1150 root      20   0       0      0      0 S  47.4   0.0   0:22.70 kswapd1
> >    1146 root      20   0       0      0      0 S  44.0   0.0   0:21.85 kswapd0
> >    1152 root      20   0       0      0      0 S  39.7   0.0   0:18.28 kswapd3
> >    1151 root      20   0       0      0      0 S  15.2   0.0   0:12.14 kswapd2
>
> Here's my profile when memory reclaim is active for the above mentioned
> test case. This is a single node system, so just kswapd. It's using around
> 40-45% CPU:
>
>     43.69%  kswapd0  [kernel.vmlinux]  [k] xas_create
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                shrink_page_list
>                __delete_from_page_cache
>                xas_store
>                xas_create
>
>     16.88%  kswapd0  [kernel.vmlinux]  [k] queued_spin_lock_slowpath
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                |
>                 --16.82%--shrink_inactive_list
>                           |
>                            --16.55%--shrink_page_list
>                                      |
>                                       --16.26%--_raw_spin_lock_irqsave
>                                                 queued_spin_lock_slowpath
>
>      9.89%  kswapd0  [kernel.vmlinux]  [k] shrink_page_list
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                shrink_page_list
>
>      5.46%  kswapd0  [kernel.vmlinux]  [k] xas_init_marks
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                shrink_page_list
>                |
>                 --5.41%--__delete_from_page_cache
>                           xas_init_marks
>
>      4.42%  kswapd0  [kernel.vmlinux]  [k] __delete_from_page_cache
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                |
>                 --4.40%--shrink_page_list
>                           __delete_from_page_cache
>
>      2.82%  kswapd0  [kernel.vmlinux]  [k] isolate_lru_pages
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                |
>                |--1.43%--shrink_active_list
>                |          isolate_lru_pages
>                |
>                 --1.39%--shrink_inactive_list
>                           isolate_lru_pages
>
>      1.99%  kswapd0  [kernel.vmlinux]  [k] free_pcppages_bulk
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                shrink_page_list
>                free_unref_page_list
>                free_unref_page_commit
>                free_pcppages_bulk
>
>      1.79%  kswapd0  [kernel.vmlinux]  [k] _raw_spin_lock_irqsave
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                |
>                 --1.76%--shrink_node
>                           shrink_lruvec
>                           shrink_inactive_list
>                           |
>                            --1.72%--shrink_page_list
>                                      _raw_spin_lock_irqsave
>
>      1.02%  kswapd0  [kernel.vmlinux]  [k] workingset_eviction
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                |
>                 --1.00%--shrink_page_list
>                           workingset_eviction
>
> > i.e. when memory reclaim kicks in, the read process has 20% less
> > time with exclusive access to the mapping tree to insert new pages.
> > Hence buffered read performance goes down quite substantially when
> > memory reclaim kicks in, and this really has nothing to do with the
> > memory reclaim LRU scanning algorithm.
> >
> > I can actually get this machine to pin those 5 processes to 100% CPU
> > under certain conditions. Each process is spinning all that extra
> > time on the mapping tree lock, and performance degrades further.
> > Changing the LRU reclaim algorithm won't fix this - the workload is
> > solidly bound by the exclusive nature of the mapping tree lock and
> > the number of tasks trying to obtain it exclusively...
>
> I've seen way worse than the above as well, it's just my go-to easy test
> case for "man I wish buffered IO didn't suck so much".
>
> >> The initial posting of this patchset did no better, in fact it did a bit
> >> worse. Performance dropped to the same levels and kswapd was using as
> >> much CPU as before, but on top of that we also got excessive swapping.
> >> Not at a high rate, but 5-10MB/sec continually.
> >>
> >> I had some back and forths with Yu Zhao and tested a few new revisions,
> >> and the current series does much better in this regard. Performance
> >> still dips a bit when page cache fills, but not nearly as much, and
> >> kswapd is using less CPU than before.
> >
> > Profiles would be interesting, because it sounds to me like reclaim
> > *might* be batching page cache removal better (e.g. fewer, larger
> > batches) and so spending less time contending on the mapping tree
> > lock...
> >
> > IOWs, I suspect this result might actually be a result of less lock
> > contention due to a change in batch processing characteristics of
> > the new algorithm rather than it being a "better" algorithm...
>
> See above - let me know if you want to see more specific profiling as
> well.

Hi Jens,

Thanks for the profiles.

Does the code path I've demonstrated seem clear to you?

Recap:

When randomly accessing a (not infinitely) large file long enough,
some blocks are bound to be accessed multiple times. In the buffered
io access path, mark_page_accessed() activates them, i.e., moving them
to the active list. Once memory is filled and kswapd starts
reclaiming, shrink_active_list() deactivates them, i.e., moving them
back to the inactive list. Both take the lru lock to add/remove pages
to/from the active/inactive lists.

IOW, pages accessed multiple times bounce between the active and the
inactive lists when random accesses put a system under memory
pressure. For random accesses, pages accessed multiple times are not
different from those accessed once, in terms of page reclaim.
(Statistically speaking, they would be less unlikely to be used
again.)

I'd be happy to give it another try if there is anything unclear.

Thanks.

Yu Zhao April 14, 2021, 8:08 p.m. UTC | #26

On Wed, Apr 14, 2021 at 1:42 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Wed, 2021-04-14 at 13:14 -0600, Yu Zhao wrote:
> > On Wed, Apr 14, 2021 at 9:59 AM Rik van Riel <riel@surriel.com>
> > wrote:
> > > On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote:
> > > > >    2) It will not scan PTE tables under non-leaf PMD entries
> > > > > that
> > > > > do not
> > > > >       have the accessed bit set, when
> > > > >       CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> > > >
> > > > This assumes  that workloads have reasonable locality. Could
> > > > there
> > > > be a worst case where only one or two pages in each PTE are used,
> > > > so this PTE skipping trick doesn't work?
> > >
> > > Databases with large shared memory segments shared between
> > > many processes come to mind as a real-world example of a
> > > worst case scenario.
> >
> > Well, I don't think you two are talking about the same thing. Andi
> > was
> > focusing on sparsity. Your example seems to be about sharing, i.e.,
> > ihgh mapcount. Of course both can happen at the same time, as I
> > tested
> > here:
> > https://lore.kernel.org/linux-mm/YHFuL%2FDdtiml4biw@google.com/#t
> >
> > I'm skeptical that shared memory used by databases is that sparse,
> > i.e., one page per PTE table, because the extremely low locality
> > would
> > heavily penalize their performance. But my knowledge in databases is
> > close to zero. So feel free to enlighten me or just ignore what I
> > said.
>
> A database may have a 200GB shared memory segment,
> and a worker task that gets spun up to handle a
> query might access only 1MB of memory to answer
> that query.
>
> That memory could be from anywhere inside the
> shared memory segment. Maybe some of the accesses
> are more dense, and others more sparse, who knows?
>
> A lot of the locality
> will depend on how memory
> space inside the shared memory segment is reclaimed
> and recycled inside the database.

Thanks. Yeah, I guess we'll just need to see more benchmarks from the
database realm. Stay tuned :)

Dave Chinner April 15, 2021, 1:21 a.m. UTC | #27

On Wed, Apr 14, 2021 at 08:43:36AM -0600, Jens Axboe wrote:
> On 4/13/21 5:14 PM, Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> >> On 4/13/21 1:51 AM, SeongJae Park wrote:
> >>> From: SeongJae Park <sjpark@amazon.de>
> >>>
> >>> Hello,
> >>>
> >>>
> >>> Very interesting work, thank you for sharing this :)
> >>>
> >>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao <yuzhao@google.com> wrote:
> >>>
> >>>> What's new in v2
> >>>> ================
> >>>> Special thanks to Jens Axboe for reporting a regression in buffered
> >>>> I/O and helping test the fix.
> >>>
> >>> Is the discussion open?  If so, could you please give me a link?
> >>
> >> I wasn't on the initial post (or any of the lists it was posted to), but
> >> it's on the google page reclaim list. Not sure if that is public or not.
> >>
> >> tldr is that I was pretty excited about this work, as buffered IO tends
> >> to suck (a lot) for high throughput applications. My test case was
> >> pretty simple:
> >>
> >> Randomly read a fast device, using 4k buffered IO, and watch what
> >> happens when the page cache gets filled up. For this particular test,
> >> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec
> >> with kswapd using a lot of CPU trying to keep up. That's mainline
> >> behavior.
> > 
> > I see this exact same behaviour here, too, but I RCA'd it to
> > contention between the inode and memory reclaim for the mapping
> > structure that indexes the page cache. Basically the mapping tree
> > lock is the contention point here - you can either be adding pages
> > to the mapping during IO, or memory reclaim can be removing pages
> > from the mapping, but we can't do both at once.
> > 
> > So we end up with kswapd spinning on the mapping tree lock like so
> > when doing 1.6GB/s in 4kB buffered IO:
> > 
> > -   20.06%     0.00%  [kernel]               [k] kswapd                                                                                                        ▒
> >    - 20.06% kswapd                                                                                                                                             ▒
> >       - 20.05% balance_pgdat                                                                                                                                   ▒
> >          - 20.03% shrink_node                                                                                                                                  ▒
> >             - 19.92% shrink_lruvec                                                                                                                             ▒
> >                - 19.91% shrink_inactive_list                                                                                                                   ▒
> >                   - 19.22% shrink_page_list                                                                                                                    ▒
> >                      - 17.51% __remove_mapping                                                                                                                 ▒
> >                         - 14.16% _raw_spin_lock_irqsave                                                                                                        ▒
> >                            - 14.14% do_raw_spin_lock                                                                                                           ▒
> >                                 __pv_queued_spin_lock_slowpath                                                                                                 ▒
> >                         - 1.56% __delete_from_page_cache                                                                                                       ▒
> >                              0.63% xas_store                                                                                                                   ▒
> >                         - 0.78% _raw_spin_unlock_irqrestore                                                                                                    ▒
> >                            - 0.69% do_raw_spin_unlock                                                                                                          ▒
> >                                 __raw_callee_save___pv_queued_spin_unlock                                                                                      ▒
> >                      - 0.82% free_unref_page_list                                                                                                              ▒
> >                         - 0.72% free_unref_page_commit                                                                                                         ▒
> >                              0.57% free_pcppages_bulk                                                                                                          ▒
> > 
> > And these are the processes consuming CPU:
> > 
> >    5171 root      20   0 1442496   5696   1284 R  99.7   0.0   1:07.78 fio
> >    1150 root      20   0       0      0      0 S  47.4   0.0   0:22.70 kswapd1
> >    1146 root      20   0       0      0      0 S  44.0   0.0   0:21.85 kswapd0
> >    1152 root      20   0       0      0      0 S  39.7   0.0   0:18.28 kswapd3
> >    1151 root      20   0       0      0      0 S  15.2   0.0   0:12.14 kswapd2
> 
> Here's my profile when memory reclaim is active for the above mentioned
> test case. This is a single node system, so just kswapd. It's using around
> 40-45% CPU:
> 
>     43.69%  kswapd0  [kernel.vmlinux]  [k] xas_create
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                shrink_inactive_list
>                shrink_page_list
>                __delete_from_page_cache
>                xas_store
>                xas_create
> 
>     16.88%  kswapd0  [kernel.vmlinux]  [k] queued_spin_lock_slowpath
>             |
>             ---ret_from_fork
>                kthread
>                kswapd
>                balance_pgdat
>                shrink_node
>                shrink_lruvec
>                |          
>                 --16.82%--shrink_inactive_list
>                           |          
>                            --16.55%--shrink_page_list
>                                      |          
>                                       --16.26%--_raw_spin_lock_irqsave
>                                                 queued_spin_lock_slowpath

Yeah, so it largely ends up in the same place, with the spinlock
contention dominating the CPU usage and efficiency of memory
reclaim.

> > i.e. when memory reclaim kicks in, the read process has 20% less
> > time with exclusive access to the mapping tree to insert new pages.
> > Hence buffered read performance goes down quite substantially when
> > memory reclaim kicks in, and this really has nothing to do with the
> > memory reclaim LRU scanning algorithm.
> > 
> > I can actually get this machine to pin those 5 processes to 100% CPU
> > under certain conditions. Each process is spinning all that extra
> > time on the mapping tree lock, and performance degrades further.
> > Changing the LRU reclaim algorithm won't fix this - the workload is
> > solidly bound by the exclusive nature of the mapping tree lock and
> > the number of tasks trying to obtain it exclusively...
> 
> I've seen way worse than the above as well, it's just my go-to easy test
> case for "man I wish buffered IO didn't suck so much".

*nod*

> >> The initial posting of this patchset did no better, in fact it did a bit
> >> worse. Performance dropped to the same levels and kswapd was using as
> >> much CPU as before, but on top of that we also got excessive swapping.
> >> Not at a high rate, but 5-10MB/sec continually.
> >>
> >> I had some back and forths with Yu Zhao and tested a few new revisions,
> >> and the current series does much better in this regard. Performance
> >> still dips a bit when page cache fills, but not nearly as much, and
> >> kswapd is using less CPU than before.
> > 
> > Profiles would be interesting, because it sounds to me like reclaim
> > *might* be batching page cache removal better (e.g. fewer, larger
> > batches) and so spending less time contending on the mapping tree
> > lock...
> > 
> > IOWs, I suspect this result might actually be a result of less lock
> > contention due to a change in batch processing characteristics of
> > the new algorithm rather than it being a "better" algorithm...
> 
> See above - let me know if you want to see more specific profiling as
> well.

I don't think that profiles are going to give us the level of detail
required to determine how this algorithm is improving performance.
That would require careful instrumentation of the memory reclaim
algorithms to demonstrate any significant change in behaviour, and
then to prove that it's a predictable, consistent improvement across
all types of machines rather than just being a freak of interactions
with a specific workload on specific hardware would need to be done.

When it comes to lock contention like this, you can't infer anything
about external algorithm changes because better algorithms often
make contention worse because the locks are hit harder and so
performance goes the wrong way. Similarly, if the external algorithm
change takes more time to do something because it is less efficient,
then locks are hit less hard, so they contend less, and performance
goes up.

I often see an external change cause a small reduction in lock
contention and increase in throughput through a heavily contended
path is often a sign something is slower or behaving worse, not
better. THe only way to determine if the external change is any good
is to first fix the lock contention problem, then do back to back
testing of the change.

Hence I'd be very hesitant to use this test in any way as a measure
of whether the multi-gen LRU is any better for this workload or
not...

Cheers,

Dave.

Dave Chinner April 15, 2021, 1:36 a.m. UTC | #28

On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > Profiles would be interesting, because it sounds to me like reclaim
> > > > *might* be batching page cache removal better (e.g. fewer, larger
> > > > batches) and so spending less time contending on the mapping tree
> > > > lock...
> > > >
> > > > IOWs, I suspect this result might actually be a result of less lock
> > > > contention due to a change in batch processing characteristics of
> > > > the new algorithm rather than it being a "better" algorithm...
> > >
> > > I appreciate the profile. But there is no batching in
> > > __remove_mapping() -- it locks the mapping for each page, and
> > > therefore the lock contention penalizes the mainline and this patchset
> > > equally. It looks worse on your system because the four kswapd threads
> > > from different nodes were working on the same file.
> >
> > I think you misunderstand exactly what I mean by "batching" here.
> > I'm not talking about doing multiple pieces of work under a single
> > lock. What I mean is that the overall amount of work done in a
> > single reclaim scan (i.e a "reclaim batch") is packaged differently.
> >
> > We already batch up page reclaim via building a page list and then
> > passing it to shrink_page_list() to process the batch of pages in a
> > single pass. Each page in this page list batch then calls
> > remove_mapping() to pull the page form the LRU, we have a run of
> > contention between the foreground read() thread and the background
> > kswapd.
> >
> > If the size or nature of the pages in the batch passed to
> > shrink_page_list() changes, then the amount of time a reclaim batch
> > is going to put pressure on the mapping tree lock will also change.
> > That's the "change in batching behaviour" I'm referring to here. I
> > haven't read through the patchset to determine if you change the
> > shrink_page_list() algorithm, but it likely changes what is passed
> > to be reclaimed and that in turn changes the locking patterns that
> > fall out of shrink_page_list...
> 
> Ok, if we are talking about the size of the batch passed to
> shrink_page_list(), both the mainline and this patchset cap it at
> SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when
> running fio/io_uring, it's safe to say both use 32.

You're still looking at micro-scale behaviour, not the larger-scale
batching effects. Are we passing SWAP_CLUSTER_MAX groups of pages to
shrinker_page_list() at a different rate?

When I say "batch of work" when talking about the page cache cycling
*500 thousand pages a second* through the cache, I'm not talking
about batches of 32 pages. I'm talking about the entire batch of
work kswapd does in an invocation cycle.

Is it scanning 100k pages 10 times a second? or 10k pages a hundred
times a second? How long does a batch take to run? how long does is
sleep between processing batches? Is there any change in these
metrics as a result of the multi-gen LRU patches?

Basically, we're looking at how access to the mapping lock is
changing the contention profile, and whether that is signficant or
not. I suspect it is, because when you have highly contended locks
and you do something external that reduces unrelated lock
contention, it's because that external thing is taking more time to
do and so there's less time to spend hitting locks hard...

As such, I don't think this test is a good measure of the multi-gen
LRU patches at all - performance is dominated by the severity of
lock contention external to the LRU scanning algorithm, and it's
hard to infer anything through suck lock contention....

> I don't want to paste everything here -- they'd clutter. Please see
> all the detailed profiles in the attachment. Let me know if their
> formats are no to your liking. I still have the raw perf.data.

Which makes the discussion thread just about impossible to follow or
comment on. Please just post the relevant excerpt of the stack
profile that you are commenting on.

> > > And I plan to reach out to other communities, e.g., PostgreSQL, to
> > > benchmark the patchset. I heard they have been complaining about the
> > > buffered io performance under memory pressure. Any other benchmarks
> > > you'd suggest?
> > >
> > > BTW, you might find another surprise in how less frequently slab
> > > shrinkers are called under memory pressure, because this patchset is a
> > > lot better at finding pages to reclaim and therefore doesn't overkill
> > > slabs.
> >
> > That's actually very likely to be a Bad Thing and cause unexpected
> > perofrmance and OOM based regressions. When the machine finally runs
> > out of page cache it can easily reclaim, it's going to get stuck
> > with long tail latencies reclaiming huge slab caches as they've had
> > no substantial ongoing pressure put on them to keep them in balance
> > with the overall memory pressure the system is under...
> 
> Well. It does use the existing equation. That is if it scans X% of
> pages, then it scans X% of slab objects. But 1) it often finds pages
> to reclaim at a lower X% 2) the pages it reclaims are less likely to
> refault. So the side effect is the overall slab objects it scans also
> reduce. I do see your point but don't see any options, at the moment.

You'll have to rebalance the memory reclaim algorithms to either:

a) make the shrinkers more aggressive so they do more reclaim when
called less often, or

b) lower the threshold at which shrinkers are called.

Keeping the slab caches in balance with page cache memory pressure
is fairly important for the performance of workloads that generate
inode and dentry cache load, especially those that don't actually
generate page cache pressure. This is the hardest part about making
fundamental changes to memory reclaim behaviour: ensuring that the
system remains balanced over a wide range of differing workloads and
reacts sanely to sudden step changes in workload behaviour...

Cheers,

Dave.

Andi Kleen April 15, 2021, 3 a.m. UTC | #29

> We fall back to the rmap when it's obviously not smart to do so. There
> is still a lot of room for improvement in this function though, i.e.,
> it should be per VMA and NUMA aware.

Okay so it's more a question to tune the cross over heuristic. That
sounds much easier than replacing everything.

Of course long term it might be a problem to maintain too many 
different ways to do things, but I suppose short term it's a reasonable
strategy.

-Andi

Yu Zhao April 15, 2021, 7:13 a.m. UTC | #30

On Wed, Apr 14, 2021 at 9:00 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> > We fall back to the rmap when it's obviously not smart to do so. There
> > is still a lot of room for improvement in this function though, i.e.,
> > it should be per VMA and NUMA aware.
>
> Okay so it's more a question to tune the cross over heuristic. That
> sounds much easier than replacing everything.
>
> Of course long term it might be a problem to maintain too many
> different ways to do things, but I suppose short term it's a reasonable
> strategy.

Hi Rik, Ying,

Sorry for being persistent. I want to make sure we are on the same page:

Page table scanning doesn't replace the existing rmap walk. It is
complementary and only happens when it is likely that most of the
pages on a system under pressure have been referenced, i.e., out of
*inactive* pages, by definition of the existing implementation. Under
such a condition, scanning *active* pages one by one with the rmap is
likely to cost more than scanning them all at once via page tables.
When we evict *inactive* pages, we still use the rmap and share a
common path with the existing code.

Page table scanning falls back to the rmap walk if the page tables of
a process are apparently sparse, i.e., rss < size of the page tables.

I should have clarified this at the very beginning of the discussion.
But it has become so natural to me and I assumed we'd all see it this
way.

Your concern regarding the NUMA optimization is still valid, and it's
a high priority.

Thanks.

Huang, Ying April 15, 2021, 8:19 a.m. UTC | #31

Yu Zhao <yuzhao@google.com> writes:

> On Wed, Apr 14, 2021 at 9:00 PM Andi Kleen <ak@linux.intel.com> wrote:
>>
>> > We fall back to the rmap when it's obviously not smart to do so. There
>> > is still a lot of room for improvement in this function though, i.e.,
>> > it should be per VMA and NUMA aware.
>>
>> Okay so it's more a question to tune the cross over heuristic. That
>> sounds much easier than replacing everything.
>>
>> Of course long term it might be a problem to maintain too many
>> different ways to do things, but I suppose short term it's a reasonable
>> strategy.
>
> Hi Rik, Ying,
>
> Sorry for being persistent. I want to make sure we are on the same page:
>
> Page table scanning doesn't replace the existing rmap walk. It is
> complementary and only happens when it is likely that most of the
> pages on a system under pressure have been referenced, i.e., out of
> *inactive* pages, by definition of the existing implementation. Under
> such a condition, scanning *active* pages one by one with the rmap is
> likely to cost more than scanning them all at once via page tables.
> When we evict *inactive* pages, we still use the rmap and share a
> common path with the existing code.
>
> Page table scanning falls back to the rmap walk if the page tables of
> a process are apparently sparse, i.e., rss < size of the page tables.
>
> I should have clarified this at the very beginning of the discussion.
> But it has become so natural to me and I assumed we'd all see it this
> way.
>
> Your concern regarding the NUMA optimization is still valid, and it's
> a high priority.

Hi, Yu,

In general, I think it's a good idea to combine the page table scanning
and rmap scanning in the page reclaiming.  For example, if the
working-set is transitioned, we can take advantage of the fast page
table scanning to identify the new working-set quickly.  While we can
fallback to the rmap scanning if the page table scanning doesn't help.

Best Regards,
Huang, Ying

Michel Lespinasse April 15, 2021, 9:57 a.m. UTC | #32

On Thu, Apr 15, 2021 at 01:13:13AM -0600, Yu Zhao wrote:
> Page table scanning doesn't replace the existing rmap walk. It is
> complementary and only happens when it is likely that most of the
> pages on a system under pressure have been referenced, i.e., out of
> *inactive* pages, by definition of the existing implementation. Under
> such a condition, scanning *active* pages one by one with the rmap is
> likely to cost more than scanning them all at once via page tables.
> When we evict *inactive* pages, we still use the rmap and share a
> common path with the existing code.
> 
> Page table scanning falls back to the rmap walk if the page tables of
> a process are apparently sparse, i.e., rss < size of the page tables.

Could you expand a bit more as to how page table scanning and rmap
scanning coexist ? Say, there is some memory pressure and you want to
identify good candidate pages to recaim. You could scan processes with
the page table scanning method, or you could scan the lru list through
the rmap method. How do you mix the two - when you use the lru/rmap
method, won't you encounter both pages that are mapped in "dense"
processes where scanning page tables would have been better, and pages
that are mapped in "sparse" processes where you are happy to be using
rmap, and even pges that are mapped into both types of processes at
once ?  Or, can you change the lru/rmap scan so that it will efficiently
skip over all dense processes when you use it ?

Thanks,

--
Michel "walken" Lespinasse

Yu Zhao April 24, 2021, 2:33 a.m. UTC | #33

On Sun, Apr 18, 2021 at 12:48 AM Michel Lespinasse
<michel@lespinasse.org> wrote:
> On Thu, Apr 15, 2021 at 01:13:13AM -0600, Yu Zhao wrote:
> > Page table scanning doesn't replace the existing rmap walk. It is
> > complementary and only happens when it is likely that most of the
> > pages on a system under pressure have been referenced, i.e., out of
> > *inactive* pages, by definition of the existing implementation. Under
> > such a condition, scanning *active* pages one by one with the rmap is
> > likely to cost more than scanning them all at once via page tables.
> > When we evict *inactive* pages, we still use the rmap and share a
> > common path with the existing code.
> >
> > Page table scanning falls back to the rmap walk if the page tables of
> > a process are apparently sparse, i.e., rss < size of the page tables.
>
> Could you expand a bit more as to how page table scanning and rmap
> scanning coexist ? Say, there is some memory pressure and you want to
> identify good candidate pages to recaim. You could scan processes with
> the page table scanning method, or you could scan the lru list through
> the rmap method. How do you mix the two - when you use the lru/rmap
> method, won't you encounter both pages that are mapped in "dense"
> processes where scanning page tables would have been better, and pages
> that are mapped in "sparse" processes where you are happy to be using
> rmap, and even pges that are mapped into both types of processes at
> once ?  Or, can you change the lru/rmap scan so that it will efficiently
> skip over all dense processes when you use it ?

Hi Michel,

Sorry for the late reply. I was out of town and am still catching up on emails.

That's a great question. Currently the page table scanning isn't smart
enough to know where dense regions are. My plan was to improve it
gradually but it seems it couldn't wait because people have major
concerns over this.

At the moment, the page table scanning decides if a process is worthy
by checking its RSS against the size of its page tables. This can only
avoid extremely sparse regions, meaning the page table scanning will
scan regions that ideally should be covered by the rmap, for some
worse case scenarios. My next step is to add a bloom filter so it can
quickly determine dense regions and target them only.

Given what I just said, the rmap is unlikely to encounter dense
regions, and that's why the perf profile shows its cpu usage drops
from ~30% to ~5%.

Now the question is how we build the bloom filter. A simple answer is
to let the rmap do the legwork, i.e., when it encounters dense
regions, add them to the filter. Of course this means we'll have to
use the rmap more than we do now, which is not ideal for some
workloads but necessary to avoid worst case scenarios.

Does it make sense?

Andi Kleen April 24, 2021, 3:30 a.m. UTC | #34

> Now the question is how we build the bloom filter. A simple answer is
> to let the rmap do the legwork, i.e., when it encounters dense
> regions, add them to the filter. Of course this means we'll have to
> use the rmap more than we do now, which is not ideal for some
> workloads but necessary to avoid worst case scenarios.

How would you maintain the bloom filter over time? Assume a process
that always creates new mappings and unmaps old mappings. How 
do the stale old mappings get removed and avoid polluting it over time?

Or are you thinking of one of the fancier bloom filter variants
that support deletion? As I understand they're significantly less
space efficient and more complicated.

-Andi

Yu Zhao April 24, 2021, 4:16 a.m. UTC | #35

On Fri, Apr 23, 2021 at 9:30 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> > Now the question is how we build the bloom filter. A simple answer is
> > to let the rmap do the legwork, i.e., when it encounters dense
> > regions, add them to the filter. Of course this means we'll have to
> > use the rmap more than we do now, which is not ideal for some
> > workloads but necessary to avoid worst case scenarios.
>
> How would you maintain the bloom filter over time? Assume a process
> that always creates new mappings and unmaps old mappings. How
> do the stale old mappings get removed and avoid polluting it over time?
>
> Or are you thinking of one of the fancier bloom filter variants
> that support deletion? As I understand they're significantly less
> space efficient and more complicated.

Hi Andi,

That's where the double buffering technique comes in :)

Recap: the creation of each new generation starts with scanning page
tables to clear the accessed bit of pages referenced since the last
scan.

We scan page tables according to the current bloom filter, and at the
same time, we build a new one and write it to the second buffer.
During this step, we eliminate regions that have become invalid, e.g.,
too sparse or completely unmapped. Note that the scan *will* miss
newly mapped regions, i.e., dense regions that the rmap hasn't
discovered. Once this step is done, we flip to the second buffer. And
from now on, all the new dense regions discovered by the rmap will be
recorded into this buffer.

Each element in the bloom filter is a hash value from an address of a
page table and a node id, indicating this page table has a worth
number of pages from this node.

A single counting bloom filter works too but it doesn't seem to offer
any advantage over double buffering. And we need to handle overflow
too.

Yu Zhao April 24, 2021, 9:21 p.m. UTC | #36

On Wed, Apr 14, 2021 at 7:36 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote:
> > On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner <david@fromorbit.com> wrote:
> > > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> > > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > Profiles would be interesting, because it sounds to me like reclaim
> > > > > *might* be batching page cache removal better (e.g. fewer, larger
> > > > > batches) and so spending less time contending on the mapping tree
> > > > > lock...
> > > > >
> > > > > IOWs, I suspect this result might actually be a result of less lock
> > > > > contention due to a change in batch processing characteristics of
> > > > > the new algorithm rather than it being a "better" algorithm...
> > > >
> > > > I appreciate the profile. But there is no batching in
> > > > __remove_mapping() -- it locks the mapping for each page, and
> > > > therefore the lock contention penalizes the mainline and this patchset
> > > > equally. It looks worse on your system because the four kswapd threads
> > > > from different nodes were working on the same file.
> > >
> > > I think you misunderstand exactly what I mean by "batching" here.
> > > I'm not talking about doing multiple pieces of work under a single
> > > lock. What I mean is that the overall amount of work done in a
> > > single reclaim scan (i.e a "reclaim batch") is packaged differently.
> > >
> > > We already batch up page reclaim via building a page list and then
> > > passing it to shrink_page_list() to process the batch of pages in a
> > > single pass. Each page in this page list batch then calls
> > > remove_mapping() to pull the page form the LRU, we have a run of
> > > contention between the foreground read() thread and the background
> > > kswapd.
> > >
> > > If the size or nature of the pages in the batch passed to
> > > shrink_page_list() changes, then the amount of time a reclaim batch
> > > is going to put pressure on the mapping tree lock will also change.
> > > That's the "change in batching behaviour" I'm referring to here. I
> > > haven't read through the patchset to determine if you change the
> > > shrink_page_list() algorithm, but it likely changes what is passed
> > > to be reclaimed and that in turn changes the locking patterns that
> > > fall out of shrink_page_list...
> >
> > Ok, if we are talking about the size of the batch passed to
> > shrink_page_list(), both the mainline and this patchset cap it at
> > SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when
> > running fio/io_uring, it's safe to say both use 32.
>
> You're still looking at micro-scale behaviour, not the larger-scale
> batching effects. Are we passing SWAP_CLUSTER_MAX groups of pages to
> shrinker_page_list() at a different rate?
>
> When I say "batch of work" when talking about the page cache cycling
> *500 thousand pages a second* through the cache, I'm not talking
> about batches of 32 pages. I'm talking about the entire batch of
> work kswapd does in an invocation cycle.
>
> Is it scanning 100k pages 10 times a second? or 10k pages a hundred
> times a second? How long does a batch take to run? how long does is
> sleep between processing batches? Is there any change in these
> metrics as a result of the multi-gen LRU patches?

Hi Dave,

Sorry for the late reply. Still catching up on emails.

Well, it doesn't really work that way. Yes, I agree that batching
theoretically can have effects on the performance but the patchset
doesn't change anything in this respect. The number of pages to
reclaim is determined by a common code path shared between the
existing implementation and this patchset. Specifically, ksawpd sets
"sc->nr_to_reclaim" based on the high watermark, and passes "sc" down
to both code path:

 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
<snipped>
+ if (lru_gen_enabled()) {
+ shrink_lru_gens(lruvec, sc);
+ return;
+ }
<snipped>

And there isn't really any new algorithm. It's just the old plain LRU.
The improvement is purely from a feedback loop that helps avoid
unnecessary activations and deactivations. By activations, I mean the
following work done in the buffered io read path:
  generic_file_read_iter()
    filemap_read()
      mark_page_accessed()
        activate_page()
Simply put, on the second accessed, the current implementation
unconditionally moves a page from the inactive lru to the active lru,
if it's not already there.

And by deactivations, I mean the following work done in kswapd:
  kswapd_shrink_node()
    shrink_node()
      shrink_node_memcgs()
        shrink_lruvec()
          shrink_active_list()
i.e., kswapd moves activated pages back to the inactive list before it
can evict them.

For random accesses, every page is equal and none is active. But how
can we tell? The refault rates, i.e., evicted and then accessed again,
for all pages are the same, no matter how many times they have been
accessed. This is exactly how the feedback loop works.

> Basically, we're looking at how access to the mapping lock is
> changing the contention profile, and whether that is signficant or
> not. I suspect it is, because when you have highly contended locks
> and you do something external that reduces unrelated lock
> contention, it's because that external thing is taking more time to
> do and so there's less time to spend hitting locks hard...
>
> As such, I don't think this test is a good measure of the multi-gen
> LRU patches at all - performance is dominated by the severity of
> lock contention external to the LRU scanning algorithm, and it's
> hard to infer anything through suck lock contention....

The lock contention you saw earlier is because of the four nodes
system you used -- each node has a kswapd thread but there is only one
file. The four kswapd threads keep banging on the same mapping lock.
This can be easily fixed if we just run the same test with a single
node *or* with multiple files.

Here is what I got (it's basically what I've attached in the previous email).

Before the patchset:

kswapd
# Children      Self  Symbol
# ........  ........  .......................................
   100.00%     0.00%  kswapd
    99.90%     0.01%  balance_pgdat
    99.86%     0.05%  shrink_node
    98.00%     0.19%  shrink_lruvec
    91.13%     0.10%  shrink_inactive_list
    75.67%     5.97%  shrink_page_list
    57.65%     2.59%  __remove_mapping
    45.99%     0.27%  __delete_from_page_cache
    42.88%     0.88%  page_cache_delete
    39.05%     1.04%  xas_store
    37.71%    37.67%  xas_create
    12.62%    11.79%  isolate_lru_pages
     8.52%     0.98%  free_unref_page_list
     7.38%     0.61%  free_unref_page_commit
     6.68%     1.78%  free_pcppages_bulk
     6.53%     0.83%  ***shrink_active_list***
     6.45%     3.21%  _raw_spin_lock_irqsave
     4.82%     4.60%  __free_one_page
     4.58%     4.21%  unlock_page
     3.62%     3.55%  native_queued_spin_lock_slowpath
     2.49%     0.71%  unaccount_page_cache_page
     2.46%     0.81%  workingset_eviction
     2.14%     0.33%  __mod_lruvec_state
     1.97%     1.88%  xas_clear_mark
     1.73%     0.26%  __mod_lruvec_page_state
     1.71%     1.06%  move_pages_to_lru
     1.66%     1.62%  workingset_age_nonresident
     1.60%     0.85%  __mod_memcg_lruvec_state
     1.58%     0.02%  shrink_slab
     1.49%     0.13%  mem_cgroup_uncharge_list
     1.45%     0.06%  do_shrink_slab
     1.37%     1.32%  page_mapping
     1.06%     0.76%  count_shadow_nodes

io_uring
# Children      Self  Symbol
# ........  ........  ........................................
    99.22%     0.00%  do_syscall_64
    94.48%     0.22%  __io_queue_sqe
    94.09%     0.31%  io_issue_sqe
    93.33%     0.62%  io_read
    88.57%     0.93%  blkdev_read_iter
    87.80%     0.15%  io_iter_do_read
    87.57%     0.16%  generic_file_read_iter
    87.35%     1.08%  filemap_read
    82.44%     0.01%  __x64_sys_io_uring_enter
    82.34%     0.01%  __do_sys_io_uring_enter
    82.09%     0.47%  io_submit_sqes
    79.50%     0.08%  io_queue_sqe
    71.47%     1.08%  filemap_get_pages
    53.08%     0.35%  ondemand_readahead
    51.74%     1.49%  page_cache_ra_unbounded
    49.92%     0.13%  page_cache_sync_ra
    21.26%     0.63%  add_to_page_cache_lru
    17.08%     0.02%  task_work_run
    16.98%     0.03%  exit_to_user_mode_prepare
    16.94%    10.52%  __add_to_page_cache_locked
    16.94%     0.06%  tctx_task_work
    16.79%     0.19%  syscall_exit_to_user_mode
    16.41%     2.61%  filemap_get_read_batch
    16.08%    15.72%  xas_load
    15.99%     0.15%  read_pages
    15.87%     0.13%  blkdev_readahead
    15.72%     0.56%  mpage_readahead
    15.58%     0.09%  io_req_task_submit
    15.37%     0.07%  __io_req_task_submit
    12.14%     0.15%  copy_page_to_iter
    12.03%     0.03%  ***__page_cache_alloc***
    11.98%     0.14%  alloc_pages_current
    11.66%     0.30%  __alloc_pages_nodemask
    11.28%    11.05%  _copy_to_iter
    11.06%     3.16%  get_page_from_freelist
    10.53%     0.10%  submit_bio
    10.27%     0.21%  submit_bio_noacct
     8.30%     0.51%  blk_mq_submit_bio
     7.73%     7.62%  clear_page_erms
     3.68%     1.02%  do_mpage_readpage
     3.33%     0.01%  page_cache_async_ra
     3.25%     0.01%  blk_flush_plug_list
     3.24%     0.06%  blk_mq_flush_plug_list
     3.18%     0.01%  blk_mq_sched_insert_requests
     3.15%     0.13%  blk_mq_try_issue_list_directly
     2.84%     0.15%  __blk_mq_try_issue_directly
     2.69%     0.58%  nvme_queue_rq
     2.54%     1.05%  blk_attempt_plug_merge
     2.53%     0.44%  ***mark_page_accessed***
     2.36%     0.12%  rw_verify_area
     2.16%     0.09%  mpage_alloc
     2.10%     0.27%  lru_cache_add
     1.81%     0.27%  security_file_permission
     1.75%     0.87%  __pagevec_lru_add
     1.70%     0.63%  _raw_spin_lock_irq
     1.53%     0.40%  xa_get_order
     1.52%     0.15%  __blk_mq_alloc_request
     1.50%     0.65%  workingset_refault
     1.50%     0.07%  activate_page
     1.48%     0.29%  io_submit_flush_completions
     1.46%     0.32%  bio_alloc_bioset
     1.42%     0.15%  xa_load
     1.40%     0.16%  pagevec_lru_move_fn
     1.39%     0.06%  blk_finish_plug
     1.35%     0.28%  submit_bio_checks
     1.26%     0.02%  asm_common_interrupt
     1.25%     0.01%  common_interrupt
     1.21%     1.21%  native_queued_spin_lock_slowpath
     1.18%     0.10%  mempool_alloc
     1.17%     0.00%  __common_interrupt
     1.14%     0.03%  handle_edge_irq
     1.08%     0.87%  apparmor_file_permission
     1.03%     0.19%  __mod_lruvec_state
     1.02%     0.65%  blk_rq_merge_ok
     1.01%     0.94%  xas_start

After the patchset:

kswapd
# Children      Self  Symbol
# ........  ........  ........................................
   100.00%     0.00%  kswapd
    99.92%     0.11%  balance_pgdat
    99.32%     0.03%  shrink_node
    97.25%     0.32%  shrink_lruvec
    96.80%     0.09%  evict_lru_gen_pages
    77.82%     6.28%  shrink_page_list
    61.61%     2.76%  __remove_mapping
    50.28%     0.33%  __delete_from_page_cache
    46.63%     1.08%  page_cache_delete
    42.20%     1.16%  xas_store
    40.71%    40.67%  xas_create
    12.54%     7.76%  isolate_lru_gen_pages
     6.42%     3.19%  _raw_spin_lock_irqsave
     6.15%     0.91%  free_unref_page_list
     5.62%     5.45%  unlock_page
     5.05%     0.59%  free_unref_page_commit
     4.35%     2.04%  lru_gen_update_size
     4.31%     1.41%  free_pcppages_bulk
     3.43%     3.36%  native_queued_spin_lock_slowpath
     3.38%     0.59%  __mod_lruvec_state
     2.97%     0.78%  unaccount_page_cache_page
     2.82%     2.52%  __free_one_page
     2.33%     1.18%  __mod_memcg_lruvec_state
     2.28%     2.17%  xas_clear_mark
     2.13%     0.30%  __mod_lruvec_page_state
     1.88%     0.04%  shrink_slab
     1.82%     1.78%  workingset_eviction
     1.74%     0.06%  do_shrink_slab
     1.70%     0.15%  mem_cgroup_uncharge_list
     1.39%     1.01%  count_shadow_nodes
     1.22%     1.18%  __mod_memcg_state.part.0
     1.16%     1.11%  page_mapping
     1.02%     0.98%  xas_init_marks

io_uring
# Children      Self  Symbol
# ........  ........  ........................................
    99.19%     0.01%  entry_SYSCALL_64_after_hwframe
    99.16%     0.00%  do_syscall_64
    94.78%     0.18%  __io_queue_sqe
    94.41%     0.25%  io_issue_sqe
    93.60%     0.48%  io_read
    89.35%     0.96%  blkdev_read_iter
    88.44%     0.12%  io_iter_do_read
    88.25%     0.16%  generic_file_read_iter
    88.00%     1.20%  filemap_read
    84.01%     0.01%  __x64_sys_io_uring_enter
    83.91%     0.01%  __do_sys_io_uring_enter
    83.74%     0.37%  io_submit_sqes
    81.28%     0.07%  io_queue_sqe
    74.65%     0.96%  filemap_get_pages
    55.92%     0.35%  ondemand_readahead
    54.57%     1.34%  page_cache_ra_unbounded
    51.57%     0.12%  page_cache_sync_ra
    24.14%     0.51%  add_to_page_cache_lru
    19.04%    11.51%  __add_to_page_cache_locked
    18.48%     0.13%  read_pages
    18.42%     0.18%  blkdev_readahead
    18.20%     0.55%  mpage_readahead
    16.81%     2.31%  filemap_get_read_batch
    16.37%    14.83%  xas_load
    15.40%     0.02%  task_work_run
    15.38%     0.03%  exit_to_user_mode_prepare
    15.31%     0.05%  tctx_task_work
    15.14%     0.15%  syscall_exit_to_user_mode
    14.05%     0.04%  io_req_task_submit
    13.86%     0.05%  __io_req_task_submit
    12.92%     0.12%  submit_bio
    11.40%     0.13%  copy_page_to_iter
    10.65%     9.61%  _copy_to_iter
     9.45%     0.03%  ***__page_cache_alloc***
     9.42%     0.16%  submit_bio_noacct
     9.40%     0.11%  alloc_pages_current
     9.11%     0.30%  __alloc_pages_nodemask
     8.53%     1.81%  get_page_from_freelist
     8.38%     0.10%  asm_common_interrupt
     8.26%     0.06%  common_interrupt
     7.75%     0.05%  __common_interrupt
     7.62%     0.44%  blk_mq_submit_bio
     7.56%     0.20%  handle_edge_irq
     6.45%     5.90%  clear_page_erms
     5.25%     0.10%  handle_irq_event
     4.88%     0.19%  nvme_irq
     4.83%     0.07%  __handle_irq_event_percpu
     4.73%     0.52%  nvme_process_cq
     4.52%     0.01%  page_cache_async_ra
     4.00%     0.04%  nvme_pci_complete_rq
     3.82%     0.04%  nvme_complete_rq
     3.76%     1.11%  do_mpage_readpage
     3.74%     0.06%  blk_mq_end_request
     3.03%     0.01%  blk_flush_plug_list
     3.02%     0.06%  blk_mq_flush_plug_list
     2.96%     0.01%  blk_mq_sched_insert_requests
     2.94%     0.10%  blk_mq_try_issue_list_directly
     2.89%     0.00%  __irqentry_text_start
     2.71%     0.41%  psi_task_change
     2.67%     0.21%  lru_cache_add
     2.65%     0.14%  __blk_mq_try_issue_directly
     2.53%     0.54%  nvme_queue_rq
     2.43%     0.17%  blk_update_request
     2.42%     0.58%  __pagevec_lru_add
     2.29%     1.42%  psi_group_change
     2.22%     0.85%  blk_attempt_plug_merge
     2.14%     0.04%  bio_endio
     2.13%     0.11%  rw_verify_area
     2.08%     0.18%  mpage_end_io
     2.01%     0.08%  mpage_alloc
     1.71%     0.56%  _raw_spin_lock_irq
     1.65%     0.98%  workingset_refault
     1.65%     0.09%  psi_memstall_leave
     1.64%     0.20%  security_file_permission
     1.61%     1.59%  _raw_spin_lock
     1.58%     0.08%  psi_memstall_enter
     1.44%     0.37%  xa_get_order
     1.43%     0.13%  __blk_mq_alloc_request
     1.37%     0.26%  io_submit_flush_completions
     1.36%     0.14%  xa_load
     1.34%     0.31%  bio_alloc_bioset
     1.31%     0.26%  submit_bio_checks
     1.29%     0.04%  blk_finish_plug
     1.28%     1.27%  native_queued_spin_lock_slowpath
     1.24%     0.19%  page_endio
     1.13%     0.10%  unlock_page
     1.09%     0.99%  read_tsc
     1.07%     0.74%  lru_gen_addition
     1.03%     0.09%  mempool_alloc
     1.02%     0.13%  wake_up_page_bit
     1.02%     0.92%  xas_start
     1.01%     0.78%  apparmor_file_permission

By comparing the two sets, we can clearly see what's changed:

Before the patchset:
     6.53%     0.83%  ***shrink_active_list***
    12.03%     0.03%  ***__page_cache_alloc***
     2.53%     0.44%  ***mark_page_accessed***

After the patchset:
     9.45%     0.03%  ***__page_cache_alloc***
(There are shrink_active_list() or mark_page_accessed() since we don't
activate and deactivate pages anymore, for this test case.)

Hopefully this is clear enough. But I do see where your skepticism
comes from and I don't want to dismiss it out of hand. So if you have
any other benchmarks, I'd be happy to try them. What do you think?

Thanks.

Yu Zhao April 27, 2021, 10:35 a.m. UTC | #37

On Wed, Apr 14, 2021 at 11:43 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hello Yu,

Hi Johannes,

I appreciate the detailed review. Hopefully I have addressed all your
comments below.

> On Tue, Apr 13, 2021 at 12:56:17AM -0600, Yu Zhao wrote:
> > What's new in v2
> > ================
> > Special thanks to Jens Axboe for reporting a regression in buffered
> > I/O and helping test the fix.
> >
> > This version includes the support of tiers, which represent levels of
> > usage from file descriptors only. Pages accessed N times via file
> > descriptors belong to tier order_base_2(N). Each generation contains
> > at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2
> > bits in page->flags. In contrast to moving across generations which
> > requires the lru lock, moving across tiers only involves an atomic
> > operation on page->flags and therefore has a negligible cost. A
> > feedback loop modeled after the well-known PID controller monitors the
> > refault rates across all tiers and decides when to activate pages from
> > which tiers, on the reclaim path.
>
> Could you elaborate a bit more on the difference between generations
> and tiers?
>
> A refault, a page table reference, or a buffered read through a file
> descriptor ultimately all boil down to a memory access. The value of
> having that memory resident and the cost of bringing it in from
> backing storage should be the same regardless of how it's accessed by
> userspace; and whether it's an in-memory reference or a non-resident
> reference should have the same relative impact on the page's age.
>
> With that context, I don't understand why file descriptor refs and
> refaults get such special treatment. Could you shed some light here?
>
> > This feedback model has a few advantages over the current feedforward
> > model:
> > 1) It has a negligible overhead in the buffered I/O access path
> >    because activations are done in the reclaim path.
>
> This is useful if the workload isn't reclaim bound, but it can be
> hazardous to defer work to reclaim, too.
>
> If you go through the git history, there have been several patches to
> soften access recognition inside reclaim because it can come with
> large latencies when page reclaim kicks in after a longer period with
> no memory pressure and doesn't have uptodate reference information -
> to the point where eating a few extra IOs tend to add less latency to
> the workload than waiting for reclaim to refresh its aging data.
>
> Could you elaborate a bit more on the tradeoff here?

=== Tiers ===

I agree with all you said. Let me summarize.

Remark 1: a refault, *a page fault* or a buffered read is exactly one
memory reference. A page table reference as how we count it, i.e., the
accessed bit is set, could be one or a thousand memory references. So
the accessed bit for a mapped page and PageReferenced() for an
unmapped page may carry different weights.

Remark 2: the cost of bringing a page back, regardless of how it is
referenced, is the same.

Remark 3: not using extra aging information may be preferable, if
obtaining or maintaining such information would cost more.

Starting with remark 3.

For pages referenced multiple times via file descriptors, we currently
activate them in mark_page_accessed(), regardless of memory pressure.
If we defer their activations, we may be penalized for it. But, based
on remark 3, it is still a win if activating them on the spot has a
higher overall cost.

The proposal here is we do not move them to the active lru list upon
the second reference. Instead, we simply increment a counter in
page->flags, just like SetPageReferenced() without activate_page() in
mark_page_accessed(). For the sake of discussion, let us assume each
possible value of the counter is a tier. Pages read ahead are in tier
0; pages referenced once are in tier 1; pages referenced twice are in
tier 2, etc. Note that we are talking about references via file
descriptors.

Then we record the refaults for each tier, and we compare the refault
rates, i.e, refaulted/evicted across all tiers, in the reclaim path.
For example, if we see tier 2 has a higher refault rate, we activate
pages from this tier. Otherwise, we keep evicting pages from this
tier. This allows us to shift the cost of activations from the
buffered read path to the reclaim path. This is likely to be a win,
and I will explain why at the end of this section.

Next let us look at remark 1, and how tiers can help us with the
different weight from the accessed bit.

For pages referenced via page tables only, we can assign them a tier,
say tier 0. Then we are able to compare their refault rate with those
referenced multiple times via file descriptors. Even though the
accessed bit carries a different weight, a refault has exactly the
same weight, because of remark 2.

For example, if pages referenced via page tables have a higher refault
rate than pages referenced twice via file descriptors, we will not
activate the latter and therefore would provide better protection to
the former by not flooding the active list. The current implementation
will activate the latter on the spot, which is suboptimal for this
example.

Another example: if we find pages referenced four times via file
descriptors have a higher refault rate than the rest, we only activate
them. The current implementation activates pages accessed twice and
three times too, and if they have a large number, they will flood the
active lru list and weaken the protection to pages accessed four
times.

Now, an additional remark.

Remark 4: tracking references of mapped pages by clearing the accessed
bit is more expensive than tracking references of unmapped pages by
mark_page_accessed().

The creation of a generation begins with scanning page tables (if they
are not too sparse) of each active process to find all referenced
pages since the last scan. So it is expensive.

If we moved a page to the next generation upon the second reference
via file descriptor, old generations would run out of pages sooner and
we would have to create new generations at a faster pace to keep up,
which increases the cost. In addition, moving pages across generations
is also expensive, because, on the data struct level, it is the same
as moving pages between the active and the inactive lists, which
requires the lru lock. On the other hand, tiers are lightweight.
Changing tiers within a generation is only an atomic operation on
page->flags.

With the current implementation, randomly reading (buffered io) a
large file, e.g., twice as large as memory size, from a fast storage
long enough will demonstrate both problems. In kswapd,
shrink_active_list() costs >6% of CPU. In the buffered read path,
mark_page_accessed() costs >2%. Statistically speaking, pages accessed
multiple times are not more active than pages accessed once, in this
case. Therefore, both functions are in vain.

Finally, the tradeoff part.

Fundamentally, the idea of tiers is based on a feedback loop, which is
essentially trial and error. So it will perform worse than the current
open loop control, i.e., activating upon the second referenced, if we
know for sure that pages referenced twice need to be protected. IOW,
knowing what is going to happen can avoid the error part from the
feedback loop. But in the realm of page reclaim, I bet we cannot
predict the future, for any workloads. Does it make sense?

> > Highlights from the discussions on v1
> > =====================================
> > Thanks to Ying Huang and Dave Hansen for the comments and suggestions
> > on page table scanning.
> >
> > A simple worst-case scenario test did not find page table scanning
> > underperforms the rmap because of the following optimizations:
> > 1) It will not scan page tables from processes that have been sleeping
> >    since the last scan.
> > 2) It will not scan PTE tables under non-leaf PMD entries that do not
> >    have the accessed bit set, when
> >    CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> > 3) It will not zigzag between the PGD table and the same PMD or PTE
> >    table spanning multiple VMAs. In other words, it finishes all the
> >    VMAs with the range of the same PMD or PTE table before it returns
> >    to the PGD table. This optimizes workloads that have large numbers
> >    of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
> >
> > TLDR
> > ====
> > The current page reclaim is too expensive in terms of CPU usage and
> > often making poor choices about what to evict. We would like to offer
> > an alternative framework that is performant, versatile and
> > straightforward.
> >
> > Repo
> > ====
> > git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1
> >
> > Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1173
> >
> > Background
> > ==========
> > DRAM is a major factor in total cost of ownership, and improving
> > memory overcommit brings a high return on investment.
>
> RAM cost on one hand.
>
> On the other, paging backends have seen a revolutionary explosion in
> iop/s capacity from solid state devices and CPUs that allow in-memory
> compression at scale, so a higher rate of paging (semi-random IO) and
> thus larger levels of overcommit are possible than ever before.
>
> There is a lot of new opportunity here.
>
> > Over the past decade of research and experimentation in memory
> > overcommit, we observed a distinct trend across millions of servers
> > and clients: the size of page cache has been decreasing because of
> > the growing popularity of cloud storage. Nowadays anon pages account
> > for more than 90% of our memory consumption and page cache contains
> > mostly executable pages.
>
> This gives the impression that because the number of setups heavily
> using the page cache has reduced somewhat, its significance is waning
> as well. I don't think that's true. I think we'll continue to have
> mainstream workloads for which the page cache is significant.
>
> Yes, the importance of paging anon memory more efficiently (or paging
> it at all again, for that matter), has increased dramatically. But IMO
> not because it's more prevalent, but rather because of the increase in
> paging capacity from the hardware side. It's not like we've been
> heavily paging filesystem data beyond cold starts either when it was
> more prevalent - workloads quickly fall apart when you do that on
> rotating drives.
>
> So that increase in paging capacity also applies to filesystem data,
> and makes local filesystems an option again where they might have been
> replaced by anonymous blobs managed by a userspace network filesystem.
>
> Take disaggregated storage for example. It's an attractive measure for
> reducing per-host CAPEX when the alternative is a local spindle, whose
> seekiness doesn't make the network distance look so bad, and prevents
> significant memory overcommit anyway. You have to spec the same RAM in
> either case.
>
> The equation is different for flash. You can *significantly* reduce
> RAM needs of even latency-sensitive, interactive workloads with cheap,
> consumer-grade local SSD drives. Disaggregating those drives and
> adding the network to the paging path would directly eat into the much
> higher RAM savings. It's a much less attractive proposition now. And
> that's bringing larger data sets back to local filesystems.
>
> And of course, even in cloud and disaggregated environments, there ARE
> those systems that deal with things like source code trees -
> development machines, build hosts etc. For those, filesystem data
> continues to be the primary workload.
>
> So while I agree with what you say about anon pages, I don't expect
> non-trivial (local) filesystem loads to go away anytime soon. The
> kernel needs to continue treating it as a first-class citizen.
>
> > Problems
> > ========
> > Notion of active/inactive
> > -------------------------
> > For servers equipped with hundreds of gigabytes of memory, the
> > granularity of the active/inactive is too coarse to be useful for job
> > scheduling. False active/inactive rates are relatively high, and thus
> > the assumed savings may not materialize.
>
> The inactive/active naming is certainly confusing for users of the
> system. The kernel uses it to preselect reclaim candidates, it's not
> meant to indicate how much memory capacity is idle and available.
>
> But a confusion around naming doesn't necessarily indicate it's bad at
> what it is actually designed to do.
>
> Fundamentally, LRU ordering is susceptible to a flood of recent pages
> with no reuse pushing out the established frequent pages. The split
> into inactive and active is simply there to address this shortcoming,
> and protect frequent pages from recent ones - where pages that are
> only accessed once get reclaimed before pages used twice or more.
>
> Obviously, 'twice or more' is a coarse category, and it's not hard to
> imagine that it might go wrong. But please, don't leave it up to the
> imagination ;-) It's been in use for two decades or so, it needs a bit
> more in-depth analysis of its limitations to justify replacing it.
>
> > For phones and laptops, executable pages are frequently evicted
> > despite the fact that there are many less recently used anon pages.
> > Major faults on executable pages cause "janks" (slow UI renderings)
> > and negatively impact user experience.
>
> This is not because of the inactive/active scheme but rather because
> of the anon/file split, which has evolved over the years to just not
> swap onto iop-anemic rotational drives.
>
> We ran into the same issue at FB too, where even with painfully
> obvious anon candidates and a fast paging backend the kernel would
> happily thrash on the page cache instead.
>
> There has been significant work in this area recently to address this
> (see commit 5df741963d52506a985b14c4bcd9a25beb9d1981). We've added
> extensive testing and production time onto these patches since and
> have not found the kernel to be thrashing executables or be reluctant
> to go after anonymous pages anymore.
>
> I wonder if your observation takes these recent changes into account?

Again, I agree with all you said above. And I can confirm your series
has generally fixed the problem for the following test case.

When our most common 4GB Chromebook model is zram-ing under memory
pressure, the size of the file lru is
  ~80MB without that series
  ~120MB with that series
  ~140MB with this series

User experience is acceptable as long as the size is above 100MB. For
optimal user experience, the size is 200MB. But we do not expect the
optimal user experience under memory pressure.

> > For lruvecs from different memcgs or nodes, comparisons are impossible
> > due to the lack of a common frame of reference.
>
> My first thought is that this is expected. Workloads running under
> different memory constraints, IO priority levels etc. will not have
> comparable workingsets: an access frequency that is considered high in
> one domain could be considered quite cold in another.
>
> Could you elaborate a bit on the situations where you would want to
> compare, and how this is possible by having more generations?

Will cover this in the discussion of generations.

> > Solutions
> > =========
> > Notion of generation numbers
> > ----------------------------
> > The notion of generation numbers introduces a quantitative approach to
> > memory overcommit. A larger number of pages can be spread out across
> > a configurable number of generations, and each generation includes all
> > pages that have been referenced since the last generation. This
> > improved granularity yields relatively low false active/inactive
> > rates.
> >
> > Given an lruvec, scans of anon and file types and selections between
> > them are all based on direct comparisons of generation numbers, which
> > are simple and yet effective. For different lruvecs, comparisons are
> > still possible based on birth times of generations.
>
> This describes *what* it's doing, but could you elaborate more on how
> to think about generations in relation to workload behavior and what
> you can predict based on how your workload gets bucketed into these?
>
> If we accept that the current two generations are not enough, how many
> should there be instead? Four? Ten?
>
> What determines this? Is it the workload's access pattern? Or the
> memory size?
>
> How do I know whether the number of generations I have chosen is right
> for my setup? How do I detect when the underlying factors changed and
> it no longer is?
>
> How does it manifest if I have too few generations? What about too
> many?
>
> What about systems that host a variety of workloads that come and go?
> Is there a generation number that will be good for any combination of
> workloads on the system as jobs come and go?
>
> For a general purpose OS like Linux, it's nice to be *able* to tune to
> your specific requirements, but it's always bad to *have* to. Whatever
> we end up doing, there needs to be some reasonable default behavior
> that works acceptably for a broad range of workloads out of the box.

=== generations ===

All good questions. Let me start abstractly and give concrete examples
afterward.

Remark 1: the number of generations only naturally grows to three,
unless users artificially create more for the purpose of working set
estimation.

Why three? We add pages mapped upon page faults to the youngest
generation, since we need to age them before we can evict them. After
we scan them once and clear the accessed bit set during the initial
faults, they become the second youngest generation. And we still
cannot evict them because we have not ascertained whether they are
inactive. We can only be sure after the second scan. Thereafter they
become the third youngest generation, if the accessed bit is not set.
The third youngest generation is also the oldest, in this case.

I suppose this is not surprising, as it simply follows the current
implementation. This is also why only the youngest and second youngest
generation are considered active, in order to be compatible with the
active/inactive notion. As long as we have something to evict, we do
not need to create more generations. IOW, we only create a new
generation when we are down to the minimum number of generations,
i.e., two, which is equivalent to being out of inactive pages, when
compared with the current implementation.

And why do we need generations in this case? It is because they help
answer the question of when we need to scan active pages. We could
reuse inactive_is_low(). But the number of generations seems to be
more deterministic than the magic numbers in inactive_is_low().

But do users need to configure the number of generations? The answer
is no. Everything works out of box, unless they are interested in the
following.

Remark 2: generations provide a temporal dimension; each generation is
a dot on the timeline.

This is designed for large scale deployments, i.e., data centers that
want to monitor their memory utilization for resource planning;
fleetwide working set estimation for optimal job scheduling, basically
for users who need a set of stats that they can aggregate.

Aggregating the active/inactive across a fleet of machines yields
nothing interesting. But generations are associated with timestamps,
and if they are artificially created at a steady pace, say every two
minutes, then their aggregation tells a lot. I will cover this more in
the use case section.

This principle also applies to memcgs or nodes, from the same machine
or different ones.

The same type of job can run concurrently on different machines and
each machine has a memcg for this job. To gain some insight into this
type of job, users collect a set of stats from those memcgs, and based
on this set, they want to predict how much memory this type of job
typically requires. In our case, it is called Autopilot. Users would
not be able to achieve this if there is not a metric system or a
common frame of reference for the stats in this set.

Similarly, if users want to select an optimal node for a job, they
need to compare all nodes, in order to determine which one has the
least amount of active pages.

Remark 3: architecturally, generations glue everything together.

When we scan page tables, we only update the generation number counter
in page->flags, without isolating the page. This is different from
what we have been doing, e.g., activate_page() or activate_page().
Tiers also rely on generations, because they need a temporal dimension
to sort out refaults from different generations. Needless to day,
refaults from younger generations are worse than those from older
generations, i.e., the former have shorter refault distances than the
latter. (Refault distance is a metric we use internally to measure
page selection quality.)

So generally it would only be more difficult, if we split things up
while trying to retain the same amount of benefits.

> > Differential scans via page tables
> > ----------------------------------
> > Each differential scan discovers all pages that have been referenced
> > since the last scan. Specifically, it walks the mm_struct list
> > associated with an lruvec to scan page tables of processes that have
> > been scheduled since the last scan. The cost of each differential scan
> > is roughly proportional to the number of referenced pages it
> > discovers. Unless address spaces are extremely sparse, page tables
> > usually have better memory locality than the rmap. The end result is
> > generally a significant reduction in CPU usage, for workloads using a
> > large amount of anon memory.
> >
> > Our real-world benchmark that browses popular websites in multiple
> > Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
> > less PSI on v5.11. With this patchset, kswapd profile looks like:
> >   49.36%  lzo1x_1_do_compress
> >    4.54%  page_vma_mapped_walk
> >    4.45%  memset_erms
> >    3.47%  walk_pte_range
> >    2.88%  zram_bvec_rw
> >
> > In addition, direct reclaim latency is reduced by 22% at 99th
> > percentile and the number of refaults is reduced by 7%. Both metrics
> > are important to phones and laptops as they are correlated to user
> > experience.
>
> This looks very exciting!
>
> However, this seems to be an improvement completely in its own right:
> getting the mapped page access information in a more efficient way.
>
> Is there anything that ties it to the multi-generation LRU that I may
> be missing here? Or could it simply be a drop-in replacement for rmap
> that gives us the CPU savings right away?

Covered in the discussion of generations.

> > Framework
> > =========
> > For each lruvec, evictable pages are divided into multiple
> > generations. The youngest generation number is stored in
> > lruvec->evictable.max_seq for both anon and file types as they are
> > aged on an equal footing. The oldest generation numbers are stored in
> > lruvec->evictable.min_seq[2] separately for anon and file types as
> > clean file pages can be evicted regardless of may_swap or
> > may_writepage. Generation numbers are truncated into
> > order_base_2(MAX_NR_GENS+1) bits in order to fit into page->flags. The
> > sliding window technique is used to prevent truncated generation
> > numbers from overlapping. Each truncated generation number is an inde
> > to lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> > Evictable pages are added to the per-zone lists indexed by max_seq or
> > min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> > faulted in.
> >
> > Each generation is then divided into multiple tiers. Tiers represent
> > levels of usage from file descriptors only. Pages accessed N times via
> > file descriptors belong to tier order_base_2(N). In contrast to moving
> > across generations which requires the lru lock, moving across tiers
> > only involves an atomic operation on page->flags and therefore has a
> > lower cost. A feedback loop modeled after the well-known PID
> > controller monitors the refault rates across all tiers and decides
> > when to activate pages from which tiers on the reclaim path.
> >
> > The framework comprises two conceptually independent components: the
> > aging and the eviction, which can be invoked separately from user
> > space.
>
> Why from userspace?

Will cover this in the discussion of use cases.

> > Aging
> > -----
> > The aging produces young generations. Given an lruvec, the aging scans
> > page tables for referenced pages of this lruvec. Upon finding one, the
> > aging updates its generation number to max_seq. After each round of
> > scan, the aging increments max_seq.
> >
> > The aging maintains either a system-wide mm_struct list or per-memcg
> > mm_struct lists and tracks whether an mm_struct is being used or has
> > been used since the last scan. Multiple threads can concurrently work
> > on the same mm_struct list, and each of them will be given a different
> > mm_struct belonging to a process that has been scheduled since the
> > last scan.
> >
> > The aging is due when both of min_seq[2] reaches max_seq-1, assuming
> > both anon and file types are reclaimable.
>
> As per above, this is centered around mapped pages, but it really
> needs to include a detailed answer for unmapped pages, such as page
> cache and shmem/tmpfs data, as well as how sampled page table
> references behave wrt realtime syscall references.

Covered in the discussion of tiers.

> > Eviction
> > --------
> > The eviction consumes old generations. Given an lruvec, the eviction
> > scans the pages on the per-zone lists indexed by either of min_seq[2].
> > It first tries to select a type based on the values of min_seq[2].
> > When anon and file types are both available from the same generation,
> > it selects the one that has a lower refault rate.
> >
> > During a scan, the eviction sorts pages according to their generation
> > numbers, if the aging has found them referenced. It also moves pages
> > from the tiers that have higher refault rates than tier 0 to the next
> > generation.
> >
> > When it finds all the per-zone lists of a selected type are empty, the
> > eviction increments min_seq[2] indexed by this selected type.
> >
> > Use cases
> > =========
> > On Android, our most advanced simulation that generates memory
> > pressure from realistic user behavior shows 18% fewer low-memory
> > kills, which in turn reduces cold starts by 16%.
>
> I assume you refer to pressure-induced lmkd kills rather than
> conventional kernel OOM kills?
>
> I.e. multi-gen LRU does a better job of identifying the workingset,
> rather than giving up too early.
>
> Again, I would be interested if the baseline here includes the recent
> anon/file balancing rework or not.

Yes, lmkd, which is based on PSI.

No, the baseline did not include the rework. I will rerun the
simulation once we have enough devices running 5.10.

BTW, does the rework also improve PSI? If so, the Android team might
be interested in backpacking it.

> > On Borg, a similar approach enables us to identify jobs that
> > underutilize their memory and downsize them considerably without
> > compromising any of our service level indicators.
>
> This is doable with the current reclaim implementation as well. At FB
> we drive proactive reclaim through cgroup control, in a feedback loop
> with psi metrics.
>
> Obviously, this would benefit from better workingset identification in
> the kernel, as more memory could be offloaded under the same pressure
> tolerances from the workload, but it's more of an optimization than
> enabling a uniquely new usecase.

=== use case ===

Thanks for sharing this information. Fleetwide efficiency is my
favorite topic! And I like your model -- it is very straightforward.

However, there are a few constraints that prohibit us from adopting it.

Remark 1: for systems with almost all of the pages mapped, proactive
reclaim using the current interface is unaffordable because of the
overhead from the rmap.

For systems with a fair number of unmapped pages, proactive reclaim
can drop some of them at a low cost. But for systems with almost all
of the pages mapped, proactive reclaim needs to walk the rmap to clear
the accessed bit. The following profile demonstrates such a overhead
when we proactively zram pages that have not been used for more than
two minutes from a system that has 99% of the pages mapped (~500GB,
moderate pressure):

 41.23%  page_vma_mapped_walk
  6.12%  do_raw_spin_lock
  5.23%  vma_interval_tree_iter_next
  4.23%  vma_interval_tree_subtree_search
  2.97%  page_referenced_one
  2.29%  lzo1x_1_do_compress

For what we profile, page_vma_mapped_walk() consumes the highest
amount of CPU among all kernel functions.

Remark 2: for optimal job scheduling, users need to predict whether a
job can land on a machine successfully without actually impacting the
existing jobs.

For example, given a pool of candidates, a job scheduler periodically
calls an aging interface provided by the kernel, in order to estimate
the working set of each candidate. And it ranks the candidates based
on their working sets. Candidates can be individual machines or nodes,
in case this job scheduler is NUMA aware. (Ours is.)

This means that working set estimation and proactive reclaim have to
be separate functions. If we bundle them, this job scheduler would
have to sacrifice the performance of the existing jobs for something
that may or may not come true.

Remark 3: for optimal fleet efficiency, users need to avoid proactive
reclaim unless they plan to use the savings for additional workloads.

Why would users want to proactively reclaim memory if they have no
plan to run additional workloads? The only reason might be that they
are not confident with the ability of the page reclaim, i.e., they do
not know whether it will give them what they need quickly enough when
they really need it. I cannot think of any other reason at the moment
:)

> > On Chrome OS, our field telemetry reports 96% fewer low-memory tab
> > discards and 59% fewer OOM kills from fully-utilized devices and no
> > regressions in monitored user experience from underutilized devices.
>
> Again, lkmd rather than kernel oom kills, right? And with or without
> the anon/file rework?

Yes, lmkd.

No, the baseline does not include the rework. But in this case it
should not matter. We have been carrying the following patch, which
protects the file lru from going below a certain threshold. Let me run
an a/b experiment on 5.10, i.e., with/without the patch, to make sure.

https://lore.kernel.org/linux-mm/20101028191523.GA14972@google.com/

> > Working set estimation
> > ----------------------
> > User space can invoke the aging by writing "+ memcg_id node_id gen
> > [swappiness]" to /sys/kernel/debug/lru_gen. This debugfs interface
> > also provides the birth time and the size of each generation.
> >
> > Proactive reclaim
> > -----------------
> > User space can invoke the eviction by writing "- memcg_id node_id gen
> > [swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen. Multiple
> > command lines are supported, so does concatenation with delimiters.
>
> Can you explain a bit more how these two are supposed to be used?
>
> The memcg id is self-explanatory: Age or evict pages from this
> particular workload.
>
> The node is a bit less intuitive. In most setups, the distance to a
> remote NUMA node is much smaller than the distance to the storage
> backend, and users would prefer finding and evicting the coldest
> memory between multiple nodes, not within individual node.

But storage backends could be something fast, e.g., zram or zswap in
our case. And we prefer to save cold pages in zram or zswap, so when
they become hot, they will be brought back to the same node. If we
migrate them to a different node, we have no way to migrate them back
instantaneously when they become hot.

> Swappiness raises a similar question. Why would the user prefer one
> type of data to be reclaimed over the other? Shouldn't it want to
> reclaim the pages that are least likely to be used again soon?

We also need to consider how applications perceive the delays from an
anonymous page fault and a buffered io read differently. Even though
these two have the same cost, the delay from an anonymous page fault
may hurt applications more. For example, Chrome is aware that buffered
io reads can be blocking, and it delegates the work to io threads,
e.g., non-UI threads, so the delay will not affect user experience.
Does it make sense?

> > FAQ
> > ===
> > Why not try to improve the existing code?
> > -----------------------------------------
> > We have tried but concluded the aforementioned problems are
> > fundamental, and therefore changes made on top of them will not result
> > in substantial gains.
>
> Realistically, I think incremental changes are unavoidable to get this
> merged upstream.
>
> Not just in the sense that they need to be smaller changes, but also
> in the sense that they need to replace old code. It would be
> impossible to maintain both, focus development and testing resources,
> and provide a reasonably stable experience with both systems tugging
> at a complicated shared code base.
>
> On the other hand, the existing code also has billions of hours of
> production testing and tuning. We can't throw this all out overnight -
> it needs to be surgical and the broader consequences of each step need
> to be well understood.
>
> We also have millions of servers relying on being able to do upgrades
> for drivers and fixes in other subsystems that we can't put on hold
> until we stabilized a new reclaim implementation from scratch.
>
> The good thing is that swap really hasn't been used much
> recently. There definitely is room to maneuver without being too
> disruptive. There *are* swap configurations today, but for the most
> part, users don't expect the kernel to swap until the machine is under
> heavy pressure. Few have expectations of it doing a nuanced and
> efficient memory offloading job under nominal loads. So the anon side
> could well be a testbed for the multigen LRU that has a more
> reasonable blast radius than doing everything at once.
>
> And if the rmap replacement for mapped pages could be split out as a
> CPU optimzation for getting MMU info, without changing how those are
> interpreted in the same step, I think we'd get into a more manageable
> territory with this proposal.

Yeah, I hear you loud and clear. We are not really writing off any
options here, just weighing them in terms of opportunity cost. The
engineering effort is one of the major factors, but the performance
gain and the lead time are also very important to us.

IMO, it would be hard to make substantial progress if we just float
ideas around. We could use something concrete to keep the discussion
going. I am not saying this patchset should be the storyline. But at
least it can serve as the springboard, hopefully launching us to a
middle ground. Does it sound reasonable?

Again, thanks for the detailed review. You have made some excellent
points. I think I also have made some good ones too. Hopefully you
would agree. In any case, feel free to let me know.

Konstantin Kharlamov April 29, 2021, 11:46 p.m. UTC | #38

In case you need it yet, this series is:

Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

My success story: I have Archlinux with 8G RAM + zswap + swap. While developing,
I have lots of apps opened such as multiple LSP-servers for different langs,
chats, two browsers, etc… Usually, my system gets quickly to a point of SWAP-
storms, where I have to kill LSP-servers, restart browsers to free memory, etc,
otherwise the system lags heavily and is barely usable.

1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU patchset, and I
started up by opening lots of apps to create memory pressure, and worked for a
day like this. Till now I had *not a single SWAP-storm*, and mind you I got 3.4G
in SWAP. I was never getting to the point of 3G in SWAP before without a single
SWAP-storm.

Right now my gf on Fedora 33 also suffers from SWAP-storms on her old Macbook
2013 with 4G RAM + zswap + swap, I think the next week I'll build for her 5.12 +
LRU patchset as well. Will see how it goes, I expect it will improve her
experience by a lot too.

P.S.: upon replying please keep me CCed, I'm not subscribed to the list

On Tue, 2021-04-13 at 00:56 -0600, Yu Zhao wrote:
> What's new in v2
> ================
> Special thanks to Jens Axboe for reporting a regression in buffered
> I/O and helping test the fix.
> 
> This version includes the support of tiers, which represent levels of
> usage from file descriptors only. Pages accessed N times via file
> descriptors belong to tier order_base_2(N). Each generation contains
> at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2
> bits in page->flags. In contrast to moving across generations which
> requires the lru lock, moving across tiers only involves an atomic
> operation on page->flags and therefore has a negligible cost. A
> feedback loop modeled after the well-known PID controller monitors the
> refault rates across all tiers and decides when to activate pages from
> which tiers, on the reclaim path.
> 
> This feedback model has a few advantages over the current feedforward
> model:
> 1) It has a negligible overhead in the buffered I/O access path
>    because activations are done in the reclaim path.
> 2) It takes mapped pages into account and avoids overprotecting pages
>    accessed multiple times via file descriptors.
> 3) More tiers offer better protection to pages accessed more than
>    twice when buffered-I/O-intensive workloads are under memory
>    pressure.
> 
> The fio/io_uring benchmark shows 14% improvement in IOPS when randomly
> accessing Samsung PM981a in the buffered I/O mode.
> 
> Highlights from the discussions on v1
> =====================================
> Thanks to Ying Huang and Dave Hansen for the comments and suggestions
> on page table scanning.
> 
> A simple worst-case scenario test did not find page table scanning
> underperforms the rmap because of the following optimizations:
> 1) It will not scan page tables from processes that have been sleeping
>    since the last scan.
> 2) It will not scan PTE tables under non-leaf PMD entries that do not
>    have the accessed bit set, when
>    CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> 3) It will not zigzag between the PGD table and the same PMD or PTE
>    table spanning multiple VMAs. In other words, it finishes all the
>    VMAs with the range of the same PMD or PTE table before it returns
>    to the PGD table. This optimizes workloads that have large numbers
>    of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
> 
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> an alternative framework that is performant, versatile and
> straightforward.
> 
> Repo
> ====
> git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1
> 
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1173
> 
> Background
> ==========
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.
> 
> Problems
> ========
> Notion of active/inactive
> -------------------------
> For servers equipped with hundreds of gigabytes of memory, the
> granularity of the active/inactive is too coarse to be useful for job
> scheduling. False active/inactive rates are relatively high, and thus
> the assumed savings may not materialize.
> 
> For phones and laptops, executable pages are frequently evicted
> despite the fact that there are many less recently used anon pages.
> Major faults on executable pages cause "janks" (slow UI renderings)
> and negatively impact user experience.
> 
> For lruvecs from different memcgs or nodes, comparisons are impossible
> due to the lack of a common frame of reference.
> 
> Incremental scans via rmap
> --------------------------
> Each incremental scan picks up at where the last scan left off and
> stops after it has found a handful of unreferenced pages. For
> workloads using a large amount of anon memory, incremental scans lose
> the advantage under sustained memory pressure due to high ratios of
> the number of scanned pages to the number of reclaimed pages. In our
> case, the average ratio of pgscan to pgsteal is above 7.
> 
> On top of that, the rmap has poor memory locality due to its complex
> data structures. The combined effects typically result in a high
> amount of CPU usage in the reclaim path. For example, with zram, a
> typical kswapd profile on v5.11 looks like:
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>    4.63%  do_raw_spin_lock
>    3.89%  vma_interval_tree_iter_next
>    3.33%  vma_interval_tree_subtree_search
> 
> And with real swap, it looks like:
>   45.16%  page_vma_mapped_walk
>    7.61%  do_raw_spin_lock
>    5.69%  vma_interval_tree_iter_next
>    4.91%  vma_interval_tree_subtree_search
>    3.71%  page_referenced_one
> 
> Solutions
> =========
> Notion of generation numbers
> ----------------------------
> The notion of generation numbers introduces a quantitative approach to
> memory overcommit. A larger number of pages can be spread out across
> a configurable number of generations, and each generation includes all
> pages that have been referenced since the last generation. This
> improved granularity yields relatively low false active/inactive
> rates.
> 
> Given an lruvec, scans of anon and file types and selections between
> them are all based on direct comparisons of generation numbers, which
> are simple and yet effective. For different lruvecs, comparisons are
> still possible based on birth times of generations.
> 
> Differential scans via page tables
> ----------------------------------
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan. The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for workloads using a
> large amount of anon memory.
> 
> Our real-world benchmark that browses popular websites in multiple
> Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
> less PSI on v5.11. With this patchset, kswapd profile looks like:
>   49.36%  lzo1x_1_do_compress
>    4.54%  page_vma_mapped_walk
>    4.45%  memset_erms
>    3.47%  walk_pte_range
>    2.88%  zram_bvec_rw
> 
> In addition, direct reclaim latency is reduced by 22% at 99th
> percentile and the number of refaults is reduced by 7%. Both metrics
> are important to phones and laptops as they are correlated to user
> experience.
> 
> Framework
> =========
> For each lruvec, evictable pages are divided into multiple
> generations. The youngest generation number is stored in
> lruvec->evictable.max_seq for both anon and file types as they are
> aged on an equal footing. The oldest generation numbers are stored in
> lruvec->evictable.min_seq[2] separately for anon and file types as
> clean file pages can be evicted regardless of may_swap or
> may_writepage. Generation numbers are truncated into
> order_base_2(MAX_NR_GENS+1) bits in order to fit into page->flags. The
> sliding window technique is used to prevent truncated generation
> numbers from overlapping. Each truncated generation number is an inde
> to lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> Evictable pages are added to the per-zone lists indexed by max_seq or
> min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> faulted in.
> 
> Each generation is then divided into multiple tiers. Tiers represent
> levels of usage from file descriptors only. Pages accessed N times via
> file descriptors belong to tier order_base_2(N). In contrast to moving
> across generations which requires the lru lock, moving across tiers
> only involves an atomic operation on page->flags and therefore has a
> lower cost. A feedback loop modeled after the well-known PID
> controller monitors the refault rates across all tiers and decides
> when to activate pages from which tiers on the reclaim path.
> 
> The framework comprises two conceptually independent components: the
> aging and the eviction, which can be invoked separately from user
> space.
> 
> Aging
> -----
> The aging produces young generations. Given an lruvec, the aging scans
> page tables for referenced pages of this lruvec. Upon finding one, the
> aging updates its generation number to max_seq. After each round of
> scan, the aging increments max_seq.
> 
> The aging maintains either a system-wide mm_struct list or per-memcg
> mm_struct lists and tracks whether an mm_struct is being used or has
> been used since the last scan. Multiple threads can concurrently work
> on the same mm_struct list, and each of them will be given a different
> mm_struct belonging to a process that has been scheduled since the
> last scan.
> 
> The aging is due when both of min_seq[2] reaches max_seq-1, assuming
> both anon and file types are reclaimable.
> 
> Eviction
> --------
> The eviction consumes old generations. Given an lruvec, the eviction
> scans the pages on the per-zone lists indexed by either of min_seq[2].
> It first tries to select a type based on the values of min_seq[2].
> When anon and file types are both available from the same generation,
> it selects the one that has a lower refault rate.
> 
> During a scan, the eviction sorts pages according to their generation
> numbers, if the aging has found them referenced. It also moves pages
> from the tiers that have higher refault rates than tier 0 to the next
> generation.
> 
> When it finds all the per-zone lists of a selected type are empty, the
> eviction increments min_seq[2] indexed by this selected type.
> 
> Use cases
> =========
> On Android, our most advanced simulation that generates memory
> pressure from realistic user behavior shows 18% fewer low-memory
> kills, which in turn reduces cold starts by 16%.
> 
> On Borg, a similar approach enables us to identify jobs that
> underutilize their memory and downsize them considerably without
> compromising any of our service level indicators.
> 
> On Chrome OS, our field telemetry reports 96% fewer low-memory tab
> discards and 59% fewer OOM kills from fully-utilized devices and no
> regressions in monitored user experience from underutilized devices.
> 
> Working set estimation
> ----------------------
> User space can invoke the aging by writing "+ memcg_id node_id gen
> [swappiness]" to /sys/kernel/debug/lru_gen. This debugfs interface
> also provides the birth time and the size of each generation.
> 
> Proactive reclaim
> -----------------
> User space can invoke the eviction by writing "- memcg_id node_id gen
> [swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen. Multiple
> command lines are supported, so does concatenation with delimiters.
> 
> Intensive buffered I/O
> ----------------------
> Tiers are specifically designed to improve the performance of
> intensive buffered I/O under memory pressure. The fio/io_uring
> benchmark shows 14% improvement in IOPS when randomly accessing
> Samsung PM981a in buffered I/O mode.
> 
> For far memory tiering and NUMA-aware job scheduling, please refer to
> the reference section.
> 
> FAQ
> ===
> Why not try to improve the existing code?
> -----------------------------------------
> We have tried but concluded the aforementioned problems are
> fundamental, and therefore changes made on top of them will not result
> in substantial gains.
> 
> What particular workloads does it help?
> ---------------------------------------
> This framework is designed to improve the performance of the page
> reclaim under any types of workloads.
> 
> How would it benefit the community?
> -----------------------------------
> Google is committed to promoting sustainable development of the
> community. We hope successful adoptions of this framework will
> steadily climb over time. To that end, we would be happy to learn your
> workloads and work with you case by case, and we will do our best to
> keep the repo fully maintained. For those whose workloads rely on the
> existing code, we will make sure you will not be affected in any way.
> 
> References
> ==========
> 1. Long-term SLOs for reclaimed cloud computing resources
>    https://research.google/pubs/pub43017/
> 2. Profiling a warehouse-scale computer
>    https://research.google/pubs/pub44271/
> 3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
>    https://research.google/pubs/pub48329/
> 4. Software-defined far memory in warehouse-scale computers
>    https://research.google/pubs/pub48551/
> 5. Borg: the Next Generation
>    https://research.google/pubs/pub49065/
> 
> Yu Zhao (16):
>   include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
>     !CONFIG_MEMCG
>   include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
>   include/linux/huge_mm.h: define is_huge_zero_pmd() if
>     !CONFIG_TRANSPARENT_HUGEPAGE
>   include/linux/cgroup.h: export cgroup_mutex
>   mm/swap.c: export activate_page()
>   mm, x86: support the access bit on non-leaf PMD entries
>   mm/vmscan.c: refactor shrink_node()
>   mm: multigenerational lru: groundwork
>   mm: multigenerational lru: activation
>   mm: multigenerational lru: mm_struct list
>   mm: multigenerational lru: aging
>   mm: multigenerational lru: eviction
>   mm: multigenerational lru: page reclaim
>   mm: multigenerational lru: user interface
>   mm: multigenerational lru: Kconfig
>   mm: multigenerational lru: documentation
> 
>  Documentation/vm/index.rst        |    1 +
>  Documentation/vm/multigen_lru.rst |  192 +++
>  arch/Kconfig                      |    9 +
>  arch/x86/Kconfig                  |    1 +
>  arch/x86/include/asm/pgtable.h    |    2 +-
>  arch/x86/mm/pgtable.c             |    5 +-
>  fs/exec.c                         |    2 +
>  fs/fuse/dev.c                     |    3 +-
>  fs/proc/task_mmu.c                |    3 +-
>  include/linux/cgroup.h            |   15 +-
>  include/linux/huge_mm.h           |    5 +
>  include/linux/memcontrol.h        |    7 +-
>  include/linux/mm.h                |    2 +
>  include/linux/mm_inline.h         |  294 ++++
>  include/linux/mm_types.h          |  117 ++
>  include/linux/mmzone.h            |  118 +-
>  include/linux/nodemask.h          |    1 +
>  include/linux/page-flags-layout.h |   20 +-
>  include/linux/page-flags.h        |    4 +-
>  include/linux/pgtable.h           |    4 +-
>  include/linux/swap.h              |    5 +-
>  kernel/bounds.c                   |    6 +
>  kernel/events/uprobes.c           |    2 +-
>  kernel/exit.c                     |    1 +
>  kernel/fork.c                     |   10 +
>  kernel/kthread.c                  |    1 +
>  kernel/sched/core.c               |    2 +
>  mm/Kconfig                        |   55 +
>  mm/huge_memory.c                  |    5 +-
>  mm/khugepaged.c                   |    2 +-
>  mm/memcontrol.c                   |   28 +
>  mm/memory.c                       |   14 +-
>  mm/migrate.c                      |    2 +-
>  mm/mm_init.c                      |   16 +-
>  mm/mmzone.c                       |    2 +
>  mm/rmap.c                         |    6 +
>  mm/swap.c                         |   54 +-
>  mm/swapfile.c                     |    6 +-
>  mm/userfaultfd.c                  |    2 +-
>  mm/vmscan.c                       | 2580 ++++++++++++++++++++++++++++-
>  mm/workingset.c                   |  179 +-
>  41 files changed, 3603 insertions(+), 180 deletions(-)
>  create mode 100644 Documentation/vm/multigen_lru.rst
>

Konstantin Kharlamov April 30, 2021, 6:37 a.m. UTC | #39

Btw, I noticed a fun thing, an improvement. I don't know yet if it can be
attributed to 5.12 (which I didn't try alone yet) or to the LRU patchset, but
I'd assume the latter, because 5.12 seems didn't to have had anything
interesting regarding memory performance¹.

I usually have Skype running in background for work purposes, which is only used
2-3 times in a week. So one would expect it to be one the first victims to
memory reclaim. Unfortunately, I never seen this to actually happen (till now,
that is): all skypeforlinux processes routinely have 0 bytes in SWAP, and the
only circumstances under which its processes can get into SWAP is after
experiencing many SWAP-storms. It was so hard for the kernel to move these
unused processes to SWAP that at some point I even tried to research if there
are any odd flags a userspace may have set on a process to keep it in RAM, just
in case that's what happens to Skype (A: no, that wasn't the case, running Skype
in a memory limited cgroup makes it swap. It's just that kernel decision were
lacking for some reason).

So, anyway, I am delighted to see now that while testing this patchset, and
without encountering even a single SWAP-storm yet, skypeforlinux are one of the
processes residing in SWAP!!

     λ smem -kc "name user pid pss swap" | grep skype    
    skypeforlinux            constantine  1151    60.0K     7.5M 
    skypeforlinux            constantine  1215   195.0K     8.1M 
    skypeforlinux            constantine  1149   706.0K     7.5M 
    skypeforlinux            constantine  1148   743.0K     7.3M 
    skypeforlinux            constantine  1307     1.4M     8.0M 
    skypeforlinux            constantine  1213     2.1M    46.1M 
    skypeforlinux            constantine  1206    14.0M    10.8M 
    skypeforlinux            constantine   818    38.5M    34.3M 
    skypeforlinux            constantine  1242   103.2M    46.8M 

!!!

1: https://kernelnewbies.org/Linux_5.12#Memory_management

On Fri, 2021-04-30 at 02:46 +0300, Konstantin Kharlamov wrote:
> In case you need it yet, this series is:
> 
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> 
> My success story: I have Archlinux with 8G RAM + zswap + swap. While developing,
> I have lots of apps opened such as multiple LSP-servers for different langs,
> chats, two browsers, etc… Usually, my system gets quickly to a point of SWAP-
> storms, where I have to kill LSP-servers, restart browsers to free memory, etc,
> otherwise the system lags heavily and is barely usable.
> 
> 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU patchset, and I
> started up by opening lots of apps to create memory pressure, and worked for a
> day like this. Till now I had *not a single SWAP-storm*, and mind you I got 3.4G
> in SWAP. I was never getting to the point of 3G in SWAP before without a single
> SWAP-storm.
> 
> Right now my gf on Fedora 33 also suffers from SWAP-storms on her old Macbook
> 2013 with 4G RAM + zswap + swap, I think the next week I'll build for her 5.12 +
> LRU patchset as well. Will see how it goes, I expect it will improve her
> experience by a lot too.
> 
> P.S.: upon replying please keep me CCed, I'm not subscribed to the list

Yu Zhao April 30, 2021, 7:31 p.m. UTC | #40

On Fri, Apr 30, 2021 at 12:38 AM Konstantin Kharlamov
<hi-angel@yandex.ru> wrote:
>
> Btw, I noticed a fun thing, an improvement. I don't know yet if it can be
> attributed to 5.12 (which I didn't try alone yet) or to the LRU patchset, but
> I'd assume the latter, because 5.12 seems didn't to have had anything
> interesting regarding memory performance¹.

I appreciate the testing and the report. They mean a lot to us.

This improvement is to be expected, and it works both ways. There are
cases that swapping is not a good idea, for example, when building
large repos. Without this patchset, some of my browser memory usually
gets swapped out while tons of memory is used to cache files I don't
really care about.

I completely agree with you on the memory cgroup part: theoretically
it could work around the problem but nobody knows how much memory to
reserve for Skype or Firefox :)

I will keep you posted on the following developments.

Thanks!

> I usually have Skype running in background for work purposes, which is only used
> 2-3 times in a week. So one would expect it to be one the first victims to
> memory reclaim. Unfortunately, I never seen this to actually happen (till now,
> that is): all skypeforlinux processes routinely have 0 bytes in SWAP, and the
> only circumstances under which its processes can get into SWAP is after
> experiencing many SWAP-storms. It was so hard for the kernel to move these
> unused processes to SWAP that at some point I even tried to research if there
> are any odd flags a userspace may have set on a process to keep it in RAM, just
> in case that's what happens to Skype (A: no, that wasn't the case, running Skype
> in a memory limited cgroup makes it swap. It's just that kernel decision were
> lacking for some reason).
>
> So, anyway, I am delighted to see now that while testing this patchset, and
> without encountering even a single SWAP-storm yet, skypeforlinux are one of the
> processes residing in SWAP!!
>
>      λ smem -kc "name user pid pss swap" | grep skype
>     skypeforlinux            constantine  1151    60.0K     7.5M
>     skypeforlinux            constantine  1215   195.0K     8.1M
>     skypeforlinux            constantine  1149   706.0K     7.5M
>     skypeforlinux            constantine  1148   743.0K     7.3M
>     skypeforlinux            constantine  1307     1.4M     8.0M
>     skypeforlinux            constantine  1213     2.1M    46.1M
>     skypeforlinux            constantine  1206    14.0M    10.8M
>     skypeforlinux            constantine   818    38.5M    34.3M
>     skypeforlinux            constantine  1242   103.2M    46.8M
>
> !!!
>
> 1: https://kernelnewbies.org/Linux_5.12#Memory_management
>
> On Fri, 2021-04-30 at 02:46 +0300, Konstantin Kharlamov wrote:
> > In case you need it yet, this series is:
> >
> > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> >
> > My success story: I have Archlinux with 8G RAM + zswap + swap. While developing,
> > I have lots of apps opened such as multiple LSP-servers for different langs,
> > chats, two browsers, etc… Usually, my system gets quickly to a point of SWAP-
> > storms, where I have to kill LSP-servers, restart browsers to free memory, etc,
> > otherwise the system lags heavily and is barely usable.
> >
> > 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU patchset, and I
> > started up by opening lots of apps to create memory pressure, and worked for a
> > day like this. Till now I had *not a single SWAP-storm*, and mind you I got 3.4G
> > in SWAP. I was never getting to the point of 3G in SWAP before without a single
> > SWAP-storm.
> >
> > Right now my gf on Fedora 33 also suffers from SWAP-storms on her old Macbook
> > 2013 with 4G RAM + zswap + swap, I think the next week I'll build for her 5.12 +
> > LRU patchset as well. Will see how it goes, I expect it will improve her
> > experience by a lot too.
> >
> > P.S.: upon replying please keep me CCed, I'm not subscribed to the list
>

[v2,00/16] Multigenerational LRU Framework

Message

Comments