mbox series

[v34,00/13] Introduce Data Access MONitor (DAMON)

Message ID 20210716081449.22187-1-sj38.park@gmail.com (mailing list archive)
Headers show
Series Introduce Data Access MONitor (DAMON) | expand

Message

SeongJae Park July 16, 2021, 8:14 a.m. UTC
From: SeongJae Park <sjpark@amazon.de>

Changes from Previous Version (v33)
===================================

Compared to the v33
(https://lore.kernel.org/linux-mm/20210713123356.6924-1-sj38.park@gmail.com/),
this version contains below minor changes.

- Rebase on latest -mm tree (v5.14-rc1-mmots-2021-07-15-18-47)
- Remove unnecessary asterisks from the MAINTAINERS file update (Joe Perches)

Now all patches of this patchset has at least one 'Reviewed-by:' or 'Acked-by:'
tags.  Andrew, could you please consider merging this into the -mm tree?

Introduction
============

DAMON is a data access monitoring framework for the Linux kernel.  The core
mechanisms of DAMON called 'region based sampling' and 'adaptive regions
adjustment' (refer to 'mechanisms.rst' in the 11th patch of this patchset for
the detail) make it

 - accurate (The monitored information is useful for DRAM level memory
   management. It might not appropriate for Cache-level accuracy, though.),
 - light-weight (The monitoring overhead is low enough to be applied online
   while making no impact on the performance of the target workloads.), and
 - scalable (the upper-bound of the instrumentation overhead is controllable
   regardless of the size of target workloads.).

Using this framework, therefore, several memory management mechanisms such as
reclamation and THP can be optimized to aware real data access patterns.
Experimental access pattern aware memory management optimization works that
incurring high instrumentation overhead will be able to have another try.

Though DAMON is for kernel subsystems, it can be easily exposed to the user
space by writing a DAMON-wrapper kernel subsystem.  Then, user space users who
have some special workloads will be able to write personalized tools or
applications for deeper understanding and specialized optimizations of their
systems.

DAMON is also merged in two public Amazon Linux kernel trees that based on
v5.4.y[1] and v5.10.y[2].

[1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
[2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon

Long-term Plan
--------------

DAMON is a part of a project called Data Access-aware Operating System (DAOS).
As the name implies, I want to improve the performance and efficiency of
systems using fine-grained data access patterns.  The optimizations are for
both kernel and user spaces.  I will therefore modify or create kernel
subsystems, export some of those to user space and implement user space library
/ tools.  Below shows the layers and components for the project.

    ---------------------------------------------------------------------------
    Primitives:     PTE Accessed bit, PG_idle, rmap, (Intel CMT), ...
    Framework:      DAMON
    Features:       DAMOS, virtual addr, physical addr, ...
    Applications:   DAMON-debugfs, (DARC), ...
    ^^^^^^^^^^^^^^^^^^^^^^^    KERNEL SPACE    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    Raw Interface:  debugfs, (sysfs), (damonfs), tracepoints, (sys_damon), ...

    vvvvvvvvvvvvvvvvvvvvvvv    USER SPACE      vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
    Library:        (libdamon), ...
    Tools:          DAMO, (perf), ...
    ---------------------------------------------------------------------------

The components in parentheses or marked as '...' are not implemented yet but in
the future plan.  IOW, those are the TODO tasks of DAOS project.  For more
detail, please refer to the plans:
https://lore.kernel.org/linux-mm/20201202082731.24828-1-sjpark@amazon.com/

Evaluations
===========

We evaluated DAMON's overhead, monitoring quality and usefulness using 24
realistic workloads on my QEMU/KVM based virtual machine running a kernel that
v24 DAMON patchset is applied.

DAMON is lightweight.  It increases system memory usage by 0.39% and slows
target workloads down by 1.16%.

DAMON is accurate and useful for memory management optimizations.  An
experimental DAMON-based operation scheme for THP, namely 'ethp', removes
76.15% of THP memory overheads while preserving 51.25% of THP speedup.  Another
experimental DAMON-based 'proactive reclamation' implementation, 'prcl',
reduces 93.38% of residential sets and 23.63% of system memory footprint while
incurring only 1.22% runtime overhead in the best case (parsec3/freqmine).

NOTE that the experimental THP optimization and proactive reclamation are not
for production but only for proof of concepts.

Please refer to the official document[1] or "Documentation/admin-guide/mm: Add
a document for DAMON" patch in this patchset for detailed evaluation setup and
results.

[1] https://damonitor.github.io/doc/html/latest-damon/admin-guide/mm/damon/eval.html

Real-world User Story
=====================

In summary, DAMON has used on production systems and proved its usefulness.

DAMON as a profiler
-------------------

We analyzed characteristics of a large scale production systems of our
customers using DAMON.  The systems utilize 70GB DRAM and 36 CPUs.  From this,
we were able to find interesting things below.

There were obviously different access pattern under idle workload and active
workload.  Under the idle workload, it accessed large memory regions with low
frequency, while the active workload accessed small memory regions with high
freuqnecy.

DAMON found a 7GB memory region that showing obviously high access frequency
under the active workload.  We believe this is the performance-effective
working set and need to be protected.

There was a 4KB memory region that showing highest access frequency under not
only active but also idle workloads.  We think this must be a hottest code
section like thing that should never be paged out.

For this analysis, DAMON used only 0.3-1% of single CPU time.  Because we used
recording-based analysis, it consumed about 3-12 MB of disk space per 20
minutes.  This is only small amount of disk space, but we can further reduce
the disk usage by using non-recording-based DAMON features.  I'd like to argue
that only DAMON can do such detailed analysis (finding 4KB highest region in
70GB memory) with the light overhead.

DAMON as a system optimization tool
-----------------------------------

We also found below potential performance problems on the systems and made
DAMON-based solutions.

The system doesn't want to make the workload suffer from the page reclamation
and thus it utilizes enough DRAM but no swap device.  However, we found the
system is actively reclaiming file-backed pages, because the system has
intensive file IO.  The file IO turned out to be not performance critical for
the workload, but the customer wanted to ensure performance critical
file-backed pages like code section to not mistakenly be evicted.

Using direct IO should or `mlock()` would be a straightforward solution, but
modifying the user space code is not easy for the customer.  Alternatively, we
could use DAMON-based operation scheme[1].  By using it, we can ask DAMON to
track access frequency of each region and make
'process_madvise(MADV_WILLNEED)[2]' call for regions having specific size and
access frequency for a time interval.

We also found the system is having high number of TLB misses.  We tried
'always' THP enabled policy and it greatly reduced TLB misses, but the page
reclamation also been more frequent due to the THP internal fragmentation
caused memory bloat.  We could try another DAMON-based operation scheme that
applies 'MADV_HUGEPAGE' to memory regions having >=2MB size and high access
frequency, while applying 'MADV_NOHUGEPAGE' to regions having <2MB size and low
access frequency.

We do not own the systems so we only reported the analysis results and possible
optimization solutions to the customers.  The customers satisfied about the
analysis results and promised to try the optimization guides.

[1] https://lore.kernel.org/linux-mm/20201006123931.5847-1-sjpark@amazon.com/
[2] https://lore.kernel.org/linux-api/20200622192900.22757-4-minchan@kernel.org/

Comparison with Idle Page Tracking
==================================

Idle Page Tracking allows users to set and read idleness of pages using a
bitmap file which represents each page with each bit of the file.  One
recommended usage of it is working set size detection.  Users can do that by

    1. find PFN of each page for workloads in interest,
    2. set all the pages as idle by doing writes to the bitmap file,
    3. wait until the workload accesses its working set, and
    4. read the idleness of the pages again and count pages became not idle.

NOTE: While Idle Page Tracking is for user space users, DAMON is primarily
designed for kernel subsystems though it can easily exposed to the user space.
Hence, this section only assumes such user space use of DAMON.

For what use cases Idle Page Tracking would be better?
------------------------------------------------------

1. Flexible usecases other than hotness monitoring.

Because Idle Page Tracking allows users to control the primitive (Page
idleness) by themselves, Idle Page Tracking users can do anything they want.
Meanwhile, DAMON is primarily designed to monitor the hotness of each memory
region.  For this, DAMON asks users to provide sampling interval and
aggregation interval.  For the reason, there could be some use case that using
Idle Page Tracking is simpler.

2. Physical memory monitoring.

Idle Page Tracking receives PFN range as input, so natively supports physical
memory monitoring.

DAMON is designed to be extensible for multiple address spaces and use cases by
implementing and using primitives for the given use case.  Therefore, by
theory, DAMON has no limitation in the type of target address space as long as
primitives for the given address space exists.  However, the default primitives
introduced by this patchset supports only virtual address spaces.

Therefore, for physical memory monitoring, you should implement your own
primitives and use it, or simply use Idle Page Tracking.

Nonetheless, RFC patchsets[1] for the physical memory address space primitives
is already available.  It also supports user memory same to Idle Page Tracking.

[1] https://lore.kernel.org/linux-mm/20200831104730.28970-1-sjpark@amazon.com/

For what use cases DAMON is better?
-----------------------------------

1. Hotness Monitoring.

Idle Page Tracking let users know only if a page frame is accessed or not.  For
hotness check, the user should write more code and use more memory.  DAMON do
that by itself.

2. Low Monitoring Overhead

DAMON receives user's monitoring request with one step and then provide the
results.  So, roughly speaking, DAMON require only O(1) user/kernel context
switches.

In case of Idle Page Tracking, however, because the interface receives
contiguous page frames, the number of user/kernel context switches increases as
the monitoring target becomes complex and huge.  As a result, the context
switch overhead could be not negligible.

Moreover, DAMON is born to handle with the monitoring overhead.  Because the
core mechanism is pure logical, Idle Page Tracking users might be able to
implement the mechanism on thier own, but it would be time consuming and the
user/kernel context switching will still more frequent than that of DAMON.
Also, the kernel subsystems cannot use the logic in this case.

3. Page granularity working set size detection.

Until v22 of this patchset, this was categorized as the thing Idle Page
Tracking could do better, because DAMON basically maintains additional metadata
for each of the monitoring target regions.  So, in the page granularity working
set size detection use case, DAMON would incur (number of monitoring target
pages * size of metadata) memory overhead.  Size of the single metadata item is
about 54 bytes, so assuming 4KB pages, about 1.3% of monitoring target pages
will be additionally used.

All essential metadata for Idle Page Tracking are embedded in 'struct page' and
page table entries.  Therefore, in this use case, only one counter variable for
working set size accounting is required if Idle Page Tracking is used.

There are more details to consider, but roughly speaking, this is true in most
cases.

However, the situation changed from v23.  Now DAMON supports arbitrary types of
monitoring targets, which don't use the metadata.  Using that, DAMON can do the
working set size detection with no additional space overhead but less
user-kernel context switch.  A first draft for the implementation of monitoring
primitives for this usage is available in a DAMON development tree[1].  An RFC
patchset for it based on this patchset will also be available soon.

From v24, the arbitrary type support is dropped from this patchset because this
patchset doesn't introduce real use of the type.  You can still get it from the
DAMON development tree[2], though.

[1] https://github.com/sjp38/linux/tree/damon/pgidle_hack
[2] https://github.com/sjp38/linux/tree/damon/master

4. More future usecases

While Idle Page Tracking has tight coupling with base primitives (PG_Idle and
page table Accessed bits), DAMON is designed to be extensible for many use
cases and address spaces.  If you need some special address type or want to use
special h/w access check primitives, you can write your own primitives for that
and configure DAMON to use those.  Therefore, if your use case could be changed
a lot in future, using DAMON could be better.

Can I use both Idle Page Tracking and DAMON?
--------------------------------------------

Yes, though using them concurrently for overlapping memory regions could result
in interference to each other.  Nevertheless, such use case would be rare or
makes no sense at all.  Even in the case, the noise would bot be really
significant.  So, you can choose whatever you want depending on the
characteristics of your use cases.

More Information
================

We prepared a showcase web site[1] that you can get more information.  There
are

- the official documentations[2],
- the heatmap format dynamic access pattern of various realistic workloads for
  heap area[3], mmap()-ed area[4], and stack[5] area,
- the dynamic working set size distribution[6] and chronological working set
  size changes[7], and
- the latest performance test results[8].

[1] https://damonitor.github.io/_index
[2] https://damonitor.github.io/doc/html/latest-damon
[3] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.0.png.html
[4] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
[5] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.2.png.html
[6] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
[7] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
[8] https://damonitor.github.io/test/result/perf/latest/html/index.html

Baseline and Complete Git Trees
===============================

The patches are based on the latest -mm tree, specifically
v5.14-rc1-mmots-2021-07-15-18-47 of https://github.com/hnaz/linux-mm.  You can
also clone the complete git tree:

    $ git clone git://github.com/sjp38/linux -b damon/patches/v34

The web is also available:
https://github.com/sjp38/linux/releases/tag/damon/patches/v34

Development Trees
-----------------

There are a couple of trees for entire DAMON patchset series and
features for future release.

- For latest release: https://github.com/sjp38/linux/tree/damon/master
- For next release: https://github.com/sjp38/linux/tree/damon/next

Long-term Support Trees
-----------------------

For people who want to test DAMON but using LTS kernels, there are another
couple of trees based on two latest LTS kernels respectively and containing the
'damon/master' backports.

- For v5.4.y: https://github.com/sjp38/linux/tree/damon/for-v5.4.y
- For v5.10.y: https://github.com/sjp38/linux/tree/damon/for-v5.10.y

Amazon Linux Kernel Trees
-------------------------

DAMON is also merged in two public Amazon Linux kernel trees that based on
v5.4.y[1] and v5.10.y[2].

[1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
[2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon

Git Tree for Diff of Patches
============================

For easy review of diff between different versions of each patch, I prepared a
git tree containing all versions of the DAMON patchset series:
https://github.com/sjp38/damon-patches

You can clone it and use 'diff' for easy review of changes between different
versions of the patchset.  For example:

    $ git clone https://github.com/sjp38/damon-patches && cd damon-patches
    $ diff -u damon/v33 damon/v34

Sequence Of Patches
===================

First three patches implement the core logics of DAMON.  The 1st patch
introduces basic sampling based hotness monitoring for arbitrary types of
targets.  Following two patches implement the core mechanisms for control of
overhead and accuracy, namely regions based sampling (patch 2) and adaptive
regions adjustment (patch 3).

Now the essential parts of DAMON is complete, but it cannot work unless someone
provides monitoring primitives for a specific use case.  The following two
patches make it just work for virtual address spaces monitoring.  The 4th patch
makes 'PG_idle' can be used by DAMON and the 5th patch implements the virtual
memory address space specific monitoring primitives using page table Accessed
bits and the 'PG_idle' page flag.

Now DAMON just works for virtual address space monitoring via the kernel space
api.  To let the user space users can use DAMON, following four patches add
interfaces for them.  The 6th patch adds a tracepoint for monitoring results.
The 7th patch implements a DAMON application kernel module, namely damon-dbgfs,
that simply wraps DAMON and exposes DAMON interface to the user space via the
debugfs interface.  The 8th patch further exports pid of monitoring thread
(kdamond) to user space for easier cpu usage accounting, and the 9th patch
makes the debugfs interface to support multiple contexts.

Three patches for maintainability follows.  The 10th patch adds documentations
for both the user space and the kernel space.  The 11th patch provides unit
tests (based on the kunit) while the 12th patch adds user space tests (based on
the kselftest).

Finally, the last patch (13th) updates the MAINTAINERS file.

Patch History
=============

Changes from v33
(https://lore.kernel.org/linux-mm/20210713123356.6924-1-sj38.park@gmail.com/),
- Rebase on latest -mm tree (v5.14-rc1-mmots-2021-07-15-18-47)
- Remove unnecessary asterisks from the MAINTAINERS file update (Joe Perches)

Changes from v32
(https://lore.kernel.org/linux-mm/20210628133355.18576-1-sj38.park@gmail.com/)
- Rebase on latest mainline (7d0fc5c62385)
- Collect 'Acked-by:' tags from Shakeel Butt

Chages from v31
(https://lore.kernel.org/linux-mm/20210621083108.17589-1-sj38.park@gmail.com/)
- Rebase on latest -mm tree (v5.13-rc7-mmots-2021-06-24-20-54)
- Add 'Acked-by:' tags from Shakeel Butt
- Use 'kthread_run()' (Shakeel Butt)
- Change default 'update_interval' to 60 seconds (Shakeel Butt)
- Utilize 'nr_regions' field in each 'damon_target' object (Shakeel Butt)
- Remove unused parameters in some functions (Shakeel Butt)
- Use variable name 'ctx' for 'damon_ctx' (Shakeel Butt)
- Make 'dbgfs' to completely manage pid reference counting (Shakeel Butt)
- Remove '.owner' of debugfs files (Shakeel Butt)

Changes from v30
(https://lore.kernel.org/linux-mm/20210616073119.16758-1-sj38.park@gmail.com/)
- Rebase on latest -mm tree (v5.13-rc6-mmots-2021-06-16-22-17)
- selftest: Fix wrong file content comparison (Markus Boehme)
- Collect 'Reviewed-by:' tags from Markus

Changes from v29
(https://lore.kernel.org/linux-mm/20210520075629.4332-1-sj38.park@gmail.com/)
- Rebase on latest -mm tree (v5.13-rc6-mmots-2021-06-15-20-28)
- Remove unnecessary documents
- Wordsmith commit message for PAGE_IDLE separation (Amit Shah)
- selftests: Fix shellcheck warnings and cleanup (Maximilian Heyne)
- Wordsmith the document (Markus Boehme)
- Fix a typo in comments (Fernand Sieber)
- Collect 'Reviewed-by:' tags from "Fernand Sieber <sieberf@amazon.com>"

Please refer to the v29 patchset to get older history.

SeongJae Park (13):
  mm: Introduce Data Access MONitor (DAMON)
  mm/damon/core: Implement region-based sampling
  mm/damon: Adaptively adjust regions
  mm/idle_page_tracking: Make PG_idle reusable
  mm/damon: Implement primitives for the virtual memory address spaces
  mm/damon: Add a tracepoint
  mm/damon: Implement a debugfs-based user space interface
  mm/damon/dbgfs: Export kdamond pid to the user space
  mm/damon/dbgfs: Support multiple contexts
  Documentation: Add documents for DAMON
  mm/damon: Add kunit tests
  mm/damon: Add user space selftests
  MAINTAINERS: Update for DAMON

 Documentation/admin-guide/mm/damon/index.rst  |  15 +
 Documentation/admin-guide/mm/damon/start.rst  | 114 +++
 Documentation/admin-guide/mm/damon/usage.rst  | 112 +++
 Documentation/admin-guide/mm/index.rst        |   1 +
 Documentation/vm/damon/api.rst                |  20 +
 Documentation/vm/damon/design.rst             | 166 ++++
 Documentation/vm/damon/faq.rst                |  51 ++
 Documentation/vm/damon/index.rst              |  30 +
 Documentation/vm/index.rst                    |   1 +
 MAINTAINERS                                   |  11 +
 include/linux/damon.h                         | 268 +++++++
 include/linux/page-flags.h                    |   4 +-
 include/linux/page_ext.h                      |   2 +-
 include/linux/page_idle.h                     |   6 +-
 include/trace/events/damon.h                  |  43 ++
 include/trace/events/mmflags.h                |   2 +-
 mm/Kconfig                                    |  10 +
 mm/Makefile                                   |   1 +
 mm/damon/Kconfig                              |  69 ++
 mm/damon/Makefile                             |   5 +
 mm/damon/core-test.h                          | 253 ++++++
 mm/damon/core.c                               | 720 ++++++++++++++++++
 mm/damon/dbgfs-test.h                         | 126 +++
 mm/damon/dbgfs.c                              | 624 +++++++++++++++
 mm/damon/vaddr-test.h                         | 329 ++++++++
 mm/damon/vaddr.c                              | 613 +++++++++++++++
 mm/page_ext.c                                 |  12 +-
 mm/page_idle.c                                |  10 -
 tools/testing/selftests/damon/Makefile        |   7 +
 .../selftests/damon/_chk_dependency.sh        |  28 +
 .../testing/selftests/damon/debugfs_attrs.sh  |  75 ++
 31 files changed, 3710 insertions(+), 18 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/damon/index.rst
 create mode 100644 Documentation/admin-guide/mm/damon/start.rst
 create mode 100644 Documentation/admin-guide/mm/damon/usage.rst
 create mode 100644 Documentation/vm/damon/api.rst
 create mode 100644 Documentation/vm/damon/design.rst
 create mode 100644 Documentation/vm/damon/faq.rst
 create mode 100644 Documentation/vm/damon/index.rst
 create mode 100644 include/linux/damon.h
 create mode 100644 include/trace/events/damon.h
 create mode 100644 mm/damon/Kconfig
 create mode 100644 mm/damon/Makefile
 create mode 100644 mm/damon/core-test.h
 create mode 100644 mm/damon/core.c
 create mode 100644 mm/damon/dbgfs-test.h
 create mode 100644 mm/damon/dbgfs.c
 create mode 100644 mm/damon/vaddr-test.h
 create mode 100644 mm/damon/vaddr.c
 create mode 100644 tools/testing/selftests/damon/Makefile
 create mode 100644 tools/testing/selftests/damon/_chk_dependency.sh
 create mode 100755 tools/testing/selftests/damon/debugfs_attrs.sh

Comments

Shakeel Butt July 27, 2021, 9:30 p.m. UTC | #1
(reduced CC list)

Hi all,

I have been asked to comment if Google is interested in using this
feature, its general usefulness and if it is sufficiently general and
non-duplicative. I will try to answer these but first I will explain
the use-cases we are particularly interested in and for which we want
a general access monitoring mechanism.

At the moment Google is particularly interested in four use-cases:

1) Working set estimation: This is used for cluster level scheduling
and controlling the knobs of memory overcommit.

2) Proactive reclaim

3) Balancing between memory tiers: Moving hot pages to fast tiers and
cold pages to slow tiers

4) Hugepage optimization: Hot memory backed by hugepages

In addition, these uses are not happening in isolation. We want a
combination of these running concurrently on a system. So, it is clear
that the first version or step of DAMON which only targets virtual
address space monitoring is not sufficient for these use-cases.

I think the more important question is if DAMON can be extended to
system level monitoring to fulfill these use-cases. Address space
monitoring is a core concept in DAMON and it has implemented address
space based optimizations (i.e. dividing address space into regions,
assuming locality within regions, random sampling within regions
instead of looking at each page and dynamically adjusting regions).
There is a followup proposal on monitoring physical address space in
DAMON. However for systems running multiple workloads, the address
space optimizations core to DAMON would be ineffective.

There are discussions/brainstorming on supporting abstract address
space based on LRUs which is somewhat similar to Multigen LRU [1]
proposal but not well articulated yet. BTW Multigen LRU [1] is another
similar proposal but targets one specific use-case i.e. memory reclaim
(proactive reclaim). Anyways I think we need more brainstorming for a
generalized solution of system level access monitoring.

Regarding merging DAMON, I personally think there are users who might
be interested in only their virtual address space and DAMON is
providing a solution for such users. SeongJae can provide more details
or knowledge if any big user other than Amazon is interested in the
feature. DAMON does not expose stable APIs at the moment, so these can
be changed later if needed. I think it is ok to merge DAMON for some
exposure. However I do want to make this clear that the solution space
is not complete. The solution of system level monitoring is still
needed which can be a future extension to DAMON or more generalized
Multigen LRU.

thanks,
Shakeel

[1] https://lore.kernel.org/lkml/20210520065355.2736558-1-yuzhao@google.com/
SeongJae Park July 28, 2021, 8:36 a.m. UTC | #2
From: SeongJae Park <sjpark@amazon.de>

Hello,


On Tue, 27 Jul 2021 14:30:38 -0700 Shakeel Butt <shakeelb@google.com> wrote:

> (reduced CC list)
> 
> Hi all,
> 
> I have been asked to comment if Google is interested in using this
> feature, its general usefulness and if it is sufficiently general and
> non-duplicative. I will try to answer these but first I will explain
> the use-cases we are particularly interested in and for which we want
> a general access monitoring mechanism.

Thank you for your great opinion below, Shakeel.

> 
> At the moment Google is particularly interested in four use-cases:
> 
> 1) Working set estimation: This is used for cluster level scheduling
> and controlling the knobs of memory overcommit.
> 
> 2) Proactive reclaim
> 
> 3) Balancing between memory tiers: Moving hot pages to fast tiers and
> cold pages to slow tiers
> 
> 4) Hugepage optimization: Hot memory backed by hugepages
> 
> In addition, these uses are not happening in isolation. We want a
> combination of these running concurrently on a system. So, it is clear
> that the first version or step of DAMON which only targets virtual
> address space monitoring is not sufficient for these use-cases.
> 
> I think the more important question is if DAMON can be extended to
> system level monitoring to fulfill these use-cases.

I also think this is the important point.  The main purpose of DAMON patchset
is providing a flexible monitoring framework which can easily extended for many
use cases.  Once we have the framework, I believe people will be able to extend
it for their usages, and others will be able to reuse those (a snowball can
start rolling).

> Address space monitoring is a core concept in DAMON and it has implemented
> address space based optimizations (i.e. dividing address space into regions,
> assuming locality within regions, random sampling within regions instead of
> looking at each page and dynamically adjusting regions).  There is a followup
> proposal on monitoring physical address space in DAMON. However for systems
> running multiple workloads, the address space optimizations core to DAMON
> would be ineffective.

Right.  If the system is running a huge number of different workloads (e.g.,
systems running a huge number of virtual machines or kubernetes-managed
containers), DAMON's region-based monitoring's accuracy could be lowered.

However, I'd like to note that there are many people running only a small
number of major workloads on their systems.  Even in the above case of systems
running a huge number of virtual machines, each virtual machine would have only
a small number of major workloads.  People can use DAMON inside the guests.  We
also confirmed the region-based physical address space monitoring on such
production systems achieves high accuracy (we found 4KB hottest memory region
in 70GB memory).

Also, the region-based monitoring is not mandatory.  The followup proposal
which extends DAMON for physical address space monitoring[1] allows people opt
out it if they want.  In addition to that, it implements a page-granularity
monitoring.  I unsure if the implementation fits for Google's usage, but I sure
you can at least implement your own on the framework without the limitation of
the regions abstraction.

> 
> There are discussions/brainstorming on supporting abstract address
> space based on LRUs which is somewhat similar to Multigen LRU [1]
> proposal but not well articulated yet. BTW Multigen LRU [1] is another
> similar proposal but targets one specific use-case i.e. memory reclaim
> (proactive reclaim). Anyways I think we need more brainstorming for a
> generalized solution of system level access monitoring.

The idea is using the positional index of each page in its LRU list as its
address.  For example, a page at the head of a LRU list will have address 0.
On the address space, we can safely assume the pages adjacent in the address
scheme will have similar access frequency, and therefore DAMON's region-based
monitoring would work.  Further, we can proactively move the pages in the LRU
list so that pages near the head of the list have higher frequencies, based on
the monitoring results.

For example, if we see below monitoring results from the address space:

    <HEAD of a LRU list> HHHHHHHMMMMMMMCCCCHHCCCCC <TAIL of a LRU list>
    (H: Hot page, M: Mid-temperature page, C: Cold page)

We can move the hot pages near the tail to the head, as below:

    <HEAD of a LRU list> HHHHHHHHHMMMMMMMCCCCCCCCC <TAIL of a LRU list>
    (H: Hot page, M: Mid-temperature page, C: Cold page)

This will improve not only monitoring accuracy but also other mechanisms such
as reclamation, which are based on the assumption of LRU list.

As Shakeel also told, this is only in a brainstorming stage, though.

> 
> Regarding merging DAMON, I personally think there are users who might
> be interested in only their virtual address space and DAMON is
> providing a solution for such users. SeongJae can provide more details
> or knowledge if any big user other than Amazon is interested in the
> feature.

AFAIR, Huawei, Intel, and Alibaba shown some level of their interests publicly
and/or personally, so far.  They did code review and/or tests and bug reports.
Also a number of researchers and individuals have reached out to me.

> DAMON does not expose stable APIs at the moment, so these can
> be changed later if needed. I think it is ok to merge DAMON for some
> exposure. However I do want to make this clear that the solution space
> is not complete. The solution of system level monitoring is still
> needed which can be a future extension to DAMON or more generalized
> Multigen LRU.

Agreed.  We have lots more works to do.  Some of those are already posted as
RFC patchsets[1,2,3,4].  I promise I will happily do the works.  But, how dare
could only I get all the fun?  I'd like to do that together with others in this
great community.  One major purpose of this patchset is thus providing a
flexible framework for such collaboration.  The virtual address space
monitoring, which this patchset provides in addition to the framework, is also
for real-world usages, though.

Now all the patches have at least one 'Reviewed-by:' or 'Acked-by:' tags.  We
didn't find serious problems since v26[5], which was posted about four months
ago. so I'm thinking this patchset has passed the minimum qualification.  If
you think there are more things to be done before this patchset is merged in
the -mm tree or mainline, please let me know.  If not, Andrew, I'd like you to
consider merging this patchset into '-mm' tree.


Thanks,
SeongJae Park

> 
> thanks,
> Shakeel
> 
> [1] https://lore.kernel.org/lkml/20210520065355.2736558-1-yuzhao@google.com/

[1] https://lore.kernel.org/linux-mm/20201216094221.11898-1-sjpark@amazon.com/
[2] https://lore.kernel.org/linux-mm/20201216084404.23183-1-sjpark@amazon.com/
[3] https://lore.kernel.org/linux-mm/20210107120729.22328-1-sjpark@amazon.com/
[4] https://lore.kernel.org/linux-mm/20210720131309.22073-1-sj38.park@gmail.com/
[5] https://lore.kernel.org/linux-mm/20210330090537.12143-1-sj38.park@gmail.com/
SeongJae Park Aug. 2, 2021, 8:24 a.m. UTC | #3
From: SeongJae Park <sjpark@amazon.de>

Hello Andrew,

On Wed, 28 Jul 2021 08:36:43 +0000 SeongJae Park <sj38.park@gmail.com> wrote:

[...]
> Now all the patches have at least one 'Reviewed-by:' or 'Acked-by:' tags.  We
> didn't find serious problems since v26[5], which was posted about four months
> ago. so I'm thinking this patchset has passed the minimum qualification.  If
> you think there are more things to be done before this patchset is merged in
> the -mm tree or mainline, please let me know.  If not, Andrew, I'd like you to
> consider merging this patchset into '-mm' tree.

I'm wondering if you had a chance to consider that.  If you had the chance but
this patchset didn't convince you, could you please let me know your concerns
so that I can make some progress?


Thanks,
SeongJae Park

[...]
SeongJae Park Aug. 4, 2021, 7:41 a.m. UTC | #4
From: SeongJae Park <sjpark@amazon.de>

Hello Andrew,

On Mon,  2 Aug 2021 08:24:24 +0000 SeongJae Park <sj38.park@gmail.com> wrote:

> From: SeongJae Park <sjpark@amazon.de>
> 
> Hello Andrew,
> 
> On Wed, 28 Jul 2021 08:36:43 +0000 SeongJae Park <sj38.park@gmail.com> wrote:
> 
> [...]
> > Now all the patches have at least one 'Reviewed-by:' or 'Acked-by:' tags.  We
> > didn't find serious problems since v26[5], which was posted about four months
> > ago. so I'm thinking this patchset has passed the minimum qualification.  If
> > you think there are more things to be done before this patchset is merged in
> > the -mm tree or mainline, please let me know.  If not, Andrew, I'd like you to
> > consider merging this patchset into '-mm' tree.
> 
> I'm wondering if you had a chance to consider that.  If you had the chance but
> this patchset didn't convince you, could you please let me know your concerns
> so that I can make some progress?

Because nearly three weeks passed since this patchset is posted, I considered
rebasing it on the latest -mm tree and posting it as v35.  But, apparently it
makes no much sense because we found nothing to fix or improve.  And, this
version can still cleanly be applied on top of the latest -mm tree.  So,
instead of merely increasing the version number, I'd like to describe why I
believe this need to be merged into the -mm tree and eventually the mainline.

1. Merging this patchset will not bother other developers

Most changes in this patchset are for DAMON-dedicated new source files.  There
is a change[1] for existing files, which makes PG_Idle independent of Idle Page
Tracking, but it is only small.  Therefore, merging this patchset will not
increase the complexity of the other parts or introduce a regression.

2. Merging this patchset will not bother other users

DAMON utilizes a mechanism that designed to minimize and limit the monitoring
overhead.  That said, DAMON can be opt out in the compile time for users who
don't want it.  Even though it is compiled, it does nothing at all unless a
user explicitly asks it to do some works.  Therefore, merging this patchset
will not silently introduce any additional overhead to users.

3. This patchset is deployed to real users

We are currently using DAMON patchset for profiling production workloads, as
described in 'Real-world User Story' section of the cover letter.  It is also
deployed to real users other than us via Amazon Linux[2,3].  A few companies
and several researchers outside Amazon have publicly and/or privately shown
their interests in DAMON.

4. The downstream-only maintenance overhead is significant

Following development works based on DAMON[4,5,6] are also ongoing.  Because
all the works are currently in downstream only, the maintenance overhead is not
small for us.  Once DAMON is upstreamed, the overhead will significantly be
reduced.

5. This patchset is reviewed and apparently is stabilized

Since the first version of DAMON patchset is posted (2020-01-20), it has
evolved a lot.  All patches of this patchset got at least one 'Reviewed-by:' or
'Acked-by:' tag by v31[7], which have posted about seven weeks ago
(2021-06-21).  After that, we found and fixed only minor issues.  We also got a
few more 'Acked-by:' tags.  Since v34, which has posted about three weeks ago,
we found no more issues.  We are also continuously running extensive
DAMON-dedicated tests.  The tests include unit tests, self tests, functional
tests, performance tests, and static code analysis.  Some of those are also
publicly available[8].

[1] https://lore.kernel.org/linux-mm/20210716081449.22187-5-sj38.park@gmail.com/
[2] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
[3] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon
[4] https://lore.kernel.org/linux-mm/20201216084404.23183-1-sjpark@amazon.com/
[5] https://lore.kernel.org/linux-mm/20201216094221.11898-1-sjpark@amazon.com/
[6] https://lore.kernel.org/linux-mm/20210720131309.22073-1-sj38.park@gmail.com/
[7] https://lore.kernel.org/linux-mm/20210621083108.17589-1-sj38.park@gmail.com/
[8] https://github.com/awslabs/damon-tests


If you think above explanation makes sense, please consider merging this into
the -mm tree.  Else, if this doesn't convince you, please let me know your
concerns or what I'm missing, so that I can make some progress.


Thanks,
SeongJae Park

[...]
Andrew Morton Aug. 6, 2021, 12:03 a.m. UTC | #5
On Wed, 28 Jul 2021 08:36:43 +0000 SeongJae Park <sj38.park@gmail.com> wrote:

> > DAMON does not expose stable APIs at the moment, so these can
> > be changed later if needed. I think it is ok to merge DAMON for some
> > exposure. However I do want to make this clear that the solution space
> > is not complete. The solution of system level monitoring is still
> > needed which can be a future extension to DAMON or more generalized
> > Multigen LRU.
> 
> Agreed.  We have lots more works to do.  Some of those are already posted as
> RFC patchsets[1,2,3,4].  I promise I will happily do the works.  But, how dare
> could only I get all the fun?  I'd like to do that together with others in this
> great community.  One major purpose of this patchset is thus providing a
> flexible framework for such collaboration.  The virtual address space
> monitoring, which this patchset provides in addition to the framework, is also
> for real-world usages, though.
> 
> Now all the patches have at least one 'Reviewed-by:' or 'Acked-by:' tags.  We
> didn't find serious problems since v26[5], which was posted about four months
> ago. so I'm thinking this patchset has passed the minimum qualification.  If
> you think there are more things to be done before this patchset is merged in
> the -mm tree or mainline, please let me know.  If not, Andrew, I'd like you to
> consider merging this patchset into '-mm' tree.

Shall take a look.  With some trepidation.

1-2 years from now someone will pop up with a massive patchset
implementing some monitoring scheme and we'll say "why didn't you use
DAMON" and they'll say "it's unsuitable for <reasons>".

I would like to see more thought/design go into how DAMON could be
modified to address Shakeel's other three requirements.  At least to
the point where we can confidently say "yes, we will be able to do
this".  Are you able to drive this discussion along please?
Andrew Morton Aug. 6, 2021, 12:43 a.m. UTC | #6
On Fri, 16 Jul 2021 08:14:36 +0000 SeongJae Park <sj38.park@gmail.com> wrote:

> DAMON is a data access monitoring framework for the Linux kernel.

Merged, thanks.

Presumably there are companion userspace tools for DAMON.  Are they
available?  Is there a plan to release and maintain these?
SeongJae Park Aug. 6, 2021, 11:48 a.m. UTC | #7
From: SeongJae Park <sjpark@amazon.de>

On Thu, 5 Aug 2021 17:03:44 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 28 Jul 2021 08:36:43 +0000 SeongJae Park <sj38.park@gmail.com> wrote:
> 
> > > DAMON does not expose stable APIs at the moment, so these can
> > > be changed later if needed. I think it is ok to merge DAMON for some
> > > exposure. However I do want to make this clear that the solution space
> > > is not complete. The solution of system level monitoring is still
> > > needed which can be a future extension to DAMON or more generalized
> > > Multigen LRU.
> > 
> > Agreed.  We have lots more works to do.  Some of those are already posted as
> > RFC patchsets[1,2,3,4].  I promise I will happily do the works.  But, how dare
> > could only I get all the fun?  I'd like to do that together with others in this
> > great community.  One major purpose of this patchset is thus providing a
> > flexible framework for such collaboration.  The virtual address space
> > monitoring, which this patchset provides in addition to the framework, is also
> > for real-world usages, though.
> > 
> > Now all the patches have at least one 'Reviewed-by:' or 'Acked-by:' tags.  We
> > didn't find serious problems since v26[5], which was posted about four months
> > ago. so I'm thinking this patchset has passed the minimum qualification.  If
> > you think there are more things to be done before this patchset is merged in
> > the -mm tree or mainline, please let me know.  If not, Andrew, I'd like you to
> > consider merging this patchset into '-mm' tree.
> 
> Shall take a look.  With some trepidation.
> 
> 1-2 years from now someone will pop up with a massive patchset
> implementing some monitoring scheme and we'll say "why didn't you use
> DAMON" and they'll say "it's unsuitable for <reasons>".

Agreed.  And I personally believe merging this in will help avoiding such
situation, because the someone will be able to easily find the developer who is
responsible to convince the person.  I will happily and definitely do my best
for that.

> 
> I would like to see more thought/design go into how DAMON could be
> modified to address Shakeel's other three requirements.  At least to
> the point where we can confidently say "yes, we will be able to do
> this".  Are you able to drive this discussion along please?

Sure.  I will describe my plan for convincing Shakeel's usages in detail as a
reply to this mail.


Thanks,
SeongJae Park
SeongJae Park Aug. 6, 2021, 11:48 a.m. UTC | #8
On Thu, 5 Aug 2021 17:43:24 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 16 Jul 2021 08:14:36 +0000 SeongJae Park <sj38.park@gmail.com> wrote:
> 
> > DAMON is a data access monitoring framework for the Linux kernel.
> 
> Merged, thanks.

Thank you!

> 
> Presumably there are companion userspace tools for DAMON.  Are they
> available?  Is there a plan to release and maintain these?

Yes, the userspace tool[1] is available, released under GPLv2, and actively
being maintained.  I am also planning to implement another basic user interface
in perf[2].  Also, the basic test suite for DAMON is available under GPLv2[3].

[1] https://github.com/awslabs/damo
[2] https://lore.kernel.org/linux-mm/20210107120729.22328-1-sjpark@amazon.com/
[3] https://github.com/awslabs/damon-tests


Thanks,
SeongJae Park
Andrew Morton Aug. 7, 2021, 6:28 p.m. UTC | #9
On Fri,  6 Aug 2021 11:48:30 +0000 SeongJae Park <sj38.park@gmail.com> wrote:

> > 
> > Presumably there are companion userspace tools for DAMON.  Are they
> > available?  Is there a plan to release and maintain these?
> 
> Yes, the userspace tool[1] is available, released under GPLv2, and actively
> being maintained.  I am also planning to implement another basic user interface
> in perf[2].  Also, the basic test suite for DAMON is available under GPLv2[3].
> 
> [1] https://github.com/awslabs/damo
> [2] https://lore.kernel.org/linux-mm/20210107120729.22328-1-sjpark@amazon.com/
> [3] https://github.com/awslabs/damon-tests

Ah.  Useful info to have in the changelogs!  I added the above words to the [0/n] introduction in mm-introduce-data-access-monitor-damon.patch
SeongJae Park Aug. 9, 2021, 2:07 p.m. UTC | #10
From: SeongJae Park <sjpark@amazon.de>

On Fri,  6 Aug 2021 11:48:01 +0000 SeongJae Park <sj38.park@gmail.com> wrote:

> From: SeongJae Park <sjpark@amazon.de>
> 
> On Thu, 5 Aug 2021 17:03:44 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
[...]
> > 
> > I would like to see more thought/design go into how DAMON could be
> > modified to address Shakeel's other three requirements.  At least to
> > the point where we can confidently say "yes, we will be able to do
> > this".  Are you able to drive this discussion along please?
> 
> Sure.  I will describe my plan for convincing Shakeel's usages in detail as a
> reply to this mail.

Shakeel, I am explaining how DAMON will be extended and how it can be used for
your usages below.  If there is any doubt or question, please feel free to let
me know.

What information DAMON (will) provides: contiguity, frequency, and recency
--------------------------------------------------------------------------

DAMON of this patchset informs users which memory region is how frequently
accessed.  The memory region is a set of contiguous pages which having similar
access frequency.  In addition to this, a following patch[1] will make DAMON to
track how long time the region maintained its size and access frequency.  We
call this as 'age' of each region.  That is, DAMON will be extended to provide
three attributes of data access patterns: contiguity (size of each region),
frequency, and recency.

Physical Address Space support
------------------------------

This version of DAMON is supporting only virtual address spaces of processes,
but will be extended to the physical address space[2].  The extension will be
quite simple because DAMON's monitoring primitives layer is separated from its
core logic.

How DAMON can be used for Shakeel's usages
------------------------------------------

The usages described in Shakeel's prior mail[1] are:

    1) Working set estimation: This is used for cluster level scheduling
    and controlling the knobs of memory overcommit.

    2) Proactive reclaim

    3) Balancing between memory tiers: Moving hot pages to fast tiers and
    cold pages to slow tiers

    4) Hugepage optimization: Hot memory backed by hugepages

    In addition, these uses are not happening in isolation. We want a
    combination of these running concurrently on a system. So, it is clear
    that the first version or step of DAMON which only targets virtual
    address space monitoring is not sufficient for these use-cases.

DAMON can satisfy all the usages as below.

- working set estimation: This can be done by iterating each region and
  checking if the access frequency of it is higher than a threshold.  Our user
  space tool provides an implementation[3] for this.  Below is a pseudo-code
  for this:

    workingsets = []
    working_set_size = 0
    for region in regions:
        if region.access_frequncy > threshold:
            workingsets.append(region)
            working_set_size += region.end_address - region.start_address
    return workingsets, working_set_size

- proactive reclaim: This can be done by iterating each region while checking
  if it has zero access frequency and if its age is higher than a time
  threshold, and reclaim those.  We implemented this as a kernel module with
  only 354 lines of code[4].  Below is a pseudo-code for this:

    for region in regions:
        if region.access_frquency == 0 and region.age > threshold:
            reclaim(region)

- Balancing between memory tiers: Because DAMON provides access frequency, we
  can know not only idle memory region but cold/cool/warm/hot memory region.
  Once the functions for migrating pages from a tier to different tier is
  matured, applying DAMON for this usage will be quite straightforward.  That
  is, for each region, if its access frequency and age is higher than
  thresholds, migrate pages in the region to faster tier.  If its access
  frequency is lower than a threshold and its age is higher than a threshold,
  migrate pages in the region to slower tier.  Below is a pseudo-code for this:

    for region in regions:
        if region.age > age_threshod:
            if region.access_frequency > hot_threshold:
                migrate_to_fast_tier(region)
            if region.access_frequency < cold_threshold:
                migrate_to_slow_tier(region)

- Hugepage optimization: This will be quite similar to tiers balancing, but we
  can use the size of regions.  That is, we do monitoring of virtual address
  spaces first.  Then, for each region, if its access frequency, age, and size
  are higher than thresholds (size threshold would be 2MB), makes the region to
  be backed by huge pages.  If the age and size are higher than thresholds but
  the access frequency is lower than a threshold, makes the huge pages of the
  region to be backed by regular pages.  We evaluated this idea with a
  prototype[5].  It removed 76.15% of THP memory overheads while preserving
  51.25% of THP speedup.  Below is a pseudo-code for this:

    for region in regions:
        if region.age > age_threshod and region.size >= 2 * MB:
            if region.access_frequency > hot_threshold:
                use_thps_for(region)
            if region.access_frequency < cold_threshold:
                use_regular_pages_for(region)

- Combination of these running concurrently: DAMON will be extended to be able
  to monitor both the physical address space and virtual address spaces
  simultaneously, like below.

    struct damon_ctx *ctx_for_virt = damon_new_ctx();
    struct damon_ctx *ctx_for_phys = damon_new_ctx();
    struct damon_context *ctxs[] = {ctx_for_virt, ctx_for_phys};
    [...]
    /* first context for physical address space monitoring */
    damon_pa_set_primitives(ctx_for_virt);
    /* second context for virtual address spaces monitoring */
    damon_va_set_primitives(ctx_for_phys);
    damon_start(ctxs, 2);

Extending for page-granularity monitoring
-----------------------------------------

To my understanding, Shakeel wants to do above with page-granularity
monitoring.  It will incur inevitable high overhead, but for someone who can
afford the cost, I will make DAMON to support it, as below.

Even with DAMON of this patchset, users can do the page-granularity monitoring
by simply setting the 'min_nr_regions' and 'max_nr_regions' of DAMON to the
number of pages in the target address space (nr_pages).  Nevertheless, it will
result in creation of 'nr_pages' region structs.  Assuming 4K pages, this will
result in about 1% memory waste, as each region struct consumes about 44 bytes
of memory.  Our plan for removal of such overhead is as below.

In a future, the regions abstraction will be able to be entirely opted out[6].
In the case, no region structs will be allocated, so the memory overhead will
be zero.  Nonetheless, the user will be required to configure DAMON to use a
special monitoring primitive which saves the monitoring results such as access
frequency and age in somewhere such as their own data structure or page flags,
like multi-gen LRU patchset does.  If such data structure is commonly usable,
we can extend DAMON core to support it.  To show how this will work, we
implemented a page-granularity idleness monitoring primitive with only 69 lines
of code[6].

Also, if someone has ideas for reducing the page granularity monitoring
overhead, we can put the optimization in the monitoring primitives layer, which
is separated from the core logic.

[1] https://lore.kernel.org/linux-mm/20201216084404.23183-2-sjpark@amazon.com/
[2] https://lore.kernel.org/linux-mm/20201216094221.11898-1-sjpark@amazon.com/
[3] https://github.com/awslabs/damo/blob/master/wss.py
[4] https://lore.kernel.org/linux-mm/20210720131309.22073-15-sj38.park@gmail.com/
[5] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html#efficient-thp
[6] https://github.com/sjp38/linux/commit/9e0cb168d30e
[7] https://lore.kernel.org/linux-mm/20201216094221.11898-14-sjpark@amazon.com/


Thanks,
SeongJae Park