mbox series

[RFC,v4,0/7] Implement Data Access Monitoring-based Memory Operation Schemes

Message ID 20200303121406.20954-1-sjpark@amazon.com (mailing list archive)
Headers show
Series Implement Data Access Monitoring-based Memory Operation Schemes | expand

Message

SeongJae Park March 3, 2020, 12:13 p.m. UTC
From: SeongJae Park <sjpark@amazon.de>

DAMON[1] can be used as a primitive for data access awared memory management
optimizations.  That said, users who want such optimizations should run DAMON,
read the monitoring results, analyze it, plan a new memory management scheme,
and apply the new scheme by themselves.  Such efforts will be inevitable for
some complicated optimizations.

However, in many other cases, the users could simply want the system to apply a
memory management action to a memory region of a specific size having a
specific access frequency for a specific time.  For example, "page out a memory
region larger than 100 MiB keeping only rare accesses more than 2 minutes", or
"Do not use THP for a memory region larger than 2 MiB rarely accessed for more
than 1 seconds".

This RFC patchset makes DAMON to handle such data access monitoring-based
operation schemes.  With this change, users can do the data access awared
optimizations by simply specifying their schemes to DAMON.


Evaluations
===========

Efficient THP
-------------

Transparent Huge Pages (THP) subsystem could waste memory space in some cases
because it aggressively promotes regular pages to huge pages.  For the reason,
use of THP is prohivited by a number of memory intensive programs such as
Redis[1] and MongoDB[2].

Below two simple data access monitoring-based operation schemes might be
helpful for the problem:

    # format: <min/max size> <min/max frequency (0-100)> <min/max age> <action>

    # If a memory region larger than 2 MiB is showing access rate higher than
    # 5%, apply MADV_HUGEPAGE to the region.
    2M	null	5	null	null	null	hugepage

    # If a memory region larger than 2 MiB is showing access rate lower than 5%
    # for more than 1 second, apply MADV_NOHUGEPAGE to the region.
    2M	null	null	5	1s	null	nohugepage

We can expect the schmes would reduce the memory space overhead but preserve
some of the performance benefit of THP.  I call this schemes Efficient THP
(ETHP).

Please note that these schemes are neither highly tuned nor for general
usecases.  These are made with my straightforward instinction for only a
demonstration of DAMOS.


Setup
-----

On my personal QEMU/KVM based virtual machine on an Intel i7 host machine
running Ubuntu 18.04, I measure runtime and consumed memory space of various
realistic workloads with several configurations.  I use 13 and 12 workloads in
PARSEC3[3] and SPLASH-2X[4] benchmark suites, respectively.  I personally use
another wrapper scripts[5] for setup and run of the workloads.

For the measurement of the amount of consumed memory in system global scope, I
drop caches before starting each of the workloads and monitor 'MemFree' in the
'/proc/meminfo' file.

The configurations I use are as below:

    orig: Linux v5.5 with 'madvise' THP policy
    thp: Linux v5.5 with 'always' THP policy
    ethp: Linux v5.5 applying the above schemes

To minimize the measurement errors, I repeat the run 5 times and average
results.  You can get stdev, min, and max of the numbers among the repeated
runs in appendix below.


[1] "Redis latency problems troubleshooting", https://redis.io/topics/latency
[2] "Disable Transparent Huge Pages (THP)",
    https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
[3] "The PARSEC Becnhmark Suite", https://parsec.cs.princeton.edu/index.htm
[4] "SPLASH-2x", https://parsec.cs.princeton.edu/parsec3-doc.htm#splash2x
[5] "parsec3_on_ubuntu", https://github.com/sjp38/parsec3_on_ubuntu


Results
-------

TL;DR: 'ethp' removes 97.61% of 'thp' memory space overhead while preserving
25.40% (up to 88.36%) of 'thp' performance improvement in total.

Following sections show the results of the measurements with raw numbers and
'orig'-relative overheads (percent) of each configuration.


Memory Space Overheads
~~~~~~~~~~~~~~~~~~~~~~

Below shows measured memory space overheads.  Raw numbers are in KiB, and the
overheads in parentheses are in percent.  For example, 'parsec3/blackscholes'
consumes about 1.819 GiB and 1.824 GiB with 'orig' and 'thp' configuration,
respectively.  The overhead of 'thp' compared to 'orig' for the workload is
0.3%.

              workloads  orig         thp (overhead)        ethp (overhead)
   parsec3/blackscholes  1819486.000  1824921.400 (  0.30)  1829070.600 (  0.53)
      parsec3/bodytrack  1417885.800  1417077.600 ( -0.06)  1427560.800 (  0.68)
        parsec3/canneal  1043876.800  1039773.000 ( -0.39)  1048445.200 (  0.44)
          parsec3/dedup  2400000.400  2434625.600 (  1.44)  2417374.400 (  0.72)
        parsec3/facesim  540206.400   542422.400 (  0.41)   551485.400 (  2.09)
         parsec3/ferret  320480.200   320157.000 ( -0.10)   331470.400 (  3.43)
   parsec3/fluidanimate  573961.400   572329.600 ( -0.28)   581836.000 (  1.37)
       parsec3/freqmine  983981.200   994839.600 (  1.10)   996124.600 (  1.23)
       parsec3/raytrace  1745175.200  1742756.400 ( -0.14)  1751706.000 (  0.37)
  parsec3/streamcluster  120558.800   120309.800 ( -0.21)   131997.800 (  9.49)
      parsec3/swaptions  14820.400    23388.800 ( 57.81)    24698.000 ( 66.65)
           parsec3/vips  2956319.200  2955803.600 ( -0.02)  2977506.200 (  0.72)
           parsec3/x264  3187699.000  3184944.000 ( -0.09)  3198462.800 (  0.34)
        splash2x/barnes  1212774.800  1221892.400 (  0.75)  1212100.800 ( -0.06)
           splash2x/fft  9364725.000  9267074.000 ( -1.04)  8997901.200 ( -3.92)
         splash2x/lu_cb  515242.400   519881.400 (  0.90)   526621.600 (  2.21)
        splash2x/lu_ncb  517308.000   520396.400 (  0.60)   521732.400 (  0.86)
      splash2x/ocean_cp  3348189.400  3380799.400 (  0.97)  3328473.400 ( -0.59)
     splash2x/ocean_ncp  3908599.800  7072076.800 ( 80.94)  4449410.400 ( 13.84)
     splash2x/radiosity  1469087.800  1482244.400 (  0.90)  1471781.000 (  0.18)
         splash2x/radix  1712487.400  1385972.800 (-19.07)  1420461.800 (-17.05)
      splash2x/raytrace  45030.600    50946.600 ( 13.14)    58586.200 ( 30.10)
       splash2x/volrend  151037.800   151188.000 (  0.10)   163213.600 (  8.06)
splash2x/water_nsquared  47442.400    47257.000 ( -0.39)    59285.800 ( 24.96)
 splash2x/water_spatial  667355.200   666824.400 ( -0.08)   673274.400 (  0.89)
                  total  40083800.000 42939900.000 (  7.13) 40150600.000 (  0.17)

In total, 'thp' shows 7.13% memory space overhead while 'ethp' shows only 0.17%
overhead.  In other words, 'ethp' removed 97.61% of 'thp' memory space
overhead.

For almost every workload, 'ethp' constantly show about 10-15 MiB memory space
overhead, mainly due to its python wrapper I used for convenient test runs.
Using DAMON's raw interface would further remove this overhead.

In case of 'parsec3/swaptions' and 'splash2x/raytrace', 'ethp' shows even
higher memory space overhead.  This is mainly due to the small size of the
workloads and the constant memory overhead of 'ethp', which came from the
python wrapper.  The workloads consumes only about 14 MiB and 45 MiB each.
Because the constant memory consumption from the python wrapper of 'ethp'
(about 10-15 MiB) is relatively huge to the small working set, the relative
overhead becomes high.  Nonetheless, such small workloads are not appropriate
target of the 'ethp' and the overhead can be removed by avoiding use of the
wrapper.


Runtime Overheads
~~~~~~~~~~~~~~~~~

Below shows measured runtime in similar way.  The raw numbers are in seconds
and the overheads are in percent.  Minus runtime overheads mean speedup.

                runtime  orig      thp (overhead)     ethp (overhead)
   parsec3/blackscholes  107.003   106.468 ( -0.50)   107.260 (  0.24)
      parsec3/bodytrack  78.854    78.757 ( -0.12)    79.261 (  0.52)
        parsec3/canneal  137.520   120.854 (-12.12)   132.427 ( -3.70)
          parsec3/dedup  11.873    11.665 ( -1.76)    11.883 (  0.09)
        parsec3/facesim  207.895   204.215 ( -1.77)   206.170 ( -0.83)
         parsec3/ferret  190.507   189.972 ( -0.28)   190.818 (  0.16)
   parsec3/fluidanimate  211.064   208.862 ( -1.04)   211.874 (  0.38)
       parsec3/freqmine  290.157   288.831 ( -0.46)   292.495 (  0.81)
       parsec3/raytrace  118.460   118.741 (  0.24)   119.808 (  1.14)
  parsec3/streamcluster  324.524   283.709 (-12.58)   307.209 ( -5.34)
      parsec3/swaptions  154.458   154.894 (  0.28)   155.307 (  0.55)
           parsec3/vips  58.588    58.622 (  0.06)    59.037 (  0.77)
           parsec3/x264  66.493    66.604 (  0.17)    67.051 (  0.84)
        splash2x/barnes  79.769    73.886 ( -7.38)    78.737 ( -1.29)
           splash2x/fft  32.857    22.960 (-30.12)    25.808 (-21.45)
         splash2x/lu_cb  85.113    84.939 ( -0.20)    85.344 (  0.27)
        splash2x/lu_ncb  92.408    90.103 ( -2.49)    93.585 (  1.27)
      splash2x/ocean_cp  44.374    42.876 ( -3.37)    43.613 ( -1.71)
     splash2x/ocean_ncp  80.710    51.831 (-35.78)    71.498 (-11.41)
     splash2x/radiosity  90.626    90.398 ( -0.25)    91.238 (  0.68)
         splash2x/radix  30.875    25.226 (-18.30)    25.882 (-16.17)
      splash2x/raytrace  84.114    82.602 ( -1.80)    85.124 (  1.20)
       splash2x/volrend  86.796    86.347 ( -0.52)    88.223 (  1.64)
splash2x/water_nsquared  230.781   220.667 ( -4.38)   232.664 (  0.82)
 splash2x/water_spatial  88.719    90.187 (  1.65)    89.228 (  0.57)
                  total  2984.530  2854.220 ( -4.37)  2951.540 ( -1.11)

In total, 'thp' shows 4.37% speedup while 'ethp' shows 1.11% speedup.  In other
words, 'ethp' preserves about 25.40% of THP performance benefit.

In the best case (splash2x/raytrace), 'ethp' preserves 88.36% of the benefit.

If we narrow down to workloads showing high THP performance benefits
(splash2x/fft, splash2x/ocean_ncp, and splash2x/radix), 'thp' and 'ethp' shows
30.75% and 14.71% speedup in total, respectively.  In other words, 'ethp'
preserves about 47.83% of the benefit.

Even in the worst case (splash2x/volrend), 'ethp' incurs only 1.64% runtime
overhead, which is similar to that of 'thp' (1.65% for
'splash2x/water_spatial').


Sequence Of Patches
===================

The patches are based on the v5.5 plus v5 DAMON patchset[1] and Minchan's
``madvise()`` factor-out patch[2].  Minchan's patch was necessary for reuse of
``madvise()`` code in DAMON.  You can also clone the complete git tree:

    $ git clone git://github.com/sjp38/linux -b damos/rfc/v4

The web is also available:
https://github.com/sjp38/linux/releases/tag/damos/rfc/v4


[1] https://lore.kernel.org/linux-mm/20200217103110.30817-1-sjpark@amazon.com/
[2] https://lore.kernel.org/linux-mm/20200128001641.5086-2-minchan@kernel.org/

The first patch allows DAMON to reuse ``madvise()`` code for the actions.  The
second patch accounts age of each region.  The third patch implements the
handling of the schemes in DAMON and exports a kernel space programming
interface for it.  The fourth patch implements a debugfs interface for
privileged people and programs.  The fifth and sixth patches each adds
kunittests and selftests for these changes, and finally the seventhe patch
modifies the user space tool for DAMON to support description and applying of
schemes in human freiendly way.


Patch History
=============

Changes from RFC v3
(https://lore.kernel.org/linux-mm/20200225102300.23895-1-sjpark@amazon.com/)
 - Add Reviewed-by from Brendan Higgins
 - Code cleanup: Modularize madvise() call
 - Fix a trivial bug in the wrapper python script
 - Add more stable and detailed evaluation results with updated ETHP scheme

Changes from RFC v2
(https://lore.kernel.org/linux-mm/20200218085309.18346-1-sjpark@amazon.com/)
 - Fix aging mechanism for more better 'old region' selection
 - Add more kunittests and kselftests for this patchset
 - Support more human friedly description and application of 'schemes'

Changes from RFC v1
(https://lore.kernel.org/linux-mm/20200210150921.32482-1-sjpark@amazon.com/)
 - Properly adjust age accounting related properties after splitting, merging,
   and action applying

SeongJae Park (7):
  mm/madvise: Export madvise_common() to mm internal code
  mm/damon: Account age of target regions
  mm/damon: Implement data access monitoring-based operation schemes
  mm/damon/schemes: Implement a debugfs interface
  mm/damon-test: Add kunit test case for regions age accounting
  mm/damon/selftests: Add 'schemes' debugfs tests
  damon/tools: Support more human friendly 'schemes' control

 include/linux/damon.h                         |  29 ++
 mm/damon-test.h                               |   5 +
 mm/damon.c                                    | 391 +++++++++++++++++-
 mm/internal.h                                 |   4 +
 mm/madvise.c                                  |   3 +-
 tools/damon/_convert_damos.py                 | 125 ++++++
 tools/damon/_damon.py                         | 143 +++++++
 tools/damon/damo                              |   7 +
 tools/damon/record.py                         | 135 +-----
 tools/damon/schemes.py                        | 105 +++++
 .../testing/selftests/damon/debugfs_attrs.sh  |  29 ++
 11 files changed, 845 insertions(+), 131 deletions(-)
 create mode 100755 tools/damon/_convert_damos.py
 create mode 100644 tools/damon/_damon.py
 create mode 100644 tools/damon/schemes.py