mbox series

[RFC,v6,0/7] Implement Data Access Monitoring-based Memory Operation Schemes

Message ID 20200407100007.3894-1-sjpark@amazon.com (mailing list archive)
Headers show
Series Implement Data Access Monitoring-based Memory Operation Schemes | expand

Message

SeongJae Park April 7, 2020, 9:59 a.m. UTC
From: SeongJae Park <sjpark@amazon.de>

DAMON[1] can be used as a primitive for data access awared memory management
optimizations.  That said, users who want such optimizations should run DAMON,
read the monitoring results, analyze it, plan a new memory management scheme,
and apply the new scheme by themselves.  Such efforts will be inevitable for
some complicated optimizations.

However, in many other cases, the users would simply want the system to apply a
memory management action to a memory region of a specific size having a
specific access frequency for a specific time.  For example, "page out a memory
region larger than 100 MiB keeping only rare accesses more than 2 minutes", or
"Do not use THP for a memory region larger than 2 MiB rarely accessed for more
than 1 seconds".

This RFC patchset makes DAMON to handle such data access monitoring-based
operation schemes.  With this change, users can do the data access awared
optimizations by simply specifying their schemes to DAMON.


Evaluations
===========

Setup
-----

On my personal QEMU/KVM based virtual machine on an Intel i7 host machine
running Ubuntu 18.04, I measure runtime and consumed system memory while
running various realistic workloads with several configurations.  I use 13 and
12 workloads in PARSEC3[3] and SPLASH-2X[4] benchmark suites, respectively.  I
personally use another wrapper scripts[5] for setup and run of the workloads.
On top of this patchset, we also applied the DAMON-based operation schemes
patchset[6] for this evaluation.

Measurement
~~~~~~~~~~~

For the measurement of the amount of consumed memory in system global scope, I
drop caches before starting each of the workloads and monitor 'MemFree' in the
'/proc/meminfo' file.  To make results more stable, I repeat the runs 5 times
and average results.  You can get stdev, min, and max of the numbers among the
repeated runs in appendix below.

Configurations
~~~~~~~~~~~~~~

The configurations I use are as below.

orig: Linux v5.5 with 'madvise' THP policy
rec: 'orig' plus DAMON running with record feature
thp: same with 'orig', but use 'always' THP policy
ethp: 'orig' plus a DAMON operation scheme[6], 'efficient THP'
prcl: 'orig' plus a DAMON operation scheme, 'proactive reclaim[7]'

I use 'rec' for measurement of DAMON overheads to target workloads and system
memory.  The remaining configs including 'thp', 'ethp', and 'prcl' are for
measurement of DAMON monitoring accuracy.

'ethp' and 'prcl' is simple DAMON-based operation schemes developed for
proof of concepts of DAMON.  'ethp' reduces memory space waste of THP by using
DAMON for decision of promotions and demotion for huge pages, while 'prcl' is
as similar as the original work.  Those are implemented as below:

# format: <min/max size> <min/max frequency (0-100)> <min/max age> <action>
# ethp: Use huge pages if a region >2MB shows >5% access rate, use regular
# pages if a region >2MB shows <5% access rate for >1 second
2M null    5 null    null null    hugepage
2M null    null 5    1s null      nohugepage

# prcl: If a region >4KB shows <5% access rate for >5 seconds, page out.
4K null    null 5    500ms null      pageout

Note that both 'ethp' and 'prcl' are designed with my only straightforward
intuition, because those are for only proof of concepts and monitoring accuracy
of DAMON.  In other words, those are not for production.  For production use,
those should be tuned more.


[1] "Redis latency problems troubleshooting", https://redis.io/topics/latency
[2] "Disable Transparent Huge Pages (THP)",
    https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
[3] "The PARSEC Becnhmark Suite", https://parsec.cs.princeton.edu/index.htm
[4] "SPLASH-2x", https://parsec.cs.princeton.edu/parsec3-doc.htm#splash2x
[5] "parsec3_on_ubuntu", https://github.com/sjp38/parsec3_on_ubuntu
[6] "[RFC v4 0/7] Implement Data Access Monitoring-based Memory Operation
    Schemes",
    https://lore.kernel.org/linux-mm/20200303121406.20954-1-sjpark@amazon.com/
[7] "Proactively reclaiming idle memory", https://lwn.net/Articles/787611/


Results
-------

Below two tables show the measurement results.  The runtimes are in seconds
while the memory usages are in KiB.  Each configurations except 'orig' shows
its overhead relative to 'orig' in percent within parenthesises.

runtime                 orig     rec      (overhead) thp      (overhead) ethp     (overhead) prcl     (overhead)
parsec3/blackscholes    107.097  106.955  (-0.13)    106.352  (-0.70)    107.357  (0.24)     108.284  (1.11)
parsec3/bodytrack       79.135   79.062   (-0.09)    78.996   (-0.18)    79.261   (0.16)     79.824   (0.87)
parsec3/canneal         139.036  139.694  (0.47)     125.947  (-9.41)    131.071  (-5.73)    148.648  (6.91)
parsec3/dedup           11.914   11.905   (-0.07)    11.729   (-1.55)    11.916   (0.02)     12.613   (5.87)
parsec3/facesim         208.761  209.476  (0.34)     204.778  (-1.91)    206.157  (-1.25)    214.016  (2.52)
parsec3/ferret          190.854  191.309  (0.24)     190.223  (-0.33)    190.821  (-0.02)    191.847  (0.52)
parsec3/fluidanimate    211.317  213.798  (1.17)     208.883  (-1.15)    211.319  (0.00)     214.566  (1.54)
parsec3/freqmine        288.672  290.547  (0.65)     288.310  (-0.13)    288.727  (0.02)     292.294  (1.25)
parsec3/raytrace        118.692  119.443  (0.63)     118.625  (-0.06)    118.986  (0.25)     129.942  (9.48)
parsec3/streamcluster   323.387  327.244  (1.19)     284.931  (-11.89)   290.604  (-10.14)   330.111  (2.08)
parsec3/swaptions       154.304  154.891  (0.38)     154.373  (0.04)     155.226  (0.60)     155.338  (0.67)
parsec3/vips            58.879   59.254   (0.64)     58.459   (-0.71)    59.029   (0.25)     59.761   (1.50)
parsec3/x264            71.805   68.718   (-4.30)    67.262   (-6.33)    69.494   (-3.22)    71.291   (-0.72)
splash2x/barnes         80.624   80.680   (0.07)     74.538   (-7.55)    78.363   (-2.80)    86.373   (7.13)
splash2x/fft            33.462   33.285   (-0.53)    23.146   (-30.83)   33.306   (-0.47)    35.311   (5.53)
splash2x/lu_cb          85.474   85.681   (0.24)     84.516   (-1.12)    85.525   (0.06)     87.267   (2.10)
splash2x/lu_ncb         93.227   93.211   (-0.02)    90.939   (-2.45)    93.526   (0.32)     94.409   (1.27)
splash2x/ocean_cp       44.348   44.668   (0.72)     42.920   (-3.22)    44.128   (-0.50)    45.785   (3.24)
splash2x/ocean_ncp      81.234   81.275   (0.05)     51.441   (-36.67)   64.974   (-20.02)   94.207   (15.97)
splash2x/radiosity      90.976   91.131   (0.17)     90.325   (-0.72)    91.395   (0.46)     97.867   (7.57)
splash2x/radix          31.269   31.185   (-0.27)    25.103   (-19.72)   29.289   (-6.33)    37.713   (20.61)
splash2x/raytrace       83.945   84.242   (0.35)     82.314   (-1.94)    83.334   (-0.73)    84.655   (0.85)
splash2x/volrend        86.703   87.545   (0.97)     86.324   (-0.44)    86.717   (0.02)     87.925   (1.41)
splash2x/water_nsquared 230.426  232.979  (1.11)     219.950  (-4.55)    224.474  (-2.58)    235.770  (2.32)
splash2x/water_spatial  88.982   89.748   (0.86)     89.086   (0.12)     89.431   (0.50)     95.849   (7.72)
total                   2994.520 3007.910 (0.45)     2859.470 (-4.51)    2924.420 (-2.34)    3091.670 (3.24)


memused.avg             orig         rec          (overhead) thp          (overhead) ethp         (overhead) prcl         (overhead)
parsec3/blackscholes    1821479.200  1836018.600  (0.80)     1822020.600  (0.03)     1834214.200  (0.70)     1721607.800  (-5.48)
parsec3/bodytrack       1418698.400  1434689.800  (1.13)     1419134.400  (0.03)     1430609.800  (0.84)     1433137.600  (1.02)
parsec3/canneal         1045065.400  1052992.400  (0.76)     1042607.400  (-0.24)    1048730.400  (0.35)     1049446.000  (0.42)
parsec3/dedup           2387073.200  2425093.600  (1.59)     2398469.600  (0.48)     2416738.400  (1.24)     2433976.800  (1.96)
parsec3/facesim         540075.800   554130.000   (2.60)     544759.400   (0.87)     553325.800   (2.45)     489255.600   (-9.41)
parsec3/ferret          316932.800   331383.600   (4.56)     320355.800   (1.08)     331042.000   (4.45)     328275.600   (3.58)
parsec3/fluidanimate    576466.400   587466.600   (1.91)     582737.000   (1.09)     582560.600   (1.06)     499228.800   (-13.40)
parsec3/freqmine        985864.000   996351.800   (1.06)     990195.000   (0.44)     997435.400   (1.17)     809333.800   (-17.91)
parsec3/raytrace        1749485.600  1753601.400  (0.24)     1744385.000  (-0.29)    1755230.400  (0.33)     1597574.400  (-8.68)
parsec3/streamcluster   120976.200   133270.000   (10.16)    118688.200   (-1.89)    132846.800   (9.81)     133412.400   (10.28)
parsec3/swaptions       14953.600    28689.400    (91.86)    15826.000    (5.83)     26803.000    (79.24)    27754.400    (85.60)
parsec3/vips            2940086.400  2965866.800  (0.88)     2943217.200  (0.11)     2960823.600  (0.71)     2968121.000  (0.95)
parsec3/x264            3179843.200  3186839.600  (0.22)     3175893.600  (-0.12)    3182023.400  (0.07)     3202598.000  (0.72)
splash2x/barnes         1210899.200  1211648.600  (0.06)     1219328.800  (0.70)     1217686.000  (0.56)     1126669.000  (-6.96)
splash2x/fft            9322834.800  9142039.200  (-1.94)    9183937.800  (-1.49)    9159042.800  (-1.76)    9321729.200  (-0.01)
splash2x/lu_cb          515411.200   523698.400   (1.61)     521019.800   (1.09)     523047.400   (1.48)     461828.400   (-10.40)
splash2x/lu_ncb         514869.000   525223.000   (2.01)     521820.600   (1.35)     522588.800   (1.50)     480118.400   (-6.75)
splash2x/ocean_cp       3345433.400  3298946.800  (-1.39)    3377377.000  (0.95)     3289771.600  (-1.66)    3273329.800  (-2.16)
splash2x/ocean_ncp      3902999.600  3873302.600  (-0.76)    7069853.000  (81.14)    4962220.800  (27.14)    3772835.600  (-3.33)
splash2x/radiosity      1471551.000  1470698.600  (-0.06)    1481433.200  (0.67)     1466283.400  (-0.36)    838138.400   (-43.04)
splash2x/radix          1700185.000  1674226.400  (-1.53)    1386397.600  (-18.46)   1544387.800  (-9.16)    1957567.600  (15.14)
splash2x/raytrace       45493.800    57050.800    (25.40)    50134.000    (10.20)    60166.400    (32.25)    57634.000    (26.69)
splash2x/volrend        150549.200   165190.600   (9.73)     151509.600   (0.64)     162845.000   (8.17)     161346.000   (7.17)
splash2x/water_nsquared 46275.200    58483.600    (26.38)    71529.200    (54.57)    56770.200    (22.68)    59995.800    (29.65)
splash2x/water_spatial  666577.200   672511.800   (0.89)     667422.200   (0.13)     674555.000   (1.20)     608374.000   (-8.73)
total                   39990000.000 39959400.000 (-0.08)    42819900.000 (7.08)     40891655.000 (2.25)     38813174.000 (-2.94)


DAMON Overheads
~~~~~~~~~~~~~~~

In total, DAMON recording feature incurs 0.41% runtime overhead (up to 1.19% in
worst case with 'parsec3/streamcluster') and -0.08% memory space overhead.

For convenience test run of 'rec', I use a Python wrapper.  The wrapper
constantly consumes about 10-15MB of memory.  This becomes high memory overhead
if the target workload has small memory footprint.  In detail, 10%, 91%, 25%,
9%, and 26% overheads shown for parsec3/streamcluster (125 MiB),
parsec3/swaptions (15 MiB), splash2x/raytrace (45 MiB), splash2x/volrend (151
MiB), and splash2x/water_nsquared (46 MiB)).  Nonetheless, the overheads are
not from DAMON, but from the wrapper, and thus should be ignored.  This fake
memory overhead continues in 'ethp' and 'prcl', as those configurations are
also using the Python wrapper.


Efficient THP
~~~~~~~~~~~~~

THP 'always' enabled policy achieves 4.51% speedup but incurs 7.08% memory
overhead.  It achieves 36.67% speedup in best case, but 81.14% memory overhead
in worst case.  Interestingly, both the best and worst case are with
'splash2x/ocean_ncp').

The 2-lines implementation of data access monitoring based THP version ('ethp')
shows 2.34% speedup and 2.25% memory overhead.  In other words, 'ethp' removes
68.22% of THP memory waste while preserving 51.88% of THP speedup in total.  In
case of the 'splash2x/ocean_ncp', 'ethp' removes 66.55% of THP memory waste
while preserving 74% of THP speedup.


Proactive Reclamation
~~~~~~~~~~~~~~~~~~~~

As same to the original work, I use 'zram' swap device for this configuration.

In total, our 1 line implementation of Proactive Reclamation, 'prcl', incurred
3.24% runtime overhead in total while achieving 2.94% system memory usage
reduction.

Nonetheless, as the memory usage is calculated with 'MemFree' in
'/proc/meminfo', it contains the SwapCached pages.  As the swapcached pages can
be easily evicted, I also measured the residential set size of the workloads:

rss.avg                 orig         rec          (overhead) thp          (overhead) ethp         (overhead) prcl         (overhead)
parsec3/blackscholes    589877.400   591587.600   (0.29)     593797.000   (0.66)     591090.800   (0.21)     424841.800   (-27.98)
parsec3/bodytrack       32326.600    32289.800    (-0.11)    32284.000    (-0.13)    32249.600    (-0.24)    28931.800    (-10.50)
parsec3/canneal         839469.400   840116.600   (0.08)     838083.800   (-0.17)    837870.400   (-0.19)    833193.800   (-0.75)
parsec3/dedup           1194881.800  1207486.800  (1.05)     1217461.000  (1.89)     1225107.000  (2.53)     995459.400   (-16.69)
parsec3/facesim         311416.600   311812.800   (0.13)     314923.000   (1.13)     312525.200   (0.36)     195057.600   (-37.36)
parsec3/ferret          99787.800    99655.400    (-0.13)    101332.800   (1.55)     99820.400    (0.03)     93295.000    (-6.51)
parsec3/fluidanimate    531801.600   531784.800   (-0.00)    531775.400   (-0.00)    531928.600   (0.02)     432113.400   (-18.75)
parsec3/freqmine        552404.600   553054.400   (0.12)     555716.400   (0.60)     554045.600   (0.30)     157776.200   (-71.44)
parsec3/raytrace        894502.400   892753.600   (-0.20)    888306.200   (-0.69)    892790.600   (-0.19)    374962.600   (-58.08)
parsec3/streamcluster   110877.200   110846.400   (-0.03)    111255.400   (0.34)     111467.600   (0.53)     110063.400   (-0.73)
parsec3/swaptions       5637.600     5611.600     (-0.46)    5621.400     (-0.29)    5630.200     (-0.13)    4594.800     (-18.50)
parsec3/vips            31897.600    31803.800    (-0.29)    32336.400    (1.38)     32168.000    (0.85)     30496.800    (-4.39)
parsec3/x264            82068.400    81975.600    (-0.11)    83066.400    (1.22)     82656.400    (0.72)     80752.400    (-1.60)
splash2x/barnes         1210976.600  1215669.400  (0.39)     1224071.200  (1.08)     1219203.200  (0.68)     1047794.600  (-13.48)
splash2x/fft            9714139.000  9623503.600  (-0.93)    9523996.200  (-1.96)    9555242.400  (-1.64)    9050047.000  (-6.84)
splash2x/lu_cb          510368.800   510468.800   (0.02)     514496.800   (0.81)     510299.200   (-0.01)    445912.000   (-12.63)
splash2x/lu_ncb         510149.600   510325.600   (0.03)     513899.000   (0.73)     510331.200   (0.04)     465811.200   (-8.69)
splash2x/ocean_cp       3407224.400  3405827.200  (-0.04)    3437758.400  (0.90)     3394473.000  (-0.37)    3334869.600  (-2.12)
splash2x/ocean_ncp      3919511.200  3934023.000  (0.37)     7181317.200  (83.22)    5074390.600  (29.46)    3560788.200  (-9.15)
splash2x/radiosity      1474982.000  1476292.400  (0.09)     1485884.000  (0.74)     1474162.800  (-0.06)    695592.400   (-52.84)
splash2x/radix          1765313.200  1752605.000  (-0.72)    1440052.200  (-18.43)   1662186.600  (-5.84)    1888954.800  (7.00)
splash2x/raytrace       23277.600    23289.600    (0.05)     29185.600    (25.38)    26960.600    (15.82)    21139.400    (-9.19)
splash2x/volrend        44110.600    44069.200    (-0.09)    44321.600    (0.48)     44436.000    (0.74)     28610.400    (-35.14)
splash2x/water_nsquared 29412.800    29443.200    (0.10)     29470.000    (0.19)     29894.600    (1.64)     27927.800    (-5.05)
splash2x/water_spatial  655785.200   656694.400   (0.14)     655665.200   (-0.02)    656572.000   (0.12)     558691.000   (-14.81)
total                   28542100.000 28472900.000 (-0.24)    31386000.000 (9.96)     29467572.000 (3.24)     24887691.000 (-12.80)

In total, 12.80% of residential sets were reduced.

With parsec3/freqmine, 'prcl' reduced 17.91% of system memory usage and 71.44%
of residential sets while incurring only 1.25% runtime overhead.


Sequence Of Patches
===================

The patches are based on the v5.6 plus v8 DAMON patchset[1] and Minchan's
``do_madvise()`` patch[2].  Minchan's patch was necessary for reuse of
``madvise()`` code in DAMON.  You can also clone the complete git tree:

    $ git clone git://github.com/sjp38/linux -b damos/rfc/v6

The web is also available:
https://github.com/sjp38/linux/releases/tag/damos/rfc/v6


[1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/
[2] https://lore.kernel.org/linux-mm/20200302193630.68771-2-minchan@kernel.org/

The first patch allows DAMON to reuse ``madvise()`` code for the actions.  The
second patch accounts age of each region.  The third patch implements the
handling of the schemes in DAMON and exports a kernel space programming
interface for it.  The fourth patch implements a debugfs interface for the
privileged people and programs.  The fifth and sixth patches each adds kunit
tests and selftests for these changes, and finally the seventhe patch adds
human friendly schemes support to the user space tool for DAMON.


Patch History
=============

Changes from RFC v5
(https://lore.kernel.org/linux-mm/20200330115042.17431-1-sjpark@amazon.com/)
 - Rebase on DAMON v8 patchset
 - Update test results
 - Fix DAMON userspace tool crash on signal handling
 - Fix checkpatch warnings

Changes from RFC v4
(https://lore.kernel.org/linux-mm/20200303121406.20954-1-sjpark@amazon.com/)
 - Handle CONFIG_ADVISE_SYSCALL
 - Clean up code (Jonathan Cameron)
 - Update test results
 - Rebase on v5.6 + DAMON v7

Changes from RFC v3
(https://lore.kernel.org/linux-mm/20200225102300.23895-1-sjpark@amazon.com/)
 - Add Reviewed-by from Brendan Higgins
 - Code cleanup: Modularize madvise() call
 - Fix a trivial bug in the wrapper python script
 - Add more stable and detailed evaluation results with updated ETHP scheme

Changes from RFC v2
(https://lore.kernel.org/linux-mm/20200218085309.18346-1-sjpark@amazon.com/)
 - Fix aging mechanism for more better 'old region' selection
 - Add more kunittests and kselftests for this patchset
 - Support more human friedly description and application of 'schemes'

Changes from RFC v1
(https://lore.kernel.org/linux-mm/20200210150921.32482-1-sjpark@amazon.com/)
 - Properly adjust age accounting related properties after splitting, merging,
   and action applying

SeongJae Park (7):
  mm/madvise: Export do_madvise() to external GPL modules
  mm/damon: Account age of target regions
  mm/damon: Implement data access monitoring-based operation schemes
  mm/damon/schemes: Implement a debugfs interface
  mm/damon-test: Add kunit test case for regions age accounting
  mm/damon/selftests: Add 'schemes' debugfs tests
  damon/tools: Support more human friendly 'schemes' control

 include/linux/damon.h                         |  29 ++
 mm/damon-test.h                               |   5 +
 mm/damon.c                                    | 428 +++++++++++++++++-
 mm/madvise.c                                  |   1 +
 tools/damon/_convert_damos.py                 | 126 ++++++
 tools/damon/_damon.py                         | 143 ++++++
 tools/damon/damo                              |   7 +
 tools/damon/record.py                         | 135 +-----
 tools/damon/schemes.py                        | 105 +++++
 .../testing/selftests/damon/debugfs_attrs.sh  |  29 ++
 10 files changed, 879 insertions(+), 129 deletions(-)
 create mode 100755 tools/damon/_convert_damos.py
 create mode 100644 tools/damon/_damon.py
 create mode 100644 tools/damon/schemes.py