mbox series

[RFC,v3,0/7] Implement Data Access Monitoring-based Memory Operation Schemes

Message ID 20200225102300.23895-1-sjpark@amazon.com (mailing list archive)
Headers show
Series Implement Data Access Monitoring-based Memory Operation Schemes | expand

Message

SeongJae Park Feb. 25, 2020, 10:22 a.m. UTC
From: SeongJae Park <sjpark@amazon.de>

DAMON[1] can be used as a primitive for data access awared memory management
optimizations.  That said, users who want such optimizations should run DAMON,
read the monitoring results, analyze it, plan a new memory management scheme,
and apply the new scheme by themselves.  Such efforts will be inevitable for
some complicated optimizations.

However, in many other cases, the users could simply want the system to apply a
memory management action to a memory region of a specific size having a
specific access frequency for a specific time.  For example, "page out a memory
region larger than 100 MiB keeping only rare accesses more than 10 minutes", or
"Use THP for a memory region larger than 2 MiB continuously accessed for more
than 1 seconds".

This RFC patchset makes DAMON to handle such data access monitoring-based
operation schemes.  With this change, users can do the data access awared
optimizations by simply specifying their schemes to DAMON.


Evaluations
===========

Transparent Huge Pages (THP) subsystem could waste memory space in some cases
because it aggressively promotes regular pages to huge pages.  For the reason,
use of THP is prohivited by a number of memory intensive programs such as
Redis[1] and MongoDB[2].

Below two simple data access monitoring-based operation schemes might be
helpful for the problem:

    # format: <min/max size> <min/max frequency (0-100)> <min/max age> <action>

    # If a memory region larger than 2 MiB is showing access rate higher than
    # 5% for more than 1 second, apply MADV_HUGEPAGE to the region.
    2M	null	5	null	1s	null	hugepage

    # If a memory region larger than 2 MiB is showing access rate lower than 5%
    # for more than 1 second, apply MADV_NOHUGEPAGE to the region.
    2M	null	null	5	1s	null	nohugepage

We can expect the schmes might reduce the memory space overhead but preserve
some amount of the THP's performance benefit.

Please note that I made these schemes with only my straightforward instinction.
Therefore, these might not be optimized schemes.


Setup
-----

On a QEMU/KVM based virtual machine on an Intel i7 host machine running Ubuntu
18.04, I measure runtime and memory usage of various realistic workloads with
several configurations.  I use 14 and 13 workloads in PARSEC3[3] and
SPLASH-2X[4] benchmark suites, respectively.  I personally use another wrapper
scripts[5] for setup and run of the workloads.

For the measurement of memory usage, we drop caches
before starting each of the workloads and monitor 'MemFree' in the
'/proc/meminfo' file.

The configurations are as below:

orig: Linux v5.5 with 'madvise' THP policy
thp: Linux v5.5 with 'always' THP policy
ethp: Linux v5.5 applying the above schemes


[1] "Redis latency problems troubleshooting", https://redis.io/topics/latency
[2] "Disable Transparent Huge Pages (THP)",
    https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
[3] "The PARSEC Becnhmark Suite", https://parsec.cs.princeton.edu/index.htm
[4] "SPLASH-2x", https://parsec.cs.princeton.edu/parsec3-doc.htm#splash2x
[5] "parsec3_on_ubuntu", https://github.com/sjp38/parsec3_on_ubuntu


Results
-------

Following sections show the results of the measurements.  For brevity, I show
only memory space and runtime overheads of thp and ethp in percentage.  For
instance, memory space overhead 76.94 of 'thp' for 'splash2x/ocean_ncp' means
the memory usage of 'splash2x/ocean_ncp' measured under 'thp' configuration was
176.94% of that measured under 'orig' configuration.

Note that the numbers are collected from only one measurement.  Thus, below
numbers might contain some measurement errors.  I will repeat the evaluations
several times and update the numbers with averages and stdevs, as soon as
prepared.


Memory Space Overheads
~~~~~~~~~~~~~~~~~~~~~~

Below shows measured memory space overheads (%) of 'thp' and 'ethp' compared to
'orig'.

workload		thp	ethp
parsec3	blackscholes	0.41	0.23
	bodytrack	-0.11	0.12
	canneal		-0.73	0.01
	dedup		5.41	5.80
	facesim		0.76	0.33
	ferret		-1.42	-0.24
	fluidanimate	-0.17	0.82
	freqmine	0.03	0.17
	raytrace	0.19	-0.55
	streamcluster	1.16	1.65
	swaptions	6.14	21.84
	vips		1.63	1.65
	x264		3.21	1.86
	PARSEC3/AVG	1.63	1.48
splash2	barnes		0.18	1.82
	fft		0.06	1.43
	lu_cb		0.94	0.09
	lu_ncb		1.09	0.43
	ocean_cp	1.11	0.23
	ocean_ncp	76.94	-0.08
	radiosity	0.62	-0.05
	radix		-18.37	0.28
	raytrace	14.52	3.64
	volrend		-0.69	1.48
	water_nsquared	-0.68	4.26
	water_spatial	0.11	0.68
	SPLASH2X/AVG	12.72	0.72

Averaged memory space overhead
PARSEC3: 1.62% (thp), 1.47% (ethp). ethp shows 1.10x lower overhead.
SPLASH2X: 12.717% (thp), 0.71% (ethp): ethp shows 18x lower overhead.

Best case: splash2x/ocean_ncp
Overheads: 76.94% (thp), -0.07% (ethp): ethp shows about -1099x lower overhead.
Apparently memory intensive workload, as it uses about 3.9 GiB memory in
'orig'.

Worst case: parsec3/swaptions
Overheads: 6.14% (thp), 21.84% (ethp): ethp shows 3.55x higher overhead.
Not memory-intensive workload, as it uses only 19 MiB memory in 'orig', though.


Runtime Overheads
~~~~~~~~~~~~~~~~~

Below shows measured runtime overheads (%) of 'thp' and 'ethp' compared to
'orig'.

workload		thp	ethp
parsec3	blackscholes	-0.29	0.60
	bodytrack	-0.25	0.61
	canneal		-18.93	-15.60
	dedup		-1.79	-0.84
	facesim		-1.69	3.36
	ferret		-0.28	1.18
	fluidanimate	-1.08	2.89
	freqmine	0.06	2.13
	raytrace	-1.35	0.79
	streamcluster	-13.69	-0.62
	swaptions	-1.83	-1.72
	vips		-2.05	-0.65
	x264		14.96	-3.11
	PARSEC3/AVG	-3.82	-0.33
splash2	barnes		-2.31	-0.95
	fft		-1.21	0.04
	lu_cb		-1.12	-0.14
	lu_ncb		-1.19	-0.24
	ocean_cp	-1.19	-0.48
	ocean_ncp	-1.29	-0.68
	radiosity	-1.29	-0.80
	radix		-0.33	-0.81
	raytrace	-0.22	-0.74
	volrend		-0.08	-0.75
	water_nsquared	-1.23	-0.57
	water_spatial	-1.16	-0.54
	SPLASH2X/AVG	-1.07	-0.51

Averaged runtime overhead
PARSEC3: -3.81% (thp), -0.38% (ethp): ethp preserves about 10% of THP speedup.
SPLASH2X: -1.07% (thp), -0.51% (ethp): ethp preserves about 50% of THP speedup.

Best case: parsec3/canneal
The overhead: -18.93% (thp), -15.60% (ethp): ethp preserves about 82% of THP
speedup.
Apparently memory intensive workload, as it uses about 1 GiB memory in average.

Worst case: parsec3/streamcluster
Seems memory-intensive workload, though it uses about 128 MiB memory.
The overheads: -13.69% (thp), -0.62% (ethp): ethp preserves only about 4% of
THP speedup.


In short, the straightforward data access monitoring-based operation scheme,
ethp, reduces memory space waste (1.10x lower for PARSEC3 and 18x lower for
SPLASH-2X) while preserving some amount of the THP's performance benefit (10%
for PARSEC3 and 50% for SPLASH-2X), as expected.


Sequence Of Patches
===================

The patches are based on the v5.5 plus v5 DAMON patchset[1] and Minchan's
``madvise()`` factor-out patch[2].  Minchan's patch was necessary for reuse of
``madvise()`` code in DAMON.  You can also clone the complete git tree:

    $ git clone git://github.com/sjp38/linux -b damos/rfc/v3

The web is also available:
https://github.com/sjp38/linux/releases/tag/damos/rfc/v3


[1] https://lore.kernel.org/linux-mm/20200217103110.30817-1-sjpark@amazon.com/
[2] https://lore.kernel.org/linux-mm/20200128001641.5086-2-minchan@kernel.org/

The first patch allows DAMON to reuse ``madvise()`` code for the actions.  The
second patch accounts age of each region.  The third patch implements the
handling of the schemes in DAMON and exports a kernel space programming
interface for it.  The fourth patch implements a debugfs interface for
privileged people and programs.  The fifth and sixth patches each adds
kunittests and selftests for these changes, and finally the seventhe patch
modifies the user space tool for DAMON to support description and applying of
schemes in human freiendly way.


Patch History
=============

Changes from RFC v2
(https://lore.kernel.org/linux-mm/20200218085309.18346-1-sjpark@amazon.com/)
 - Fix aging mechanism for more better 'old region' selection
 - Add more kunittests and kselftests for this patchset
 - Support more human friedly description and application of 'schemes'

Changes from RFC v1
(https://lore.kernel.org/linux-mm/20200210150921.32482-1-sjpark@amazon.com/)
 - Properly adjust age accounting related properties after splitting, merging,
   and action applying

SeongJae Park (7):
  mm/madvise: Export madvise_common() to mm internal code
  mm/damon: Account age of target regions
  mm/damon: Implement data access monitoring-based operation schemes
  mm/damon/schemes: Implement a debugfs interface
  mm/damon-test: Add kunit test case for regions age accounting
  mm/damon/selftests: Add 'schemes' debugfs tests
  damon/tools: Support more human friendly 'schemes' control

 include/linux/damon.h                         |  29 ++
 mm/damon-test.h                               |   5 +
 mm/damon.c                                    | 414 +++++++++++++++++-
 mm/internal.h                                 |   4 +
 mm/madvise.c                                  |   3 +-
 tools/damon/_convert_damos.py                 | 125 ++++++
 tools/damon/_damon.py                         | 143 ++++++
 tools/damon/damo                              |   7 +
 tools/damon/record.py                         | 133 +-----
 tools/damon/schemes.py                        | 105 +++++
 .../testing/selftests/damon/debugfs_attrs.sh  |  29 ++
 11 files changed, 867 insertions(+), 130 deletions(-)
 create mode 100755 tools/damon/_convert_damos.py
 create mode 100644 tools/damon/_damon.py
 create mode 100644 tools/damon/schemes.py