[RFC,v4,0/7] Implement Data Access Monitoring-based Memory Operation Schemes

Message ID	20200303121406.20954-1-sjpark@amazon.com (mailing list archive)
Headers	show Return-Path: <SRS0=qryl=4U=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C6BDA208C3 IronPort-SDR: CWsT/hCDiexNocZ1YKFqbTjV5R4LdeWA13mMSIEIrrP3S+ixILCJnT+iY6B8r1vT4xJyp1kMAQ RV/ioS2bH2jg== From: SeongJae Park <sjpark@amazon.com> To: <akpm@linux-foundation.org> CC: SeongJae Park <sjpark@amazon.de>, <aarcange@redhat.com>, <acme@kernel.org>, <alexander.shishkin@linux.intel.com>, <amit@kernel.org>, <brendan.d.gregg@gmail.com>, <brendanhiggins@google.com>, <cai@lca.pw>, <colin.king@canonical.com>, <corbet@lwn.net>, <dwmw@amazon.com>, <jolsa@redhat.com>, <kirill@shutemov.name>, <mark.rutland@arm.com>, <mgorman@suse.de>, <minchan@kernel.org>, <mingo@redhat.com>, <namhyung@kernel.org>, <peterz@infradead.org>, <rdunlap@infradead.org>, <riel@surriel.com>, <rientjes@google.com>, <rostedt@goodmis.org>, <shuah@kernel.org>, <sj38.park@gmail.com>, <vbabka@suse.cz>, <vdavydov.dev@gmail.com>, <yang.shi@linux.alibaba.com>, <ying.huang@intel.com>, <linux-mm@kvack.org>, <linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org> Subject: [RFC v4 0/7] Implement Data Access Monitoring-based Memory Operation Schemes Date: Tue, 3 Mar 2020 13:13:59 +0100 Message-ID: <20200303121406.20954-1-sjpark@amazon.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Implement Data Access Monitoring-based Memory Operation Schemes \| expand [RFC,v4,0/7] Implement Data Access Monitoring-based Memory Operation Schemes [RFC,v4,1/7] mm/madvise: Export madvise_common() to mm internal code [RFC,v4,2/7] mm/damon: Account age of target regions [RFC,v4,3/7] mm/damon: Implement data access monitoring-based operation schemes [RFC,v4,4/7] mm/damon/schemes: Implement a debugfs interface [RFC,v4,5/7] mm/damon-test: Add kunit test case for regions age accounting [RFC,v4,6/7] mm/damon/selftests: Add 'schemes' debugfs tests [RFC,v4,7/7] damon/tools: Support more human friendly 'schemes' control

From: SeongJae Park <sjpark@amazon.de> DAMON[1] can be used as a primitive for data access awared memory management optimizations. That said, users who want such optimizations should run DAMON, read the monitoring results, analyze it, plan a new memory management scheme, and apply the new scheme by themselves. Such efforts will be inevitable for some complicated optimizations. However, in many other cases, the users could simply want the system to apply a memory management action to a memory region of a specific size having a specific access frequency for a specific time. For example, "page out a memory region larger than 100 MiB keeping only rare accesses more than 2 minutes", or "Do not use THP for a memory region larger than 2 MiB rarely accessed for more than 1 seconds". This RFC patchset makes DAMON to handle such data access monitoring-based operation schemes. With this change, users can do the data access awared optimizations by simply specifying their schemes to DAMON. Evaluations =========== Efficient THP ------------- Transparent Huge Pages (THP) subsystem could waste memory space in some cases because it aggressively promotes regular pages to huge pages. For the reason, use of THP is prohivited by a number of memory intensive programs such as Redis[1] and MongoDB[2]. Below two simple data access monitoring-based operation schemes might be helpful for the problem: # format: <min/max size> <min/max frequency (0-100)> <min/max age> <action> # If a memory region larger than 2 MiB is showing access rate higher than # 5%, apply MADV_HUGEPAGE to the region. 2M null 5 null null null hugepage # If a memory region larger than 2 MiB is showing access rate lower than 5% # for more than 1 second, apply MADV_NOHUGEPAGE to the region. 2M null null 5 1s null nohugepage We can expect the schmes would reduce the memory space overhead but preserve some of the performance benefit of THP. I call this schemes Efficient THP (ETHP). Please note that these schemes are neither highly tuned nor for general usecases. These are made with my straightforward instinction for only a demonstration of DAMOS. Setup ----- On my personal QEMU/KVM based virtual machine on an Intel i7 host machine running Ubuntu 18.04, I measure runtime and consumed memory space of various realistic workloads with several configurations. I use 13 and 12 workloads in PARSEC3[3] and SPLASH-2X[4] benchmark suites, respectively. I personally use another wrapper scripts[5] for setup and run of the workloads. For the measurement of the amount of consumed memory in system global scope, I drop caches before starting each of the workloads and monitor 'MemFree' in the '/proc/meminfo' file. The configurations I use are as below: orig: Linux v5.5 with 'madvise' THP policy thp: Linux v5.5 with 'always' THP policy ethp: Linux v5.5 applying the above schemes To minimize the measurement errors, I repeat the run 5 times and average results. You can get stdev, min, and max of the numbers among the repeated runs in appendix below. [1] "Redis latency problems troubleshooting", https://redis.io/topics/latency [2] "Disable Transparent Huge Pages (THP)", https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/ [3] "The PARSEC Becnhmark Suite", https://parsec.cs.princeton.edu/index.htm [4] "SPLASH-2x", https://parsec.cs.princeton.edu/parsec3-doc.htm#splash2x [5] "parsec3_on_ubuntu", https://github.com/sjp38/parsec3_on_ubuntu Results ------- TL;DR: 'ethp' removes 97.61% of 'thp' memory space overhead while preserving 25.40% (up to 88.36%) of 'thp' performance improvement in total. Following sections show the results of the measurements with raw numbers and 'orig'-relative overheads (percent) of each configuration. Memory Space Overheads ~~~~~~~~~~~~~~~~~~~~~~ Below shows measured memory space overheads. Raw numbers are in KiB, and the overheads in parentheses are in percent. For example, 'parsec3/blackscholes' consumes about 1.819 GiB and 1.824 GiB with 'orig' and 'thp' configuration, respectively. The overhead of 'thp' compared to 'orig' for the workload is 0.3%. workloads orig thp (overhead) ethp (overhead) parsec3/blackscholes 1819486.000 1824921.400 ( 0.30) 1829070.600 ( 0.53) parsec3/bodytrack 1417885.800 1417077.600 ( -0.06) 1427560.800 ( 0.68) parsec3/canneal 1043876.800 1039773.000 ( -0.39) 1048445.200 ( 0.44) parsec3/dedup 2400000.400 2434625.600 ( 1.44) 2417374.400 ( 0.72) parsec3/facesim 540206.400 542422.400 ( 0.41) 551485.400 ( 2.09) parsec3/ferret 320480.200 320157.000 ( -0.10) 331470.400 ( 3.43) parsec3/fluidanimate 573961.400 572329.600 ( -0.28) 581836.000 ( 1.37) parsec3/freqmine 983981.200 994839.600 ( 1.10) 996124.600 ( 1.23) parsec3/raytrace 1745175.200 1742756.400 ( -0.14) 1751706.000 ( 0.37) parsec3/streamcluster 120558.800 120309.800 ( -0.21) 131997.800 ( 9.49) parsec3/swaptions 14820.400 23388.800 ( 57.81) 24698.000 ( 66.65) parsec3/vips 2956319.200 2955803.600 ( -0.02) 2977506.200 ( 0.72) parsec3/x264 3187699.000 3184944.000 ( -0.09) 3198462.800 ( 0.34) splash2x/barnes 1212774.800 1221892.400 ( 0.75) 1212100.800 ( -0.06) splash2x/fft 9364725.000 9267074.000 ( -1.04) 8997901.200 ( -3.92) splash2x/lu_cb 515242.400 519881.400 ( 0.90) 526621.600 ( 2.21) splash2x/lu_ncb 517308.000 520396.400 ( 0.60) 521732.400 ( 0.86) splash2x/ocean_cp 3348189.400 3380799.400 ( 0.97) 3328473.400 ( -0.59) splash2x/ocean_ncp 3908599.800 7072076.800 ( 80.94) 4449410.400 ( 13.84) splash2x/radiosity 1469087.800 1482244.400 ( 0.90) 1471781.000 ( 0.18) splash2x/radix 1712487.400 1385972.800 (-19.07) 1420461.800 (-17.05) splash2x/raytrace 45030.600 50946.600 ( 13.14) 58586.200 ( 30.10) splash2x/volrend 151037.800 151188.000 ( 0.10) 163213.600 ( 8.06) splash2x/water_nsquared 47442.400 47257.000 ( -0.39) 59285.800 ( 24.96) splash2x/water_spatial 667355.200 666824.400 ( -0.08) 673274.400 ( 0.89) total 40083800.000 42939900.000 ( 7.13) 40150600.000 ( 0.17) In total, 'thp' shows 7.13% memory space overhead while 'ethp' shows only 0.17% overhead. In other words, 'ethp' removed 97.61% of 'thp' memory space overhead. For almost every workload, 'ethp' constantly show about 10-15 MiB memory space overhead, mainly due to its python wrapper I used for convenient test runs. Using DAMON's raw interface would further remove this overhead. In case of 'parsec3/swaptions' and 'splash2x/raytrace', 'ethp' shows even higher memory space overhead. This is mainly due to the small size of the workloads and the constant memory overhead of 'ethp', which came from the python wrapper. The workloads consumes only about 14 MiB and 45 MiB each. Because the constant memory consumption from the python wrapper of 'ethp' (about 10-15 MiB) is relatively huge to the small working set, the relative overhead becomes high. Nonetheless, such small workloads are not appropriate target of the 'ethp' and the overhead can be removed by avoiding use of the wrapper. Runtime Overheads ~~~~~~~~~~~~~~~~~ Below shows measured runtime in similar way. The raw numbers are in seconds and the overheads are in percent. Minus runtime overheads mean speedup. runtime orig thp (overhead) ethp (overhead) parsec3/blackscholes 107.003 106.468 ( -0.50) 107.260 ( 0.24) parsec3/bodytrack 78.854 78.757 ( -0.12) 79.261 ( 0.52) parsec3/canneal 137.520 120.854 (-12.12) 132.427 ( -3.70) parsec3/dedup 11.873 11.665 ( -1.76) 11.883 ( 0.09) parsec3/facesim 207.895 204.215 ( -1.77) 206.170 ( -0.83) parsec3/ferret 190.507 189.972 ( -0.28) 190.818 ( 0.16) parsec3/fluidanimate 211.064 208.862 ( -1.04) 211.874 ( 0.38) parsec3/freqmine 290.157 288.831 ( -0.46) 292.495 ( 0.81) parsec3/raytrace 118.460 118.741 ( 0.24) 119.808 ( 1.14) parsec3/streamcluster 324.524 283.709 (-12.58) 307.209 ( -5.34) parsec3/swaptions 154.458 154.894 ( 0.28) 155.307 ( 0.55) parsec3/vips 58.588 58.622 ( 0.06) 59.037 ( 0.77) parsec3/x264 66.493 66.604 ( 0.17) 67.051 ( 0.84) splash2x/barnes 79.769 73.886 ( -7.38) 78.737 ( -1.29) splash2x/fft 32.857 22.960 (-30.12) 25.808 (-21.45) splash2x/lu_cb 85.113 84.939 ( -0.20) 85.344 ( 0.27) splash2x/lu_ncb 92.408 90.103 ( -2.49) 93.585 ( 1.27) splash2x/ocean_cp 44.374 42.876 ( -3.37) 43.613 ( -1.71) splash2x/ocean_ncp 80.710 51.831 (-35.78) 71.498 (-11.41) splash2x/radiosity 90.626 90.398 ( -0.25) 91.238 ( 0.68) splash2x/radix 30.875 25.226 (-18.30) 25.882 (-16.17) splash2x/raytrace 84.114 82.602 ( -1.80) 85.124 ( 1.20) splash2x/volrend 86.796 86.347 ( -0.52) 88.223 ( 1.64) splash2x/water_nsquared 230.781 220.667 ( -4.38) 232.664 ( 0.82) splash2x/water_spatial 88.719 90.187 ( 1.65) 89.228 ( 0.57) total 2984.530 2854.220 ( -4.37) 2951.540 ( -1.11) In total, 'thp' shows 4.37% speedup while 'ethp' shows 1.11% speedup. In other words, 'ethp' preserves about 25.40% of THP performance benefit. In the best case (splash2x/raytrace), 'ethp' preserves 88.36% of the benefit. If we narrow down to workloads showing high THP performance benefits (splash2x/fft, splash2x/ocean_ncp, and splash2x/radix), 'thp' and 'ethp' shows 30.75% and 14.71% speedup in total, respectively. In other words, 'ethp' preserves about 47.83% of the benefit. Even in the worst case (splash2x/volrend), 'ethp' incurs only 1.64% runtime overhead, which is similar to that of 'thp' (1.65% for 'splash2x/water_spatial'). Sequence Of Patches =================== The patches are based on the v5.5 plus v5 DAMON patchset[1] and Minchan's ``madvise()`` factor-out patch[2]. Minchan's patch was necessary for reuse of ``madvise()`` code in DAMON. You can also clone the complete git tree: $ git clone git://github.com/sjp38/linux -b damos/rfc/v4 The web is also available: https://github.com/sjp38/linux/releases/tag/damos/rfc/v4 [1] https://lore.kernel.org/linux-mm/20200217103110.30817-1-sjpark@amazon.com/ [2] https://lore.kernel.org/linux-mm/20200128001641.5086-2-minchan@kernel.org/ The first patch allows DAMON to reuse ``madvise()`` code for the actions. The second patch accounts age of each region. The third patch implements the handling of the schemes in DAMON and exports a kernel space programming interface for it. The fourth patch implements a debugfs interface for privileged people and programs. The fifth and sixth patches each adds kunittests and selftests for these changes, and finally the seventhe patch modifies the user space tool for DAMON to support description and applying of schemes in human freiendly way. Patch History ============= Changes from RFC v3 (https://lore.kernel.org/linux-mm/20200225102300.23895-1-sjpark@amazon.com/) - Add Reviewed-by from Brendan Higgins - Code cleanup: Modularize madvise() call - Fix a trivial bug in the wrapper python script - Add more stable and detailed evaluation results with updated ETHP scheme Changes from RFC v2 (https://lore.kernel.org/linux-mm/20200218085309.18346-1-sjpark@amazon.com/) - Fix aging mechanism for more better 'old region' selection - Add more kunittests and kselftests for this patchset - Support more human friedly description and application of 'schemes' Changes from RFC v1 (https://lore.kernel.org/linux-mm/20200210150921.32482-1-sjpark@amazon.com/) - Properly adjust age accounting related properties after splitting, merging, and action applying SeongJae Park (7): mm/madvise: Export madvise_common() to mm internal code mm/damon: Account age of target regions mm/damon: Implement data access monitoring-based operation schemes mm/damon/schemes: Implement a debugfs interface mm/damon-test: Add kunit test case for regions age accounting mm/damon/selftests: Add 'schemes' debugfs tests damon/tools: Support more human friendly 'schemes' control include/linux/damon.h | 29 ++ mm/damon-test.h | 5 + mm/damon.c | 391 +++++++++++++++++- mm/internal.h | 4 + mm/madvise.c | 3 +- tools/damon/_convert_damos.py | 125 ++++++ tools/damon/_damon.py | 143 +++++++ tools/damon/damo | 7 + tools/damon/record.py | 135 +----- tools/damon/schemes.py | 105 +++++ .../testing/selftests/damon/debugfs_attrs.sh | 29 ++ 11 files changed, 845 insertions(+), 131 deletions(-) create mode 100755 tools/damon/_convert_damos.py create mode 100644 tools/damon/_damon.py create mode 100644 tools/damon/schemes.py

[RFC,v4,0/7] Implement Data Access Monitoring-based Memory Operation Schemes

Message