Message ID | 20190520035254.57579-1-minchan@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | introduce memory hinting API for external process | expand |
On 05/20/2019 09:22 AM, Minchan Kim wrote: > - Problem > > Naturally, cached apps were dominant consumers of memory on the system. > However, they were not significant consumers of swap even though they are > good candidate for swap. Under investigation, swapping out only begins > once the low zone watermark is hit and kswapd wakes up, but the overall > allocation rate in the system might trip lmkd thresholds and cause a cached > process to be killed(we measured performance swapping out vs. zapping the > memory by killing a process. Unsurprisingly, zapping is 10x times faster > even though we use zram which is much faster than real storage) so kill > from lmkd will often satisfy the high zone watermark, resulting in very > few pages actually being moved to swap. Getting killed by lmkd which is triggered by custom system memory allocation parameters and hence not being able to swap out is a problem ? But is not the problem created by lmkd itself. Or Is the objective here is reduce the number of processes which get killed by lmkd by triggering swapping for the unused memory (user hinted) sooner so that they dont get picked by lmkd. Under utilization for zram hardware is a concern here as well ? Swapping out memory into zram wont increase the latency for a hot start ? Or is it because as it will prevent a fresh cold start which anyway will be slower than a slow hot start. Just being curious.
[Cc linux-api] On Mon 20-05-19 12:52:47, Minchan Kim wrote: > - Background > > The Android terminology used for forking a new process and starting an app > from scratch is a cold start, while resuming an existing app is a hot start. > While we continually try to improve the performance of cold starts, hot > starts will always be significantly less power hungry as well as faster so > we are trying to make hot start more likely than cold start. > > To increase hot start, Android userspace manages the order that apps should > be killed in a process called ActivityManagerService. ActivityManagerService > tracks every Android app or service that the user could be interacting with > at any time and translates that into a ranked list for lmkd(low memory > killer daemon). They are likely to be killed by lmkd if the system has to > reclaim memory. In that sense they are similar to entries in any other cache. > Those apps are kept alive for opportunistic performance improvements but > those performance improvements will vary based on the memory requirements of > individual workloads. > > - Problem > > Naturally, cached apps were dominant consumers of memory on the system. > However, they were not significant consumers of swap even though they are > good candidate for swap. Under investigation, swapping out only begins > once the low zone watermark is hit and kswapd wakes up, but the overall > allocation rate in the system might trip lmkd thresholds and cause a cached > process to be killed(we measured performance swapping out vs. zapping the > memory by killing a process. Unsurprisingly, zapping is 10x times faster > even though we use zram which is much faster than real storage) so kill > from lmkd will often satisfy the high zone watermark, resulting in very > few pages actually being moved to swap. > > - Approach > > The approach we chose was to use a new interface to allow userspace to > proactively reclaim entire processes by leveraging platform information. > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > that are known to be cold from userspace and to avoid races with lmkd > by reclaiming apps as soon as they entered the cached state. Additionally, > it could provide many chances for platform to use much information to > optimize memory efficiency. > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > and MADV_FREE by adding non-destructive ways to gain some free memory > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > when memory pressure rises. > > To achieve the goal, the patchset introduce two new options for madvise. > One is MADV_COOL which will deactive activated pages and the other is > MADV_COLD which will reclaim private pages instantly. These new options > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed when memory pressure rises. > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > information required to make the reclaim decision is not known to the app. > Instead, it is known to a centralized userspace daemon, and that daemon > must be able to initiate reclaim on its own without any app involvement. > To solve the concern, this patch introduces new syscall - > > struct pr_madvise_param { > int size; > const struct iovec *vec; > } > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > struct pr_madvise_param *restuls, > struct pr_madvise_param *ranges, > unsigned long flags); > > The syscall get pidfd to give hints to external process and provides > pair of result/ranges vector arguments so that it could give several > hints to each address range all at once. > > I guess others have different ideas about the naming of syscall and options > so feel free to suggest better naming. > > - Experiment > > We did bunch of testing with several hundreds of real users, not artificial > benchmark on android. We saw about 17% cold start decreasement without any > significant battery/app startup latency issues. And with artificial benchmark > which launches and switching apps, we saw average 7% app launching improvement, > 18% less lmkd kill and good stat from vmstat. > > A is vanilla and B is process_madvise. > > > A B delta ratio(%) > allocstall_dma 0 0 0 0.00 > allocstall_movable 1464 457 -1007 -69.00 > allocstall_normal 263210 190763 -72447 -28.00 > allocstall_total 264674 191220 -73454 -28.00 > compact_daemon_wake 26912 25294 -1618 -7.00 > compact_fail 17885 14151 -3734 -21.00 > compact_free_scanned 4204766409 3835994922 -368771487 -9.00 > compact_isolated 3446484 2967618 -478866 -14.00 > compact_migrate_scanned 1621336411 1324695710 -296640701 -19.00 > compact_stall 19387 15343 -4044 -21.00 > compact_success 1502 1192 -310 -21.00 > kswapd_high_wmark_hit_quickly 234 184 -50 -22.00 > kswapd_inodesteal 221635 233093 11458 5.00 > kswapd_low_wmark_hit_quickly 66065 54009 -12056 -19.00 > nr_dirtied 259934 296476 36542 14.00 > nr_vmscan_immediate_reclaim 2587 2356 -231 -9.00 > nr_vmscan_write 1274232 2661733 1387501 108.00 > nr_written 1514060 2937560 1423500 94.00 > pageoutrun 67561 55133 -12428 -19.00 > pgactivate 2335060 1984882 -350178 -15.00 > pgalloc_dma 13743011 14096463 353452 2.00 > pgalloc_movable 0 0 0 0.00 > pgalloc_normal 18742440 16802065 -1940375 -11.00 > pgalloc_total 32485451 30898528 -1586923 -5.00 > pgdeactivate 4262210 2930670 -1331540 -32.00 > pgfault 30812334 31085065 272731 0.00 > pgfree 33553970 31765164 -1788806 -6.00 > pginodesteal 33411 15084 -18327 -55.00 > pglazyfreed 0 0 0 0.00 > pgmajfault 551312 1508299 956987 173.00 > pgmigrate_fail 43927 29330 -14597 -34.00 > pgmigrate_success 1399851 1203922 -195929 -14.00 > pgpgin 24141776 19032156 -5109620 -22.00 > pgpgout 959344 1103316 143972 15.00 > pgpgoutclean 4639732 3765868 -873864 -19.00 > pgrefill 4884560 3006938 -1877622 -39.00 > pgrotated 37828 25897 -11931 -32.00 > pgscan_direct 1456037 957567 -498470 -35.00 > pgscan_direct_throttle 0 0 0 0.00 > pgscan_kswapd 6667767 5047360 -1620407 -25.00 > pgscan_total 8123804 6004927 -2118877 -27.00 > pgskip_dma 0 0 0 0.00 > pgskip_movable 0 0 0 0.00 > pgskip_normal 14907 25382 10475 70.00 > pgskip_total 14907 25382 10475 70.00 > pgsteal_direct 1118986 690215 -428771 -39.00 > pgsteal_kswapd 4750223 3657107 -1093116 -24.00 > pgsteal_total 5869209 4347322 -1521887 -26.00 > pswpin 417613 1392647 975034 233.00 > pswpout 1274224 2661731 1387507 108.00 > slabs_scanned 13686905 10807200 -2879705 -22.00 > workingset_activate 668966 569444 -99522 -15.00 > workingset_nodereclaim 38957 32621 -6336 -17.00 > workingset_refault 2816795 2179782 -637013 -23.00 > workingset_restore 294320 168601 -125719 -43.00 > > pgmajfault is increased by 173% because swapin is increased by 200% by > process_madvise hint. However, swap read based on zram is much cheaper > than file IO in performance point of view and app hot start by swapin is > also cheaper than cold start from the beginning of app which needs many IO > from storage and initialization steps. > > This patchset is against on next-20190517. > > Minchan Kim (7): > mm: introduce MADV_COOL > mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM > mm: introduce MADV_COLD > mm: factor out madvise's core functionality > mm: introduce external memory hinting API > mm: extend process_madvise syscall to support vector arrary > mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/page-flags.h | 1 + > include/linux/page_idle.h | 15 + > include/linux/proc_fs.h | 1 + > include/linux/swap.h | 2 + > include/linux/syscalls.h | 2 + > include/uapi/asm-generic/mman-common.h | 12 + > include/uapi/asm-generic/unistd.h | 2 + > kernel/signal.c | 2 +- > kernel/sys_ni.c | 1 + > mm/madvise.c | 600 +++++++++++++++++++++---- > mm/swap.c | 43 ++ > mm/vmscan.c | 80 +++- > 14 files changed, 680 insertions(+), 83 deletions(-) > > -- > 2.21.0.1020.gf2820cf01a-goog >
Hi. On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > - Background > > The Android terminology used for forking a new process and starting an app > from scratch is a cold start, while resuming an existing app is a hot start. > While we continually try to improve the performance of cold starts, hot > starts will always be significantly less power hungry as well as faster so > we are trying to make hot start more likely than cold start. > > To increase hot start, Android userspace manages the order that apps should > be killed in a process called ActivityManagerService. ActivityManagerService > tracks every Android app or service that the user could be interacting with > at any time and translates that into a ranked list for lmkd(low memory > killer daemon). They are likely to be killed by lmkd if the system has to > reclaim memory. In that sense they are similar to entries in any other cache. > Those apps are kept alive for opportunistic performance improvements but > those performance improvements will vary based on the memory requirements of > individual workloads. > > - Problem > > Naturally, cached apps were dominant consumers of memory on the system. > However, they were not significant consumers of swap even though they are > good candidate for swap. Under investigation, swapping out only begins > once the low zone watermark is hit and kswapd wakes up, but the overall > allocation rate in the system might trip lmkd thresholds and cause a cached > process to be killed(we measured performance swapping out vs. zapping the > memory by killing a process. Unsurprisingly, zapping is 10x times faster > even though we use zram which is much faster than real storage) so kill > from lmkd will often satisfy the high zone watermark, resulting in very > few pages actually being moved to swap. > > - Approach > > The approach we chose was to use a new interface to allow userspace to > proactively reclaim entire processes by leveraging platform information. > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > that are known to be cold from userspace and to avoid races with lmkd > by reclaiming apps as soon as they entered the cached state. Additionally, > it could provide many chances for platform to use much information to > optimize memory efficiency. > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > and MADV_FREE by adding non-destructive ways to gain some free memory > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > when memory pressure rises. > > To achieve the goal, the patchset introduce two new options for madvise. > One is MADV_COOL which will deactive activated pages and the other is > MADV_COLD which will reclaim private pages instantly. These new options > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed when memory pressure rises. > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > information required to make the reclaim decision is not known to the app. > Instead, it is known to a centralized userspace daemon, and that daemon > must be able to initiate reclaim on its own without any app involvement. > To solve the concern, this patch introduces new syscall - > > struct pr_madvise_param { > int size; > const struct iovec *vec; > } > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > struct pr_madvise_param *restuls, > struct pr_madvise_param *ranges, > unsigned long flags); > > The syscall get pidfd to give hints to external process and provides > pair of result/ranges vector arguments so that it could give several > hints to each address range all at once. > > I guess others have different ideas about the naming of syscall and options > so feel free to suggest better naming. > > - Experiment > > We did bunch of testing with several hundreds of real users, not artificial > benchmark on android. We saw about 17% cold start decreasement without any > significant battery/app startup latency issues. And with artificial benchmark > which launches and switching apps, we saw average 7% app launching improvement, > 18% less lmkd kill and good stat from vmstat. > > A is vanilla and B is process_madvise. > > > A B delta ratio(%) > allocstall_dma 0 0 0 0.00 > allocstall_movable 1464 457 -1007 -69.00 > allocstall_normal 263210 190763 -72447 -28.00 > allocstall_total 264674 191220 -73454 -28.00 > compact_daemon_wake 26912 25294 -1618 -7.00 > compact_fail 17885 14151 -3734 -21.00 > compact_free_scanned 4204766409 3835994922 -368771487 -9.00 > compact_isolated 3446484 2967618 -478866 -14.00 > compact_migrate_scanned 1621336411 1324695710 -296640701 -19.00 > compact_stall 19387 15343 -4044 -21.00 > compact_success 1502 1192 -310 -21.00 > kswapd_high_wmark_hit_quickly 234 184 -50 -22.00 > kswapd_inodesteal 221635 233093 11458 5.00 > kswapd_low_wmark_hit_quickly 66065 54009 -12056 -19.00 > nr_dirtied 259934 296476 36542 14.00 > nr_vmscan_immediate_reclaim 2587 2356 -231 -9.00 > nr_vmscan_write 1274232 2661733 1387501 108.00 > nr_written 1514060 2937560 1423500 94.00 > pageoutrun 67561 55133 -12428 -19.00 > pgactivate 2335060 1984882 -350178 -15.00 > pgalloc_dma 13743011 14096463 353452 2.00 > pgalloc_movable 0 0 0 0.00 > pgalloc_normal 18742440 16802065 -1940375 -11.00 > pgalloc_total 32485451 30898528 -1586923 -5.00 > pgdeactivate 4262210 2930670 -1331540 -32.00 > pgfault 30812334 31085065 272731 0.00 > pgfree 33553970 31765164 -1788806 -6.00 > pginodesteal 33411 15084 -18327 -55.00 > pglazyfreed 0 0 0 0.00 > pgmajfault 551312 1508299 956987 173.00 > pgmigrate_fail 43927 29330 -14597 -34.00 > pgmigrate_success 1399851 1203922 -195929 -14.00 > pgpgin 24141776 19032156 -5109620 -22.00 > pgpgout 959344 1103316 143972 15.00 > pgpgoutclean 4639732 3765868 -873864 -19.00 > pgrefill 4884560 3006938 -1877622 -39.00 > pgrotated 37828 25897 -11931 -32.00 > pgscan_direct 1456037 957567 -498470 -35.00 > pgscan_direct_throttle 0 0 0 0.00 > pgscan_kswapd 6667767 5047360 -1620407 -25.00 > pgscan_total 8123804 6004927 -2118877 -27.00 > pgskip_dma 0 0 0 0.00 > pgskip_movable 0 0 0 0.00 > pgskip_normal 14907 25382 10475 70.00 > pgskip_total 14907 25382 10475 70.00 > pgsteal_direct 1118986 690215 -428771 -39.00 > pgsteal_kswapd 4750223 3657107 -1093116 -24.00 > pgsteal_total 5869209 4347322 -1521887 -26.00 > pswpin 417613 1392647 975034 233.00 > pswpout 1274224 2661731 1387507 108.00 > slabs_scanned 13686905 10807200 -2879705 -22.00 > workingset_activate 668966 569444 -99522 -15.00 > workingset_nodereclaim 38957 32621 -6336 -17.00 > workingset_refault 2816795 2179782 -637013 -23.00 > workingset_restore 294320 168601 -125719 -43.00 > > pgmajfault is increased by 173% because swapin is increased by 200% by > process_madvise hint. However, swap read based on zram is much cheaper > than file IO in performance point of view and app hot start by swapin is > also cheaper than cold start from the beginning of app which needs many IO > from storage and initialization steps. > > This patchset is against on next-20190517. > > Minchan Kim (7): > mm: introduce MADV_COOL > mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM > mm: introduce MADV_COLD > mm: factor out madvise's core functionality > mm: introduce external memory hinting API > mm: extend process_madvise syscall to support vector arrary > mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/page-flags.h | 1 + > include/linux/page_idle.h | 15 + > include/linux/proc_fs.h | 1 + > include/linux/swap.h | 2 + > include/linux/syscalls.h | 2 + > include/uapi/asm-generic/mman-common.h | 12 + > include/uapi/asm-generic/unistd.h | 2 + > kernel/signal.c | 2 +- > kernel/sys_ni.c | 1 + > mm/madvise.c | 600 +++++++++++++++++++++---- > mm/swap.c | 43 ++ > mm/vmscan.c | 80 +++- > 14 files changed, 680 insertions(+), 83 deletions(-) > > -- > 2.21.0.1020.gf2820cf01a-goog > Please Cc me for the next iteration since I was working on the very same thing recently [1]. Thank you. [1] https://gitlab.com/post-factum/pf-kernel/commits/remote-madvise-v3
On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > - Approach > > The approach we chose was to use a new interface to allow userspace to > proactively reclaim entire processes by leveraging platform information. > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > that are known to be cold from userspace and to avoid races with lmkd > by reclaiming apps as soon as they entered the cached state. Additionally, > it could provide many chances for platform to use much information to > optimize memory efficiency. > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > and MADV_FREE by adding non-destructive ways to gain some free memory > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > when memory pressure rises. I agree with this approach and the semantics. But these names are very vague and extremely easy to confuse since they're so similar. MADV_COLD could be a good name, but for deactivating pages, not reclaiming them - marking memory "cold" on the LRU for later reclaim. For the immediate reclaim one, I think there is a better option too: In virtual memory speak, putting a page into secondary storage (or ensuring it's already there), and then freeing its in-memory copy, is called "paging out". And that's what this flag is supposed to do. So how about MADV_PAGEOUT? With that, we'd have: MADV_FREE: Mark data invalid, free memory when needed MADV_DONTNEED: Mark data invalid, free memory immediately MADV_COLD: Data is not used for a while, free memory when needed MADV_PAGEOUT: Data is not used for a while, free memory immediately What do you think?
On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > Or Is the objective here is reduce the number of processes which get killed by > lmkd by triggering swapping for the unused memory (user hinted) sooner so that > they dont get picked by lmkd. Under utilization for zram hardware is a concern > here as well ? The objective is to avoid some instances of memory pressure by proactively swapping pages that userspace knows to be cold before those pages reach the end of the LRUs, which in turn can prevent some apps from being killed by lmk/lmkd. As soon as Android userspace knows that an application is not being used and is only resident to improve performance if the user returns to that app, we can kick off process_madvise on that process's pages (or some portion of those pages) in a power-efficient way to reduce memory pressure long before the system hits the free page watermark. This allows the system more time to put pages into zram versus waiting for the watermark to trigger kswapd, which decreases the likelihood that later memory allocations will cause enough pressure to trigger a kill of one of these apps. > Swapping out memory into zram wont increase the latency for a hot start ? Or > is it because as it will prevent a fresh cold start which anyway will be slower > than a slow hot start. Just being curious. First, not all swapped pages will be reloaded immediately once an app is resumed. We've found that an app's working set post-process_madvise is significantly smaller than what an app allocates when it first launches (see the delta between pswpin and pswpout in Minchan's results). Presumably because of this, faulting to fetch from zram does not seem to introduce a noticeable hot start penalty, not does it cause an increase in performance problems later in the app's lifecycle. I've measured with and without process_madvise, and the differences are within our noise bounds. Second, because we're not preemptively evicting file pages and only making them more likely to be evicted when there's already memory pressure, we avoid the case where we process_madvise an app then immediately return to the app and reload all file pages in the working set even though there was no intervening memory pressure. Our initial version of this work evicted file pages preemptively and did cause a noticeable slowdown (~15%) for that case; this patch set avoids that slowdown. Finally, the benefit from avoiding cold starts is huge. The performance improvement from having a hot start instead of a cold start ranges from 3x for very small apps to 50x+ for larger apps like high-fidelity games.
On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > IMHO we should spell it out that this patchset complements MADV_WONTNEED > and MADV_FREE by adding non-destructive ways to gain some free memory > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > when memory pressure rises. Do we tear down page tables for these ranges? That seems like a good way of reclaiming potentially a substantial amount of memory.
On 05/20/2019 10:29 PM, Tim Murray wrote: > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: >> >> Or Is the objective here is reduce the number of processes which get killed by >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that >> they dont get picked by lmkd. Under utilization for zram hardware is a concern >> here as well ? > > The objective is to avoid some instances of memory pressure by > proactively swapping pages that userspace knows to be cold before > those pages reach the end of the LRUs, which in turn can prevent some > apps from being killed by lmk/lmkd. As soon as Android userspace knows > that an application is not being used and is only resident to improve > performance if the user returns to that app, we can kick off > process_madvise on that process's pages (or some portion of those > pages) in a power-efficient way to reduce memory pressure long before > the system hits the free page watermark. This allows the system more > time to put pages into zram versus waiting for the watermark to > trigger kswapd, which decreases the likelihood that later memory > allocations will cause enough pressure to trigger a kill of one of > these apps. So this opens up bit of LRU management to user space hints. Also because the app in itself wont know about the memory situation of the entire system, new system call needs to be called from an external process. > >> Swapping out memory into zram wont increase the latency for a hot start ? Or >> is it because as it will prevent a fresh cold start which anyway will be slower >> than a slow hot start. Just being curious. > > First, not all swapped pages will be reloaded immediately once an app > is resumed. We've found that an app's working set post-process_madvise > is significantly smaller than what an app allocates when it first > launches (see the delta between pswpin and pswpout in Minchan's > results). Presumably because of this, faulting to fetch from zram does pswpin 417613 1392647 975034 233.00 pswpout 1274224 2661731 1387507 108.00 IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that always the case ? Or it tend to swap out from an active area of the working set which faulted back again. > not seem to introduce a noticeable hot start penalty, not does it > cause an increase in performance problems later in the app's > lifecycle. I've measured with and without process_madvise, and the > differences are within our noise bounds. Second, because we're not That is assuming that post process_madvise() working set for the application is always smaller. There is another challenge. The external process should ideally have the knowledge of active areas of the working set for an application in question for it to invoke process_madvise() correctly to prevent such scenarios. > preemptively evicting file pages and only making them more likely to > be evicted when there's already memory pressure, we avoid the case > where we process_madvise an app then immediately return to the app and > reload all file pages in the working set even though there was no > intervening memory pressure. Our initial version of this work evicted That would be the worst case scenario which should be avoided. Memory pressure must be a parameter before actually doing the swap out. But pages if know to be inactive/cold can be marked high priority to be swapped out. > file pages preemptively and did cause a noticeable slowdown (~15%) for > that case; this patch set avoids that slowdown. Finally, the benefit > from avoiding cold starts is huge. The performance improvement from > having a hot start instead of a cold start ranges from 3x for very > small apps to 50x+ for larger apps like high-fidelity games. Is there any other real world scenario apart from this app based ecosystem where user hinted LRU management might be helpful ? Just being curious. Thanks for the detailed explanation. I will continue looking into this series.
On Mon, May 20, 2019 at 04:42:00PM +0200, Oleksandr Natalenko wrote: > Hi. > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > - Background > > > > The Android terminology used for forking a new process and starting an app > > from scratch is a cold start, while resuming an existing app is a hot start. > > While we continually try to improve the performance of cold starts, hot > > starts will always be significantly less power hungry as well as faster so > > we are trying to make hot start more likely than cold start. > > > > To increase hot start, Android userspace manages the order that apps should > > be killed in a process called ActivityManagerService. ActivityManagerService > > tracks every Android app or service that the user could be interacting with > > at any time and translates that into a ranked list for lmkd(low memory > > killer daemon). They are likely to be killed by lmkd if the system has to > > reclaim memory. In that sense they are similar to entries in any other cache. > > Those apps are kept alive for opportunistic performance improvements but > > those performance improvements will vary based on the memory requirements of > > individual workloads. > > > > - Problem > > > > Naturally, cached apps were dominant consumers of memory on the system. > > However, they were not significant consumers of swap even though they are > > good candidate for swap. Under investigation, swapping out only begins > > once the low zone watermark is hit and kswapd wakes up, but the overall > > allocation rate in the system might trip lmkd thresholds and cause a cached > > process to be killed(we measured performance swapping out vs. zapping the > > memory by killing a process. Unsurprisingly, zapping is 10x times faster > > even though we use zram which is much faster than real storage) so kill > > from lmkd will often satisfy the high zone watermark, resulting in very > > few pages actually being moved to swap. > > > > - Approach > > > > The approach we chose was to use a new interface to allow userspace to > > proactively reclaim entire processes by leveraging platform information. > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > that are known to be cold from userspace and to avoid races with lmkd > > by reclaiming apps as soon as they entered the cached state. Additionally, > > it could provide many chances for platform to use much information to > > optimize memory efficiency. > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > and MADV_FREE by adding non-destructive ways to gain some free memory > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > when memory pressure rises. > > > > To achieve the goal, the patchset introduce two new options for madvise. > > One is MADV_COOL which will deactive activated pages and the other is > > MADV_COLD which will reclaim private pages instantly. These new options > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > > that it hints the kernel that memory region is not currently needed and > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > > that it hints the kernel that memory region is not currently needed and > > should be reclaimed when memory pressure rises. > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > > information required to make the reclaim decision is not known to the app. > > Instead, it is known to a centralized userspace daemon, and that daemon > > must be able to initiate reclaim on its own without any app involvement. > > To solve the concern, this patch introduces new syscall - > > > > struct pr_madvise_param { > > int size; > > const struct iovec *vec; > > } > > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > > struct pr_madvise_param *restuls, > > struct pr_madvise_param *ranges, > > unsigned long flags); > > > > The syscall get pidfd to give hints to external process and provides > > pair of result/ranges vector arguments so that it could give several > > hints to each address range all at once. > > > > I guess others have different ideas about the naming of syscall and options > > so feel free to suggest better naming. > > > > - Experiment > > > > We did bunch of testing with several hundreds of real users, not artificial > > benchmark on android. We saw about 17% cold start decreasement without any > > significant battery/app startup latency issues. And with artificial benchmark > > which launches and switching apps, we saw average 7% app launching improvement, > > 18% less lmkd kill and good stat from vmstat. > > > > A is vanilla and B is process_madvise. > > > > > > A B delta ratio(%) > > allocstall_dma 0 0 0 0.00 > > allocstall_movable 1464 457 -1007 -69.00 > > allocstall_normal 263210 190763 -72447 -28.00 > > allocstall_total 264674 191220 -73454 -28.00 > > compact_daemon_wake 26912 25294 -1618 -7.00 > > compact_fail 17885 14151 -3734 -21.00 > > compact_free_scanned 4204766409 3835994922 -368771487 -9.00 > > compact_isolated 3446484 2967618 -478866 -14.00 > > compact_migrate_scanned 1621336411 1324695710 -296640701 -19.00 > > compact_stall 19387 15343 -4044 -21.00 > > compact_success 1502 1192 -310 -21.00 > > kswapd_high_wmark_hit_quickly 234 184 -50 -22.00 > > kswapd_inodesteal 221635 233093 11458 5.00 > > kswapd_low_wmark_hit_quickly 66065 54009 -12056 -19.00 > > nr_dirtied 259934 296476 36542 14.00 > > nr_vmscan_immediate_reclaim 2587 2356 -231 -9.00 > > nr_vmscan_write 1274232 2661733 1387501 108.00 > > nr_written 1514060 2937560 1423500 94.00 > > pageoutrun 67561 55133 -12428 -19.00 > > pgactivate 2335060 1984882 -350178 -15.00 > > pgalloc_dma 13743011 14096463 353452 2.00 > > pgalloc_movable 0 0 0 0.00 > > pgalloc_normal 18742440 16802065 -1940375 -11.00 > > pgalloc_total 32485451 30898528 -1586923 -5.00 > > pgdeactivate 4262210 2930670 -1331540 -32.00 > > pgfault 30812334 31085065 272731 0.00 > > pgfree 33553970 31765164 -1788806 -6.00 > > pginodesteal 33411 15084 -18327 -55.00 > > pglazyfreed 0 0 0 0.00 > > pgmajfault 551312 1508299 956987 173.00 > > pgmigrate_fail 43927 29330 -14597 -34.00 > > pgmigrate_success 1399851 1203922 -195929 -14.00 > > pgpgin 24141776 19032156 -5109620 -22.00 > > pgpgout 959344 1103316 143972 15.00 > > pgpgoutclean 4639732 3765868 -873864 -19.00 > > pgrefill 4884560 3006938 -1877622 -39.00 > > pgrotated 37828 25897 -11931 -32.00 > > pgscan_direct 1456037 957567 -498470 -35.00 > > pgscan_direct_throttle 0 0 0 0.00 > > pgscan_kswapd 6667767 5047360 -1620407 -25.00 > > pgscan_total 8123804 6004927 -2118877 -27.00 > > pgskip_dma 0 0 0 0.00 > > pgskip_movable 0 0 0 0.00 > > pgskip_normal 14907 25382 10475 70.00 > > pgskip_total 14907 25382 10475 70.00 > > pgsteal_direct 1118986 690215 -428771 -39.00 > > pgsteal_kswapd 4750223 3657107 -1093116 -24.00 > > pgsteal_total 5869209 4347322 -1521887 -26.00 > > pswpin 417613 1392647 975034 233.00 > > pswpout 1274224 2661731 1387507 108.00 > > slabs_scanned 13686905 10807200 -2879705 -22.00 > > workingset_activate 668966 569444 -99522 -15.00 > > workingset_nodereclaim 38957 32621 -6336 -17.00 > > workingset_refault 2816795 2179782 -637013 -23.00 > > workingset_restore 294320 168601 -125719 -43.00 > > > > pgmajfault is increased by 173% because swapin is increased by 200% by > > process_madvise hint. However, swap read based on zram is much cheaper > > than file IO in performance point of view and app hot start by swapin is > > also cheaper than cold start from the beginning of app which needs many IO > > from storage and initialization steps. > > > > This patchset is against on next-20190517. > > > > Minchan Kim (7): > > mm: introduce MADV_COOL > > mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM > > mm: introduce MADV_COLD > > mm: factor out madvise's core functionality > > mm: introduce external memory hinting API > > mm: extend process_madvise syscall to support vector arrary > > mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER > > > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > include/linux/page-flags.h | 1 + > > include/linux/page_idle.h | 15 + > > include/linux/proc_fs.h | 1 + > > include/linux/swap.h | 2 + > > include/linux/syscalls.h | 2 + > > include/uapi/asm-generic/mman-common.h | 12 + > > include/uapi/asm-generic/unistd.h | 2 + > > kernel/signal.c | 2 +- > > kernel/sys_ni.c | 1 + > > mm/madvise.c | 600 +++++++++++++++++++++---- > > mm/swap.c | 43 ++ > > mm/vmscan.c | 80 +++- > > 14 files changed, 680 insertions(+), 83 deletions(-) > > > > -- > > 2.21.0.1020.gf2820cf01a-goog > > > > Please Cc me for the next iteration since I was working on the very same > thing recently [1]. > > Thank you. > > [1] https://gitlab.com/post-factum/pf-kernel/commits/remote-madvise-v3 Sure, I'm happy to see others have similar requirement.
On Mon, May 20, 2019 at 12:46:05PM -0400, Johannes Weiner wrote: > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > - Approach > > > > The approach we chose was to use a new interface to allow userspace to > > proactively reclaim entire processes by leveraging platform information. > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > that are known to be cold from userspace and to avoid races with lmkd > > by reclaiming apps as soon as they entered the cached state. Additionally, > > it could provide many chances for platform to use much information to > > optimize memory efficiency. > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > and MADV_FREE by adding non-destructive ways to gain some free memory > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > when memory pressure rises. > > I agree with this approach and the semantics. But these names are very > vague and extremely easy to confuse since they're so similar. > > MADV_COLD could be a good name, but for deactivating pages, not > reclaiming them - marking memory "cold" on the LRU for later reclaim. > > For the immediate reclaim one, I think there is a better option too: > In virtual memory speak, putting a page into secondary storage (or > ensuring it's already there), and then freeing its in-memory copy, is > called "paging out". And that's what this flag is supposed to do. So > how about MADV_PAGEOUT? > > With that, we'd have: > > MADV_FREE: Mark data invalid, free memory when needed > MADV_DONTNEED: Mark data invalid, free memory immediately > > MADV_COLD: Data is not used for a while, free memory when needed > MADV_PAGEOUT: Data is not used for a while, free memory immediately > > What do you think? There are several suggestions until now. Thanks, Folks! For deactivating: - MADV_COOL - MADV_RECLAIM_LAZY - MADV_DEACTIVATE - MADV_COLD - MADV_FREE_PRESERVE For reclaiming: - MADV_COLD - MADV_RECLAIM_NOW - MADV_RECLAIMING - MADV_PAGEOUT - MADV_DONTNEED_PRESERVE It seems everybody doesn't like MADV_COLD so want to go with other. For consisteny of view with other existing hints of madvise, -preserve postfix suits well. However, originally, I don't like the naming FREE vs DONTNEED from the beginning. They were easily confused. I prefer PAGEOUT to RECLAIM since it's more likely to be nuance to represent reclaim with memory pressure and is supposed to paged-in if someone need it later. So, it imply PRESERVE. If there is not strong against it, I want to go with MADV_COLD and MADV_PAGEOUT. Other opinion?
On Mon, May 20, 2019 at 06:44:52PM -0700, Matthew Wilcox wrote: > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > and MADV_FREE by adding non-destructive ways to gain some free memory > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > when memory pressure rises. > > Do we tear down page tables for these ranges? That seems like a good True for MADV_COLD(reclaiming) but false for MADV_COOL(deactivating) at this implementation. > way of reclaiming potentially a substantial amount of memory. Given that consider refauting are spread out over time and reclaim occurs in burst, that does make sense to speed up the reclaiming. However, a concern to me is anonymous pages since they need swap cache insertion, which would be wasteful if they are not reclaimed, finally.
On Tue, May 21, 2019 at 08:25:55AM +0530, Anshuman Khandual wrote: > > > On 05/20/2019 10:29 PM, Tim Murray wrote: > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual > > <anshuman.khandual@arm.com> wrote: > >> > >> Or Is the objective here is reduce the number of processes which get killed by > >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that > >> they dont get picked by lmkd. Under utilization for zram hardware is a concern > >> here as well ? > > > > The objective is to avoid some instances of memory pressure by > > proactively swapping pages that userspace knows to be cold before > > those pages reach the end of the LRUs, which in turn can prevent some > > apps from being killed by lmk/lmkd. As soon as Android userspace knows > > that an application is not being used and is only resident to improve > > performance if the user returns to that app, we can kick off > > process_madvise on that process's pages (or some portion of those > > pages) in a power-efficient way to reduce memory pressure long before > > the system hits the free page watermark. This allows the system more > > time to put pages into zram versus waiting for the watermark to > > trigger kswapd, which decreases the likelihood that later memory > > allocations will cause enough pressure to trigger a kill of one of > > these apps. > > So this opens up bit of LRU management to user space hints. Also because the app > in itself wont know about the memory situation of the entire system, new system > call needs to be called from an external process. That's why process_madvise is introduced here. > > > > >> Swapping out memory into zram wont increase the latency for a hot start ? Or > >> is it because as it will prevent a fresh cold start which anyway will be slower > >> than a slow hot start. Just being curious. > > > > First, not all swapped pages will be reloaded immediately once an app > > is resumed. We've found that an app's working set post-process_madvise > > is significantly smaller than what an app allocates when it first > > launches (see the delta between pswpin and pswpout in Minchan's > > results). Presumably because of this, faulting to fetch from zram does > > pswpin 417613 1392647 975034 233.00 > pswpout 1274224 2661731 1387507 108.00 > > IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that > always the case ? Or it tend to swap out from an active area of the working set > which faulted back again. I think it's because apps are alive longer via reducing being killed so turn into from pgpgin to swapin. > > > not seem to introduce a noticeable hot start penalty, not does it > > cause an increase in performance problems later in the app's > > lifecycle. I've measured with and without process_madvise, and the > > differences are within our noise bounds. Second, because we're not > > That is assuming that post process_madvise() working set for the application is > always smaller. There is another challenge. The external process should ideally > have the knowledge of active areas of the working set for an application in > question for it to invoke process_madvise() correctly to prevent such scenarios. There are several ways to detect workingset more accurately at the cost of runtime. For example, with idle page tracking or clear_refs. Accuracy is always trade-off of overhead for LRU aging. > > > preemptively evicting file pages and only making them more likely to > > be evicted when there's already memory pressure, we avoid the case > > where we process_madvise an app then immediately return to the app and > > reload all file pages in the working set even though there was no > > intervening memory pressure. Our initial version of this work evicted > > That would be the worst case scenario which should be avoided. Memory pressure > must be a parameter before actually doing the swap out. But pages if know to be > inactive/cold can be marked high priority to be swapped out. > > > file pages preemptively and did cause a noticeable slowdown (~15%) for > > that case; this patch set avoids that slowdown. Finally, the benefit > > from avoiding cold starts is huge. The performance improvement from > > having a hot start instead of a cold start ranges from 3x for very > > small apps to 50x+ for larger apps like high-fidelity games. > > Is there any other real world scenario apart from this app based ecosystem where > user hinted LRU management might be helpful ? Just being curious. Thanks for the > detailed explanation. I will continue looking into this series.
[Cc linux-api] On Tue 21-05-19 13:39:50, Minchan Kim wrote: > On Mon, May 20, 2019 at 12:46:05PM -0400, Johannes Weiner wrote: > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > > - Approach > > > > > > The approach we chose was to use a new interface to allow userspace to > > > proactively reclaim entire processes by leveraging platform information. > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > > that are known to be cold from userspace and to avoid races with lmkd > > > by reclaiming apps as soon as they entered the cached state. Additionally, > > > it could provide many chances for platform to use much information to > > > optimize memory efficiency. > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > > and MADV_FREE by adding non-destructive ways to gain some free memory > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > > kernel that memory region is not currently needed and should be reclaimed > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > > kernel that memory region is not currently needed and should be reclaimed > > > when memory pressure rises. > > > > I agree with this approach and the semantics. But these names are very > > vague and extremely easy to confuse since they're so similar. > > > > MADV_COLD could be a good name, but for deactivating pages, not > > reclaiming them - marking memory "cold" on the LRU for later reclaim. > > > > For the immediate reclaim one, I think there is a better option too: > > In virtual memory speak, putting a page into secondary storage (or > > ensuring it's already there), and then freeing its in-memory copy, is > > called "paging out". And that's what this flag is supposed to do. So > > how about MADV_PAGEOUT? > > > > With that, we'd have: > > > > MADV_FREE: Mark data invalid, free memory when needed > > MADV_DONTNEED: Mark data invalid, free memory immediately > > > > MADV_COLD: Data is not used for a while, free memory when needed > > MADV_PAGEOUT: Data is not used for a while, free memory immediately > > > > What do you think? > > There are several suggestions until now. Thanks, Folks! > > For deactivating: > > - MADV_COOL > - MADV_RECLAIM_LAZY > - MADV_DEACTIVATE > - MADV_COLD > - MADV_FREE_PRESERVE > > > For reclaiming: > > - MADV_COLD > - MADV_RECLAIM_NOW > - MADV_RECLAIMING > - MADV_PAGEOUT > - MADV_DONTNEED_PRESERVE > > It seems everybody doesn't like MADV_COLD so want to go with other. > For consisteny of view with other existing hints of madvise, -preserve > postfix suits well. However, originally, I don't like the naming FREE > vs DONTNEED from the beginning. They were easily confused. > I prefer PAGEOUT to RECLAIM since it's more likely to be nuance to > represent reclaim with memory pressure and is supposed to paged-in > if someone need it later. So, it imply PRESERVE. > If there is not strong against it, I want to go with MADV_COLD and > MADV_PAGEOUT. > > Other opinion? I do not really care strongly. I am pretty sure we will have a lot of suggestions because people tend to be good at arguing about that... Anyway, unlike DONTNEED/FREE we do not have any other OS to implement these features, right? So we shouldn't be tight to existing names. On the other hand I kinda like the reference to the existing names but DEACTIVATE/PAGEOUT seem a good fit to me as well. Unless there is way much better name suggested I would go with one of those. Up to you.
[linux-api] On Mon 20-05-19 18:44:52, Matthew Wilcox wrote: > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > and MADV_FREE by adding non-destructive ways to gain some free memory > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > when memory pressure rises. > > Do we tear down page tables for these ranges? That seems like a good > way of reclaiming potentially a substantial amount of memory. I do not think we can in general because this is a non-destructive operation. So at least we cannot tear down anonymous ptes (they will turn into swap entries).
On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > - Background > > The Android terminology used for forking a new process and starting an app > from scratch is a cold start, while resuming an existing app is a hot start. > While we continually try to improve the performance of cold starts, hot > starts will always be significantly less power hungry as well as faster so > we are trying to make hot start more likely than cold start. > > To increase hot start, Android userspace manages the order that apps should > be killed in a process called ActivityManagerService. ActivityManagerService > tracks every Android app or service that the user could be interacting with > at any time and translates that into a ranked list for lmkd(low memory > killer daemon). They are likely to be killed by lmkd if the system has to > reclaim memory. In that sense they are similar to entries in any other cache. > Those apps are kept alive for opportunistic performance improvements but > those performance improvements will vary based on the memory requirements of > individual workloads. > > - Problem > > Naturally, cached apps were dominant consumers of memory on the system. > However, they were not significant consumers of swap even though they are > good candidate for swap. Under investigation, swapping out only begins > once the low zone watermark is hit and kswapd wakes up, but the overall > allocation rate in the system might trip lmkd thresholds and cause a cached > process to be killed(we measured performance swapping out vs. zapping the > memory by killing a process. Unsurprisingly, zapping is 10x times faster > even though we use zram which is much faster than real storage) so kill > from lmkd will often satisfy the high zone watermark, resulting in very > few pages actually being moved to swap. > > - Approach > > The approach we chose was to use a new interface to allow userspace to > proactively reclaim entire processes by leveraging platform information. > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > that are known to be cold from userspace and to avoid races with lmkd > by reclaiming apps as soon as they entered the cached state. Additionally, > it could provide many chances for platform to use much information to > optimize memory efficiency. > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > and MADV_FREE by adding non-destructive ways to gain some free memory > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > when memory pressure rises. > > To achieve the goal, the patchset introduce two new options for madvise. > One is MADV_COOL which will deactive activated pages and the other is > MADV_COLD which will reclaim private pages instantly. These new options > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed when memory pressure rises. > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > information required to make the reclaim decision is not known to the app. > Instead, it is known to a centralized userspace daemon, and that daemon > must be able to initiate reclaim on its own without any app involvement. > To solve the concern, this patch introduces new syscall - > > struct pr_madvise_param { > int size; > const struct iovec *vec; > } > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > struct pr_madvise_param *restuls, > struct pr_madvise_param *ranges, > unsigned long flags); > > The syscall get pidfd to give hints to external process and provides > pair of result/ranges vector arguments so that it could give several > hints to each address range all at once. > > I guess others have different ideas about the naming of syscall and options > so feel free to suggest better naming. Yes, all new syscalls making use of pidfds should be named pidfd_<action>. So please make this pidfd_madvise. Please make sure to Cc me on this in the future as I'm maintaining pidfds. Would be great to have Jann on this too since he's been touching both mm and parts of the pidfd stuff with me.
On Tue 21-05-19 08:25:55, Anshuman Khandual wrote: > On 05/20/2019 10:29 PM, Tim Murray wrote: [...] > > not seem to introduce a noticeable hot start penalty, not does it > > cause an increase in performance problems later in the app's > > lifecycle. I've measured with and without process_madvise, and the > > differences are within our noise bounds. Second, because we're not > > That is assuming that post process_madvise() working set for the application is > always smaller. There is another challenge. The external process should ideally > have the knowledge of active areas of the working set for an application in > question for it to invoke process_madvise() correctly to prevent such scenarios. But that doesn't really seem relevant for the API itself, right? The higher level logic the monitor's business.
On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote: > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > - Background > > > > The Android terminology used for forking a new process and starting an app > > from scratch is a cold start, while resuming an existing app is a hot start. > > While we continually try to improve the performance of cold starts, hot > > starts will always be significantly less power hungry as well as faster so > > we are trying to make hot start more likely than cold start. > > > > To increase hot start, Android userspace manages the order that apps should > > be killed in a process called ActivityManagerService. ActivityManagerService > > tracks every Android app or service that the user could be interacting with > > at any time and translates that into a ranked list for lmkd(low memory > > killer daemon). They are likely to be killed by lmkd if the system has to > > reclaim memory. In that sense they are similar to entries in any other cache. > > Those apps are kept alive for opportunistic performance improvements but > > those performance improvements will vary based on the memory requirements of > > individual workloads. > > > > - Problem > > > > Naturally, cached apps were dominant consumers of memory on the system. > > However, they were not significant consumers of swap even though they are > > good candidate for swap. Under investigation, swapping out only begins > > once the low zone watermark is hit and kswapd wakes up, but the overall > > allocation rate in the system might trip lmkd thresholds and cause a cached > > process to be killed(we measured performance swapping out vs. zapping the > > memory by killing a process. Unsurprisingly, zapping is 10x times faster > > even though we use zram which is much faster than real storage) so kill > > from lmkd will often satisfy the high zone watermark, resulting in very > > few pages actually being moved to swap. > > > > - Approach > > > > The approach we chose was to use a new interface to allow userspace to > > proactively reclaim entire processes by leveraging platform information. > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > that are known to be cold from userspace and to avoid races with lmkd > > by reclaiming apps as soon as they entered the cached state. Additionally, > > it could provide many chances for platform to use much information to > > optimize memory efficiency. > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > and MADV_FREE by adding non-destructive ways to gain some free memory > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > when memory pressure rises. > > > > To achieve the goal, the patchset introduce two new options for madvise. > > One is MADV_COOL which will deactive activated pages and the other is > > MADV_COLD which will reclaim private pages instantly. These new options > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > > that it hints the kernel that memory region is not currently needed and > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > > that it hints the kernel that memory region is not currently needed and > > should be reclaimed when memory pressure rises. > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > > information required to make the reclaim decision is not known to the app. > > Instead, it is known to a centralized userspace daemon, and that daemon > > must be able to initiate reclaim on its own without any app involvement. > > To solve the concern, this patch introduces new syscall - > > > > struct pr_madvise_param { > > int size; > > const struct iovec *vec; > > } > > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > > struct pr_madvise_param *restuls, > > struct pr_madvise_param *ranges, > > unsigned long flags); > > > > The syscall get pidfd to give hints to external process and provides > > pair of result/ranges vector arguments so that it could give several > > hints to each address range all at once. > > > > I guess others have different ideas about the naming of syscall and options > > so feel free to suggest better naming. > > Yes, all new syscalls making use of pidfds should be named > pidfd_<action>. So please make this pidfd_madvise. I don't have any particular preference but just wondering why pidfd is so special to have it as prefix of system call name. > > Please make sure to Cc me on this in the future as I'm maintaining > pidfds. Would be great to have Jann on this too since he's been touching > both mm and parts of the pidfd stuff with me. Sure!
On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote: > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote: > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > > - Background > > > > > > The Android terminology used for forking a new process and starting an app > > > from scratch is a cold start, while resuming an existing app is a hot start. > > > While we continually try to improve the performance of cold starts, hot > > > starts will always be significantly less power hungry as well as faster so > > > we are trying to make hot start more likely than cold start. > > > > > > To increase hot start, Android userspace manages the order that apps should > > > be killed in a process called ActivityManagerService. ActivityManagerService > > > tracks every Android app or service that the user could be interacting with > > > at any time and translates that into a ranked list for lmkd(low memory > > > killer daemon). They are likely to be killed by lmkd if the system has to > > > reclaim memory. In that sense they are similar to entries in any other cache. > > > Those apps are kept alive for opportunistic performance improvements but > > > those performance improvements will vary based on the memory requirements of > > > individual workloads. > > > > > > - Problem > > > > > > Naturally, cached apps were dominant consumers of memory on the system. > > > However, they were not significant consumers of swap even though they are > > > good candidate for swap. Under investigation, swapping out only begins > > > once the low zone watermark is hit and kswapd wakes up, but the overall > > > allocation rate in the system might trip lmkd thresholds and cause a cached > > > process to be killed(we measured performance swapping out vs. zapping the > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster > > > even though we use zram which is much faster than real storage) so kill > > > from lmkd will often satisfy the high zone watermark, resulting in very > > > few pages actually being moved to swap. > > > > > > - Approach > > > > > > The approach we chose was to use a new interface to allow userspace to > > > proactively reclaim entire processes by leveraging platform information. > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > > that are known to be cold from userspace and to avoid races with lmkd > > > by reclaiming apps as soon as they entered the cached state. Additionally, > > > it could provide many chances for platform to use much information to > > > optimize memory efficiency. > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > > and MADV_FREE by adding non-destructive ways to gain some free memory > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > > kernel that memory region is not currently needed and should be reclaimed > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > > kernel that memory region is not currently needed and should be reclaimed > > > when memory pressure rises. > > > > > > To achieve the goal, the patchset introduce two new options for madvise. > > > One is MADV_COOL which will deactive activated pages and the other is > > > MADV_COLD which will reclaim private pages instantly. These new options > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > > > that it hints the kernel that memory region is not currently needed and > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > > > that it hints the kernel that memory region is not currently needed and > > > should be reclaimed when memory pressure rises. > > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > > > information required to make the reclaim decision is not known to the app. > > > Instead, it is known to a centralized userspace daemon, and that daemon > > > must be able to initiate reclaim on its own without any app involvement. > > > To solve the concern, this patch introduces new syscall - > > > > > > struct pr_madvise_param { > > > int size; > > > const struct iovec *vec; > > > } > > > > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > > > struct pr_madvise_param *restuls, > > > struct pr_madvise_param *ranges, > > > unsigned long flags); > > > > > > The syscall get pidfd to give hints to external process and provides > > > pair of result/ranges vector arguments so that it could give several > > > hints to each address range all at once. > > > > > > I guess others have different ideas about the naming of syscall and options > > > so feel free to suggest better naming. > > > > Yes, all new syscalls making use of pidfds should be named > > pidfd_<action>. So please make this pidfd_madvise. > > I don't have any particular preference but just wondering why pidfd is > so special to have it as prefix of system call name. It's a whole new API to address processes. We already have clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you exported pidfd_to_pid(). And we're going to have pidfd_open(). Your syscall works only with pidfds so it's tied to this api as well so it should follow the naming scheme. This also makes life easier for userspace and is consistent. > > > > > Please make sure to Cc me on this in the future as I'm maintaining > > pidfds. Would be great to have Jann on this too since he's been touching > > both mm and parts of the pidfd stuff with me. > > Sure! Thanks!
On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote: > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote: > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote: > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > > > - Background > > > > > > > > The Android terminology used for forking a new process and starting an app > > > > from scratch is a cold start, while resuming an existing app is a hot start. > > > > While we continually try to improve the performance of cold starts, hot > > > > starts will always be significantly less power hungry as well as faster so > > > > we are trying to make hot start more likely than cold start. > > > > > > > > To increase hot start, Android userspace manages the order that apps should > > > > be killed in a process called ActivityManagerService. ActivityManagerService > > > > tracks every Android app or service that the user could be interacting with > > > > at any time and translates that into a ranked list for lmkd(low memory > > > > killer daemon). They are likely to be killed by lmkd if the system has to > > > > reclaim memory. In that sense they are similar to entries in any other cache. > > > > Those apps are kept alive for opportunistic performance improvements but > > > > those performance improvements will vary based on the memory requirements of > > > > individual workloads. > > > > > > > > - Problem > > > > > > > > Naturally, cached apps were dominant consumers of memory on the system. > > > > However, they were not significant consumers of swap even though they are > > > > good candidate for swap. Under investigation, swapping out only begins > > > > once the low zone watermark is hit and kswapd wakes up, but the overall > > > > allocation rate in the system might trip lmkd thresholds and cause a cached > > > > process to be killed(we measured performance swapping out vs. zapping the > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster > > > > even though we use zram which is much faster than real storage) so kill > > > > from lmkd will often satisfy the high zone watermark, resulting in very > > > > few pages actually being moved to swap. > > > > > > > > - Approach > > > > > > > > The approach we chose was to use a new interface to allow userspace to > > > > proactively reclaim entire processes by leveraging platform information. > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > > > that are known to be cold from userspace and to avoid races with lmkd > > > > by reclaiming apps as soon as they entered the cached state. Additionally, > > > > it could provide many chances for platform to use much information to > > > > optimize memory efficiency. > > > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > > > and MADV_FREE by adding non-destructive ways to gain some free memory > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > > > kernel that memory region is not currently needed and should be reclaimed > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > > > kernel that memory region is not currently needed and should be reclaimed > > > > when memory pressure rises. > > > > > > > > To achieve the goal, the patchset introduce two new options for madvise. > > > > One is MADV_COOL which will deactive activated pages and the other is > > > > MADV_COLD which will reclaim private pages instantly. These new options > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > > > > that it hints the kernel that memory region is not currently needed and > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > > > > that it hints the kernel that memory region is not currently needed and > > > > should be reclaimed when memory pressure rises. > > > > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > > > > information required to make the reclaim decision is not known to the app. > > > > Instead, it is known to a centralized userspace daemon, and that daemon > > > > must be able to initiate reclaim on its own without any app involvement. > > > > To solve the concern, this patch introduces new syscall - > > > > > > > > struct pr_madvise_param { > > > > int size; > > > > const struct iovec *vec; > > > > } > > > > > > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > > > > struct pr_madvise_param *restuls, > > > > struct pr_madvise_param *ranges, > > > > unsigned long flags); > > > > > > > > The syscall get pidfd to give hints to external process and provides > > > > pair of result/ranges vector arguments so that it could give several > > > > hints to each address range all at once. > > > > > > > > I guess others have different ideas about the naming of syscall and options > > > > so feel free to suggest better naming. > > > > > > Yes, all new syscalls making use of pidfds should be named > > > pidfd_<action>. So please make this pidfd_madvise. > > > > I don't have any particular preference but just wondering why pidfd is > > so special to have it as prefix of system call name. > > It's a whole new API to address processes. We already have > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your > syscall works only with pidfds so it's tied to this api as well so it > should follow the naming scheme. This also makes life easier for > userspace and is consistent. This is at least my reasoning. I'm not going to make this a whole big pedantic argument. If people have really strong feelings about not using this prefix then fine. But if syscalls can be grouped together and have consistent naming this is always a big plus.
On Tue, May 21, 2019 at 01:30:32PM +0200, Christian Brauner wrote: > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote: > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote: > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > > > - Background > > > > > > > > The Android terminology used for forking a new process and starting an app > > > > from scratch is a cold start, while resuming an existing app is a hot start. > > > > While we continually try to improve the performance of cold starts, hot > > > > starts will always be significantly less power hungry as well as faster so > > > > we are trying to make hot start more likely than cold start. > > > > > > > > To increase hot start, Android userspace manages the order that apps should > > > > be killed in a process called ActivityManagerService. ActivityManagerService > > > > tracks every Android app or service that the user could be interacting with > > > > at any time and translates that into a ranked list for lmkd(low memory > > > > killer daemon). They are likely to be killed by lmkd if the system has to > > > > reclaim memory. In that sense they are similar to entries in any other cache. > > > > Those apps are kept alive for opportunistic performance improvements but > > > > those performance improvements will vary based on the memory requirements of > > > > individual workloads. > > > > > > > > - Problem > > > > > > > > Naturally, cached apps were dominant consumers of memory on the system. > > > > However, they were not significant consumers of swap even though they are > > > > good candidate for swap. Under investigation, swapping out only begins > > > > once the low zone watermark is hit and kswapd wakes up, but the overall > > > > allocation rate in the system might trip lmkd thresholds and cause a cached > > > > process to be killed(we measured performance swapping out vs. zapping the > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster > > > > even though we use zram which is much faster than real storage) so kill > > > > from lmkd will often satisfy the high zone watermark, resulting in very > > > > few pages actually being moved to swap. > > > > > > > > - Approach > > > > > > > > The approach we chose was to use a new interface to allow userspace to > > > > proactively reclaim entire processes by leveraging platform information. > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > > > that are known to be cold from userspace and to avoid races with lmkd > > > > by reclaiming apps as soon as they entered the cached state. Additionally, > > > > it could provide many chances for platform to use much information to > > > > optimize memory efficiency. > > > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > > > and MADV_FREE by adding non-destructive ways to gain some free memory > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > > > kernel that memory region is not currently needed and should be reclaimed > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > > > kernel that memory region is not currently needed and should be reclaimed > > > > when memory pressure rises. > > > > > > > > To achieve the goal, the patchset introduce two new options for madvise. > > > > One is MADV_COOL which will deactive activated pages and the other is > > > > MADV_COLD which will reclaim private pages instantly. These new options > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > > > > that it hints the kernel that memory region is not currently needed and > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > > > > that it hints the kernel that memory region is not currently needed and > > > > should be reclaimed when memory pressure rises. > > > > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > > > > information required to make the reclaim decision is not known to the app. > > > > Instead, it is known to a centralized userspace daemon, and that daemon > > > > must be able to initiate reclaim on its own without any app involvement. > > > > To solve the concern, this patch introduces new syscall - > > > > > > > > struct pr_madvise_param { > > > > int size; > > > > const struct iovec *vec; > > > > } > > > > > > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > > > > struct pr_madvise_param *restuls, > > > > struct pr_madvise_param *ranges, > > > > unsigned long flags); > > > > > > > > The syscall get pidfd to give hints to external process and provides > > > > pair of result/ranges vector arguments so that it could give several > > > > hints to each address range all at once. > > > > > > > > I guess others have different ideas about the naming of syscall and options > > > > so feel free to suggest better naming. > > > > > > Yes, all new syscalls making use of pidfds should be named > > > pidfd_<action>. So please make this pidfd_madvise. > > > > I don't have any particular preference but just wondering why pidfd is > > so special to have it as prefix of system call name. > > It's a whole new API to address processes. We already have > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your > syscall works only with pidfds so it's tied to this api as well so it > should follow the naming scheme. This also makes life easier for > userspace and is consistent. Okay. I will change the API name at next revision. Thanks.
On May 21, 2019 1:41:20 PM GMT+02:00, Minchan Kim <minchan@kernel.org> wrote: >On Tue, May 21, 2019 at 01:30:32PM +0200, Christian Brauner wrote: >> On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote: >> > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote: >> > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: >> > > > - Background >> > > > >> > > > The Android terminology used for forking a new process and >starting an app >> > > > from scratch is a cold start, while resuming an existing app is >a hot start. >> > > > While we continually try to improve the performance of cold >starts, hot >> > > > starts will always be significantly less power hungry as well >as faster so >> > > > we are trying to make hot start more likely than cold start. >> > > > >> > > > To increase hot start, Android userspace manages the order that >apps should >> > > > be killed in a process called ActivityManagerService. >ActivityManagerService >> > > > tracks every Android app or service that the user could be >interacting with >> > > > at any time and translates that into a ranked list for lmkd(low >memory >> > > > killer daemon). They are likely to be killed by lmkd if the >system has to >> > > > reclaim memory. In that sense they are similar to entries in >any other cache. >> > > > Those apps are kept alive for opportunistic performance >improvements but >> > > > those performance improvements will vary based on the memory >requirements of >> > > > individual workloads. >> > > > >> > > > - Problem >> > > > >> > > > Naturally, cached apps were dominant consumers of memory on the >system. >> > > > However, they were not significant consumers of swap even >though they are >> > > > good candidate for swap. Under investigation, swapping out only >begins >> > > > once the low zone watermark is hit and kswapd wakes up, but the >overall >> > > > allocation rate in the system might trip lmkd thresholds and >cause a cached >> > > > process to be killed(we measured performance swapping out vs. >zapping the >> > > > memory by killing a process. Unsurprisingly, zapping is 10x >times faster >> > > > even though we use zram which is much faster than real storage) >so kill >> > > > from lmkd will often satisfy the high zone watermark, resulting >in very >> > > > few pages actually being moved to swap. >> > > > >> > > > - Approach >> > > > >> > > > The approach we chose was to use a new interface to allow >userspace to >> > > > proactively reclaim entire processes by leveraging platform >information. >> > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs >for pages >> > > > that are known to be cold from userspace and to avoid races >with lmkd >> > > > by reclaiming apps as soon as they entered the cached state. >Additionally, >> > > > it could provide many chances for platform to use much >information to >> > > > optimize memory efficiency. >> > > > >> > > > IMHO we should spell it out that this patchset complements >MADV_WONTNEED >> > > > and MADV_FREE by adding non-destructive ways to gain some free >memory >> > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it >hints the >> > > > kernel that memory region is not currently needed and should be >reclaimed >> > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it >hints the >> > > > kernel that memory region is not currently needed and should be >reclaimed >> > > > when memory pressure rises. >> > > > >> > > > To achieve the goal, the patchset introduce two new options for >madvise. >> > > > One is MADV_COOL which will deactive activated pages and the >other is >> > > > MADV_COLD which will reclaim private pages instantly. These new >options >> > > > complement MADV_DONTNEED and MADV_FREE by adding >non-destructive ways to >> > > > gain some free memory space. MADV_COLD is similar to >MADV_DONTNEED in a way >> > > > that it hints the kernel that memory region is not currently >needed and >> > > > should be reclaimed immediately; MADV_COOL is similar to >MADV_FREE in a way >> > > > that it hints the kernel that memory region is not currently >needed and >> > > > should be reclaimed when memory pressure rises. >> > > > >> > > > This approach is similar in spirit to madvise(MADV_WONTNEED), >but the >> > > > information required to make the reclaim decision is not known >to the app. >> > > > Instead, it is known to a centralized userspace daemon, and >that daemon >> > > > must be able to initiate reclaim on its own without any app >involvement. >> > > > To solve the concern, this patch introduces new syscall - >> > > > >> > > > struct pr_madvise_param { >> > > > int size; >> > > > const struct iovec *vec; >> > > > } >> > > > >> > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, >> > > > struct pr_madvise_param *restuls, >> > > > struct pr_madvise_param *ranges, >> > > > unsigned long flags); >> > > > >> > > > The syscall get pidfd to give hints to external process and >provides >> > > > pair of result/ranges vector arguments so that it could give >several >> > > > hints to each address range all at once. >> > > > >> > > > I guess others have different ideas about the naming of syscall >and options >> > > > so feel free to suggest better naming. >> > > >> > > Yes, all new syscalls making use of pidfds should be named >> > > pidfd_<action>. So please make this pidfd_madvise. >> > >> > I don't have any particular preference but just wondering why pidfd >is >> > so special to have it as prefix of system call name. >> >> It's a whole new API to address processes. We already have >> clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you >> exported pidfd_to_pid(). And we're going to have pidfd_open(). Your >> syscall works only with pidfds so it's tied to this api as well so it >> should follow the naming scheme. This also makes life easier for >> userspace and is consistent. > >Okay. I will change the API name at next revision. >Thanks. Thanks! Fwiw, there's been a similar patch by Oleksandr for pidfd_madvise I stumbled upon a few days back: https://gitlab.com/post-factum/pf-kernel/commit/0595f874a53fa898739ac315ddf208554d9dc897 He wanted to be cc'ed but I forgot. Christian
On Tue, May 21, 2019 at 02:04:00PM +0200, Christian Brauner wrote: > On May 21, 2019 1:41:20 PM GMT+02:00, Minchan Kim <minchan@kernel.org> wrote: > >On Tue, May 21, 2019 at 01:30:32PM +0200, Christian Brauner wrote: > >> On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote: > >> > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote: > >> > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > >> > > > - Background > >> > > > > >> > > > The Android terminology used for forking a new process and > >starting an app > >> > > > from scratch is a cold start, while resuming an existing app is > >a hot start. > >> > > > While we continually try to improve the performance of cold > >starts, hot > >> > > > starts will always be significantly less power hungry as well > >as faster so > >> > > > we are trying to make hot start more likely than cold start. > >> > > > > >> > > > To increase hot start, Android userspace manages the order that > >apps should > >> > > > be killed in a process called ActivityManagerService. > >ActivityManagerService > >> > > > tracks every Android app or service that the user could be > >interacting with > >> > > > at any time and translates that into a ranked list for lmkd(low > >memory > >> > > > killer daemon). They are likely to be killed by lmkd if the > >system has to > >> > > > reclaim memory. In that sense they are similar to entries in > >any other cache. > >> > > > Those apps are kept alive for opportunistic performance > >improvements but > >> > > > those performance improvements will vary based on the memory > >requirements of > >> > > > individual workloads. > >> > > > > >> > > > - Problem > >> > > > > >> > > > Naturally, cached apps were dominant consumers of memory on the > >system. > >> > > > However, they were not significant consumers of swap even > >though they are > >> > > > good candidate for swap. Under investigation, swapping out only > >begins > >> > > > once the low zone watermark is hit and kswapd wakes up, but the > >overall > >> > > > allocation rate in the system might trip lmkd thresholds and > >cause a cached > >> > > > process to be killed(we measured performance swapping out vs. > >zapping the > >> > > > memory by killing a process. Unsurprisingly, zapping is 10x > >times faster > >> > > > even though we use zram which is much faster than real storage) > >so kill > >> > > > from lmkd will often satisfy the high zone watermark, resulting > >in very > >> > > > few pages actually being moved to swap. > >> > > > > >> > > > - Approach > >> > > > > >> > > > The approach we chose was to use a new interface to allow > >userspace to > >> > > > proactively reclaim entire processes by leveraging platform > >information. > >> > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs > >for pages > >> > > > that are known to be cold from userspace and to avoid races > >with lmkd > >> > > > by reclaiming apps as soon as they entered the cached state. > >Additionally, > >> > > > it could provide many chances for platform to use much > >information to > >> > > > optimize memory efficiency. > >> > > > > >> > > > IMHO we should spell it out that this patchset complements > >MADV_WONTNEED > >> > > > and MADV_FREE by adding non-destructive ways to gain some free > >memory > >> > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it > >hints the > >> > > > kernel that memory region is not currently needed and should be > >reclaimed > >> > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it > >hints the > >> > > > kernel that memory region is not currently needed and should be > >reclaimed > >> > > > when memory pressure rises. > >> > > > > >> > > > To achieve the goal, the patchset introduce two new options for > >madvise. > >> > > > One is MADV_COOL which will deactive activated pages and the > >other is > >> > > > MADV_COLD which will reclaim private pages instantly. These new > >options > >> > > > complement MADV_DONTNEED and MADV_FREE by adding > >non-destructive ways to > >> > > > gain some free memory space. MADV_COLD is similar to > >MADV_DONTNEED in a way > >> > > > that it hints the kernel that memory region is not currently > >needed and > >> > > > should be reclaimed immediately; MADV_COOL is similar to > >MADV_FREE in a way > >> > > > that it hints the kernel that memory region is not currently > >needed and > >> > > > should be reclaimed when memory pressure rises. > >> > > > > >> > > > This approach is similar in spirit to madvise(MADV_WONTNEED), > >but the > >> > > > information required to make the reclaim decision is not known > >to the app. > >> > > > Instead, it is known to a centralized userspace daemon, and > >that daemon > >> > > > must be able to initiate reclaim on its own without any app > >involvement. > >> > > > To solve the concern, this patch introduces new syscall - > >> > > > > >> > > > struct pr_madvise_param { > >> > > > int size; > >> > > > const struct iovec *vec; > >> > > > } > >> > > > > >> > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > >> > > > struct pr_madvise_param *restuls, > >> > > > struct pr_madvise_param *ranges, > >> > > > unsigned long flags); > >> > > > > >> > > > The syscall get pidfd to give hints to external process and > >provides > >> > > > pair of result/ranges vector arguments so that it could give > >several > >> > > > hints to each address range all at once. > >> > > > > >> > > > I guess others have different ideas about the naming of syscall > >and options > >> > > > so feel free to suggest better naming. > >> > > > >> > > Yes, all new syscalls making use of pidfds should be named > >> > > pidfd_<action>. So please make this pidfd_madvise. > >> > > >> > I don't have any particular preference but just wondering why pidfd > >is > >> > so special to have it as prefix of system call name. > >> > >> It's a whole new API to address processes. We already have > >> clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you > >> exported pidfd_to_pid(). And we're going to have pidfd_open(). Your > >> syscall works only with pidfds so it's tied to this api as well so it > >> should follow the naming scheme. This also makes life easier for > >> userspace and is consistent. > > > >Okay. I will change the API name at next revision. > >Thanks. > > Thanks! > Fwiw, there's been a similar patch by Oleksandr for pidfd_madvise I stumbled upon a few days back: > https://gitlab.com/post-factum/pf-kernel/commit/0595f874a53fa898739ac315ddf208554d9dc897 > > He wanted to be cc'ed but I forgot. Thanks :). FWIW, since this submission is essentially a continuation of our discussion involving my earlier KSM submissions here, I won't move my gitlab branch forward and will be happy to assist with what we have here, be it pidfd_madvise() or a set or /proc files (or smth else). > > Christian >
On Sun, May 19, 2019 at 8:53 PM Minchan Kim <minchan@kernel.org> wrote: > > - Background > > The Android terminology used for forking a new process and starting an app > from scratch is a cold start, while resuming an existing app is a hot start. > While we continually try to improve the performance of cold starts, hot > starts will always be significantly less power hungry as well as faster so > we are trying to make hot start more likely than cold start. > > To increase hot start, Android userspace manages the order that apps should > be killed in a process called ActivityManagerService. ActivityManagerService > tracks every Android app or service that the user could be interacting with > at any time and translates that into a ranked list for lmkd(low memory > killer daemon). They are likely to be killed by lmkd if the system has to > reclaim memory. In that sense they are similar to entries in any other cache. > Those apps are kept alive for opportunistic performance improvements but > those performance improvements will vary based on the memory requirements of > individual workloads. > > - Problem > > Naturally, cached apps were dominant consumers of memory on the system. > However, they were not significant consumers of swap even though they are > good candidate for swap. Under investigation, swapping out only begins > once the low zone watermark is hit and kswapd wakes up, but the overall > allocation rate in the system might trip lmkd thresholds and cause a cached > process to be killed(we measured performance swapping out vs. zapping the > memory by killing a process. Unsurprisingly, zapping is 10x times faster > even though we use zram which is much faster than real storage) so kill > from lmkd will often satisfy the high zone watermark, resulting in very > few pages actually being moved to swap. It is not clear what exactly is the problem from the above para. IMO low usage of swap is not the problem but rather global memory pressure and the reactive response to it is the problem. Killing apps over swap is preferred as you have noted zapping frees memory faster but it indirectly increases cold start. Also swapping on allocation causes latency issues for the app. So, a proactive mechanism is needed to keep global pressure away and indirectly reduces cold starts and alloc stalls. > > - Approach > > The approach we chose was to use a new interface to allow userspace to > proactively reclaim entire processes by leveraging platform information. > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > that are known to be cold from userspace and to avoid races with lmkd > by reclaiming apps as soon as they entered the cached state. Additionally, > it could provide many chances for platform to use much information to > optimize memory efficiency. I think it would be good to have clear reasoning on why "reclaim from userspace" approach is taken. Android runtime clearly has more accurate stale/cold information at the app/process level and can positively influence kernel's reclaim decisions. So, "reclaim from userspace" approach makes total sense for Android. I envision that Chrome OS would be another very obvious user of this approach. There can be tens of tabs which the user have not touched for sometime. Chrome OS can proactively reclaim memory from such tabs. > > IMHO we should spell it out that this patchset complements MADV_WONTNEED MADV_DONTNEED? same at couple of places below. > and MADV_FREE by adding non-destructive ways to gain some free memory > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > when memory pressure rises. > > To achieve the goal, the patchset introduce two new options for madvise. > One is MADV_COOL which will deactive activated pages and the other is > MADV_COLD which will reclaim private pages instantly. These new options > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed when memory pressure rises. > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > information required to make the reclaim decision is not known to the app. > Instead, it is known to a centralized userspace daemon, and that daemon > must be able to initiate reclaim on its own without any app involvement. > To solve the concern, this patch introduces new syscall - > > struct pr_madvise_param { > int size; > const struct iovec *vec; > } > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > struct pr_madvise_param *restuls, > struct pr_madvise_param *ranges, > unsigned long flags); > > The syscall get pidfd to give hints to external process and provides > pair of result/ranges vector arguments so that it could give several > hints to each address range all at once. > > I guess others have different ideas about the naming of syscall and options > so feel free to suggest better naming. > > - Experiment > > We did bunch of testing with several hundreds of real users, not artificial > benchmark on android. We saw about 17% cold start decreasement without any > significant battery/app startup latency issues. And with artificial benchmark > which launches and switching apps, we saw average 7% app launching improvement, > 18% less lmkd kill and good stat from vmstat. > > A is vanilla and B is process_madvise. > > > A B delta ratio(%) > allocstall_dma 0 0 0 0.00 > allocstall_movable 1464 457 -1007 -69.00 > allocstall_normal 263210 190763 -72447 -28.00 > allocstall_total 264674 191220 -73454 -28.00 > compact_daemon_wake 26912 25294 -1618 -7.00 > compact_fail 17885 14151 -3734 -21.00 > compact_free_scanned 4204766409 3835994922 -368771487 -9.00 > compact_isolated 3446484 2967618 -478866 -14.00 > compact_migrate_scanned 1621336411 1324695710 -296640701 -19.00 > compact_stall 19387 15343 -4044 -21.00 > compact_success 1502 1192 -310 -21.00 > kswapd_high_wmark_hit_quickly 234 184 -50 -22.00 > kswapd_inodesteal 221635 233093 11458 5.00 > kswapd_low_wmark_hit_quickly 66065 54009 -12056 -19.00 > nr_dirtied 259934 296476 36542 14.00 > nr_vmscan_immediate_reclaim 2587 2356 -231 -9.00 > nr_vmscan_write 1274232 2661733 1387501 108.00 > nr_written 1514060 2937560 1423500 94.00 > pageoutrun 67561 55133 -12428 -19.00 > pgactivate 2335060 1984882 -350178 -15.00 > pgalloc_dma 13743011 14096463 353452 2.00 > pgalloc_movable 0 0 0 0.00 > pgalloc_normal 18742440 16802065 -1940375 -11.00 > pgalloc_total 32485451 30898528 -1586923 -5.00 > pgdeactivate 4262210 2930670 -1331540 -32.00 > pgfault 30812334 31085065 272731 0.00 > pgfree 33553970 31765164 -1788806 -6.00 > pginodesteal 33411 15084 -18327 -55.00 > pglazyfreed 0 0 0 0.00 > pgmajfault 551312 1508299 956987 173.00 > pgmigrate_fail 43927 29330 -14597 -34.00 > pgmigrate_success 1399851 1203922 -195929 -14.00 > pgpgin 24141776 19032156 -5109620 -22.00 > pgpgout 959344 1103316 143972 15.00 > pgpgoutclean 4639732 3765868 -873864 -19.00 > pgrefill 4884560 3006938 -1877622 -39.00 > pgrotated 37828 25897 -11931 -32.00 > pgscan_direct 1456037 957567 -498470 -35.00 > pgscan_direct_throttle 0 0 0 0.00 > pgscan_kswapd 6667767 5047360 -1620407 -25.00 > pgscan_total 8123804 6004927 -2118877 -27.00 > pgskip_dma 0 0 0 0.00 > pgskip_movable 0 0 0 0.00 > pgskip_normal 14907 25382 10475 70.00 > pgskip_total 14907 25382 10475 70.00 > pgsteal_direct 1118986 690215 -428771 -39.00 > pgsteal_kswapd 4750223 3657107 -1093116 -24.00 > pgsteal_total 5869209 4347322 -1521887 -26.00 > pswpin 417613 1392647 975034 233.00 > pswpout 1274224 2661731 1387507 108.00 > slabs_scanned 13686905 10807200 -2879705 -22.00 > workingset_activate 668966 569444 -99522 -15.00 > workingset_nodereclaim 38957 32621 -6336 -17.00 > workingset_refault 2816795 2179782 -637013 -23.00 > workingset_restore 294320 168601 -125719 -43.00 > > pgmajfault is increased by 173% because swapin is increased by 200% by > process_madvise hint. However, swap read based on zram is much cheaper > than file IO in performance point of view and app hot start by swapin is > also cheaper than cold start from the beginning of app which needs many IO > from storage and initialization steps. > > This patchset is against on next-20190517. > > Minchan Kim (7): > mm: introduce MADV_COOL > mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM > mm: introduce MADV_COLD > mm: factor out madvise's core functionality > mm: introduce external memory hinting API > mm: extend process_madvise syscall to support vector arrary > mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/page-flags.h | 1 + > include/linux/page_idle.h | 15 + > include/linux/proc_fs.h | 1 + > include/linux/swap.h | 2 + > include/linux/syscalls.h | 2 + > include/uapi/asm-generic/mman-common.h | 12 + > include/uapi/asm-generic/unistd.h | 2 + > kernel/signal.c | 2 +- > kernel/sys_ni.c | 1 + > mm/madvise.c | 600 +++++++++++++++++++++---- > mm/swap.c | 43 ++ > mm/vmscan.c | 80 +++- > 14 files changed, 680 insertions(+), 83 deletions(-) > > -- > 2.21.0.1020.gf2820cf01a-goog >
On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 05/20/2019 10:29 PM, Tim Murray wrote: > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual > > <anshuman.khandual@arm.com> wrote: > >> > >> Or Is the objective here is reduce the number of processes which get killed by > >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that > >> they dont get picked by lmkd. Under utilization for zram hardware is a concern > >> here as well ? > > > > The objective is to avoid some instances of memory pressure by > > proactively swapping pages that userspace knows to be cold before > > those pages reach the end of the LRUs, which in turn can prevent some > > apps from being killed by lmk/lmkd. As soon as Android userspace knows > > that an application is not being used and is only resident to improve > > performance if the user returns to that app, we can kick off > > process_madvise on that process's pages (or some portion of those > > pages) in a power-efficient way to reduce memory pressure long before > > the system hits the free page watermark. This allows the system more > > time to put pages into zram versus waiting for the watermark to > > trigger kswapd, which decreases the likelihood that later memory > > allocations will cause enough pressure to trigger a kill of one of > > these apps. > > So this opens up bit of LRU management to user space hints. Also because the app > in itself wont know about the memory situation of the entire system, new system > call needs to be called from an external process. > > > > >> Swapping out memory into zram wont increase the latency for a hot start ? Or > >> is it because as it will prevent a fresh cold start which anyway will be slower > >> than a slow hot start. Just being curious. > > > > First, not all swapped pages will be reloaded immediately once an app > > is resumed. We've found that an app's working set post-process_madvise > > is significantly smaller than what an app allocates when it first > > launches (see the delta between pswpin and pswpout in Minchan's > > results). Presumably because of this, faulting to fetch from zram does > > pswpin 417613 1392647 975034 233.00 > pswpout 1274224 2661731 1387507 108.00 > > IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that > always the case ? Or it tend to swap out from an active area of the working set > which faulted back again. > > > not seem to introduce a noticeable hot start penalty, not does it > > cause an increase in performance problems later in the app's > > lifecycle. I've measured with and without process_madvise, and the > > differences are within our noise bounds. Second, because we're not > > That is assuming that post process_madvise() working set for the application is > always smaller. There is another challenge. The external process should ideally > have the knowledge of active areas of the working set for an application in > question for it to invoke process_madvise() correctly to prevent such scenarios. > > > preemptively evicting file pages and only making them more likely to > > be evicted when there's already memory pressure, we avoid the case > > where we process_madvise an app then immediately return to the app and > > reload all file pages in the working set even though there was no > > intervening memory pressure. Our initial version of this work evicted > > That would be the worst case scenario which should be avoided. Memory pressure > must be a parameter before actually doing the swap out. But pages if know to be > inactive/cold can be marked high priority to be swapped out. > > > file pages preemptively and did cause a noticeable slowdown (~15%) for > > that case; this patch set avoids that slowdown. Finally, the benefit > > from avoiding cold starts is huge. The performance improvement from > > having a hot start instead of a cold start ranges from 3x for very > > small apps to 50x+ for larger apps like high-fidelity games. > > Is there any other real world scenario apart from this app based ecosystem where > user hinted LRU management might be helpful ? Just being curious. Thanks for the > detailed explanation. I will continue looking into this series. Chrome OS is another real world use-case for this user hinted LRU management approach by proactively reclaiming reclaim from tabs not accessed by the user for some time.
To expand on the ChromeOS use case we're in a very similar situation to Android. For example, the Chrome browser uses a separate process for each individual tab (with some exceptions) and over time many tabs remain open in a back-grounded or idle state. Given that we have a lot of information about the weight of a tab, when it was last active, etc, we can benefit tremendously from per-process reclaim. We're working on getting real world numbers but all of our initial testing shows very promising results. On Tue, May 21, 2019 at 5:57 AM Shakeel Butt <shakeelb@google.com> wrote: > On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: > > > > > > > > On 05/20/2019 10:29 PM, Tim Murray wrote: > > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual > > > <anshuman.khandual@arm.com> wrote: > > >> > > >> Or Is the objective here is reduce the number of processes which get > killed by > > >> lmkd by triggering swapping for the unused memory (user hinted) > sooner so that > > >> they dont get picked by lmkd. Under utilization for zram hardware is > a concern > > >> here as well ? > > > > > > The objective is to avoid some instances of memory pressure by > > > proactively swapping pages that userspace knows to be cold before > > > those pages reach the end of the LRUs, which in turn can prevent some > > > apps from being killed by lmk/lmkd. As soon as Android userspace knows > > > that an application is not being used and is only resident to improve > > > performance if the user returns to that app, we can kick off > > > process_madvise on that process's pages (or some portion of those > > > pages) in a power-efficient way to reduce memory pressure long before > > > the system hits the free page watermark. This allows the system more > > > time to put pages into zram versus waiting for the watermark to > > > trigger kswapd, which decreases the likelihood that later memory > > > allocations will cause enough pressure to trigger a kill of one of > > > these apps. > > > > So this opens up bit of LRU management to user space hints. Also because > the app > > in itself wont know about the memory situation of the entire system, new > system > > call needs to be called from an external process. > > > > > > > >> Swapping out memory into zram wont increase the latency for a hot > start ? Or > > >> is it because as it will prevent a fresh cold start which anyway will > be slower > > >> than a slow hot start. Just being curious. > > > > > > First, not all swapped pages will be reloaded immediately once an app > > > is resumed. We've found that an app's working set post-process_madvise > > > is significantly smaller than what an app allocates when it first > > > launches (see the delta between pswpin and pswpout in Minchan's > > > results). Presumably because of this, faulting to fetch from zram does > > > > pswpin 417613 1392647 975034 233.00 > > pswpout 1274224 2661731 1387507 108.00 > > > > IIUC the swap-in ratio is way higher in comparison to that of swap out. > Is that > > always the case ? Or it tend to swap out from an active area of the > working set > > which faulted back again. > > > > > not seem to introduce a noticeable hot start penalty, not does it > > > cause an increase in performance problems later in the app's > > > lifecycle. I've measured with and without process_madvise, and the > > > differences are within our noise bounds. Second, because we're not > > > > That is assuming that post process_madvise() working set for the > application is > > always smaller. There is another challenge. The external process should > ideally > > have the knowledge of active areas of the working set for an application > in > > question for it to invoke process_madvise() correctly to prevent such > scenarios. > > > > > preemptively evicting file pages and only making them more likely to > > > be evicted when there's already memory pressure, we avoid the case > > > where we process_madvise an app then immediately return to the app and > > > reload all file pages in the working set even though there was no > > > intervening memory pressure. Our initial version of this work evicted > > > > That would be the worst case scenario which should be avoided. Memory > pressure > > must be a parameter before actually doing the swap out. But pages if > know to be > > inactive/cold can be marked high priority to be swapped out. > > > > > file pages preemptively and did cause a noticeable slowdown (~15%) for > > > that case; this patch set avoids that slowdown. Finally, the benefit > > > from avoiding cold starts is huge. The performance improvement from > > > having a hot start instead of a cold start ranges from 3x for very > > > small apps to 50x+ for larger apps like high-fidelity games. > > > > Is there any other real world scenario apart from this app based > ecosystem where > > user hinted LRU management might be helpful ? Just being curious. Thanks > for the > > detailed explanation. I will continue looking into this series. > > Chrome OS is another real world use-case for this user hinted LRU > management approach by proactively reclaiming reclaim from tabs not > accessed by the user for some time. > <div dir="ltr">To expand on the ChromeOS use case we're in a very similar situation to Android. For example, the Chrome browser uses a separate process for each individual tab (with some exceptions) and over time many tabs remain open in a back-grounded or idle state. Given that we have a lot of information about the weight of a tab, when it was last active, etc, we can benefit tremendously from per-process reclaim. We're working on getting real world numbers but all of our initial testing shows very promising results.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 21, 2019 at 5:57 AM Shakeel Butt <<a href="mailto:shakeelb@google.com">shakeelb@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual<br> <<a href="mailto:anshuman.khandual@arm.com" target="_blank">anshuman.khandual@arm.com</a>> wrote:<br> ><br> ><br> ><br> > On 05/20/2019 10:29 PM, Tim Murray wrote:<br> > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual<br> > > <<a href="mailto:anshuman.khandual@arm.com" target="_blank">anshuman.khandual@arm.com</a>> wrote:<br> > >><br> > >> Or Is the objective here is reduce the number of processes which get killed by<br> > >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that<br> > >> they dont get picked by lmkd. Under utilization for zram hardware is a concern<br> > >> here as well ?<br> > ><br> > > The objective is to avoid some instances of memory pressure by<br> > > proactively swapping pages that userspace knows to be cold before<br> > > those pages reach the end of the LRUs, which in turn can prevent some<br> > > apps from being killed by lmk/lmkd. As soon as Android userspace knows<br> > > that an application is not being used and is only resident to improve<br> > > performance if the user returns to that app, we can kick off<br> > > process_madvise on that process's pages (or some portion of those<br> > > pages) in a power-efficient way to reduce memory pressure long before<br> > > the system hits the free page watermark. This allows the system more<br> > > time to put pages into zram versus waiting for the watermark to<br> > > trigger kswapd, which decreases the likelihood that later memory<br> > > allocations will cause enough pressure to trigger a kill of one of<br> > > these apps.<br> ><br> > So this opens up bit of LRU management to user space hints. Also because the app<br> > in itself wont know about the memory situation of the entire system, new system<br> > call needs to be called from an external process.<br> ><br> > ><br> > >> Swapping out memory into zram wont increase the latency for a hot start ? Or<br> > >> is it because as it will prevent a fresh cold start which anyway will be slower<br> > >> than a slow hot start. Just being curious.<br> > ><br> > > First, not all swapped pages will be reloaded immediately once an app<br> > > is resumed. We've found that an app's working set post-process_madvise<br> > > is significantly smaller than what an app allocates when it first<br> > > launches (see the delta between pswpin and pswpout in Minchan's<br> > > results). Presumably because of this, faulting to fetch from zram does<br> ><br> > pswpin 417613 1392647 975034 233.00<br> > pswpout 1274224 2661731 1387507 108.00<br> ><br> > IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that<br> > always the case ? Or it tend to swap out from an active area of the working set<br> > which faulted back again.<br> ><br> > > not seem to introduce a noticeable hot start penalty, not does it<br> > > cause an increase in performance problems later in the app's<br> > > lifecycle. I've measured with and without process_madvise, and the<br> > > differences are within our noise bounds. Second, because we're not<br> ><br> > That is assuming that post process_madvise() working set for the application is<br> > always smaller. There is another challenge. The external process should ideally<br> > have the knowledge of active areas of the working set for an application in<br> > question for it to invoke process_madvise() correctly to prevent such scenarios.<br> ><br> > > preemptively evicting file pages and only making them more likely to<br> > > be evicted when there's already memory pressure, we avoid the case<br> > > where we process_madvise an app then immediately return to the app and<br> > > reload all file pages in the working set even though there was no<br> > > intervening memory pressure. Our initial version of this work evicted<br> ><br> > That would be the worst case scenario which should be avoided. Memory pressure<br> > must be a parameter before actually doing the swap out. But pages if know to be<br> > inactive/cold can be marked high priority to be swapped out.<br> ><br> > > file pages preemptively and did cause a noticeable slowdown (~15%) for<br> > > that case; this patch set avoids that slowdown. Finally, the benefit<br> > > from avoiding cold starts is huge. The performance improvement from<br> > > having a hot start instead of a cold start ranges from 3x for very<br> > > small apps to 50x+ for larger apps like high-fidelity games.<br> ><br> > Is there any other real world scenario apart from this app based ecosystem where<br> > user hinted LRU management might be helpful ? Just being curious. Thanks for the<br> > detailed explanation. I will continue looking into this series.<br> <br> Chrome OS is another real world use-case for this user hinted LRU<br> management approach by proactively reclaiming reclaim from tabs not<br> accessed by the user for some time.<br> </blockquote></div>
To expand on the ChromeOS use case we're in a very similar situation to Android. For example, the Chrome browser uses a separate process for each individual tab (with some exceptions) and over time many tabs remain open in a back-grounded or idle state. Given that we have a lot of information about the weight of a tab, when it was last active, etc, we can benefit tremendously from per-process reclaim. We're working on getting real world numbers but all of our initial testing shows very promising results. On Tue, May 21, 2019 at 5:57 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: > > > > > > > > On 05/20/2019 10:29 PM, Tim Murray wrote: > > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual > > > <anshuman.khandual@arm.com> wrote: > > >> > > >> Or Is the objective here is reduce the number of processes which get killed by > > >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that > > >> they dont get picked by lmkd. Under utilization for zram hardware is a concern > > >> here as well ? > > > > > > The objective is to avoid some instances of memory pressure by > > > proactively swapping pages that userspace knows to be cold before > > > those pages reach the end of the LRUs, which in turn can prevent some > > > apps from being killed by lmk/lmkd. As soon as Android userspace knows > > > that an application is not being used and is only resident to improve > > > performance if the user returns to that app, we can kick off > > > process_madvise on that process's pages (or some portion of those > > > pages) in a power-efficient way to reduce memory pressure long before > > > the system hits the free page watermark. This allows the system more > > > time to put pages into zram versus waiting for the watermark to > > > trigger kswapd, which decreases the likelihood that later memory > > > allocations will cause enough pressure to trigger a kill of one of > > > these apps. > > > > So this opens up bit of LRU management to user space hints. Also because the app > > in itself wont know about the memory situation of the entire system, new system > > call needs to be called from an external process. > > > > > > > >> Swapping out memory into zram wont increase the latency for a hot start ? Or > > >> is it because as it will prevent a fresh cold start which anyway will be slower > > >> than a slow hot start. Just being curious. > > > > > > First, not all swapped pages will be reloaded immediately once an app > > > is resumed. We've found that an app's working set post-process_madvise > > > is significantly smaller than what an app allocates when it first > > > launches (see the delta between pswpin and pswpout in Minchan's > > > results). Presumably because of this, faulting to fetch from zram does > > > > pswpin 417613 1392647 975034 233.00 > > pswpout 1274224 2661731 1387507 108.00 > > > > IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that > > always the case ? Or it tend to swap out from an active area of the working set > > which faulted back again. > > > > > not seem to introduce a noticeable hot start penalty, not does it > > > cause an increase in performance problems later in the app's > > > lifecycle. I've measured with and without process_madvise, and the > > > differences are within our noise bounds. Second, because we're not > > > > That is assuming that post process_madvise() working set for the application is > > always smaller. There is another challenge. The external process should ideally > > have the knowledge of active areas of the working set for an application in > > question for it to invoke process_madvise() correctly to prevent such scenarios. > > > > > preemptively evicting file pages and only making them more likely to > > > be evicted when there's already memory pressure, we avoid the case > > > where we process_madvise an app then immediately return to the app and > > > reload all file pages in the working set even though there was no > > > intervening memory pressure. Our initial version of this work evicted > > > > That would be the worst case scenario which should be avoided. Memory pressure > > must be a parameter before actually doing the swap out. But pages if know to be > > inactive/cold can be marked high priority to be swapped out. > > > > > file pages preemptively and did cause a noticeable slowdown (~15%) for > > > that case; this patch set avoids that slowdown. Finally, the benefit > > > from avoiding cold starts is huge. The performance improvement from > > > having a hot start instead of a cold start ranges from 3x for very > > > small apps to 50x+ for larger apps like high-fidelity games. > > > > Is there any other real world scenario apart from this app based ecosystem where > > user hinted LRU management might be helpful ? Just being curious. Thanks for the > > detailed explanation. I will continue looking into this series. > > Chrome OS is another real world use-case for this user hinted LRU > management approach by proactively reclaiming reclaim from tabs not > accessed by the user for some time.
On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote: > > On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote: > > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote: > > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote: > > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > > > > - Background > > > > > > > > > > The Android terminology used for forking a new process and starting an app > > > > > from scratch is a cold start, while resuming an existing app is a hot start. > > > > > While we continually try to improve the performance of cold starts, hot > > > > > starts will always be significantly less power hungry as well as faster so > > > > > we are trying to make hot start more likely than cold start. > > > > > > > > > > To increase hot start, Android userspace manages the order that apps should > > > > > be killed in a process called ActivityManagerService. ActivityManagerService > > > > > tracks every Android app or service that the user could be interacting with > > > > > at any time and translates that into a ranked list for lmkd(low memory > > > > > killer daemon). They are likely to be killed by lmkd if the system has to > > > > > reclaim memory. In that sense they are similar to entries in any other cache. > > > > > Those apps are kept alive for opportunistic performance improvements but > > > > > those performance improvements will vary based on the memory requirements of > > > > > individual workloads. > > > > > > > > > > - Problem > > > > > > > > > > Naturally, cached apps were dominant consumers of memory on the system. > > > > > However, they were not significant consumers of swap even though they are > > > > > good candidate for swap. Under investigation, swapping out only begins > > > > > once the low zone watermark is hit and kswapd wakes up, but the overall > > > > > allocation rate in the system might trip lmkd thresholds and cause a cached > > > > > process to be killed(we measured performance swapping out vs. zapping the > > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster > > > > > even though we use zram which is much faster than real storage) so kill > > > > > from lmkd will often satisfy the high zone watermark, resulting in very > > > > > few pages actually being moved to swap. > > > > > > > > > > - Approach > > > > > > > > > > The approach we chose was to use a new interface to allow userspace to > > > > > proactively reclaim entire processes by leveraging platform information. > > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > > > > that are known to be cold from userspace and to avoid races with lmkd > > > > > by reclaiming apps as soon as they entered the cached state. Additionally, > > > > > it could provide many chances for platform to use much information to > > > > > optimize memory efficiency. > > > > > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > > > > and MADV_FREE by adding non-destructive ways to gain some free memory > > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > > > > kernel that memory region is not currently needed and should be reclaimed > > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > > > > kernel that memory region is not currently needed and should be reclaimed > > > > > when memory pressure rises. > > > > > > > > > > To achieve the goal, the patchset introduce two new options for madvise. > > > > > One is MADV_COOL which will deactive activated pages and the other is > > > > > MADV_COLD which will reclaim private pages instantly. These new options > > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > > > > > that it hints the kernel that memory region is not currently needed and > > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > > > > > that it hints the kernel that memory region is not currently needed and > > > > > should be reclaimed when memory pressure rises. > > > > > > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > > > > > information required to make the reclaim decision is not known to the app. > > > > > Instead, it is known to a centralized userspace daemon, and that daemon > > > > > must be able to initiate reclaim on its own without any app involvement. > > > > > To solve the concern, this patch introduces new syscall - > > > > > > > > > > struct pr_madvise_param { > > > > > int size; > > > > > const struct iovec *vec; > > > > > } > > > > > > > > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > > > > > struct pr_madvise_param *restuls, > > > > > struct pr_madvise_param *ranges, > > > > > unsigned long flags); > > > > > > > > > > The syscall get pidfd to give hints to external process and provides > > > > > pair of result/ranges vector arguments so that it could give several > > > > > hints to each address range all at once. > > > > > > > > > > I guess others have different ideas about the naming of syscall and options > > > > > so feel free to suggest better naming. > > > > > > > > Yes, all new syscalls making use of pidfds should be named > > > > pidfd_<action>. So please make this pidfd_madvise. > > > > > > I don't have any particular preference but just wondering why pidfd is > > > so special to have it as prefix of system call name. > > > > It's a whole new API to address processes. We already have > > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you > > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your > > syscall works only with pidfds so it's tied to this api as well so it > > should follow the naming scheme. This also makes life easier for > > userspace and is consistent. > > This is at least my reasoning. I'm not going to make this a whole big > pedantic argument. If people have really strong feelings about not using > this prefix then fine. But if syscalls can be grouped together and have > consistent naming this is always a big plus. My hope has been that pidfd use becomes normalized enough that prefixing "pidfd_" to pidfd-accepting system calls becomes redundant. We write write(), not fd_write(), right? :-) pidfd_open() makes sense because the primary purpose of this system call is to operate on a pidfd, but I think process_madvise() is fine.
On Wed, May 22, 2019 at 7:12 AM Daniel Colascione <dancol@google.com> wrote: > > On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote: > > > > On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote: > > > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote: > > > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote: > > > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > > > > > - Background > > > > > > > > > > > > The Android terminology used for forking a new process and starting an app > > > > > > from scratch is a cold start, while resuming an existing app is a hot start. > > > > > > While we continually try to improve the performance of cold starts, hot > > > > > > starts will always be significantly less power hungry as well as faster so > > > > > > we are trying to make hot start more likely than cold start. > > > > > > > > > > > > To increase hot start, Android userspace manages the order that apps should > > > > > > be killed in a process called ActivityManagerService. ActivityManagerService > > > > > > tracks every Android app or service that the user could be interacting with > > > > > > at any time and translates that into a ranked list for lmkd(low memory > > > > > > killer daemon). They are likely to be killed by lmkd if the system has to > > > > > > reclaim memory. In that sense they are similar to entries in any other cache. > > > > > > Those apps are kept alive for opportunistic performance improvements but > > > > > > those performance improvements will vary based on the memory requirements of > > > > > > individual workloads. > > > > > > > > > > > > - Problem > > > > > > > > > > > > Naturally, cached apps were dominant consumers of memory on the system. > > > > > > However, they were not significant consumers of swap even though they are > > > > > > good candidate for swap. Under investigation, swapping out only begins > > > > > > once the low zone watermark is hit and kswapd wakes up, but the overall > > > > > > allocation rate in the system might trip lmkd thresholds and cause a cached > > > > > > process to be killed(we measured performance swapping out vs. zapping the > > > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster > > > > > > even though we use zram which is much faster than real storage) so kill > > > > > > from lmkd will often satisfy the high zone watermark, resulting in very > > > > > > few pages actually being moved to swap. > > > > > > > > > > > > - Approach > > > > > > > > > > > > The approach we chose was to use a new interface to allow userspace to > > > > > > proactively reclaim entire processes by leveraging platform information. > > > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > > > > > that are known to be cold from userspace and to avoid races with lmkd > > > > > > by reclaiming apps as soon as they entered the cached state. Additionally, > > > > > > it could provide many chances for platform to use much information to > > > > > > optimize memory efficiency. > > > > > > > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > > > > > and MADV_FREE by adding non-destructive ways to gain some free memory > > > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > > > > > kernel that memory region is not currently needed and should be reclaimed > > > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > > > > > kernel that memory region is not currently needed and should be reclaimed > > > > > > when memory pressure rises. > > > > > > > > > > > > To achieve the goal, the patchset introduce two new options for madvise. > > > > > > One is MADV_COOL which will deactive activated pages and the other is > > > > > > MADV_COLD which will reclaim private pages instantly. These new options > > > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > > > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > > > > > > that it hints the kernel that memory region is not currently needed and > > > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > > > > > > that it hints the kernel that memory region is not currently needed and > > > > > > should be reclaimed when memory pressure rises. > > > > > > > > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > > > > > > information required to make the reclaim decision is not known to the app. > > > > > > Instead, it is known to a centralized userspace daemon, and that daemon > > > > > > must be able to initiate reclaim on its own without any app involvement. > > > > > > To solve the concern, this patch introduces new syscall - > > > > > > > > > > > > struct pr_madvise_param { > > > > > > int size; > > > > > > const struct iovec *vec; > > > > > > } > > > > > > > > > > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > > > > > > struct pr_madvise_param *restuls, > > > > > > struct pr_madvise_param *ranges, > > > > > > unsigned long flags); > > > > > > > > > > > > The syscall get pidfd to give hints to external process and provides > > > > > > pair of result/ranges vector arguments so that it could give several > > > > > > hints to each address range all at once. > > > > > > > > > > > > I guess others have different ideas about the naming of syscall and options > > > > > > so feel free to suggest better naming. > > > > > > > > > > Yes, all new syscalls making use of pidfds should be named > > > > > pidfd_<action>. So please make this pidfd_madvise. > > > > > > > > I don't have any particular preference but just wondering why pidfd is > > > > so special to have it as prefix of system call name. > > > > > > It's a whole new API to address processes. We already have > > > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you > > > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your > > > syscall works only with pidfds so it's tied to this api as well so it > > > should follow the naming scheme. This also makes life easier for > > > userspace and is consistent. > > > > This is at least my reasoning. I'm not going to make this a whole big > > pedantic argument. If people have really strong feelings about not using > > this prefix then fine. But if syscalls can be grouped together and have > > consistent naming this is always a big plus. > > My hope has been that pidfd use becomes normalized enough that > prefixing "pidfd_" to pidfd-accepting system calls becomes redundant. > We write write(), not fd_write(), right? :-) pidfd_open() makes sense > because the primary purpose of this system call is to operate on a > pidfd, but I think process_madvise() is fine. This madvise syscall just operates on pidfds. It would make sense to name it process_madvise() if were to operate both on pid_t and int pidfd. Giving specific names to system calls won't stop it from becoming normalized. The fact that people built other system calls around it is enough proof of that. :) For userspace pidfd_madvise is nicer and it clearly expresses that it only accepts pidfds. So please, Minchan make it pidfd_madvise() in the next version. :) Christian
On Wed, May 22, 2019 at 1:22 AM Christian Brauner <christian@brauner.io> wrote: > > On Wed, May 22, 2019 at 7:12 AM Daniel Colascione <dancol@google.com> wrote: > > > > On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote: > > > > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote: > > > > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote: > > > > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > > > > > > - Background > > > > > > > > > > > > > > The Android terminology used for forking a new process and starting an app > > > > > > > from scratch is a cold start, while resuming an existing app is a hot start. > > > > > > > While we continually try to improve the performance of cold starts, hot > > > > > > > starts will always be significantly less power hungry as well as faster so > > > > > > > we are trying to make hot start more likely than cold start. > > > > > > > > > > > > > > To increase hot start, Android userspace manages the order that apps should > > > > > > > be killed in a process called ActivityManagerService. ActivityManagerService > > > > > > > tracks every Android app or service that the user could be interacting with > > > > > > > at any time and translates that into a ranked list for lmkd(low memory > > > > > > > killer daemon). They are likely to be killed by lmkd if the system has to > > > > > > > reclaim memory. In that sense they are similar to entries in any other cache. > > > > > > > Those apps are kept alive for opportunistic performance improvements but > > > > > > > those performance improvements will vary based on the memory requirements of > > > > > > > individual workloads. > > > > > > > > > > > > > > - Problem > > > > > > > > > > > > > > Naturally, cached apps were dominant consumers of memory on the system. > > > > > > > However, they were not significant consumers of swap even though they are > > > > > > > good candidate for swap. Under investigation, swapping out only begins > > > > > > > once the low zone watermark is hit and kswapd wakes up, but the overall > > > > > > > allocation rate in the system might trip lmkd thresholds and cause a cached > > > > > > > process to be killed(we measured performance swapping out vs. zapping the > > > > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster > > > > > > > even though we use zram which is much faster than real storage) so kill > > > > > > > from lmkd will often satisfy the high zone watermark, resulting in very > > > > > > > few pages actually being moved to swap. > > > > > > > > > > > > > > - Approach > > > > > > > > > > > > > > The approach we chose was to use a new interface to allow userspace to > > > > > > > proactively reclaim entire processes by leveraging platform information. > > > > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > > > > > > that are known to be cold from userspace and to avoid races with lmkd > > > > > > > by reclaiming apps as soon as they entered the cached state. Additionally, > > > > > > > it could provide many chances for platform to use much information to > > > > > > > optimize memory efficiency. > > > > > > > > > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > > > > > > and MADV_FREE by adding non-destructive ways to gain some free memory > > > > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > > > > > > kernel that memory region is not currently needed and should be reclaimed > > > > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > > > > > > kernel that memory region is not currently needed and should be reclaimed > > > > > > > when memory pressure rises. > > > > > > > > > > > > > > To achieve the goal, the patchset introduce two new options for madvise. > > > > > > > One is MADV_COOL which will deactive activated pages and the other is > > > > > > > MADV_COLD which will reclaim private pages instantly. These new options > > > > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > > > > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > > > > > > > that it hints the kernel that memory region is not currently needed and > > > > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > > > > > > > that it hints the kernel that memory region is not currently needed and > > > > > > > should be reclaimed when memory pressure rises. > > > > > > > > > > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > > > > > > > information required to make the reclaim decision is not known to the app. > > > > > > > Instead, it is known to a centralized userspace daemon, and that daemon > > > > > > > must be able to initiate reclaim on its own without any app involvement. > > > > > > > To solve the concern, this patch introduces new syscall - > > > > > > > > > > > > > > struct pr_madvise_param { > > > > > > > int size; > > > > > > > const struct iovec *vec; > > > > > > > } > > > > > > > > > > > > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > > > > > > > struct pr_madvise_param *restuls, > > > > > > > struct pr_madvise_param *ranges, > > > > > > > unsigned long flags); > > > > > > > > > > > > > > The syscall get pidfd to give hints to external process and provides > > > > > > > pair of result/ranges vector arguments so that it could give several > > > > > > > hints to each address range all at once. > > > > > > > > > > > > > > I guess others have different ideas about the naming of syscall and options > > > > > > > so feel free to suggest better naming. > > > > > > > > > > > > Yes, all new syscalls making use of pidfds should be named > > > > > > pidfd_<action>. So please make this pidfd_madvise. > > > > > > > > > > I don't have any particular preference but just wondering why pidfd is > > > > > so special to have it as prefix of system call name. > > > > > > > > It's a whole new API to address processes. We already have > > > > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you > > > > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your > > > > syscall works only with pidfds so it's tied to this api as well so it > > > > should follow the naming scheme. This also makes life easier for > > > > userspace and is consistent. > > > > > > This is at least my reasoning. I'm not going to make this a whole big > > > pedantic argument. If people have really strong feelings about not using > > > this prefix then fine. But if syscalls can be grouped together and have > > > consistent naming this is always a big plus. > > > > My hope has been that pidfd use becomes normalized enough that > > prefixing "pidfd_" to pidfd-accepting system calls becomes redundant. > > We write write(), not fd_write(), right? :-) pidfd_open() makes sense > > because the primary purpose of this system call is to operate on a > > pidfd, but I think process_madvise() is fine. > > This madvise syscall just operates on pidfds. It would make sense to > name it process_madvise() if were to operate both on pid_t and int pidfd. The name of the function ought to encode its purpose, not its signature. The system call under discussion operates on processes and so should be called "process_madvise". That this system call happens to accept a pidfd to identify the process on which it operates is not its most interesting aspect of the system call. The argument type isn't important enough to spotlight in the permanent name of an API. Pidfds are novel now, but they won't be novel in the future. > Giving specific names to system calls won't stop it from becoming > normalized. We could name system calls with `cat /dev/urandom | xxd` and they'd still get used. It doesn't follow that all names are equally good.
On Wed, May 22, 2019 at 06:16:35AM -0700, Daniel Colascione wrote: > On Wed, May 22, 2019 at 1:22 AM Christian Brauner <christian@brauner.io> wrote: > > > > On Wed, May 22, 2019 at 7:12 AM Daniel Colascione <dancol@google.com> wrote: > > > > > > On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote: > > > > > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote: > > > > > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote: > > > > > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > > > > > > > - Background > > > > > > > > > > > > > > > > The Android terminology used for forking a new process and starting an app > > > > > > > > from scratch is a cold start, while resuming an existing app is a hot start. > > > > > > > > While we continually try to improve the performance of cold starts, hot > > > > > > > > starts will always be significantly less power hungry as well as faster so > > > > > > > > we are trying to make hot start more likely than cold start. > > > > > > > > > > > > > > > > To increase hot start, Android userspace manages the order that apps should > > > > > > > > be killed in a process called ActivityManagerService. ActivityManagerService > > > > > > > > tracks every Android app or service that the user could be interacting with > > > > > > > > at any time and translates that into a ranked list for lmkd(low memory > > > > > > > > killer daemon). They are likely to be killed by lmkd if the system has to > > > > > > > > reclaim memory. In that sense they are similar to entries in any other cache. > > > > > > > > Those apps are kept alive for opportunistic performance improvements but > > > > > > > > those performance improvements will vary based on the memory requirements of > > > > > > > > individual workloads. > > > > > > > > > > > > > > > > - Problem > > > > > > > > > > > > > > > > Naturally, cached apps were dominant consumers of memory on the system. > > > > > > > > However, they were not significant consumers of swap even though they are > > > > > > > > good candidate for swap. Under investigation, swapping out only begins > > > > > > > > once the low zone watermark is hit and kswapd wakes up, but the overall > > > > > > > > allocation rate in the system might trip lmkd thresholds and cause a cached > > > > > > > > process to be killed(we measured performance swapping out vs. zapping the > > > > > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster > > > > > > > > even though we use zram which is much faster than real storage) so kill > > > > > > > > from lmkd will often satisfy the high zone watermark, resulting in very > > > > > > > > few pages actually being moved to swap. > > > > > > > > > > > > > > > > - Approach > > > > > > > > > > > > > > > > The approach we chose was to use a new interface to allow userspace to > > > > > > > > proactively reclaim entire processes by leveraging platform information. > > > > > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > > > > > > > that are known to be cold from userspace and to avoid races with lmkd > > > > > > > > by reclaiming apps as soon as they entered the cached state. Additionally, > > > > > > > > it could provide many chances for platform to use much information to > > > > > > > > optimize memory efficiency. > > > > > > > > > > > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > > > > > > > and MADV_FREE by adding non-destructive ways to gain some free memory > > > > > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > > > > > > > kernel that memory region is not currently needed and should be reclaimed > > > > > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > > > > > > > kernel that memory region is not currently needed and should be reclaimed > > > > > > > > when memory pressure rises. > > > > > > > > > > > > > > > > To achieve the goal, the patchset introduce two new options for madvise. > > > > > > > > One is MADV_COOL which will deactive activated pages and the other is > > > > > > > > MADV_COLD which will reclaim private pages instantly. These new options > > > > > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > > > > > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > > > > > > > > that it hints the kernel that memory region is not currently needed and > > > > > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > > > > > > > > that it hints the kernel that memory region is not currently needed and > > > > > > > > should be reclaimed when memory pressure rises. > > > > > > > > > > > > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > > > > > > > > information required to make the reclaim decision is not known to the app. > > > > > > > > Instead, it is known to a centralized userspace daemon, and that daemon > > > > > > > > must be able to initiate reclaim on its own without any app involvement. > > > > > > > > To solve the concern, this patch introduces new syscall - > > > > > > > > > > > > > > > > struct pr_madvise_param { > > > > > > > > int size; > > > > > > > > const struct iovec *vec; > > > > > > > > } > > > > > > > > > > > > > > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > > > > > > > > struct pr_madvise_param *restuls, > > > > > > > > struct pr_madvise_param *ranges, > > > > > > > > unsigned long flags); > > > > > > > > > > > > > > > > The syscall get pidfd to give hints to external process and provides > > > > > > > > pair of result/ranges vector arguments so that it could give several > > > > > > > > hints to each address range all at once. > > > > > > > > > > > > > > > > I guess others have different ideas about the naming of syscall and options > > > > > > > > so feel free to suggest better naming. > > > > > > > > > > > > > > Yes, all new syscalls making use of pidfds should be named > > > > > > > pidfd_<action>. So please make this pidfd_madvise. > > > > > > > > > > > > I don't have any particular preference but just wondering why pidfd is > > > > > > so special to have it as prefix of system call name. > > > > > > > > > > It's a whole new API to address processes. We already have > > > > > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you > > > > > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your > > > > > syscall works only with pidfds so it's tied to this api as well so it > > > > > should follow the naming scheme. This also makes life easier for > > > > > userspace and is consistent. > > > > > > > > This is at least my reasoning. I'm not going to make this a whole big > > > > pedantic argument. If people have really strong feelings about not using > > > > this prefix then fine. But if syscalls can be grouped together and have > > > > consistent naming this is always a big plus. > > > > > > My hope has been that pidfd use becomes normalized enough that > > > prefixing "pidfd_" to pidfd-accepting system calls becomes redundant. > > > We write write(), not fd_write(), right? :-) pidfd_open() makes sense > > > because the primary purpose of this system call is to operate on a > > > pidfd, but I think process_madvise() is fine. > > > > This madvise syscall just operates on pidfds. It would make sense to > > name it process_madvise() if were to operate both on pid_t and int pidfd. > > The name of the function ought to encode its purpose, not its > signature. The system call under discussion operates on processes and > so should be called "process_madvise". That this system call happens > to accept a pidfd to identify the process on which it operates is not > its most interesting aspect of the system call. The argument type > isn't important enough to spotlight in the permanent name of an API. > Pidfds are novel now, but they won't be novel in the future. I'm not going to go into yet another long argument. I prefer pidfd_*. It's tied to the api, transparent for userspace, and disambiguates it from process_vm_{read,write}v that both take a pid_t.
On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote: > I'm not going to go into yet another long argument. I prefer pidfd_*. Ok. We're each allowed our opinion. > It's tied to the api, transparent for userspace, and disambiguates it > from process_vm_{read,write}v that both take a pid_t. Speaking of process_vm_readv and process_vm_writev: both have a currently-unused flags argument. Both should grow a flag that tells them to interpret the pid argument as a pidfd. Or do you support adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why should process_madvise be called pidfd_madvise while process_vm_readv isn't called pidfd_vm_readv?
On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote: > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote: > > I'm not going to go into yet another long argument. I prefer pidfd_*. > > Ok. We're each allowed our opinion. > > > It's tied to the api, transparent for userspace, and disambiguates it > > from process_vm_{read,write}v that both take a pid_t. > > Speaking of process_vm_readv and process_vm_writev: both have a > currently-unused flags argument. Both should grow a flag that tells > them to interpret the pid argument as a pidfd. Or do you support > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why > should process_madvise be called pidfd_madvise while process_vm_readv > isn't called pidfd_vm_readv? Actually, you should then do the same with process_madvise() and give it a flag for that too if that's not too crazy. Christian
On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote: > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote: > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote: > > > I'm not going to go into yet another long argument. I prefer pidfd_*. > > > > Ok. We're each allowed our opinion. > > > > > It's tied to the api, transparent for userspace, and disambiguates it > > > from process_vm_{read,write}v that both take a pid_t. > > > > Speaking of process_vm_readv and process_vm_writev: both have a > > currently-unused flags argument. Both should grow a flag that tells > > them to interpret the pid argument as a pidfd. Or do you support > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why > > should process_madvise be called pidfd_madvise while process_vm_readv > > isn't called pidfd_vm_readv? > > Actually, you should then do the same with process_madvise() and give it > a flag for that too if that's not too crazy. I don't know what you mean. My gut feeling is that for the sake of consistency, process_madvise, process_vm_readv, and process_vm_writev should all accept a first argument interpreted as either a numeric PID or a pidfd depending on a flag --- ideally the same flag. Is that what you have in mind?
On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote: > On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote: > > > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote: > > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote: > > > > I'm not going to go into yet another long argument. I prefer pidfd_*. > > > > > > Ok. We're each allowed our opinion. > > > > > > > It's tied to the api, transparent for userspace, and disambiguates it > > > > from process_vm_{read,write}v that both take a pid_t. > > > > > > Speaking of process_vm_readv and process_vm_writev: both have a > > > currently-unused flags argument. Both should grow a flag that tells > > > them to interpret the pid argument as a pidfd. Or do you support > > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why > > > should process_madvise be called pidfd_madvise while process_vm_readv > > > isn't called pidfd_vm_readv? > > > > Actually, you should then do the same with process_madvise() and give it > > a flag for that too if that's not too crazy. > > I don't know what you mean. My gut feeling is that for the sake of > consistency, process_madvise, process_vm_readv, and process_vm_writev > should all accept a first argument interpreted as either a numeric PID > or a pidfd depending on a flag --- ideally the same flag. Is that what > you have in mind? Yes. For the sake of consistency they should probably all default to interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag interpret as pidfd.
On Wed, May 22, 2019 at 9:01 AM Christian Brauner <christian@brauner.io> wrote: > > On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote: > > On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote: > > > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote: > > > > > I'm not going to go into yet another long argument. I prefer pidfd_*. > > > > > > > > Ok. We're each allowed our opinion. > > > > > > > > > It's tied to the api, transparent for userspace, and disambiguates it > > > > > from process_vm_{read,write}v that both take a pid_t. > > > > > > > > Speaking of process_vm_readv and process_vm_writev: both have a > > > > currently-unused flags argument. Both should grow a flag that tells > > > > them to interpret the pid argument as a pidfd. Or do you support > > > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why > > > > should process_madvise be called pidfd_madvise while process_vm_readv > > > > isn't called pidfd_vm_readv? > > > > > > Actually, you should then do the same with process_madvise() and give it > > > a flag for that too if that's not too crazy. > > > > I don't know what you mean. My gut feeling is that for the sake of > > consistency, process_madvise, process_vm_readv, and process_vm_writev > > should all accept a first argument interpreted as either a numeric PID > > or a pidfd depending on a flag --- ideally the same flag. Is that what > > you have in mind? > > Yes. For the sake of consistency they should probably all default to > interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag > interpret as pidfd. Sounds good to me!
On Wed, May 22, 2019 at 09:01:33AM -0700, Daniel Colascione wrote: > On Wed, May 22, 2019 at 9:01 AM Christian Brauner <christian@brauner.io> wrote: > > > > On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote: > > > On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote: > > > > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > I'm not going to go into yet another long argument. I prefer pidfd_*. > > > > > > > > > > Ok. We're each allowed our opinion. > > > > > > > > > > > It's tied to the api, transparent for userspace, and disambiguates it > > > > > > from process_vm_{read,write}v that both take a pid_t. > > > > > > > > > > Speaking of process_vm_readv and process_vm_writev: both have a > > > > > currently-unused flags argument. Both should grow a flag that tells > > > > > them to interpret the pid argument as a pidfd. Or do you support > > > > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why > > > > > should process_madvise be called pidfd_madvise while process_vm_readv > > > > > isn't called pidfd_vm_readv? > > > > > > > > Actually, you should then do the same with process_madvise() and give it > > > > a flag for that too if that's not too crazy. > > > > > > I don't know what you mean. My gut feeling is that for the sake of > > > consistency, process_madvise, process_vm_readv, and process_vm_writev > > > should all accept a first argument interpreted as either a numeric PID > > > or a pidfd depending on a flag --- ideally the same flag. Is that what > > > you have in mind? > > > > Yes. For the sake of consistency they should probably all default to > > interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag > > interpret as pidfd. > > Sounds good to me! Then, I want to change from pidfd to pid at next revsion and stick to process_madvise as naming. Later, you guys could define PROCESS_PIDFD flag and change all at once every process_xxx syscall friends. If you are faster so that I see PROCESS_PIDFD earlier, I am happy to use it. Thanks.
On Thu, May 23, 2019 at 10:07:17PM +0900, Minchan Kim wrote: > On Wed, May 22, 2019 at 09:01:33AM -0700, Daniel Colascione wrote: > > On Wed, May 22, 2019 at 9:01 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote: > > > > On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > > > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote: > > > > > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote: > > > > > > > I'm not going to go into yet another long argument. I prefer pidfd_*. > > > > > > > > > > > > Ok. We're each allowed our opinion. > > > > > > > > > > > > > It's tied to the api, transparent for userspace, and disambiguates it > > > > > > > from process_vm_{read,write}v that both take a pid_t. > > > > > > > > > > > > Speaking of process_vm_readv and process_vm_writev: both have a > > > > > > currently-unused flags argument. Both should grow a flag that tells > > > > > > them to interpret the pid argument as a pidfd. Or do you support > > > > > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why > > > > > > should process_madvise be called pidfd_madvise while process_vm_readv > > > > > > isn't called pidfd_vm_readv? > > > > > > > > > > Actually, you should then do the same with process_madvise() and give it > > > > > a flag for that too if that's not too crazy. > > > > > > > > I don't know what you mean. My gut feeling is that for the sake of > > > > consistency, process_madvise, process_vm_readv, and process_vm_writev > > > > should all accept a first argument interpreted as either a numeric PID > > > > or a pidfd depending on a flag --- ideally the same flag. Is that what > > > > you have in mind? > > > > > > Yes. For the sake of consistency they should probably all default to > > > interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag > > > interpret as pidfd. > > > > Sounds good to me! > > Then, I want to change from pidfd to pid at next revsion and stick to > process_madvise as naming. Later, you guys could define PROCESS_PIDFD > flag and change all at once every process_xxx syscall friends. > > If you are faster so that I see PROCESS_PIDFD earlier, I am happy to > use it. Hi Folks, I don't want to consume a new API argument too early so want to say I will use process_madvise with pidfs argument because I agree with Daniel that we don't need to export implmentation on the syscall name. I hope every upcoming new syscall with process has by default pidfs so people are familiar with pidfd slowly so finallly they forgot pid in the long run so naturally replace pid with pidfs.
On 05/21/2019 04:04 PM, Michal Hocko wrote: > On Tue 21-05-19 08:25:55, Anshuman Khandual wrote: >> On 05/20/2019 10:29 PM, Tim Murray wrote: > [...] >>> not seem to introduce a noticeable hot start penalty, not does it >>> cause an increase in performance problems later in the app's >>> lifecycle. I've measured with and without process_madvise, and the >>> differences are within our noise bounds. Second, because we're not >> >> That is assuming that post process_madvise() working set for the application is >> always smaller. There is another challenge. The external process should ideally >> have the knowledge of active areas of the working set for an application in >> question for it to invoke process_madvise() correctly to prevent such scenarios. > > But that doesn't really seem relevant for the API itself, right? The > higher level logic the monitor's business. Right. I was just wondering how the monitor would even decide what areas of the target application is active or inactive. The target application is still just an opaque entity for the monitor unless there is some sort of communication. But you are right, this not relevant to the API itself.