mbox series

[RFC,0/7] introduce memory hinting API for external process

Message ID 20190520035254.57579-1-minchan@kernel.org (mailing list archive)
Headers show
Series introduce memory hinting API for external process | expand

Message

Minchan Kim May 20, 2019, 3:52 a.m. UTC
- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot start.
While we continually try to improve the performance of cold starts, hot
starts will always be significantly less power hungry as well as faster so
we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps should
be killed in a process called ActivityManagerService. ActivityManagerService
tracks every Android app or service that the user could be interacting with
at any time and translates that into a ranked list for lmkd(low memory
killer daemon). They are likely to be killed by lmkd if the system has to
reclaim memory. In that sense they are similar to entries in any other cache.
Those apps are kept alive for opportunistic performance improvements but
those performance improvements will vary based on the memory requirements of
individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a cached
process to be killed(we measured performance swapping out vs. zapping the
memory by killing a process. Unsurprisingly, zapping is 10x times faster
even though we use zram which is much faster than real storage) so kill
from lmkd will often satisfy the high zone watermark, resulting in very
few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd
by reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

IMHO we should spell it out that this patchset complements MADV_WONTNEED
and MADV_FREE by adding non-destructive ways to gain some free memory
space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
kernel that memory region is not currently needed and should be reclaimed
immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
kernel that memory region is not currently needed and should be reclaimed
when memory pressure rises.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COOL which will deactive activated pages and the other is
MADV_COLD which will reclaim private pages instantly. These new options
complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed when memory pressure rises.

This approach is similar in spirit to madvise(MADV_WONTNEED), but the
information required to make the reclaim decision is not known to the app.
Instead, it is known to a centralized userspace daemon, and that daemon
must be able to initiate reclaim on its own without any app involvement.
To solve the concern, this patch introduces new syscall -

	struct pr_madvise_param {
		int size;
		const struct iovec *vec;
	}

	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
				struct pr_madvise_param *restuls,
				struct pr_madvise_param *ranges,
				unsigned long flags);

The syscall get pidfd to give hints to external process and provides
pair of result/ranges vector arguments so that it could give several
hints to each address range all at once.

I guess others have different ideas about the naming of syscall and options
so feel free to suggest better naming.

- Experiment

We did bunch of testing with several hundreds of real users, not artificial
benchmark on android. We saw about 17% cold start decreasement without any
significant battery/app startup latency issues. And with artificial benchmark
which launches and switching apps, we saw average 7% app launching improvement,
18% less lmkd kill and good stat from vmstat.

A is vanilla and B is process_madvise.


                                       A          B      delta   ratio(%)
               allocstall_dma          0          0          0       0.00
           allocstall_movable       1464        457      -1007     -69.00
            allocstall_normal     263210     190763     -72447     -28.00
             allocstall_total     264674     191220     -73454     -28.00
          compact_daemon_wake      26912      25294      -1618      -7.00
                 compact_fail      17885      14151      -3734     -21.00
         compact_free_scanned 4204766409 3835994922 -368771487      -9.00
             compact_isolated    3446484    2967618    -478866     -14.00
      compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
                compact_stall      19387      15343      -4044     -21.00
              compact_success       1502       1192       -310     -21.00
kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
            kswapd_inodesteal     221635     233093      11458       5.00
 kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
                   nr_dirtied     259934     296476      36542      14.00
  nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
              nr_vmscan_write    1274232    2661733    1387501     108.00
                   nr_written    1514060    2937560    1423500      94.00
                   pageoutrun      67561      55133     -12428     -19.00
                   pgactivate    2335060    1984882    -350178     -15.00
                  pgalloc_dma   13743011   14096463     353452       2.00
              pgalloc_movable          0          0          0       0.00
               pgalloc_normal   18742440   16802065   -1940375     -11.00
                pgalloc_total   32485451   30898528   -1586923      -5.00
                 pgdeactivate    4262210    2930670   -1331540     -32.00
                      pgfault   30812334   31085065     272731       0.00
                       pgfree   33553970   31765164   -1788806      -6.00
                 pginodesteal      33411      15084     -18327     -55.00
                  pglazyfreed          0          0          0       0.00
                   pgmajfault     551312    1508299     956987     173.00
               pgmigrate_fail      43927      29330     -14597     -34.00
            pgmigrate_success    1399851    1203922    -195929     -14.00
                       pgpgin   24141776   19032156   -5109620     -22.00
                      pgpgout     959344    1103316     143972      15.00
                 pgpgoutclean    4639732    3765868    -873864     -19.00
                     pgrefill    4884560    3006938   -1877622     -39.00
                    pgrotated      37828      25897     -11931     -32.00
                pgscan_direct    1456037     957567    -498470     -35.00
       pgscan_direct_throttle          0          0          0       0.00
                pgscan_kswapd    6667767    5047360   -1620407     -25.00
                 pgscan_total    8123804    6004927   -2118877     -27.00
                   pgskip_dma          0          0          0       0.00
               pgskip_movable          0          0          0       0.00
                pgskip_normal      14907      25382      10475      70.00
                 pgskip_total      14907      25382      10475      70.00
               pgsteal_direct    1118986     690215    -428771     -39.00
               pgsteal_kswapd    4750223    3657107   -1093116     -24.00
                pgsteal_total    5869209    4347322   -1521887     -26.00
                       pswpin     417613    1392647     975034     233.00
                      pswpout    1274224    2661731    1387507     108.00
                slabs_scanned   13686905   10807200   -2879705     -22.00
          workingset_activate     668966     569444     -99522     -15.00
       workingset_nodereclaim      38957      32621      -6336     -17.00
           workingset_refault    2816795    2179782    -637013     -23.00
           workingset_restore     294320     168601    -125719     -43.00

pgmajfault is increased by 173% because swapin is increased by 200% by
process_madvise hint. However, swap read based on zram is much cheaper
than file IO in performance point of view and app hot start by swapin is
also cheaper than cold start from the beginning of app which needs many IO
from storage and initialization steps.

This patchset is against on next-20190517.

Minchan Kim (7):
  mm: introduce MADV_COOL
  mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
  mm: introduce MADV_COLD
  mm: factor out madvise's core functionality
  mm: introduce external memory hinting API
  mm: extend process_madvise syscall to support vector arrary
  mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/page-flags.h             |   1 +
 include/linux/page_idle.h              |  15 +
 include/linux/proc_fs.h                |   1 +
 include/linux/swap.h                   |   2 +
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/mman-common.h |  12 +
 include/uapi/asm-generic/unistd.h      |   2 +
 kernel/signal.c                        |   2 +-
 kernel/sys_ni.c                        |   1 +
 mm/madvise.c                           | 600 +++++++++++++++++++++----
 mm/swap.c                              |  43 ++
 mm/vmscan.c                            |  80 +++-
 14 files changed, 680 insertions(+), 83 deletions(-)

Comments

Anshuman Khandual May 20, 2019, 6:37 a.m. UTC | #1
On 05/20/2019 09:22 AM, Minchan Kim wrote:
> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.

Getting killed by lmkd which is triggered by custom system memory allocation
parameters and hence not being able to swap out is a problem ? But is not the
problem created by lmkd itself.

Or Is the objective here is reduce the number of processes which get killed by
lmkd by triggering swapping for the unused memory (user hinted) sooner so that
they dont get picked by lmkd. Under utilization for zram hardware is a concern
here as well ?

Swapping out memory into zram wont increase the latency for a hot start ? Or
is it because as it will prevent a fresh cold start which anyway will be slower
than a slow hot start. Just being curious.
Michal Hocko May 20, 2019, 9:28 a.m. UTC | #2
[Cc linux-api]

On Mon 20-05-19 12:52:47, Minchan Kim wrote:
> - Background
> 
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
> 
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
> 
> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.
> 
> - Approach
> 
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.
> 
> IMHO we should spell it out that this patchset complements MADV_WONTNEED
> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.
> 
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COOL which will deactive activated pages and the other is
> MADV_COLD which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.
> 
> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
> 
> 	struct pr_madvise_param {
> 		int size;
> 		const struct iovec *vec;
> 	}
> 
> 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> 				struct pr_madvise_param *restuls,
> 				struct pr_madvise_param *ranges,
> 				unsigned long flags);
> 
> The syscall get pidfd to give hints to external process and provides
> pair of result/ranges vector arguments so that it could give several
> hints to each address range all at once.
> 
> I guess others have different ideas about the naming of syscall and options
> so feel free to suggest better naming.
> 
> - Experiment
> 
> We did bunch of testing with several hundreds of real users, not artificial
> benchmark on android. We saw about 17% cold start decreasement without any
> significant battery/app startup latency issues. And with artificial benchmark
> which launches and switching apps, we saw average 7% app launching improvement,
> 18% less lmkd kill and good stat from vmstat.
> 
> A is vanilla and B is process_madvise.
> 
> 
>                                        A          B      delta   ratio(%)
>                allocstall_dma          0          0          0       0.00
>            allocstall_movable       1464        457      -1007     -69.00
>             allocstall_normal     263210     190763     -72447     -28.00
>              allocstall_total     264674     191220     -73454     -28.00
>           compact_daemon_wake      26912      25294      -1618      -7.00
>                  compact_fail      17885      14151      -3734     -21.00
>          compact_free_scanned 4204766409 3835994922 -368771487      -9.00
>              compact_isolated    3446484    2967618    -478866     -14.00
>       compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
>                 compact_stall      19387      15343      -4044     -21.00
>               compact_success       1502       1192       -310     -21.00
> kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
>             kswapd_inodesteal     221635     233093      11458       5.00
>  kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
>                    nr_dirtied     259934     296476      36542      14.00
>   nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
>               nr_vmscan_write    1274232    2661733    1387501     108.00
>                    nr_written    1514060    2937560    1423500      94.00
>                    pageoutrun      67561      55133     -12428     -19.00
>                    pgactivate    2335060    1984882    -350178     -15.00
>                   pgalloc_dma   13743011   14096463     353452       2.00
>               pgalloc_movable          0          0          0       0.00
>                pgalloc_normal   18742440   16802065   -1940375     -11.00
>                 pgalloc_total   32485451   30898528   -1586923      -5.00
>                  pgdeactivate    4262210    2930670   -1331540     -32.00
>                       pgfault   30812334   31085065     272731       0.00
>                        pgfree   33553970   31765164   -1788806      -6.00
>                  pginodesteal      33411      15084     -18327     -55.00
>                   pglazyfreed          0          0          0       0.00
>                    pgmajfault     551312    1508299     956987     173.00
>                pgmigrate_fail      43927      29330     -14597     -34.00
>             pgmigrate_success    1399851    1203922    -195929     -14.00
>                        pgpgin   24141776   19032156   -5109620     -22.00
>                       pgpgout     959344    1103316     143972      15.00
>                  pgpgoutclean    4639732    3765868    -873864     -19.00
>                      pgrefill    4884560    3006938   -1877622     -39.00
>                     pgrotated      37828      25897     -11931     -32.00
>                 pgscan_direct    1456037     957567    -498470     -35.00
>        pgscan_direct_throttle          0          0          0       0.00
>                 pgscan_kswapd    6667767    5047360   -1620407     -25.00
>                  pgscan_total    8123804    6004927   -2118877     -27.00
>                    pgskip_dma          0          0          0       0.00
>                pgskip_movable          0          0          0       0.00
>                 pgskip_normal      14907      25382      10475      70.00
>                  pgskip_total      14907      25382      10475      70.00
>                pgsteal_direct    1118986     690215    -428771     -39.00
>                pgsteal_kswapd    4750223    3657107   -1093116     -24.00
>                 pgsteal_total    5869209    4347322   -1521887     -26.00
>                        pswpin     417613    1392647     975034     233.00
>                       pswpout    1274224    2661731    1387507     108.00
>                 slabs_scanned   13686905   10807200   -2879705     -22.00
>           workingset_activate     668966     569444     -99522     -15.00
>        workingset_nodereclaim      38957      32621      -6336     -17.00
>            workingset_refault    2816795    2179782    -637013     -23.00
>            workingset_restore     294320     168601    -125719     -43.00
> 
> pgmajfault is increased by 173% because swapin is increased by 200% by
> process_madvise hint. However, swap read based on zram is much cheaper
> than file IO in performance point of view and app hot start by swapin is
> also cheaper than cold start from the beginning of app which needs many IO
> from storage and initialization steps.
> 
> This patchset is against on next-20190517.
> 
> Minchan Kim (7):
>   mm: introduce MADV_COOL
>   mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
>   mm: introduce MADV_COLD
>   mm: factor out madvise's core functionality
>   mm: introduce external memory hinting API
>   mm: extend process_madvise syscall to support vector arrary
>   mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/page-flags.h             |   1 +
>  include/linux/page_idle.h              |  15 +
>  include/linux/proc_fs.h                |   1 +
>  include/linux/swap.h                   |   2 +
>  include/linux/syscalls.h               |   2 +
>  include/uapi/asm-generic/mman-common.h |  12 +
>  include/uapi/asm-generic/unistd.h      |   2 +
>  kernel/signal.c                        |   2 +-
>  kernel/sys_ni.c                        |   1 +
>  mm/madvise.c                           | 600 +++++++++++++++++++++----
>  mm/swap.c                              |  43 ++
>  mm/vmscan.c                            |  80 +++-
>  14 files changed, 680 insertions(+), 83 deletions(-)
> 
> -- 
> 2.21.0.1020.gf2820cf01a-goog
>
Oleksandr Natalenko May 20, 2019, 2:42 p.m. UTC | #3
Hi.

On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> - Background
> 
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
> 
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
> 
> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.
> 
> - Approach
> 
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.
> 
> IMHO we should spell it out that this patchset complements MADV_WONTNEED
> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.
> 
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COOL which will deactive activated pages and the other is
> MADV_COLD which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.
> 
> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
> 
> 	struct pr_madvise_param {
> 		int size;
> 		const struct iovec *vec;
> 	}
> 
> 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> 				struct pr_madvise_param *restuls,
> 				struct pr_madvise_param *ranges,
> 				unsigned long flags);
> 
> The syscall get pidfd to give hints to external process and provides
> pair of result/ranges vector arguments so that it could give several
> hints to each address range all at once.
> 
> I guess others have different ideas about the naming of syscall and options
> so feel free to suggest better naming.
> 
> - Experiment
> 
> We did bunch of testing with several hundreds of real users, not artificial
> benchmark on android. We saw about 17% cold start decreasement without any
> significant battery/app startup latency issues. And with artificial benchmark
> which launches and switching apps, we saw average 7% app launching improvement,
> 18% less lmkd kill and good stat from vmstat.
> 
> A is vanilla and B is process_madvise.
> 
> 
>                                        A          B      delta   ratio(%)
>                allocstall_dma          0          0          0       0.00
>            allocstall_movable       1464        457      -1007     -69.00
>             allocstall_normal     263210     190763     -72447     -28.00
>              allocstall_total     264674     191220     -73454     -28.00
>           compact_daemon_wake      26912      25294      -1618      -7.00
>                  compact_fail      17885      14151      -3734     -21.00
>          compact_free_scanned 4204766409 3835994922 -368771487      -9.00
>              compact_isolated    3446484    2967618    -478866     -14.00
>       compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
>                 compact_stall      19387      15343      -4044     -21.00
>               compact_success       1502       1192       -310     -21.00
> kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
>             kswapd_inodesteal     221635     233093      11458       5.00
>  kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
>                    nr_dirtied     259934     296476      36542      14.00
>   nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
>               nr_vmscan_write    1274232    2661733    1387501     108.00
>                    nr_written    1514060    2937560    1423500      94.00
>                    pageoutrun      67561      55133     -12428     -19.00
>                    pgactivate    2335060    1984882    -350178     -15.00
>                   pgalloc_dma   13743011   14096463     353452       2.00
>               pgalloc_movable          0          0          0       0.00
>                pgalloc_normal   18742440   16802065   -1940375     -11.00
>                 pgalloc_total   32485451   30898528   -1586923      -5.00
>                  pgdeactivate    4262210    2930670   -1331540     -32.00
>                       pgfault   30812334   31085065     272731       0.00
>                        pgfree   33553970   31765164   -1788806      -6.00
>                  pginodesteal      33411      15084     -18327     -55.00
>                   pglazyfreed          0          0          0       0.00
>                    pgmajfault     551312    1508299     956987     173.00
>                pgmigrate_fail      43927      29330     -14597     -34.00
>             pgmigrate_success    1399851    1203922    -195929     -14.00
>                        pgpgin   24141776   19032156   -5109620     -22.00
>                       pgpgout     959344    1103316     143972      15.00
>                  pgpgoutclean    4639732    3765868    -873864     -19.00
>                      pgrefill    4884560    3006938   -1877622     -39.00
>                     pgrotated      37828      25897     -11931     -32.00
>                 pgscan_direct    1456037     957567    -498470     -35.00
>        pgscan_direct_throttle          0          0          0       0.00
>                 pgscan_kswapd    6667767    5047360   -1620407     -25.00
>                  pgscan_total    8123804    6004927   -2118877     -27.00
>                    pgskip_dma          0          0          0       0.00
>                pgskip_movable          0          0          0       0.00
>                 pgskip_normal      14907      25382      10475      70.00
>                  pgskip_total      14907      25382      10475      70.00
>                pgsteal_direct    1118986     690215    -428771     -39.00
>                pgsteal_kswapd    4750223    3657107   -1093116     -24.00
>                 pgsteal_total    5869209    4347322   -1521887     -26.00
>                        pswpin     417613    1392647     975034     233.00
>                       pswpout    1274224    2661731    1387507     108.00
>                 slabs_scanned   13686905   10807200   -2879705     -22.00
>           workingset_activate     668966     569444     -99522     -15.00
>        workingset_nodereclaim      38957      32621      -6336     -17.00
>            workingset_refault    2816795    2179782    -637013     -23.00
>            workingset_restore     294320     168601    -125719     -43.00
> 
> pgmajfault is increased by 173% because swapin is increased by 200% by
> process_madvise hint. However, swap read based on zram is much cheaper
> than file IO in performance point of view and app hot start by swapin is
> also cheaper than cold start from the beginning of app which needs many IO
> from storage and initialization steps.
> 
> This patchset is against on next-20190517.
> 
> Minchan Kim (7):
>   mm: introduce MADV_COOL
>   mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
>   mm: introduce MADV_COLD
>   mm: factor out madvise's core functionality
>   mm: introduce external memory hinting API
>   mm: extend process_madvise syscall to support vector arrary
>   mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/page-flags.h             |   1 +
>  include/linux/page_idle.h              |  15 +
>  include/linux/proc_fs.h                |   1 +
>  include/linux/swap.h                   |   2 +
>  include/linux/syscalls.h               |   2 +
>  include/uapi/asm-generic/mman-common.h |  12 +
>  include/uapi/asm-generic/unistd.h      |   2 +
>  kernel/signal.c                        |   2 +-
>  kernel/sys_ni.c                        |   1 +
>  mm/madvise.c                           | 600 +++++++++++++++++++++----
>  mm/swap.c                              |  43 ++
>  mm/vmscan.c                            |  80 +++-
>  14 files changed, 680 insertions(+), 83 deletions(-)
> 
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 

Please Cc me for the next iteration since I was working on the very same
thing recently [1].

Thank you.

[1] https://gitlab.com/post-factum/pf-kernel/commits/remote-madvise-v3
Johannes Weiner May 20, 2019, 4:46 p.m. UTC | #4
On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> - Approach
> 
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.
> 
> IMHO we should spell it out that this patchset complements MADV_WONTNEED
> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.

I agree with this approach and the semantics. But these names are very
vague and extremely easy to confuse since they're so similar.

MADV_COLD could be a good name, but for deactivating pages, not
reclaiming them - marking memory "cold" on the LRU for later reclaim.

For the immediate reclaim one, I think there is a better option too:
In virtual memory speak, putting a page into secondary storage (or
ensuring it's already there), and then freeing its in-memory copy, is
called "paging out". And that's what this flag is supposed to do. So
how about MADV_PAGEOUT?

With that, we'd have:

MADV_FREE: Mark data invalid, free memory when needed
MADV_DONTNEED: Mark data invalid, free memory immediately

MADV_COLD: Data is not used for a while, free memory when needed
MADV_PAGEOUT: Data is not used for a while, free memory immediately

What do you think?
Tim Murray May 20, 2019, 4:59 p.m. UTC | #5
On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
> Or Is the objective here is reduce the number of processes which get killed by
> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
> they dont get picked by lmkd. Under utilization for zram hardware is a concern
> here as well ?

The objective is to avoid some instances of memory pressure by
proactively swapping pages that userspace knows to be cold before
those pages reach the end of the LRUs, which in turn can prevent some
apps from being killed by lmk/lmkd. As soon as Android userspace knows
that an application is not being used and is only resident to improve
performance if the user returns to that app, we can kick off
process_madvise on that process's pages (or some portion of those
pages) in a power-efficient way to reduce memory pressure long before
the system hits the free page watermark. This allows the system more
time to put pages into zram versus waiting for the watermark to
trigger kswapd, which decreases the likelihood that later memory
allocations will cause enough pressure to trigger a kill of one of
these apps.

> Swapping out memory into zram wont increase the latency for a hot start ? Or
> is it because as it will prevent a fresh cold start which anyway will be slower
> than a slow hot start. Just being curious.

First, not all swapped pages will be reloaded immediately once an app
is resumed. We've found that an app's working set post-process_madvise
is significantly smaller than what an app allocates when it first
launches (see the delta between pswpin and pswpout in Minchan's
results). Presumably because of this, faulting to fetch from zram does
not seem to introduce a noticeable hot start penalty, not does it
cause an increase in performance problems later in the app's
lifecycle. I've measured with and without process_madvise, and the
differences are within our noise bounds. Second, because we're not
preemptively evicting file pages and only making them more likely to
be evicted when there's already memory pressure, we avoid the case
where we process_madvise an app then immediately return to the app and
reload all file pages in the working set even though there was no
intervening memory pressure. Our initial version of this work evicted
file pages preemptively and did cause a noticeable slowdown (~15%) for
that case; this patch set avoids that slowdown. Finally, the benefit
from avoiding cold starts is huge. The performance improvement from
having a hot start instead of a cold start ranges from 3x for very
small apps to 50x+ for larger apps like high-fidelity games.
Matthew Wilcox May 21, 2019, 1:44 a.m. UTC | #6
On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> IMHO we should spell it out that this patchset complements MADV_WONTNEED
> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.

Do we tear down page tables for these ranges?  That seems like a good
way of reclaiming potentially a substantial amount of memory.
Anshuman Khandual May 21, 2019, 2:55 a.m. UTC | #7
On 05/20/2019 10:29 PM, Tim Murray wrote:
> On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>> Or Is the objective here is reduce the number of processes which get killed by
>> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
>> they dont get picked by lmkd. Under utilization for zram hardware is a concern
>> here as well ?
> 
> The objective is to avoid some instances of memory pressure by
> proactively swapping pages that userspace knows to be cold before
> those pages reach the end of the LRUs, which in turn can prevent some
> apps from being killed by lmk/lmkd. As soon as Android userspace knows
> that an application is not being used and is only resident to improve
> performance if the user returns to that app, we can kick off
> process_madvise on that process's pages (or some portion of those
> pages) in a power-efficient way to reduce memory pressure long before
> the system hits the free page watermark. This allows the system more
> time to put pages into zram versus waiting for the watermark to
> trigger kswapd, which decreases the likelihood that later memory
> allocations will cause enough pressure to trigger a kill of one of
> these apps.

So this opens up bit of LRU management to user space hints. Also because the app
in itself wont know about the memory situation of the entire system, new system
call needs to be called from an external process.

> 
>> Swapping out memory into zram wont increase the latency for a hot start ? Or
>> is it because as it will prevent a fresh cold start which anyway will be slower
>> than a slow hot start. Just being curious.
> 
> First, not all swapped pages will be reloaded immediately once an app
> is resumed. We've found that an app's working set post-process_madvise
> is significantly smaller than what an app allocates when it first
> launches (see the delta between pswpin and pswpout in Minchan's
> results). Presumably because of this, faulting to fetch from zram does

pswpin      417613    1392647     975034     233.00
pswpout    1274224    2661731    1387507     108.00

IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that
always the case ? Or it tend to swap out from an active area of the working set
which faulted back again.

> not seem to introduce a noticeable hot start penalty, not does it
> cause an increase in performance problems later in the app's
> lifecycle. I've measured with and without process_madvise, and the
> differences are within our noise bounds. Second, because we're not

That is assuming that post process_madvise() working set for the application is
always smaller. There is another challenge. The external process should ideally
have the knowledge of active areas of the working set for an application in
question for it to invoke process_madvise() correctly to prevent such scenarios.

> preemptively evicting file pages and only making them more likely to
> be evicted when there's already memory pressure, we avoid the case
> where we process_madvise an app then immediately return to the app and
> reload all file pages in the working set even though there was no
> intervening memory pressure. Our initial version of this work evicted

That would be the worst case scenario which should be avoided. Memory pressure
must be a parameter before actually doing the swap out. But pages if know to be
inactive/cold can be marked high priority to be swapped out.

> file pages preemptively and did cause a noticeable slowdown (~15%) for
> that case; this patch set avoids that slowdown. Finally, the benefit
> from avoiding cold starts is huge. The performance improvement from
> having a hot start instead of a cold start ranges from 3x for very
> small apps to 50x+ for larger apps like high-fidelity games.

Is there any other real world scenario apart from this app based ecosystem where
user hinted LRU management might be helpful ? Just being curious. Thanks for the
detailed explanation. I will continue looking into this series.
Minchan Kim May 21, 2019, 2:56 a.m. UTC | #8
On Mon, May 20, 2019 at 04:42:00PM +0200, Oleksandr Natalenko wrote:
> Hi.
> 
> On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > - Background
> > 
> > The Android terminology used for forking a new process and starting an app
> > from scratch is a cold start, while resuming an existing app is a hot start.
> > While we continually try to improve the performance of cold starts, hot
> > starts will always be significantly less power hungry as well as faster so
> > we are trying to make hot start more likely than cold start.
> > 
> > To increase hot start, Android userspace manages the order that apps should
> > be killed in a process called ActivityManagerService. ActivityManagerService
> > tracks every Android app or service that the user could be interacting with
> > at any time and translates that into a ranked list for lmkd(low memory
> > killer daemon). They are likely to be killed by lmkd if the system has to
> > reclaim memory. In that sense they are similar to entries in any other cache.
> > Those apps are kept alive for opportunistic performance improvements but
> > those performance improvements will vary based on the memory requirements of
> > individual workloads.
> > 
> > - Problem
> > 
> > Naturally, cached apps were dominant consumers of memory on the system.
> > However, they were not significant consumers of swap even though they are
> > good candidate for swap. Under investigation, swapping out only begins
> > once the low zone watermark is hit and kswapd wakes up, but the overall
> > allocation rate in the system might trip lmkd thresholds and cause a cached
> > process to be killed(we measured performance swapping out vs. zapping the
> > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > even though we use zram which is much faster than real storage) so kill
> > from lmkd will often satisfy the high zone watermark, resulting in very
> > few pages actually being moved to swap.
> > 
> > - Approach
> > 
> > The approach we chose was to use a new interface to allow userspace to
> > proactively reclaim entire processes by leveraging platform information.
> > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > that are known to be cold from userspace and to avoid races with lmkd
> > by reclaiming apps as soon as they entered the cached state. Additionally,
> > it could provide many chances for platform to use much information to
> > optimize memory efficiency.
> > 
> > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > and MADV_FREE by adding non-destructive ways to gain some free memory
> > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > when memory pressure rises.
> > 
> > To achieve the goal, the patchset introduce two new options for madvise.
> > One is MADV_COOL which will deactive activated pages and the other is
> > MADV_COLD which will reclaim private pages instantly. These new options
> > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > that it hints the kernel that memory region is not currently needed and
> > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > that it hints the kernel that memory region is not currently needed and
> > should be reclaimed when memory pressure rises.
> > 
> > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > information required to make the reclaim decision is not known to the app.
> > Instead, it is known to a centralized userspace daemon, and that daemon
> > must be able to initiate reclaim on its own without any app involvement.
> > To solve the concern, this patch introduces new syscall -
> > 
> > 	struct pr_madvise_param {
> > 		int size;
> > 		const struct iovec *vec;
> > 	}
> > 
> > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > 				struct pr_madvise_param *restuls,
> > 				struct pr_madvise_param *ranges,
> > 				unsigned long flags);
> > 
> > The syscall get pidfd to give hints to external process and provides
> > pair of result/ranges vector arguments so that it could give several
> > hints to each address range all at once.
> > 
> > I guess others have different ideas about the naming of syscall and options
> > so feel free to suggest better naming.
> > 
> > - Experiment
> > 
> > We did bunch of testing with several hundreds of real users, not artificial
> > benchmark on android. We saw about 17% cold start decreasement without any
> > significant battery/app startup latency issues. And with artificial benchmark
> > which launches and switching apps, we saw average 7% app launching improvement,
> > 18% less lmkd kill and good stat from vmstat.
> > 
> > A is vanilla and B is process_madvise.
> > 
> > 
> >                                        A          B      delta   ratio(%)
> >                allocstall_dma          0          0          0       0.00
> >            allocstall_movable       1464        457      -1007     -69.00
> >             allocstall_normal     263210     190763     -72447     -28.00
> >              allocstall_total     264674     191220     -73454     -28.00
> >           compact_daemon_wake      26912      25294      -1618      -7.00
> >                  compact_fail      17885      14151      -3734     -21.00
> >          compact_free_scanned 4204766409 3835994922 -368771487      -9.00
> >              compact_isolated    3446484    2967618    -478866     -14.00
> >       compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
> >                 compact_stall      19387      15343      -4044     -21.00
> >               compact_success       1502       1192       -310     -21.00
> > kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
> >             kswapd_inodesteal     221635     233093      11458       5.00
> >  kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
> >                    nr_dirtied     259934     296476      36542      14.00
> >   nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
> >               nr_vmscan_write    1274232    2661733    1387501     108.00
> >                    nr_written    1514060    2937560    1423500      94.00
> >                    pageoutrun      67561      55133     -12428     -19.00
> >                    pgactivate    2335060    1984882    -350178     -15.00
> >                   pgalloc_dma   13743011   14096463     353452       2.00
> >               pgalloc_movable          0          0          0       0.00
> >                pgalloc_normal   18742440   16802065   -1940375     -11.00
> >                 pgalloc_total   32485451   30898528   -1586923      -5.00
> >                  pgdeactivate    4262210    2930670   -1331540     -32.00
> >                       pgfault   30812334   31085065     272731       0.00
> >                        pgfree   33553970   31765164   -1788806      -6.00
> >                  pginodesteal      33411      15084     -18327     -55.00
> >                   pglazyfreed          0          0          0       0.00
> >                    pgmajfault     551312    1508299     956987     173.00
> >                pgmigrate_fail      43927      29330     -14597     -34.00
> >             pgmigrate_success    1399851    1203922    -195929     -14.00
> >                        pgpgin   24141776   19032156   -5109620     -22.00
> >                       pgpgout     959344    1103316     143972      15.00
> >                  pgpgoutclean    4639732    3765868    -873864     -19.00
> >                      pgrefill    4884560    3006938   -1877622     -39.00
> >                     pgrotated      37828      25897     -11931     -32.00
> >                 pgscan_direct    1456037     957567    -498470     -35.00
> >        pgscan_direct_throttle          0          0          0       0.00
> >                 pgscan_kswapd    6667767    5047360   -1620407     -25.00
> >                  pgscan_total    8123804    6004927   -2118877     -27.00
> >                    pgskip_dma          0          0          0       0.00
> >                pgskip_movable          0          0          0       0.00
> >                 pgskip_normal      14907      25382      10475      70.00
> >                  pgskip_total      14907      25382      10475      70.00
> >                pgsteal_direct    1118986     690215    -428771     -39.00
> >                pgsteal_kswapd    4750223    3657107   -1093116     -24.00
> >                 pgsteal_total    5869209    4347322   -1521887     -26.00
> >                        pswpin     417613    1392647     975034     233.00
> >                       pswpout    1274224    2661731    1387507     108.00
> >                 slabs_scanned   13686905   10807200   -2879705     -22.00
> >           workingset_activate     668966     569444     -99522     -15.00
> >        workingset_nodereclaim      38957      32621      -6336     -17.00
> >            workingset_refault    2816795    2179782    -637013     -23.00
> >            workingset_restore     294320     168601    -125719     -43.00
> > 
> > pgmajfault is increased by 173% because swapin is increased by 200% by
> > process_madvise hint. However, swap read based on zram is much cheaper
> > than file IO in performance point of view and app hot start by swapin is
> > also cheaper than cold start from the beginning of app which needs many IO
> > from storage and initialization steps.
> > 
> > This patchset is against on next-20190517.
> > 
> > Minchan Kim (7):
> >   mm: introduce MADV_COOL
> >   mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
> >   mm: introduce MADV_COLD
> >   mm: factor out madvise's core functionality
> >   mm: introduce external memory hinting API
> >   mm: extend process_madvise syscall to support vector arrary
> >   mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
> > 
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/page-flags.h             |   1 +
> >  include/linux/page_idle.h              |  15 +
> >  include/linux/proc_fs.h                |   1 +
> >  include/linux/swap.h                   |   2 +
> >  include/linux/syscalls.h               |   2 +
> >  include/uapi/asm-generic/mman-common.h |  12 +
> >  include/uapi/asm-generic/unistd.h      |   2 +
> >  kernel/signal.c                        |   2 +-
> >  kernel/sys_ni.c                        |   1 +
> >  mm/madvise.c                           | 600 +++++++++++++++++++++----
> >  mm/swap.c                              |  43 ++
> >  mm/vmscan.c                            |  80 +++-
> >  14 files changed, 680 insertions(+), 83 deletions(-)
> > 
> > -- 
> > 2.21.0.1020.gf2820cf01a-goog
> > 
> 
> Please Cc me for the next iteration since I was working on the very same
> thing recently [1].
> 
> Thank you.
> 
> [1] https://gitlab.com/post-factum/pf-kernel/commits/remote-madvise-v3

Sure, I'm happy to see others have similar requirement.
Minchan Kim May 21, 2019, 4:39 a.m. UTC | #9
On Mon, May 20, 2019 at 12:46:05PM -0400, Johannes Weiner wrote:
> On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > - Approach
> > 
> > The approach we chose was to use a new interface to allow userspace to
> > proactively reclaim entire processes by leveraging platform information.
> > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > that are known to be cold from userspace and to avoid races with lmkd
> > by reclaiming apps as soon as they entered the cached state. Additionally,
> > it could provide many chances for platform to use much information to
> > optimize memory efficiency.
> > 
> > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > and MADV_FREE by adding non-destructive ways to gain some free memory
> > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > when memory pressure rises.
> 
> I agree with this approach and the semantics. But these names are very
> vague and extremely easy to confuse since they're so similar.
> 
> MADV_COLD could be a good name, but for deactivating pages, not
> reclaiming them - marking memory "cold" on the LRU for later reclaim.
> 
> For the immediate reclaim one, I think there is a better option too:
> In virtual memory speak, putting a page into secondary storage (or
> ensuring it's already there), and then freeing its in-memory copy, is
> called "paging out". And that's what this flag is supposed to do. So
> how about MADV_PAGEOUT?
> 
> With that, we'd have:
> 
> MADV_FREE: Mark data invalid, free memory when needed
> MADV_DONTNEED: Mark data invalid, free memory immediately
> 
> MADV_COLD: Data is not used for a while, free memory when needed
> MADV_PAGEOUT: Data is not used for a while, free memory immediately
> 
> What do you think?

There are several suggestions until now. Thanks, Folks!

For deactivating:

- MADV_COOL
- MADV_RECLAIM_LAZY
- MADV_DEACTIVATE
- MADV_COLD
- MADV_FREE_PRESERVE


For reclaiming:

- MADV_COLD
- MADV_RECLAIM_NOW
- MADV_RECLAIMING
- MADV_PAGEOUT
- MADV_DONTNEED_PRESERVE

It seems everybody doesn't like MADV_COLD so want to go with other.
For consisteny of view with other existing hints of madvise, -preserve
postfix suits well. However, originally, I don't like the naming FREE
vs DONTNEED from the beginning. They were easily confused.
I prefer PAGEOUT to RECLAIM since it's more likely to be nuance to
represent reclaim with memory pressure and is supposed to paged-in
if someone need it later. So, it imply PRESERVE.
If there is not strong against it, I want to go with MADV_COLD and
MADV_PAGEOUT.

Other opinion?
Minchan Kim May 21, 2019, 5:01 a.m. UTC | #10
On Mon, May 20, 2019 at 06:44:52PM -0700, Matthew Wilcox wrote:
> On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > and MADV_FREE by adding non-destructive ways to gain some free memory
> > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > when memory pressure rises.
> 
> Do we tear down page tables for these ranges?  That seems like a good

True for MADV_COLD(reclaiming) but false for MADV_COOL(deactivating) at
this implementation.

> way of reclaiming potentially a substantial amount of memory.

Given that consider refauting are spread out over time and reclaim occurs
in burst, that does make sense to speed up the reclaiming. However, a
concern to me is anonymous pages since they need swap cache insertion,
which would be wasteful if they are not reclaimed, finally.
Minchan Kim May 21, 2019, 5:14 a.m. UTC | #11
On Tue, May 21, 2019 at 08:25:55AM +0530, Anshuman Khandual wrote:
> 
> 
> On 05/20/2019 10:29 PM, Tim Murray wrote:
> > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> >>
> >> Or Is the objective here is reduce the number of processes which get killed by
> >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
> >> they dont get picked by lmkd. Under utilization for zram hardware is a concern
> >> here as well ?
> > 
> > The objective is to avoid some instances of memory pressure by
> > proactively swapping pages that userspace knows to be cold before
> > those pages reach the end of the LRUs, which in turn can prevent some
> > apps from being killed by lmk/lmkd. As soon as Android userspace knows
> > that an application is not being used and is only resident to improve
> > performance if the user returns to that app, we can kick off
> > process_madvise on that process's pages (or some portion of those
> > pages) in a power-efficient way to reduce memory pressure long before
> > the system hits the free page watermark. This allows the system more
> > time to put pages into zram versus waiting for the watermark to
> > trigger kswapd, which decreases the likelihood that later memory
> > allocations will cause enough pressure to trigger a kill of one of
> > these apps.
> 
> So this opens up bit of LRU management to user space hints. Also because the app
> in itself wont know about the memory situation of the entire system, new system
> call needs to be called from an external process.

That's why process_madvise is introduced here.

> 
> > 
> >> Swapping out memory into zram wont increase the latency for a hot start ? Or
> >> is it because as it will prevent a fresh cold start which anyway will be slower
> >> than a slow hot start. Just being curious.
> > 
> > First, not all swapped pages will be reloaded immediately once an app
> > is resumed. We've found that an app's working set post-process_madvise
> > is significantly smaller than what an app allocates when it first
> > launches (see the delta between pswpin and pswpout in Minchan's
> > results). Presumably because of this, faulting to fetch from zram does
> 
> pswpin      417613    1392647     975034     233.00
> pswpout    1274224    2661731    1387507     108.00
> 
> IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that
> always the case ? Or it tend to swap out from an active area of the working set
> which faulted back again.

I think it's because apps are alive longer via reducing being killed
so turn into from pgpgin to swapin.

> 
> > not seem to introduce a noticeable hot start penalty, not does it
> > cause an increase in performance problems later in the app's
> > lifecycle. I've measured with and without process_madvise, and the
> > differences are within our noise bounds. Second, because we're not
> 
> That is assuming that post process_madvise() working set for the application is
> always smaller. There is another challenge. The external process should ideally
> have the knowledge of active areas of the working set for an application in
> question for it to invoke process_madvise() correctly to prevent such scenarios.

There are several ways to detect workingset more accurately at the cost
of runtime. For example, with idle page tracking or clear_refs. Accuracy
is always trade-off of overhead for LRU aging.

> 
> > preemptively evicting file pages and only making them more likely to
> > be evicted when there's already memory pressure, we avoid the case
> > where we process_madvise an app then immediately return to the app and
> > reload all file pages in the working set even though there was no
> > intervening memory pressure. Our initial version of this work evicted
> 
> That would be the worst case scenario which should be avoided. Memory pressure
> must be a parameter before actually doing the swap out. But pages if know to be
> inactive/cold can be marked high priority to be swapped out.
> 
> > file pages preemptively and did cause a noticeable slowdown (~15%) for
> > that case; this patch set avoids that slowdown. Finally, the benefit
> > from avoiding cold starts is huge. The performance improvement from
> > having a hot start instead of a cold start ranges from 3x for very
> > small apps to 50x+ for larger apps like high-fidelity games.
> 
> Is there any other real world scenario apart from this app based ecosystem where
> user hinted LRU management might be helpful ? Just being curious. Thanks for the
> detailed explanation. I will continue looking into this series.
Michal Hocko May 21, 2019, 6:32 a.m. UTC | #12
[Cc linux-api]

On Tue 21-05-19 13:39:50, Minchan Kim wrote:
> On Mon, May 20, 2019 at 12:46:05PM -0400, Johannes Weiner wrote:
> > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > - Approach
> > > 
> > > The approach we chose was to use a new interface to allow userspace to
> > > proactively reclaim entire processes by leveraging platform information.
> > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > that are known to be cold from userspace and to avoid races with lmkd
> > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > it could provide many chances for platform to use much information to
> > > optimize memory efficiency.
> > > 
> > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > kernel that memory region is not currently needed and should be reclaimed
> > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > kernel that memory region is not currently needed and should be reclaimed
> > > when memory pressure rises.
> > 
> > I agree with this approach and the semantics. But these names are very
> > vague and extremely easy to confuse since they're so similar.
> > 
> > MADV_COLD could be a good name, but for deactivating pages, not
> > reclaiming them - marking memory "cold" on the LRU for later reclaim.
> > 
> > For the immediate reclaim one, I think there is a better option too:
> > In virtual memory speak, putting a page into secondary storage (or
> > ensuring it's already there), and then freeing its in-memory copy, is
> > called "paging out". And that's what this flag is supposed to do. So
> > how about MADV_PAGEOUT?
> > 
> > With that, we'd have:
> > 
> > MADV_FREE: Mark data invalid, free memory when needed
> > MADV_DONTNEED: Mark data invalid, free memory immediately
> > 
> > MADV_COLD: Data is not used for a while, free memory when needed
> > MADV_PAGEOUT: Data is not used for a while, free memory immediately
> > 
> > What do you think?
> 
> There are several suggestions until now. Thanks, Folks!
> 
> For deactivating:
> 
> - MADV_COOL
> - MADV_RECLAIM_LAZY
> - MADV_DEACTIVATE
> - MADV_COLD
> - MADV_FREE_PRESERVE
> 
> 
> For reclaiming:
> 
> - MADV_COLD
> - MADV_RECLAIM_NOW
> - MADV_RECLAIMING
> - MADV_PAGEOUT
> - MADV_DONTNEED_PRESERVE
> 
> It seems everybody doesn't like MADV_COLD so want to go with other.
> For consisteny of view with other existing hints of madvise, -preserve
> postfix suits well. However, originally, I don't like the naming FREE
> vs DONTNEED from the beginning. They were easily confused.
> I prefer PAGEOUT to RECLAIM since it's more likely to be nuance to
> represent reclaim with memory pressure and is supposed to paged-in
> if someone need it later. So, it imply PRESERVE.
> If there is not strong against it, I want to go with MADV_COLD and
> MADV_PAGEOUT.
> 
> Other opinion?

I do not really care strongly. I am pretty sure we will have a lot of
suggestions because people tend to be good at arguing about that...
Anyway, unlike DONTNEED/FREE we do not have any other OS to implement
these features, right? So we shouldn't be tight to existing names.
On the other hand I kinda like the reference to the existing names but
DEACTIVATE/PAGEOUT seem a good fit to me as well. Unless there is way
much better name suggested I would go with one of those. Up to you.
Michal Hocko May 21, 2019, 6:34 a.m. UTC | #13
[linux-api]

On Mon 20-05-19 18:44:52, Matthew Wilcox wrote:
> On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > and MADV_FREE by adding non-destructive ways to gain some free memory
> > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > when memory pressure rises.
> 
> Do we tear down page tables for these ranges?  That seems like a good
> way of reclaiming potentially a substantial amount of memory.

I do not think we can in general because this is a non-destructive
operation. So at least we cannot tear down anonymous ptes (they will
turn into swap entries).
Christian Brauner May 21, 2019, 8:42 a.m. UTC | #14
On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> - Background
> 
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
> 
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
> 
> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.
> 
> - Approach
> 
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.
> 
> IMHO we should spell it out that this patchset complements MADV_WONTNEED
> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.
> 
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COOL which will deactive activated pages and the other is
> MADV_COLD which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.
> 
> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
> 
> 	struct pr_madvise_param {
> 		int size;
> 		const struct iovec *vec;
> 	}
> 
> 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> 				struct pr_madvise_param *restuls,
> 				struct pr_madvise_param *ranges,
> 				unsigned long flags);
> 
> The syscall get pidfd to give hints to external process and provides
> pair of result/ranges vector arguments so that it could give several
> hints to each address range all at once.
> 
> I guess others have different ideas about the naming of syscall and options
> so feel free to suggest better naming.

Yes, all new syscalls making use of pidfds should be named
pidfd_<action>. So please make this pidfd_madvise.

Please make sure to Cc me on this in the future as I'm maintaining
pidfds. Would be great to have Jann on this too since he's been touching
both mm and parts of the pidfd stuff with me.
Michal Hocko May 21, 2019, 10:34 a.m. UTC | #15
On Tue 21-05-19 08:25:55, Anshuman Khandual wrote:
> On 05/20/2019 10:29 PM, Tim Murray wrote:
[...]
> > not seem to introduce a noticeable hot start penalty, not does it
> > cause an increase in performance problems later in the app's
> > lifecycle. I've measured with and without process_madvise, and the
> > differences are within our noise bounds. Second, because we're not
> 
> That is assuming that post process_madvise() working set for the application is
> always smaller. There is another challenge. The external process should ideally
> have the knowledge of active areas of the working set for an application in
> question for it to invoke process_madvise() correctly to prevent such scenarios.

But that doesn't really seem relevant for the API itself, right? The
higher level logic the monitor's business.
Minchan Kim May 21, 2019, 11:05 a.m. UTC | #16
On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > - Background
> > 
> > The Android terminology used for forking a new process and starting an app
> > from scratch is a cold start, while resuming an existing app is a hot start.
> > While we continually try to improve the performance of cold starts, hot
> > starts will always be significantly less power hungry as well as faster so
> > we are trying to make hot start more likely than cold start.
> > 
> > To increase hot start, Android userspace manages the order that apps should
> > be killed in a process called ActivityManagerService. ActivityManagerService
> > tracks every Android app or service that the user could be interacting with
> > at any time and translates that into a ranked list for lmkd(low memory
> > killer daemon). They are likely to be killed by lmkd if the system has to
> > reclaim memory. In that sense they are similar to entries in any other cache.
> > Those apps are kept alive for opportunistic performance improvements but
> > those performance improvements will vary based on the memory requirements of
> > individual workloads.
> > 
> > - Problem
> > 
> > Naturally, cached apps were dominant consumers of memory on the system.
> > However, they were not significant consumers of swap even though they are
> > good candidate for swap. Under investigation, swapping out only begins
> > once the low zone watermark is hit and kswapd wakes up, but the overall
> > allocation rate in the system might trip lmkd thresholds and cause a cached
> > process to be killed(we measured performance swapping out vs. zapping the
> > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > even though we use zram which is much faster than real storage) so kill
> > from lmkd will often satisfy the high zone watermark, resulting in very
> > few pages actually being moved to swap.
> > 
> > - Approach
> > 
> > The approach we chose was to use a new interface to allow userspace to
> > proactively reclaim entire processes by leveraging platform information.
> > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > that are known to be cold from userspace and to avoid races with lmkd
> > by reclaiming apps as soon as they entered the cached state. Additionally,
> > it could provide many chances for platform to use much information to
> > optimize memory efficiency.
> > 
> > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > and MADV_FREE by adding non-destructive ways to gain some free memory
> > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > kernel that memory region is not currently needed and should be reclaimed
> > when memory pressure rises.
> > 
> > To achieve the goal, the patchset introduce two new options for madvise.
> > One is MADV_COOL which will deactive activated pages and the other is
> > MADV_COLD which will reclaim private pages instantly. These new options
> > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > that it hints the kernel that memory region is not currently needed and
> > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > that it hints the kernel that memory region is not currently needed and
> > should be reclaimed when memory pressure rises.
> > 
> > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > information required to make the reclaim decision is not known to the app.
> > Instead, it is known to a centralized userspace daemon, and that daemon
> > must be able to initiate reclaim on its own without any app involvement.
> > To solve the concern, this patch introduces new syscall -
> > 
> > 	struct pr_madvise_param {
> > 		int size;
> > 		const struct iovec *vec;
> > 	}
> > 
> > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > 				struct pr_madvise_param *restuls,
> > 				struct pr_madvise_param *ranges,
> > 				unsigned long flags);
> > 
> > The syscall get pidfd to give hints to external process and provides
> > pair of result/ranges vector arguments so that it could give several
> > hints to each address range all at once.
> > 
> > I guess others have different ideas about the naming of syscall and options
> > so feel free to suggest better naming.
> 
> Yes, all new syscalls making use of pidfds should be named
> pidfd_<action>. So please make this pidfd_madvise.

I don't have any particular preference but just wondering why pidfd is
so special to have it as prefix of system call name.

> 
> Please make sure to Cc me on this in the future as I'm maintaining
> pidfds. Would be great to have Jann on this too since he's been touching
> both mm and parts of the pidfd stuff with me.

Sure!
Christian Brauner May 21, 2019, 11:30 a.m. UTC | #17
On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > - Background
> > > 
> > > The Android terminology used for forking a new process and starting an app
> > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > While we continually try to improve the performance of cold starts, hot
> > > starts will always be significantly less power hungry as well as faster so
> > > we are trying to make hot start more likely than cold start.
> > > 
> > > To increase hot start, Android userspace manages the order that apps should
> > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > tracks every Android app or service that the user could be interacting with
> > > at any time and translates that into a ranked list for lmkd(low memory
> > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > Those apps are kept alive for opportunistic performance improvements but
> > > those performance improvements will vary based on the memory requirements of
> > > individual workloads.
> > > 
> > > - Problem
> > > 
> > > Naturally, cached apps were dominant consumers of memory on the system.
> > > However, they were not significant consumers of swap even though they are
> > > good candidate for swap. Under investigation, swapping out only begins
> > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > process to be killed(we measured performance swapping out vs. zapping the
> > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > even though we use zram which is much faster than real storage) so kill
> > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > few pages actually being moved to swap.
> > > 
> > > - Approach
> > > 
> > > The approach we chose was to use a new interface to allow userspace to
> > > proactively reclaim entire processes by leveraging platform information.
> > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > that are known to be cold from userspace and to avoid races with lmkd
> > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > it could provide many chances for platform to use much information to
> > > optimize memory efficiency.
> > > 
> > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > kernel that memory region is not currently needed and should be reclaimed
> > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > kernel that memory region is not currently needed and should be reclaimed
> > > when memory pressure rises.
> > > 
> > > To achieve the goal, the patchset introduce two new options for madvise.
> > > One is MADV_COOL which will deactive activated pages and the other is
> > > MADV_COLD which will reclaim private pages instantly. These new options
> > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > that it hints the kernel that memory region is not currently needed and
> > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > that it hints the kernel that memory region is not currently needed and
> > > should be reclaimed when memory pressure rises.
> > > 
> > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > information required to make the reclaim decision is not known to the app.
> > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > must be able to initiate reclaim on its own without any app involvement.
> > > To solve the concern, this patch introduces new syscall -
> > > 
> > > 	struct pr_madvise_param {
> > > 		int size;
> > > 		const struct iovec *vec;
> > > 	}
> > > 
> > > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > 				struct pr_madvise_param *restuls,
> > > 				struct pr_madvise_param *ranges,
> > > 				unsigned long flags);
> > > 
> > > The syscall get pidfd to give hints to external process and provides
> > > pair of result/ranges vector arguments so that it could give several
> > > hints to each address range all at once.
> > > 
> > > I guess others have different ideas about the naming of syscall and options
> > > so feel free to suggest better naming.
> > 
> > Yes, all new syscalls making use of pidfds should be named
> > pidfd_<action>. So please make this pidfd_madvise.
> 
> I don't have any particular preference but just wondering why pidfd is
> so special to have it as prefix of system call name.

It's a whole new API to address processes. We already have
clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
syscall works only with pidfds so it's tied to this api as well so it
should follow the naming scheme. This also makes life easier for
userspace and is consistent.

> 
> > 
> > Please make sure to Cc me on this in the future as I'm maintaining
> > pidfds. Would be great to have Jann on this too since he's been touching
> > both mm and parts of the pidfd stuff with me.
> 
> Sure!

Thanks!
Christian Brauner May 21, 2019, 11:39 a.m. UTC | #18
On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > - Background
> > > > 
> > > > The Android terminology used for forking a new process and starting an app
> > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > While we continually try to improve the performance of cold starts, hot
> > > > starts will always be significantly less power hungry as well as faster so
> > > > we are trying to make hot start more likely than cold start.
> > > > 
> > > > To increase hot start, Android userspace manages the order that apps should
> > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > tracks every Android app or service that the user could be interacting with
> > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > Those apps are kept alive for opportunistic performance improvements but
> > > > those performance improvements will vary based on the memory requirements of
> > > > individual workloads.
> > > > 
> > > > - Problem
> > > > 
> > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > However, they were not significant consumers of swap even though they are
> > > > good candidate for swap. Under investigation, swapping out only begins
> > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > even though we use zram which is much faster than real storage) so kill
> > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > few pages actually being moved to swap.
> > > > 
> > > > - Approach
> > > > 
> > > > The approach we chose was to use a new interface to allow userspace to
> > > > proactively reclaim entire processes by leveraging platform information.
> > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > it could provide many chances for platform to use much information to
> > > > optimize memory efficiency.
> > > > 
> > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > kernel that memory region is not currently needed and should be reclaimed
> > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > kernel that memory region is not currently needed and should be reclaimed
> > > > when memory pressure rises.
> > > > 
> > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > that it hints the kernel that memory region is not currently needed and
> > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > that it hints the kernel that memory region is not currently needed and
> > > > should be reclaimed when memory pressure rises.
> > > > 
> > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > information required to make the reclaim decision is not known to the app.
> > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > must be able to initiate reclaim on its own without any app involvement.
> > > > To solve the concern, this patch introduces new syscall -
> > > > 
> > > > 	struct pr_madvise_param {
> > > > 		int size;
> > > > 		const struct iovec *vec;
> > > > 	}
> > > > 
> > > > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > 				struct pr_madvise_param *restuls,
> > > > 				struct pr_madvise_param *ranges,
> > > > 				unsigned long flags);
> > > > 
> > > > The syscall get pidfd to give hints to external process and provides
> > > > pair of result/ranges vector arguments so that it could give several
> > > > hints to each address range all at once.
> > > > 
> > > > I guess others have different ideas about the naming of syscall and options
> > > > so feel free to suggest better naming.
> > > 
> > > Yes, all new syscalls making use of pidfds should be named
> > > pidfd_<action>. So please make this pidfd_madvise.
> > 
> > I don't have any particular preference but just wondering why pidfd is
> > so special to have it as prefix of system call name.
> 
> It's a whole new API to address processes. We already have
> clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> syscall works only with pidfds so it's tied to this api as well so it
> should follow the naming scheme. This also makes life easier for
> userspace and is consistent.

This is at least my reasoning. I'm not going to make this a whole big
pedantic argument. If people have really strong feelings about not using
this prefix then fine. But if syscalls can be grouped together and have
consistent naming this is always a big plus.
Minchan Kim May 21, 2019, 11:41 a.m. UTC | #19
On Tue, May 21, 2019 at 01:30:32PM +0200, Christian Brauner wrote:
> On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > - Background
> > > > 
> > > > The Android terminology used for forking a new process and starting an app
> > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > While we continually try to improve the performance of cold starts, hot
> > > > starts will always be significantly less power hungry as well as faster so
> > > > we are trying to make hot start more likely than cold start.
> > > > 
> > > > To increase hot start, Android userspace manages the order that apps should
> > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > tracks every Android app or service that the user could be interacting with
> > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > Those apps are kept alive for opportunistic performance improvements but
> > > > those performance improvements will vary based on the memory requirements of
> > > > individual workloads.
> > > > 
> > > > - Problem
> > > > 
> > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > However, they were not significant consumers of swap even though they are
> > > > good candidate for swap. Under investigation, swapping out only begins
> > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > even though we use zram which is much faster than real storage) so kill
> > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > few pages actually being moved to swap.
> > > > 
> > > > - Approach
> > > > 
> > > > The approach we chose was to use a new interface to allow userspace to
> > > > proactively reclaim entire processes by leveraging platform information.
> > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > it could provide many chances for platform to use much information to
> > > > optimize memory efficiency.
> > > > 
> > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > kernel that memory region is not currently needed and should be reclaimed
> > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > kernel that memory region is not currently needed and should be reclaimed
> > > > when memory pressure rises.
> > > > 
> > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > that it hints the kernel that memory region is not currently needed and
> > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > that it hints the kernel that memory region is not currently needed and
> > > > should be reclaimed when memory pressure rises.
> > > > 
> > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > information required to make the reclaim decision is not known to the app.
> > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > must be able to initiate reclaim on its own without any app involvement.
> > > > To solve the concern, this patch introduces new syscall -
> > > > 
> > > > 	struct pr_madvise_param {
> > > > 		int size;
> > > > 		const struct iovec *vec;
> > > > 	}
> > > > 
> > > > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > 				struct pr_madvise_param *restuls,
> > > > 				struct pr_madvise_param *ranges,
> > > > 				unsigned long flags);
> > > > 
> > > > The syscall get pidfd to give hints to external process and provides
> > > > pair of result/ranges vector arguments so that it could give several
> > > > hints to each address range all at once.
> > > > 
> > > > I guess others have different ideas about the naming of syscall and options
> > > > so feel free to suggest better naming.
> > > 
> > > Yes, all new syscalls making use of pidfds should be named
> > > pidfd_<action>. So please make this pidfd_madvise.
> > 
> > I don't have any particular preference but just wondering why pidfd is
> > so special to have it as prefix of system call name.
> 
> It's a whole new API to address processes. We already have
> clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> syscall works only with pidfds so it's tied to this api as well so it
> should follow the naming scheme. This also makes life easier for
> userspace and is consistent.

Okay. I will change the API name at next revision.
Thanks.
Christian Brauner May 21, 2019, 12:04 p.m. UTC | #20
On May 21, 2019 1:41:20 PM GMT+02:00, Minchan Kim <minchan@kernel.org> wrote:
>On Tue, May 21, 2019 at 01:30:32PM +0200, Christian Brauner wrote:
>> On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
>> > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
>> > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
>> > > > - Background
>> > > > 
>> > > > The Android terminology used for forking a new process and
>starting an app
>> > > > from scratch is a cold start, while resuming an existing app is
>a hot start.
>> > > > While we continually try to improve the performance of cold
>starts, hot
>> > > > starts will always be significantly less power hungry as well
>as faster so
>> > > > we are trying to make hot start more likely than cold start.
>> > > > 
>> > > > To increase hot start, Android userspace manages the order that
>apps should
>> > > > be killed in a process called ActivityManagerService.
>ActivityManagerService
>> > > > tracks every Android app or service that the user could be
>interacting with
>> > > > at any time and translates that into a ranked list for lmkd(low
>memory
>> > > > killer daemon). They are likely to be killed by lmkd if the
>system has to
>> > > > reclaim memory. In that sense they are similar to entries in
>any other cache.
>> > > > Those apps are kept alive for opportunistic performance
>improvements but
>> > > > those performance improvements will vary based on the memory
>requirements of
>> > > > individual workloads.
>> > > > 
>> > > > - Problem
>> > > > 
>> > > > Naturally, cached apps were dominant consumers of memory on the
>system.
>> > > > However, they were not significant consumers of swap even
>though they are
>> > > > good candidate for swap. Under investigation, swapping out only
>begins
>> > > > once the low zone watermark is hit and kswapd wakes up, but the
>overall
>> > > > allocation rate in the system might trip lmkd thresholds and
>cause a cached
>> > > > process to be killed(we measured performance swapping out vs.
>zapping the
>> > > > memory by killing a process. Unsurprisingly, zapping is 10x
>times faster
>> > > > even though we use zram which is much faster than real storage)
>so kill
>> > > > from lmkd will often satisfy the high zone watermark, resulting
>in very
>> > > > few pages actually being moved to swap.
>> > > > 
>> > > > - Approach
>> > > > 
>> > > > The approach we chose was to use a new interface to allow
>userspace to
>> > > > proactively reclaim entire processes by leveraging platform
>information.
>> > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs
>for pages
>> > > > that are known to be cold from userspace and to avoid races
>with lmkd
>> > > > by reclaiming apps as soon as they entered the cached state.
>Additionally,
>> > > > it could provide many chances for platform to use much
>information to
>> > > > optimize memory efficiency.
>> > > > 
>> > > > IMHO we should spell it out that this patchset complements
>MADV_WONTNEED
>> > > > and MADV_FREE by adding non-destructive ways to gain some free
>memory
>> > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it
>hints the
>> > > > kernel that memory region is not currently needed and should be
>reclaimed
>> > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it
>hints the
>> > > > kernel that memory region is not currently needed and should be
>reclaimed
>> > > > when memory pressure rises.
>> > > > 
>> > > > To achieve the goal, the patchset introduce two new options for
>madvise.
>> > > > One is MADV_COOL which will deactive activated pages and the
>other is
>> > > > MADV_COLD which will reclaim private pages instantly. These new
>options
>> > > > complement MADV_DONTNEED and MADV_FREE by adding
>non-destructive ways to
>> > > > gain some free memory space. MADV_COLD is similar to
>MADV_DONTNEED in a way
>> > > > that it hints the kernel that memory region is not currently
>needed and
>> > > > should be reclaimed immediately; MADV_COOL is similar to
>MADV_FREE in a way
>> > > > that it hints the kernel that memory region is not currently
>needed and
>> > > > should be reclaimed when memory pressure rises.
>> > > > 
>> > > > This approach is similar in spirit to madvise(MADV_WONTNEED),
>but the
>> > > > information required to make the reclaim decision is not known
>to the app.
>> > > > Instead, it is known to a centralized userspace daemon, and
>that daemon
>> > > > must be able to initiate reclaim on its own without any app
>involvement.
>> > > > To solve the concern, this patch introduces new syscall -
>> > > > 
>> > > > 	struct pr_madvise_param {
>> > > > 		int size;
>> > > > 		const struct iovec *vec;
>> > > > 	}
>> > > > 
>> > > > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
>> > > > 				struct pr_madvise_param *restuls,
>> > > > 				struct pr_madvise_param *ranges,
>> > > > 				unsigned long flags);
>> > > > 
>> > > > The syscall get pidfd to give hints to external process and
>provides
>> > > > pair of result/ranges vector arguments so that it could give
>several
>> > > > hints to each address range all at once.
>> > > > 
>> > > > I guess others have different ideas about the naming of syscall
>and options
>> > > > so feel free to suggest better naming.
>> > > 
>> > > Yes, all new syscalls making use of pidfds should be named
>> > > pidfd_<action>. So please make this pidfd_madvise.
>> > 
>> > I don't have any particular preference but just wondering why pidfd
>is
>> > so special to have it as prefix of system call name.
>> 
>> It's a whole new API to address processes. We already have
>> clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
>> exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
>> syscall works only with pidfds so it's tied to this api as well so it
>> should follow the naming scheme. This also makes life easier for
>> userspace and is consistent.
>
>Okay. I will change the API name at next revision.
>Thanks.

Thanks!
Fwiw, there's been a similar patch by Oleksandr for pidfd_madvise I stumbled upon a few days back:
https://gitlab.com/post-factum/pf-kernel/commit/0595f874a53fa898739ac315ddf208554d9dc897

He wanted to be cc'ed but I forgot.

Christian
Oleksandr Natalenko May 21, 2019, 12:15 p.m. UTC | #21
On Tue, May 21, 2019 at 02:04:00PM +0200, Christian Brauner wrote:
> On May 21, 2019 1:41:20 PM GMT+02:00, Minchan Kim <minchan@kernel.org> wrote:
> >On Tue, May 21, 2019 at 01:30:32PM +0200, Christian Brauner wrote:
> >> On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> >> > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> >> > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> >> > > > - Background
> >> > > > 
> >> > > > The Android terminology used for forking a new process and
> >starting an app
> >> > > > from scratch is a cold start, while resuming an existing app is
> >a hot start.
> >> > > > While we continually try to improve the performance of cold
> >starts, hot
> >> > > > starts will always be significantly less power hungry as well
> >as faster so
> >> > > > we are trying to make hot start more likely than cold start.
> >> > > > 
> >> > > > To increase hot start, Android userspace manages the order that
> >apps should
> >> > > > be killed in a process called ActivityManagerService.
> >ActivityManagerService
> >> > > > tracks every Android app or service that the user could be
> >interacting with
> >> > > > at any time and translates that into a ranked list for lmkd(low
> >memory
> >> > > > killer daemon). They are likely to be killed by lmkd if the
> >system has to
> >> > > > reclaim memory. In that sense they are similar to entries in
> >any other cache.
> >> > > > Those apps are kept alive for opportunistic performance
> >improvements but
> >> > > > those performance improvements will vary based on the memory
> >requirements of
> >> > > > individual workloads.
> >> > > > 
> >> > > > - Problem
> >> > > > 
> >> > > > Naturally, cached apps were dominant consumers of memory on the
> >system.
> >> > > > However, they were not significant consumers of swap even
> >though they are
> >> > > > good candidate for swap. Under investigation, swapping out only
> >begins
> >> > > > once the low zone watermark is hit and kswapd wakes up, but the
> >overall
> >> > > > allocation rate in the system might trip lmkd thresholds and
> >cause a cached
> >> > > > process to be killed(we measured performance swapping out vs.
> >zapping the
> >> > > > memory by killing a process. Unsurprisingly, zapping is 10x
> >times faster
> >> > > > even though we use zram which is much faster than real storage)
> >so kill
> >> > > > from lmkd will often satisfy the high zone watermark, resulting
> >in very
> >> > > > few pages actually being moved to swap.
> >> > > > 
> >> > > > - Approach
> >> > > > 
> >> > > > The approach we chose was to use a new interface to allow
> >userspace to
> >> > > > proactively reclaim entire processes by leveraging platform
> >information.
> >> > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs
> >for pages
> >> > > > that are known to be cold from userspace and to avoid races
> >with lmkd
> >> > > > by reclaiming apps as soon as they entered the cached state.
> >Additionally,
> >> > > > it could provide many chances for platform to use much
> >information to
> >> > > > optimize memory efficiency.
> >> > > > 
> >> > > > IMHO we should spell it out that this patchset complements
> >MADV_WONTNEED
> >> > > > and MADV_FREE by adding non-destructive ways to gain some free
> >memory
> >> > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it
> >hints the
> >> > > > kernel that memory region is not currently needed and should be
> >reclaimed
> >> > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it
> >hints the
> >> > > > kernel that memory region is not currently needed and should be
> >reclaimed
> >> > > > when memory pressure rises.
> >> > > > 
> >> > > > To achieve the goal, the patchset introduce two new options for
> >madvise.
> >> > > > One is MADV_COOL which will deactive activated pages and the
> >other is
> >> > > > MADV_COLD which will reclaim private pages instantly. These new
> >options
> >> > > > complement MADV_DONTNEED and MADV_FREE by adding
> >non-destructive ways to
> >> > > > gain some free memory space. MADV_COLD is similar to
> >MADV_DONTNEED in a way
> >> > > > that it hints the kernel that memory region is not currently
> >needed and
> >> > > > should be reclaimed immediately; MADV_COOL is similar to
> >MADV_FREE in a way
> >> > > > that it hints the kernel that memory region is not currently
> >needed and
> >> > > > should be reclaimed when memory pressure rises.
> >> > > > 
> >> > > > This approach is similar in spirit to madvise(MADV_WONTNEED),
> >but the
> >> > > > information required to make the reclaim decision is not known
> >to the app.
> >> > > > Instead, it is known to a centralized userspace daemon, and
> >that daemon
> >> > > > must be able to initiate reclaim on its own without any app
> >involvement.
> >> > > > To solve the concern, this patch introduces new syscall -
> >> > > > 
> >> > > > 	struct pr_madvise_param {
> >> > > > 		int size;
> >> > > > 		const struct iovec *vec;
> >> > > > 	}
> >> > > > 
> >> > > > 	int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> >> > > > 				struct pr_madvise_param *restuls,
> >> > > > 				struct pr_madvise_param *ranges,
> >> > > > 				unsigned long flags);
> >> > > > 
> >> > > > The syscall get pidfd to give hints to external process and
> >provides
> >> > > > pair of result/ranges vector arguments so that it could give
> >several
> >> > > > hints to each address range all at once.
> >> > > > 
> >> > > > I guess others have different ideas about the naming of syscall
> >and options
> >> > > > so feel free to suggest better naming.
> >> > > 
> >> > > Yes, all new syscalls making use of pidfds should be named
> >> > > pidfd_<action>. So please make this pidfd_madvise.
> >> > 
> >> > I don't have any particular preference but just wondering why pidfd
> >is
> >> > so special to have it as prefix of system call name.
> >> 
> >> It's a whole new API to address processes. We already have
> >> clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> >> exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> >> syscall works only with pidfds so it's tied to this api as well so it
> >> should follow the naming scheme. This also makes life easier for
> >> userspace and is consistent.
> >
> >Okay. I will change the API name at next revision.
> >Thanks.
> 
> Thanks!
> Fwiw, there's been a similar patch by Oleksandr for pidfd_madvise I stumbled upon a few days back:
> https://gitlab.com/post-factum/pf-kernel/commit/0595f874a53fa898739ac315ddf208554d9dc897
> 
> He wanted to be cc'ed but I forgot.

Thanks :).

FWIW, since this submission is essentially a continuation of our discussion
involving my earlier KSM submissions here, I won't move my gitlab branch
forward and will be happy to assist with what we have here, be it
pidfd_madvise() or a set or /proc files (or smth else).

> 
> Christian
>
Shakeel Butt May 21, 2019, 12:53 p.m. UTC | #22
On Sun, May 19, 2019 at 8:53 PM Minchan Kim <minchan@kernel.org> wrote:
>
> - Background
>
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
>
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
>
> - Problem
>
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.

It is not clear what exactly is the problem from the above para. IMO
low usage of swap is not the problem but rather global memory pressure
and the reactive response to it is the problem. Killing apps over swap
is preferred as you have noted zapping frees memory faster but it
indirectly increases cold start. Also swapping on allocation causes
latency issues for the app. So, a proactive mechanism is needed to
keep global pressure away and indirectly reduces cold starts and alloc
stalls.

>
> - Approach
>
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.

I think it would be good to have clear reasoning on why "reclaim from
userspace" approach is taken. Android runtime clearly has more
accurate stale/cold information at the app/process level and can
positively influence kernel's reclaim decisions. So, "reclaim from
userspace" approach makes total sense for Android. I envision that
Chrome OS would be another very obvious user of this approach. There
can be tens of tabs which the user have not touched for sometime.
Chrome OS can proactively reclaim memory from such tabs.

>
> IMHO we should spell it out that this patchset complements MADV_WONTNEED

MADV_DONTNEED? same at couple of places below.

> and MADV_FREE by adding non-destructive ways to gain some free memory
> space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> kernel that memory region is not currently needed and should be reclaimed
> when memory pressure rises.
>
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COOL which will deactive activated pages and the other is
> MADV_COLD which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.
>
> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
>
>         struct pr_madvise_param {
>                 int size;
>                 const struct iovec *vec;
>         }
>
>         int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
>                                 struct pr_madvise_param *restuls,
>                                 struct pr_madvise_param *ranges,
>                                 unsigned long flags);
>
> The syscall get pidfd to give hints to external process and provides
> pair of result/ranges vector arguments so that it could give several
> hints to each address range all at once.
>
> I guess others have different ideas about the naming of syscall and options
> so feel free to suggest better naming.
>
> - Experiment
>
> We did bunch of testing with several hundreds of real users, not artificial
> benchmark on android. We saw about 17% cold start decreasement without any
> significant battery/app startup latency issues. And with artificial benchmark
> which launches and switching apps, we saw average 7% app launching improvement,
> 18% less lmkd kill and good stat from vmstat.
>
> A is vanilla and B is process_madvise.
>
>
>                                        A          B      delta   ratio(%)
>                allocstall_dma          0          0          0       0.00
>            allocstall_movable       1464        457      -1007     -69.00
>             allocstall_normal     263210     190763     -72447     -28.00
>              allocstall_total     264674     191220     -73454     -28.00
>           compact_daemon_wake      26912      25294      -1618      -7.00
>                  compact_fail      17885      14151      -3734     -21.00
>          compact_free_scanned 4204766409 3835994922 -368771487      -9.00
>              compact_isolated    3446484    2967618    -478866     -14.00
>       compact_migrate_scanned 1621336411 1324695710 -296640701     -19.00
>                 compact_stall      19387      15343      -4044     -21.00
>               compact_success       1502       1192       -310     -21.00
> kswapd_high_wmark_hit_quickly        234        184        -50     -22.00
>             kswapd_inodesteal     221635     233093      11458       5.00
>  kswapd_low_wmark_hit_quickly      66065      54009     -12056     -19.00
>                    nr_dirtied     259934     296476      36542      14.00
>   nr_vmscan_immediate_reclaim       2587       2356       -231      -9.00
>               nr_vmscan_write    1274232    2661733    1387501     108.00
>                    nr_written    1514060    2937560    1423500      94.00
>                    pageoutrun      67561      55133     -12428     -19.00
>                    pgactivate    2335060    1984882    -350178     -15.00
>                   pgalloc_dma   13743011   14096463     353452       2.00
>               pgalloc_movable          0          0          0       0.00
>                pgalloc_normal   18742440   16802065   -1940375     -11.00
>                 pgalloc_total   32485451   30898528   -1586923      -5.00
>                  pgdeactivate    4262210    2930670   -1331540     -32.00
>                       pgfault   30812334   31085065     272731       0.00
>                        pgfree   33553970   31765164   -1788806      -6.00
>                  pginodesteal      33411      15084     -18327     -55.00
>                   pglazyfreed          0          0          0       0.00
>                    pgmajfault     551312    1508299     956987     173.00
>                pgmigrate_fail      43927      29330     -14597     -34.00
>             pgmigrate_success    1399851    1203922    -195929     -14.00
>                        pgpgin   24141776   19032156   -5109620     -22.00
>                       pgpgout     959344    1103316     143972      15.00
>                  pgpgoutclean    4639732    3765868    -873864     -19.00
>                      pgrefill    4884560    3006938   -1877622     -39.00
>                     pgrotated      37828      25897     -11931     -32.00
>                 pgscan_direct    1456037     957567    -498470     -35.00
>        pgscan_direct_throttle          0          0          0       0.00
>                 pgscan_kswapd    6667767    5047360   -1620407     -25.00
>                  pgscan_total    8123804    6004927   -2118877     -27.00
>                    pgskip_dma          0          0          0       0.00
>                pgskip_movable          0          0          0       0.00
>                 pgskip_normal      14907      25382      10475      70.00
>                  pgskip_total      14907      25382      10475      70.00
>                pgsteal_direct    1118986     690215    -428771     -39.00
>                pgsteal_kswapd    4750223    3657107   -1093116     -24.00
>                 pgsteal_total    5869209    4347322   -1521887     -26.00
>                        pswpin     417613    1392647     975034     233.00
>                       pswpout    1274224    2661731    1387507     108.00
>                 slabs_scanned   13686905   10807200   -2879705     -22.00
>           workingset_activate     668966     569444     -99522     -15.00
>        workingset_nodereclaim      38957      32621      -6336     -17.00
>            workingset_refault    2816795    2179782    -637013     -23.00
>            workingset_restore     294320     168601    -125719     -43.00
>
> pgmajfault is increased by 173% because swapin is increased by 200% by
> process_madvise hint. However, swap read based on zram is much cheaper
> than file IO in performance point of view and app hot start by swapin is
> also cheaper than cold start from the beginning of app which needs many IO
> from storage and initialization steps.
>
> This patchset is against on next-20190517.
>
> Minchan Kim (7):
>   mm: introduce MADV_COOL
>   mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
>   mm: introduce MADV_COLD
>   mm: factor out madvise's core functionality
>   mm: introduce external memory hinting API
>   mm: extend process_madvise syscall to support vector arrary
>   mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
>
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  include/linux/page-flags.h             |   1 +
>  include/linux/page_idle.h              |  15 +
>  include/linux/proc_fs.h                |   1 +
>  include/linux/swap.h                   |   2 +
>  include/linux/syscalls.h               |   2 +
>  include/uapi/asm-generic/mman-common.h |  12 +
>  include/uapi/asm-generic/unistd.h      |   2 +
>  kernel/signal.c                        |   2 +-
>  kernel/sys_ni.c                        |   1 +
>  mm/madvise.c                           | 600 +++++++++++++++++++++----
>  mm/swap.c                              |  43 ++
>  mm/vmscan.c                            |  80 +++-
>  14 files changed, 680 insertions(+), 83 deletions(-)
>
> --
> 2.21.0.1020.gf2820cf01a-goog
>
Shakeel Butt May 21, 2019, 12:56 p.m. UTC | #23
On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 05/20/2019 10:29 PM, Tim Murray wrote:
> > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> >>
> >> Or Is the objective here is reduce the number of processes which get killed by
> >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
> >> they dont get picked by lmkd. Under utilization for zram hardware is a concern
> >> here as well ?
> >
> > The objective is to avoid some instances of memory pressure by
> > proactively swapping pages that userspace knows to be cold before
> > those pages reach the end of the LRUs, which in turn can prevent some
> > apps from being killed by lmk/lmkd. As soon as Android userspace knows
> > that an application is not being used and is only resident to improve
> > performance if the user returns to that app, we can kick off
> > process_madvise on that process's pages (or some portion of those
> > pages) in a power-efficient way to reduce memory pressure long before
> > the system hits the free page watermark. This allows the system more
> > time to put pages into zram versus waiting for the watermark to
> > trigger kswapd, which decreases the likelihood that later memory
> > allocations will cause enough pressure to trigger a kill of one of
> > these apps.
>
> So this opens up bit of LRU management to user space hints. Also because the app
> in itself wont know about the memory situation of the entire system, new system
> call needs to be called from an external process.
>
> >
> >> Swapping out memory into zram wont increase the latency for a hot start ? Or
> >> is it because as it will prevent a fresh cold start which anyway will be slower
> >> than a slow hot start. Just being curious.
> >
> > First, not all swapped pages will be reloaded immediately once an app
> > is resumed. We've found that an app's working set post-process_madvise
> > is significantly smaller than what an app allocates when it first
> > launches (see the delta between pswpin and pswpout in Minchan's
> > results). Presumably because of this, faulting to fetch from zram does
>
> pswpin      417613    1392647     975034     233.00
> pswpout    1274224    2661731    1387507     108.00
>
> IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that
> always the case ? Or it tend to swap out from an active area of the working set
> which faulted back again.
>
> > not seem to introduce a noticeable hot start penalty, not does it
> > cause an increase in performance problems later in the app's
> > lifecycle. I've measured with and without process_madvise, and the
> > differences are within our noise bounds. Second, because we're not
>
> That is assuming that post process_madvise() working set for the application is
> always smaller. There is another challenge. The external process should ideally
> have the knowledge of active areas of the working set for an application in
> question for it to invoke process_madvise() correctly to prevent such scenarios.
>
> > preemptively evicting file pages and only making them more likely to
> > be evicted when there's already memory pressure, we avoid the case
> > where we process_madvise an app then immediately return to the app and
> > reload all file pages in the working set even though there was no
> > intervening memory pressure. Our initial version of this work evicted
>
> That would be the worst case scenario which should be avoided. Memory pressure
> must be a parameter before actually doing the swap out. But pages if know to be
> inactive/cold can be marked high priority to be swapped out.
>
> > file pages preemptively and did cause a noticeable slowdown (~15%) for
> > that case; this patch set avoids that slowdown. Finally, the benefit
> > from avoiding cold starts is huge. The performance improvement from
> > having a hot start instead of a cold start ranges from 3x for very
> > small apps to 50x+ for larger apps like high-fidelity games.
>
> Is there any other real world scenario apart from this app based ecosystem where
> user hinted LRU management might be helpful ? Just being curious. Thanks for the
> detailed explanation. I will continue looking into this series.

Chrome OS is another real world use-case for this user hinted LRU
management approach by proactively reclaiming reclaim from tabs not
accessed by the user for some time.
Brian Geffon May 22, 2019, 4:15 a.m. UTC | #24
To expand on the ChromeOS use case we're in a very similar situation to
Android. For example, the Chrome browser uses a separate process for each
individual tab (with some exceptions) and over time many tabs remain open
in a back-grounded or idle state. Given that we have a lot of information
about the weight of a tab, when it was last active, etc, we can benefit
tremendously from per-process reclaim. We're working on getting real world
numbers but all of our initial testing shows very promising results.

On Tue, May 21, 2019 at 5:57 AM Shakeel Butt <shakeelb@google.com> wrote:

> On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
> >
> >
> >
> > On 05/20/2019 10:29 PM, Tim Murray wrote:
> > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> > > <anshuman.khandual@arm.com> wrote:
> > >>
> > >> Or Is the objective here is reduce the number of processes which get
> killed by
> > >> lmkd by triggering swapping for the unused memory (user hinted)
> sooner so that
> > >> they dont get picked by lmkd. Under utilization for zram hardware is
> a concern
> > >> here as well ?
> > >
> > > The objective is to avoid some instances of memory pressure by
> > > proactively swapping pages that userspace knows to be cold before
> > > those pages reach the end of the LRUs, which in turn can prevent some
> > > apps from being killed by lmk/lmkd. As soon as Android userspace knows
> > > that an application is not being used and is only resident to improve
> > > performance if the user returns to that app, we can kick off
> > > process_madvise on that process's pages (or some portion of those
> > > pages) in a power-efficient way to reduce memory pressure long before
> > > the system hits the free page watermark. This allows the system more
> > > time to put pages into zram versus waiting for the watermark to
> > > trigger kswapd, which decreases the likelihood that later memory
> > > allocations will cause enough pressure to trigger a kill of one of
> > > these apps.
> >
> > So this opens up bit of LRU management to user space hints. Also because
> the app
> > in itself wont know about the memory situation of the entire system, new
> system
> > call needs to be called from an external process.
> >
> > >
> > >> Swapping out memory into zram wont increase the latency for a hot
> start ? Or
> > >> is it because as it will prevent a fresh cold start which anyway will
> be slower
> > >> than a slow hot start. Just being curious.
> > >
> > > First, not all swapped pages will be reloaded immediately once an app
> > > is resumed. We've found that an app's working set post-process_madvise
> > > is significantly smaller than what an app allocates when it first
> > > launches (see the delta between pswpin and pswpout in Minchan's
> > > results). Presumably because of this, faulting to fetch from zram does
> >
> > pswpin      417613    1392647     975034     233.00
> > pswpout    1274224    2661731    1387507     108.00
> >
> > IIUC the swap-in ratio is way higher in comparison to that of swap out.
> Is that
> > always the case ? Or it tend to swap out from an active area of the
> working set
> > which faulted back again.
> >
> > > not seem to introduce a noticeable hot start penalty, not does it
> > > cause an increase in performance problems later in the app's
> > > lifecycle. I've measured with and without process_madvise, and the
> > > differences are within our noise bounds. Second, because we're not
> >
> > That is assuming that post process_madvise() working set for the
> application is
> > always smaller. There is another challenge. The external process should
> ideally
> > have the knowledge of active areas of the working set for an application
> in
> > question for it to invoke process_madvise() correctly to prevent such
> scenarios.
> >
> > > preemptively evicting file pages and only making them more likely to
> > > be evicted when there's already memory pressure, we avoid the case
> > > where we process_madvise an app then immediately return to the app and
> > > reload all file pages in the working set even though there was no
> > > intervening memory pressure. Our initial version of this work evicted
> >
> > That would be the worst case scenario which should be avoided. Memory
> pressure
> > must be a parameter before actually doing the swap out. But pages if
> know to be
> > inactive/cold can be marked high priority to be swapped out.
> >
> > > file pages preemptively and did cause a noticeable slowdown (~15%) for
> > > that case; this patch set avoids that slowdown. Finally, the benefit
> > > from avoiding cold starts is huge. The performance improvement from
> > > having a hot start instead of a cold start ranges from 3x for very
> > > small apps to 50x+ for larger apps like high-fidelity games.
> >
> > Is there any other real world scenario apart from this app based
> ecosystem where
> > user hinted LRU management might be helpful ? Just being curious. Thanks
> for the
> > detailed explanation. I will continue looking into this series.
>
> Chrome OS is another real world use-case for this user hinted LRU
> management approach by proactively reclaiming reclaim from tabs not
> accessed by the user for some time.
>
<div dir="ltr">To expand on the ChromeOS use case we&#39;re in a very similar situation to Android. For example, the Chrome browser uses a separate process for each individual tab (with some exceptions) and over time many tabs remain open in a back-grounded or idle state. Given that we have a lot of information about the weight of a tab, when it was last active, etc, we can benefit tremendously from per-process reclaim. We&#39;re working on getting real world numbers but all of our initial testing shows very promising results.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 21, 2019 at 5:57 AM Shakeel Butt &lt;<a href="mailto:shakeelb@google.com">shakeelb@google.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual<br>
&lt;<a href="mailto:anshuman.khandual@arm.com" target="_blank">anshuman.khandual@arm.com</a>&gt; wrote:<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; On 05/20/2019 10:29 PM, Tim Murray wrote:<br>
&gt; &gt; On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual<br>
&gt; &gt; &lt;<a href="mailto:anshuman.khandual@arm.com" target="_blank">anshuman.khandual@arm.com</a>&gt; wrote:<br>
&gt; &gt;&gt;<br>
&gt; &gt;&gt; Or Is the objective here is reduce the number of processes which get killed by<br>
&gt; &gt;&gt; lmkd by triggering swapping for the unused memory (user hinted) sooner so that<br>
&gt; &gt;&gt; they dont get picked by lmkd. Under utilization for zram hardware is a concern<br>
&gt; &gt;&gt; here as well ?<br>
&gt; &gt;<br>
&gt; &gt; The objective is to avoid some instances of memory pressure by<br>
&gt; &gt; proactively swapping pages that userspace knows to be cold before<br>
&gt; &gt; those pages reach the end of the LRUs, which in turn can prevent some<br>
&gt; &gt; apps from being killed by lmk/lmkd. As soon as Android userspace knows<br>
&gt; &gt; that an application is not being used and is only resident to improve<br>
&gt; &gt; performance if the user returns to that app, we can kick off<br>
&gt; &gt; process_madvise on that process&#39;s pages (or some portion of those<br>
&gt; &gt; pages) in a power-efficient way to reduce memory pressure long before<br>
&gt; &gt; the system hits the free page watermark. This allows the system more<br>
&gt; &gt; time to put pages into zram versus waiting for the watermark to<br>
&gt; &gt; trigger kswapd, which decreases the likelihood that later memory<br>
&gt; &gt; allocations will cause enough pressure to trigger a kill of one of<br>
&gt; &gt; these apps.<br>
&gt;<br>
&gt; So this opens up bit of LRU management to user space hints. Also because the app<br>
&gt; in itself wont know about the memory situation of the entire system, new system<br>
&gt; call needs to be called from an external process.<br>
&gt;<br>
&gt; &gt;<br>
&gt; &gt;&gt; Swapping out memory into zram wont increase the latency for a hot start ? Or<br>
&gt; &gt;&gt; is it because as it will prevent a fresh cold start which anyway will be slower<br>
&gt; &gt;&gt; than a slow hot start. Just being curious.<br>
&gt; &gt;<br>
&gt; &gt; First, not all swapped pages will be reloaded immediately once an app<br>
&gt; &gt; is resumed. We&#39;ve found that an app&#39;s working set post-process_madvise<br>
&gt; &gt; is significantly smaller than what an app allocates when it first<br>
&gt; &gt; launches (see the delta between pswpin and pswpout in Minchan&#39;s<br>
&gt; &gt; results). Presumably because of this, faulting to fetch from zram does<br>
&gt;<br>
&gt; pswpin      417613    1392647     975034     233.00<br>
&gt; pswpout    1274224    2661731    1387507     108.00<br>
&gt;<br>
&gt; IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that<br>
&gt; always the case ? Or it tend to swap out from an active area of the working set<br>
&gt; which faulted back again.<br>
&gt;<br>
&gt; &gt; not seem to introduce a noticeable hot start penalty, not does it<br>
&gt; &gt; cause an increase in performance problems later in the app&#39;s<br>
&gt; &gt; lifecycle. I&#39;ve measured with and without process_madvise, and the<br>
&gt; &gt; differences are within our noise bounds. Second, because we&#39;re not<br>
&gt;<br>
&gt; That is assuming that post process_madvise() working set for the application is<br>
&gt; always smaller. There is another challenge. The external process should ideally<br>
&gt; have the knowledge of active areas of the working set for an application in<br>
&gt; question for it to invoke process_madvise() correctly to prevent such scenarios.<br>
&gt;<br>
&gt; &gt; preemptively evicting file pages and only making them more likely to<br>
&gt; &gt; be evicted when there&#39;s already memory pressure, we avoid the case<br>
&gt; &gt; where we process_madvise an app then immediately return to the app and<br>
&gt; &gt; reload all file pages in the working set even though there was no<br>
&gt; &gt; intervening memory pressure. Our initial version of this work evicted<br>
&gt;<br>
&gt; That would be the worst case scenario which should be avoided. Memory pressure<br>
&gt; must be a parameter before actually doing the swap out. But pages if know to be<br>
&gt; inactive/cold can be marked high priority to be swapped out.<br>
&gt;<br>
&gt; &gt; file pages preemptively and did cause a noticeable slowdown (~15%) for<br>
&gt; &gt; that case; this patch set avoids that slowdown. Finally, the benefit<br>
&gt; &gt; from avoiding cold starts is huge. The performance improvement from<br>
&gt; &gt; having a hot start instead of a cold start ranges from 3x for very<br>
&gt; &gt; small apps to 50x+ for larger apps like high-fidelity games.<br>
&gt;<br>
&gt; Is there any other real world scenario apart from this app based ecosystem where<br>
&gt; user hinted LRU management might be helpful ? Just being curious. Thanks for the<br>
&gt; detailed explanation. I will continue looking into this series.<br>
<br>
Chrome OS is another real world use-case for this user hinted LRU<br>
management approach by proactively reclaiming reclaim from tabs not<br>
accessed by the user for some time.<br>
</blockquote></div>
Brian Geffon May 22, 2019, 4:23 a.m. UTC | #25
To expand on the ChromeOS use case we're in a very similar situation
to Android. For example, the Chrome browser uses a separate process
for each individual tab (with some exceptions) and over time many tabs
remain open in a back-grounded or idle state. Given that we have a lot
of information about the weight of a tab, when it was last active,
etc, we can benefit tremendously from per-process reclaim. We're
working on getting real world numbers but all of our initial testing
shows very promising results.


On Tue, May 21, 2019 at 5:57 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
> >
> >
> >
> > On 05/20/2019 10:29 PM, Tim Murray wrote:
> > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> > > <anshuman.khandual@arm.com> wrote:
> > >>
> > >> Or Is the objective here is reduce the number of processes which get killed by
> > >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
> > >> they dont get picked by lmkd. Under utilization for zram hardware is a concern
> > >> here as well ?
> > >
> > > The objective is to avoid some instances of memory pressure by
> > > proactively swapping pages that userspace knows to be cold before
> > > those pages reach the end of the LRUs, which in turn can prevent some
> > > apps from being killed by lmk/lmkd. As soon as Android userspace knows
> > > that an application is not being used and is only resident to improve
> > > performance if the user returns to that app, we can kick off
> > > process_madvise on that process's pages (or some portion of those
> > > pages) in a power-efficient way to reduce memory pressure long before
> > > the system hits the free page watermark. This allows the system more
> > > time to put pages into zram versus waiting for the watermark to
> > > trigger kswapd, which decreases the likelihood that later memory
> > > allocations will cause enough pressure to trigger a kill of one of
> > > these apps.
> >
> > So this opens up bit of LRU management to user space hints. Also because the app
> > in itself wont know about the memory situation of the entire system, new system
> > call needs to be called from an external process.
> >
> > >
> > >> Swapping out memory into zram wont increase the latency for a hot start ? Or
> > >> is it because as it will prevent a fresh cold start which anyway will be slower
> > >> than a slow hot start. Just being curious.
> > >
> > > First, not all swapped pages will be reloaded immediately once an app
> > > is resumed. We've found that an app's working set post-process_madvise
> > > is significantly smaller than what an app allocates when it first
> > > launches (see the delta between pswpin and pswpout in Minchan's
> > > results). Presumably because of this, faulting to fetch from zram does
> >
> > pswpin      417613    1392647     975034     233.00
> > pswpout    1274224    2661731    1387507     108.00
> >
> > IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that
> > always the case ? Or it tend to swap out from an active area of the working set
> > which faulted back again.
> >
> > > not seem to introduce a noticeable hot start penalty, not does it
> > > cause an increase in performance problems later in the app's
> > > lifecycle. I've measured with and without process_madvise, and the
> > > differences are within our noise bounds. Second, because we're not
> >
> > That is assuming that post process_madvise() working set for the application is
> > always smaller. There is another challenge. The external process should ideally
> > have the knowledge of active areas of the working set for an application in
> > question for it to invoke process_madvise() correctly to prevent such scenarios.
> >
> > > preemptively evicting file pages and only making them more likely to
> > > be evicted when there's already memory pressure, we avoid the case
> > > where we process_madvise an app then immediately return to the app and
> > > reload all file pages in the working set even though there was no
> > > intervening memory pressure. Our initial version of this work evicted
> >
> > That would be the worst case scenario which should be avoided. Memory pressure
> > must be a parameter before actually doing the swap out. But pages if know to be
> > inactive/cold can be marked high priority to be swapped out.
> >
> > > file pages preemptively and did cause a noticeable slowdown (~15%) for
> > > that case; this patch set avoids that slowdown. Finally, the benefit
> > > from avoiding cold starts is huge. The performance improvement from
> > > having a hot start instead of a cold start ranges from 3x for very
> > > small apps to 50x+ for larger apps like high-fidelity games.
> >
> > Is there any other real world scenario apart from this app based ecosystem where
> > user hinted LRU management might be helpful ? Just being curious. Thanks for the
> > detailed explanation. I will continue looking into this series.
>
> Chrome OS is another real world use-case for this user hinted LRU
> management approach by proactively reclaiming reclaim from tabs not
> accessed by the user for some time.
Daniel Colascione May 22, 2019, 5:11 a.m. UTC | #26
On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > > - Background
> > > > >
> > > > > The Android terminology used for forking a new process and starting an app
> > > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > > While we continually try to improve the performance of cold starts, hot
> > > > > starts will always be significantly less power hungry as well as faster so
> > > > > we are trying to make hot start more likely than cold start.
> > > > >
> > > > > To increase hot start, Android userspace manages the order that apps should
> > > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > > tracks every Android app or service that the user could be interacting with
> > > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > > Those apps are kept alive for opportunistic performance improvements but
> > > > > those performance improvements will vary based on the memory requirements of
> > > > > individual workloads.
> > > > >
> > > > > - Problem
> > > > >
> > > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > > However, they were not significant consumers of swap even though they are
> > > > > good candidate for swap. Under investigation, swapping out only begins
> > > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > > even though we use zram which is much faster than real storage) so kill
> > > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > > few pages actually being moved to swap.
> > > > >
> > > > > - Approach
> > > > >
> > > > > The approach we chose was to use a new interface to allow userspace to
> > > > > proactively reclaim entire processes by leveraging platform information.
> > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > > it could provide many chances for platform to use much information to
> > > > > optimize memory efficiency.
> > > > >
> > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > when memory pressure rises.
> > > > >
> > > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > > that it hints the kernel that memory region is not currently needed and
> > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > > that it hints the kernel that memory region is not currently needed and
> > > > > should be reclaimed when memory pressure rises.
> > > > >
> > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > > information required to make the reclaim decision is not known to the app.
> > > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > > must be able to initiate reclaim on its own without any app involvement.
> > > > > To solve the concern, this patch introduces new syscall -
> > > > >
> > > > >         struct pr_madvise_param {
> > > > >                 int size;
> > > > >                 const struct iovec *vec;
> > > > >         }
> > > > >
> > > > >         int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > >                                 struct pr_madvise_param *restuls,
> > > > >                                 struct pr_madvise_param *ranges,
> > > > >                                 unsigned long flags);
> > > > >
> > > > > The syscall get pidfd to give hints to external process and provides
> > > > > pair of result/ranges vector arguments so that it could give several
> > > > > hints to each address range all at once.
> > > > >
> > > > > I guess others have different ideas about the naming of syscall and options
> > > > > so feel free to suggest better naming.
> > > >
> > > > Yes, all new syscalls making use of pidfds should be named
> > > > pidfd_<action>. So please make this pidfd_madvise.
> > >
> > > I don't have any particular preference but just wondering why pidfd is
> > > so special to have it as prefix of system call name.
> >
> > It's a whole new API to address processes. We already have
> > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> > syscall works only with pidfds so it's tied to this api as well so it
> > should follow the naming scheme. This also makes life easier for
> > userspace and is consistent.
>
> This is at least my reasoning. I'm not going to make this a whole big
> pedantic argument. If people have really strong feelings about not using
> this prefix then fine. But if syscalls can be grouped together and have
> consistent naming this is always a big plus.

My hope has been that pidfd use becomes normalized enough that
prefixing "pidfd_" to pidfd-accepting system calls becomes redundant.
We write write(), not fd_write(), right? :-) pidfd_open() makes sense
because the primary purpose of this system call is to operate on a
pidfd, but I think process_madvise() is fine.
Christian Brauner May 22, 2019, 8:22 a.m. UTC | #27
On Wed, May 22, 2019 at 7:12 AM Daniel Colascione <dancol@google.com> wrote:
>
> On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> > > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > > > - Background
> > > > > >
> > > > > > The Android terminology used for forking a new process and starting an app
> > > > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > > > While we continually try to improve the performance of cold starts, hot
> > > > > > starts will always be significantly less power hungry as well as faster so
> > > > > > we are trying to make hot start more likely than cold start.
> > > > > >
> > > > > > To increase hot start, Android userspace manages the order that apps should
> > > > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > > > tracks every Android app or service that the user could be interacting with
> > > > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > > > Those apps are kept alive for opportunistic performance improvements but
> > > > > > those performance improvements will vary based on the memory requirements of
> > > > > > individual workloads.
> > > > > >
> > > > > > - Problem
> > > > > >
> > > > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > > > However, they were not significant consumers of swap even though they are
> > > > > > good candidate for swap. Under investigation, swapping out only begins
> > > > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > > > even though we use zram which is much faster than real storage) so kill
> > > > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > > > few pages actually being moved to swap.
> > > > > >
> > > > > > - Approach
> > > > > >
> > > > > > The approach we chose was to use a new interface to allow userspace to
> > > > > > proactively reclaim entire processes by leveraging platform information.
> > > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > > > it could provide many chances for platform to use much information to
> > > > > > optimize memory efficiency.
> > > > > >
> > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > when memory pressure rises.
> > > > > >
> > > > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > should be reclaimed when memory pressure rises.
> > > > > >
> > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > > > information required to make the reclaim decision is not known to the app.
> > > > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > > > must be able to initiate reclaim on its own without any app involvement.
> > > > > > To solve the concern, this patch introduces new syscall -
> > > > > >
> > > > > >         struct pr_madvise_param {
> > > > > >                 int size;
> > > > > >                 const struct iovec *vec;
> > > > > >         }
> > > > > >
> > > > > >         int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > > >                                 struct pr_madvise_param *restuls,
> > > > > >                                 struct pr_madvise_param *ranges,
> > > > > >                                 unsigned long flags);
> > > > > >
> > > > > > The syscall get pidfd to give hints to external process and provides
> > > > > > pair of result/ranges vector arguments so that it could give several
> > > > > > hints to each address range all at once.
> > > > > >
> > > > > > I guess others have different ideas about the naming of syscall and options
> > > > > > so feel free to suggest better naming.
> > > > >
> > > > > Yes, all new syscalls making use of pidfds should be named
> > > > > pidfd_<action>. So please make this pidfd_madvise.
> > > >
> > > > I don't have any particular preference but just wondering why pidfd is
> > > > so special to have it as prefix of system call name.
> > >
> > > It's a whole new API to address processes. We already have
> > > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> > > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> > > syscall works only with pidfds so it's tied to this api as well so it
> > > should follow the naming scheme. This also makes life easier for
> > > userspace and is consistent.
> >
> > This is at least my reasoning. I'm not going to make this a whole big
> > pedantic argument. If people have really strong feelings about not using
> > this prefix then fine. But if syscalls can be grouped together and have
> > consistent naming this is always a big plus.
>
> My hope has been that pidfd use becomes normalized enough that
> prefixing "pidfd_" to pidfd-accepting system calls becomes redundant.
> We write write(), not fd_write(), right? :-) pidfd_open() makes sense
> because the primary purpose of this system call is to operate on a
> pidfd, but I think process_madvise() is fine.

This madvise syscall just operates on pidfds. It would make sense to
name it process_madvise() if were to operate both on pid_t and int pidfd.
Giving specific names to system calls won't stop it from becoming
normalized. The fact that people built other system calls around it
is enough proof of that. :)
For userspace pidfd_madvise is nicer and it clearly expresses
that it only accepts pidfds.
So please, Minchan make it pidfd_madvise() in the next version. :)

Christian
Daniel Colascione May 22, 2019, 1:16 p.m. UTC | #28
On Wed, May 22, 2019 at 1:22 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, May 22, 2019 at 7:12 AM Daniel Colascione <dancol@google.com> wrote:
> >
> > On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> > > > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > > > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > > > > - Background
> > > > > > >
> > > > > > > The Android terminology used for forking a new process and starting an app
> > > > > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > > > > While we continually try to improve the performance of cold starts, hot
> > > > > > > starts will always be significantly less power hungry as well as faster so
> > > > > > > we are trying to make hot start more likely than cold start.
> > > > > > >
> > > > > > > To increase hot start, Android userspace manages the order that apps should
> > > > > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > > > > tracks every Android app or service that the user could be interacting with
> > > > > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > > > > Those apps are kept alive for opportunistic performance improvements but
> > > > > > > those performance improvements will vary based on the memory requirements of
> > > > > > > individual workloads.
> > > > > > >
> > > > > > > - Problem
> > > > > > >
> > > > > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > > > > However, they were not significant consumers of swap even though they are
> > > > > > > good candidate for swap. Under investigation, swapping out only begins
> > > > > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > > > > even though we use zram which is much faster than real storage) so kill
> > > > > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > > > > few pages actually being moved to swap.
> > > > > > >
> > > > > > > - Approach
> > > > > > >
> > > > > > > The approach we chose was to use a new interface to allow userspace to
> > > > > > > proactively reclaim entire processes by leveraging platform information.
> > > > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > > > > it could provide many chances for platform to use much information to
> > > > > > > optimize memory efficiency.
> > > > > > >
> > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > > when memory pressure rises.
> > > > > > >
> > > > > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > > should be reclaimed when memory pressure rises.
> > > > > > >
> > > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > > > > information required to make the reclaim decision is not known to the app.
> > > > > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > > > > must be able to initiate reclaim on its own without any app involvement.
> > > > > > > To solve the concern, this patch introduces new syscall -
> > > > > > >
> > > > > > >         struct pr_madvise_param {
> > > > > > >                 int size;
> > > > > > >                 const struct iovec *vec;
> > > > > > >         }
> > > > > > >
> > > > > > >         int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > > > >                                 struct pr_madvise_param *restuls,
> > > > > > >                                 struct pr_madvise_param *ranges,
> > > > > > >                                 unsigned long flags);
> > > > > > >
> > > > > > > The syscall get pidfd to give hints to external process and provides
> > > > > > > pair of result/ranges vector arguments so that it could give several
> > > > > > > hints to each address range all at once.
> > > > > > >
> > > > > > > I guess others have different ideas about the naming of syscall and options
> > > > > > > so feel free to suggest better naming.
> > > > > >
> > > > > > Yes, all new syscalls making use of pidfds should be named
> > > > > > pidfd_<action>. So please make this pidfd_madvise.
> > > > >
> > > > > I don't have any particular preference but just wondering why pidfd is
> > > > > so special to have it as prefix of system call name.
> > > >
> > > > It's a whole new API to address processes. We already have
> > > > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> > > > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> > > > syscall works only with pidfds so it's tied to this api as well so it
> > > > should follow the naming scheme. This also makes life easier for
> > > > userspace and is consistent.
> > >
> > > This is at least my reasoning. I'm not going to make this a whole big
> > > pedantic argument. If people have really strong feelings about not using
> > > this prefix then fine. But if syscalls can be grouped together and have
> > > consistent naming this is always a big plus.
> >
> > My hope has been that pidfd use becomes normalized enough that
> > prefixing "pidfd_" to pidfd-accepting system calls becomes redundant.
> > We write write(), not fd_write(), right? :-) pidfd_open() makes sense
> > because the primary purpose of this system call is to operate on a
> > pidfd, but I think process_madvise() is fine.
>
> This madvise syscall just operates on pidfds. It would make sense to
> name it process_madvise() if were to operate both on pid_t and int pidfd.

The name of the function ought to encode its purpose, not its
signature. The system call under discussion operates on processes and
so should be called "process_madvise". That this system call happens
to accept a pidfd to identify the process on which it operates is not
its most interesting aspect of the system call. The argument type
isn't important enough to spotlight in the permanent name of an API.
Pidfds are novel now, but they won't be novel in the future.

> Giving specific names to system calls won't stop it from becoming
> normalized.

We could name system calls with `cat /dev/urandom | xxd` and they'd
still get used. It doesn't follow that all names are equally good.
Christian Brauner May 22, 2019, 2:52 p.m. UTC | #29
On Wed, May 22, 2019 at 06:16:35AM -0700, Daniel Colascione wrote:
> On Wed, May 22, 2019 at 1:22 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Wed, May 22, 2019 at 7:12 AM Daniel Colascione <dancol@google.com> wrote:
> > >
> > > On Tue, May 21, 2019 at 4:39 AM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > On Tue, May 21, 2019 at 01:30:29PM +0200, Christian Brauner wrote:
> > > > > On Tue, May 21, 2019 at 08:05:52PM +0900, Minchan Kim wrote:
> > > > > > On Tue, May 21, 2019 at 10:42:00AM +0200, Christian Brauner wrote:
> > > > > > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote:
> > > > > > > > - Background
> > > > > > > >
> > > > > > > > The Android terminology used for forking a new process and starting an app
> > > > > > > > from scratch is a cold start, while resuming an existing app is a hot start.
> > > > > > > > While we continually try to improve the performance of cold starts, hot
> > > > > > > > starts will always be significantly less power hungry as well as faster so
> > > > > > > > we are trying to make hot start more likely than cold start.
> > > > > > > >
> > > > > > > > To increase hot start, Android userspace manages the order that apps should
> > > > > > > > be killed in a process called ActivityManagerService. ActivityManagerService
> > > > > > > > tracks every Android app or service that the user could be interacting with
> > > > > > > > at any time and translates that into a ranked list for lmkd(low memory
> > > > > > > > killer daemon). They are likely to be killed by lmkd if the system has to
> > > > > > > > reclaim memory. In that sense they are similar to entries in any other cache.
> > > > > > > > Those apps are kept alive for opportunistic performance improvements but
> > > > > > > > those performance improvements will vary based on the memory requirements of
> > > > > > > > individual workloads.
> > > > > > > >
> > > > > > > > - Problem
> > > > > > > >
> > > > > > > > Naturally, cached apps were dominant consumers of memory on the system.
> > > > > > > > However, they were not significant consumers of swap even though they are
> > > > > > > > good candidate for swap. Under investigation, swapping out only begins
> > > > > > > > once the low zone watermark is hit and kswapd wakes up, but the overall
> > > > > > > > allocation rate in the system might trip lmkd thresholds and cause a cached
> > > > > > > > process to be killed(we measured performance swapping out vs. zapping the
> > > > > > > > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > > > > > > > even though we use zram which is much faster than real storage) so kill
> > > > > > > > from lmkd will often satisfy the high zone watermark, resulting in very
> > > > > > > > few pages actually being moved to swap.
> > > > > > > >
> > > > > > > > - Approach
> > > > > > > >
> > > > > > > > The approach we chose was to use a new interface to allow userspace to
> > > > > > > > proactively reclaim entire processes by leveraging platform information.
> > > > > > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > > > > > > > that are known to be cold from userspace and to avoid races with lmkd
> > > > > > > > by reclaiming apps as soon as they entered the cached state. Additionally,
> > > > > > > > it could provide many chances for platform to use much information to
> > > > > > > > optimize memory efficiency.
> > > > > > > >
> > > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED
> > > > > > > > and MADV_FREE by adding non-destructive ways to gain some free memory
> > > > > > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
> > > > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
> > > > > > > > kernel that memory region is not currently needed and should be reclaimed
> > > > > > > > when memory pressure rises.
> > > > > > > >
> > > > > > > > To achieve the goal, the patchset introduce two new options for madvise.
> > > > > > > > One is MADV_COOL which will deactive activated pages and the other is
> > > > > > > > MADV_COLD which will reclaim private pages instantly. These new options
> > > > > > > > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > > > > > > > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
> > > > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > > > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
> > > > > > > > that it hints the kernel that memory region is not currently needed and
> > > > > > > > should be reclaimed when memory pressure rises.
> > > > > > > >
> > > > > > > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > > > > > > > information required to make the reclaim decision is not known to the app.
> > > > > > > > Instead, it is known to a centralized userspace daemon, and that daemon
> > > > > > > > must be able to initiate reclaim on its own without any app involvement.
> > > > > > > > To solve the concern, this patch introduces new syscall -
> > > > > > > >
> > > > > > > >         struct pr_madvise_param {
> > > > > > > >                 int size;
> > > > > > > >                 const struct iovec *vec;
> > > > > > > >         }
> > > > > > > >
> > > > > > > >         int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
> > > > > > > >                                 struct pr_madvise_param *restuls,
> > > > > > > >                                 struct pr_madvise_param *ranges,
> > > > > > > >                                 unsigned long flags);
> > > > > > > >
> > > > > > > > The syscall get pidfd to give hints to external process and provides
> > > > > > > > pair of result/ranges vector arguments so that it could give several
> > > > > > > > hints to each address range all at once.
> > > > > > > >
> > > > > > > > I guess others have different ideas about the naming of syscall and options
> > > > > > > > so feel free to suggest better naming.
> > > > > > >
> > > > > > > Yes, all new syscalls making use of pidfds should be named
> > > > > > > pidfd_<action>. So please make this pidfd_madvise.
> > > > > >
> > > > > > I don't have any particular preference but just wondering why pidfd is
> > > > > > so special to have it as prefix of system call name.
> > > > >
> > > > > It's a whole new API to address processes. We already have
> > > > > clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you
> > > > > exported pidfd_to_pid(). And we're going to have pidfd_open(). Your
> > > > > syscall works only with pidfds so it's tied to this api as well so it
> > > > > should follow the naming scheme. This also makes life easier for
> > > > > userspace and is consistent.
> > > >
> > > > This is at least my reasoning. I'm not going to make this a whole big
> > > > pedantic argument. If people have really strong feelings about not using
> > > > this prefix then fine. But if syscalls can be grouped together and have
> > > > consistent naming this is always a big plus.
> > >
> > > My hope has been that pidfd use becomes normalized enough that
> > > prefixing "pidfd_" to pidfd-accepting system calls becomes redundant.
> > > We write write(), not fd_write(), right? :-) pidfd_open() makes sense
> > > because the primary purpose of this system call is to operate on a
> > > pidfd, but I think process_madvise() is fine.
> >
> > This madvise syscall just operates on pidfds. It would make sense to
> > name it process_madvise() if were to operate both on pid_t and int pidfd.
> 
> The name of the function ought to encode its purpose, not its
> signature. The system call under discussion operates on processes and
> so should be called "process_madvise". That this system call happens
> to accept a pidfd to identify the process on which it operates is not
> its most interesting aspect of the system call. The argument type
> isn't important enough to spotlight in the permanent name of an API.
> Pidfds are novel now, but they won't be novel in the future.

I'm not going to go into yet another long argument. I prefer pidfd_*.
It's tied to the api, transparent for userspace, and disambiguates it
from process_vm_{read,write}v that both take a pid_t.
Daniel Colascione May 22, 2019, 3:17 p.m. UTC | #30
On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> I'm not going to go into yet another long argument. I prefer pidfd_*.

Ok. We're each allowed our opinion.

> It's tied to the api, transparent for userspace, and disambiguates it
> from process_vm_{read,write}v that both take a pid_t.

Speaking of process_vm_readv and process_vm_writev: both have a
currently-unused flags argument. Both should grow a flag that tells
them to interpret the pid argument as a pidfd. Or do you support
adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
should process_madvise be called pidfd_madvise while process_vm_readv
isn't called pidfd_vm_readv?
Christian Brauner May 22, 2019, 3:48 p.m. UTC | #31
On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > I'm not going to go into yet another long argument. I prefer pidfd_*.
> 
> Ok. We're each allowed our opinion.
> 
> > It's tied to the api, transparent for userspace, and disambiguates it
> > from process_vm_{read,write}v that both take a pid_t.
> 
> Speaking of process_vm_readv and process_vm_writev: both have a
> currently-unused flags argument. Both should grow a flag that tells
> them to interpret the pid argument as a pidfd. Or do you support
> adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> should process_madvise be called pidfd_madvise while process_vm_readv
> isn't called pidfd_vm_readv?

Actually, you should then do the same with process_madvise() and give it
a flag for that too if that's not too crazy.

Christian
Daniel Colascione May 22, 2019, 3:57 p.m. UTC | #32
On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> >
> > Ok. We're each allowed our opinion.
> >
> > > It's tied to the api, transparent for userspace, and disambiguates it
> > > from process_vm_{read,write}v that both take a pid_t.
> >
> > Speaking of process_vm_readv and process_vm_writev: both have a
> > currently-unused flags argument. Both should grow a flag that tells
> > them to interpret the pid argument as a pidfd. Or do you support
> > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > should process_madvise be called pidfd_madvise while process_vm_readv
> > isn't called pidfd_vm_readv?
>
> Actually, you should then do the same with process_madvise() and give it
> a flag for that too if that's not too crazy.

I don't know what you mean. My gut feeling is that for the sake of
consistency, process_madvise, process_vm_readv, and process_vm_writev
should all accept a first argument interpreted as either a numeric PID
or a pidfd depending on a flag --- ideally the same flag. Is that what
you have in mind?
Christian Brauner May 22, 2019, 4:01 p.m. UTC | #33
On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote:
> On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> > >
> > > Ok. We're each allowed our opinion.
> > >
> > > > It's tied to the api, transparent for userspace, and disambiguates it
> > > > from process_vm_{read,write}v that both take a pid_t.
> > >
> > > Speaking of process_vm_readv and process_vm_writev: both have a
> > > currently-unused flags argument. Both should grow a flag that tells
> > > them to interpret the pid argument as a pidfd. Or do you support
> > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > > should process_madvise be called pidfd_madvise while process_vm_readv
> > > isn't called pidfd_vm_readv?
> >
> > Actually, you should then do the same with process_madvise() and give it
> > a flag for that too if that's not too crazy.
> 
> I don't know what you mean. My gut feeling is that for the sake of
> consistency, process_madvise, process_vm_readv, and process_vm_writev
> should all accept a first argument interpreted as either a numeric PID
> or a pidfd depending on a flag --- ideally the same flag. Is that what
> you have in mind?

Yes. For the sake of consistency they should probably all default to
interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag
interpret as pidfd.
Daniel Colascione May 22, 2019, 4:01 p.m. UTC | #34
On Wed, May 22, 2019 at 9:01 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote:
> > On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> > > >
> > > > Ok. We're each allowed our opinion.
> > > >
> > > > > It's tied to the api, transparent for userspace, and disambiguates it
> > > > > from process_vm_{read,write}v that both take a pid_t.
> > > >
> > > > Speaking of process_vm_readv and process_vm_writev: both have a
> > > > currently-unused flags argument. Both should grow a flag that tells
> > > > them to interpret the pid argument as a pidfd. Or do you support
> > > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > > > should process_madvise be called pidfd_madvise while process_vm_readv
> > > > isn't called pidfd_vm_readv?
> > >
> > > Actually, you should then do the same with process_madvise() and give it
> > > a flag for that too if that's not too crazy.
> >
> > I don't know what you mean. My gut feeling is that for the sake of
> > consistency, process_madvise, process_vm_readv, and process_vm_writev
> > should all accept a first argument interpreted as either a numeric PID
> > or a pidfd depending on a flag --- ideally the same flag. Is that what
> > you have in mind?
>
> Yes. For the sake of consistency they should probably all default to
> interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag
> interpret as pidfd.

Sounds good to me!
Minchan Kim May 23, 2019, 1:07 p.m. UTC | #35
On Wed, May 22, 2019 at 09:01:33AM -0700, Daniel Colascione wrote:
> On Wed, May 22, 2019 at 9:01 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote:
> > > On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote:
> > > >
> > > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > > > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> > > > >
> > > > > Ok. We're each allowed our opinion.
> > > > >
> > > > > > It's tied to the api, transparent for userspace, and disambiguates it
> > > > > > from process_vm_{read,write}v that both take a pid_t.
> > > > >
> > > > > Speaking of process_vm_readv and process_vm_writev: both have a
> > > > > currently-unused flags argument. Both should grow a flag that tells
> > > > > them to interpret the pid argument as a pidfd. Or do you support
> > > > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > > > > should process_madvise be called pidfd_madvise while process_vm_readv
> > > > > isn't called pidfd_vm_readv?
> > > >
> > > > Actually, you should then do the same with process_madvise() and give it
> > > > a flag for that too if that's not too crazy.
> > >
> > > I don't know what you mean. My gut feeling is that for the sake of
> > > consistency, process_madvise, process_vm_readv, and process_vm_writev
> > > should all accept a first argument interpreted as either a numeric PID
> > > or a pidfd depending on a flag --- ideally the same flag. Is that what
> > > you have in mind?
> >
> > Yes. For the sake of consistency they should probably all default to
> > interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag
> > interpret as pidfd.
> 
> Sounds good to me!

Then, I want to change from pidfd to pid at next revsion and stick to
process_madvise as naming. Later, you guys could define PROCESS_PIDFD
flag and change all at once every process_xxx syscall friends.

If you are faster so that I see PROCESS_PIDFD earlier, I am happy to
use it.

Thanks.
Minchan Kim May 27, 2019, 8:06 a.m. UTC | #36
On Thu, May 23, 2019 at 10:07:17PM +0900, Minchan Kim wrote:
> On Wed, May 22, 2019 at 09:01:33AM -0700, Daniel Colascione wrote:
> > On Wed, May 22, 2019 at 9:01 AM Christian Brauner <christian@brauner.io> wrote:
> > >
> > > On Wed, May 22, 2019 at 08:57:47AM -0700, Daniel Colascione wrote:
> > > > On Wed, May 22, 2019 at 8:48 AM Christian Brauner <christian@brauner.io> wrote:
> > > > >
> > > > > On Wed, May 22, 2019 at 08:17:23AM -0700, Daniel Colascione wrote:
> > > > > > On Wed, May 22, 2019 at 7:52 AM Christian Brauner <christian@brauner.io> wrote:
> > > > > > > I'm not going to go into yet another long argument. I prefer pidfd_*.
> > > > > >
> > > > > > Ok. We're each allowed our opinion.
> > > > > >
> > > > > > > It's tied to the api, transparent for userspace, and disambiguates it
> > > > > > > from process_vm_{read,write}v that both take a pid_t.
> > > > > >
> > > > > > Speaking of process_vm_readv and process_vm_writev: both have a
> > > > > > currently-unused flags argument. Both should grow a flag that tells
> > > > > > them to interpret the pid argument as a pidfd. Or do you support
> > > > > > adding pidfd_vm_readv and pidfd_vm_writev system calls? If not, why
> > > > > > should process_madvise be called pidfd_madvise while process_vm_readv
> > > > > > isn't called pidfd_vm_readv?
> > > > >
> > > > > Actually, you should then do the same with process_madvise() and give it
> > > > > a flag for that too if that's not too crazy.
> > > >
> > > > I don't know what you mean. My gut feeling is that for the sake of
> > > > consistency, process_madvise, process_vm_readv, and process_vm_writev
> > > > should all accept a first argument interpreted as either a numeric PID
> > > > or a pidfd depending on a flag --- ideally the same flag. Is that what
> > > > you have in mind?
> > >
> > > Yes. For the sake of consistency they should probably all default to
> > > interpret as pid and if say PROCESS_{VM_}PIDFD is passed as flag
> > > interpret as pidfd.
> > 
> > Sounds good to me!
> 
> Then, I want to change from pidfd to pid at next revsion and stick to
> process_madvise as naming. Later, you guys could define PROCESS_PIDFD
> flag and change all at once every process_xxx syscall friends.
> 
> If you are faster so that I see PROCESS_PIDFD earlier, I am happy to
> use it.

Hi Folks,

I don't want to consume a new API argument too early so want to say
I will use process_madvise with pidfs argument because I agree with
Daniel that we don't need to export implmentation on the syscall name.

I hope every upcoming new syscall with process has by default pidfs
so people are familiar with pidfd slowly so finallly they forgot pid
in the long run so naturally replace pid with pidfs.
Anshuman Khandual May 28, 2019, 10:50 a.m. UTC | #37
On 05/21/2019 04:04 PM, Michal Hocko wrote:
> On Tue 21-05-19 08:25:55, Anshuman Khandual wrote:
>> On 05/20/2019 10:29 PM, Tim Murray wrote:
> [...]
>>> not seem to introduce a noticeable hot start penalty, not does it
>>> cause an increase in performance problems later in the app's
>>> lifecycle. I've measured with and without process_madvise, and the
>>> differences are within our noise bounds. Second, because we're not
>>
>> That is assuming that post process_madvise() working set for the application is
>> always smaller. There is another challenge. The external process should ideally
>> have the knowledge of active areas of the working set for an application in
>> question for it to invoke process_madvise() correctly to prevent such scenarios.
> 
> But that doesn't really seem relevant for the API itself, right? The
> higher level logic the monitor's business.

Right. I was just wondering how the monitor would even decide what areas of the
target application is active or inactive. The target application is still just an
opaque entity for the monitor unless there is some sort of communication. But you
are right, this not relevant to the API itself.