Message ID | 20200413215750.7239-1-lmoiseichuk@magicleap.com (mailing list archive) |
---|---|
Headers | show |
Series | memcg, vmpressure: expose vmpressure controls | expand |
On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote: > From: Leonid Moiseichuk <lmoiseichuk@magicleap.com> > > Small tweak to populate vmpressure parameters to userspace without > any built-in logic change. > > The vmpressure is used actively (e.g. on Android) to track mm stress. > vmpressure parameters selected empiricaly quite long time ago and not > always suitable for modern memory configurations. This needs much more details. Why it is not suitable? What are usual numbers you need to set up to work properly? Why those wouldn't be generally applicable? Anyway, I have to confess I am not a big fan of this. vmpressure turned out to be a very weak interface to measure the memory pressure. Not only it is not numa aware which makes it unusable on many systems it also gives data way too late from the practice. Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart to measure the memory pressure in the first place? > Leonid Moiseichuk (2): > memcg: expose vmpressure knobs > memcg, vmpressure: expose vmpressure controls > > .../admin-guide/cgroup-v1/memory.rst | 12 +- > include/linux/vmpressure.h | 35 ++++++ > mm/memcontrol.c | 113 ++++++++++++++++++ > mm/vmpressure.c | 101 +++++++--------- > 4 files changed, 200 insertions(+), 61 deletions(-) > > -- > 2.17.1 >
Thanks Michal for quick response, see my answer below. I will update the commit message with numbers for 8 GB memory swapless devices. On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote: > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote: > > From: Leonid Moiseichuk <lmoiseichuk@magicleap.com> > > > > Small tweak to populate vmpressure parameters to userspace without > > any built-in logic change. > > > > The vmpressure is used actively (e.g. on Android) to track mm stress. > > vmpressure parameters selected empiricaly quite long time ago and not > > always suitable for modern memory configurations. > > This needs much more details. Why it is not suitable? What are usual > numbers you need to set up to work properly? Why those wouldn't be > generally applicable? > As far I see numbers which vmpressure uses - they are closer to RSS of userspace processes for memory utilization. Default calibration in memory.pressure_level_medium as 60% makes 8GB device hit memory threshold when RSS utilization reaches ~5 GB and that is a bit too early, I observe it happened immediately after boot. Reasonable level should be in the 70-80% range depending on SW preloaded on your device. From another point of view having a memory.pressure_level_critical set to 95% may never happen as it comes to a level where an OOM killer already starts to kill processes, and in some cases it is even worse than the now removed Android low memory killer. For such cases has sense to shift the threshold down to 85-90% to have device reliably handling low memory situations and not rely only on oom_score_adj hints. Next important parameter for tweaking is memory.pressure_window which has the sense to increase twice to reduce the number of activations of userspace to save some power by reducing sensitivity. For 12 and 16 GB devices the situation will be similar but worse, based on fact in current settings they will hit medium memory usage when ~5 or 6.5 GB memory will be still free. > > Anyway, I have to confess I am not a big fan of this. vmpressure turned > out to be a very weak interface to measure the memory pressure. Not only > it is not numa aware which makes it unusable on many systems it also > gives data way too late from the practice. > > Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart > to measure the memory pressure in the first place? > According to our checks PSI produced numbers only when swap enabled e.g. swapless device 75% RAM utilization: ==> /proc/pressure/io <== some avg10=0.00 avg60=1.18 avg300=1.51 total=9642648 full avg10=0.00 avg60=1.11 avg300=1.47 total=9271174 ==> /proc/pressure/memory <== some avg10=0.00 avg60=0.00 avg300=0.00 total=0 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 Probably it is possible to activate PSI by introducing high IO and swap enabled but that is not a typical case for mobile devices. With swap-enabled case memory pressure follows IO pressure with some fraction i.e. memory is io/2 ... io/10 depending on pattern. Light sysbench case with swap enabled ==> /proc/pressure/io <== some avg10=0.00 avg60=0.00 avg300=0.11 total=155383820 full avg10=0.00 avg60=0.00 avg300=0.05 total=100516966 ==> /proc/pressure/memory <== some avg10=0.00 avg60=0.00 avg300=0.06 total=465916397 full avg10=0.00 avg60=0.00 avg300=0.00 total=368664282 Since not all devices have zram or swap enabled it makes sense to have vmpressure tuning option possible since it is well used in Android and related issues are understandable. > > Leonid Moiseichuk (2): > > memcg: expose vmpressure knobs > > memcg, vmpressure: expose vmpressure controls > > > > .../admin-guide/cgroup-v1/memory.rst | 12 +- > > include/linux/vmpressure.h | 35 ++++++ > > mm/memcontrol.c | 113 ++++++++++++++++++ > > mm/vmpressure.c | 101 +++++++--------- > > 4 files changed, 200 insertions(+), 61 deletions(-) > > > > -- > > 2.17.1 > > > > -- > Michal Hocko > SUSE Labs >
On Tue 14-04-20 12:42:44, Leonid Moiseichuk wrote: > Thanks Michal for quick response, see my answer below. > I will update the commit message with numbers for 8 GB memory swapless > devices. > > On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote: > > > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote: > > > From: Leonid Moiseichuk <lmoiseichuk@magicleap.com> > > > > > > Small tweak to populate vmpressure parameters to userspace without > > > any built-in logic change. > > > > > > The vmpressure is used actively (e.g. on Android) to track mm stress. > > > vmpressure parameters selected empiricaly quite long time ago and not > > > always suitable for modern memory configurations. > > > > This needs much more details. Why it is not suitable? What are usual > > numbers you need to set up to work properly? Why those wouldn't be > > generally applicable? > > > As far I see numbers which vmpressure uses - they are closer to RSS of > userspace processes for memory utilization. > Default calibration in memory.pressure_level_medium as 60% makes 8GB device > hit memory threshold when RSS utilization > reaches ~5 GB and that is a bit too early, I observe it happened > immediately after boot. Reasonable level should be > in the 70-80% range depending on SW preloaded on your device. I am not sure I follow. Levels are based on the reclaim ineffectivity not the overall memory utilization. So it takes to have only 40% reclaim effectivity to trigger the medium level. While you are right that the threshold for the event is pretty arbitrary I would like to hear why that doesn't work in your environment. It shouldn't really depend on the amount of memory as this is a percentage, right? > From another point of view having a memory.pressure_level_critical set to > 95% may never happen as it comes to a level where an OOM killer already > starts to kill processes, > and in some cases it is even worse than the now removed Android low memory > killer. For such cases has sense to shift the threshold down to 85-90% to > have device reliably > handling low memory situations and not rely only on oom_score_adj hints. > > Next important parameter for tweaking is memory.pressure_window which has > the sense to increase twice to reduce the number of activations of userspace > to save some power by reducing sensitivity. Could you be more specific, please? > For 12 and 16 GB devices the situation will be similar but worse, based on > fact in current settings they will hit medium memory usage when ~5 or 6.5 > GB memory will be still free. > > > > > > Anyway, I have to confess I am not a big fan of this. vmpressure turned > > out to be a very weak interface to measure the memory pressure. Not only > > it is not numa aware which makes it unusable on many systems it also > > gives data way too late from the practice. > > > > Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart > > to measure the memory pressure in the first place? > > > > According to our checks PSI produced numbers only when swap enabled e.g. > swapless device 75% RAM utilization: I believe you should discuss that with the people familiar with PSI internals (Johannes already in the CC list).
On Tue, Apr 14, 2020 at 12:42:44PM -0400, Leonid Moiseichuk wrote: > On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote: > > Anyway, I have to confess I am not a big fan of this. vmpressure turned > > out to be a very weak interface to measure the memory pressure. Not only > > it is not numa aware which makes it unusable on many systems it also > > gives data way too late from the practice. Yes, it's late in the game for vmpressure, and also a bit too late for extensive changes in cgroup1. > > Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart > > to measure the memory pressure in the first place? > > > > According to our checks PSI produced numbers only when swap enabled e.g. > swapless device 75% RAM utilization: > ==> /proc/pressure/io <== > some avg10=0.00 avg60=1.18 avg300=1.51 total=9642648 > full avg10=0.00 avg60=1.11 avg300=1.47 total=9271174 > > ==> /proc/pressure/memory <== > some avg10=0.00 avg60=0.00 avg300=0.00 total=0 > full avg10=0.00 avg60=0.00 avg300=0.00 total=0 That doesn't look right. With total=0, there couldn't have been any reclaim activity, which means that vmpressure couldn't have reported anything either. By the time vmpressure reports a drop in reclaim efficiency, psi should have already been reporting time spent doing reclaim. It reports a superset of the information conveyed by vmpressure. > Probably it is possible to activate PSI by introducing high IO and swap > enabled but that is not a typical case for mobile devices. > > With swap-enabled case memory pressure follows IO pressure with some > fraction i.e. memory is io/2 ... io/10 depending on pattern. > Light sysbench case with swap enabled > ==> /proc/pressure/io <== > some avg10=0.00 avg60=0.00 avg300=0.11 total=155383820 > full avg10=0.00 avg60=0.00 avg300=0.05 total=100516966 > ==> /proc/pressure/memory <== > some avg10=0.00 avg60=0.00 avg300=0.06 total=465916397 > full avg10=0.00 avg60=0.00 avg300=0.00 total=368664282 > > Since not all devices have zram or swap enabled it makes sense to have > vmpressure tuning option possible since > it is well used in Android and related issues are understandable. Android (since 10 afaik) uses psi to make low memory / OOM decisions. See the introduction of the psi poll() support: https://lwn.net/Articles/782662/ It's true that with swap you may see a more gradual increase in pressure, whereas without swap you may go from idle to OOM much faster, depending on what type of memory is being allocated. But psi will still report it. You may just have to use poll() to get in-time notification like you do with vmpressure.
It would be nice if you can specify exact numbers you like to see. On Tue, Apr 14, 2020 at 2:49 PM Michal Hocko <mhocko@kernel.org> wrote: > .... > As far I see numbers which vmpressure uses - they are closer to RSS of > > userspace processes for memory utilization. > > Default calibration in memory.pressure_level_medium as 60% makes 8GB > device > > hit memory threshold when RSS utilization > > reaches ~5 GB and that is a bit too early, I observe it happened > > immediately after boot. Reasonable level should be > > in the 70-80% range depending on SW preloaded on your device. > > I am not sure I follow. Levels are based on the reclaim ineffectivity not > the overall memory utilization. So it takes to have only 40% reclaim > effectivity to trigger the medium level. While you are right that the > threshold for the event is pretty arbitrary I would like to hear why > that doesn't work in your environment. It shouldn't really depend on the > amount of memory as this is a percentage, right? > It is not only depends from amount of memory or reclams but also what is software running. As I see from vmscan.c vmpressure activated from various shrink_node() or, basically do_try_to_free_pages(). To hit this state you need to somehow lack memory due to various reasons, so the amount of memory plays a role here. In particular my case is very impacted by GPU (using CMA) consumption which can easily take gigs. Apps can take gigabyte as well. So reclaiming will be quite often called in case of lack of memory (4K calls are possible). Handling level change will happen if the amount of scanned pages is more than window size, 512 is too little as now it is only 2 MB. So small slices are a source of false triggers. Next, pressure counted as unsigned long scale = scanned + reclaimed; pressure = scale - (reclaimed * scale / scanned); pressure = pressure * 100 / scale; Or for 512 pages (lets use minimal) it leads to reclaimed should be 204 pages for 60% threshold and 25 pages for 95% (as critical) In case of pressure happened (usually at 85% of memory used, and hittin critical level) I rarely see something like closer to real numbers vmpressure_work_fn: scanned 545, reclaimed 144 <-- 73% vmpressure_work_fn: scanned 16283, reclaimed 2495 <-- same session but 83% Most of the time it is looping between kswapd and lmkd reclaiming failures, consuming quite a high amount of cpu. On vmscan calls everything looks as expected [ 312.410938] vmpressure: tree 0 scanned 4, reclaimed 2 [ 312.410939] vmpressure: tree 0 scanned 120, reclaimed 62 [ 312.410939] vmpressure: tree 1 scanned 2, reclaimed 1 [ 312.410940] vmpressure: tree 1 scanned 120, reclaimed 62 [ 312.410941] vmpressure: tree 0 scanned 0, reclaimed 0 > > > From another point of view having a memory.pressure_level_critical set to > > 95% may never happen as it comes to a level where an OOM killer already > > starts to kill processes, > > and in some cases it is even worse than the now removed Android low > memory > > killer. For such cases has sense to shift the threshold down to 85-90% to > > have device reliably > > handling low memory situations and not rely only on oom_score_adj hints. > > > > Next important parameter for tweaking is memory.pressure_window which has > > the sense to increase twice to reduce the number of activations of > userspace > > to save some power by reducing sensitivity. > > Could you be more specific, please? > That are parameters which most sensitive for tweaking for me. At least someone who use vmpressure will be able to tune up or down depending on combination apps. > > > For 12 and 16 GB devices the situation will be similar but worse, based > on > > fact in current settings they will hit medium memory usage when ~5 or 6.5 > > GB memory will be still free. > > > > > > > > > > Anyway, I have to confess I am not a big fan of this. vmpressure turned > > > out to be a very weak interface to measure the memory pressure. Not > only > > > it is not numa aware which makes it unusable on many systems it also > > > gives data way too late from the practice. > > > > > > Btw. why don't you use /proc/pressure/memory resp. its memcg > counterpart > > > to measure the memory pressure in the first place? > > > > > > > According to our checks PSI produced numbers only when swap enabled e.g. > > swapless device 75% RAM utilization: > > I believe you should discuss that with the people familiar with PSI > internals (Johannes already in the CC list). > Thanks for pointing, I will answer for his letters. > -- > Michal Hocko > SUSE Labs >
I do not agree with all comments, see below. On Tue, Apr 14, 2020 at 3:23 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > On Tue, Apr 14, 2020 at 12:42:44PM -0400, Leonid Moiseichuk wrote: > > On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote: > > > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote: > > > Anyway, I have to confess I am not a big fan of this. vmpressure turned > > > out to be a very weak interface to measure the memory pressure. Not > only > > > it is not numa aware which makes it unusable on many systems it also > > > gives data way too late from the practice. > > Yes, it's late in the game for vmpressure, and also a bit too late for > extensive changes in cgroup1. > 200 lines just to move functionality from one place to another without logic change? There does not seem to be extensive changes. > > > > Btw. why don't you use /proc/pressure/memory resp. its memcg > counterpart > > > to measure the memory pressure in the first place? > > > > > > > According to our checks PSI produced numbers only when swap enabled e.g. > > swapless device 75% RAM utilization: > > ==> /proc/pressure/io <== > > some avg10=0.00 avg60=1.18 avg300=1.51 total=9642648 > > full avg10=0.00 avg60=1.11 avg300=1.47 total=9271174 > > > > ==> /proc/pressure/memory <== > > some avg10=0.00 avg60=0.00 avg300=0.00 total=0 > > full avg10=0.00 avg60=0.00 avg300=0.00 total=0 > > That doesn't look right. With total=0, there couldn't have been any > reclaim activity, which means that vmpressure couldn't have reported > anything either. > Unfortunately not, vmpressure do reclaiming, I shared numbers/calls in the parallel letter. And I see kswapd+lmkd consumes quite a lot of cpu cycles. That is the same device, swap disabled. If I enable swap (zram based as Android usually does) it starts to make some numbers below 0.1, which does not seem huge pressure. By the time vmpressure reports a drop in reclaim efficiency, psi > should have already been reporting time spent doing reclaim. It > reports a superset of the information conveyed by vmpressure. > > > Probably it is possible to activate PSI by introducing high IO and swap > > enabled but that is not a typical case for mobile devices. > > > > With swap-enabled case memory pressure follows IO pressure with some > > fraction i.e. memory is io/2 ... io/10 depending on pattern. > > Light sysbench case with swap enabled > > ==> /proc/pressure/io <== > > some avg10=0.00 avg60=0.00 avg300=0.11 total=155383820 > > full avg10=0.00 avg60=0.00 avg300=0.05 total=100516966 > > ==> /proc/pressure/memory <== > > some avg10=0.00 avg60=0.00 avg300=0.06 total=465916397 > > full avg10=0.00 avg60=0.00 avg300=0.00 total=368664282 > > > > Since not all devices have zram or swap enabled it makes sense to have > > vmpressure tuning option possible since > > it is well used in Android and related issues are understandable. > > Android (since 10 afaik) uses psi to make low memory / OOM > decisions. See the introduction of the psi poll() support: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_782662_&d=DwIBAg&c=0ia8zh_eZtQM1JEjWgVLZg&r=dIXgSomcB34epPNJ3JPl0D4WwsDd12lPHClV0_L9Aw4&m=GJC3IQZUa2vG0cqtoa4Ma_R-S_cRvQSZGbpD389b84w&s=Kp-EqrjqguJqWJ-tefwwRPeLIZennPkko0qEV_fgIbc&e= > Android makes a selection PSI (primary) or vmpressure (backup), see line 2872+ https://android.googlesource.com/platform/system/memory/lmkd/+/refs/heads/master/lmkd.cpp#2872 > > It's true that with swap you may see a more gradual increase in > pressure, whereas without swap you may go from idle to OOM much > faster, depending on what type of memory is being allocated. But psi > will still report it. You may just have to use poll() to get in-time > notification like you do with vmpressure. > I expected that any spikes will be visible in previous avg level e.g. 10s Cannot confirm that now but I could play around. If you have preferences about use-cases please let me know.
On Tue 14-04-20 16:53:55, Leonid Moiseichuk wrote: > It would be nice if you can specify exact numbers you like to see. You are proposing an interface which allows to tune thresholds from userspace. Which suggests that you want to tune them. I am asking what kind of tuning you are using and why cannot we use them as defaults in the kernel. > On Tue, Apr 14, 2020 at 2:49 PM Michal Hocko <mhocko@kernel.org> wrote: > > > .... > > > As far I see numbers which vmpressure uses - they are closer to RSS of > > > userspace processes for memory utilization. > > > Default calibration in memory.pressure_level_medium as 60% makes 8GB > > device > > > hit memory threshold when RSS utilization > > > reaches ~5 GB and that is a bit too early, I observe it happened > > > immediately after boot. Reasonable level should be > > > in the 70-80% range depending on SW preloaded on your device. > > > > I am not sure I follow. Levels are based on the reclaim ineffectivity not > > the overall memory utilization. So it takes to have only 40% reclaim > > effectivity to trigger the medium level. While you are right that the > > threshold for the event is pretty arbitrary I would like to hear why > > that doesn't work in your environment. It shouldn't really depend on the > > amount of memory as this is a percentage, right? > > > It is not only depends from amount of memory or reclams but also what is > software running. > > As I see from vmscan.c vmpressure activated from various shrink_node() or, > basically do_try_to_free_pages(). > To hit this state you need to somehow lack memory due to various reasons, > so the amount of memory plays a role here. > In particular my case is very impacted by GPU (using CMA) consumption which > can easily take gigs. > Apps can take gigabyte as well. > So reclaiming will be quite often called in case of lack of memory (4K > calls are possible). > > Handling level change will happen if the amount of scanned pages is more > than window size, 512 is too little as now it is only 2 MB. > So small slices are a source of false triggers. > > Next, pressure counted as > unsigned long scale = scanned + reclaimed; > pressure = scale - (reclaimed * scale / scanned); > pressure = pressure * 100 / scale; Just to make this more obvious this is essentially 100 * (1 - reclaimed/scanned) > Or for 512 pages (lets use minimal) it leads to reclaimed should be 204 > pages for 60% threshold and 25 pages for 95% (as critical) > > In case of pressure happened (usually at 85% of memory used, and hittin > critical level) I still find this very confusing because the amount of used memory is not really important. It really only depends on the reclaim activity and that is either the memcg or the global reclaim. And you are getting critical levels only if the reclaim is failing to reclaim way too many pages. > I rarely see something like closer to real numbers > vmpressure_work_fn: scanned 545, reclaimed 144 <-- 73% > vmpressure_work_fn: scanned 16283, reclaimed 2495 <-- same session but 83% > Most of the time it is looping between kswapd and lmkd reclaiming failures, > consuming quite a high amount of cpu. > > On vmscan calls everything looks as expected > [ 312.410938] vmpressure: tree 0 scanned 4, reclaimed 2 > [ 312.410939] vmpressure: tree 0 scanned 120, reclaimed 62 > [ 312.410939] vmpressure: tree 1 scanned 2, reclaimed 1 > [ 312.410940] vmpressure: tree 1 scanned 120, reclaimed 62 > [ 312.410941] vmpressure: tree 0 scanned 0, reclaimed 0 This looks more like a problem of vmpressure implementation than something you want to workaround by tuning to me.
On Tue 14-04-20 18:12:47, Leonid Moiseichuk wrote: > I do not agree with all comments, see below. > > On Tue, Apr 14, 2020 at 3:23 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > On Tue, Apr 14, 2020 at 12:42:44PM -0400, Leonid Moiseichuk wrote: > > > On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote: > > > > Anyway, I have to confess I am not a big fan of this. vmpressure turned > > > > out to be a very weak interface to measure the memory pressure. Not > > only > > > > it is not numa aware which makes it unusable on many systems it also > > > > gives data way too late from the practice. > > > > Yes, it's late in the game for vmpressure, and also a bit too late for > > extensive changes in cgroup1. > > > 200 lines just to move functionality from one place to another without > logic change? > There does not seem to be extensive changes. Any user visible API is an big change. We have to maintain any api for ever. So there has to be a really strong reason/use case for inclusion. I haven't heard any strong justification so far. It all seems to me that you are trying to workaround real vmpressure issues by fine tunning parameters and that is almost always a bad reason for a adding a new tunning.
As Chris Down stated cgroups v1 frozen, so no API changes in the mainline kernel. If opinions change in the future I can continue with polishing this change. I will focus on PSI bugs for swapless/zram swapped devices :) The rest is below. On Wed, Apr 15, 2020 at 3:51 AM Michal Hocko <mhocko@kernel.org> wrote: > On Tue 14-04-20 16:53:55, Leonid Moiseichuk wrote: > > It would be nice if you can specify exact numbers you like to see. > > You are proposing an interface which allows to tune thresholds from > userspace. Which suggests that you want to tune them. I am asking what > kind of tuning you are using and why cannot we use them as defaults in > the kernel. > Yes, this type of hack is obvious. But selecting some parameters in one moment of time might be not good later. Plus these patches can be applied by vendor to e.g. Android 8 or 9 who has no PSI and tweaked their own way. Some products stick to old versions of kernels. I made a docs in separate change to cover a wider set of older kernels. They are transparent, tested and working fine. > > On Tue, Apr 14, 2020 at 2:49 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > .... > > > > > As far I see numbers which vmpressure uses - they are closer to RSS of > > > > userspace processes for memory utilization. > > > > Default calibration in memory.pressure_level_medium as 60% makes 8GB > > > device > > > > hit memory threshold when RSS utilization > > > > reaches ~5 GB and that is a bit too early, I observe it happened > > > > immediately after boot. Reasonable level should be > > > > in the 70-80% range depending on SW preloaded on your device. > > > > > > I am not sure I follow. Levels are based on the reclaim ineffectivity > not > > > the overall memory utilization. So it takes to have only 40% reclaim > > > effectivity to trigger the medium level. While you are right that the > > > threshold for the event is pretty arbitrary I would like to hear why > > > that doesn't work in your environment. It shouldn't really depend on > the > > > amount of memory as this is a percentage, right? > > > > > It is not only depends from amount of memory or reclams but also what is > > software running. > > > > As I see from vmscan.c vmpressure activated from various shrink_node() > or, > > basically do_try_to_free_pages(). > > To hit this state you need to somehow lack memory due to various reasons, > > so the amount of memory plays a role here. > > In particular my case is very impacted by GPU (using CMA) consumption > which > > can easily take gigs. > > Apps can take gigabyte as well. > > So reclaiming will be quite often called in case of lack of memory (4K > > calls are possible). > > > > Handling level change will happen if the amount of scanned pages is more > > than window size, 512 is too little as now it is only 2 MB. > > So small slices are a source of false triggers. > > > > Next, pressure counted as > > unsigned long scale = scanned + reclaimed; > > pressure = scale - (reclaimed * scale / scanned); > > pressure = pressure * 100 / scale; > > Just to make this more obvious this is essentially > 100 * (1 - reclaimed/scanned) > > > Or for 512 pages (lets use minimal) it leads to reclaimed should be 204 > > pages for 60% threshold and 25 pages for 95% (as critical) > > > > In case of pressure happened (usually at 85% of memory used, and hittin > > critical level) > > I still find this very confusing because the amount of used memory is > not really important. It really only depends on the reclaim activity and > that is either the memcg or the global reclaim. And you are getting > critical levels only if the reclaim is failing to reclaim way too many > pages. > OK, agree from that point of view. But for larger systems reclaiming happens not so often and we can use larger window sizes to have better memory utilization approximation. > > > I rarely see something like closer to real numbers > > vmpressure_work_fn: scanned 545, reclaimed 144 <-- 73% > > vmpressure_work_fn: scanned 16283, reclaimed 2495 <-- same session but > 83% > > Most of the time it is looping between kswapd and lmkd reclaiming > failures, > > consuming quite a high amount of cpu. > > > > On vmscan calls everything looks as expected > > [ 312.410938] vmpressure: tree 0 scanned 4, reclaimed 2 > > [ 312.410939] vmpressure: tree 0 scanned 120, reclaimed 62 > > [ 312.410939] vmpressure: tree 1 scanned 2, reclaimed 1 > > [ 312.410940] vmpressure: tree 1 scanned 120, reclaimed 62 > > [ 312.410941] vmpressure: tree 0 scanned 0, reclaimed 0 > > This looks more like a problem of vmpressure implementation than > something you want to workaround by tuning to me. > Basically it is how it works - collect the scanned page and activate worker activity to update the current level. > > -- > Michal Hocko > SUSE Labs >
On Wed 15-04-20 08:17:42, Leonid Moiseichuk wrote: > As Chris Down stated cgroups v1 frozen, so no API changes in the mainline > kernel. Yes, this is true, _but_ if there are clear shortcomings in the existing vmpressure implementation which could be addressed reasonably then there is no reason to ignore them. [...] > > I still find this very confusing because the amount of used memory is > > not really important. It really only depends on the reclaim activity and > > that is either the memcg or the global reclaim. And you are getting > > critical levels only if the reclaim is failing to reclaim way too many > > pages. > > > > OK, agree from that point of view. > But for larger systems reclaiming happens not so often and we can > use larger window sizes to have better memory utilization approximation. Nobody is saying the the window size has to be fixed. This all can be auto tuned in the kernel. It would, however, require to define what "better utilization approximation" means much more specifically. [...] > > This looks more like a problem of vmpressure implementation than > > something you want to workaround by tuning to me. > > > Basically it is how it works - collect the scanned page and activate worker > activity to update the current level. That is the case only for some vmpressure invocations. And your data suggest that those might lead to misleading results. So this is likely good to focus on and find out whether this can be addressed.
Good point but at the current moment I am not ready to implement autotune window size because my device only has 8 GB RAM and a very custom version of Android-based SW. So basically I cannot test in all possible combinations. On Wed, Apr 15, 2020 at 8:29 AM Michal Hocko <mhocko@kernel.org> wrote: > On Wed 15-04-20 08:17:42, Leonid Moiseichuk wrote: > > As Chris Down stated cgroups v1 frozen, so no API changes in the mainline > > kernel. > > Yes, this is true, _but_ if there are clear shortcomings in the existing > vmpressure implementation which could be addressed reasonably then there > is no reason to ignore them. > > [...] > > > > I still find this very confusing because the amount of used memory is > > > not really important. It really only depends on the reclaim activity > and > > > that is either the memcg or the global reclaim. And you are getting > > > critical levels only if the reclaim is failing to reclaim way too many > > > pages. > > > > > > > OK, agree from that point of view. > > But for larger systems reclaiming happens not so often and we can > > use larger window sizes to have better memory utilization approximation. > > Nobody is saying the the window size has to be fixed. This all can be > auto tuned in the kernel. It would, however, require to define what > "better utilization approximation" means much more specifically. > > [...] > > > This looks more like a problem of vmpressure implementation than > > > something you want to workaround by tuning to me. > > > > > Basically it is how it works - collect the scanned page and activate > worker > > activity to update the current level. > > That is the case only for some vmpressure invocations. And your data > suggest that those might lead to misleading results. So this is likely > good to focus on and find out whether this can be addressed. > -- > Michal Hocko > SUSE Labs >
From: Leonid Moiseichuk <lmoiseichuk@magicleap.com> Small tweak to populate vmpressure parameters to userspace without any built-in logic change. The vmpressure is used actively (e.g. on Android) to track mm stress. vmpressure parameters selected empiricaly quite long time ago and not always suitable for modern memory configurations. Leonid Moiseichuk (2): memcg: expose vmpressure knobs memcg, vmpressure: expose vmpressure controls .../admin-guide/cgroup-v1/memory.rst | 12 +- include/linux/vmpressure.h | 35 ++++++ mm/memcontrol.c | 113 ++++++++++++++++++ mm/vmpressure.c | 101 +++++++--------- 4 files changed, 200 insertions(+), 61 deletions(-)