mbox series

[0/2] memcg, vmpressure: expose vmpressure controls

Message ID 20200413215750.7239-1-lmoiseichuk@magicleap.com (mailing list archive)
Headers show
Series memcg, vmpressure: expose vmpressure controls | expand

Message

svc_lmoiseichuk@magicleap.com April 13, 2020, 9:57 p.m. UTC
From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>

Small tweak to populate vmpressure parameters to userspace without
any built-in logic change.

The vmpressure is used actively (e.g. on Android) to track mm stress.
vmpressure parameters selected empiricaly quite long time ago and not
always suitable for modern memory configurations.

Leonid Moiseichuk (2):
  memcg: expose vmpressure knobs
  memcg, vmpressure: expose vmpressure controls

 .../admin-guide/cgroup-v1/memory.rst          |  12 +-
 include/linux/vmpressure.h                    |  35 ++++++
 mm/memcontrol.c                               | 113 ++++++++++++++++++
 mm/vmpressure.c                               | 101 +++++++---------
 4 files changed, 200 insertions(+), 61 deletions(-)

Comments

Michal Hocko April 14, 2020, 11:37 a.m. UTC | #1
On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>
> 
> Small tweak to populate vmpressure parameters to userspace without
> any built-in logic change.
> 
> The vmpressure is used actively (e.g. on Android) to track mm stress.
> vmpressure parameters selected empiricaly quite long time ago and not
> always suitable for modern memory configurations.

This needs much more details. Why it is not suitable? What are usual
numbers you need to set up to work properly? Why those wouldn't be
generally applicable?

Anyway, I have to confess I am not a big fan of this. vmpressure turned
out to be a very weak interface to measure the memory pressure. Not only
it is not numa aware which makes it unusable on many systems it also 
gives data way too late from the practice.

Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart
to measure the memory pressure in the first place?

> Leonid Moiseichuk (2):
>   memcg: expose vmpressure knobs
>   memcg, vmpressure: expose vmpressure controls
> 
>  .../admin-guide/cgroup-v1/memory.rst          |  12 +-
>  include/linux/vmpressure.h                    |  35 ++++++
>  mm/memcontrol.c                               | 113 ++++++++++++++++++
>  mm/vmpressure.c                               | 101 +++++++---------
>  4 files changed, 200 insertions(+), 61 deletions(-)
> 
> -- 
> 2.17.1
>
Leonid Moiseichuk April 14, 2020, 4:42 p.m. UTC | #2
Thanks Michal for quick response, see my answer below.
I will update the commit message with numbers for 8 GB memory swapless
devices.

On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote:

> On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> > From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>
> >
> > Small tweak to populate vmpressure parameters to userspace without
> > any built-in logic change.
> >
> > The vmpressure is used actively (e.g. on Android) to track mm stress.
> > vmpressure parameters selected empiricaly quite long time ago and not
> > always suitable for modern memory configurations.
>
> This needs much more details. Why it is not suitable? What are usual
> numbers you need to set up to work properly? Why those wouldn't be
> generally applicable?
>
As far I see numbers which vmpressure uses - they are closer to RSS of
userspace processes for memory utilization.
Default calibration in memory.pressure_level_medium as 60% makes 8GB device
hit memory threshold when RSS utilization
reaches ~5 GB and that is a bit too early, I observe it happened
immediately after boot. Reasonable level should be
in the 70-80% range depending on SW preloaded on your device.

From another point of view having a memory.pressure_level_critical set to
95% may never happen as it comes to a level where an OOM killer already
starts to kill processes,
and in some cases it is even worse than the now removed Android low memory
killer. For such cases has sense to shift the threshold down to 85-90% to
have device reliably
handling low memory situations and not rely only on oom_score_adj hints.

Next important parameter for tweaking is memory.pressure_window which has
the sense to increase twice to reduce the number of activations of userspace
to save some power by reducing sensitivity.

For 12 and 16 GB devices the situation will be similar but worse, based on
fact in current settings they will hit medium memory usage when ~5 or 6.5
GB memory will be still free.


>
> Anyway, I have to confess I am not a big fan of this. vmpressure turned
> out to be a very weak interface to measure the memory pressure. Not only
> it is not numa aware which makes it unusable on many systems it also
> gives data way too late from the practice.
>
> Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart
> to measure the memory pressure in the first place?
>

According to our checks PSI produced numbers only when swap enabled e.g.
swapless device 75% RAM utilization:
==> /proc/pressure/io <==
some avg10=0.00 avg60=1.18 avg300=1.51 total=9642648
full avg10=0.00 avg60=1.11 avg300=1.47 total=9271174

==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Probably it is possible to activate PSI by introducing high IO and swap
enabled but that is not a typical case for mobile devices.

With swap-enabled case memory pressure follows IO pressure with some
fraction i.e. memory is io/2 ... io/10 depending on pattern.
Light sysbench case with swap enabled
==> /proc/pressure/io <==
some avg10=0.00 avg60=0.00 avg300=0.11 total=155383820
full avg10=0.00 avg60=0.00 avg300=0.05 total=100516966
==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.00 avg300=0.06 total=465916397
full avg10=0.00 avg60=0.00 avg300=0.00 total=368664282

Since not all devices have zram or swap enabled it makes sense to have
vmpressure tuning option possible since
it is well used in Android and related issues are understandable.


> > Leonid Moiseichuk (2):
> >   memcg: expose vmpressure knobs
> >   memcg, vmpressure: expose vmpressure controls
> >
> >  .../admin-guide/cgroup-v1/memory.rst          |  12 +-
> >  include/linux/vmpressure.h                    |  35 ++++++
> >  mm/memcontrol.c                               | 113 ++++++++++++++++++
> >  mm/vmpressure.c                               | 101 +++++++---------
> >  4 files changed, 200 insertions(+), 61 deletions(-)
> >
> > --
> > 2.17.1
> >
>
> --
> Michal Hocko
> SUSE Labs
>
Michal Hocko April 14, 2020, 6:49 p.m. UTC | #3
On Tue 14-04-20 12:42:44, Leonid Moiseichuk wrote:
> Thanks Michal for quick response, see my answer below.
> I will update the commit message with numbers for 8 GB memory swapless
> devices.
> 
> On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> > > From: Leonid Moiseichuk <lmoiseichuk@magicleap.com>
> > >
> > > Small tweak to populate vmpressure parameters to userspace without
> > > any built-in logic change.
> > >
> > > The vmpressure is used actively (e.g. on Android) to track mm stress.
> > > vmpressure parameters selected empiricaly quite long time ago and not
> > > always suitable for modern memory configurations.
> >
> > This needs much more details. Why it is not suitable? What are usual
> > numbers you need to set up to work properly? Why those wouldn't be
> > generally applicable?
> >
> As far I see numbers which vmpressure uses - they are closer to RSS of
> userspace processes for memory utilization.
> Default calibration in memory.pressure_level_medium as 60% makes 8GB device
> hit memory threshold when RSS utilization
> reaches ~5 GB and that is a bit too early, I observe it happened
> immediately after boot. Reasonable level should be
> in the 70-80% range depending on SW preloaded on your device.

I am not sure I follow. Levels are based on the reclaim ineffectivity not
the overall memory utilization. So it takes to have only 40% reclaim
effectivity to trigger the medium level. While you are right that the
threshold for the event is pretty arbitrary I would like to hear why
that doesn't work in your environment. It shouldn't really depend on the
amount of memory as this is a percentage, right?

> From another point of view having a memory.pressure_level_critical set to
> 95% may never happen as it comes to a level where an OOM killer already
> starts to kill processes,
> and in some cases it is even worse than the now removed Android low memory
> killer. For such cases has sense to shift the threshold down to 85-90% to
> have device reliably
> handling low memory situations and not rely only on oom_score_adj hints.
> 
> Next important parameter for tweaking is memory.pressure_window which has
> the sense to increase twice to reduce the number of activations of userspace
> to save some power by reducing sensitivity.

Could you be more specific, please?

> For 12 and 16 GB devices the situation will be similar but worse, based on
> fact in current settings they will hit medium memory usage when ~5 or 6.5
> GB memory will be still free.
> 
> 
> >
> > Anyway, I have to confess I am not a big fan of this. vmpressure turned
> > out to be a very weak interface to measure the memory pressure. Not only
> > it is not numa aware which makes it unusable on many systems it also
> > gives data way too late from the practice.
> >
> > Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart
> > to measure the memory pressure in the first place?
> >
> 
> According to our checks PSI produced numbers only when swap enabled e.g.
> swapless device 75% RAM utilization:

I believe you should discuss that with the people familiar with PSI
internals (Johannes already in the CC list).
Johannes Weiner April 14, 2020, 7:23 p.m. UTC | #4
On Tue, Apr 14, 2020 at 12:42:44PM -0400, Leonid Moiseichuk wrote:
> On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> > Anyway, I have to confess I am not a big fan of this. vmpressure turned
> > out to be a very weak interface to measure the memory pressure. Not only
> > it is not numa aware which makes it unusable on many systems it also
> > gives data way too late from the practice.

Yes, it's late in the game for vmpressure, and also a bit too late for
extensive changes in cgroup1.

> > Btw. why don't you use /proc/pressure/memory resp. its memcg counterpart
> > to measure the memory pressure in the first place?
> >
> 
> According to our checks PSI produced numbers only when swap enabled e.g.
> swapless device 75% RAM utilization:
> ==> /proc/pressure/io <==
> some avg10=0.00 avg60=1.18 avg300=1.51 total=9642648
> full avg10=0.00 avg60=1.11 avg300=1.47 total=9271174
> 
> ==> /proc/pressure/memory <==
> some avg10=0.00 avg60=0.00 avg300=0.00 total=0
> full avg10=0.00 avg60=0.00 avg300=0.00 total=0

That doesn't look right. With total=0, there couldn't have been any
reclaim activity, which means that vmpressure couldn't have reported
anything either.

By the time vmpressure reports a drop in reclaim efficiency, psi
should have already been reporting time spent doing reclaim. It
reports a superset of the information conveyed by vmpressure.

> Probably it is possible to activate PSI by introducing high IO and swap
> enabled but that is not a typical case for mobile devices.
> 
> With swap-enabled case memory pressure follows IO pressure with some
> fraction i.e. memory is io/2 ... io/10 depending on pattern.
> Light sysbench case with swap enabled
> ==> /proc/pressure/io <==
> some avg10=0.00 avg60=0.00 avg300=0.11 total=155383820
> full avg10=0.00 avg60=0.00 avg300=0.05 total=100516966
> ==> /proc/pressure/memory <==
> some avg10=0.00 avg60=0.00 avg300=0.06 total=465916397
> full avg10=0.00 avg60=0.00 avg300=0.00 total=368664282
> 
> Since not all devices have zram or swap enabled it makes sense to have
> vmpressure tuning option possible since
> it is well used in Android and related issues are understandable.

Android (since 10 afaik) uses psi to make low memory / OOM
decisions. See the introduction of the psi poll() support:
https://lwn.net/Articles/782662/

It's true that with swap you may see a more gradual increase in
pressure, whereas without swap you may go from idle to OOM much
faster, depending on what type of memory is being allocated. But psi
will still report it. You may just have to use poll() to get in-time
notification like you do with vmpressure.
Leonid Moiseichuk April 14, 2020, 8:53 p.m. UTC | #5
It would be nice if you can specify exact numbers you like to see.

On Tue, Apr 14, 2020 at 2:49 PM Michal Hocko <mhocko@kernel.org> wrote:

> ....

> As far I see numbers which vmpressure uses - they are closer to RSS of
> > userspace processes for memory utilization.
> > Default calibration in memory.pressure_level_medium as 60% makes 8GB
> device
> > hit memory threshold when RSS utilization
> > reaches ~5 GB and that is a bit too early, I observe it happened
> > immediately after boot. Reasonable level should be
> > in the 70-80% range depending on SW preloaded on your device.
>
> I am not sure I follow. Levels are based on the reclaim ineffectivity not
> the overall memory utilization. So it takes to have only 40% reclaim
> effectivity to trigger the medium level. While you are right that the
> threshold for the event is pretty arbitrary I would like to hear why
> that doesn't work in your environment. It shouldn't really depend on the
> amount of memory as this is a percentage, right?
>
It is not only depends from amount of memory or reclams but also what is
software running.

As I see from vmscan.c vmpressure activated from various shrink_node()  or,
basically do_try_to_free_pages().
To hit this state you need to somehow lack memory due to various reasons,
so the amount of memory plays a role here.
In particular my case is very impacted by GPU (using CMA) consumption which
can easily take gigs.
Apps can take gigabyte as well.
So reclaiming will be quite often called in case of lack of memory (4K
calls are possible).

Handling level change will happen if the amount of scanned pages is more
than window size, 512 is too little as now it is only 2 MB.
So small slices are a source of false triggers.

Next, pressure counted as
        unsigned long scale = scanned + reclaimed;
        pressure = scale - (reclaimed * scale / scanned);
        pressure = pressure * 100 / scale;
Or for 512 pages (lets use minimal) it leads to reclaimed should be 204
pages for 60% threshold and 25 pages for 95% (as critical)

In case of pressure happened (usually at 85% of memory used, and hittin
critical level) I rarely see something like closer to real numbers
vmpressure_work_fn: scanned 545, reclaimed 144   <-- 73%
vmpressure_work_fn: scanned 16283, reclaimed 2495  <-- same session but 83%
Most of the time it is looping between kswapd and lmkd reclaiming failures,
consuming quite a high amount of cpu.

On vmscan calls everything looks as expected
[  312.410938] vmpressure: tree 0 scanned 4, reclaimed 2
[  312.410939] vmpressure: tree 0 scanned 120, reclaimed 62
[  312.410939] vmpressure: tree 1 scanned 2, reclaimed 1
[  312.410940] vmpressure: tree 1 scanned 120, reclaimed 62
[  312.410941] vmpressure: tree 0 scanned 0, reclaimed 0


>
> > From another point of view having a memory.pressure_level_critical set to
> > 95% may never happen as it comes to a level where an OOM killer already
> > starts to kill processes,
> > and in some cases it is even worse than the now removed Android low
> memory
> > killer. For such cases has sense to shift the threshold down to 85-90% to
> > have device reliably
> > handling low memory situations and not rely only on oom_score_adj hints.
> >
> > Next important parameter for tweaking is memory.pressure_window which has
> > the sense to increase twice to reduce the number of activations of
> userspace
> > to save some power by reducing sensitivity.
>
> Could you be more specific, please?
>
That are parameters which most sensitive for tweaking for me.
At least someone who use vmpressure will be able to tune up or down
depending on combination apps.



>
> > For 12 and 16 GB devices the situation will be similar but worse, based
> on
> > fact in current settings they will hit medium memory usage when ~5 or 6.5
> > GB memory will be still free.
> >
> >
> > >
> > > Anyway, I have to confess I am not a big fan of this. vmpressure turned
> > > out to be a very weak interface to measure the memory pressure. Not
> only
> > > it is not numa aware which makes it unusable on many systems it also
> > > gives data way too late from the practice.
> > >
> > > Btw. why don't you use /proc/pressure/memory resp. its memcg
> counterpart
> > > to measure the memory pressure in the first place?
> > >
> >
> > According to our checks PSI produced numbers only when swap enabled e.g.
> > swapless device 75% RAM utilization:
>
> I believe you should discuss that with the people familiar with PSI
> internals (Johannes already in the CC list).
>

Thanks for pointing, I will answer for his letters.

> --
> Michal Hocko
> SUSE Labs
>
Leonid Moiseichuk April 14, 2020, 10:12 p.m. UTC | #6
I do not agree with all comments, see below.

On Tue, Apr 14, 2020 at 3:23 PM Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Apr 14, 2020 at 12:42:44PM -0400, Leonid Moiseichuk wrote:
> > On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> > > Anyway, I have to confess I am not a big fan of this. vmpressure turned
> > > out to be a very weak interface to measure the memory pressure. Not
> only
> > > it is not numa aware which makes it unusable on many systems it also
> > > gives data way too late from the practice.
>
> Yes, it's late in the game for vmpressure, and also a bit too late for
> extensive changes in cgroup1.
>
200 lines just to move functionality from one place to another without
logic change?
There does not seem to be extensive changes.


>
> > > Btw. why don't you use /proc/pressure/memory resp. its memcg
> counterpart
> > > to measure the memory pressure in the first place?
> > >
> >
> > According to our checks PSI produced numbers only when swap enabled e.g.
> > swapless device 75% RAM utilization:
> > ==> /proc/pressure/io <==
> > some avg10=0.00 avg60=1.18 avg300=1.51 total=9642648
> > full avg10=0.00 avg60=1.11 avg300=1.47 total=9271174
> >
> > ==> /proc/pressure/memory <==
> > some avg10=0.00 avg60=0.00 avg300=0.00 total=0
> > full avg10=0.00 avg60=0.00 avg300=0.00 total=0
>
> That doesn't look right. With total=0, there couldn't have been any
> reclaim activity, which means that vmpressure couldn't have reported
> anything either.
>
Unfortunately not, vmpressure do reclaiming, I shared numbers/calls in the
parallel letter.
And I see kswapd+lmkd consumes quite a lot of cpu cycles.
That is the same device, swap disabled.
If I enable swap (zram based as Android usually does) it starts to make
some numbers below 0.1,
which does not seem huge pressure.


By the time vmpressure reports a drop in reclaim efficiency, psi
> should have already been reporting time spent doing reclaim. It
> reports a superset of the information conveyed by vmpressure.
>


> > Probably it is possible to activate PSI by introducing high IO and swap
> > enabled but that is not a typical case for mobile devices.
> >
> > With swap-enabled case memory pressure follows IO pressure with some
> > fraction i.e. memory is io/2 ... io/10 depending on pattern.
> > Light sysbench case with swap enabled
> > ==> /proc/pressure/io <==
> > some avg10=0.00 avg60=0.00 avg300=0.11 total=155383820
> > full avg10=0.00 avg60=0.00 avg300=0.05 total=100516966
> > ==> /proc/pressure/memory <==
> > some avg10=0.00 avg60=0.00 avg300=0.06 total=465916397
> > full avg10=0.00 avg60=0.00 avg300=0.00 total=368664282
> >
> > Since not all devices have zram or swap enabled it makes sense to have
> > vmpressure tuning option possible since
> > it is well used in Android and related issues are understandable.
>
> Android (since 10 afaik) uses psi to make low memory / OOM
> decisions. See the introduction of the psi poll() support:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_782662_&d=DwIBAg&c=0ia8zh_eZtQM1JEjWgVLZg&r=dIXgSomcB34epPNJ3JPl0D4WwsDd12lPHClV0_L9Aw4&m=GJC3IQZUa2vG0cqtoa4Ma_R-S_cRvQSZGbpD389b84w&s=Kp-EqrjqguJqWJ-tefwwRPeLIZennPkko0qEV_fgIbc&e=
>

Android makes a selection PSI (primary) or vmpressure (backup), see line
2872+
https://android.googlesource.com/platform/system/memory/lmkd/+/refs/heads/master/lmkd.cpp#2872


>
> It's true that with swap you may see a more gradual increase in
> pressure, whereas without swap you may go from idle to OOM much
> faster, depending on what type of memory is being allocated. But psi
> will still report it. You may just have to use poll() to get in-time
> notification like you do with vmpressure.
>
I expected that any spikes will be visible in previous avg level e.g. 10s
Cannot confirm that now but I could play around.  If you have preferences
about use-cases please let me know.
Michal Hocko April 15, 2020, 7:51 a.m. UTC | #7
On Tue 14-04-20 16:53:55, Leonid Moiseichuk wrote:
> It would be nice if you can specify exact numbers you like to see.

You are proposing an interface which allows to tune thresholds from
userspace. Which suggests that you want to tune them. I am asking what
kind of tuning you are using and why cannot we use them as defaults in
the kernel.

> On Tue, Apr 14, 2020 at 2:49 PM Michal Hocko <mhocko@kernel.org> wrote:
> 
> > ....
> 
> > As far I see numbers which vmpressure uses - they are closer to RSS of
> > > userspace processes for memory utilization.
> > > Default calibration in memory.pressure_level_medium as 60% makes 8GB
> > device
> > > hit memory threshold when RSS utilization
> > > reaches ~5 GB and that is a bit too early, I observe it happened
> > > immediately after boot. Reasonable level should be
> > > in the 70-80% range depending on SW preloaded on your device.
> >
> > I am not sure I follow. Levels are based on the reclaim ineffectivity not
> > the overall memory utilization. So it takes to have only 40% reclaim
> > effectivity to trigger the medium level. While you are right that the
> > threshold for the event is pretty arbitrary I would like to hear why
> > that doesn't work in your environment. It shouldn't really depend on the
> > amount of memory as this is a percentage, right?
> >
> It is not only depends from amount of memory or reclams but also what is
> software running.
> 
> As I see from vmscan.c vmpressure activated from various shrink_node()  or,
> basically do_try_to_free_pages().
> To hit this state you need to somehow lack memory due to various reasons,
> so the amount of memory plays a role here.
> In particular my case is very impacted by GPU (using CMA) consumption which
> can easily take gigs.
> Apps can take gigabyte as well.
> So reclaiming will be quite often called in case of lack of memory (4K
> calls are possible).
> 
> Handling level change will happen if the amount of scanned pages is more
> than window size, 512 is too little as now it is only 2 MB.
> So small slices are a source of false triggers.
> 
> Next, pressure counted as
>         unsigned long scale = scanned + reclaimed;
>         pressure = scale - (reclaimed * scale / scanned);
>         pressure = pressure * 100 / scale;

Just to make this more obvious this is essentially 
	100 * (1 - reclaimed/scanned)

> Or for 512 pages (lets use minimal) it leads to reclaimed should be 204
> pages for 60% threshold and 25 pages for 95% (as critical)
>
> In case of pressure happened (usually at 85% of memory used, and hittin
> critical level)

I still find this very confusing because the amount of used memory is
not really important. It really only depends on the reclaim activity and
that is either the memcg or the global reclaim. And you are getting
critical levels only if the reclaim is failing to reclaim way too many
pages. 

> I rarely see something like closer to real numbers
> vmpressure_work_fn: scanned 545, reclaimed 144   <-- 73%
> vmpressure_work_fn: scanned 16283, reclaimed 2495  <-- same session but 83%
> Most of the time it is looping between kswapd and lmkd reclaiming failures,
> consuming quite a high amount of cpu.
> 
> On vmscan calls everything looks as expected
> [  312.410938] vmpressure: tree 0 scanned 4, reclaimed 2
> [  312.410939] vmpressure: tree 0 scanned 120, reclaimed 62
> [  312.410939] vmpressure: tree 1 scanned 2, reclaimed 1
> [  312.410940] vmpressure: tree 1 scanned 120, reclaimed 62
> [  312.410941] vmpressure: tree 0 scanned 0, reclaimed 0

This looks more like a problem of vmpressure implementation than
something you want to workaround by tuning to me.
Michal Hocko April 15, 2020, 7:55 a.m. UTC | #8
On Tue 14-04-20 18:12:47, Leonid Moiseichuk wrote:
> I do not agree with all comments, see below.
> 
> On Tue, Apr 14, 2020 at 3:23 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Tue, Apr 14, 2020 at 12:42:44PM -0400, Leonid Moiseichuk wrote:
> > > On Tue, Apr 14, 2020 at 7:37 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Mon 13-04-20 17:57:48, svc_lmoiseichuk@magicleap.com wrote:
> > > > Anyway, I have to confess I am not a big fan of this. vmpressure turned
> > > > out to be a very weak interface to measure the memory pressure. Not
> > only
> > > > it is not numa aware which makes it unusable on many systems it also
> > > > gives data way too late from the practice.
> >
> > Yes, it's late in the game for vmpressure, and also a bit too late for
> > extensive changes in cgroup1.
> >
> 200 lines just to move functionality from one place to another without
> logic change?
> There does not seem to be extensive changes.

Any user visible API is an big change. We have to maintain any api for
ever. So there has to be a really strong reason/use case for inclusion.
I haven't heard any strong justification so far. It all seems to me that
you are trying to workaround real vmpressure issues by fine tunning
parameters and that is almost always a bad reason for a adding a new
tunning.
Leonid Moiseichuk April 15, 2020, 12:17 p.m. UTC | #9
As Chris Down stated cgroups v1 frozen, so no API changes in the mainline
kernel.
If opinions change in the future I can continue with polishing this change.
I will focus on PSI bugs for swapless/zram swapped devices :)

The rest is below.

On Wed, Apr 15, 2020 at 3:51 AM Michal Hocko <mhocko@kernel.org> wrote:

> On Tue 14-04-20 16:53:55, Leonid Moiseichuk wrote:
> > It would be nice if you can specify exact numbers you like to see.
>
> You are proposing an interface which allows to tune thresholds from
> userspace. Which suggests that you want to tune them. I am asking what
> kind of tuning you are using and why cannot we use them as defaults in
> the kernel.
>

Yes, this type of hack is obvious. But selecting some parameters in one
moment of time might be not good later.
Plus these patches can be applied by vendor to e.g. Android 8 or 9 who has
no PSI and tweaked their own way.
Some products stick to old versions of kernels. I made a docs in separate
change to cover a wider set of older kernels.
They are transparent, tested and working fine.


> > On Tue, Apr 14, 2020 at 2:49 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > > ....
> >
> > > As far I see numbers which vmpressure uses - they are closer to RSS of
> > > > userspace processes for memory utilization.
> > > > Default calibration in memory.pressure_level_medium as 60% makes 8GB
> > > device
> > > > hit memory threshold when RSS utilization
> > > > reaches ~5 GB and that is a bit too early, I observe it happened
> > > > immediately after boot. Reasonable level should be
> > > > in the 70-80% range depending on SW preloaded on your device.
> > >
> > > I am not sure I follow. Levels are based on the reclaim ineffectivity
> not
> > > the overall memory utilization. So it takes to have only 40% reclaim
> > > effectivity to trigger the medium level. While you are right that the
> > > threshold for the event is pretty arbitrary I would like to hear why
> > > that doesn't work in your environment. It shouldn't really depend on
> the
> > > amount of memory as this is a percentage, right?
> > >
> > It is not only depends from amount of memory or reclams but also what is
> > software running.
> >
> > As I see from vmscan.c vmpressure activated from various shrink_node()
> or,
> > basically do_try_to_free_pages().
> > To hit this state you need to somehow lack memory due to various reasons,
> > so the amount of memory plays a role here.
> > In particular my case is very impacted by GPU (using CMA) consumption
> which
> > can easily take gigs.
> > Apps can take gigabyte as well.
> > So reclaiming will be quite often called in case of lack of memory (4K
> > calls are possible).
> >
> > Handling level change will happen if the amount of scanned pages is more
> > than window size, 512 is too little as now it is only 2 MB.
> > So small slices are a source of false triggers.
> >
> > Next, pressure counted as
> >         unsigned long scale = scanned + reclaimed;
> >         pressure = scale - (reclaimed * scale / scanned);
> >         pressure = pressure * 100 / scale;
>
> Just to make this more obvious this is essentially
>         100 * (1 - reclaimed/scanned)
>
> > Or for 512 pages (lets use minimal) it leads to reclaimed should be 204
> > pages for 60% threshold and 25 pages for 95% (as critical)
> >
> > In case of pressure happened (usually at 85% of memory used, and hittin
> > critical level)
>
> I still find this very confusing because the amount of used memory is
> not really important. It really only depends on the reclaim activity and
> that is either the memcg or the global reclaim. And you are getting
> critical levels only if the reclaim is failing to reclaim way too many
> pages.
>

OK, agree from that point of view.
But for larger systems reclaiming happens not so often and we can
use larger window sizes to have better memory utilization approximation.


>
> > I rarely see something like closer to real numbers
> > vmpressure_work_fn: scanned 545, reclaimed 144   <-- 73%
> > vmpressure_work_fn: scanned 16283, reclaimed 2495  <-- same session but
> 83%
> > Most of the time it is looping between kswapd and lmkd reclaiming
> failures,
> > consuming quite a high amount of cpu.
> >
> > On vmscan calls everything looks as expected
> > [  312.410938] vmpressure: tree 0 scanned 4, reclaimed 2
> > [  312.410939] vmpressure: tree 0 scanned 120, reclaimed 62
> > [  312.410939] vmpressure: tree 1 scanned 2, reclaimed 1
> > [  312.410940] vmpressure: tree 1 scanned 120, reclaimed 62
> > [  312.410941] vmpressure: tree 0 scanned 0, reclaimed 0
>
> This looks more like a problem of vmpressure implementation than
> something you want to workaround by tuning to me.
>
Basically it is how it works - collect the scanned page and activate worker
activity to update the current level.


>
> --
> Michal Hocko
> SUSE Labs
>
Michal Hocko April 15, 2020, 12:28 p.m. UTC | #10
On Wed 15-04-20 08:17:42, Leonid Moiseichuk wrote:
> As Chris Down stated cgroups v1 frozen, so no API changes in the mainline
> kernel.

Yes, this is true, _but_ if there are clear shortcomings in the existing
vmpressure implementation which could be addressed reasonably then there
is no reason to ignore them.

[...]

> > I still find this very confusing because the amount of used memory is
> > not really important. It really only depends on the reclaim activity and
> > that is either the memcg or the global reclaim. And you are getting
> > critical levels only if the reclaim is failing to reclaim way too many
> > pages.
> >
> 
> OK, agree from that point of view.
> But for larger systems reclaiming happens not so often and we can
> use larger window sizes to have better memory utilization approximation.

Nobody is saying the the window size has to be fixed. This all can be
auto tuned in the kernel.  It would, however, require to define what
"better utilization approximation" means much more specifically.

[...]
> > This looks more like a problem of vmpressure implementation than
> > something you want to workaround by tuning to me.
> >
> Basically it is how it works - collect the scanned page and activate worker
> activity to update the current level.

That is the case only for some vmpressure invocations. And your data
suggest that those might lead to misleading results. So this is likely
good to focus on and find out whether this can be addressed.
Leonid Moiseichuk April 15, 2020, 12:33 p.m. UTC | #11
Good point but at the current moment I am not ready to implement autotune
window size because my device only has 8 GB RAM and a very custom version
of Android-based SW.
So  basically I cannot test in all possible combinations.

On Wed, Apr 15, 2020 at 8:29 AM Michal Hocko <mhocko@kernel.org> wrote:

> On Wed 15-04-20 08:17:42, Leonid Moiseichuk wrote:
> > As Chris Down stated cgroups v1 frozen, so no API changes in the mainline
> > kernel.
>
> Yes, this is true, _but_ if there are clear shortcomings in the existing
> vmpressure implementation which could be addressed reasonably then there
> is no reason to ignore them.
>
> [...]
>
> > > I still find this very confusing because the amount of used memory is
> > > not really important. It really only depends on the reclaim activity
> and
> > > that is either the memcg or the global reclaim. And you are getting
> > > critical levels only if the reclaim is failing to reclaim way too many
> > > pages.
> > >
> >
> > OK, agree from that point of view.
> > But for larger systems reclaiming happens not so often and we can
> > use larger window sizes to have better memory utilization approximation.
>
> Nobody is saying the the window size has to be fixed. This all can be
> auto tuned in the kernel.  It would, however, require to define what
> "better utilization approximation" means much more specifically.
>
> [...]
> > > This looks more like a problem of vmpressure implementation than
> > > something you want to workaround by tuning to me.
> > >
> > Basically it is how it works - collect the scanned page and activate
> worker
> > activity to update the current level.
>
> That is the case only for some vmpressure invocations. And your data
> suggest that those might lead to misleading results. So this is likely
> good to focus on and find out whether this can be addressed.
> --
> Michal Hocko
> SUSE Labs
>