diff mbox series

mm, oom: make the calculation of oom badness more accurate

Message ID 1594214649-9837-1-git-send-email-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm, oom: make the calculation of oom badness more accurate | expand

Commit Message

Yafang Shao July 8, 2020, 1:24 p.m. UTC
Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
[7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
[7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
[7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
[7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
[7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
[7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
[7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
[7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
[7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
[7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
[7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
[7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
[7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
[7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
[7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
[7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
[7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
[7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
[7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
[7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but its
rss is only one page. That is because, when we calculate the oom badness in
oom_badness(), we always ignore the negtive point and convert all of these
negtive points to 1. Now as oom_score_adj of all the processes in this
targeted memcg have the same value -998, the points of these processes are
all negtive value. As a result, the first scanned process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed by
system oom.

To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/oom.h |  4 ++--
 mm/oom_kill.c       | 20 +++++++++-----------
 2 files changed, 11 insertions(+), 13 deletions(-)

Comments

Michal Hocko July 8, 2020, 2:28 p.m. UTC | #1
On Wed 08-07-20 09:24:09, Yafang Shao wrote:
> Recently we found an issue on our production environment that when memcg
> oom is triggered the oom killer doesn't chose the process with largest
> resident memory but chose the first scanned process. Note that all
> processes in this memcg have the same oom_score_adj, so the oom killer
> should chose the process with largest resident memory.
> 
> Bellow is part of the oom info, which is enough to analyze this issue.
> [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
> [...]
> [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
> [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
> [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
> [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
> [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
> [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
> [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
> [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
> [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
> [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
> [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
> [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
> [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
> [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
> [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
> [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
> [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
> [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
> [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
> [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
> [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
> [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> We can find that the first scanned process 5740 (pause) was killed, but its
> rss is only one page. That is because, when we calculate the oom badness in
> oom_badness(), we always ignore the negtive point and convert all of these
> negtive points to 1. Now as oom_score_adj of all the processes in this
> targeted memcg have the same value -998, the points of these processes are
> all negtive value. As a result, the first scanned process will be killed.

Such a large bias can skew results quite considerably. 

> The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> a Guaranteed pod, which has higher priority to prevent from being killed by
> system oom.

This is really interesting! I assume that the oom_score_adj is set to
protect from the global oom situation right? I am struggling to
understand what is the expected behavior when the oom is internal for
such a group though. Does killing a single task from such a group is a
sensible choice? I am not really familiar with kubelet but can it cope
with data_sim going away from under it while the rest would still run?
Wouldn't it make more sense to simply tear down the whole thing?

But that is a separate thing.

> To fix this issue, we should make the calculation of oom point more
> accurate. We can achieve it by convert the chosen_point from 'unsigned
> long' to 'long'.

oom_score has a very coarse units because it maps all the consumed
memory into 0 - 1000 scale so effectively per-mille of the usable
memory. oom_score_adj acts on top of that as a bias. This is
exported to the userspace and I do not think we can change that (see
Documentation/filesystems/proc.rst) unfortunately. So you patch cannot
be really accepted as is because it would start reporting values outside
of the allowed range unless I am doing some math incorrectly.

On the other hand, in this particular case I believe the existing
calculation is just wrong. Usable memory is 16777216kB (4194304 pages),
the top consumer is 3976380 pages so 94.8% the lowest memory consumer is
effectively 0%. Even if we discount 94.8% by 99.8% then we should be
still having something like 7950 pages. So the normalization oom_badness
does cuts results too aggressively. There was quite some churn in the
calculation in the past fixing weird rounding bugs so I have to think
about how to fix this properly some more.

That being said, even though the configuration is weird I do agree that
oom_badness scaling is really unexpected and the memory consumption
in this particular example should be quite telling about who to chose as
an oom victim.
Michal Hocko July 8, 2020, 2:32 p.m. UTC | #2
I have only now realized that David is not on Cc. Add him here. The
patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoar.shao@gmail.com.

I believe the main problem is that we are normalizing to oom_score_adj
units rather than usage/total. I have a very vague recollection this has
been done in the past but I didn't get to dig into details yet.

On Wed 08-07-20 16:28:08, Michal Hocko wrote:
> On Wed 08-07-20 09:24:09, Yafang Shao wrote:
> > Recently we found an issue on our production environment that when memcg
> > oom is triggered the oom killer doesn't chose the process with largest
> > resident memory but chose the first scanned process. Note that all
> > processes in this memcg have the same oom_score_adj, so the oom killer
> > should chose the process with largest resident memory.
> > 
> > Bellow is part of the oom info, which is enough to analyze this issue.
> > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
> > [...]
> > [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> > [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
> > [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
> > [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
> > [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
> > [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
> > [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
> > [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
> > [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
> > [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
> > [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
> > [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
> > [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
> > [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
> > [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
> > [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
> > [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
> > [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
> > [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
> > [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
> > [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
> > [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
> > [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > 
> > We can find that the first scanned process 5740 (pause) was killed, but its
> > rss is only one page. That is because, when we calculate the oom badness in
> > oom_badness(), we always ignore the negtive point and convert all of these
> > negtive points to 1. Now as oom_score_adj of all the processes in this
> > targeted memcg have the same value -998, the points of these processes are
> > all negtive value. As a result, the first scanned process will be killed.
> 
> Such a large bias can skew results quite considerably. 
> 
> > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> > a Guaranteed pod, which has higher priority to prevent from being killed by
> > system oom.
> 
> This is really interesting! I assume that the oom_score_adj is set to
> protect from the global oom situation right? I am struggling to
> understand what is the expected behavior when the oom is internal for
> such a group though. Does killing a single task from such a group is a
> sensible choice? I am not really familiar with kubelet but can it cope
> with data_sim going away from under it while the rest would still run?
> Wouldn't it make more sense to simply tear down the whole thing?
> 
> But that is a separate thing.
> 
> > To fix this issue, we should make the calculation of oom point more
> > accurate. We can achieve it by convert the chosen_point from 'unsigned
> > long' to 'long'.
> 
> oom_score has a very coarse units because it maps all the consumed
> memory into 0 - 1000 scale so effectively per-mille of the usable
> memory. oom_score_adj acts on top of that as a bias. This is
> exported to the userspace and I do not think we can change that (see
> Documentation/filesystems/proc.rst) unfortunately. So you patch cannot
> be really accepted as is because it would start reporting values outside
> of the allowed range unless I am doing some math incorrectly.
> 
> On the other hand, in this particular case I believe the existing
> calculation is just wrong. Usable memory is 16777216kB (4194304 pages),
> the top consumer is 3976380 pages so 94.8% the lowest memory consumer is
> effectively 0%. Even if we discount 94.8% by 99.8% then we should be
> still having something like 7950 pages. So the normalization oom_badness
> does cuts results too aggressively. There was quite some churn in the
> calculation in the past fixing weird rounding bugs so I have to think
> about how to fix this properly some more.
> 
> That being said, even though the configuration is weird I do agree that
> oom_badness scaling is really unexpected and the memory consumption
> in this particular example should be quite telling about who to chose as
> an oom victim.
> -- 
> Michal Hocko
> SUSE Labs
Yafang Shao July 8, 2020, 3:11 p.m. UTC | #3
On Wed, Jul 8, 2020 at 10:28 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 08-07-20 09:24:09, Yafang Shao wrote:
> > Recently we found an issue on our production environment that when memcg
> > oom is triggered the oom killer doesn't chose the process with largest
> > resident memory but chose the first scanned process. Note that all
> > processes in this memcg have the same oom_score_adj, so the oom killer
> > should chose the process with largest resident memory.
> >
> > Bellow is part of the oom info, which is enough to analyze this issue.
> > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
> > [...]
> > [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> > [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
> > [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
> > [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
> > [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
> > [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
> > [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
> > [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
> > [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
> > [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
> > [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
> > [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
> > [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
> > [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
> > [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
> > [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
> > [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
> > [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
> > [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
> > [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
> > [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
> > [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
> > [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> >
> > We can find that the first scanned process 5740 (pause) was killed, but its
> > rss is only one page. That is because, when we calculate the oom badness in
> > oom_badness(), we always ignore the negtive point and convert all of these
> > negtive points to 1. Now as oom_score_adj of all the processes in this
> > targeted memcg have the same value -998, the points of these processes are
> > all negtive value. As a result, the first scanned process will be killed.
>
> Such a large bias can skew results quite considerably.
>

Right.
Pls. refer the kubernetes doc[1] for more information about this large bias .

[1]. https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/

> > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> > a Guaranteed pod, which has higher priority to prevent from being killed by
> > system oom.
>
> This is really interesting! I assume that the oom_score_adj is set to
> protect from the global oom situation right?

Right. See also the kubernetes doc.

> I am struggling to
> understand what is the expected behavior when the oom is internal for
> such a group though. Does killing a single task from such a group is a
> sensible choice? I am not really familiar with kubelet but can it cope
> with data_sim going away from under it while the rest would still run?
> Wouldn't it make more sense to simply tear down the whole thing?
>

There are two containers in one kubernetes pod, one of which is a
pause-container, which has only one process - the pause, which is
managing the netns, and the other is the docker-init-container, in
which all other processes are running.
Once the pause process is killed, the kubelet will rebuild all the
containers in this pod, while if one of the processes in the
docker-init-container is killed, the kubelet will try to re-run it.
So tearing down the whole thing is more costly than only trying to
re-running one process.
I'm not familiar with kubernetes as well, that is my understanding.

> But that is a separate thing.

Right.

>
> > To fix this issue, we should make the calculation of oom point more
> > accurate. We can achieve it by convert the chosen_point from 'unsigned
> > long' to 'long'.
>
> oom_score has a very coarse units because it maps all the consumed
> memory into 0 - 1000 scale so effectively per-mille of the usable
> memory. oom_score_adj acts on top of that as a bias. This is
> exported to the userspace and I do not think we can change that (see
> Documentation/filesystems/proc.rst) unfortunately.

In this doc, I only find the oom_score and oom_score_adj is exposed to
the userspace.
While this patch only changes the oom_control->chosen_points, which is
only for oom internally use.
So I don't think we can't change oom_control->chosen_points.


> So you patch cannot
> be really accepted as is because it would start reporting values outside
> of the allowed range unless I am doing some math incorrectly.
>

See above, my patch will not break the userspace at all.

> On the other hand, in this particular case I believe the existing
> calculation is just wrong. Usable memory is 16777216kB (4194304 pages),
> the top consumer is 3976380 pages so 94.8% the lowest memory consumer is
> effectively 0%. Even if we discount 94.8% by 99.8% then we should be
> still having something like 7950 pages. So the normalization oom_badness
> does cuts results too aggressively. There was quite some churn in the
> calculation in the past fixing weird rounding bugs so I have to think
> about how to fix this properly some more.
>
> That being said, even though the configuration is weird I do agree that
> oom_badness scaling is really unexpected and the memory consumption
> in this particular example should be quite telling about who to chose as
> an oom victim.
> --
> Michal Hocko
> SUSE Labs
Michal Hocko July 8, 2020, 4:09 p.m. UTC | #4
On Wed 08-07-20 23:11:43, Yafang Shao wrote:
> On Wed, Jul 8, 2020 at 10:28 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 08-07-20 09:24:09, Yafang Shao wrote:
> > > Recently we found an issue on our production environment that when memcg
> > > oom is triggered the oom killer doesn't chose the process with largest
> > > resident memory but chose the first scanned process. Note that all
> > > processes in this memcg have the same oom_score_adj, so the oom killer
> > > should chose the process with largest resident memory.
> > >
> > > Bellow is part of the oom info, which is enough to analyze this issue.
> > > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> > > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> > > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
> > > [...]
> > > [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> > > [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
> > > [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
> > > [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
> > > [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
> > > [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
> > > [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
> > > [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
> > > [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
> > > [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
> > > [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
> > > [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
> > > [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
> > > [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
> > > [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
> > > [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
> > > [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
> > > [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
> > > [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
> > > [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
> > > [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
> > > [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
> > > [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> > > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> > > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > >
> > > We can find that the first scanned process 5740 (pause) was killed, but its
> > > rss is only one page. That is because, when we calculate the oom badness in
> > > oom_badness(), we always ignore the negtive point and convert all of these
> > > negtive points to 1. Now as oom_score_adj of all the processes in this
> > > targeted memcg have the same value -998, the points of these processes are
> > > all negtive value. As a result, the first scanned process will be killed.
> >
> > Such a large bias can skew results quite considerably.
> >
> 
> Right.
> Pls. refer the kubernetes doc[1] for more information about this large bias .
> 
> [1]. https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
> 
> > > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> > > a Guaranteed pod, which has higher priority to prevent from being killed by
> > > system oom.
> >
> > This is really interesting! I assume that the oom_score_adj is set to
> > protect from the global oom situation right?
> 
> Right. See also the kubernetes doc.
> 
> > I am struggling to
> > understand what is the expected behavior when the oom is internal for
> > such a group though. Does killing a single task from such a group is a
> > sensible choice? I am not really familiar with kubelet but can it cope
> > with data_sim going away from under it while the rest would still run?
> > Wouldn't it make more sense to simply tear down the whole thing?
> >
> 
> There are two containers in one kubernetes pod, one of which is a
> pause-container, which has only one process - the pause, which is
> managing the netns, and the other is the docker-init-container, in
> which all other processes are running.
> Once the pause process is killed, the kubelet will rebuild all the
> containers in this pod, while if one of the processes in the
> docker-init-container is killed, the kubelet will try to re-run it.
> So tearing down the whole thing is more costly than only trying to
> re-running one process.
> I'm not familiar with kubernetes as well, that is my understanding.

Thanks for the clarification!
 
[...]
> > oom_score has a very coarse units because it maps all the consumed
> > memory into 0 - 1000 scale so effectively per-mille of the usable
> > memory. oom_score_adj acts on top of that as a bias. This is
> > exported to the userspace and I do not think we can change that (see
> > Documentation/filesystems/proc.rst) unfortunately.
> 
> In this doc, I only find the oom_score and oom_score_adj is exposed to
> the userspace.
> While this patch only changes the oom_control->chosen_points, which is
> only for oom internally use.
> So I don't think we can't change oom_control->chosen_points.

Unless I am misreading the patch you are allowing negative values to be
returned from proc_oom_score and that is used by proc_oom_score which is
exported to the userspace.
David Rientjes July 8, 2020, 5:57 p.m. UTC | #5
On Wed, 8 Jul 2020, Michal Hocko wrote:

> I have only now realized that David is not on Cc. Add him here. The
> patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoar.shao@gmail.com.
> 
> I believe the main problem is that we are normalizing to oom_score_adj
> units rather than usage/total. I have a very vague recollection this has
> been done in the past but I didn't get to dig into details yet.
> 

The memcg max is 4194304 pages, and an oom_score_adj of -998 would yield a 
page adjustment of:

adj = -998 * 4194304 / 1000 = −4185915 pages

The largest pid 58406 (data_sim) has rss 3967322 pages,
pgtables 37101568 / 4096 = 9058 pages, and swapents 0.  So it's unadjusted 
badness is

3967322 + 9058 pages = 3976380 pages

Factoring in oom_score_adj, all of these processes will have a badness of 
1 because oom_badness() doesn't underflow, which I think is the point of 
Yafang's proposal.

I think the patch can work but, as you mention, also needs an update to 
proc_oom_score().  proc_oom_score() is using the global amount of memory 
so Yafang is likely not seeing it go negative for that reason but it could 
happen.

> On Wed 08-07-20 16:28:08, Michal Hocko wrote:
> > On Wed 08-07-20 09:24:09, Yafang Shao wrote:
> > > Recently we found an issue on our production environment that when memcg
> > > oom is triggered the oom killer doesn't chose the process with largest
> > > resident memory but chose the first scanned process. Note that all
> > > processes in this memcg have the same oom_score_adj, so the oom killer
> > > should chose the process with largest resident memory.
> > > 
> > > Bellow is part of the oom info, which is enough to analyze this issue.
> > > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> > > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> > > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
> > > [...]
> > > [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> > > [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
> > > [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
> > > [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
> > > [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
> > > [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
> > > [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
> > > [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
> > > [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
> > > [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
> > > [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
> > > [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
> > > [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
> > > [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
> > > [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
> > > [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
> > > [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
> > > [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
> > > [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
> > > [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
> > > [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
> > > [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
> > > [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> > > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> > > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > 
> > > We can find that the first scanned process 5740 (pause) was killed, but its
> > > rss is only one page. That is because, when we calculate the oom badness in
> > > oom_badness(), we always ignore the negtive point and convert all of these
> > > negtive points to 1. Now as oom_score_adj of all the processes in this
> > > targeted memcg have the same value -998, the points of these processes are
> > > all negtive value. As a result, the first scanned process will be killed.
> > 
> > Such a large bias can skew results quite considerably. 
> > 
> > > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> > > a Guaranteed pod, which has higher priority to prevent from being killed by
> > > system oom.
> > 
> > This is really interesting! I assume that the oom_score_adj is set to
> > protect from the global oom situation right? I am struggling to
> > understand what is the expected behavior when the oom is internal for
> > such a group though. Does killing a single task from such a group is a
> > sensible choice? I am not really familiar with kubelet but can it cope
> > with data_sim going away from under it while the rest would still run?
> > Wouldn't it make more sense to simply tear down the whole thing?
> > 
> > But that is a separate thing.
> > 
> > > To fix this issue, we should make the calculation of oom point more
> > > accurate. We can achieve it by convert the chosen_point from 'unsigned
> > > long' to 'long'.
> > 
> > oom_score has a very coarse units because it maps all the consumed
> > memory into 0 - 1000 scale so effectively per-mille of the usable
> > memory. oom_score_adj acts on top of that as a bias. This is
> > exported to the userspace and I do not think we can change that (see
> > Documentation/filesystems/proc.rst) unfortunately. So you patch cannot
> > be really accepted as is because it would start reporting values outside
> > of the allowed range unless I am doing some math incorrectly.
> > 
> > On the other hand, in this particular case I believe the existing
> > calculation is just wrong. Usable memory is 16777216kB (4194304 pages),
> > the top consumer is 3976380 pages so 94.8% the lowest memory consumer is
> > effectively 0%. Even if we discount 94.8% by 99.8% then we should be
> > still having something like 7950 pages. So the normalization oom_badness
> > does cuts results too aggressively. There was quite some churn in the
> > calculation in the past fixing weird rounding bugs so I have to think
> > about how to fix this properly some more.
> > 
> > That being said, even though the configuration is weird I do agree that
> > oom_badness scaling is really unexpected and the memory consumption
> > in this particular example should be quite telling about who to chose as
> > an oom victim.
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> -- 
> Michal Hocko
> SUSE Labs
>
Michal Hocko July 8, 2020, 7:02 p.m. UTC | #6
On Wed 08-07-20 10:57:27, David Rientjes wrote:
> On Wed, 8 Jul 2020, Michal Hocko wrote:
> 
> > I have only now realized that David is not on Cc. Add him here. The
> > patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoar.shao@gmail.com.
> > 
> > I believe the main problem is that we are normalizing to oom_score_adj
> > units rather than usage/total. I have a very vague recollection this has
> > been done in the past but I didn't get to dig into details yet.
> > 
> 
> The memcg max is 4194304 pages, and an oom_score_adj of -998 would yield a 
> page adjustment of:
> 
> adj = -998 * 4194304 / 1000 = −4185915 pages
> 
> The largest pid 58406 (data_sim) has rss 3967322 pages,
> pgtables 37101568 / 4096 = 9058 pages, and swapents 0.  So it's unadjusted 
> badness is
> 
> 3967322 + 9058 pages = 3976380 pages
> 
> Factoring in oom_score_adj, all of these processes will have a badness of 
> 1 because oom_badness() doesn't underflow, which I think is the point of 
> Yafang's proposal.
> 
> I think the patch can work but, as you mention, also needs an update to 
> proc_oom_score().  proc_oom_score() is using the global amount of memory 
> so Yafang is likely not seeing it go negative for that reason but it could 
> happen.

Yes, memcg just makes it more obvious but the same might happen for the
global case. I am not sure how we can both alow underflow and present
the value that would fit the existing model. The exported value should
really reflect what the oom killer is using for the calculation or we
are going to see discrepancies between the real oom decision and
presented values. So I believe we really have to change the calculation
rather than just make it tolerant to underflows.

But I have to think about that much more.
Yafang Shao July 9, 2020, 1:57 a.m. UTC | #7
On Thu, Jul 9, 2020 at 12:09 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 08-07-20 23:11:43, Yafang Shao wrote:
> > On Wed, Jul 8, 2020 at 10:28 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Wed 08-07-20 09:24:09, Yafang Shao wrote:
> > > > Recently we found an issue on our production environment that when memcg
> > > > oom is triggered the oom killer doesn't chose the process with largest
> > > > resident memory but chose the first scanned process. Note that all
> > > > processes in this memcg have the same oom_score_adj, so the oom killer
> > > > should chose the process with largest resident memory.
> > > >
> > > > Bellow is part of the oom info, which is enough to analyze this issue.
> > > > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> > > > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> > > > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
> > > > [...]
> > > > [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> > > > [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
> > > > [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
> > > > [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
> > > > [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
> > > > [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
> > > > [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
> > > > [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
> > > > [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
> > > > [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
> > > > [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
> > > > [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
> > > > [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
> > > > [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
> > > > [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
> > > > [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
> > > > [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
> > > > [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
> > > > [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
> > > > [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
> > > > [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
> > > > [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
> > > > [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> > > > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> > > > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > >
> > > > We can find that the first scanned process 5740 (pause) was killed, but its
> > > > rss is only one page. That is because, when we calculate the oom badness in
> > > > oom_badness(), we always ignore the negtive point and convert all of these
> > > > negtive points to 1. Now as oom_score_adj of all the processes in this
> > > > targeted memcg have the same value -998, the points of these processes are
> > > > all negtive value. As a result, the first scanned process will be killed.
> > >
> > > Such a large bias can skew results quite considerably.
> > >
> >
> > Right.
> > Pls. refer the kubernetes doc[1] for more information about this large bias .
> >
> > [1]. https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
> >
> > > > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> > > > a Guaranteed pod, which has higher priority to prevent from being killed by
> > > > system oom.
> > >
> > > This is really interesting! I assume that the oom_score_adj is set to
> > > protect from the global oom situation right?
> >
> > Right. See also the kubernetes doc.
> >
> > > I am struggling to
> > > understand what is the expected behavior when the oom is internal for
> > > such a group though. Does killing a single task from such a group is a
> > > sensible choice? I am not really familiar with kubelet but can it cope
> > > with data_sim going away from under it while the rest would still run?
> > > Wouldn't it make more sense to simply tear down the whole thing?
> > >
> >
> > There are two containers in one kubernetes pod, one of which is a
> > pause-container, which has only one process - the pause, which is
> > managing the netns, and the other is the docker-init-container, in
> > which all other processes are running.
> > Once the pause process is killed, the kubelet will rebuild all the
> > containers in this pod, while if one of the processes in the
> > docker-init-container is killed, the kubelet will try to re-run it.
> > So tearing down the whole thing is more costly than only trying to
> > re-running one process.
> > I'm not familiar with kubernetes as well, that is my understanding.
>
> Thanks for the clarification!
>
> [...]
> > > oom_score has a very coarse units because it maps all the consumed
> > > memory into 0 - 1000 scale so effectively per-mille of the usable
> > > memory. oom_score_adj acts on top of that as a bias. This is
> > > exported to the userspace and I do not think we can change that (see
> > > Documentation/filesystems/proc.rst) unfortunately.
> >
> > In this doc, I only find the oom_score and oom_score_adj is exposed to
> > the userspace.
> > While this patch only changes the oom_control->chosen_points, which is
> > only for oom internally use.
> > So I don't think we can't change oom_control->chosen_points.
>
> Unless I am misreading the patch you are allowing negative values to be
> returned from proc_oom_score and that is used by proc_oom_score which is
> exported to the userspace.

Thanks for pointing it out. I missed it.
Yafang Shao July 9, 2020, 1:57 a.m. UTC | #8
On Thu, Jul 9, 2020 at 1:57 AM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 8 Jul 2020, Michal Hocko wrote:
>
> > I have only now realized that David is not on Cc. Add him here. The
> > patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoar.shao@gmail.com.
> >
> > I believe the main problem is that we are normalizing to oom_score_adj
> > units rather than usage/total. I have a very vague recollection this has
> > been done in the past but I didn't get to dig into details yet.
> >
>
> The memcg max is 4194304 pages, and an oom_score_adj of -998 would yield a
> page adjustment of:
>
> adj = -998 * 4194304 / 1000 = −4185915 pages
>
> The largest pid 58406 (data_sim) has rss 3967322 pages,
> pgtables 37101568 / 4096 = 9058 pages, and swapents 0.  So it's unadjusted
> badness is
>
> 3967322 + 9058 pages = 3976380 pages
>
> Factoring in oom_score_adj, all of these processes will have a badness of
> 1 because oom_badness() doesn't underflow, which I think is the point of
> Yafang's proposal.
>

Right. Thanks for helping clarify it.

> I think the patch can work but, as you mention, also needs an update to
> proc_oom_score().  proc_oom_score() is using the global amount of memory
> so Yafang is likely not seeing it go negative for that reason but it could
> happen.
>

I missed proc_oom_score(). I will think about how to correct it.

> > On Wed 08-07-20 16:28:08, Michal Hocko wrote:
> > > On Wed 08-07-20 09:24:09, Yafang Shao wrote:
> > > > Recently we found an issue on our production environment that when memcg
> > > > oom is triggered the oom killer doesn't chose the process with largest
> > > > resident memory but chose the first scanned process. Note that all
> > > > processes in this memcg have the same oom_score_adj, so the oom killer
> > > > should chose the process with largest resident memory.
> > > >
> > > > Bellow is part of the oom info, which is enough to analyze this issue.
> > > > [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
> > > > [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
> > > > [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
> > > > [...]
> > > > [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> > > > [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
> > > > [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
> > > > [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
> > > > [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
> > > > [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
> > > > [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
> > > > [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
> > > > [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
> > > > [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
> > > > [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
> > > > [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
> > > > [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
> > > > [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
> > > > [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
> > > > [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
> > > > [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
> > > > [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
> > > > [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
> > > > [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
> > > > [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
> > > > [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
> > > > [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
> > > > [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
> > > > [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > >
> > > > We can find that the first scanned process 5740 (pause) was killed, but its
> > > > rss is only one page. That is because, when we calculate the oom badness in
> > > > oom_badness(), we always ignore the negtive point and convert all of these
> > > > negtive points to 1. Now as oom_score_adj of all the processes in this
> > > > targeted memcg have the same value -998, the points of these processes are
> > > > all negtive value. As a result, the first scanned process will be killed.
> > >
> > > Such a large bias can skew results quite considerably.
> > >
> > > > The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
> > > > a Guaranteed pod, which has higher priority to prevent from being killed by
> > > > system oom.
> > >
> > > This is really interesting! I assume that the oom_score_adj is set to
> > > protect from the global oom situation right? I am struggling to
> > > understand what is the expected behavior when the oom is internal for
> > > such a group though. Does killing a single task from such a group is a
> > > sensible choice? I am not really familiar with kubelet but can it cope
> > > with data_sim going away from under it while the rest would still run?
> > > Wouldn't it make more sense to simply tear down the whole thing?
> > >
> > > But that is a separate thing.
> > >
> > > > To fix this issue, we should make the calculation of oom point more
> > > > accurate. We can achieve it by convert the chosen_point from 'unsigned
> > > > long' to 'long'.
> > >
> > > oom_score has a very coarse units because it maps all the consumed
> > > memory into 0 - 1000 scale so effectively per-mille of the usable
> > > memory. oom_score_adj acts on top of that as a bias. This is
> > > exported to the userspace and I do not think we can change that (see
> > > Documentation/filesystems/proc.rst) unfortunately. So you patch cannot
> > > be really accepted as is because it would start reporting values outside
> > > of the allowed range unless I am doing some math incorrectly.
> > >
> > > On the other hand, in this particular case I believe the existing
> > > calculation is just wrong. Usable memory is 16777216kB (4194304 pages),
> > > the top consumer is 3976380 pages so 94.8% the lowest memory consumer is
> > > effectively 0%. Even if we discount 94.8% by 99.8% then we should be
> > > still having something like 7950 pages. So the normalization oom_badness
> > > does cuts results too aggressively. There was quite some churn in the
> > > calculation in the past fixing weird rounding bugs so I have to think
> > > about how to fix this properly some more.
> > >
> > > That being said, even though the configuration is weird I do agree that
> > > oom_badness scaling is really unexpected and the memory consumption
> > > in this particular example should be quite telling about who to chose as
> > > an oom victim.
Yafang Shao July 9, 2020, 2:14 a.m. UTC | #9
On Thu, Jul 9, 2020 at 3:02 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 08-07-20 10:57:27, David Rientjes wrote:
> > On Wed, 8 Jul 2020, Michal Hocko wrote:
> >
> > > I have only now realized that David is not on Cc. Add him here. The
> > > patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoar.shao@gmail.com.
> > >
> > > I believe the main problem is that we are normalizing to oom_score_adj
> > > units rather than usage/total. I have a very vague recollection this has
> > > been done in the past but I didn't get to dig into details yet.
> > >
> >
> > The memcg max is 4194304 pages, and an oom_score_adj of -998 would yield a
> > page adjustment of:
> >
> > adj = -998 * 4194304 / 1000 = −4185915 pages
> >
> > The largest pid 58406 (data_sim) has rss 3967322 pages,
> > pgtables 37101568 / 4096 = 9058 pages, and swapents 0.  So it's unadjusted
> > badness is
> >
> > 3967322 + 9058 pages = 3976380 pages
> >
> > Factoring in oom_score_adj, all of these processes will have a badness of
> > 1 because oom_badness() doesn't underflow, which I think is the point of
> > Yafang's proposal.
> >
> > I think the patch can work but, as you mention, also needs an update to
> > proc_oom_score().  proc_oom_score() is using the global amount of memory
> > so Yafang is likely not seeing it go negative for that reason but it could
> > happen.
>
> Yes, memcg just makes it more obvious but the same might happen for the
> global case. I am not sure how we can both alow underflow and present
> the value that would fit the existing model. The exported value should
> really reflect what the oom killer is using for the calculation or we
> are going to see discrepancies between the real oom decision and
> presented values. So I believe we really have to change the calculation
> rather than just make it tolerant to underflows.
>

Hi Michal,

- Before my patch,
The result of oom_badness() is [1,  2 * totalpages),
and the result of proc_oom_score() is  [0, 2000).

While the badness score in the Documentation/filesystems/proc.rst is: [0, 1000]
"The badness heuristic assigns a value to each candidate task ranging from 0
(never kill) to 1000 (always kill) to determine which process is targeted"

That means, we need to update the documentation anyway unless my
calculation is wrong.
So the point will be how to change it ?

- After my patch
oom_badness():  (-totalpages, 2 * totalpages)
proc_oom_score(): (-1000, 2000)

If we allow underflow, we can change the documentation as "from -1000
(never kill) to 2000(always kill)".
While if we don't allow underflow,  we can make bellow simple change,

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 774784587..0da8efa41 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -528,7 +528,7 @@ static int proc_oom_score(struct seq_file *m,
struct pid_namespace *ns,
        unsigned long totalpages = totalram_pages + total_swap_pages;
        unsigned long points = 0;

-       points = oom_badness(task, NULL, NULL, totalpages) *
+       points = 1000 + oom_badness(task, NULL, NULL, totalpages) *
                                      1000 / totalpages;
        seq_printf(m, "%lu\n", points);

And then update the documentation as "from 0 (never kill) to 3000
(always kill)"
Michal Hocko July 9, 2020, 6:26 a.m. UTC | #10
On Thu 09-07-20 10:14:14, Yafang Shao wrote:
> On Thu, Jul 9, 2020 at 3:02 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 08-07-20 10:57:27, David Rientjes wrote:
> > > On Wed, 8 Jul 2020, Michal Hocko wrote:
> > >
> > > > I have only now realized that David is not on Cc. Add him here. The
> > > > patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoar.shao@gmail.com.
> > > >
> > > > I believe the main problem is that we are normalizing to oom_score_adj
> > > > units rather than usage/total. I have a very vague recollection this has
> > > > been done in the past but I didn't get to dig into details yet.
> > > >
> > >
> > > The memcg max is 4194304 pages, and an oom_score_adj of -998 would yield a
> > > page adjustment of:
> > >
> > > adj = -998 * 4194304 / 1000 = −4185915 pages
> > >
> > > The largest pid 58406 (data_sim) has rss 3967322 pages,
> > > pgtables 37101568 / 4096 = 9058 pages, and swapents 0.  So it's unadjusted
> > > badness is
> > >
> > > 3967322 + 9058 pages = 3976380 pages
> > >
> > > Factoring in oom_score_adj, all of these processes will have a badness of
> > > 1 because oom_badness() doesn't underflow, which I think is the point of
> > > Yafang's proposal.
> > >
> > > I think the patch can work but, as you mention, also needs an update to
> > > proc_oom_score().  proc_oom_score() is using the global amount of memory
> > > so Yafang is likely not seeing it go negative for that reason but it could
> > > happen.
> >
> > Yes, memcg just makes it more obvious but the same might happen for the
> > global case. I am not sure how we can both alow underflow and present
> > the value that would fit the existing model. The exported value should
> > really reflect what the oom killer is using for the calculation or we
> > are going to see discrepancies between the real oom decision and
> > presented values. So I believe we really have to change the calculation
> > rather than just make it tolerant to underflows.
> >
> 
> Hi Michal,
> 
> - Before my patch,
> The result of oom_badness() is [1,  2 * totalpages),
> and the result of proc_oom_score() is  [0, 2000).
> 
> While the badness score in the Documentation/filesystems/proc.rst is: [0, 1000]
> "The badness heuristic assigns a value to each candidate task ranging from 0
> (never kill) to 1000 (always kill) to determine which process is targeted"
> 
> That means, we need to update the documentation anyway unless my
> calculation is wrong.

No, your calculation is correct. The documentation is correct albeit
slightly misleading. The net score calculation is indeed in range of [0, 1000].
It is the oom_score_adj added on top which skews it. This is documented
as
"The value of /proc/<pid>/oom_score_adj is added to the badness score before it
is used to determine which task to kill."

This is the exported value but paragraph "3.2 /proc/<pid>/oom_score" only says
"This file can be used to check the current score used by the oom-killer is for
any given <pid>." which is not really explicit about the exported range.

Maybe clarifying that would be helpful. I will post a patch. There are
few other things to sync up with the current state.

> So the point will be how to change it ?
> 
> - After my patch
> oom_badness():  (-totalpages, 2 * totalpages)
> proc_oom_score(): (-1000, 2000)
> 
> If we allow underflow, we can change the documentation as "from -1000
> (never kill) to 2000(always kill)".
> While if we don't allow underflow,  we can make bellow simple change,
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 774784587..0da8efa41 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -528,7 +528,7 @@ static int proc_oom_score(struct seq_file *m,
> struct pid_namespace *ns,
>         unsigned long totalpages = totalram_pages + total_swap_pages;
>         unsigned long points = 0;
> 
> -       points = oom_badness(task, NULL, NULL, totalpages) *
> +       points = 1000 + oom_badness(task, NULL, NULL, totalpages) *
>                                       1000 / totalpages;
>         seq_printf(m, "%lu\n", points);
> 
> And then update the documentation as "from 0 (never kill) to 3000
> (always kill)"

This is still not quite there yet, I am afraid. OOM_SCORE_ADJ_MIN tasks have
always reported 0 and I can imagine somebody might depend on this fact.
So you need to special case LONG_MIN at least. It would be also better
to stick with [0, 2000] range.
Michal Hocko July 9, 2020, 6:41 a.m. UTC | #11
On Thu 09-07-20 08:26:46, Michal Hocko wrote:
> On Thu 09-07-20 10:14:14, Yafang Shao wrote:
[...]
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 774784587..0da8efa41 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -528,7 +528,7 @@ static int proc_oom_score(struct seq_file *m,
> > struct pid_namespace *ns,
> >         unsigned long totalpages = totalram_pages + total_swap_pages;
> >         unsigned long points = 0;
> > 
> > -       points = oom_badness(task, NULL, NULL, totalpages) *
> > +       points = 1000 + oom_badness(task, NULL, NULL, totalpages) *
> >                                       1000 / totalpages;
> >         seq_printf(m, "%lu\n", points);
> > 
> > And then update the documentation as "from 0 (never kill) to 3000
> > (always kill)"
> 
> This is still not quite there yet, I am afraid. OOM_SCORE_ADJ_MIN tasks have
> always reported 0 and I can imagine somebody might depend on this fact.
> So you need to special case LONG_MIN at least. It would be also better
> to stick with [0, 2000] range.

usage * 1000 / total -> [0, 1000]

adj -> [-1000, 1000]

max (0, usage * 1000 / total + adj) -> [0, 2000]

So effectively something like this.

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d86c0afc8a85..fdf5a0be1cc3 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -550,9 +550,9 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
 			  struct pid *pid, struct task_struct *task)
 {
 	unsigned long totalpages = totalram_pages() + total_swap_pages;
-	unsigned long points = 0;
+	long points = 0;
 
-	points = oom_badness(task, totalpages) * 1000 / totalpages;
+	points = max(0, oom_badness(task, totalpages));
 	seq_printf(m, "%lu\n", points);
 
 	return 0;
diff --git a/mm/gup.c b/mm/gup.c
index de9e36262ccb..75980dd5a2fc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1794,7 +1794,6 @@ static long __gup_longterm_locked(struct task_struct *tsk,
 				     vmas_tmp, NULL, gup_flags);
 
 	if (gup_flags & FOLL_LONGTERM) {
-		memalloc_nocma_restore(flags);
 		if (rc < 0)
 			goto out;
 
@@ -1802,11 +1801,13 @@ static long __gup_longterm_locked(struct task_struct *tsk,
 			for (i = 0; i < rc; i++)
 				put_page(pages[i]);
 			rc = -EOPNOTSUPP;
+			memalloc_nocma_restore(flags);
 			goto out;
 		}
 
 		rc = check_and_migrate_cma_pages(tsk, mm, start, rc, pages,
 						 vmas_tmp, gup_flags);
+		memalloc_nocma_restore(flags);
 	}
 
 out:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6e94962893ee..30dbc302a677 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -196,13 +196,13 @@ static bool is_dump_unreclaim_slabs(void)
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, unsigned long totalpages)
 {
 	long points;
 	long adj;
 
 	if (oom_unkillable_task(p))
-		return 0;
+		return -1;
 
 	p = find_lock_task_mm(p);
 	if (!p)
@@ -229,15 +229,10 @@ unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
 		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
 	task_unlock(p);
 
-	/* Normalize to oom_score_adj units */
-	adj *= totalpages / 1000;
+	points *= 1000 / totalpages;
 	points += adj;
 
-	/*
-	 * Never return 0 for an eligible task regardless of the root bonus and
-	 * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
-	 */
-	return points > 0 ? points : 1;
+	return points;
 }
 
 static const char * const oom_constraint_text[] = {
@@ -341,7 +336,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
 	}
 
 	points = oom_badness(task, oc->totalpages);
-	if (!points || points < oc->chosen_points)
+	if (points > 0 || points < oc->chosen_points)
 		goto next;
 
 select:
Yafang Shao July 9, 2020, 7:31 a.m. UTC | #12
On Thu, Jul 9, 2020 at 2:26 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 09-07-20 10:14:14, Yafang Shao wrote:
> > On Thu, Jul 9, 2020 at 3:02 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Wed 08-07-20 10:57:27, David Rientjes wrote:
> > > > On Wed, 8 Jul 2020, Michal Hocko wrote:
> > > >
> > > > > I have only now realized that David is not on Cc. Add him here. The
> > > > > patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoar.shao@gmail.com.
> > > > >
> > > > > I believe the main problem is that we are normalizing to oom_score_adj
> > > > > units rather than usage/total. I have a very vague recollection this has
> > > > > been done in the past but I didn't get to dig into details yet.
> > > > >
> > > >
> > > > The memcg max is 4194304 pages, and an oom_score_adj of -998 would yield a
> > > > page adjustment of:
> > > >
> > > > adj = -998 * 4194304 / 1000 = −4185915 pages
> > > >
> > > > The largest pid 58406 (data_sim) has rss 3967322 pages,
> > > > pgtables 37101568 / 4096 = 9058 pages, and swapents 0.  So it's unadjusted
> > > > badness is
> > > >
> > > > 3967322 + 9058 pages = 3976380 pages
> > > >
> > > > Factoring in oom_score_adj, all of these processes will have a badness of
> > > > 1 because oom_badness() doesn't underflow, which I think is the point of
> > > > Yafang's proposal.
> > > >
> > > > I think the patch can work but, as you mention, also needs an update to
> > > > proc_oom_score().  proc_oom_score() is using the global amount of memory
> > > > so Yafang is likely not seeing it go negative for that reason but it could
> > > > happen.
> > >
> > > Yes, memcg just makes it more obvious but the same might happen for the
> > > global case. I am not sure how we can both alow underflow and present
> > > the value that would fit the existing model. The exported value should
> > > really reflect what the oom killer is using for the calculation or we
> > > are going to see discrepancies between the real oom decision and
> > > presented values. So I believe we really have to change the calculation
> > > rather than just make it tolerant to underflows.
> > >
> >
> > Hi Michal,
> >
> > - Before my patch,
> > The result of oom_badness() is [1,  2 * totalpages),
> > and the result of proc_oom_score() is  [0, 2000).
> >
> > While the badness score in the Documentation/filesystems/proc.rst is: [0, 1000]
> > "The badness heuristic assigns a value to each candidate task ranging from 0
> > (never kill) to 1000 (always kill) to determine which process is targeted"
> >
> > That means, we need to update the documentation anyway unless my
> > calculation is wrong.
>
> No, your calculation is correct. The documentation is correct albeit
> slightly misleading. The net score calculation is indeed in range of [0, 1000].
> It is the oom_score_adj added on top which skews it. This is documented
> as
> "The value of /proc/<pid>/oom_score_adj is added to the badness score before it
> is used to determine which task to kill."
>
> This is the exported value but paragraph "3.2 /proc/<pid>/oom_score" only says
> "This file can be used to check the current score used by the oom-killer is for
> any given <pid>." which is not really explicit about the exported range.
>
> Maybe clarifying that would be helpful. I will post a patch. There are
> few other things to sync up with the current state.
>
> > So the point will be how to change it ?
> >
> > - After my patch
> > oom_badness():  (-totalpages, 2 * totalpages)
> > proc_oom_score(): (-1000, 2000)
> >
> > If we allow underflow, we can change the documentation as "from -1000
> > (never kill) to 2000(always kill)".
> > While if we don't allow underflow,  we can make bellow simple change,
> >
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 774784587..0da8efa41 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -528,7 +528,7 @@ static int proc_oom_score(struct seq_file *m,
> > struct pid_namespace *ns,
> >         unsigned long totalpages = totalram_pages + total_swap_pages;
> >         unsigned long points = 0;
> >
> > -       points = oom_badness(task, NULL, NULL, totalpages) *
> > +       points = 1000 + oom_badness(task, NULL, NULL, totalpages) *
> >                                       1000 / totalpages;
> >         seq_printf(m, "%lu\n", points);
> >
> > And then update the documentation as "from 0 (never kill) to 3000
> > (always kill)"
>
> This is still not quite there yet, I am afraid. OOM_SCORE_ADJ_MIN tasks have
> always reported 0 and I can imagine somebody might depend on this fact.

No, I don't think anybody will use the reported 0 to get the
conclusion that it is a OOM_SCORE_ADJ_MIN task.
Because,
    points = oom_badness(task, totalpages) * 1000 / totalpages;
so the points will always be 0 if the return value of
oom_badness(task, totalpages) is less than totalpages/1000.

If the user wants to know whether it is an OOM_SCORE_ADJ_MIN task, he
will always use /proc/[pid]/oom_score_adj to get it, that is more
reliable.

> So you need to special case LONG_MIN at least. It would be also better
> to stick with [0, 2000] range.

I don't know why it must stick with [0, 2000] range.
As the oom_score_adj sticks with [-1000, 1000] range, I think the
proc_oom_score() could be a negative value as well.
Michal Hocko July 9, 2020, 8:17 a.m. UTC | #13
On Thu 09-07-20 15:31:23, Yafang Shao wrote:
> On Thu, Jul 9, 2020 at 2:26 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 09-07-20 10:14:14, Yafang Shao wrote:
> > > On Thu, Jul 9, 2020 at 3:02 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Wed 08-07-20 10:57:27, David Rientjes wrote:
> > > > > On Wed, 8 Jul 2020, Michal Hocko wrote:
> > > > >
> > > > > > I have only now realized that David is not on Cc. Add him here. The
> > > > > > patch is http://lkml.kernel.org/r/1594214649-9837-1-git-send-email-laoar.shao@gmail.com.
> > > > > >
> > > > > > I believe the main problem is that we are normalizing to oom_score_adj
> > > > > > units rather than usage/total. I have a very vague recollection this has
> > > > > > been done in the past but I didn't get to dig into details yet.
> > > > > >
> > > > >
> > > > > The memcg max is 4194304 pages, and an oom_score_adj of -998 would yield a
> > > > > page adjustment of:
> > > > >
> > > > > adj = -998 * 4194304 / 1000 = −4185915 pages
> > > > >
> > > > > The largest pid 58406 (data_sim) has rss 3967322 pages,
> > > > > pgtables 37101568 / 4096 = 9058 pages, and swapents 0.  So it's unadjusted
> > > > > badness is
> > > > >
> > > > > 3967322 + 9058 pages = 3976380 pages
> > > > >
> > > > > Factoring in oom_score_adj, all of these processes will have a badness of
> > > > > 1 because oom_badness() doesn't underflow, which I think is the point of
> > > > > Yafang's proposal.
> > > > >
> > > > > I think the patch can work but, as you mention, also needs an update to
> > > > > proc_oom_score().  proc_oom_score() is using the global amount of memory
> > > > > so Yafang is likely not seeing it go negative for that reason but it could
> > > > > happen.
> > > >
> > > > Yes, memcg just makes it more obvious but the same might happen for the
> > > > global case. I am not sure how we can both alow underflow and present
> > > > the value that would fit the existing model. The exported value should
> > > > really reflect what the oom killer is using for the calculation or we
> > > > are going to see discrepancies between the real oom decision and
> > > > presented values. So I believe we really have to change the calculation
> > > > rather than just make it tolerant to underflows.
> > > >
> > >
> > > Hi Michal,
> > >
> > > - Before my patch,
> > > The result of oom_badness() is [1,  2 * totalpages),
> > > and the result of proc_oom_score() is  [0, 2000).
> > >
> > > While the badness score in the Documentation/filesystems/proc.rst is: [0, 1000]
> > > "The badness heuristic assigns a value to each candidate task ranging from 0
> > > (never kill) to 1000 (always kill) to determine which process is targeted"
> > >
> > > That means, we need to update the documentation anyway unless my
> > > calculation is wrong.
> >
> > No, your calculation is correct. The documentation is correct albeit
> > slightly misleading. The net score calculation is indeed in range of [0, 1000].
> > It is the oom_score_adj added on top which skews it. This is documented
> > as
> > "The value of /proc/<pid>/oom_score_adj is added to the badness score before it
> > is used to determine which task to kill."
> >
> > This is the exported value but paragraph "3.2 /proc/<pid>/oom_score" only says
> > "This file can be used to check the current score used by the oom-killer is for
> > any given <pid>." which is not really explicit about the exported range.
> >
> > Maybe clarifying that would be helpful. I will post a patch. There are
> > few other things to sync up with the current state.
> >
> > > So the point will be how to change it ?
> > >
> > > - After my patch
> > > oom_badness():  (-totalpages, 2 * totalpages)
> > > proc_oom_score(): (-1000, 2000)
> > >
> > > If we allow underflow, we can change the documentation as "from -1000
> > > (never kill) to 2000(always kill)".
> > > While if we don't allow underflow,  we can make bellow simple change,
> > >
> > > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > > index 774784587..0da8efa41 100644
> > > --- a/fs/proc/base.c
> > > +++ b/fs/proc/base.c
> > > @@ -528,7 +528,7 @@ static int proc_oom_score(struct seq_file *m,
> > > struct pid_namespace *ns,
> > >         unsigned long totalpages = totalram_pages + total_swap_pages;
> > >         unsigned long points = 0;
> > >
> > > -       points = oom_badness(task, NULL, NULL, totalpages) *
> > > +       points = 1000 + oom_badness(task, NULL, NULL, totalpages) *
> > >                                       1000 / totalpages;
> > >         seq_printf(m, "%lu\n", points);
> > >
> > > And then update the documentation as "from 0 (never kill) to 3000
> > > (always kill)"
> >
> > This is still not quite there yet, I am afraid. OOM_SCORE_ADJ_MIN tasks have
> > always reported 0 and I can imagine somebody might depend on this fact.
> 
> No, I don't think anybody will use the reported 0 to get the
> conclusion that it is a OOM_SCORE_ADJ_MIN task.
> Because,
>     points = oom_badness(task, totalpages) * 1000 / totalpages;
> so the points will always be 0 if the return value of
> oom_badness(task, totalpages) is less than totalpages/1000.

Correct, I've had the other implication in mind though.
OOM_SCORE_ADJ_MIN -> score == 0

> If the user wants to know whether it is an OOM_SCORE_ADJ_MIN task, he
> will always use /proc/[pid]/oom_score_adj to get it, that is more
> reliable.
> 
> > So you need to special case LONG_MIN at least. It would be also better
> > to stick with [0, 2000] range.
> 
> I don't know why it must stick with [0, 2000] range.

Because this is a long term exported API and userspace might depend on
it.
diff mbox series

Patch

diff --git a/include/linux/oom.h b/include/linux/oom.h
index c696c26..f022f58 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -48,7 +48,7 @@  struct oom_control {
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
 	struct task_struct *chosen;
-	unsigned long chosen_points;
+	long chosen_points;
 
 	/* Used to print the constraint info. */
 	enum oom_constraint constraint;
@@ -107,7 +107,7 @@  static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 
 bool __oom_reap_task_mm(struct mm_struct *mm);
 
-extern unsigned long oom_badness(struct task_struct *p,
+long oom_badness(struct task_struct *p,
 		unsigned long totalpages);
 
 extern bool out_of_memory(struct oom_control *oc);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6e94962..e740695 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -196,17 +196,17 @@  static bool is_dump_unreclaim_slabs(void)
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, unsigned long totalpages)
 {
 	long points;
 	long adj;
 
 	if (oom_unkillable_task(p))
-		return 0;
+		return LONG_MIN;
 
 	p = find_lock_task_mm(p);
 	if (!p)
-		return 0;
+		return LONG_MIN;
 
 	/*
 	 * Do not even consider tasks which are explicitly marked oom
@@ -218,7 +218,7 @@  unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
 			test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
 			in_vfork(p)) {
 		task_unlock(p);
-		return 0;
+		return LONG_MIN;
 	}
 
 	/*
@@ -233,11 +233,7 @@  unsigned long oom_badness(struct task_struct *p, unsigned long totalpages)
 	adj *= totalpages / 1000;
 	points += adj;
 
-	/*
-	 * Never return 0 for an eligible task regardless of the root bonus and
-	 * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
-	 */
-	return points > 0 ? points : 1;
+	return points;
 }
 
 static const char * const oom_constraint_text[] = {
@@ -258,6 +254,8 @@  static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	bool cpuset_limited = false;
 	int nid;
 
+	oc->chosen_points = LONG_MIN;
+
 	if (is_memcg_oom(oc)) {
 		oc->totalpages = mem_cgroup_get_max(oc->memcg) ?: 1;
 		return CONSTRAINT_MEMCG;
@@ -336,12 +334,12 @@  static int oom_evaluate_task(struct task_struct *task, void *arg)
 	 * killed first if it triggers an oom, then select it.
 	 */
 	if (oom_task_origin(task)) {
-		points = ULONG_MAX;
+		points = LONG_MAX;
 		goto select;
 	}
 
 	points = oom_badness(task, oc->totalpages);
-	if (!points || points < oc->chosen_points)
+	if (points == LONG_MIN || points < oc->chosen_points)
 		goto next;
 
 select: