[021/165] mm, oom: make the calculation of oom badness more accurate

From: Yafang Shao <laoar.shao@gmail.com>

From: Yafang Shao <laoar.shao@gmail.com>
Subject: mm, oom: make the calculation of oom badness more accurate

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process.  Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
[7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
[7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
[7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
[7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
[7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
[7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
[7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
[7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
[7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
[7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
[7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
[7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
[7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
[7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
[7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
[7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
[7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
[7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
[7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
[7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page.  That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1.  Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value.  As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate.  We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
  Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/base.c      |   11 ++++++++++-
 include/linux/oom.h |    4 ++--
 mm/oom_kill.c       |   22 ++++++++++------------
 3 files changed, 22 insertions(+), 15 deletions(-)

Message ID	20200812013122.Uv5D8Wwt_%akpm@linux-foundation.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=d60w=BW=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 77ED92076C Date: Tue, 11 Aug 2020 18:31:22 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, cai@lca.pw, laoar.shao@gmail.com, linux-mm@kvack.org, mhocko@suse.com, mm-commits@vger.kernel.org, naresh.kamboju@linaro.org, rientjes@google.com, torvalds@linux-foundation.org Subject: [patch 021/165] mm, oom: make the calculation of oom badness more accurate Message-ID: <20200812013122.Uv5D8Wwt_%akpm@linux-foundation.org> In-Reply-To: <20200811182949.e12ae9a472e3b5e27e16ad6c@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/165] percpu: return number of released bytes from pcpu_free_area() \| expand [001/165] percpu: return number of released bytes from pcpu_free_area() [002/165] mm: memcg/percpu: account percpu memory to memory cgroups [003/165] mm: memcg/percpu: per-memcg percpu memory statistics [004/165] mm: memcg: charge memcg percpu memory to the parent cgroup [006/165] mm/hugetlb: add mempolicy check in the reservation routine [007/165] mm/vmscan: make active/inactive ratio as 1:1 for anon lru [008/165] mm/vmscan: protect the workingset on anonymous LRU [009/165] mm/workingset: prepare the workingset detection infrastructure for anon LRU [010/165] mm/swapcache: support to handle the shadow entries [011/165] mm/swap: implement workingset detection for anonymous LRU [012/165] mm/vmscan: restore active/inactive ratio for anonymous LRU [013/165] /proc/PID/smaps: consistent whitespace output format [014/165] mm: proactive compaction [015/165] mm: fix compile error due to COMPACTION_HPAGE_ORDER [016/165] mm: use unsigned types for fragmentation score [017/165] mm/compaction: correct the comments of compact_defer_shift [018/165] mm: mempolicy: fix kerneldoc of numa_map_to_online_node() [019/165] mm/mempolicy.c: check parameters first in kernel_get_mempolicy [020/165] include/linux/mempolicy.h: fix typo [021/165] mm, oom: make the calculation of oom badness more accurate [022/165] doc, mm: sync up oom_score_adj documentation [023/165] doc, mm: clarify /proc/<pid>/oom_score value range [024/165] mm, oom: show process exiting information in __oom_kill_process() [025/165] hugetlbfs: prevent filesystem stacking of hugetlbfs [026/165] hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem [027/165] mm/migrate: optimize migrate_vma_setup() for holes [028/165] mm/migrate: add migrate-shared test for migrate_vma_() [029/165] mm: thp: remove debug_cow switch [030/165] mm/vmstat: add events for THP migration without split [031/165] mm/cma.c: fix NULL pointer dereference when cma could not be activated [032/165] mm: cma: fix the name of CMA areas [033/165] mm: hugetlb: fix the name of hugetlb CMA [034/165] cma: don't quit at first error when activating reserved areas [035/165] include/linux/sched/mm.h: optimize current_gfp_context() [036/165] mm: mmu_notifier: fix and extend kerneldoc [037/165] x86/mm: use max memory block size on bare metal [038/165] mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid() [039/165] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done [040/165] mm, memory_hotplug: update pcp lists everytime onlining a memory block [041/165] mm: drop duplicated words in <linux/pgtable.h> [042/165] mm: drop duplicated words in <linux/mm.h> [043/165] include/linux/highmem.h: fix duplicated words in a comment [044/165] include/linux/frontswap.h: drop duplicated word in a comment [045/165] include/linux/memcontrol.h: drop duplicate word and fix spello [046/165] sh/mm: drop unused MAX_PHYSADDR_BITS [047/165] sparc: drop unused MAX_PHYSADDR_BITS [048/165] mm/compaction.c: delete duplicated word [049/165] mm/filemap.c: delete duplicated word [050/165] mm/hmm.c: delete duplicated word [051/165] mm/hugetlb.c: delete duplicated words [052/165] mm/memcontrol.c: delete duplicated words [053/165] mm/memory.c: delete duplicated words [054/165] mm/migrate.c: delete duplicated word [055/165] mm/nommu.c: delete duplicated words [056/165] mm/page_alloc.c: delete or fix duplicated words [057/165] mm/shmem.c: delete duplicated word [058/165] mm/slab_common.c: delete duplicated word [059/165] mm/usercopy.c: delete duplicated word [060/165] mm/vmscan.c: delete or fix duplicated words [061/165] mm/zpool.c: delete duplicated word and fix grammar [062/165] mm/zsmalloc.c: fix duplicated words [063/165] syscalls: use uaccess_kernel in addr_limit_user_check [064/165] nds32: use uaccess_kernel in show_regs [065/165] riscv: include <asm/pgtable.h> in <asm/uaccess.h> [066/165] uaccess: remove segment_eq [067/165] uaccess: add force_uaccess_{begin,end} helpers [068/165] exec: use force_uaccess_begin during exec and exit [069/165] alpha: fix annotation of io{read,write}{16,32}be() [070/165] include/linux/compiler-clang.h: drop duplicated word in a comment [071/165] include/linux/exportfs.h: drop duplicated word in a comment [072/165] include/linux/async_tx.h: drop duplicated word in a comment [073/165] include/linux/xz.h: drop duplicated word [074/165] kernel: add a kernel_wait helper [075/165] ./Makefile: add debug option to enable function aligned on 32 bytes [076/165] kernel.h: remove duplicate include of asm/div64.h [077/165] include/: replace HTTP links with HTTPS ones [078/165] include/linux/poison.h: remove obsolete comment [079/165] sparse: group the defines by functionality [080/165] lib/bitmap.c: fix bitmap_cut() for partial overlapping case [081/165] lib/test_bitmap.c: add test for bitmap_cut() [082/165] lib/generic-radix-tree.c: remove unneeded __rcu [083/165] lib/test_bitops: do the full test during module init [084/165] lib/test_lockup.c: make symbol 'test_works' static [085/165] lib/Kconfig.debug: make TEST_LOCKUP depend on module [086/165] lib/test_lockup.c: fix return value of test_lockup_init() [087/165] lib/: replace HTTP links with HTTPS ones [088/165] kstrto: correct documentation references to simple_strto() [089/165] kstrto: do not describe simple_strto*() as obsolete/replaced [090/165] lz4: fix kernel decompression speed [091/165] lib/test_bits.c: add tests of GENMASK [092/165] checkpatch: add test for possible misuse of IS_ENABLED() without CONFIG_ [093/165] checkpatch: add --fix option for ASSIGN_IN_IF [094/165] checkpatch: fix CONST_STRUCT when const_structs.checkpatch is missing [095/165] checkpatch: add test for repeated words [096/165] checkpatch: remove missing switch/case break test [097/165] autofs: fix doubled word [098/165] fs/minix: check return value of sb_getblk() [099/165] fs/minix: don't allow getting deleted inodes [100/165] fs/minix: reject too-large maximum file size [101/165] fs/minix: set s_maxbytes correctly [102/165] fs/minix: fix block limit check for V1 filesystems [103/165] fs/minix: remove expected error message in block_to_path() [104/165] nilfs2: only call unlock_new_inode() if I_NEW [105/165] nilfs2: convert __nilfs_msg to integrate the level and format [106/165] nilfs2: use a more common logging style [107/165] fs/ufs: avoid potential u32 multiplication overflow [108/165] fatfs: switch write_lock to read_lock in fat_ioctl_get_attributes [109/165] VFAT/FAT/MSDOS FILESYSTEM: replace HTTP links with HTTPS ones [110/165] fat: fix fat_ra_init() for data clusters == 0 [111/165] fs/signalfd.c: fix inconsistent return codes for signalfd4 [112/165] selftests: kmod: use variable NAME in kmod_test_0001() [113/165] kmod: remove redundant "be an" in the comment [114/165] test_kmod: avoid potential double free in trigger_config_run_type() [115/165] coredump: add %f for executable filename [116/165] exec: change uselib(2) IS_SREG() failure to EACCES [117/165] exec: move S_ISREG() check earlier [118/165] exec: move path_noexec() check earlier [119/165] kdump: append kernel build-id string to VMCOREINFO [120/165] drivers/rapidio/devices/rio_mport_cdev.c: use struct_size() helper [121/165] drivers/rapidio/rio-scan.c: use struct_size() helper [122/165] rapidio/rio_mport_cdev: use array_size() helper in copy_{from,to}_user() [123/165] kernel/panic.c: make oops_may_print() return bool [124/165] lib/Kconfig.debug: fix typo in the help text of CONFIG_PANIC_TIMEOUT [125/165] panic: make print_oops_end_marker() static [126/165] kcov: unconditionally add -fno-stack-protector to compiler options [127/165] kcov: make some symbols static [128/165] scripts/gdb: fix python 3.8 SyntaxWarning [129/165] ipc: uninline functions [130/165] ipc/shm.c: remove the superfluous break [131/165] mm/page_isolation: prefer the node of the source page [132/165] mm/migrate: move migration helper from .h to .c [133/165] mm/hugetlb: unify migration callbacks [134/165] mm/migrate: clear __GFP_RECLAIM to make the migration callback consistent with regular TH… [135/165] mm/migrate: introduce a standard migration target allocation function [136/165] mm/mempolicy: use a standard migration target allocation callback [137/165] mm/page_alloc: remove a wrapper for alloc_migration_target() [138/165] mm/gup: restrict CMA region by using allocation scope API [139/165] mm/hugetlb: make hugetlb migration callback CMA aware [140/165] mm/gup: use a standard migration target allocation callback [141/165] mm: do page fault accounting in handle_mm_fault [142/165] mm/alpha: use general page fault accounting [143/165] mm/arc: use general page fault accounting [144/165] mm/arm: use general page fault accounting [145/165] mm/arm64: use general page fault accounting [146/165] mm/csky: use general page fault accounting [147/165] mm/hexagon: use general page fault accounting [148/165] mm/ia64: use general page fault accounting [149/165] mm/m68k: use general page fault accounting [150/165] mm/microblaze: use general page fault accounting [151/165] mm/mips: use general page fault accounting [152/165] mm/nds32: use general page fault accounting [153/165] mm/nios2: use general page fault accounting [154/165] mm/openrisc: use general page fault accounting [155/165] mm/parisc: use general page fault accounting [156/165] mm/powerpc: use general page fault accounting [157/165] mm/riscv: use general page fault accounting [158/165] mm/s390: use general page fault accounting [159/165] mm/sh: use general page fault accounting [160/165] mm/sparc32: use general page fault accounting [161/165] mm/sparc64: use general page fault accounting [162/165] mm/x86: use general page fault accounting [163/165] mm/xtensa: use general page fault accounting [164/165] mm: clean up the last pieces of page fault accountings [165/165] mm/gup: remove task_struct pointer for all gup code

[021/165] mm, oom: make the calculation of oom badness more accurate

Commit Message

Patch