OOM: Better, but still there on

On Sat 17-12-16 22:06:47, Nils Holland wrote:
[...]
> Unfortunately, the reclaim trace messages stopped a while after the first
> OOM messages show up - most likely my "cat" had been killed at that
> point or became unresponsive. :-/

The later is more probable because I do not see the OOM killer to kill
any cat process and the first bash has been killed 10s after the first
OOM.

2016-12-17 21:36:56 192.168.17.23:6665 [ 1276.828639] Killed process 3894 (xz) total-vm:68640kB, anon-rss:65920kB, file-rss:1696kB, shmem-rss:0kB
2016-12-17 21:36:57 192.168.17.23:6665 [ 1277.598271] Killed process 3864 (sandbox) total-vm:2192kB, anon-rss:128kB, file-rss:1400kB, shmem-rss:0kB
2016-12-17 21:36:57 192.168.17.23:6665 [ 1278.222416] Killed process 3086 (emerge) total-vm:65064kB, anon-rss:52768kB, file-rss:7216kB, shmem-rss:0kB
2016-12-17 21:36:58 192.168.17.23:6665 [ 1278.846902] Killed process 2705 (NetworkManager) total-vm:104376kB, anon-rss:4172kB, file-rss:10516kB, shmem-rss:0kB
2016-12-17 21:36:59 192.168.17.23:6665 [ 1279.862150] Killed process 2823 (polkitd) total-vm:65536kB, anon-rss:2192kB, file-rss:8656kB, shmem-rss:0kB
2016-12-17 21:37:00 192.168.17.23:6665 [ 1280.496988] Killed process 3885 (ebuild.sh) total-vm:10640kB, anon-rss:3340kB, file-rss:2244kB, shmem-rss:0kB
2016-12-17 21:37:04 192.168.17.23:6665 [ 1285.126052] Killed process 2824 (wpa_supplicant) total-vm:8580kB, anon-rss:540kB, file-rss:5092kB, shmem-rss:0kB
2016-12-17 21:37:05 192.168.17.23:6665 [ 1286.124687] Killed process 2943 (bash) total-vm:7320kB, anon-rss:368kB, file-rss:3240kB, shmem-rss:0kB
2016-12-17 21:37:07 192.168.17.23:6665 [ 1287.974353] Killed process 2878 (sshd) total-vm:10524kB, anon-rss:700kB, file-rss:4908kB, shmem-rss:4kB
2016-12-17 21:37:16 192.168.17.23:6665 [ 1296.953350] Killed process 4048 (ebuild.sh) total-vm:10640kB, anon-rss:3352kB, file-rss:1892kB, shmem-rss:0kB
2016-12-17 21:37:24 192.168.17.23:6665 [ 1304.398944] Killed process 1980 (systemd-journal) total-vm:24640kB, anon-rss:332kB, file-rss:4608kB, shmem-rss:4kB
2016-12-17 21:37:25 192.168.17.23:6665 [ 1305.934472] Killed process 2918 ((sd-pam)) total-vm:9152kB, anon-rss:964kB, file-rss:1536kB, shmem-rss:0kB
2016-12-17 21:37:28 192.168.17.23:6665 [ 1308.878775] Killed process 2888 (systemd) total-vm:7856kB, anon-rss:528kB, file-rss:4388kB, shmem-rss:0kB
2016-12-17 21:37:34 192.168.17.23:6665 [ 1314.268177] Killed process 2711 (rsyslogd) total-vm:25200kB, anon-rss:1084kB, file-rss:2908kB, shmem-rss:0kB
2016-12-17 21:37:39 192.168.17.23:6665 [ 1319.634561] Killed process 2704 (systemd-logind) total-vm:5980kB, anon-rss:340kB, file-rss:3568kB, shmem-rss:0kB
2016-12-17 21:37:43 192.168.17.23:6665 [ 1323.488894] Killed process 3103 (htop) total-vm:7532kB, anon-rss:1024kB, file-rss:2872kB, shmem-rss:0kB
2016-12-17 21:38:42 192.168.17.23:6665 [ 1379.556282] Killed process 2701 (systemd-timesyn) total-vm:15480kB, anon-rss:356kB, file-rss:3292kB, shmem-rss:0kB
2016-12-17 21:39:05 192.168.17.23:6665 [ 1403.130435] Killed process 3082 (bash) total-vm:7324kB, anon-rss:380kB, file-rss:3324kB, shmem-rss:0kB
2016-12-17 21:39:17 192.168.17.23:6665 [ 1417.600367] Killed process 3077 (start_trace) total-vm:6948kB, anon-rss:184kB, file-rss:2524kB, shmem-rss:0kB
2016-12-17 21:39:24 192.168.17.23:6665 [ 1423.955452] Killed process 3073 (bash) total-vm:7324kB, anon-rss:380kB, file-rss:3284kB, shmem-rss:0kB
2016-12-17 21:39:27 192.168.17.23:6665 [ 1425.338670] Killed process 3099 (bash) total-vm:7324kB, anon-rss:376kB, file-rss:3176kB, shmem-rss:0kB
2016-12-17 21:39:27 192.168.17.23:6665 [ 1426.800677] Killed process 3070 (screen) total-vm:7440kB, anon-rss:960kB, file-rss:2360kB, shmem-rss:0kB

> In the end, the machine didn't completely panic, but after nothing new
> showed up being logged via the network, I walked up to the
> machine and found it in a state where I couldn't really log in to it
> anymore, but all that worked was, as always, a magic SysRequest reboot.
> 
> The complete log, from machine boot right up to the point where it
> wouldn't really do anything anymore, is up again on my web server (~42
> MB, 928 KB packed):
> 
> http://ftp.tisys.org/pub/misc/teela_2016-12-17.log.xz

$ xzgrep invoked teela_2016-12-17.log.xz | sed 's@.*gfp_mask=0x[0-9a-f]*(\(.*\)), .*@\1@' | sort | uniq -c
      2 GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK
      1 GFP_KERNEL|__GFP_NOTRACK
      6 GFP_KERNEL|__GFP_NOWARN|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_NOTRACK
      1 GFP_KERNEL|__GFP_NOWARN|__GFP_REPEAT|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_NOTRACK
      2 GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK
      2 GFP_TEMPORARY
      5 GFP_TEMPORARY|__GFP_NOTRACK
      3 GFP_USER|__GFP_COLD

so all of them are lowmem requests which is in line with your previous
report. This basically means that only zone Normal is usable as I've
already mentioned before. In general lowmem problems are inherent to the
32b kernels but in this case we still have a _lot of_ page cache to
reclaim so we shouldn't really blow up. 

Normal free:41260kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532676kB inactive_file:100kB unevictable:0kB writepending:124kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:157428kB slab_unreclaimable:68940kB kernel_stack:1160kB pagetables:1336kB bounce:0kB free_pcp:484kB local_pcp:240kB free_cma:0kB

and this looks very similar to your previous report as well. No
anonymous pages and the whole file LRU sitting in the active list so
there is nothing imediatelly reclaimable. This is very weird because
we should rotate the active list to the inactive if the later is low
which it obviously is here and this seems to be the case in other cases
as well (inactive_is_low.sh is a simple and dirty script to subtract
Highmem active/inactive counters from the node ones).

$ xzgrep -f zones teela_2016-12-17.log.xz | sh inactive_is_low.sh
total_active 1094600 active 541424 total_inactive 1117512 inactive 104 ratio 1 low 1
total_active 1094744 active 541568 total_inactive 1117524 inactive 116 ratio 1 low 1
total_active 1094864 active 541564 total_inactive 1117512 inactive 108 ratio 1 low 1
total_active 1095188 active 541564 total_inactive 1117220 inactive 116 ratio 1 low 1
total_active 1097520 active 541596 total_inactive 1115048 inactive 120 ratio 1 low 1
total_active 1097836 active 541612 total_inactive 1114764 inactive 136 ratio 1 low 1
total_active 1098692 active 542384 total_inactive 1114688 inactive 100 ratio 1 low 1
total_active 1098964 active 542504 total_inactive 1114480 inactive 24 ratio 1 low 1
total_active 1099108 active 542620 total_inactive 1114544 inactive 92 ratio 1 low 1
total_active 1099180 active 542548 total_inactive 1114564 inactive 236 ratio 1 low 1
[...]

Unfortunatelly shrink_active_list doesn't have any tracepoint so we do
not know whether we managed to rotate those pages. If they are referenced
quickly enough we might just keep refaulting them... Could you try to apply
the followin diff on top what you have currently. It should add some more
tracepoint data which might tell us more. We can reduce the amount of
tracing data by enabling only mm_vmscan_lru_isolate,
mm_vmscan_lru_shrink_inactive and mm_vmscan_lru_shrink_active.
---

OOM: Better, but still there on

Commit Message

Comments

Patch