diff mbox series

[v2] page_alloc: consider highatomic reserve in wmartermark fast

Message ID 20200613025102.12880-1-jaewon31.kim@samsung.com (mailing list archive)
State New, archived
Headers show
Series [v2] page_alloc: consider highatomic reserve in wmartermark fast | expand

Commit Message

Jaewon Kim June 13, 2020, 2:51 a.m. UTC
zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm,
page_alloc: shortcut watermark checks for order-0 pages"). The commit
simply checks if free pages is bigger than watermark without additional
calculation such like reducing watermark.

It considered free cma pages but it did not consider highatomic
reserved. This may incur exhaustion of free pages except high order
atomic free pages.

Assume that reserved_highatomic pageblock is bigger than watermark min,
and there are only few free pages except high order atomic free. Because
zone_watermark_fast passes the allocation without considering high order
atomic free, normal reclaimable allocation like GFP_HIGHUSER will
consume all the free pages. Then finally order-0 atomic allocation may
fail on allocation.

This means watermark min is not protected against non-atomic allocation.
The order-0 atomic allocation with ALLOC_HARDER unwantedly can be
failed. Additionally the __GFP_MEMALLOC allocation with
ALLOC_NO_WATERMARKS also can be failed.

To avoid the problem, zone_watermark_fast should consider highatomic
reserve. If the actual size of high atomic free is counted accurately
like cma free, we may use it. On this patch just use
nr_reserved_highatomic. Additionally introduce
__zone_watermark_unusable_free to factor out common parts between
zone_watermark_fast and __zone_watermark_ok.

This is trace log which shows GFP_HIGHUSER consumes free pages right
before ALLOC_NO_WATERMARKS.

  <...>-22275 [006] ....   889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
  <...>-22275 [006] ....   889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
kswapd0-1207  [005] ...1   889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE

This is an example of ALLOC_HARDER allocation failure.

<4>[ 6207.637280]  [3:  Binder:9343_3:22875] Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
<4>[ 6207.637311]  [3:  Binder:9343_3:22875] Call trace:
<4>[ 6207.637346]  [3:  Binder:9343_3:22875] [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
<4>[ 6207.637356]  [3:  Binder:9343_3:22875] [<ffffff8008223320>] warn_alloc+0xd8/0x12c
<4>[ 6207.637365]  [3:  Binder:9343_3:22875] [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
<4>[ 6207.637374]  [3:  Binder:9343_3:22875] [<ffffff800827f6e8>] new_slab+0x128/0x604
<4>[ 6207.637381]  [3:  Binder:9343_3:22875] [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
<4>[ 6207.637387]  [3:  Binder:9343_3:22875] [<ffffff800827ba00>] __kmalloc+0x2f8/0x310
<4>[ 6207.637396]  [3:  Binder:9343_3:22875] [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
<4>[ 6207.637404]  [3:  Binder:9343_3:22875] [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
<4>[ 6207.637412]  [3:  Binder:9343_3:22875] [<ffffff80084ad880>] security_sid_to_context+0x10/0x18
<4>[ 6207.637421]  [3:  Binder:9343_3:22875] [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
<4>[ 6207.637430]  [3:  Binder:9343_3:22875] [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
<4>[ 6207.637442]  [3:  Binder:9343_3:22875] [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
<4>[ 6207.637569]  [3:  Binder:9343_3:22875] Mem-Info:
<4>[ 6207.637595]  [3:  Binder:9343_3:22875] active_anon:102061 inactive_anon:81551 isolated_anon:0
<4>[ 6207.637595]  [3:  Binder:9343_3:22875]  active_file:59102 inactive_file:68924 isolated_file:64
<4>[ 6207.637595]  [3:  Binder:9343_3:22875]  unevictable:611 dirty:63 writeback:0 unstable:0
<4>[ 6207.637595]  [3:  Binder:9343_3:22875]  slab_reclaimable:13324 slab_unreclaimable:44354
<4>[ 6207.637595]  [3:  Binder:9343_3:22875]  mapped:83015 shmem:4858 pagetables:26316 bounce:0
<4>[ 6207.637595]  [3:  Binder:9343_3:22875]  free:2727 free_pcp:1035 free_cma:178
<4>[ 6207.637616]  [3:  Binder:9343_3:22875] Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
<4>[ 6207.637627]  [3:  Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
<4>[ 6207.637632]  [3:  Binder:9343_3:22875] lowmem_reserve[]: 0 0
<4>[ 6207.637637]  [3:  Binder:9343_3:22875] Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB
<4>[ 6207.637655]  [3:  Binder:9343_3:22875] 138826 total pagecache pages
<4>[ 6207.637663]  [3:  Binder:9343_3:22875] 5460 pages in swap cache
<4>[ 6207.637668]  [3:  Binder:9343_3:22875] Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142

This is an example of ALLOC_NO_WATERMARKS allocation failure.

<6>[  156.701551]  [4:        kswapd0: 1209] kswapd0 cpuset=/ mems_allowed=0
<4>[  156.701563]  [4:        kswapd0: 1209] CPU: 4 PID: 1209 Comm: kswapd0 Tainted: G        W       4.14.113-18113966 #1
<4>[  156.701572]  [4:        kswapd0: 1209] Call trace:
<4>[  156.701605]  [4:        kswapd0: 1209] [<0000000000000000>] dump_stack+0x68/0x90
<4>[  156.701612]  [4:        kswapd0: 1209] [<0000000000000000>] warn_alloc+0x104/0x198
<4>[  156.701617]  [4:        kswapd0: 1209] [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
<4>[  156.701623]  [4:        kswapd0: 1209] [<0000000000000000>] zs_malloc+0x148/0x3d0
<4>[  156.701630]  [4:        kswapd0: 1209] [<0000000000000000>] zram_bvec_rw+0x250/0x568
<4>[  156.701634]  [4:        kswapd0: 1209] [<0000000000000000>] zram_rw_page+0x8c/0xe0
<4>[  156.701640]  [4:        kswapd0: 1209] [<0000000000000000>] bdev_write_page+0x70/0xbc
<4>[  156.701645]  [4:        kswapd0: 1209] [<0000000000000000>] __swap_writepage+0x58/0x37c
<4>[  156.701649]  [4:        kswapd0: 1209] [<0000000000000000>] swap_writepage+0x40/0x4c
<4>[  156.701654]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_page_list+0xc3c/0xf54
<4>[  156.701659]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
<4>[  156.701664]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_node_memcg+0x23c/0x618
<4>[  156.701668]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_node+0x1c8/0x304
<4>[  156.701673]  [4:        kswapd0: 1209] [<0000000000000000>] kswapd+0x680/0x7c4
<4>[  156.701679]  [4:        kswapd0: 1209] [<0000000000000000>] kthread+0x110/0x120
<4>[  156.701684]  [4:        kswapd0: 1209] [<0000000000000000>] ret_from_fork+0x10/0x18
<4>[  156.701689]  [4:        kswapd0: 1209] Mem-Info:
<4>[  156.701712]  [4:        kswapd0: 1209] active_anon:88690 inactive_anon:88630 isolated_anon:0
<4>[  156.701712]  [4:        kswapd0: 1209]  active_file:99173 inactive_file:169305 isolated_file:32
<4>[  156.701712]  [4:        kswapd0: 1209]  unevictable:48292 dirty:538 writeback:38 unstable:0
<4>[  156.701712]  [4:        kswapd0: 1209]  slab_reclaimable:15131 slab_unreclaimable:47762
<4>[  156.701712]  [4:        kswapd0: 1209]  mapped:274654 shmem:2824 pagetables:25088 bounce:0
<4>[  156.701712]  [4:        kswapd0: 1209]  free:2489 free_pcp:444 free_cma:3
<4>[  156.701728]  [4:        kswapd0: 1209] Node 0 active_anon:354760kB inactive_anon:354520kB active_file:396692kB inactive_file:677220kB unevictable:193168kB isolated(anon):0kB isolated(file):128kB mapped:1098616kB dirty:2152kB writeback:152kB shmem:11296kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
<4>[  156.701738]  [4:        kswapd0: 1209] Normal free:9956kB min:7428kB low:93440kB high:97032kB active_anon:355176kB inactive_anon:354580kB active_file:396196kB inactive_file:677284kB unevictable:193168kB writepending:2304kB present:4081664kB managed:3593324kB mlocked:193168kB kernel_stack:55008kB pagetables:100352kB bounce:0kB free_pcp:1776kB local_pcp:656kB free_cma:12kB
<4>[  156.701741]  [4:        kswapd0: 1209] lowmem_reserve[]: 0 0
<4>[  156.701747]  [4:        kswapd0: 1209] Normal: 196*4kB (H) 141*8kB (H) 109*16kB (H) 63*32kB (H) 20*64kB (H) 8*128kB (H) 2*256kB (H) 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 9000kB

Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
---
v2: factor out common part
v1: consider highatomic reserve
---
 mm/page_alloc.c | 61 ++++++++++++++++++++++++++++---------------------
 1 file changed, 35 insertions(+), 26 deletions(-)

Comments

Vlastimil Babka June 12, 2020, 2:34 p.m. UTC | #1
On 6/13/20 4:51 AM, Jaewon Kim wrote:
> zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm,
> page_alloc: shortcut watermark checks for order-0 pages"). The commit
> simply checks if free pages is bigger than watermark without additional
> calculation such like reducing watermark.
> 
> It considered free cma pages but it did not consider highatomic
> reserved. This may incur exhaustion of free pages except high order
> atomic free pages.
> 
> Assume that reserved_highatomic pageblock is bigger than watermark min,
> and there are only few free pages except high order atomic free. Because
> zone_watermark_fast passes the allocation without considering high order
> atomic free, normal reclaimable allocation like GFP_HIGHUSER will
> consume all the free pages. Then finally order-0 atomic allocation may
> fail on allocation.

I don't understand why order-0 atomic allocation will fail. Is it because of
watermark check, or finding no suitable pages?
- watermark check should be OK as atomic allocations can use reserves
- suitable pages should be OK, even if all free pages are in the highatomic
reserves, because rmqueue() contains:

if (alloc_flags & ALLOC_HARDER)
	page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);

So what am I missing?

> This means watermark min is not protected against non-atomic allocation.
> The order-0 atomic allocation with ALLOC_HARDER unwantedly can be
> failed. Additionally the __GFP_MEMALLOC allocation with
> ALLOC_NO_WATERMARKS also can be failed.
> 
> To avoid the problem, zone_watermark_fast should consider highatomic
> reserve. If the actual size of high atomic free is counted accurately
> like cma free, we may use it. On this patch just use
> nr_reserved_highatomic. Additionally introduce
> __zone_watermark_unusable_free to factor out common parts between
> zone_watermark_fast and __zone_watermark_ok.
> 
> This is trace log which shows GFP_HIGHUSER consumes free pages right
> before ALLOC_NO_WATERMARKS.
> 
>   <...>-22275 [006] ....   889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> kswapd0-1207  [005] ...1   889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE
> 
> This is an example of ALLOC_HARDER allocation failure.
> 
> <4>[ 6207.637280]  [3:  Binder:9343_3:22875] Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
> <4>[ 6207.637311]  [3:  Binder:9343_3:22875] Call trace:
> <4>[ 6207.637346]  [3:  Binder:9343_3:22875] [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
> <4>[ 6207.637356]  [3:  Binder:9343_3:22875] [<ffffff8008223320>] warn_alloc+0xd8/0x12c
> <4>[ 6207.637365]  [3:  Binder:9343_3:22875] [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
> <4>[ 6207.637374]  [3:  Binder:9343_3:22875] [<ffffff800827f6e8>] new_slab+0x128/0x604
> <4>[ 6207.637381]  [3:  Binder:9343_3:22875] [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
> <4>[ 6207.637387]  [3:  Binder:9343_3:22875] [<ffffff800827ba00>] __kmalloc+0x2f8/0x310
> <4>[ 6207.637396]  [3:  Binder:9343_3:22875] [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
> <4>[ 6207.637404]  [3:  Binder:9343_3:22875] [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
> <4>[ 6207.637412]  [3:  Binder:9343_3:22875] [<ffffff80084ad880>] security_sid_to_context+0x10/0x18
> <4>[ 6207.637421]  [3:  Binder:9343_3:22875] [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
> <4>[ 6207.637430]  [3:  Binder:9343_3:22875] [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
> <4>[ 6207.637442]  [3:  Binder:9343_3:22875] [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
> <4>[ 6207.637569]  [3:  Binder:9343_3:22875] Mem-Info:
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875] active_anon:102061 inactive_anon:81551 isolated_anon:0
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  active_file:59102 inactive_file:68924 isolated_file:64
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  unevictable:611 dirty:63 writeback:0 unstable:0
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  slab_reclaimable:13324 slab_unreclaimable:44354
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  mapped:83015 shmem:4858 pagetables:26316 bounce:0
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  free:2727 free_pcp:1035 free_cma:178
> <4>[ 6207.637616]  [3:  Binder:9343_3:22875] Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> <4>[ 6207.637627]  [3:  Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
> <4>[ 6207.637632]  [3:  Binder:9343_3:22875] lowmem_reserve[]: 0 0
> <4>[ 6207.637637]  [3:  Binder:9343_3:22875] Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB

OK this shows we are well above min watermark, and indeed only free pages are
highatomic. Why doesn't the rmqueue() part quoted above work as expected and
allow this allocation to use those highatomic blocks?

> <4>[ 6207.637655]  [3:  Binder:9343_3:22875] 138826 total pagecache pages
> <4>[ 6207.637663]  [3:  Binder:9343_3:22875] 5460 pages in swap cache
> <4>[ 6207.637668]  [3:  Binder:9343_3:22875] Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142
> 
> This is an example of ALLOC_NO_WATERMARKS allocation failure.
> 
> <6>[  156.701551]  [4:        kswapd0: 1209] kswapd0 cpuset=/ mems_allowed=0
> <4>[  156.701563]  [4:        kswapd0: 1209] CPU: 4 PID: 1209 Comm: kswapd0 Tainted: G        W       4.14.113-18113966 #1
> <4>[  156.701572]  [4:        kswapd0: 1209] Call trace:
> <4>[  156.701605]  [4:        kswapd0: 1209] [<0000000000000000>] dump_stack+0x68/0x90
> <4>[  156.701612]  [4:        kswapd0: 1209] [<0000000000000000>] warn_alloc+0x104/0x198
> <4>[  156.701617]  [4:        kswapd0: 1209] [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
> <4>[  156.701623]  [4:        kswapd0: 1209] [<0000000000000000>] zs_malloc+0x148/0x3d0
> <4>[  156.701630]  [4:        kswapd0: 1209] [<0000000000000000>] zram_bvec_rw+0x250/0x568
> <4>[  156.701634]  [4:        kswapd0: 1209] [<0000000000000000>] zram_rw_page+0x8c/0xe0
> <4>[  156.701640]  [4:        kswapd0: 1209] [<0000000000000000>] bdev_write_page+0x70/0xbc
> <4>[  156.701645]  [4:        kswapd0: 1209] [<0000000000000000>] __swap_writepage+0x58/0x37c
> <4>[  156.701649]  [4:        kswapd0: 1209] [<0000000000000000>] swap_writepage+0x40/0x4c
> <4>[  156.701654]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_page_list+0xc3c/0xf54
> <4>[  156.701659]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
> <4>[  156.701664]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_node_memcg+0x23c/0x618
> <4>[  156.701668]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_node+0x1c8/0x304
> <4>[  156.701673]  [4:        kswapd0: 1209] [<0000000000000000>] kswapd+0x680/0x7c4
> <4>[  156.701679]  [4:        kswapd0: 1209] [<0000000000000000>] kthread+0x110/0x120
> <4>[  156.701684]  [4:        kswapd0: 1209] [<0000000000000000>] ret_from_fork+0x10/0x18
> <4>[  156.701689]  [4:        kswapd0: 1209] Mem-Info:
> <4>[  156.701712]  [4:        kswapd0: 1209] active_anon:88690 inactive_anon:88630 isolated_anon:0
> <4>[  156.701712]  [4:        kswapd0: 1209]  active_file:99173 inactive_file:169305 isolated_file:32
> <4>[  156.701712]  [4:        kswapd0: 1209]  unevictable:48292 dirty:538 writeback:38 unstable:0
> <4>[  156.701712]  [4:        kswapd0: 1209]  slab_reclaimable:15131 slab_unreclaimable:47762
> <4>[  156.701712]  [4:        kswapd0: 1209]  mapped:274654 shmem:2824 pagetables:25088 bounce:0
> <4>[  156.701712]  [4:        kswapd0: 1209]  free:2489 free_pcp:444 free_cma:3
> <4>[  156.701728]  [4:        kswapd0: 1209] Node 0 active_anon:354760kB inactive_anon:354520kB active_file:396692kB inactive_file:677220kB unevictable:193168kB isolated(anon):0kB isolated(file):128kB mapped:1098616kB dirty:2152kB writeback:152kB shmem:11296kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> <4>[  156.701738]  [4:        kswapd0: 1209] Normal free:9956kB min:7428kB low:93440kB high:97032kB active_anon:355176kB inactive_anon:354580kB active_file:396196kB inactive_file:677284kB unevictable:193168kB writepending:2304kB present:4081664kB managed:3593324kB mlocked:193168kB kernel_stack:55008kB pagetables:100352kB bounce:0kB free_pcp:1776kB local_pcp:656kB free_cma:12kB
> <4>[  156.701741]  [4:        kswapd0: 1209] lowmem_reserve[]: 0 0
> <4>[  156.701747]  [4:        kswapd0: 1209] Normal: 196*4kB (H) 141*8kB (H) 109*16kB (H) 63*32kB (H) 20*64kB (H) 8*128kB (H) 2*256kB (H) 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 9000kB

Same here, although here AFAICS ALLOC_NO_WATERMARKS doesn't imply ALLOC_HARDER,
so the rmqueue() check wouldn't kick in? That would be something to fix... but
doesn't explain the GFP_ATOMIC case above.

...

> @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
        /*
         * Fast check for order-0 only. If this fails then the reserves
         * need to be calculated. There is a corner case where the check
         * passes but only the high-order atomic reserve are free. If
>  	 * the caller is !atomic then it'll uselessly search the free
>  	 * list. That corner case is then slower but it is harmless.
>  	 */

The comment stops being true after this patch? It also suggests that Mel
anticipated this corner case, but that it should only cause a false positive
zone_watermark_fast() and then rmqueue() fails for !ALLOC_HARDER as it cannot
use MIGRATE_HIGHATOMIC blocks. It expects atomic order-0 still works. So what's
going on?

> -	if (!order && (free_pages - cma_pages) >
> -				mark + z->lowmem_reserve[highest_zoneidx])
> -		return true;
> +	if (!order) {
> +		long fast_free = free_pages - unusable_free;
> +
> +		if (fast_free > mark + z->lowmem_reserve[highest_zoneidx])
> +			return true;
> +	}
>  
>  	return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
>  					free_pages);
>
Jaewon Kim June 13, 2020, 4:16 a.m. UTC | #2
2020년 6월 12일 (금) 오후 11:34, Vlastimil Babka <vbabka@suse.cz>님이 작성:
>
> On 6/13/20 4:51 AM, Jaewon Kim wrote:
> > zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm,
> > page_alloc: shortcut watermark checks for order-0 pages"). The commit
> > simply checks if free pages is bigger than watermark without additional
> > calculation such like reducing watermark.
> >
> > It considered free cma pages but it did not consider highatomic
> > reserved. This may incur exhaustion of free pages except high order
> > atomic free pages.
> >
> > Assume that reserved_highatomic pageblock is bigger than watermark min,
> > and there are only few free pages except high order atomic free. Because
> > zone_watermark_fast passes the allocation without considering high order
> > atomic free, normal reclaimable allocation like GFP_HIGHUSER will
> > consume all the free pages. Then finally order-0 atomic allocation may
> > fail on allocation.
>
> I don't understand why order-0 atomic allocation will fail. Is it because
of
> watermark check, or finding no suitable pages?
> - watermark check should be OK as atomic allocations can use reserves
> - suitable pages should be OK, even if all free pages are in the
highatomic
> reserves, because rmqueue() contains:
>
> if (alloc_flags & ALLOC_HARDER)
>         page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
>
> So what am I missing?
>
Hello
The order-0 atomic allocation can be failed because of depletion of
suitable free page.
Watermark check passes order-0 atomic allocation but it will be failed at
finding a free page.
The  __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) can be used
only for highorder.

> > This means watermark min is not protected against non-atomic allocation.
> > The order-0 atomic allocation with ALLOC_HARDER unwantedly can be
> > failed. Additionally the __GFP_MEMALLOC allocation with
> > ALLOC_NO_WATERMARKS also can be failed.
> >
> > To avoid the problem, zone_watermark_fast should consider highatomic
> > reserve. If the actual size of high atomic free is counted accurately
> > like cma free, we may use it. On this patch just use
> > nr_reserved_highatomic. Additionally introduce
> > __zone_watermark_unusable_free to factor out common parts between
> > zone_watermark_fast and __zone_watermark_ok.
> >
> > This is trace log which shows GFP_HIGHUSER consumes free pages right
> > before ALLOC_NO_WATERMARKS.
> >
> >   <...>-22275 [006] ....   889.213383: mm_page_alloc:
page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650
gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213385: mm_page_alloc:
page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650
gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213387: mm_page_alloc:
page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650
gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213389: mm_page_alloc:
page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650
gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213391: mm_page_alloc:
page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650
gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213393: mm_page_alloc:
page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650
gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213395: mm_page_alloc:
page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650
gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213396: mm_page_alloc:
page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650
gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> > kswapd0-1207  [005] ...1   889.213398: mm_page_alloc: page= (null)
pfn=0 order=0 migratetype=1 nr_free=3650
gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE
> >
> > This is an example of ALLOC_HARDER allocation failure.
> >
> > <4>[ 6207.637280]  [3:  Binder:9343_3:22875] Binder:9343_3: page
allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
> > <4>[ 6207.637311]  [3:  Binder:9343_3:22875] Call trace:
> > <4>[ 6207.637346]  [3:  Binder:9343_3:22875] [<ffffff8008f40f8c>]
dump_stack+0xb8/0xf0
> > <4>[ 6207.637356]  [3:  Binder:9343_3:22875] [<ffffff8008223320>]
warn_alloc+0xd8/0x12c
> > <4>[ 6207.637365]  [3:  Binder:9343_3:22875] [<ffffff80082245e4>]
__alloc_pages_nodemask+0x120c/0x1250
> > <4>[ 6207.637374]  [3:  Binder:9343_3:22875] [<ffffff800827f6e8>]
new_slab+0x128/0x604
> > <4>[ 6207.637381]  [3:  Binder:9343_3:22875] [<ffffff800827b0cc>]
___slab_alloc+0x508/0x670
> > <4>[ 6207.637387]  [3:  Binder:9343_3:22875] [<ffffff800827ba00>]
__kmalloc+0x2f8/0x310
> > <4>[ 6207.637396]  [3:  Binder:9343_3:22875] [<ffffff80084ac3e0>]
context_struct_to_string+0x104/0x1cc
> > <4>[ 6207.637404]  [3:  Binder:9343_3:22875] [<ffffff80084ad8fc>]
security_sid_to_context_core+0x74/0x144
> > <4>[ 6207.637412]  [3:  Binder:9343_3:22875] [<ffffff80084ad880>]
security_sid_to_context+0x10/0x18
> > <4>[ 6207.637421]  [3:  Binder:9343_3:22875] [<ffffff800849bd80>]
selinux_secid_to_secctx+0x20/0x28
> > <4>[ 6207.637430]  [3:  Binder:9343_3:22875] [<ffffff800849109c>]
security_secid_to_secctx+0x3c/0x70
> > <4>[ 6207.637442]  [3:  Binder:9343_3:22875] [<ffffff8008bfe118>]
binder_transaction+0xe68/0x454c
> > <4>[ 6207.637569]  [3:  Binder:9343_3:22875] Mem-Info:
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875] active_anon:102061
inactive_anon:81551 isolated_anon:0
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  active_file:59102
inactive_file:68924 isolated_file:64
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  unevictable:611 dirty:63
writeback:0 unstable:0
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  slab_reclaimable:13324
slab_unreclaimable:44354
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  mapped:83015 shmem:4858
pagetables:26316 bounce:0
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  free:2727 free_pcp:1035
free_cma:178
> > <4>[ 6207.637616]  [3:  Binder:9343_3:22875] Node 0
active_anon:408244kB inactive_anon:326204kB active_file:236408kB
inactive_file:275696kB unevictable:2444kB isolated(anon):0kB
isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB
shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > <4>[ 6207.637627]  [3:  Binder:9343_3:22875] Normal free:10908kB
min:6192kB low:44388kB high:47060kB active_anon:409160kB
inactive_anon:325924kB active_file:235820kB inactive_file:276628kB
unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB
mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB
free_pcp:4140kB local_pcp:40kB free_cma:712kB
> > <4>[ 6207.637632]  [3:  Binder:9343_3:22875] lowmem_reserve[]: 0 0
> > <4>[ 6207.637637]  [3:  Binder:9343_3:22875] Normal: 505*4kB (H)
357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB
0*1024kB 0*2048kB 0*4096kB = 10236kB
>
> OK this shows we are well above min watermark, and indeed only free pages
are
> highatomic. Why doesn't the rmqueue() part quoted above work as expected
and
> allow this allocation to use those highatomic blocks?
>
Because highatomic free is reserved free only for high atomic, actually
ALLOC_HARDER.
Order-0 atomic allocation cannot use the highatomic free.

> > <4>[ 6207.637655]  [3:  Binder:9343_3:22875] 138826 total pagecache
pages
> > <4>[ 6207.637663]  [3:  Binder:9343_3:22875] 5460 pages in swap cache
> > <4>[ 6207.637668]  [3:  Binder:9343_3:22875] Swap cache stats: add
8273090, delete 8267506, find 1004381/4060142
> >
> > This is an example of ALLOC_NO_WATERMARKS allocation failure.
> >
> > <6>[  156.701551]  [4:        kswapd0: 1209] kswapd0 cpuset=/
mems_allowed=0
> > <4>[  156.701563]  [4:        kswapd0: 1209] CPU: 4 PID: 1209 Comm:
kswapd0 Tainted: G        W       4.14.113-18113966 #1
> > <4>[  156.701572]  [4:        kswapd0: 1209] Call trace:
> > <4>[  156.701605]  [4:        kswapd0: 1209] [<0000000000000000>]
dump_stack+0x68/0x90
> > <4>[  156.701612]  [4:        kswapd0: 1209] [<0000000000000000>]
warn_alloc+0x104/0x198
> > <4>[  156.701617]  [4:        kswapd0: 1209] [<0000000000000000>]
__alloc_pages_nodemask+0xdc0/0xdf0
> > <4>[  156.701623]  [4:        kswapd0: 1209] [<0000000000000000>]
zs_malloc+0x148/0x3d0
> > <4>[  156.701630]  [4:        kswapd0: 1209] [<0000000000000000>]
zram_bvec_rw+0x250/0x568
> > <4>[  156.701634]  [4:        kswapd0: 1209] [<0000000000000000>]
zram_rw_page+0x8c/0xe0
> > <4>[  156.701640]  [4:        kswapd0: 1209] [<0000000000000000>]
bdev_write_page+0x70/0xbc
> > <4>[  156.701645]  [4:        kswapd0: 1209] [<0000000000000000>]
__swap_writepage+0x58/0x37c
> > <4>[  156.701649]  [4:        kswapd0: 1209] [<0000000000000000>]
swap_writepage+0x40/0x4c
> > <4>[  156.701654]  [4:        kswapd0: 1209] [<0000000000000000>]
shrink_page_list+0xc3c/0xf54
> > <4>[  156.701659]  [4:        kswapd0: 1209] [<0000000000000000>]
shrink_inactive_list+0x2b0/0x61c
> > <4>[  156.701664]  [4:        kswapd0: 1209] [<0000000000000000>]
shrink_node_memcg+0x23c/0x618
> > <4>[  156.701668]  [4:        kswapd0: 1209] [<0000000000000000>]
shrink_node+0x1c8/0x304
> > <4>[  156.701673]  [4:        kswapd0: 1209] [<0000000000000000>]
kswapd+0x680/0x7c4
> > <4>[  156.701679]  [4:        kswapd0: 1209] [<0000000000000000>]
kthread+0x110/0x120
> > <4>[  156.701684]  [4:        kswapd0: 1209] [<0000000000000000>]
ret_from_fork+0x10/0x18
> > <4>[  156.701689]  [4:        kswapd0: 1209] Mem-Info:
> > <4>[  156.701712]  [4:        kswapd0: 1209] active_anon:88690
inactive_anon:88630 isolated_anon:0
> > <4>[  156.701712]  [4:        kswapd0: 1209]  active_file:99173
inactive_file:169305 isolated_file:32
> > <4>[  156.701712]  [4:        kswapd0: 1209]  unevictable:48292
dirty:538 writeback:38 unstable:0
> > <4>[  156.701712]  [4:        kswapd0: 1209]  slab_reclaimable:15131
slab_unreclaimable:47762
> > <4>[  156.701712]  [4:        kswapd0: 1209]  mapped:274654 shmem:2824
pagetables:25088 bounce:0
> > <4>[  156.701712]  [4:        kswapd0: 1209]  free:2489 free_pcp:444
free_cma:3
> > <4>[  156.701728]  [4:        kswapd0: 1209] Node 0
active_anon:354760kB inactive_anon:354520kB active_file:396692kB
inactive_file:677220kB unevictable:193168kB isolated(anon):0kB
isolated(file):128kB mapped:1098616kB dirty:2152kB writeback:152kB
shmem:11296kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > <4>[  156.701738]  [4:        kswapd0: 1209] Normal free:9956kB
min:7428kB low:93440kB high:97032kB active_anon:355176kB
inactive_anon:354580kB active_file:396196kB inactive_file:677284kB
unevictable:193168kB writepending:2304kB present:4081664kB
managed:3593324kB mlocked:193168kB kernel_stack:55008kB pagetables:100352kB
bounce:0kB free_pcp:1776kB local_pcp:656kB free_cma:12kB
> > <4>[  156.701741]  [4:        kswapd0: 1209] lowmem_reserve[]: 0 0
> > <4>[  156.701747]  [4:        kswapd0: 1209] Normal: 196*4kB (H)
141*8kB (H) 109*16kB (H) 63*32kB (H) 20*64kB (H) 8*128kB (H) 2*256kB (H)
1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 9000kB
>
> Same here, although here AFAICS ALLOC_NO_WATERMARKS doesn't imply
ALLOC_HARDER,
> so the rmqueue() check wouldn't kick in? That would be something to
fix... but
> doesn't explain the GFP_ATOMIC case above.

I might miss some log above, AFAIK it was order-0 allocation.
For an order-0 allocation with ALLOC_NO_WATERMARKS, rmqueue will search free
a page but it cannot find one because other direct reclaimable allocation
already took
the suitable free without direct reclaimatiion.
>
> ...
>
> > @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct
zone *z, unsigned int order,
>         /*
>          * Fast check for order-0 only. If this fails then the reserves
>          * need to be calculated. There is a corner case where the check
>          * passes but only the high-order atomic reserve are free. If
> >        * the caller is !atomic then it'll uselessly search the free
> >        * list. That corner case is then slower but it is harmless.
> >        */
>
> The comment stops being true after this patch? It also suggests that Mel
> anticipated this corner case, but that it should only cause a false
positive
> zone_watermark_fast() and then rmqueue() fails for !ALLOC_HARDER as it
cannot
> use MIGRATE_HIGHATOMIC blocks. It expects atomic order-0 still works. So
what's
> going on?

As Mel also agreed with me in v1 mail thread, this highatomic reserved
should
be considered even in watermark fast.

The comment, I think, may need to be changed. Prior to this patch, non
highatomic
allocation may do useless search, but it also can take ALL non
highatomic free.

With this patch, non highatomic allocation will NOT do useless search.
Rather,
it may be required direct reclamation even when there are some non high
atomic free.

i.e)
In following situation, watermark check fails (9MB - 8MB < 4MB) though
there are
enough free (9MB - 4MB > 4MB). If this is really matter, we need to count
highatomic
free accurately.

min : 4MB,
highatomic reserved : 8MB
Total free : 9MB
  actual highatomic free : 4MB
  non highatomic free : 5MB
>
> > -     if (!order && (free_pages - cma_pages) >
> > -                             mark + z->lowmem_reserve[highest_zoneidx])
> > -             return true;
> > +     if (!order) {
> > +             long fast_free = free_pages - unusable_free;
> > +
> > +             if (fast_free > mark + z->lowmem_reserve[highest_zoneidx])
> > +                     return true;
> > +     }
> >
> >       return __zone_watermark_ok(z, order, mark, highest_zoneidx,
alloc_flags,
> >                                       free_pages);
> >
>
Baoquan He June 13, 2020, 9:42 a.m. UTC | #3
On 06/13/20 at 11:51am, Jaewon Kim wrote:
> zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm,
> page_alloc: shortcut watermark checks for order-0 pages"). The commit
> simply checks if free pages is bigger than watermark without additional
> calculation such like reducing watermark.
> 
> It considered free cma pages but it did not consider highatomic
> reserved. This may incur exhaustion of free pages except high order
> atomic free pages.
> 
> Assume that reserved_highatomic pageblock is bigger than watermark min,
> and there are only few free pages except high order atomic free. Because
> zone_watermark_fast passes the allocation without considering high order
> atomic free, normal reclaimable allocation like GFP_HIGHUSER will
> consume all the free pages. Then finally order-0 atomic allocation may
> fail on allocation.
> 
> This means watermark min is not protected against non-atomic allocation.
> The order-0 atomic allocation with ALLOC_HARDER unwantedly can be
> failed. Additionally the __GFP_MEMALLOC allocation with
> ALLOC_NO_WATERMARKS also can be failed.
> 
> To avoid the problem, zone_watermark_fast should consider highatomic
> reserve. If the actual size of high atomic free is counted accurately
> like cma free, we may use it. On this patch just use
> nr_reserved_highatomic. Additionally introduce
> __zone_watermark_unusable_free to factor out common parts between
> zone_watermark_fast and __zone_watermark_ok.
> 
> This is trace log which shows GFP_HIGHUSER consumes free pages right
> before ALLOC_NO_WATERMARKS.
> 
>   <...>-22275 [006] ....   889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
>   <...>-22275 [006] ....   889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> kswapd0-1207  [005] ...1   889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE
> 
> This is an example of ALLOC_HARDER allocation failure.
> 
> <4>[ 6207.637280]  [3:  Binder:9343_3:22875] Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
> <4>[ 6207.637311]  [3:  Binder:9343_3:22875] Call trace:
> <4>[ 6207.637346]  [3:  Binder:9343_3:22875] [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
> <4>[ 6207.637356]  [3:  Binder:9343_3:22875] [<ffffff8008223320>] warn_alloc+0xd8/0x12c
> <4>[ 6207.637365]  [3:  Binder:9343_3:22875] [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
> <4>[ 6207.637374]  [3:  Binder:9343_3:22875] [<ffffff800827f6e8>] new_slab+0x128/0x604
> <4>[ 6207.637381]  [3:  Binder:9343_3:22875] [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
> <4>[ 6207.637387]  [3:  Binder:9343_3:22875] [<ffffff800827ba00>] __kmalloc+0x2f8/0x310
> <4>[ 6207.637396]  [3:  Binder:9343_3:22875] [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
> <4>[ 6207.637404]  [3:  Binder:9343_3:22875] [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
> <4>[ 6207.637412]  [3:  Binder:9343_3:22875] [<ffffff80084ad880>] security_sid_to_context+0x10/0x18
> <4>[ 6207.637421]  [3:  Binder:9343_3:22875] [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
> <4>[ 6207.637430]  [3:  Binder:9343_3:22875] [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
> <4>[ 6207.637442]  [3:  Binder:9343_3:22875] [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
> <4>[ 6207.637569]  [3:  Binder:9343_3:22875] Mem-Info:
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875] active_anon:102061 inactive_anon:81551 isolated_anon:0
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  active_file:59102 inactive_file:68924 isolated_file:64
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  unevictable:611 dirty:63 writeback:0 unstable:0
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  slab_reclaimable:13324 slab_unreclaimable:44354
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  mapped:83015 shmem:4858 pagetables:26316 bounce:0
> <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  free:2727 free_pcp:1035 free_cma:178
> <4>[ 6207.637616]  [3:  Binder:9343_3:22875] Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> <4>[ 6207.637627]  [3:  Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
> <4>[ 6207.637632]  [3:  Binder:9343_3:22875] lowmem_reserve[]: 0 0
> <4>[ 6207.637637]  [3:  Binder:9343_3:22875] Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB
> <4>[ 6207.637655]  [3:  Binder:9343_3:22875] 138826 total pagecache pages
> <4>[ 6207.637663]  [3:  Binder:9343_3:22875] 5460 pages in swap cache
> <4>[ 6207.637668]  [3:  Binder:9343_3:22875] Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142
> 
> This is an example of ALLOC_NO_WATERMARKS allocation failure.
> 
> <6>[  156.701551]  [4:        kswapd0: 1209] kswapd0 cpuset=/ mems_allowed=0
> <4>[  156.701563]  [4:        kswapd0: 1209] CPU: 4 PID: 1209 Comm: kswapd0 Tainted: G        W       4.14.113-18113966 #1
> <4>[  156.701572]  [4:        kswapd0: 1209] Call trace:
> <4>[  156.701605]  [4:        kswapd0: 1209] [<0000000000000000>] dump_stack+0x68/0x90
> <4>[  156.701612]  [4:        kswapd0: 1209] [<0000000000000000>] warn_alloc+0x104/0x198
> <4>[  156.701617]  [4:        kswapd0: 1209] [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
> <4>[  156.701623]  [4:        kswapd0: 1209] [<0000000000000000>] zs_malloc+0x148/0x3d0
> <4>[  156.701630]  [4:        kswapd0: 1209] [<0000000000000000>] zram_bvec_rw+0x250/0x568
> <4>[  156.701634]  [4:        kswapd0: 1209] [<0000000000000000>] zram_rw_page+0x8c/0xe0
> <4>[  156.701640]  [4:        kswapd0: 1209] [<0000000000000000>] bdev_write_page+0x70/0xbc
> <4>[  156.701645]  [4:        kswapd0: 1209] [<0000000000000000>] __swap_writepage+0x58/0x37c
> <4>[  156.701649]  [4:        kswapd0: 1209] [<0000000000000000>] swap_writepage+0x40/0x4c
> <4>[  156.701654]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_page_list+0xc3c/0xf54
> <4>[  156.701659]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
> <4>[  156.701664]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_node_memcg+0x23c/0x618
> <4>[  156.701668]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_node+0x1c8/0x304
> <4>[  156.701673]  [4:        kswapd0: 1209] [<0000000000000000>] kswapd+0x680/0x7c4
> <4>[  156.701679]  [4:        kswapd0: 1209] [<0000000000000000>] kthread+0x110/0x120
> <4>[  156.701684]  [4:        kswapd0: 1209] [<0000000000000000>] ret_from_fork+0x10/0x18
> <4>[  156.701689]  [4:        kswapd0: 1209] Mem-Info:
> <4>[  156.701712]  [4:        kswapd0: 1209] active_anon:88690 inactive_anon:88630 isolated_anon:0
> <4>[  156.701712]  [4:        kswapd0: 1209]  active_file:99173 inactive_file:169305 isolated_file:32
> <4>[  156.701712]  [4:        kswapd0: 1209]  unevictable:48292 dirty:538 writeback:38 unstable:0
> <4>[  156.701712]  [4:        kswapd0: 1209]  slab_reclaimable:15131 slab_unreclaimable:47762
> <4>[  156.701712]  [4:        kswapd0: 1209]  mapped:274654 shmem:2824 pagetables:25088 bounce:0
> <4>[  156.701712]  [4:        kswapd0: 1209]  free:2489 free_pcp:444 free_cma:3
> <4>[  156.701728]  [4:        kswapd0: 1209] Node 0 active_anon:354760kB inactive_anon:354520kB active_file:396692kB inactive_file:677220kB unevictable:193168kB isolated(anon):0kB isolated(file):128kB mapped:1098616kB dirty:2152kB writeback:152kB shmem:11296kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> <4>[  156.701738]  [4:        kswapd0: 1209] Normal free:9956kB min:7428kB low:93440kB high:97032kB active_anon:355176kB inactive_anon:354580kB active_file:396196kB inactive_file:677284kB unevictable:193168kB writepending:2304kB present:4081664kB managed:3593324kB mlocked:193168kB kernel_stack:55008kB pagetables:100352kB bounce:0kB free_pcp:1776kB local_pcp:656kB free_cma:12kB
> <4>[  156.701741]  [4:        kswapd0: 1209] lowmem_reserve[]: 0 0
> <4>[  156.701747]  [4:        kswapd0: 1209] Normal: 196*4kB (H) 141*8kB (H) 109*16kB (H) 63*32kB (H) 20*64kB (H) 8*128kB (H) 2*256kB (H) 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 9000kB
> 
> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
> ---
> v2: factor out common part
> v1: consider highatomic reserve
> ---
>  mm/page_alloc.c | 61 ++++++++++++++++++++++++++++---------------------
>  1 file changed, 35 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48eb0f1410d4..c2177e056f19 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3487,6 +3487,29 @@ static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>  }
>  ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
>  
> +static inline long __zone_watermark_unusable_free(struct zone *z,
> +				unsigned int order, unsigned int alloc_flags)
> +{
> +	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
> +	long unusable_free = (1 << order) - 1;
> +
> +	/*
> +	 * If the caller does not have rights to ALLOC_HARDER then subtract
> +	 * the high-atomic reserves. This will over-estimate the size of the
> +	 * atomic reserve but it avoids a search.
> +	 */
> +	if (likely(!alloc_harder))
> +		unusable_free += z->nr_reserved_highatomic;
> +
> +#ifdef CONFIG_CMA
> +	/* If allocation can't use CMA areas don't use free CMA pages */
> +	if (!(alloc_flags & ALLOC_CMA))
> +		unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
> +#endif
> +
> +	return unusable_free;
> +}
> +
>  /*
>   * Return true if free base pages are above 'mark'. For high-order checks it
>   * will return true of the order-0 watermark is reached and there is at least
> @@ -3502,19 +3525,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
>  
>  	/* free_pages may go negative - that's OK */
> -	free_pages -= (1 << order) - 1;
> +	free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
>  
>  	if (alloc_flags & ALLOC_HIGH)
>  		min -= min / 2;
>  
> -	/*
> -	 * If the caller does not have rights to ALLOC_HARDER then subtract
> -	 * the high-atomic reserves. This will over-estimate the size of the
> -	 * atomic reserve but it avoids a search.
> -	 */
> -	if (likely(!alloc_harder)) {
> -		free_pages -= z->nr_reserved_highatomic;
> -	} else {
> +	if (unlikely(alloc_harder)) {
>  		/*
>  		 * OOM victims can try even harder than normal ALLOC_HARDER
>  		 * users on the grounds that it's definitely going to be in
> @@ -3527,13 +3543,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>  			min -= min / 4;
>  	}
>  
> -
> -#ifdef CONFIG_CMA
> -	/* If allocation can't use CMA areas don't use free CMA pages */
> -	if (!(alloc_flags & ALLOC_CMA))
> -		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> -#endif
> -
>  	/*
>  	 * Check watermarks for an order-0 allocation request. If these
>  	 * are not met, then a high-order request also cannot go ahead
> @@ -3582,14 +3591,11 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  				unsigned long mark, int highest_zoneidx,
>  				unsigned int alloc_flags)
>  {
> -	long free_pages = zone_page_state(z, NR_FREE_PAGES);
> -	long cma_pages = 0;
> +	long free_pages;
> +	long unusable_free;
>  
> -#ifdef CONFIG_CMA
> -	/* If allocation can't use CMA areas don't use free CMA pages */
> -	if (!(alloc_flags & ALLOC_CMA))
> -		cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
> -#endif
> +	free_pages = zone_page_state(z, NR_FREE_PAGES);
> +	unusable_free =  __zone_watermark_unusable_free(z, order, alloc_flags);
>  
>  	/*
>  	 * Fast check for order-0 only. If this fails then the reserves
> @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>  	 * the caller is !atomic then it'll uselessly search the free
>  	 * list. That corner case is then slower but it is harmless.

Do we need remove or adjust the code comment at this place? So Mel have
foreseen the corner case, just reclaiming to unreserve the highatomic 
might be ignored.

>  	 */
> -	if (!order && (free_pages - cma_pages) >
> -				mark + z->lowmem_reserve[highest_zoneidx])
> -		return true;
> +	if (!order) {
> +		long fast_free = free_pages - unusable_free;
> +
> +		if (fast_free > mark + z->lowmem_reserve[highest_zoneidx])
> +			return true;
> +	}
>  
>  	return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
>  					free_pages);
> -- 
> 2.17.1
> 
>
Jaewon Kim June 13, 2020, 1:08 p.m. UTC | #4
2020년 6월 13일 (토) 오후 6:42, Baoquan He <bhe@redhat.com>님이 작성:
>
> On 06/13/20 at 11:51am, Jaewon Kim wrote:
> > zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm,
> > page_alloc: shortcut watermark checks for order-0 pages"). The commit
> > simply checks if free pages is bigger than watermark without additional
> > calculation such like reducing watermark.
> >
> > It considered free cma pages but it did not consider highatomic
> > reserved. This may incur exhaustion of free pages except high order
> > atomic free pages.
> >
> > Assume that reserved_highatomic pageblock is bigger than watermark min,
> > and there are only few free pages except high order atomic free. Because
> > zone_watermark_fast passes the allocation without considering high order
> > atomic free, normal reclaimable allocation like GFP_HIGHUSER will
> > consume all the free pages. Then finally order-0 atomic allocation may
> > fail on allocation.
> >
> > This means watermark min is not protected against non-atomic allocation.
> > The order-0 atomic allocation with ALLOC_HARDER unwantedly can be
> > failed. Additionally the __GFP_MEMALLOC allocation with
> > ALLOC_NO_WATERMARKS also can be failed.
> >
> > To avoid the problem, zone_watermark_fast should consider highatomic
> > reserve. If the actual size of high atomic free is counted accurately
> > like cma free, we may use it. On this patch just use
> > nr_reserved_highatomic. Additionally introduce
> > __zone_watermark_unusable_free to factor out common parts between
> > zone_watermark_fast and __zone_watermark_ok.
> >
> > This is trace log which shows GFP_HIGHUSER consumes free pages right
> > before ALLOC_NO_WATERMARKS.
> >
> >   <...>-22275 [006] ....   889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> >   <...>-22275 [006] ....   889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
> > kswapd0-1207  [005] ...1   889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE
> >
> > This is an example of ALLOC_HARDER allocation failure.
> >
> > <4>[ 6207.637280]  [3:  Binder:9343_3:22875] Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
> > <4>[ 6207.637311]  [3:  Binder:9343_3:22875] Call trace:
> > <4>[ 6207.637346]  [3:  Binder:9343_3:22875] [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
> > <4>[ 6207.637356]  [3:  Binder:9343_3:22875] [<ffffff8008223320>] warn_alloc+0xd8/0x12c
> > <4>[ 6207.637365]  [3:  Binder:9343_3:22875] [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
> > <4>[ 6207.637374]  [3:  Binder:9343_3:22875] [<ffffff800827f6e8>] new_slab+0x128/0x604
> > <4>[ 6207.637381]  [3:  Binder:9343_3:22875] [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
> > <4>[ 6207.637387]  [3:  Binder:9343_3:22875] [<ffffff800827ba00>] __kmalloc+0x2f8/0x310
> > <4>[ 6207.637396]  [3:  Binder:9343_3:22875] [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
> > <4>[ 6207.637404]  [3:  Binder:9343_3:22875] [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
> > <4>[ 6207.637412]  [3:  Binder:9343_3:22875] [<ffffff80084ad880>] security_sid_to_context+0x10/0x18
> > <4>[ 6207.637421]  [3:  Binder:9343_3:22875] [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
> > <4>[ 6207.637430]  [3:  Binder:9343_3:22875] [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
> > <4>[ 6207.637442]  [3:  Binder:9343_3:22875] [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
> > <4>[ 6207.637569]  [3:  Binder:9343_3:22875] Mem-Info:
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875] active_anon:102061 inactive_anon:81551 isolated_anon:0
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  active_file:59102 inactive_file:68924 isolated_file:64
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  unevictable:611 dirty:63 writeback:0 unstable:0
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  slab_reclaimable:13324 slab_unreclaimable:44354
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  mapped:83015 shmem:4858 pagetables:26316 bounce:0
> > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  free:2727 free_pcp:1035 free_cma:178
> > <4>[ 6207.637616]  [3:  Binder:9343_3:22875] Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > <4>[ 6207.637627]  [3:  Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
> > <4>[ 6207.637632]  [3:  Binder:9343_3:22875] lowmem_reserve[]: 0 0
> > <4>[ 6207.637637]  [3:  Binder:9343_3:22875] Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB
> > <4>[ 6207.637655]  [3:  Binder:9343_3:22875] 138826 total pagecache pages
> > <4>[ 6207.637663]  [3:  Binder:9343_3:22875] 5460 pages in swap cache
> > <4>[ 6207.637668]  [3:  Binder:9343_3:22875] Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142
> >
> > This is an example of ALLOC_NO_WATERMARKS allocation failure.
> >
> > <6>[  156.701551]  [4:        kswapd0: 1209] kswapd0 cpuset=/ mems_allowed=0
> > <4>[  156.701563]  [4:        kswapd0: 1209] CPU: 4 PID: 1209 Comm: kswapd0 Tainted: G        W       4.14.113-18113966 #1
> > <4>[  156.701572]  [4:        kswapd0: 1209] Call trace:
> > <4>[  156.701605]  [4:        kswapd0: 1209] [<0000000000000000>] dump_stack+0x68/0x90
> > <4>[  156.701612]  [4:        kswapd0: 1209] [<0000000000000000>] warn_alloc+0x104/0x198
> > <4>[  156.701617]  [4:        kswapd0: 1209] [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
> > <4>[  156.701623]  [4:        kswapd0: 1209] [<0000000000000000>] zs_malloc+0x148/0x3d0
> > <4>[  156.701630]  [4:        kswapd0: 1209] [<0000000000000000>] zram_bvec_rw+0x250/0x568
> > <4>[  156.701634]  [4:        kswapd0: 1209] [<0000000000000000>] zram_rw_page+0x8c/0xe0
> > <4>[  156.701640]  [4:        kswapd0: 1209] [<0000000000000000>] bdev_write_page+0x70/0xbc
> > <4>[  156.701645]  [4:        kswapd0: 1209] [<0000000000000000>] __swap_writepage+0x58/0x37c
> > <4>[  156.701649]  [4:        kswapd0: 1209] [<0000000000000000>] swap_writepage+0x40/0x4c
> > <4>[  156.701654]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_page_list+0xc3c/0xf54
> > <4>[  156.701659]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
> > <4>[  156.701664]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_node_memcg+0x23c/0x618
> > <4>[  156.701668]  [4:        kswapd0: 1209] [<0000000000000000>] shrink_node+0x1c8/0x304
> > <4>[  156.701673]  [4:        kswapd0: 1209] [<0000000000000000>] kswapd+0x680/0x7c4
> > <4>[  156.701679]  [4:        kswapd0: 1209] [<0000000000000000>] kthread+0x110/0x120
> > <4>[  156.701684]  [4:        kswapd0: 1209] [<0000000000000000>] ret_from_fork+0x10/0x18
> > <4>[  156.701689]  [4:        kswapd0: 1209] Mem-Info:
> > <4>[  156.701712]  [4:        kswapd0: 1209] active_anon:88690 inactive_anon:88630 isolated_anon:0
> > <4>[  156.701712]  [4:        kswapd0: 1209]  active_file:99173 inactive_file:169305 isolated_file:32
> > <4>[  156.701712]  [4:        kswapd0: 1209]  unevictable:48292 dirty:538 writeback:38 unstable:0
> > <4>[  156.701712]  [4:        kswapd0: 1209]  slab_reclaimable:15131 slab_unreclaimable:47762
> > <4>[  156.701712]  [4:        kswapd0: 1209]  mapped:274654 shmem:2824 pagetables:25088 bounce:0
> > <4>[  156.701712]  [4:        kswapd0: 1209]  free:2489 free_pcp:444 free_cma:3
> > <4>[  156.701728]  [4:        kswapd0: 1209] Node 0 active_anon:354760kB inactive_anon:354520kB active_file:396692kB inactive_file:677220kB unevictable:193168kB isolated(anon):0kB isolated(file):128kB mapped:1098616kB dirty:2152kB writeback:152kB shmem:11296kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > <4>[  156.701738]  [4:        kswapd0: 1209] Normal free:9956kB min:7428kB low:93440kB high:97032kB active_anon:355176kB inactive_anon:354580kB active_file:396196kB inactive_file:677284kB unevictable:193168kB writepending:2304kB present:4081664kB managed:3593324kB mlocked:193168kB kernel_stack:55008kB pagetables:100352kB bounce:0kB free_pcp:1776kB local_pcp:656kB free_cma:12kB
> > <4>[  156.701741]  [4:        kswapd0: 1209] lowmem_reserve[]: 0 0
> > <4>[  156.701747]  [4:        kswapd0: 1209] Normal: 196*4kB (H) 141*8kB (H) 109*16kB (H) 63*32kB (H) 20*64kB (H) 8*128kB (H) 2*256kB (H) 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 9000kB
> >
> > Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> > Suggested-by: Minchan Kim <minchan@kernel.org>
> > Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
> > ---
> > v2: factor out common part
> > v1: consider highatomic reserve
> > ---
> >  mm/page_alloc.c | 61 ++++++++++++++++++++++++++++---------------------
> >  1 file changed, 35 insertions(+), 26 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 48eb0f1410d4..c2177e056f19 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3487,6 +3487,29 @@ static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
> >  }
> >  ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
> >
> > +static inline long __zone_watermark_unusable_free(struct zone *z,
> > +                             unsigned int order, unsigned int alloc_flags)
> > +{
> > +     const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
> > +     long unusable_free = (1 << order) - 1;
> > +
> > +     /*
> > +      * If the caller does not have rights to ALLOC_HARDER then subtract
> > +      * the high-atomic reserves. This will over-estimate the size of the
> > +      * atomic reserve but it avoids a search.
> > +      */
> > +     if (likely(!alloc_harder))
> > +             unusable_free += z->nr_reserved_highatomic;
> > +
> > +#ifdef CONFIG_CMA
> > +     /* If allocation can't use CMA areas don't use free CMA pages */
> > +     if (!(alloc_flags & ALLOC_CMA))
> > +             unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
> > +#endif
> > +
> > +     return unusable_free;
> > +}
> > +
> >  /*
> >   * Return true if free base pages are above 'mark'. For high-order checks it
> >   * will return true of the order-0 watermark is reached and there is at least
> > @@ -3502,19 +3525,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> >       const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
> >
> >       /* free_pages may go negative - that's OK */
> > -     free_pages -= (1 << order) - 1;
> > +     free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
> >
> >       if (alloc_flags & ALLOC_HIGH)
> >               min -= min / 2;
> >
> > -     /*
> > -      * If the caller does not have rights to ALLOC_HARDER then subtract
> > -      * the high-atomic reserves. This will over-estimate the size of the
> > -      * atomic reserve but it avoids a search.
> > -      */
> > -     if (likely(!alloc_harder)) {
> > -             free_pages -= z->nr_reserved_highatomic;
> > -     } else {
> > +     if (unlikely(alloc_harder)) {
> >               /*
> >                * OOM victims can try even harder than normal ALLOC_HARDER
> >                * users on the grounds that it's definitely going to be in
> > @@ -3527,13 +3543,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> >                       min -= min / 4;
> >       }
> >
> > -
> > -#ifdef CONFIG_CMA
> > -     /* If allocation can't use CMA areas don't use free CMA pages */
> > -     if (!(alloc_flags & ALLOC_CMA))
> > -             free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> > -#endif
> > -
> >       /*
> >        * Check watermarks for an order-0 allocation request. If these
> >        * are not met, then a high-order request also cannot go ahead
> > @@ -3582,14 +3591,11 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
> >                               unsigned long mark, int highest_zoneidx,
> >                               unsigned int alloc_flags)
> >  {
> > -     long free_pages = zone_page_state(z, NR_FREE_PAGES);
> > -     long cma_pages = 0;
> > +     long free_pages;
> > +     long unusable_free;
> >
> > -#ifdef CONFIG_CMA
> > -     /* If allocation can't use CMA areas don't use free CMA pages */
> > -     if (!(alloc_flags & ALLOC_CMA))
> > -             cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
> > -#endif
> > +     free_pages = zone_page_state(z, NR_FREE_PAGES);
> > +     unusable_free =  __zone_watermark_unusable_free(z, order, alloc_flags);
> >
> >       /*
> >        * Fast check for order-0 only. If this fails then the reserves
> > @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
> >        * the caller is !atomic then it'll uselessly search the free
> >        * list. That corner case is then slower but it is harmless.
>
> Do we need remove or adjust the code comment at this place? So Mel have
> foreseen the corner case, just reclaiming to unreserve the highatomic
> might be ignored.
>

Hello thank you for your comment.

My previous mail to Vlastimil Babka seems to have html.
Let me put  again here because I also think the comment should be changed.
I'd like to hear opinions from the open source community.

Additionally actually I think we need accurate counting of highatomic
free after this patch.

----------------------------------------------------------------------------------------

As Mel also agreed with me in v1 mail thread, this highatomic reserved should
be considered even in watermark fast.

The comment, I think, may need to be changed. Prior to this patch, non
highatomic
allocation may do useless search, but it also can take ALL non highatomic free.

With this patch, non highatomic allocation will NOT do useless search. Rather,
it may be required direct reclamation even when there are some non
high atomic free.

i.e)
In following situation, watermark check fails (9MB - 8MB < 4MB) though there are
enough free (9MB - 4MB > 4MB). If this is really matter, we need to
count highatomic
free accurately.

min : 4MB,
highatomic reserved : 8MB
Total free : 9MB
  actual highatomic free : 4MB
  non highatomic free : 5MB

> >        */
> > -     if (!order && (free_pages - cma_pages) >
> > -                             mark + z->lowmem_reserve[highest_zoneidx])
> > -             return true;
> > +     if (!order) {
> > +             long fast_free = free_pages - unusable_free;
> > +
> > +             if (fast_free > mark + z->lowmem_reserve[highest_zoneidx])
> > +                     return true;
> > +     }
> >
> >       return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
> >                                       free_pages);
> > --
> > 2.17.1
> >
> >
>
Baoquan He June 13, 2020, 11:17 p.m. UTC | #5
On 06/13/20 at 10:08pm, Jaewon Kim wrote:
...
> > > This is an example of ALLOC_HARDER allocation failure.
> > >
> > > <4>[ 6207.637280]  [3:  Binder:9343_3:22875] Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
> > > <4>[ 6207.637311]  [3:  Binder:9343_3:22875] Call trace:
> > > <4>[ 6207.637346]  [3:  Binder:9343_3:22875] [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
> > > <4>[ 6207.637356]  [3:  Binder:9343_3:22875] [<ffffff8008223320>] warn_alloc+0xd8/0x12c
> > > <4>[ 6207.637365]  [3:  Binder:9343_3:22875] [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
> > > <4>[ 6207.637374]  [3:  Binder:9343_3:22875] [<ffffff800827f6e8>] new_slab+0x128/0x604
> > > <4>[ 6207.637381]  [3:  Binder:9343_3:22875] [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
> > > <4>[ 6207.637387]  [3:  Binder:9343_3:22875] [<ffffff800827ba00>] __kmalloc+0x2f8/0x310
> > > <4>[ 6207.637396]  [3:  Binder:9343_3:22875] [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
> > > <4>[ 6207.637404]  [3:  Binder:9343_3:22875] [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
> > > <4>[ 6207.637412]  [3:  Binder:9343_3:22875] [<ffffff80084ad880>] security_sid_to_context+0x10/0x18
> > > <4>[ 6207.637421]  [3:  Binder:9343_3:22875] [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
> > > <4>[ 6207.637430]  [3:  Binder:9343_3:22875] [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
> > > <4>[ 6207.637442]  [3:  Binder:9343_3:22875] [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
> > > <4>[ 6207.637569]  [3:  Binder:9343_3:22875] Mem-Info:
> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875] active_anon:102061 inactive_anon:81551 isolated_anon:0
> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  active_file:59102 inactive_file:68924 isolated_file:64
> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  unevictable:611 dirty:63 writeback:0 unstable:0
> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  slab_reclaimable:13324 slab_unreclaimable:44354
> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  mapped:83015 shmem:4858 pagetables:26316 bounce:0
> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  free:2727 free_pcp:1035 free_cma:178
> > > <4>[ 6207.637616]  [3:  Binder:9343_3:22875] Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > > <4>[ 6207.637627]  [3:  Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB

Checked this mem info, wondering why there's no 'reserved_highatomic'
printing in all these examples.

> > > <4>[ 6207.637632]  [3:  Binder:9343_3:22875] lowmem_reserve[]: 0 0
> > > <4>[ 6207.637637]  [3:  Binder:9343_3:22875] Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB
> > > <4>[ 6207.637655]  [3:  Binder:9343_3:22875] 138826 total pagecache pages
> > > <4>[ 6207.637663]  [3:  Binder:9343_3:22875] 5460 pages in swap cache
> > > <4>[ 6207.637668]  [3:  Binder:9343_3:22875] Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142
> > >
...
> > >  mm/page_alloc.c | 61 ++++++++++++++++++++++++++++---------------------
> > >  1 file changed, 35 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 48eb0f1410d4..c2177e056f19 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3487,6 +3487,29 @@ static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
> > >  }
> > >  ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
> > >
> > > +static inline long __zone_watermark_unusable_free(struct zone *z,
> > > +                             unsigned int order, unsigned int alloc_flags)
> > > +{
> > > +     const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
> > > +     long unusable_free = (1 << order) - 1;
> > > +
> > > +     /*
> > > +      * If the caller does not have rights to ALLOC_HARDER then subtract
> > > +      * the high-atomic reserves. This will over-estimate the size of the
> > > +      * atomic reserve but it avoids a search.
> > > +      */
> > > +     if (likely(!alloc_harder))
> > > +             unusable_free += z->nr_reserved_highatomic;
> > > +
> > > +#ifdef CONFIG_CMA
> > > +     /* If allocation can't use CMA areas don't use free CMA pages */
> > > +     if (!(alloc_flags & ALLOC_CMA))
> > > +             unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
> > > +#endif
> > > +
> > > +     return unusable_free;
> > > +}
> > > +
> > >  /*
> > >   * Return true if free base pages are above 'mark'. For high-order checks it
> > >   * will return true of the order-0 watermark is reached and there is at least
> > > @@ -3502,19 +3525,12 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> > >       const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
> > >
> > >       /* free_pages may go negative - that's OK */
> > > -     free_pages -= (1 << order) - 1;
> > > +     free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
> > >
> > >       if (alloc_flags & ALLOC_HIGH)
> > >               min -= min / 2;
> > >
> > > -     /*
> > > -      * If the caller does not have rights to ALLOC_HARDER then subtract
> > > -      * the high-atomic reserves. This will over-estimate the size of the
> > > -      * atomic reserve but it avoids a search.
> > > -      */
> > > -     if (likely(!alloc_harder)) {
> > > -             free_pages -= z->nr_reserved_highatomic;
> > > -     } else {
> > > +     if (unlikely(alloc_harder)) {
> > >               /*
> > >                * OOM victims can try even harder than normal ALLOC_HARDER
> > >                * users on the grounds that it's definitely going to be in
> > > @@ -3527,13 +3543,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> > >                       min -= min / 4;
> > >       }
> > >
> > > -
> > > -#ifdef CONFIG_CMA
> > > -     /* If allocation can't use CMA areas don't use free CMA pages */
> > > -     if (!(alloc_flags & ALLOC_CMA))
> > > -             free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> > > -#endif
> > > -
> > >       /*
> > >        * Check watermarks for an order-0 allocation request. If these
> > >        * are not met, then a high-order request also cannot go ahead
> > > @@ -3582,14 +3591,11 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
> > >                               unsigned long mark, int highest_zoneidx,
> > >                               unsigned int alloc_flags)
> > >  {
> > > -     long free_pages = zone_page_state(z, NR_FREE_PAGES);
> > > -     long cma_pages = 0;
> > > +     long free_pages;
> > > +     long unusable_free;
> > >
> > > -#ifdef CONFIG_CMA
> > > -     /* If allocation can't use CMA areas don't use free CMA pages */
> > > -     if (!(alloc_flags & ALLOC_CMA))
> > > -             cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
> > > -#endif
> > > +     free_pages = zone_page_state(z, NR_FREE_PAGES);
> > > +     unusable_free =  __zone_watermark_unusable_free(z, order, alloc_flags);
> > >
> > >       /*
> > >        * Fast check for order-0 only. If this fails then the reserves
> > > @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
> > >        * the caller is !atomic then it'll uselessly search the free
> > >        * list. That corner case is then slower but it is harmless.
> >
> > Do we need remove or adjust the code comment at this place? So Mel have
> > foreseen the corner case, just reclaiming to unreserve the highatomic
> > might be ignored.
> >
> 
> Hello thank you for your comment.
> 
> My previous mail to Vlastimil Babka seems to have html.
> Let me put  again here because I also think the comment should be changed.
> I'd like to hear opinions from the open source community.

Yeah, your replying mail to Vlastimil looks a little messy on format, I
didn't scroll down to check.

> 
> Additionally actually I think we need accurate counting of highatomic
> free after this patch.
> 
> ----------------------------------------------------------------------------------------
> 
> As Mel also agreed with me in v1 mail thread, this highatomic reserved should
> be considered even in watermark fast.
> 
> The comment, I think, may need to be changed. Prior to this patch, non
> highatomic
> allocation may do useless search, but it also can take ALL non highatomic free.
> 
> With this patch, non highatomic allocation will NOT do useless search. Rather,
> it may be required direct reclamation even when there are some non
> high atomic free.
> 
> i.e)
> In following situation, watermark check fails (9MB - 8MB < 4MB) though there are
> enough free (9MB - 4MB > 4MB). If this is really matter, we need to
> count highatomic
> free accurately.

Seems to make sense. We only use zone->nr_reserved_highatomic to account
the reserved highatomic, don't track how much have been used for
highatomic allocation. But not sure if this will happen often, or just a
very rare case, whether taking that into account will impact anything.

> 
> min : 4MB,
> highatomic reserved : 8MB
> Total free : 9MB
>   actual highatomic free : 4MB
>   non highatomic free : 5MB
> 
> > >        */
> > > -     if (!order && (free_pages - cma_pages) >
> > > -                             mark + z->lowmem_reserve[highest_zoneidx])
> > > -             return true;
> > > +     if (!order) {
> > > +             long fast_free = free_pages - unusable_free;
> > > +
> > > +             if (fast_free > mark + z->lowmem_reserve[highest_zoneidx])
> > > +                     return true;
> > > +     }
> > >
> > >       return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
> > >                                       free_pages);
> > > --
> > > 2.17.1
> > >
> > >
> >
>
Vlastimil Babka June 15, 2020, 11:25 a.m. UTC | #6
On 6/13/20 6:16 AM, Jaewon Kim wrote:
> 
> 
> 2020년 6월 12일 (금) 오후 11:34, Vlastimil Babka <vbabka@suse.cz
> <mailto:vbabka@suse.cz>>님이 작성:
>>
>> On 6/13/20 4:51 AM, Jaewon Kim wrote:
>> > zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm,
>> > page_alloc: shortcut watermark checks for order-0 pages"). The commit
>> > simply checks if free pages is bigger than watermark without additional
>> > calculation such like reducing watermark.
>> >
>> > It considered free cma pages but it did not consider highatomic
>> > reserved. This may incur exhaustion of free pages except high order
>> > atomic free pages.
>> >
>> > Assume that reserved_highatomic pageblock is bigger than watermark min,
>> > and there are only few free pages except high order atomic free. Because
>> > zone_watermark_fast passes the allocation without considering high order
>> > atomic free, normal reclaimable allocation like GFP_HIGHUSER will
>> > consume all the free pages. Then finally order-0 atomic allocation may
>> > fail on allocation.
>>
>> I don't understand why order-0 atomic allocation will fail. Is it because of
>> watermark check, or finding no suitable pages?
>> - watermark check should be OK as atomic allocations can use reserves
>> - suitable pages should be OK, even if all free pages are in the highatomic
>> reserves, because rmqueue() contains:
>>
>> if (alloc_flags & ALLOC_HARDER)
>>         page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
>>
>> So what am I missing?
>>
> Hello
> The order-0 atomic allocation can be failed because of depletion of suitable
> free page.
> Watermark check passes order-0 atomic allocation but it will be failed at
> finding a free page.
> The  __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) can be used
> only for highorder.

Ah, that's what I missed, rmqueue() will divert all order-0 allocations to
rmqueue_pcplist() so those will not reach the hunk above. Thanks.

>> > @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone
> *z, unsigned int order,
>>         /*
>>          * Fast check for order-0 only. If this fails then the reserves
>>          * need to be calculated. There is a corner case where the check
>>          * passes but only the high-order atomic reserve are free. If
>> >        * the caller is !atomic then it'll uselessly search the free
>> >        * list. That corner case is then slower but it is harmless.
>> >        */
>>
>> The comment stops being true after this patch? It also suggests that Mel
>> anticipated this corner case, but that it should only cause a false positive
>> zone_watermark_fast() and then rmqueue() fails for !ALLOC_HARDER as it cannot
>> use MIGRATE_HIGHATOMIC blocks. It expects atomic order-0 still works. So what's
>> going on?
> 
> As Mel also agreed with me in v1 mail thread, this highatomic reserved should
> be considered even in watermark fast.
> 
> The comment, I think, may need to be changed. Prior to this patch, non highatomic
> allocation may do useless search, but it also can take ALL non highatomic free.
> 
> With this patch, non highatomic allocation will NOT do useless search. Rather,

Yes, that's what I meant.
Vlastimil Babka June 15, 2020, 11:36 a.m. UTC | #7
On 6/14/20 1:17 AM, Baoquan He wrote:
> On 06/13/20 at 10:08pm, Jaewon Kim wrote:
> ...
>> > > This is an example of ALLOC_HARDER allocation failure.
>> > >
>> > > <4>[ 6207.637280]  [3:  Binder:9343_3:22875] Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
>> > > <4>[ 6207.637311]  [3:  Binder:9343_3:22875] Call trace:
>> > > <4>[ 6207.637346]  [3:  Binder:9343_3:22875] [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
>> > > <4>[ 6207.637356]  [3:  Binder:9343_3:22875] [<ffffff8008223320>] warn_alloc+0xd8/0x12c
>> > > <4>[ 6207.637365]  [3:  Binder:9343_3:22875] [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
>> > > <4>[ 6207.637374]  [3:  Binder:9343_3:22875] [<ffffff800827f6e8>] new_slab+0x128/0x604
>> > > <4>[ 6207.637381]  [3:  Binder:9343_3:22875] [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
>> > > <4>[ 6207.637387]  [3:  Binder:9343_3:22875] [<ffffff800827ba00>] __kmalloc+0x2f8/0x310
>> > > <4>[ 6207.637396]  [3:  Binder:9343_3:22875] [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
>> > > <4>[ 6207.637404]  [3:  Binder:9343_3:22875] [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
>> > > <4>[ 6207.637412]  [3:  Binder:9343_3:22875] [<ffffff80084ad880>] security_sid_to_context+0x10/0x18
>> > > <4>[ 6207.637421]  [3:  Binder:9343_3:22875] [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
>> > > <4>[ 6207.637430]  [3:  Binder:9343_3:22875] [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
>> > > <4>[ 6207.637442]  [3:  Binder:9343_3:22875] [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
>> > > <4>[ 6207.637569]  [3:  Binder:9343_3:22875] Mem-Info:
>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875] active_anon:102061 inactive_anon:81551 isolated_anon:0
>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  active_file:59102 inactive_file:68924 isolated_file:64
>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  unevictable:611 dirty:63 writeback:0 unstable:0
>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  slab_reclaimable:13324 slab_unreclaimable:44354
>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  mapped:83015 shmem:4858 pagetables:26316 bounce:0
>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  free:2727 free_pcp:1035 free_cma:178
>> > > <4>[ 6207.637616]  [3:  Binder:9343_3:22875] Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>> > > <4>[ 6207.637627]  [3:  Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
> 
> Checked this mem info, wondering why there's no 'reserved_highatomic'
> printing in all these examples.

Yeah, it better be printed, especially after it's included in watermark
calculation, so we're less confused by reports of allocation failure where
watermarks are seemingly ok.

...

>> > >       /*
>> > >        * Fast check for order-0 only. If this fails then the reserves
>> > > @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>> > >        * the caller is !atomic then it'll uselessly search the free
>> > >        * list. That corner case is then slower but it is harmless.
>> >
>> > Do we need remove or adjust the code comment at this place? So Mel have
>> > foreseen the corner case, just reclaiming to unreserve the highatomic
>> > might be ignored.
>> >
>> 
>> Hello thank you for your comment.
>> 
>> My previous mail to Vlastimil Babka seems to have html.
>> Let me put  again here because I also think the comment should be changed.
>> I'd like to hear opinions from the open source community.
> 
> Yeah, your replying mail to Vlastimil looks a little messy on format, I
> didn't scroll down to check.
> 
>> 
>> Additionally actually I think we need accurate counting of highatomic
>> free after this patch.
>> 
>> ----------------------------------------------------------------------------------------
>> 
>> As Mel also agreed with me in v1 mail thread, this highatomic reserved should
>> be considered even in watermark fast.
>> 
>> The comment, I think, may need to be changed. Prior to this patch, non
>> highatomic
>> allocation may do useless search, but it also can take ALL non highatomic free.
>> 
>> With this patch, non highatomic allocation will NOT do useless search. Rather,
>> it may be required direct reclamation even when there are some non
>> high atomic free.
>> 
>> i.e)
>> In following situation, watermark check fails (9MB - 8MB < 4MB) though there are
>> enough free (9MB - 4MB > 4MB). If this is really matter, we need to
>> count highatomic
>> free accurately.
> 
> Seems to make sense. We only use zone->nr_reserved_highatomic to account
> the reserved highatomic, don't track how much have been used for
> highatomic allocation. But not sure if this will happen often, or just a
> very rare case, whether taking that into account will impact anything.

Unfortunately there's a problem when trying to account free pages of a migrate
type exactly, as e.g. during reserve_highatomic_pageblock(), some pages might be
in pcplist of other cpu with other migratetype, and once they are freed, the
buddy merging will merge the different migratetypes and distort the accounting.
Fixing this for all migratetypes would have performance overhead, so we only do
that for MIGRATE_ISOLATE which is not that frequent (and it took a while to
eliminate all corner cases), and CMA which doesn't change pageblocks dynamically.

So either we live with the possible overreclaim due to inaccurate counting per
your example above, or we instead let order-0 atomic allocations use highatomic
reserves.
Jaewon Kim June 16, 2020, 7:30 a.m. UTC | #8
>On 6/14/20 1:17 AM, Baoquan He wrote:
>> On 06/13/20 at 10:08pm, Jaewon Kim wrote:
>> ...
>>> > > This is an example of ALLOC_HARDER allocation failure.
>>> > >
>>> > > <4>[ 6207.637280]  [3:  Binder:9343_3:22875] Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
>>> > > <4>[ 6207.637311]  [3:  Binder:9343_3:22875] Call trace:
>>> > > <4>[ 6207.637346]  [3:  Binder:9343_3:22875] [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
>>> > > <4>[ 6207.637356]  [3:  Binder:9343_3:22875] [<ffffff8008223320>] warn_alloc+0xd8/0x12c
>>> > > <4>[ 6207.637365]  [3:  Binder:9343_3:22875] [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
>>> > > <4>[ 6207.637374]  [3:  Binder:9343_3:22875] [<ffffff800827f6e8>] new_slab+0x128/0x604
>>> > > <4>[ 6207.637381]  [3:  Binder:9343_3:22875] [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
>>> > > <4>[ 6207.637387]  [3:  Binder:9343_3:22875] [<ffffff800827ba00>] __kmalloc+0x2f8/0x310
>>> > > <4>[ 6207.637396]  [3:  Binder:9343_3:22875] [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
>>> > > <4>[ 6207.637404]  [3:  Binder:9343_3:22875] [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
>>> > > <4>[ 6207.637412]  [3:  Binder:9343_3:22875] [<ffffff80084ad880>] security_sid_to_context+0x10/0x18
>>> > > <4>[ 6207.637421]  [3:  Binder:9343_3:22875] [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
>>> > > <4>[ 6207.637430]  [3:  Binder:9343_3:22875] [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
>>> > > <4>[ 6207.637442]  [3:  Binder:9343_3:22875] [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
>>> > > <4>[ 6207.637569]  [3:  Binder:9343_3:22875] Mem-Info:
>>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875] active_anon:102061 inactive_anon:81551 isolated_anon:0
>>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  active_file:59102 inactive_file:68924 isolated_file:64
>>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  unevictable:611 dirty:63 writeback:0 unstable:0
>>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  slab_reclaimable:13324 slab_unreclaimable:44354
>>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  mapped:83015 shmem:4858 pagetables:26316 bounce:0
>>> > > <4>[ 6207.637595]  [3:  Binder:9343_3:22875]  free:2727 free_pcp:1035 free_cma:178
>>> > > <4>[ 6207.637616]  [3:  Binder:9343_3:22875] Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>>> > > <4>[ 6207.637627]  [3:  Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
>> 
>> Checked this mem info, wondering why there's no 'reserved_highatomic'
>> printing in all these examples.
> 
>Yeah, it better be printed, especially after it's included in watermark
>calculation, so we're less confused by reports of allocation failure where
>watermarks are seemingly ok.
> 


Hello Vlastimil and Baoquan

The log in previous mail was captured from kernel based on v4.14.
After adding the reserved_highatomic log, I finally got a new log below
Let me change description in next patch.

There seems be reserved_highatomic:32768KB and actually 14232kB free.

[ 4738.329298] kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null)
[ 4738.329325] kswapd0 cpuset=/ mems_allowed=0
[ 4738.329339] CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1
[ 4738.329350] Call trace: 
[ 4738.329366] [<0000000000000000>] dump_backtrace+0x0/0x248
[ 4738.329377] [<0000000000000000>] show_stack+0x18/0x20
[ 4738.329390] [<0000000000000000>] __dump_stack+0x20/0x28
[ 4738.329398] [<0000000000000000>] dump_stack+0x68/0x90
[ 4738.329409] [<0000000000000000>] warn_alloc+0x104/0x198
[ 4738.329417] [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
[ 4738.329427] [<0000000000000000>] zs_malloc+0x148/0x3d0
[ 4738.329438] [<0000000000000000>] zram_bvec_rw+0x410/0x798
[ 4738.329446] [<0000000000000000>] zram_rw_page+0x88/0xdc
[ 4738.329455] [<0000000000000000>] bdev_write_page+0x70/0xbc
[ 4738.329463] [<0000000000000000>] __swap_writepage+0x58/0x37c
[ 4738.329469] [<0000000000000000>] swap_writepage+0x40/0x4c
[ 4738.329478] [<0000000000000000>] shrink_page_list+0xc30/0xf48
[ 4738.329486] [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
[ 4738.329494] [<0000000000000000>] shrink_node_memcg+0x23c/0x618
[ 4738.329501] [<0000000000000000>] shrink_node+0x1c8/0x304
[ 4738.329509] [<0000000000000000>] kswapd+0x680/0x7c4
[ 4738.329518] [<0000000000000000>] kthread+0x110/0x120
[ 4738.329527] [<0000000000000000>] ret_from_fork+0x10/0x18
[ 4738.329538] Mem-Info:
[ 4738.329574] active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a            slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0
[ 4738.329593] Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d           irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 4738.329603] Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k           B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB        [ 4738.329607] lowmem_reserve[]: 0 0
[ 4738.329615] Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB


>...
> 
>>> > >       /*
>>> > >        * Fast check for order-0 only. If this fails then the reserves
>>> > > @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
>>> > >        * the caller is !atomic then it'll uselessly search the free
>>> > >        * list. That corner case is then slower but it is harmless.
>>> >
>>> > Do we need remove or adjust the code comment at this place? So Mel have
>>> > foreseen the corner case, just reclaiming to unreserve the highatomic
>>> > might be ignored.
>>> >
>>> 
>>> Hello thank you for your comment.
>>> 
>>> My previous mail to Vlastimil Babka seems to have html.
>>> Let me put  again here because I also think the comment should be changed.
>>> I'd like to hear opinions from the open source community.
>> 
>> Yeah, your replying mail to Vlastimil looks a little messy on format, I
>> didn't scroll down to check.
>> 
>>> 
>>> Additionally actually I think we need accurate counting of highatomic
>>> free after this patch.
>>> 
>>> ----------------------------------------------------------------------------------------
>>> 
>>> As Mel also agreed with me in v1 mail thread, this highatomic reserved should
>>> be considered even in watermark fast.
>>> 
>>> The comment, I think, may need to be changed. Prior to this patch, non
>>> highatomic
>>> allocation may do useless search, but it also can take ALL non highatomic free.
>>> 
>>> With this patch, non highatomic allocation will NOT do useless search. Rather,
>>> it may be required direct reclamation even when there are some non
>>> high atomic free.
>>> 
>>> i.e)
>>> In following situation, watermark check fails (9MB - 8MB < 4MB) though there are
>>> enough free (9MB - 4MB > 4MB). If this is really matter, we need to
>>> count highatomic
>>> free accurately.
>> 
>> Seems to make sense. We only use zone->nr_reserved_highatomic to account
>> the reserved highatomic, don't track how much have been used for
>> highatomic allocation. But not sure if this will happen often, or just a
>> very rare case, whether taking that into account will impact anything.
> 
>Unfortunately there's a problem when trying to account free pages of a migrate
>type exactly, as e.g. during reserve_highatomic_pageblock(), some pages might be
>in pcplist of other cpu with other migratetype, and once they are freed, the
>buddy merging will merge the different migratetypes and distort the accounting.
>Fixing this for all migratetypes would have performance overhead, so we only do
>that for MIGRATE_ISOLATE which is not that frequent (and it took a while to
>eliminate all corner cases), and CMA which doesn't change pageblocks dynamically.

AFAIK we do not account free cma in pcp either. But yes accurate check could be 
overhead. For example, __mod_zone_freepage_state should account highatomic free 
as cma free. And we may see some incorrect accounting issue.

> 
>So either we live with the possible overreclaim due to inaccurate counting per
>your example above, or we instead let order-0 atomic allocations use highatomic
>reserves.
>

Additionally regarding existing Mel's comment, let me remove some of it if you
don't mind.

        /*
         * Fast check for order-0 only. If this fails then the reserves
-        * need to be calculated. There is a corner case where the check
-        * passes but only the high-order atomic reserve are free. If
-        * the caller is !atomic then it'll uselessly search the free
-        * list. That corner case is then slower but it is harmless.
+        * need to be calculated.
         */

I will prepare v3 patch.
Thank you again.
Baoquan He June 16, 2020, 2:17 p.m. UTC | #9
On 06/16/20 at 04:30pm, 김재원 wrote:
> >>> > > <4>[ 6207.637627]  [3:  Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
> >> 
> >> Checked this mem info, wondering why there's no 'reserved_highatomic'
> >> printing in all these examples.
> > 
> >Yeah, it better be printed, especially after it's included in watermark
> >calculation, so we're less confused by reports of allocation failure where
> >watermarks are seemingly ok.
> > 
> 
> 
> Hello Vlastimil and Baoquan
> 
> The log in previous mail was captured from kernel based on v4.14.
> After adding the reserved_highatomic log, I finally got a new log below
> Let me change description in next patch.
> 
> There seems be reserved_highatomic:32768KB and actually 14232kB free.
> 
> [ 4738.329298] kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null)
> [ 4738.329325] kswapd0 cpuset=/ mems_allowed=0
> [ 4738.329339] CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1
> [ 4738.329350] Call trace: 
> [ 4738.329366] [<0000000000000000>] dump_backtrace+0x0/0x248
> [ 4738.329377] [<0000000000000000>] show_stack+0x18/0x20
> [ 4738.329390] [<0000000000000000>] __dump_stack+0x20/0x28
> [ 4738.329398] [<0000000000000000>] dump_stack+0x68/0x90
> [ 4738.329409] [<0000000000000000>] warn_alloc+0x104/0x198
> [ 4738.329417] [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
> [ 4738.329427] [<0000000000000000>] zs_malloc+0x148/0x3d0
> [ 4738.329438] [<0000000000000000>] zram_bvec_rw+0x410/0x798
> [ 4738.329446] [<0000000000000000>] zram_rw_page+0x88/0xdc
> [ 4738.329455] [<0000000000000000>] bdev_write_page+0x70/0xbc
> [ 4738.329463] [<0000000000000000>] __swap_writepage+0x58/0x37c
> [ 4738.329469] [<0000000000000000>] swap_writepage+0x40/0x4c
> [ 4738.329478] [<0000000000000000>] shrink_page_list+0xc30/0xf48
> [ 4738.329486] [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
> [ 4738.329494] [<0000000000000000>] shrink_node_memcg+0x23c/0x618
> [ 4738.329501] [<0000000000000000>] shrink_node+0x1c8/0x304
> [ 4738.329509] [<0000000000000000>] kswapd+0x680/0x7c4
> [ 4738.329518] [<0000000000000000>] kthread+0x110/0x120
> [ 4738.329527] [<0000000000000000>] ret_from_fork+0x10/0x18
> [ 4738.329538] Mem-Info:
> [ 4738.329574] active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a            slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0
> [ 4738.329593] Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d           irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [ 4738.329603] Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k           B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB        [ 4738.329607] lowmem_reserve[]: 0 0
> [ 4738.329615] Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB
> 
> 
> >...
> > 
> >>> > >       /*
> >>> > >        * Fast check for order-0 only. If this fails then the reserves
> >>> > > @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
> >>> > >        * the caller is !atomic then it'll uselessly search the free
> >>> > >        * list. That corner case is then slower but it is harmless.
> >>> >
> >>> > Do we need remove or adjust the code comment at this place? So Mel have
> >>> > foreseen the corner case, just reclaiming to unreserve the highatomic
> >>> > might be ignored.
> >>> >
> >>> 
> >>> Hello thank you for your comment.
> >>> 
> >>> My previous mail to Vlastimil Babka seems to have html.
> >>> Let me put  again here because I also think the comment should be changed.
> >>> I'd like to hear opinions from the open source community.
> >> 
> >> Yeah, your replying mail to Vlastimil looks a little messy on format, I
> >> didn't scroll down to check.
> >> 
> >>> 
> >>> Additionally actually I think we need accurate counting of highatomic
> >>> free after this patch.
> >>> 
> >>> ----------------------------------------------------------------------------------------
> >>> 
> >>> As Mel also agreed with me in v1 mail thread, this highatomic reserved should
> >>> be considered even in watermark fast.
> >>> 
> >>> The comment, I think, may need to be changed. Prior to this patch, non
> >>> highatomic
> >>> allocation may do useless search, but it also can take ALL non highatomic free.
> >>> 
> >>> With this patch, non highatomic allocation will NOT do useless search. Rather,
> >>> it may be required direct reclamation even when there are some non
> >>> high atomic free.
> >>> 
> >>> i.e)
> >>> In following situation, watermark check fails (9MB - 8MB < 4MB) though there are
> >>> enough free (9MB - 4MB > 4MB). If this is really matter, we need to
> >>> count highatomic
> >>> free accurately.
> >> 
> >> Seems to make sense. We only use zone->nr_reserved_highatomic to account
> >> the reserved highatomic, don't track how much have been used for
> >> highatomic allocation. But not sure if this will happen often, or just a
> >> very rare case, whether taking that into account will impact anything.
> > 
> >Unfortunately there's a problem when trying to account free pages of a migrate
> >type exactly, as e.g. during reserve_highatomic_pageblock(), some pages might be
> >in pcplist of other cpu with other migratetype, and once they are freed, the
> >buddy merging will merge the different migratetypes and distort the accounting.

Yeah, the migratetype could have been cached in page->index before we
finish the reserve_highatomic_pageblock(). Seems we take a very coarse
grained way to do the highatomic reservation and accounting. When I went
through the related code, found out we call
reserve_highatomic_pageblock() if below condition is met. So what if
order is 1, and all other pages on the page block have been used? Do we
possibly have this kind of extreme case?

From Jaewon's captured information, we can see that the available free
highatomic is even less than half the zone->nr_reserved_highatomic.
Should we at least limit the reservation to the case with a bigger
order. E.g 1/2 of pageblock_nr_pages? Please correct me if I am wrong or
miss anyting.

"reserved_highatomic:32768KB and actually 14232kB free."

get_page_from_freelist
{
	...
			if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
                                reserve_highatomic_pageblock(page, zone, order);
	...
}
> >Fixing this for all migratetypes would have performance overhead, so we only do
> >that for MIGRATE_ISOLATE which is not that frequent (and it took a while to
> >eliminate all corner cases), and CMA which doesn't change pageblocks dynamically.
> 
> AFAIK we do not account free cma in pcp either. But yes accurate check could be 
> overhead. For example, __mod_zone_freepage_state should account highatomic free 
> as cma free. And we may see some incorrect accounting issue.
> 
> > 
> >So either we live with the possible overreclaim due to inaccurate counting per
> >your example above, or we instead let order-0 atomic allocations use highatomic
> >reserves.
> >
> 
> Additionally regarding existing Mel's comment, let me remove some of it if you
> don't mind.
> 
>         /*
>          * Fast check for order-0 only. If this fails then the reserves
> -        * need to be calculated. There is a corner case where the check
> -        * passes but only the high-order atomic reserve are free. If
> -        * the caller is !atomic then it'll uselessly search the free
> -        * list. That corner case is then slower but it is harmless.
> +        * need to be calculated.
>          */
> 
> I will prepare v3 patch.
> Thank you again.
>
Jaewon Kim June 16, 2020, 3:46 p.m. UTC | #10
.,

2020년 6월 16일 (화) 오후 11:17, Baoquan He <bhe@redhat.com>님이 작성:
>
> On 06/16/20 at 04:30pm, 김재원 wrote:
> > >>> > > <4>[ 6207.637627]  [3:  Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
> > >>
> > >> Checked this mem info, wondering why there's no 'reserved_highatomic'
> > >> printing in all these examples.
> > >
> > >Yeah, it better be printed, especially after it's included in watermark
> > >calculation, so we're less confused by reports of allocation failure where
> > >watermarks are seemingly ok.
> > >
> >
> >
> > Hello Vlastimil and Baoquan
> >
> > The log in previous mail was captured from kernel based on v4.14.
> > After adding the reserved_highatomic log, I finally got a new log below
> > Let me change description in next patch.
> >
> > There seems be reserved_highatomic:32768KB and actually 14232kB free.
> >
> > [ 4738.329298] kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null)
> > [ 4738.329325] kswapd0 cpuset=/ mems_allowed=0
> > [ 4738.329339] CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1
> > [ 4738.329350] Call trace:
> > [ 4738.329366] [<0000000000000000>] dump_backtrace+0x0/0x248
> > [ 4738.329377] [<0000000000000000>] show_stack+0x18/0x20
> > [ 4738.329390] [<0000000000000000>] __dump_stack+0x20/0x28
> > [ 4738.329398] [<0000000000000000>] dump_stack+0x68/0x90
> > [ 4738.329409] [<0000000000000000>] warn_alloc+0x104/0x198
> > [ 4738.329417] [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
> > [ 4738.329427] [<0000000000000000>] zs_malloc+0x148/0x3d0
> > [ 4738.329438] [<0000000000000000>] zram_bvec_rw+0x410/0x798
> > [ 4738.329446] [<0000000000000000>] zram_rw_page+0x88/0xdc
> > [ 4738.329455] [<0000000000000000>] bdev_write_page+0x70/0xbc
> > [ 4738.329463] [<0000000000000000>] __swap_writepage+0x58/0x37c
> > [ 4738.329469] [<0000000000000000>] swap_writepage+0x40/0x4c
> > [ 4738.329478] [<0000000000000000>] shrink_page_list+0xc30/0xf48
> > [ 4738.329486] [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
> > [ 4738.329494] [<0000000000000000>] shrink_node_memcg+0x23c/0x618
> > [ 4738.329501] [<0000000000000000>] shrink_node+0x1c8/0x304
> > [ 4738.329509] [<0000000000000000>] kswapd+0x680/0x7c4
> > [ 4738.329518] [<0000000000000000>] kthread+0x110/0x120
> > [ 4738.329527] [<0000000000000000>] ret_from_fork+0x10/0x18
> > [ 4738.329538] Mem-Info:
> > [ 4738.329574] active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a            slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0
> > [ 4738.329593] Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d           irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > [ 4738.329603] Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k           B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB        [ 4738.329607] lowmem_reserve[]: 0 0
> > [ 4738.329615] Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB
> >
> >
> > >...
> > >
> > >>> > >       /*
> > >>> > >        * Fast check for order-0 only. If this fails then the reserves
> > >>> > > @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
> > >>> > >        * the caller is !atomic then it'll uselessly search the free
> > >>> > >        * list. That corner case is then slower but it is harmless.
> > >>> >
> > >>> > Do we need remove or adjust the code comment at this place? So Mel have
> > >>> > foreseen the corner case, just reclaiming to unreserve the highatomic
> > >>> > might be ignored.
> > >>> >
> > >>>
> > >>> Hello thank you for your comment.
> > >>>
> > >>> My previous mail to Vlastimil Babka seems to have html.
> > >>> Let me put  again here because I also think the comment should be changed.
> > >>> I'd like to hear opinions from the open source community.
> > >>
> > >> Yeah, your replying mail to Vlastimil looks a little messy on format, I
> > >> didn't scroll down to check.
> > >>
> > >>>
> > >>> Additionally actually I think we need accurate counting of highatomic
> > >>> free after this patch.
> > >>>
> > >>> ----------------------------------------------------------------------------------------
> > >>>
> > >>> As Mel also agreed with me in v1 mail thread, this highatomic reserved should
> > >>> be considered even in watermark fast.
> > >>>
> > >>> The comment, I think, may need to be changed. Prior to this patch, non
> > >>> highatomic
> > >>> allocation may do useless search, but it also can take ALL non highatomic free.
> > >>>
> > >>> With this patch, non highatomic allocation will NOT do useless search. Rather,
> > >>> it may be required direct reclamation even when there are some non
> > >>> high atomic free.
> > >>>
> > >>> i.e)
> > >>> In following situation, watermark check fails (9MB - 8MB < 4MB) though there are
> > >>> enough free (9MB - 4MB > 4MB). If this is really matter, we need to
> > >>> count highatomic
> > >>> free accurately.
> > >>
> > >> Seems to make sense. We only use zone->nr_reserved_highatomic to account
> > >> the reserved highatomic, don't track how much have been used for
> > >> highatomic allocation. But not sure if this will happen often, or just a
> > >> very rare case, whether taking that into account will impact anything.
> > >
> > >Unfortunately there's a problem when trying to account free pages of a migrate
> > >type exactly, as e.g. during reserve_highatomic_pageblock(), some pages might be
> > >in pcplist of other cpu with other migratetype, and once they are freed, the
> > >buddy merging will merge the different migratetypes and distort the accounting.
>
> Yeah, the migratetype could have been cached in page->index before we
> finish the reserve_highatomic_pageblock(). Seems we take a very coarse
> grained way to do the highatomic reservation and accounting. When I went
> through the related code, found out we call
> reserve_highatomic_pageblock() if below condition is met. So what if
> order is 1, and all other pages on the page block have been used? Do we
> possibly have this kind of extreme case?

If I correctly understand your question, yes I think so.
If a hight-order free page was allocated on ALLOC_HARDER context, and the page
was the last order-1, then zone->nr_reserved_highatomic will be increased by
pageblock_nr_pages even though there was actually no free page moved to the
highatomic free page list.

The highatomic logic, I think, was originally designed to reserve in
that coarse grained
way to mitigate highatomic allocation failure.

>
> From Jaewon's captured information, we can see that the available free
> highatomic is even less than half the zone->nr_reserved_highatomic.
> Should we at least limit the reservation to the case with a bigger
> order. E.g 1/2 of pageblock_nr_pages? Please correct me if I am wrong or
> miss anyting.
>

I do not know well, but I think high-order lower than 1/2 of pageblock_nr_pages
also should be considered. i.e) a system using huge order-3 atomic allocation
in a short time.

> "reserved_highatomic:32768KB and actually 14232kB free."

I think this unwanted case might happen by freed pages. The pages allocated
for non-high-atomic also would be freed back into highatomic free
list. But I guess
there was sudden huge use of highatomic and partially freed.

>
> get_page_from_freelist
> {
>         ...
>                         if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
>                                 reserve_highatomic_pageblock(page, zone, order);
>         ...
> }
> > >Fixing this for all migratetypes would have performance overhead, so we only do
> > >that for MIGRATE_ISOLATE which is not that frequent (and it took a while to
> > >eliminate all corner cases), and CMA which doesn't change pageblocks dynamically.
> >
> > AFAIK we do not account free cma in pcp either. But yes accurate check could be
> > overhead. For example, __mod_zone_freepage_state should account highatomic free
> > as cma free. And we may see some incorrect accounting issue.
> >
> > >
> > >So either we live with the possible overreclaim due to inaccurate counting per
> > >your example above, or we instead let order-0 atomic allocations use highatomic
> > >reserves.
> > >
> >
> > Additionally regarding existing Mel's comment, let me remove some of it if you
> > don't mind.
> >
> >         /*
> >          * Fast check for order-0 only. If this fails then the reserves
> > -        * need to be calculated. There is a corner case where the check
> > -        * passes but only the high-order atomic reserve are free. If
> > -        * the caller is !atomic then it'll uselessly search the free
> > -        * list. That corner case is then slower but it is harmless.
> > +        * need to be calculated.
> >          */
> >
> > I will prepare v3 patch.
> > Thank you again.
> >
>
Baoquan He June 17, 2020, 4:03 a.m. UTC | #11
On 06/17/20 at 12:46am, Jaewon Kim wrote:
...
> > > >>> i.e)
> > > >>> In following situation, watermark check fails (9MB - 8MB < 4MB) though there are
> > > >>> enough free (9MB - 4MB > 4MB). If this is really matter, we need to
> > > >>> count highatomic
> > > >>> free accurately.
> > > >>
> > > >> Seems to make sense. We only use zone->nr_reserved_highatomic to account
> > > >> the reserved highatomic, don't track how much have been used for
> > > >> highatomic allocation. But not sure if this will happen often, or just a
> > > >> very rare case, whether taking that into account will impact anything.
> > > >
> > > >Unfortunately there's a problem when trying to account free pages of a migrate
> > > >type exactly, as e.g. during reserve_highatomic_pageblock(), some pages might be
> > > >in pcplist of other cpu with other migratetype, and once they are freed, the
> > > >buddy merging will merge the different migratetypes and distort the accounting.
> >
> > Yeah, the migratetype could have been cached in page->index before we
> > finish the reserve_highatomic_pageblock(). Seems we take a very coarse
> > grained way to do the highatomic reservation and accounting. When I went
> > through the related code, found out we call
> > reserve_highatomic_pageblock() if below condition is met. So what if
> > order is 1, and all other pages on the page block have been used? Do we
> > possibly have this kind of extreme case?
> 
> If I correctly understand your question, yes I think so.
> If a hight-order free page was allocated on ALLOC_HARDER context, and the page
> was the last order-1, then zone->nr_reserved_highatomic will be increased by
> pageblock_nr_pages even though there was actually no free page moved to the
> highatomic free page list.

Exactly.

> 
> The highatomic logic, I think, was originally designed to reserve in
> that coarse grained
> way to mitigate highatomic allocation failure.
> 
> >
> > From Jaewon's captured information, we can see that the available free
> > highatomic is even less than half the zone->nr_reserved_highatomic.
> > Should we at least limit the reservation to the case with a bigger
> > order. E.g 1/2 of pageblock_nr_pages? Please correct me if I am wrong or
> > miss anyting.
> >
> 
> I do not know well, but I think high-order lower than 1/2 of pageblock_nr_pages
> also should be considered. i.e) a system using huge order-3 atomic allocation
> in a short time.
> 
> > "reserved_highatomic:32768KB and actually 14232kB free."
> 
> I think this unwanted case might happen by freed pages. The pages allocated
> for non-high-atomic also would be freed back into highatomic free
> list. But I guess
> there was sudden huge use of highatomic and partially freed.

OK, thanks for sharing. So we can leave with it, may do some improvement
if any issue is reported in the future.

> 
> >
> > get_page_from_freelist
> > {
> >         ...
> >                         if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
> >                                 reserve_highatomic_pageblock(page, zone, order);
> >         ...
> > }
> > > >Fixing this for all migratetypes would have performance overhead, so we only do
> > > >that for MIGRATE_ISOLATE which is not that frequent (and it took a while to
> > > >eliminate all corner cases), and CMA which doesn't change pageblocks dynamically.
> > >
> > > AFAIK we do not account free cma in pcp either. But yes accurate check could be
> > > overhead. For example, __mod_zone_freepage_state should account highatomic free
> > > as cma free. And we may see some incorrect accounting issue.
> > >
> > > >
> > > >So either we live with the possible overreclaim due to inaccurate counting per
> > > >your example above, or we instead let order-0 atomic allocations use highatomic
> > > >reserves.
> > > >
> > >
> > > Additionally regarding existing Mel's comment, let me remove some of it if you
> > > don't mind.
> > >
> > >         /*
> > >          * Fast check for order-0 only. If this fails then the reserves
> > > -        * need to be calculated. There is a corner case where the check
> > > -        * passes but only the high-order atomic reserve are free. If
> > > -        * the caller is !atomic then it'll uselessly search the free
> > > -        * list. That corner case is then slower but it is harmless.
> > > +        * need to be calculated.
> > >          */
> > >
> > > I will prepare v3 patch.
> > > Thank you again.
> > >
> >
>
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eb0f1410d4..c2177e056f19 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3487,6 +3487,29 @@  static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 }
 ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
 
+static inline long __zone_watermark_unusable_free(struct zone *z,
+				unsigned int order, unsigned int alloc_flags)
+{
+	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
+	long unusable_free = (1 << order) - 1;
+
+	/*
+	 * If the caller does not have rights to ALLOC_HARDER then subtract
+	 * the high-atomic reserves. This will over-estimate the size of the
+	 * atomic reserve but it avoids a search.
+	 */
+	if (likely(!alloc_harder))
+		unusable_free += z->nr_reserved_highatomic;
+
+#ifdef CONFIG_CMA
+	/* If allocation can't use CMA areas don't use free CMA pages */
+	if (!(alloc_flags & ALLOC_CMA))
+		unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
+#endif
+
+	return unusable_free;
+}
+
 /*
  * Return true if free base pages are above 'mark'. For high-order checks it
  * will return true of the order-0 watermark is reached and there is at least
@@ -3502,19 +3525,12 @@  bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
 
 	/* free_pages may go negative - that's OK */
-	free_pages -= (1 << order) - 1;
+	free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags);
 
 	if (alloc_flags & ALLOC_HIGH)
 		min -= min / 2;
 
-	/*
-	 * If the caller does not have rights to ALLOC_HARDER then subtract
-	 * the high-atomic reserves. This will over-estimate the size of the
-	 * atomic reserve but it avoids a search.
-	 */
-	if (likely(!alloc_harder)) {
-		free_pages -= z->nr_reserved_highatomic;
-	} else {
+	if (unlikely(alloc_harder)) {
 		/*
 		 * OOM victims can try even harder than normal ALLOC_HARDER
 		 * users on the grounds that it's definitely going to be in
@@ -3527,13 +3543,6 @@  bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			min -= min / 4;
 	}
 
-
-#ifdef CONFIG_CMA
-	/* If allocation can't use CMA areas don't use free CMA pages */
-	if (!(alloc_flags & ALLOC_CMA))
-		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
-
 	/*
 	 * Check watermarks for an order-0 allocation request. If these
 	 * are not met, then a high-order request also cannot go ahead
@@ -3582,14 +3591,11 @@  static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 				unsigned long mark, int highest_zoneidx,
 				unsigned int alloc_flags)
 {
-	long free_pages = zone_page_state(z, NR_FREE_PAGES);
-	long cma_pages = 0;
+	long free_pages;
+	long unusable_free;
 
-#ifdef CONFIG_CMA
-	/* If allocation can't use CMA areas don't use free CMA pages */
-	if (!(alloc_flags & ALLOC_CMA))
-		cma_pages = zone_page_state(z, NR_FREE_CMA_PAGES);
-#endif
+	free_pages = zone_page_state(z, NR_FREE_PAGES);
+	unusable_free =  __zone_watermark_unusable_free(z, order, alloc_flags);
 
 	/*
 	 * Fast check for order-0 only. If this fails then the reserves
@@ -3598,9 +3604,12 @@  static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 	 * the caller is !atomic then it'll uselessly search the free
 	 * list. That corner case is then slower but it is harmless.
 	 */
-	if (!order && (free_pages - cma_pages) >
-				mark + z->lowmem_reserve[highest_zoneidx])
-		return true;
+	if (!order) {
+		long fast_free = free_pages - unusable_free;
+
+		if (fast_free > mark + z->lowmem_reserve[highest_zoneidx])
+			return true;
+	}
 
 	return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
 					free_pages);