diff mbox series

mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

Message ID 20180907130550.11885-1-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show
Series mm, thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings | expand

Commit Message

Michal Hocko Sept. 7, 2018, 1:05 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

Andrea has noticed [1] that a THP allocation might be really disruptive
when allocated on NUMA system with the local node full or hard to
reclaim. Stefan has posted an allocation stall report on 4.12 based
SLES kernel which suggests the same issue:
[245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
[245513.363983] kvm cpuset=/ mems_allowed=0-1
[245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
[245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
[245513.365905] Call Trace:
[245513.366535]  dump_stack+0x5c/0x84
[245513.367148]  warn_alloc+0xe0/0x180
[245513.367769]  __alloc_pages_slowpath+0x820/0xc90
[245513.368406]  ? __slab_free+0xa9/0x2f0
[245513.369048]  ? __slab_free+0xa9/0x2f0
[245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
[245513.370300]  alloc_pages_vma+0x1e5/0x280
[245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
[245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
[245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
[245513.372812]  __handle_mm_fault+0x93d/0x1060
[245513.373439]  handle_mm_fault+0xc6/0x1b0
[245513.374042]  __do_page_fault+0x230/0x430
[245513.374679]  ? get_vtime_delta+0x13/0xb0
[245513.375411]  do_page_fault+0x2a/0x70
[245513.376145]  ? page_fault+0x65/0x80
[245513.376882]  page_fault+0x7b/0x80
[...]
[245513.382056] Mem-Info:
[245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
                 active_file:60183 inactive_file:245285 isolated_file:0
                 unevictable:15657 dirty:286 writeback:1 unstable:0
                 slab_reclaimable:75543 slab_unreclaimable:2509111
                 mapped:81814 shmem:31764 pagetables:370616 bounce:0
                 free:32294031 free_pcp:6233 free_cma:0
[245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
juggle gfp flags for different allocation modes. The rationale is that
__GFP_THISNODE is helpful in relaxed defrag modes because falling back
to a different node might be more harmful than the benefit of a large page.
If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
a higher priority than local NUMA placement.

Be careful when the vma has an explicit numa binding though, because
__GFP_THISNODE is not playing well with it. We want to follow the
explicit numa policy rather than enforce a node which happens to be
local to the cpu we are running on.

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Stefan Priebe <s.priebe@profihost.ag>
Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this is a follow up for [1]. Anrea has proposed two approaches to solve
the regression. This is an alternative implementation of the second
approach [2]. The reason for an alternative approach is that I strongly
believe that all the subtle THP gfp manipulation should be at a single
place (alloc_hugepage_direct_gfpmask) rather than spread in multiple
places with additional fixup. There is one notable difference to [2]
and that is defrag=allways behavior where I am preserving the original
behavior. The reason for that is that defrag=always has always had
tendency to stall and reclaim and we have addressed that by defining a
new default defrag mode. We can discuss this behavior later but I
believe the default mode and a regression noticed by multiple users
should be closed regardless. Hence this patch.

[2] http://lkml.kernel.org/r/20180820032640.9896-2-aarcange@redhat.com

 include/linux/mempolicy.h |  2 ++
 mm/huge_memory.c          | 26 ++++++++++++++++++--------
 mm/mempolicy.c            | 28 +---------------------------
 3 files changed, 21 insertions(+), 35 deletions(-)

Comments

Stefan Priebe - Profihost AG Sept. 8, 2018, 6:52 p.m. UTC | #1
Hello,

whlie using this path i got another stall - which i never saw under
kernel 4.4. Here is the trace:
[305111.932698] INFO: task ksmtuned:1399 blocked for more than 120 seconds.
[305111.933612]       Tainted: G                   4.12.0+105-ph #1
[305111.934456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[305111.935323] ksmtuned        D    0  1399      1 0x00080000
[305111.936207] Call Trace:
[305111.937118]  ? __schedule+0x3bc/0x830
[305111.937991]  schedule+0x32/0x80
[305111.938837]  schedule_preempt_disabled+0xa/0x10
[305111.939687]  __mutex_lock.isra.4+0x287/0x4c0
[305111.940550]  ? run_store+0x47/0x2b0
[305111.941416]  run_store+0x47/0x2b0
[305111.942284]  ? __kmalloc+0x157/0x1d0
[305111.943138]  kernfs_fop_write+0x102/0x180
[305111.943988]  __vfs_write+0x26/0x140
[305111.944827]  ? __alloc_fd+0x44/0x170
[305111.945669]  ? set_close_on_exec+0x30/0x60
[305111.946519]  vfs_write+0xb1/0x1e0
[305111.947359]  SyS_write+0x42/0x90
[305111.948193]  do_syscall_64+0x74/0x150
[305111.949014]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[305111.949854] RIP: 0033:0x7fe7cde93730
[305111.950678] RSP: 002b:00007fffab0d5e88 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[305111.951525] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007fe7cde93730
[305111.952358] RDX: 0000000000000002 RSI: 00000000011b1c08 RDI:
0000000000000001
[305111.953170] RBP: 00000000011b1c08 R08: 00007fe7ce153760 R09:
00007fe7ce797b40
[305111.953979] R10: 0000000000000073 R11: 0000000000000246 R12:
0000000000000002
[305111.954790] R13: 0000000000000001 R14: 00007fe7ce152600 R15:
0000000000000002
[305146.987742] khugepaged: page allocation stalls for 224236ms,
order:9,
mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM),
nodemask=(null)
[305146.989652] khugepaged cpuset=/ mems_allowed=0-1
[305146.990582] CPU: 1 PID: 405 Comm: khugepaged Tainted: G
     4.12.0+105-ph #1 SLE15 (unreleased)
[305146.991536] Hardware name: Supermicro
X9DRW-3LN4F+/X9DRW-3TF+/X9DRW-3LN4F+/X9DRW-3TF+, BIOS 3.00 07/05/2013
[305146.992524] Call Trace:
[305146.993493]  dump_stack+0x5c/0x84
[305146.994499]  warn_alloc+0xe0/0x180
[305146.995469]  __alloc_pages_slowpath+0x820/0xc90
[305146.996456]  ? get_vtime_delta+0x13/0xb0
[305146.997424]  ? sched_clock+0x5/0x10
[305146.998394]  ? del_timer_sync+0x35/0x40
[305146.999370]  __alloc_pages_nodemask+0x1cc/0x210
[305147.000369]  khugepaged_alloc_page+0x39/0x70
[305147.001326]  khugepaged+0xc0c/0x20c0
[305147.002214]  ? remove_wait_queue+0x60/0x60
[305147.003226]  kthread+0xff/0x130
[305147.004219]  ? collapse_shmem+0xba0/0xba0
[305147.005131]  ? kthread_create_on_node+0x40/0x40
[305147.005971]  ret_from_fork+0x35/0x40
[305147.006835] Mem-Info:
[305147.007681] active_anon:51674768 inactive_anon:69112 isolated_anon:21
                 active_file:47818 inactive_file:51708 isolated_file:0
                 unevictable:15710 dirty:187 writeback:0 unstable:0
                 slab_reclaimable:62499 slab_unreclaimable:1284920
                 mapped:66765 shmem:47623 pagetables:185294 bounce:0
                 free:44265934 free_pcp:23646 free_cma:0
[305147.012664] Node 0 active_anon:116919912kB inactive_anon:238824kB
active_file:157296kB inactive_file:112820kB unevictable:58364kB
isolated(anon):80kB isolated(file):0kB mapped:221548kB dirty:548kB
writeback:0kB shmem:153196kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 14430208kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[305147.015437] Node 1 active_anon:89781496kB inactive_anon:37624kB
active_file:33976kB inactive_file:94012kB unevictable:4476kB
isolated(anon):4kB isolated(file):0kB mapped:45512kB dirty:200kB
writeback:0kB shmem:37296kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 9279488kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[305147.018550] Node 0 DMA free:15816kB min:12kB low:24kB high:36kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB writepending:0kB present:15988kB managed:15816kB
mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[305147.022085] lowmem_reserve[]: 0 1922 193381 193381 193381
[305147.023208] Node 0 DMA32 free:769020kB min:1964kB low:3944kB
high:5924kB active_anon:1204144kB inactive_anon:0kB active_file:4kB
inactive_file:0kB unevictable:0kB writepending:0kB present:2046368kB
managed:1980800kB mlocked:0kB slab_reclaimable:32kB
slab_unreclaimable:5376kB kernel_stack:0kB pagetables:1560kB bounce:0kB
free_pcp:0kB local_pcp:0kB free_cma:0kB
[305147.026768] lowmem_reserve[]: 0 0 191458 191458 191458
[305147.028044] Node 0 Normal free:71769584kB min:194564kB low:390620kB
high:586676kB active_anon:115715084kB inactive_anon:238824kB
active_file:157292kB inactive_file:112820kB unevictable:58364kB
writepending:548kB present:199229440kB managed:196058772kB
mlocked:58364kB slab_reclaimable:146536kB slab_unreclaimable:2697676kB
kernel_stack:12488kB pagetables:669756kB bounce:0kB free_pcp:42284kB
local_pcp:284kB free_cma:0kB
[305147.033356] lowmem_reserve[]: 0 0 0 0 0
[305147.034754] Node 1 Normal free:104504180kB min:196664kB low:394836kB
high:593008kB active_anon:89783256kB inactive_anon:37624kB
active_file:33976kB inactive_file:94012kB unevictable:4476kB
writepending:200kB present:201326592kB managed:198175320kB
mlocked:4476kB slab_reclaimable:103428kB slab_unreclaimable:2436628kB
kernel_stack:14232kB pagetables:69860kB bounce:0kB free_pcp:51764kB
local_pcp:936kB free_cma:0kB
[305147.040667] lowmem_reserve[]: 0 0 0 0 0
[305147.042180] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 1*64kB (U)
1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) =
15816kB
[305147.045054] Node 0 DMA32: 6363*4kB (UM) 3686*8kB (UM) 2106*16kB (UM)
1102*32kB (UME) 608*64kB (UME) 304*128kB (UME) 146*256kB (UME) 59*512kB
(UME) 32*1024kB (UME) 4*2048kB (UME) 112*4096kB (M) = 769020kB
[305147.048041] Node 0 Normal: 29891*4kB (UME) 367874*8kB (UME)
978139*16kB (UME) 481963*32kB (UME) 173612*64kB (UME) 65480*128kB (UME)
28019*256kB (UME) 10993*512kB (UME) 5217*1024kB (UM) 0*2048kB 0*4096kB =
71771692kB
[305147.051291] Node 1 Normal: 396333*4kB (UME) 257656*8kB (UME)
276637*16kB (UME) 190234*32kB (ME) 101344*64kB (ME) 39168*128kB (UME)
19207*256kB (UME) 8599*512kB (UME) 3065*1024kB (UM) 2*2048kB (UM)
16206*4096kB (M) = 104501892kB
[305147.054836] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
[305147.056555] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[305147.058160] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
[305147.059817] Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[305147.061429] 149901 total pagecache pages
[305147.063124] 2 pages in swap cache
[305147.064908] Swap cache stats: add 7, delete 5, find 0/0
[305147.066676] Free swap  = 3905020kB
[305147.068268] Total swap = 3905532kB
[305147.069955] 100654597 pages RAM
[305147.071569] 0 pages HighMem/MovableOnly
[305147.073176] 1596920 pages reserved
[305147.074946] 0 pages hwpoisoned
[326258.236694] INFO: task ksmtuned:1399 blocked for more than 120 seconds.
[326258.237723]       Tainted: G                   4.12.0+105-ph #1
[326258.238718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[326258.239679] ksmtuned        D    0  1399      1 0x00080000
[326258.240651] Call Trace:
[326258.241602]  ? __schedule+0x3bc/0x830
[326258.242557]  schedule+0x32/0x80
[326258.243462]  schedule_preempt_disabled+0xa/0x10
[326258.244336]  __mutex_lock.isra.4+0x287/0x4c0
[326258.245205]  ? run_store+0x47/0x2b0
[326258.246064]  run_store+0x47/0x2b0
[326258.246890]  ? __kmalloc+0x157/0x1d0
[326258.247716]  kernfs_fop_write+0x102/0x180
[326258.248514]  __vfs_write+0x26/0x140
[326258.249284]  ? __alloc_fd+0x44/0x170
[326258.250062]  ? set_close_on_exec+0x30/0x60
[326258.250812]  vfs_write+0xb1/0x1e0
[326258.251548]  SyS_write+0x42/0x90
[326258.252237]  do_syscall_64+0x74/0x150
[326258.252920]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[326258.253577] RIP: 0033:0x7fe7cde93730
[326258.254263] RSP: 002b:00007fffab0d5e88 EFLAGS: 00000246 ORIG_RAX:
0000000000000001
[326258.254916] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007fe7cde93730
[326258.255564] RDX: 0000000000000002 RSI: 00000000011b1c08 RDI:
0000000000000001
[326258.256187] RBP: 00000000011b1c08 R08: 00007fe7ce153760 R09:
00007fe7ce797b40
[326258.256801] R10: 0000000000000073 R11: 0000000000000246 R12:
0000000000000002
[326258.257406] R13: 0000000000000001 R14: 00007fe7ce152600 R15:
0000000000000002

Greets,
Stefan
Am 07.09.2018 um 15:05 schrieb Michal Hocko:
> From: Michal Hocko <mhocko@suse.com>
> 
> Andrea has noticed [1] that a THP allocation might be really disruptive
> when allocated on NUMA system with the local node full or hard to
> reclaim. Stefan has posted an allocation stall report on 4.12 based
> SLES kernel which suggests the same issue:
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80
> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>                  active_file:60183 inactive_file:245285 isolated_file:0
>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>                  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> juggle gfp flags for different allocation modes. The rationale is that
> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> to a different node might be more harmful than the benefit of a large page.
> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> a higher priority than local NUMA placement.
> 
> Be careful when the vma has an explicit numa binding though, because
> __GFP_THISNODE is not playing well with it. We want to follow the
> explicit numa policy rather than enforce a node which happens to be
> local to the cpu we are running on.
> 
> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
> 
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Tested-by: Stefan Priebe <s.priebe@profihost.ag>
> Tested-by: Zi Yan <zi.yan@cs.rutgers.edu>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> 
> Hi,
> this is a follow up for [1]. Anrea has proposed two approaches to solve
> the regression. This is an alternative implementation of the second
> approach [2]. The reason for an alternative approach is that I strongly
> believe that all the subtle THP gfp manipulation should be at a single
> place (alloc_hugepage_direct_gfpmask) rather than spread in multiple
> places with additional fixup. There is one notable difference to [2]
> and that is defrag=allways behavior where I am preserving the original
> behavior. The reason for that is that defrag=always has always had
> tendency to stall and reclaim and we have addressed that by defining a
> new default defrag mode. We can discuss this behavior later but I
> believe the default mode and a regression noticed by multiple users
> should be closed regardless. Hence this patch.
> 
> [2] http://lkml.kernel.org/r/20180820032640.9896-2-aarcange@redhat.com
> 
>  include/linux/mempolicy.h |  2 ++
>  mm/huge_memory.c          | 26 ++++++++++++++++++--------
>  mm/mempolicy.c            | 28 +---------------------------
>  3 files changed, 21 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5228c62af416..bac395f1d00a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
>  struct mempolicy *get_task_policy(struct task_struct *p);
>  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>  		unsigned long addr);
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +						unsigned long addr);
>  bool vma_policy_mof(struct vm_area_struct *vma);
>  
>  extern void numa_default_policy(void);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c3bc7e9c9a2a..56c9aac4dc86 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -629,21 +629,31 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t this_node = 0;
> +	struct mempolicy *pol;
> +
> +#ifdef CONFIG_NUMA
> +	/* __GFP_THISNODE makes sense only if there is no explicit binding */
> +	pol = get_vma_policy(vma, addr);
> +	if (pol->mode != MPOL_BIND)
> +		this_node = __GFP_THISNODE;
> +	mpol_cond_put(pol);
> +#endif
>  
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     __GFP_KSWAPD_RECLAIM);
> +							     __GFP_KSWAPD_RECLAIM | this_node);
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
> -							     0);
> -	return GFP_TRANSHUGE_LIGHT;
> +							     this_node);
> +	return GFP_TRANSHUGE_LIGHT | this_node;
>  }
>  
>  /* Caller must hold page table lock. */
> @@ -715,7 +725,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  			pte_free(vma->vm_mm, pgtable);
>  		return ret;
>  	}
> -	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	if (unlikely(!page)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> @@ -1290,7 +1300,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  alloc:
>  	if (transparent_hugepage_enabled(vma) &&
>  	    !transparent_hugepage_debug_cow()) {
> -		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
>  		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
>  	} else
>  		new_page = NULL;
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..75bbfc3d6233 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * freeing by another task.  It is the caller's responsibility to free the
>   * extra reference for shared policies.
>   */
> -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> +struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  						unsigned long addr)
>  {
>  	struct mempolicy *pol = __get_vma_policy(vma, addr);
> @@ -2026,32 +2026,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		goto out;
>  	}
>  
> -	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
> -		int hpage_node = node;
> -
> -		/*
> -		 * For hugepage allocation and non-interleave policy which
> -		 * allows the current node (or other explicitly preferred
> -		 * node) we only try to allocate from the current/preferred
> -		 * node and don't fall back to other nodes, as the cost of
> -		 * remote accesses would likely offset THP benefits.
> -		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> -		 */
> -		if (pol->mode == MPOL_PREFERRED &&
> -						!(pol->flags & MPOL_F_LOCAL))
> -			hpage_node = pol->v.preferred_node;
> -
> -		nmask = policy_nodemask(gfp, pol);
> -		if (!nmask || node_isset(hpage_node, *nmask)) {
> -			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> -			goto out;
> -		}
> -	}
> -
>  	nmask = policy_nodemask(gfp, pol);
>  	preferred_nid = policy_node(gfp, pol, node);
>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
>
Michal Hocko Sept. 10, 2018, 7:39 a.m. UTC | #2
[Cc Vlastimil. The full report is http://lkml.kernel.org/r/f7ed71c1-d599-5257-fd8f-041eb24d9f29@profihost.ag]
On Sat 08-09-18 20:52:35, Stefan Priebe - Profihost AG wrote:
> [305146.987742] khugepaged: page allocation stalls for 224236ms, order:9,
> mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)

This is certainly not a result of this patch AFAICS. khugepaged does add
__GFP_THISNODE regardless of what alloc_hugepage_khugepaged_gfpmask
thinks about that flag. Something to look into as well I guess.

Anyway, I guess we want to look closer at what compaction is doing here
because such a long stall is really not acceptable at all. Maybe this is
something 4.12 kernel related. This is hard to tell. Unfortunatelly,
upstream has lost the stall warning so you wouldn't know this is the
case with newer kernels.

Anyway, running with compaction tracepoints might tell us more.
Vlastimil will surely help to tell you which of them to enable.
David Rientjes Sept. 10, 2018, 8:08 p.m. UTC | #3
On Fri, 7 Sep 2018, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> Andrea has noticed [1] that a THP allocation might be really disruptive
> when allocated on NUMA system with the local node full or hard to
> reclaim. Stefan has posted an allocation stall report on 4.12 based
> SLES kernel which suggests the same issue:
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80

Since we don't have __GFP_REPEAT, this suggests that 
__alloc_pages_direct_compact() took >100s to complete.  The memory 
capacity of the system isn't shown, but I assume it's around 768GB?  This 
should be with COMPACT_PRIO_ASYNC, and MIGRATE_ASYNC compaction certainly 
should abort much earlier.

> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>                  active_file:60183 inactive_file:245285 isolated_file:0
>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>                  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> juggle gfp flags for different allocation modes. The rationale is that
> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> to a different node might be more harmful than the benefit of a large page.
> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> a higher priority than local NUMA placement.
> 

That's not entirely true, the remote access latency for remote thp on all 
of our platforms is greater than local small pages, this is especially 
true for remote thp that is allocated intersocket and must be accessed 
through the interconnect.

Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
allocation latency, but certainly not remote access latency.  There are 
users who remap their text segment onto transparent hugepages are fine 
with startup delay if they are access all of their text from local thp.  
Remote thp would be a significant performance degradation.

When Andrea brought this up, I suggested that the full solution would be a 
MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
benefit is that we could replace the thp "defrag" mode default by setting 
this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
concerned about (1) getting thp when system-wide it is not default and (2) 
additional fault latency when direct compaction is not default.  They are 
not anticipating the degradation of remote access latency, so overloading 
the meaning of the mode is probably not a good idea.
Stefan Priebe - Profihost AG Sept. 10, 2018, 8:22 p.m. UTC | #4
Am 10.09.2018 um 22:08 schrieb David Rientjes:
> On Fri, 7 Sep 2018, Michal Hocko wrote:
> 
>> From: Michal Hocko <mhocko@suse.com>
>>
>> Andrea has noticed [1] that a THP allocation might be really disruptive
>> when allocated on NUMA system with the local node full or hard to
>> reclaim. Stefan has posted an allocation stall report on 4.12 based
>> SLES kernel which suggests the same issue:
>> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
>> [245513.363983] kvm cpuset=/ mems_allowed=0-1
>> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
>> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
>> [245513.365905] Call Trace:
>> [245513.366535]  dump_stack+0x5c/0x84
>> [245513.367148]  warn_alloc+0xe0/0x180
>> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
>> [245513.368406]  ? __slab_free+0xa9/0x2f0
>> [245513.369048]  ? __slab_free+0xa9/0x2f0
>> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
>> [245513.370300]  alloc_pages_vma+0x1e5/0x280
>> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
>> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
>> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
>> [245513.372812]  __handle_mm_fault+0x93d/0x1060
>> [245513.373439]  handle_mm_fault+0xc6/0x1b0
>> [245513.374042]  __do_page_fault+0x230/0x430
>> [245513.374679]  ? get_vtime_delta+0x13/0xb0
>> [245513.375411]  do_page_fault+0x2a/0x70
>> [245513.376145]  ? page_fault+0x65/0x80
>> [245513.376882]  page_fault+0x7b/0x80
> 
> Since we don't have __GFP_REPEAT, this suggests that 
> __alloc_pages_direct_compact() took >100s to complete.  The memory 
> capacity of the system isn't shown, but I assume it's around 768GB?  This 
> should be with COMPACT_PRIO_ASYNC, and MIGRATE_ASYNC compaction certainly 
> should abort much earlier.

Yes it's 768GB.

Greets,
Stefan

>> [245513.382056] Mem-Info:
>> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>>                  active_file:60183 inactive_file:245285 isolated_file:0
>>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>>                  free:32294031 free_pcp:6233 free_cma:0
>> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
>>
>> The defrag mode is "madvise" and from the above report it is clear that
>> the THP has been allocated for MADV_HUGEPAGA vma.
>>
>> Andrea has identified that the main source of the problem is
>> __GFP_THISNODE usage:
>>
>> : The problem is that direct compaction combined with the NUMA
>> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
>> : hard the local node, instead of failing the allocation if there's no
>> : THP available in the local node.
>> :
>> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
>> : path even with MPOL_DEFAULT.
>> :
>> : The idea behind the __GFP_THISNODE addition, is that it is better to
>> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
>> : backed memory. That largely depends on the remote latency though, on
>> : threadrippers for example the overhead is relatively low in my
>> : experience.
>> :
>> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
>> : extremely slow qemu startup with vfio, if the VM is larger than the
>> : size of one host NUMA node. This is because it will try very hard to
>> : unsuccessfully swapout get_user_pages pinned pages as result of the
>> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
>> : allocations and instead of trying to allocate THP on other nodes (it
>> : would be even worse without vfio type1 GUP pins of course, except it'd
>> : be swapping heavily instead).
>>
>> Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
>> it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
>> juggle gfp flags for different allocation modes. The rationale is that
>> __GFP_THISNODE is helpful in relaxed defrag modes because falling back
>> to a different node might be more harmful than the benefit of a large page.
>> If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
>> a higher priority than local NUMA placement.
>>
> 
> That's not entirely true, the remote access latency for remote thp on all 
> of our platforms is greater than local small pages, this is especially 
> true for remote thp that is allocated intersocket and must be accessed 
> through the interconnect.
> 
> Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
> allocation latency, but certainly not remote access latency.  There are 
> users who remap their text segment onto transparent hugepages are fine 
> with startup delay if they are access all of their text from local thp.  
> Remote thp would be a significant performance degradation.
> 
> When Andrea brought this up, I suggested that the full solution would be a 
> MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
> benefit is that we could replace the thp "defrag" mode default by setting 
> this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> concerned about (1) getting thp when system-wide it is not default and (2) 
> additional fault latency when direct compaction is not default.  They are 
> not anticipating the degradation of remote access latency, so overloading 
> the meaning of the mode is probably not a good idea.
>
Vlastimil Babka Sept. 11, 2018, 8:51 a.m. UTC | #5
On 09/10/2018 10:08 PM, David Rientjes wrote:
> When Andrea brought this up, I suggested that the full solution would be a 
> MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 

Can you elaborate on the semantics of this? You mean that a given vma
could now have two mempolicies, where one would be for hugepages only?
That's likely much more easy to suggest than to implement, with all uapi
consequences...

> benefit is that we could replace the thp "defrag" mode default by setting 
> this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> concerned about (1) getting thp when system-wide it is not default and (2) 
> additional fault latency when direct compaction is not default.  They are 
> not anticipating the degradation of remote access latency, so overloading 
> the meaning of the mode is probably not a good idea.
>
Vlastimil Babka Sept. 11, 2018, 9:03 a.m. UTC | #6
On 09/08/2018 08:52 PM, Stefan Priebe - Profihost AG wrote:
> Hello,
> 
> whlie using this path i got another stall - which i never saw under
> kernel 4.4. Here is the trace:
> [305111.932698] INFO: task ksmtuned:1399 blocked for more than 120 seconds.
> [305111.933612]       Tainted: G                   4.12.0+105-ph #1
> [305111.934456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [305111.935323] ksmtuned        D    0  1399      1 0x00080000
> [305111.936207] Call Trace:
> [305111.937118]  ? __schedule+0x3bc/0x830
> [305111.937991]  schedule+0x32/0x80
> [305111.938837]  schedule_preempt_disabled+0xa/0x10
> [305111.939687]  __mutex_lock.isra.4+0x287/0x4c0

Hmm so is this waiting on mutex_lock(&ksm_thread_mutex)? Who's holding it?

Looking at your original report [1] I see more tasks waiting on a semaphore:

[245654.463746] INFO: task pvestatd:3175 blocked for more than 120 seconds.
[245654.464730] Tainted: G W 4.12.0+98-ph <a
href="/view.php?id=1" title="[geschlossen] Integration Ramdisk"
class="resolved">0000001</a>
[245654.465666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[245654.466591] pvestatd D 0 3175 1 0x00080000
[245654.467495] Call Trace:
[245654.468413] ? __schedule+0x3bc/0x830
[245654.469283] schedule+0x32/0x80
[245654.470108] rwsem_down_read_failed+0x121/0x170
[245654.470918] ? call_rwsem_down_read_failed+0x14/0x30
[245654.471729] ? alloc_pages_current+0x91/0x140
[245654.472530] call_rwsem_down_read_failed+0x14/0x30
[245654.473316] down_read+0x13/0x30
[245654.474064] proc_pid_cmdline_read+0xae/0x3f0

- probably down_read(&mm->mmap_sem);

[245654.485537] INFO: task service:86643 blocked for more than 120 seconds.
[245654.486015] Tainted: G W 4.12.0+98-ph <a
href="/view.php?id=1" title="[geschlossen] Integration Ramdisk"
class="resolved">0000001</a>
[245654.486502] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[245654.486960] service D 0 86643 1 0x00080000
[245654.487423] Call Trace:
[245654.487865] ? __schedule+0x3bc/0x830
[245654.488286] schedule+0x32/0x80
[245654.488763] rwsem_down_read_failed+0x121/0x170
[245654.489200] ? __handle_mm_fault+0xd67/0x1060
[245654.489668] ? call_rwsem_down_read_failed+0x14/0x30
[245654.490088] call_rwsem_down_read_failed+0x14/0x30
[245654.490538] down_read+0x13/0x30
[245654.490964] __do_page_fault+0x32b/0x430
[245654.491389] ? get_vtime_delta+0x13/0xb0
[245654.491835] do_page_fault+0x2a/0x70
[245654.492253] ? page_fault+0x65/0x80
[245654.492703] page_fault+0x7b/0x80

- page fault, so another mmap_sem for read

[245654.496050] INFO: task safe_timer:86651 blocked for more than 120
seconds.
[245654.496466] Tainted: G W 4.12.0+98-ph <a
href="/view.php?id=1" title="[geschlossen] Integration Ramdisk"
class="resolved">0000001</a>
[245654.496916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[245654.497336] safe_timer D 0 86651 1 0x00080000
[245654.497790] Call Trace:
[245654.498211] ? __schedule+0x3bc/0x830
[245654.498646] schedule+0x32/0x80
[245654.499074] rwsem_down_read_failed+0x121/0x170
[245654.499501] ? call_rwsem_down_read_failed+0x14/0x30
[245654.499959] call_rwsem_down_read_failed+0x14/0x30
[245654.500381] down_read+0x13/0x30
[245654.500851] __do_page_fault+0x32b/0x430
[245654.501285] ? get_vtime_delta+0x13/0xb0
[245654.501722] do_page_fault+0x2a/0x70
[245654.502163] ? page_fault+0x65/0x80
[245654.502600] page_fault+0x7b/0x80

- ditto

But those are different tasks that the one that was stalling allocation
while holding mmap_sem for write.

So I'm not sure what's going on, but maybe the reclaim is also stalled
due to waiting on some locks, and is thus victim of something else?

[1] https://lists.opensuse.org/opensuse-kernel/2018-08/msg00012.html
Michal Hocko Sept. 11, 2018, 11:56 a.m. UTC | #7
On Mon 10-09-18 13:08:34, David Rientjes wrote:
> On Fri, 7 Sep 2018, Michal Hocko wrote:
[...]
> > Fix this by removing __GFP_THISNODE handling from alloc_pages_vma where
> > it doesn't belong and move it to alloc_hugepage_direct_gfpmask where we
> > juggle gfp flags for different allocation modes. The rationale is that
> > __GFP_THISNODE is helpful in relaxed defrag modes because falling back
> > to a different node might be more harmful than the benefit of a large page.
> > If the user really requires THP (e.g. by MADV_HUGEPAGE) then the THP has
> > a higher priority than local NUMA placement.
> > 
> 
> That's not entirely true, the remote access latency for remote thp on all 
> of our platforms is greater than local small pages, this is especially 
> true for remote thp that is allocated intersocket and must be accessed 
> through the interconnect.
> 
> Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
> allocation latency, but certainly not remote access latency.  There are 
> users who remap their text segment onto transparent hugepages are fine 
> with startup delay if they are access all of their text from local thp.  
> Remote thp would be a significant performance degradation.

Well, it seems that expectations differ for users. It seems that kvm
users do not really agree with your interpretation.

> When Andrea brought this up, I suggested that the full solution would be a 
> MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
> benefit is that we could replace the thp "defrag" mode default by setting 
> this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> concerned about (1) getting thp when system-wide it is not default and (2) 
> additional fault latency when direct compaction is not default.  They are 
> not anticipating the degradation of remote access latency, so overloading 
> the meaning of the mode is probably not a good idea.

hugepage specific MPOL flags sounds like yet another step into even more
cluttered API and semantic, I am afraid. Why should this be any
different from regular page allocations? You are getting off-node memory
once your local node is full. You have to use an explicit binding to
disallow that. THP should be similar in that regards. Once you have said
that you _really_ want THP then you are closer to what we do for regular
pages IMHO.

I do realize that this is a gray zone because nobody bothered to define
the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
is exceptionaly short of information). So we are left with more or less
undefined behavior and define it properly now. As we can see this might
regress in some workloads but I strongly suspect that an explicit
binding sounds more logical approach than a thp specific mpol mode. If
anything this should be a more generic memory policy basically saying
that a zone/node reclaim mode should be enabled for the particular
allocation.
David Rientjes Sept. 11, 2018, 8:30 p.m. UTC | #8
On Tue, 11 Sep 2018, Michal Hocko wrote:

> > That's not entirely true, the remote access latency for remote thp on all 
> > of our platforms is greater than local small pages, this is especially 
> > true for remote thp that is allocated intersocket and must be accessed 
> > through the interconnect.
> > 
> > Our users of MADV_HUGEPAGE are ok with assuming the burden of increased 
> > allocation latency, but certainly not remote access latency.  There are 
> > users who remap their text segment onto transparent hugepages are fine 
> > with startup delay if they are access all of their text from local thp.  
> > Remote thp would be a significant performance degradation.
> 
> Well, it seems that expectations differ for users. It seems that kvm
> users do not really agree with your interpretation.
> 

If kvm is happy to allocate hugepages remotely, at least on a subset of 
platforms where it doesn't incur such a high remote access latency, then 
we probably shouldn't be adding lumping that together with the current 
semantics of MADV_HUGEPAGE.  Otherwise, we risk it becoming a dumping 
ground where current users may regress because they would be much more 
willing to fault local pages of the native page size and lose the ability 
to require that absent using mbind() -- and in that case they would be 
affected by the policy decision of native page sizes as well.

> > When Andrea brought this up, I suggested that the full solution would be a 
> > MPOL_F_HUGEPAGE flag that could define thp allocation policy -- the added 
> > benefit is that we could replace the thp "defrag" mode default by setting 
> > this as part of default_policy.  Right now, MADV_HUGEPAGE users are 
> > concerned about (1) getting thp when system-wide it is not default and (2) 
> > additional fault latency when direct compaction is not default.  They are 
> > not anticipating the degradation of remote access latency, so overloading 
> > the meaning of the mode is probably not a good idea.
> 
> hugepage specific MPOL flags sounds like yet another step into even more
> cluttered API and semantic, I am afraid. Why should this be any
> different from regular page allocations? You are getting off-node memory
> once your local node is full. You have to use an explicit binding to
> disallow that. THP should be similar in that regards. Once you have said
> that you _really_ want THP then you are closer to what we do for regular
> pages IMHO.
> 

Saying that we really want THP isn't an all-or-nothing decision.  We 
certainly want to try hard to fault hugepages locally especially at task 
startup when remapping our .text segment to thp, and MADV_HUGEPAGE works 
very well for that.  Remote hugepages would be a regression that we now 
have no way to avoid because the kernel doesn't provide for it, if we were 
to remove __GFP_THISNODE that this patch introduces.

On Broadwell, for example, we find 7% slower access to remote hugepages 
than local native pages.  On Naples, that becomes worse: 14% slower access 
latency for intrasocket hugepages compared to local native pages and 39% 
slower for intersocket.

> I do realize that this is a gray zone because nobody bothered to define
> the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
> is exceptionaly short of information). So we are left with more or less
> undefined behavior and define it properly now. As we can see this might
> regress in some workloads but I strongly suspect that an explicit
> binding sounds more logical approach than a thp specific mpol mode. If
> anything this should be a more generic memory policy basically saying
> that a zone/node reclaim mode should be enabled for the particular
> allocation.

This would be quite a serious regression with no way to actually define 
that we want local hugepages but allow fallback to remote native pages if 
we cannot allocate local native pages.  So rather than causing userspace 
to regress and give them no alternative, I would suggest either hugepage 
specific mempolicies or another madvise mode to allow remotely allocated 
hugepages.
Michal Hocko Sept. 12, 2018, 12:05 p.m. UTC | #9
On Tue 11-09-18 13:30:20, David Rientjes wrote:
> On Tue, 11 Sep 2018, Michal Hocko wrote:
[...]
> > hugepage specific MPOL flags sounds like yet another step into even more
> > cluttered API and semantic, I am afraid. Why should this be any
> > different from regular page allocations? You are getting off-node memory
> > once your local node is full. You have to use an explicit binding to
> > disallow that. THP should be similar in that regards. Once you have said
> > that you _really_ want THP then you are closer to what we do for regular
> > pages IMHO.
> > 
> 
> Saying that we really want THP isn't an all-or-nothing decision.  We 
> certainly want to try hard to fault hugepages locally especially at task 
> startup when remapping our .text segment to thp, and MADV_HUGEPAGE works 
> very well for that.  Remote hugepages would be a regression that we now 
> have no way to avoid because the kernel doesn't provide for it, if we were 
> to remove __GFP_THISNODE that this patch introduces.

Why cannot you use mempolicy to bind to local nodes if you really care
about the locality?

> On Broadwell, for example, we find 7% slower access to remote hugepages 
> than local native pages.  On Naples, that becomes worse: 14% slower access 
> latency for intrasocket hugepages compared to local native pages and 39% 
> slower for intersocket.

So, again, how does this compare to regular 4k pages? You are going to
pay for the same remote access as well.

From what you have said so far it sounds like you would like to have
something like the zone/node reclaim mode fine grained for a specific
mapping. If we really want to support something like that then it should
be a generic policy rather than THP specific thing IMHO.

As I've said it is hard to come up with a solution that would satisfy
everybody but considering that the existing reports are seeing this a
regression and cosindering their NUMA requirements are not so strict as
yours I would tend to think that stronger NUMA requirements should be
expressed explicitly rather than implicit effect of a madvise flag. We
do have APIs for that.
Andrea Arcangeli Sept. 12, 2018, 1:54 p.m. UTC | #10
Hello,

On Tue, Sep 11, 2018 at 01:56:13PM +0200, Michal Hocko wrote:
> Well, it seems that expectations differ for users. It seems that kvm
> users do not really agree with your interpretation.

Like David also mentioned here:

lkml.kernel.org/r/alpine.DEB.2.21.1808211021110.258924@chino.kir.corp.google.com

depends on the hardware what is a win, so there's no one size fits
all.

For two sockets providing remote THP to KVM is likely a win, but
changing the defaults depending on boot-time NUMA topology makes
things less deterministic and it's also impossible to define an exact
break even point.

> I do realize that this is a gray zone because nobody bothered to define
> the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
> is exceptionaly short of information). So we are left with more or less
> undefined behavior and define it properly now. As we can see this might
> regress in some workloads but I strongly suspect that an explicit
> binding sounds more logical approach than a thp specific mpol mode. If
> anything this should be a more generic memory policy basically saying
> that a zone/node reclaim mode should be enabled for the particular
> allocation.

MADV_HUGEPAGE means the allocation is long lived, so the cost of
compaction is worth it in direct reclaim. Not much else. That is not
the problem.

The problem is that even if you ignore the breakage and regression to
real life workloads, what is happening right now obviously would
require root privilege but MADV_HUEGPAGE requires no root privilege.

Swapping heavy because MADV_HUGEPAGE when there are gigabytes free on
other nodes and not even 4k would be swapped-out with THP turned off
in sysfs, is simply not possibly what MADV_HUGEPAGE could have been
about, and it's a kernel regression that never existed until that
commit that added __GFP_THISNODE to the default THP heuristic in
mempolicy.

I think we should defer the problem of what is better between 4k NUMA
local or remote THP by default for later, I provided two options
myself because it didn't matter so much which option we picked in the
short term, as long as the bug was fixed.

I wasn't particularly happy about your patch because it still swaps
with certain defrag settings which is still allowing things that
shouldn't happen without some kind of privileged capability.

If you can update the patch to prevent swapping in all cases it's a go
as far as I'm concerned. The main difference is that you're dropping
the THP logic in the mempolicy code which will make it worse for some
case and I was trying to retain it for all cases where swapping
wouldn't happen anyway and such logic would have still provided the
behavior David prefers to those cases.

Adding the new feature to create a THP specific mempolicy can be done
later. In the meanwhile the current mempolicy code can always override
whatever THP default behavior that gets out of this, just it will
require the admin to setup a mempolicy to enforce the preferred
behavior to 4k and THP allocations alike.

Thanks,
Andrea
Michal Hocko Sept. 12, 2018, 2:21 p.m. UTC | #11
On Wed 12-09-18 09:54:17, Andrea Arcangeli wrote:
> Hello,
> 
> On Tue, Sep 11, 2018 at 01:56:13PM +0200, Michal Hocko wrote:
> > Well, it seems that expectations differ for users. It seems that kvm
> > users do not really agree with your interpretation.
> 
> Like David also mentioned here:
> 
> lkml.kernel.org/r/alpine.DEB.2.21.1808211021110.258924@chino.kir.corp.google.com
> 
> depends on the hardware what is a win, so there's no one size fits
> all.
> 
> For two sockets providing remote THP to KVM is likely a win, but
> changing the defaults depending on boot-time NUMA topology makes
> things less deterministic and it's also impossible to define an exact
> break even point.
> 
> > I do realize that this is a gray zone because nobody bothered to define
> > the semantic since the MADV_HUGEPAGE has been introduced (a826e422420b4
> > is exceptionaly short of information). So we are left with more or less
> > undefined behavior and define it properly now. As we can see this might
> > regress in some workloads but I strongly suspect that an explicit
> > binding sounds more logical approach than a thp specific mpol mode. If
> > anything this should be a more generic memory policy basically saying
> > that a zone/node reclaim mode should be enabled for the particular
> > allocation.
> 
> MADV_HUGEPAGE means the allocation is long lived, so the cost of
> compaction is worth it in direct reclaim. Not much else. That is not
> the problem.

It seems there is no general agreement here. My understanding is that
this means that the user really prefers THP for whatever reasons.

> The problem is that even if you ignore the breakage and regression to
> real life workloads, what is happening right now obviously would
> require root privilege but MADV_HUEGPAGE requires no root privilege.

I do not follow.

> Swapping heavy because MADV_HUGEPAGE when there are gigabytes free on
> other nodes and not even 4k would be swapped-out with THP turned off
> in sysfs, is simply not possibly what MADV_HUGEPAGE could have been
> about, and it's a kernel regression that never existed until that
> commit that added __GFP_THISNODE to the default THP heuristic in
> mempolicy.

agreed

> I think we should defer the problem of what is better between 4k NUMA
> local or remote THP by default for later, I provided two options
> myself because it didn't matter so much which option we picked in the
> short term, as long as the bug was fixed.
> 
> I wasn't particularly happy about your patch because it still swaps
> with certain defrag settings which is still allowing things that
> shouldn't happen without some kind of privileged capability.

Well, I am not really sure about defrag=always. I would rather care
about the default behavior to plug the regression first. And think about
`always' mode on top. Or is this a no-go from your POV?
Michal Hocko Sept. 12, 2018, 3:25 p.m. UTC | #12
On Wed 12-09-18 16:21:26, Michal Hocko wrote:
> On Wed 12-09-18 09:54:17, Andrea Arcangeli wrote:
[...]
> > I wasn't particularly happy about your patch because it still swaps
> > with certain defrag settings which is still allowing things that
> > shouldn't happen without some kind of privileged capability.
> 
> Well, I am not really sure about defrag=always. I would rather care
> about the default behavior to plug the regression first. And think about
> `always' mode on top. Or is this a no-go from your POV?

In other words the following on top
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 56c9aac4dc86..723e8d77c5ef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -644,7 +644,7 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, un
 #endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
David Rientjes Sept. 12, 2018, 8:40 p.m. UTC | #13
On Wed, 12 Sep 2018, Michal Hocko wrote:

> > Saying that we really want THP isn't an all-or-nothing decision.  We 
> > certainly want to try hard to fault hugepages locally especially at task 
> > startup when remapping our .text segment to thp, and MADV_HUGEPAGE works 
> > very well for that.  Remote hugepages would be a regression that we now 
> > have no way to avoid because the kernel doesn't provide for it, if we were 
> > to remove __GFP_THISNODE that this patch introduces.
> 
> Why cannot you use mempolicy to bind to local nodes if you really care
> about the locality?
> 

Because we do not want to oom kill, we want to fallback first to local 
native pages and then to remote native pages.  That's the order of least 
to greatest latency, we do not want to work hard to allocate a remote 
hugepage when a local native page is faster.  This seems pretty straight 
forward.

> From what you have said so far it sounds like you would like to have
> something like the zone/node reclaim mode fine grained for a specific
> mapping. If we really want to support something like that then it should
> be a generic policy rather than THP specific thing IMHO.
> 
> As I've said it is hard to come up with a solution that would satisfy
> everybody but considering that the existing reports are seeing this a
> regression and cosindering their NUMA requirements are not so strict as
> yours I would tend to think that stronger NUMA requirements should be
> expressed explicitly rather than implicit effect of a madvise flag. We
> do have APIs for that.

Every process on every platform we have would need to define this explicit 
mempolicy for users of libraries that remap text segments because changing 
the allocation behavior of thp out from under them would cause very 
noticeable performance regressions.  I don't know of any platform where 
remote hugepages is preferred over local native pages.  If they exist, it 
sounds resaonable to introduce a stronger variant of MADV_HUGEPAGE that 
defines exactly what you want rather than causing it to become a dumping 
ground and userspace regressions.
diff mbox series

Patch

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@  struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..56c9aac4dc86 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,21 +629,31 @@  static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	gfp_t this_node = 0;
+	struct mempolicy *pol;
+
+#ifdef CONFIG_NUMA
+	/* __GFP_THISNODE makes sense only if there is no explicit binding */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+	mpol_cond_put(pol);
+#endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
@@ -715,7 +725,7 @@  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma);
+	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1290,7 +1300,7 @@  vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..75bbfc3d6233 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1648,7 +1648,7 @@  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -2026,32 +2026,6 @@  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);