diff mbox series

[v3] xfs: avoid deadlock when trigger memory reclaim in ->writepages

Message ID 1592222181-9832-1-git-send-email-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v3] xfs: avoid deadlock when trigger memory reclaim in ->writepages | expand

Commit Message

Yafang Shao June 15, 2020, 11:56 a.m. UTC
Recently there is a XFS deadlock on our server with an old kernel.
This deadlock is caused by allocating memory in xfs_map_blocks() while
doing writeback on behalf of memroy reclaim. Although this deadlock happens
on an old kernel, I think it could happen on the upstream as well. This
issue only happens once and can't be reproduced, so I haven't tried to
reproduce it on upsteam kernel.

Bellow is the call trace of this deadlock.
[480594.790087] INFO: task redis-server:16212 blocked for more than 120 seconds.
[480594.790087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[480594.790088] redis-server    D ffffffff8168bd60     0 16212  14347 0x00000004
[480594.790090]  ffff880da128f070 0000000000000082 ffff880f94a2eeb0 ffff880da128ffd8
[480594.790092]  ffff880da128ffd8 ffff880da128ffd8 ffff880f94a2eeb0 ffff88103f9d6c40
[480594.790094]  0000000000000000 7fffffffffffffff ffff88207ffc0ee8 ffffffff8168bd60
[480594.790096] Call Trace:
[480594.790101]  [<ffffffff8168dce9>] schedule+0x29/0x70
[480594.790103]  [<ffffffff8168b749>] schedule_timeout+0x239/0x2c0
[480594.790111]  [<ffffffff8168d28e>] io_schedule_timeout+0xae/0x130
[480594.790114]  [<ffffffff8168d328>] io_schedule+0x18/0x20
[480594.790116]  [<ffffffff8168bd71>] bit_wait_io+0x11/0x50
[480594.790118]  [<ffffffff8168b895>] __wait_on_bit+0x65/0x90
[480594.790121]  [<ffffffff811814e1>] wait_on_page_bit+0x81/0xa0
[480594.790125]  [<ffffffff81196ad2>] shrink_page_list+0x6d2/0xaf0
[480594.790130]  [<ffffffff811975a3>] shrink_inactive_list+0x223/0x710
[480594.790135]  [<ffffffff81198225>] shrink_lruvec+0x3b5/0x810
[480594.790139]  [<ffffffff8119873a>] shrink_zone+0xba/0x1e0
[480594.790141]  [<ffffffff81198c20>] do_try_to_free_pages+0x100/0x510
[480594.790143]  [<ffffffff8119928d>] try_to_free_mem_cgroup_pages+0xdd/0x170
[480594.790145]  [<ffffffff811f32de>] mem_cgroup_reclaim+0x4e/0x120
[480594.790147]  [<ffffffff811f37cc>] __mem_cgroup_try_charge+0x41c/0x670
[480594.790153]  [<ffffffff811f5cb6>] __memcg_kmem_newpage_charge+0xf6/0x180
[480594.790157]  [<ffffffff8118c72d>] __alloc_pages_nodemask+0x22d/0x420
[480594.790162]  [<ffffffff811d0c7a>] alloc_pages_current+0xaa/0x170
[480594.790165]  [<ffffffff811db8fc>] new_slab+0x30c/0x320
[480594.790168]  [<ffffffff811dd17c>] ___slab_alloc+0x3ac/0x4f0
[480594.790204]  [<ffffffff81685656>] __slab_alloc+0x40/0x5c
[480594.790206]  [<ffffffff811dfc43>] kmem_cache_alloc+0x193/0x1e0
[480594.790233]  [<ffffffffa04fab67>] kmem_zone_alloc+0x97/0x130 [xfs]
[480594.790247]  [<ffffffffa04f90ba>] _xfs_trans_alloc+0x3a/0xa0 [xfs]
[480594.790261]  [<ffffffffa04f915c>] xfs_trans_alloc+0x3c/0x50 [xfs]
[480594.790276]  [<ffffffffa04e958b>] xfs_iomap_write_allocate+0x1cb/0x390 [xfs]
[480594.790299]  [<ffffffffa04d3616>] xfs_map_blocks+0x1a6/0x210 [xfs]
[480594.790312]  [<ffffffffa04d416b>] xfs_do_writepage+0x17b/0x550 [xfs]
[480594.790314]  [<ffffffff8118d881>] write_cache_pages+0x251/0x4d0 [xfs]
[480594.790338]  [<ffffffffa04d3e05>] xfs_vm_writepages+0xc5/0xe0 [xfs]
[480594.790341]  [<ffffffff8118ebfe>] do_writepages+0x1e/0x40
[480594.790343]  [<ffffffff811837b5>] __filemap_fdatawrite_range+0x65/0x80
[480594.790346]  [<ffffffff81183901>] filemap_write_and_wait_range+0x41/0x90
[480594.790360]  [<ffffffffa04df2c6>] xfs_file_fsync+0x66/0x1e0 [xfs]
[480594.790363]  [<ffffffff81231cf5>] do_fsync+0x65/0xa0
[480594.790365]  [<ffffffff81231fe3>] SyS_fdatasync+0x13/0x20
[480594.790367]  [<ffffffff81698d09>] system_call_fastpath+0x16/0x1b

Note that xfs_iomap_write_allocate() is replaced by xfs_convert_blocks() in
commit 4ad765edb02a ("xfs: move xfs_iomap_write_allocate to xfs_aops.c")
and write_cache_pages() is replaced by iomap_writepages() in
commit 598ecfbaa742 ("iomap: lift the xfs writeback code to iomap").
So for upsteam, the call trace should be,
xfs_vm_writepages
  -> iomap_writepages
     -> write_cache_pages
        -> iomap_do_writepage
           -> xfs_map_blocks
              -> xfs_convert_blocks
                 -> xfs_bmapi_convert_delalloc
                    -> xfs_trans_alloc //It should alloc page with GFP_NOFS

I'm not sure whether it is proper to add the GFP_NOFS to all the
->writepags, so I only add it for XFS.

Stefan also reported that he saw this issue two times while under memory
pressure, So I add his reported-by.

Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

---
v2 -> v3:
- retitle the subject from "iomap: avoid deadlock if memory reclaim is
  triggered in writepage path"
- set GFP_NOFS only for XFS ->writepages

v1 -> v2:
- retitle the subject from "xfs: avoid deadlock when tigger memory reclam
  in xfs_map_blocks()"
- set GFP_NOFS in iomap_do_writepage(), per Dave
---
 fs/xfs/xfs_aops.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Holger Hoffstätte June 15, 2020, 2:25 p.m. UTC | #1
On 2020-06-15 13:56, Yafang Shao wrote:
> Recently there is a XFS deadlock on our server with an old kernel.
> This deadlock is caused by allocating memory in xfs_map_blocks() while
> doing writeback on behalf of memroy reclaim. Although this deadlock happens
> on an old kernel, I think it could happen on the upstream as well. This
> issue only happens once and can't be reproduced, so I haven't tried to
> reproduce it on upsteam kernel.
> 
> Bellow is the call trace of this deadlock.
> [480594.790087] INFO: task redis-server:16212 blocked for more than 120 seconds.
> [480594.790087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [480594.790088] redis-server    D ffffffff8168bd60     0 16212  14347 0x00000004
> [480594.790090]  ffff880da128f070 0000000000000082 ffff880f94a2eeb0 ffff880da128ffd8
> [480594.790092]  ffff880da128ffd8 ffff880da128ffd8 ffff880f94a2eeb0 ffff88103f9d6c40
> [480594.790094]  0000000000000000 7fffffffffffffff ffff88207ffc0ee8 ffffffff8168bd60
> [480594.790096] Call Trace:
> [480594.790101]  [<ffffffff8168dce9>] schedule+0x29/0x70
> [480594.790103]  [<ffffffff8168b749>] schedule_timeout+0x239/0x2c0
> [480594.790111]  [<ffffffff8168d28e>] io_schedule_timeout+0xae/0x130
> [480594.790114]  [<ffffffff8168d328>] io_schedule+0x18/0x20
> [480594.790116]  [<ffffffff8168bd71>] bit_wait_io+0x11/0x50
> [480594.790118]  [<ffffffff8168b895>] __wait_on_bit+0x65/0x90
> [480594.790121]  [<ffffffff811814e1>] wait_on_page_bit+0x81/0xa0
> [480594.790125]  [<ffffffff81196ad2>] shrink_page_list+0x6d2/0xaf0
> [480594.790130]  [<ffffffff811975a3>] shrink_inactive_list+0x223/0x710
> [480594.790135]  [<ffffffff81198225>] shrink_lruvec+0x3b5/0x810
> [480594.790139]  [<ffffffff8119873a>] shrink_zone+0xba/0x1e0
> [480594.790141]  [<ffffffff81198c20>] do_try_to_free_pages+0x100/0x510
> [480594.790143]  [<ffffffff8119928d>] try_to_free_mem_cgroup_pages+0xdd/0x170
> [480594.790145]  [<ffffffff811f32de>] mem_cgroup_reclaim+0x4e/0x120
> [480594.790147]  [<ffffffff811f37cc>] __mem_cgroup_try_charge+0x41c/0x670
> [480594.790153]  [<ffffffff811f5cb6>] __memcg_kmem_newpage_charge+0xf6/0x180
> [480594.790157]  [<ffffffff8118c72d>] __alloc_pages_nodemask+0x22d/0x420
> [480594.790162]  [<ffffffff811d0c7a>] alloc_pages_current+0xaa/0x170
> [480594.790165]  [<ffffffff811db8fc>] new_slab+0x30c/0x320
> [480594.790168]  [<ffffffff811dd17c>] ___slab_alloc+0x3ac/0x4f0
> [480594.790204]  [<ffffffff81685656>] __slab_alloc+0x40/0x5c
> [480594.790206]  [<ffffffff811dfc43>] kmem_cache_alloc+0x193/0x1e0
> [480594.790233]  [<ffffffffa04fab67>] kmem_zone_alloc+0x97/0x130 [xfs]
> [480594.790247]  [<ffffffffa04f90ba>] _xfs_trans_alloc+0x3a/0xa0 [xfs]
> [480594.790261]  [<ffffffffa04f915c>] xfs_trans_alloc+0x3c/0x50 [xfs]
> [480594.790276]  [<ffffffffa04e958b>] xfs_iomap_write_allocate+0x1cb/0x390 [xfs]
> [480594.790299]  [<ffffffffa04d3616>] xfs_map_blocks+0x1a6/0x210 [xfs]
> [480594.790312]  [<ffffffffa04d416b>] xfs_do_writepage+0x17b/0x550 [xfs]
> [480594.790314]  [<ffffffff8118d881>] write_cache_pages+0x251/0x4d0 [xfs]
> [480594.790338]  [<ffffffffa04d3e05>] xfs_vm_writepages+0xc5/0xe0 [xfs]
> [480594.790341]  [<ffffffff8118ebfe>] do_writepages+0x1e/0x40
> [480594.790343]  [<ffffffff811837b5>] __filemap_fdatawrite_range+0x65/0x80
> [480594.790346]  [<ffffffff81183901>] filemap_write_and_wait_range+0x41/0x90
> [480594.790360]  [<ffffffffa04df2c6>] xfs_file_fsync+0x66/0x1e0 [xfs]
> [480594.790363]  [<ffffffff81231cf5>] do_fsync+0x65/0xa0
> [480594.790365]  [<ffffffff81231fe3>] SyS_fdatasync+0x13/0x20
> [480594.790367]  [<ffffffff81698d09>] system_call_fastpath+0x16/0x1b
> 
> Note that xfs_iomap_write_allocate() is replaced by xfs_convert_blocks() in
> commit 4ad765edb02a ("xfs: move xfs_iomap_write_allocate to xfs_aops.c")
> and write_cache_pages() is replaced by iomap_writepages() in
> commit 598ecfbaa742 ("iomap: lift the xfs writeback code to iomap").
> So for upsteam, the call trace should be,
> xfs_vm_writepages
>    -> iomap_writepages
>       -> write_cache_pages
>          -> iomap_do_writepage
>             -> xfs_map_blocks
>                -> xfs_convert_blocks
>                   -> xfs_bmapi_convert_delalloc
>                      -> xfs_trans_alloc //It should alloc page with GFP_NOFS
> 
> I'm not sure whether it is proper to add the GFP_NOFS to all the
> ->writepags, so I only add it for XFS.
> 
> Stefan also reported that he saw this issue two times while under memory
> pressure, So I add his reported-by.
> 
> Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> 
> ---
> v2 -> v3:
> - retitle the subject from "iomap: avoid deadlock if memory reclaim is
>    triggered in writepage path"
> - set GFP_NOFS only for XFS ->writepages
> 
> v1 -> v2:
> - retitle the subject from "xfs: avoid deadlock when tigger memory reclam
>    in xfs_map_blocks()"
> - set GFP_NOFS in iomap_do_writepage(), per Dave
> ---
>   fs/xfs/xfs_aops.c | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index b356118..1ccfbf2 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
>   	struct writeback_control *wbc)
>   {
>   	struct xfs_writepage_ctx wpc = { };
> +	unsigned int nofs_flag;
> +	int ret;
>   
>   	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> -	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> +
> +	/*
> +	 * We can allocate memory here while doing writeback on behalf of
> +	 * memory reclaim.  To avoid memory allocation deadlocks set the
> +	 * task-wide nofs context for the following operations.
> +	 */
> +	nofs_flag = memalloc_nofs_save();
> +	ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> +	memalloc_nofs_restore(nofs_flag);
> +
> +	return ret;
>   }
>   
>   STATIC int
> 

Not sure if I did something wrong, but while the previous version of this patch
worked fine, this one gave me (with v2 removed obviously):

[  +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x6b4/0x780
[  +0.000001] Modules linked in: tcp_bbr2 sch_fq_codel nct6775 hwmon_vid btrfs blake2b_generic xor zstd_compress x86_pkg_temp_thermal drivetemp coretemp crc32_pclmul raid6_pq zstd_decompress aesni_intel i915 crypto_simd cryptd glue_helper intel_gtt bfq i2c_algo_bit iosf_mbi drm_kms_helper i2c_i801 mq_deadline syscopyarea sysfillrect usbhid sysimgblt fb_sys_fops drm drm_panel_orientation_quirks i2c_core atlantic video backlight
[  +0.000013] CPU: 0 PID: 2811 Comm: dmesg Not tainted 5.7.2 #1
[  +0.000000] Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013
[  +0.000002] RIP: 0010:iomap_do_writepage+0x6b4/0x780
[  +0.000001] Code: ff e9 bf fc ff ff 48 8b 44 24 48 48 39 44 24 28 0f 84 f7 fb ff ff 0f 0b e9 f0 fb ff ff 4c 8b 64 24 10 41 89 c7 e9 d8 fb ff ff <0f> 0b eb 88 8b 54 24 24 85 d2 75 5f 49 8b 45 50 48 8b 40 10 48 85
[  +0.000001] RSP: 0018:ffffc90000277ae8 EFLAGS: 00010206
[  +0.000001] RAX: 0000000000444004 RBX: ffffc90000277bc0 RCX: 0000000000000018
[  +0.000000] RDX: 0000000000000000 RSI: ffffc90000277d58 RDI: ffffea001fd25e00
[  +0.000001] RBP: ffff8887f7edfd40 R08: ffffffffffffffff R09: 0000000000000006
[  +0.000001] R10: ffff88881f5dd000 R11: 000000000000020d R12: ffffea001fd25e00
[  +0.000000] R13: ffffc90000277c80 R14: ffff8887fb95e800 R15: ffffea001fd25e00
[  +0.000001] FS:  0000000000000000(0000) GS:ffff8887ff600000(0000) knlGS:0000000000000000
[  +0.000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000000] CR2: 00007f836c0915e8 CR3: 0000000002009005 CR4: 00000000000606f0
[  +0.000001] Call Trace:
[  +0.000006]  ? page_mkclean+0x5e/0x90
[  +0.000003]  ? clear_page_dirty_for_io+0x18a/0x1d0
[  +0.000002]  write_cache_pages+0x196/0x400
[  +0.000001]  ? iomap_readpages+0x180/0x180
[  +0.000003]  iomap_writepages+0x1c/0x40
[  +0.000003]  xfs_vm_writepages+0x68/0x90
[  +0.000002]  do_writepages+0x25/0xa0
[  +0.000002]  __filemap_fdatawrite_range+0xa7/0xe0
[  +0.000002]  xfs_release+0x11b/0x160
[  +0.000002]  __fput+0xd7/0x240
[  +0.000003]  task_work_run+0x5c/0x80
[  +0.000001]  do_exit+0x33b/0x9c0
[  +0.000001]  do_group_exit+0x33/0x90
[  +0.000002]  __x64_sys_exit_group+0x14/0x20
[  +0.000001]  do_syscall_64+0x4e/0x310
[  +0.000003]  ? do_user_addr_fault+0x1df/0x460
[  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000001] RIP: 0033:0x7f836bf73489
[  +0.000003] Code: Bad RIP value.
[  +0.000000] RSP: 002b:00007fff90ec3b98 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  +0.000001] RAX: ffffffffffffffda RBX: 00007f836c0626e0 RCX: 00007f836bf73489
[  +0.000001] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
[  +0.000001] RBP: 00007f836c0626e0 R08: ffffffffffffff80 R09: 00007fff90ec3a60
[  +0.000000] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
[  +0.000001] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000
[  +0.000001] ---[ end trace aed8763335accf60 ]---

..and hosed the fs by eating any writeback, zeroing out $things.
0/10 cats, do not want.

-h
Yafang Shao June 15, 2020, 2:51 p.m. UTC | #2
On Mon, Jun 15, 2020 at 10:25 PM Holger Hoffstätte
<holger@applied-asynchrony.com> wrote:
>
> On 2020-06-15 13:56, Yafang Shao wrote:
> > Recently there is a XFS deadlock on our server with an old kernel.
> > This deadlock is caused by allocating memory in xfs_map_blocks() while
> > doing writeback on behalf of memroy reclaim. Although this deadlock happens
> > on an old kernel, I think it could happen on the upstream as well. This
> > issue only happens once and can't be reproduced, so I haven't tried to
> > reproduce it on upsteam kernel.
> >
> > Bellow is the call trace of this deadlock.
> > [480594.790087] INFO: task redis-server:16212 blocked for more than 120 seconds.
> > [480594.790087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [480594.790088] redis-server    D ffffffff8168bd60     0 16212  14347 0x00000004
> > [480594.790090]  ffff880da128f070 0000000000000082 ffff880f94a2eeb0 ffff880da128ffd8
> > [480594.790092]  ffff880da128ffd8 ffff880da128ffd8 ffff880f94a2eeb0 ffff88103f9d6c40
> > [480594.790094]  0000000000000000 7fffffffffffffff ffff88207ffc0ee8 ffffffff8168bd60
> > [480594.790096] Call Trace:
> > [480594.790101]  [<ffffffff8168dce9>] schedule+0x29/0x70
> > [480594.790103]  [<ffffffff8168b749>] schedule_timeout+0x239/0x2c0
> > [480594.790111]  [<ffffffff8168d28e>] io_schedule_timeout+0xae/0x130
> > [480594.790114]  [<ffffffff8168d328>] io_schedule+0x18/0x20
> > [480594.790116]  [<ffffffff8168bd71>] bit_wait_io+0x11/0x50
> > [480594.790118]  [<ffffffff8168b895>] __wait_on_bit+0x65/0x90
> > [480594.790121]  [<ffffffff811814e1>] wait_on_page_bit+0x81/0xa0
> > [480594.790125]  [<ffffffff81196ad2>] shrink_page_list+0x6d2/0xaf0
> > [480594.790130]  [<ffffffff811975a3>] shrink_inactive_list+0x223/0x710
> > [480594.790135]  [<ffffffff81198225>] shrink_lruvec+0x3b5/0x810
> > [480594.790139]  [<ffffffff8119873a>] shrink_zone+0xba/0x1e0
> > [480594.790141]  [<ffffffff81198c20>] do_try_to_free_pages+0x100/0x510
> > [480594.790143]  [<ffffffff8119928d>] try_to_free_mem_cgroup_pages+0xdd/0x170
> > [480594.790145]  [<ffffffff811f32de>] mem_cgroup_reclaim+0x4e/0x120
> > [480594.790147]  [<ffffffff811f37cc>] __mem_cgroup_try_charge+0x41c/0x670
> > [480594.790153]  [<ffffffff811f5cb6>] __memcg_kmem_newpage_charge+0xf6/0x180
> > [480594.790157]  [<ffffffff8118c72d>] __alloc_pages_nodemask+0x22d/0x420
> > [480594.790162]  [<ffffffff811d0c7a>] alloc_pages_current+0xaa/0x170
> > [480594.790165]  [<ffffffff811db8fc>] new_slab+0x30c/0x320
> > [480594.790168]  [<ffffffff811dd17c>] ___slab_alloc+0x3ac/0x4f0
> > [480594.790204]  [<ffffffff81685656>] __slab_alloc+0x40/0x5c
> > [480594.790206]  [<ffffffff811dfc43>] kmem_cache_alloc+0x193/0x1e0
> > [480594.790233]  [<ffffffffa04fab67>] kmem_zone_alloc+0x97/0x130 [xfs]
> > [480594.790247]  [<ffffffffa04f90ba>] _xfs_trans_alloc+0x3a/0xa0 [xfs]
> > [480594.790261]  [<ffffffffa04f915c>] xfs_trans_alloc+0x3c/0x50 [xfs]
> > [480594.790276]  [<ffffffffa04e958b>] xfs_iomap_write_allocate+0x1cb/0x390 [xfs]
> > [480594.790299]  [<ffffffffa04d3616>] xfs_map_blocks+0x1a6/0x210 [xfs]
> > [480594.790312]  [<ffffffffa04d416b>] xfs_do_writepage+0x17b/0x550 [xfs]
> > [480594.790314]  [<ffffffff8118d881>] write_cache_pages+0x251/0x4d0 [xfs]
> > [480594.790338]  [<ffffffffa04d3e05>] xfs_vm_writepages+0xc5/0xe0 [xfs]
> > [480594.790341]  [<ffffffff8118ebfe>] do_writepages+0x1e/0x40
> > [480594.790343]  [<ffffffff811837b5>] __filemap_fdatawrite_range+0x65/0x80
> > [480594.790346]  [<ffffffff81183901>] filemap_write_and_wait_range+0x41/0x90
> > [480594.790360]  [<ffffffffa04df2c6>] xfs_file_fsync+0x66/0x1e0 [xfs]
> > [480594.790363]  [<ffffffff81231cf5>] do_fsync+0x65/0xa0
> > [480594.790365]  [<ffffffff81231fe3>] SyS_fdatasync+0x13/0x20
> > [480594.790367]  [<ffffffff81698d09>] system_call_fastpath+0x16/0x1b
> >
> > Note that xfs_iomap_write_allocate() is replaced by xfs_convert_blocks() in
> > commit 4ad765edb02a ("xfs: move xfs_iomap_write_allocate to xfs_aops.c")
> > and write_cache_pages() is replaced by iomap_writepages() in
> > commit 598ecfbaa742 ("iomap: lift the xfs writeback code to iomap").
> > So for upsteam, the call trace should be,
> > xfs_vm_writepages
> >    -> iomap_writepages
> >       -> write_cache_pages
> >          -> iomap_do_writepage
> >             -> xfs_map_blocks
> >                -> xfs_convert_blocks
> >                   -> xfs_bmapi_convert_delalloc
> >                      -> xfs_trans_alloc //It should alloc page with GFP_NOFS
> >
> > I'm not sure whether it is proper to add the GFP_NOFS to all the
> > ->writepags, so I only add it for XFS.
> >
> > Stefan also reported that he saw this issue two times while under memory
> > pressure, So I add his reported-by.
> >
> > Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >
> > ---
> > v2 -> v3:
> > - retitle the subject from "iomap: avoid deadlock if memory reclaim is
> >    triggered in writepage path"
> > - set GFP_NOFS only for XFS ->writepages
> >
> > v1 -> v2:
> > - retitle the subject from "xfs: avoid deadlock when tigger memory reclam
> >    in xfs_map_blocks()"
> > - set GFP_NOFS in iomap_do_writepage(), per Dave
> > ---
> >   fs/xfs/xfs_aops.c | 14 +++++++++++++-
> >   1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index b356118..1ccfbf2 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> >       struct writeback_control *wbc)
> >   {
> >       struct xfs_writepage_ctx wpc = { };
> > +     unsigned int nofs_flag;
> > +     int ret;
> >
> >       xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > -     return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > +
> > +     /*
> > +      * We can allocate memory here while doing writeback on behalf of
> > +      * memory reclaim.  To avoid memory allocation deadlocks set the
> > +      * task-wide nofs context for the following operations.
> > +      */
> > +     nofs_flag = memalloc_nofs_save();
> > +     ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > +     memalloc_nofs_restore(nofs_flag);
> > +
> > +     return ret;
> >   }
> >
> >   STATIC int
> >
>
> Not sure if I did something wrong, but while the previous version of this patch
> worked fine, this one gave me (with v2 removed obviously):
>
> [  +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x6b4/0x780
> [  +0.000001] Modules linked in: tcp_bbr2 sch_fq_codel nct6775 hwmon_vid btrfs blake2b_generic xor zstd_compress x86_pkg_temp_thermal drivetemp coretemp crc32_pclmul raid6_pq zstd_decompress aesni_intel i915 crypto_simd cryptd glue_helper intel_gtt bfq i2c_algo_bit iosf_mbi drm_kms_helper i2c_i801 mq_deadline syscopyarea sysfillrect usbhid sysimgblt fb_sys_fops drm drm_panel_orientation_quirks i2c_core atlantic video backlight
> [  +0.000013] CPU: 0 PID: 2811 Comm: dmesg Not tainted 5.7.2 #1
> [  +0.000000] Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013
> [  +0.000002] RIP: 0010:iomap_do_writepage+0x6b4/0x780
> [  +0.000001] Code: ff e9 bf fc ff ff 48 8b 44 24 48 48 39 44 24 28 0f 84 f7 fb ff ff 0f 0b e9 f0 fb ff ff 4c 8b 64 24 10 41 89 c7 e9 d8 fb ff ff <0f> 0b eb 88 8b 54 24 24 85 d2 75 5f 49 8b 45 50 48 8b 40 10 48 85
> [  +0.000001] RSP: 0018:ffffc90000277ae8 EFLAGS: 00010206
> [  +0.000001] RAX: 0000000000444004 RBX: ffffc90000277bc0 RCX: 0000000000000018
> [  +0.000000] RDX: 0000000000000000 RSI: ffffc90000277d58 RDI: ffffea001fd25e00
> [  +0.000001] RBP: ffff8887f7edfd40 R08: ffffffffffffffff R09: 0000000000000006
> [  +0.000001] R10: ffff88881f5dd000 R11: 000000000000020d R12: ffffea001fd25e00
> [  +0.000000] R13: ffffc90000277c80 R14: ffff8887fb95e800 R15: ffffea001fd25e00
> [  +0.000001] FS:  0000000000000000(0000) GS:ffff8887ff600000(0000) knlGS:0000000000000000
> [  +0.000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  +0.000000] CR2: 00007f836c0915e8 CR3: 0000000002009005 CR4: 00000000000606f0
> [  +0.000001] Call Trace:
> [  +0.000006]  ? page_mkclean+0x5e/0x90
> [  +0.000003]  ? clear_page_dirty_for_io+0x18a/0x1d0
> [  +0.000002]  write_cache_pages+0x196/0x400
> [  +0.000001]  ? iomap_readpages+0x180/0x180
> [  +0.000003]  iomap_writepages+0x1c/0x40
> [  +0.000003]  xfs_vm_writepages+0x68/0x90
> [  +0.000002]  do_writepages+0x25/0xa0
> [  +0.000002]  __filemap_fdatawrite_range+0xa7/0xe0
> [  +0.000002]  xfs_release+0x11b/0x160
> [  +0.000002]  __fput+0xd7/0x240
> [  +0.000003]  task_work_run+0x5c/0x80
> [  +0.000001]  do_exit+0x33b/0x9c0
> [  +0.000001]  do_group_exit+0x33/0x90
> [  +0.000002]  __x64_sys_exit_group+0x14/0x20
> [  +0.000001]  do_syscall_64+0x4e/0x310
> [  +0.000003]  ? do_user_addr_fault+0x1df/0x460
> [  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  +0.000001] RIP: 0033:0x7f836bf73489
> [  +0.000003] Code: Bad RIP value.
> [  +0.000000] RSP: 002b:00007fff90ec3b98 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [  +0.000001] RAX: ffffffffffffffda RBX: 00007f836c0626e0 RCX: 00007f836bf73489
> [  +0.000001] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
> [  +0.000001] RBP: 00007f836c0626e0 R08: ffffffffffffff80 R09: 00007fff90ec3a60
> [  +0.000000] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
> [  +0.000001] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000
> [  +0.000001] ---[ end trace aed8763335accf60 ]---
>
> ..and hosed the fs by eating any writeback, zeroing out $things.
> 0/10 cats, do not want.
>
> -h


Oh my bad.
I ignored the check in iomap_do_writepage() that,
if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
    goto redirty;

This check is to avoid recursive filesystem reclaim.

GFP_NOFS should be set after this check, or we should move this check
at the beginning of ->writepage(s).
v2 is fine, you can apply v2 to your kernel and feel free to let me
know if you catch any issue after applying it.

Thanks for your feedback.
Michal Hocko June 15, 2020, 2:53 p.m. UTC | #3
On Mon 15-06-20 16:25:52, Holger Hoffstätte wrote:
> On 2020-06-15 13:56, Yafang Shao wrote:
[...]
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index b356118..1ccfbf2 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> >   	struct writeback_control *wbc)
> >   {
> >   	struct xfs_writepage_ctx wpc = { };
> > +	unsigned int nofs_flag;
> > +	int ret;
> >   	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > -	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > +
> > +	/*
> > +	 * We can allocate memory here while doing writeback on behalf of
> > +	 * memory reclaim.  To avoid memory allocation deadlocks set the
> > +	 * task-wide nofs context for the following operations.
> > +	 */
> > +	nofs_flag = memalloc_nofs_save();
> > +	ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > +	memalloc_nofs_restore(nofs_flag);
> > +
> > +	return ret;
> >   }
> >   STATIC int
> > 
> 
> Not sure if I did something wrong, but while the previous version of this patch
> worked fine, this one gave me (with v2 removed obviously):
> 
> [  +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x6b4/0x780

This corresponds to
        /*
         * Given that we do not allow direct reclaim to call us, we should
         * never be called in a recursive filesystem reclaim context.
         */
        if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
                goto redirty;

which effectivelly says that memalloc_nofs_save/restore cannot be used
for that code path. Your stack trace doesn't point to a reclaim path
which shows that this path is shared and also underlines that this is
not really an intended use of the api. Please refer to
Documentation/core-api/gfp_mask-from-fs-io.rst for more details but
shortly the API should be used at the layer which defines a context
which shouldn't allow to recurse. E.g. a lock which would be problematic
in the reclaim recursion path.
Matthew Wilcox June 15, 2020, 3:07 p.m. UTC | #4
On Mon, Jun 15, 2020 at 04:53:31PM +0200, Michal Hocko wrote:
> On Mon 15-06-20 16:25:52, Holger Hoffstätte wrote:
> > On 2020-06-15 13:56, Yafang Shao wrote:
> [...]
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index b356118..1ccfbf2 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > >   	struct writeback_control *wbc)
> > >   {
> > >   	struct xfs_writepage_ctx wpc = { };
> > > +	unsigned int nofs_flag;
> > > +	int ret;
> > >   	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > > -	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > +
> > > +	/*
> > > +	 * We can allocate memory here while doing writeback on behalf of
> > > +	 * memory reclaim.  To avoid memory allocation deadlocks set the
> > > +	 * task-wide nofs context for the following operations.
> > > +	 */
> > > +	nofs_flag = memalloc_nofs_save();
> > > +	ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > +	memalloc_nofs_restore(nofs_flag);
> > > +
> > > +	return ret;
> > >   }
> > >   STATIC int
> > > 
> > 
> > Not sure if I did something wrong, but while the previous version of this patch
> > worked fine, this one gave me (with v2 removed obviously):
> > 
> > [  +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x6b4/0x780
> 
> This corresponds to
>         /*
>          * Given that we do not allow direct reclaim to call us, we should
>          * never be called in a recursive filesystem reclaim context.
>          */
>         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>                 goto redirty;
> 
> which effectivelly says that memalloc_nofs_save/restore cannot be used
> for that code path. Your stack trace doesn't point to a reclaim path
> which shows that this path is shared and also underlines that this is
> not really an intended use of the api. Please refer to
> Documentation/core-api/gfp_mask-from-fs-io.rst for more details but
> shortly the API should be used at the layer which defines a context
> which shouldn't allow to recurse. E.g. a lock which would be problematic
> in the reclaim recursion path.

I'm concerned that this might not be the correct approach.  Generally we
have the rule that freeing memory should not require allocating memory
to succeed.  Since XFS obviously does need to allocate memory to start
a transaction, this allocation should normally be protected by a mempool
or similar.

XFS is subtle and has its own rules around memory allocation and
writeback, so I don't want to say this is definitely wrong.  I'm just
saying that I have concerns it might not be right.
Yafang Shao June 15, 2020, 3:08 p.m. UTC | #5
On Mon, Jun 15, 2020 at 10:53 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 15-06-20 16:25:52, Holger Hoffstätte wrote:
> > On 2020-06-15 13:56, Yafang Shao wrote:
> [...]
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index b356118..1ccfbf2 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > >     struct writeback_control *wbc)
> > >   {
> > >     struct xfs_writepage_ctx wpc = { };
> > > +   unsigned int nofs_flag;
> > > +   int ret;
> > >     xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > > -   return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > +
> > > +   /*
> > > +    * We can allocate memory here while doing writeback on behalf of
> > > +    * memory reclaim.  To avoid memory allocation deadlocks set the
> > > +    * task-wide nofs context for the following operations.
> > > +    */
> > > +   nofs_flag = memalloc_nofs_save();
> > > +   ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > +   memalloc_nofs_restore(nofs_flag);
> > > +
> > > +   return ret;
> > >   }
> > >   STATIC int
> > >
> >
> > Not sure if I did something wrong, but while the previous version of this patch
> > worked fine, this one gave me (with v2 removed obviously):
> >
> > [  +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x6b4/0x780
>
> This corresponds to
>         /*
>          * Given that we do not allow direct reclaim to call us, we should
>          * never be called in a recursive filesystem reclaim context.
>          */
>         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>                 goto redirty;
>
> which effectivelly says that memalloc_nofs_save/restore cannot be used
> for that code path.

Hi Michal,

My understanding is that this warning is to tell us we don't want a
recursive filesystem reclaim with PF_MEMALLOC_NOFS being specifically
set, but unfortunately PF_MEMALLOC_NOFS doesn't work so it comes here
again.

IOW, PF_MEMALLOC_NOFS can be set after this check, like what I did in v2. [1]


> Your stack trace doesn't point to a reclaim path
> which shows that this path is shared and also underlines that this is
> not really an intended use of the api. Please refer to
> Documentation/core-api/gfp_mask-from-fs-io.rst for more details but
> shortly the API should be used at the layer which defines a context
> which shouldn't allow to recurse. E.g. a lock which would be problematic
> in the reclaim recursion path.

Thanks for your information.
As pointed out by Dave in v1[2] that iomap_do_writepage() can be
called with a locked page
cache page and calls ->map_blocks from that context.

[1]. https://lore.kernel.org/linux-xfs/1591254347-15912-1-git-send-email-laoar.shao@gmail.com/
[2] https://lore.kernel.org/linux-xfs/20200603222741.GQ2040@dread.disaster.area/
Dave Chinner June 15, 2020, 11:06 p.m. UTC | #6
On Mon, Jun 15, 2020 at 04:53:31PM +0200, Michal Hocko wrote:
> On Mon 15-06-20 16:25:52, Holger Hoffstätte wrote:
> > On 2020-06-15 13:56, Yafang Shao wrote:
> [...]
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index b356118..1ccfbf2 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > >   	struct writeback_control *wbc)
> > >   {
> > >   	struct xfs_writepage_ctx wpc = { };
> > > +	unsigned int nofs_flag;
> > > +	int ret;
> > >   	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > > -	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > +
> > > +	/*
> > > +	 * We can allocate memory here while doing writeback on behalf of
> > > +	 * memory reclaim.  To avoid memory allocation deadlocks set the
> > > +	 * task-wide nofs context for the following operations.
> > > +	 */
> > > +	nofs_flag = memalloc_nofs_save();
> > > +	ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > +	memalloc_nofs_restore(nofs_flag);
> > > +
> > > +	return ret;
> > >   }
> > >   STATIC int
> > > 
> > 
> > Not sure if I did something wrong, but while the previous version of this patch
> > worked fine, this one gave me (with v2 removed obviously):
> > 
> > [  +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x6b4/0x780
> 
> This corresponds to
>         /*
>          * Given that we do not allow direct reclaim to call us, we should
>          * never be called in a recursive filesystem reclaim context.
>          */
>         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>                 goto redirty;
> 
> which effectivelly says that memalloc_nofs_save/restore cannot be used
> for that code path.

No it doesn't. Everyone is ignoring the -context- of this code, most
especially the previous 3 lines of code and it's comment:

        /*
         * Refuse to write the page out if we are called from reclaim context.
         *
         * This avoids stack overflows when called from deeply used stacks in
         * random callers for direct reclaim or memcg reclaim.  We explicitly
         * allow reclaim from kswapd as the stack usage there is relatively low.
         *
         * This should never happen except in the case of a VM regression so
         * warn about it.
         */
        if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
                        PF_MEMALLOC))
                goto redirty;

You will see that we specifically avoid this path from reclaim
context except for kswapd. And kswapd always runs with GFP_KERNEL
context so we allow writeback from it, so it will pass both this
check and the NOFS check above. 

IOws, we can't trigger to the WARN_ON_ONCE(current->flags &
PF_MEMALLOC_NOFS)) check from a memory reclaim context: this
PF_MEMALLOC_NOFS check here is not doing what people think it is.

History lesson time, eh?

The recursion protection here -used- to use PF_FSTRANS, not
PF_MEMALLOC_NOFS. See commit 9070733b4efac ("xfs: abstract
PF_FSTRANS to PF_MEMALLOC_NOFS"). This hunk is most instructive
when you look at the comment:

         * Given that we do not allow direct reclaim to call us, we should
         * never be called while in a filesystem transaction.
         */
-       if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+       if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
                goto redirty;

It wasn't for memory allocation recursion protection in XFS - it was
for transaction reservation recursion protection by something trying
to flush data pages while holding a transaction reservation. Doing
this could deadlock the journal because the existing reservation
could prevent the nested reservation for being able to reserve space
in the journal and that is a self-deadlock vector.

IOWs, this check is not protecting against memory reclaim recursion
bugs at all (that's the previous check I quoted). This check is
protecting against the filesystem calling writepages directly from a
context where it can self-deadlock.

So what we are seeing here is that the PF_FSTRANS ->
PF_MEMALLOC_NOFS abstraction lost all the actual useful information
about what type of error this check was protecting against.

> Your stack trace doesn't point to a reclaim path
> which shows that this path is shared and also underlines that this is
> not really an intended use of the api.

Actually, Michal, it was your PF_FSTRANS -> PF_MEMALLOC_NOFS
abstraction of this code that turned this from "exactly what
PF_FSTRANS was intended to catch" to what you call "unintended use
of the API"....

IOWs, putting the iomap_writepage path under NOFS context is the
right thing to do from a "prevent memory reclaim" perspective, but
now we are hitting against the problems of repurposing filesystem
specific flags for subtlely different generic semantics...

I suspect we need to re-introduce PF_FSTRANS, set/clear/transfer it
again in all the places XFS used to transfer it, and change this
iomap check to use PF_FSTRANS and not PF_MEMALLOC_NOFS, because it's
clearly not and never has been a memory reclaim recursion warning
check....

Cheers,

Dave.
Dave Chinner June 15, 2020, 11:23 p.m. UTC | #7
On Mon, Jun 15, 2020 at 08:07:36AM -0700, Matthew Wilcox wrote:
> On Mon, Jun 15, 2020 at 04:53:31PM +0200, Michal Hocko wrote:
> > On Mon 15-06-20 16:25:52, Holger Hoffstätte wrote:
> > > On 2020-06-15 13:56, Yafang Shao wrote:
> > [...]
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index b356118..1ccfbf2 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > > >   	struct writeback_control *wbc)
> > > >   {
> > > >   	struct xfs_writepage_ctx wpc = { };
> > > > +	unsigned int nofs_flag;
> > > > +	int ret;
> > > >   	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > > > -	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > > +
> > > > +	/*
> > > > +	 * We can allocate memory here while doing writeback on behalf of
> > > > +	 * memory reclaim.  To avoid memory allocation deadlocks set the
> > > > +	 * task-wide nofs context for the following operations.
> > > > +	 */
> > > > +	nofs_flag = memalloc_nofs_save();
> > > > +	ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > > +	memalloc_nofs_restore(nofs_flag);
> > > > +
> > > > +	return ret;
> > > >   }
> > > >   STATIC int
> > > > 
> > > 
> > > Not sure if I did something wrong, but while the previous version of this patch
> > > worked fine, this one gave me (with v2 removed obviously):
> > > 
> > > [  +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x6b4/0x780
> > 
> > This corresponds to
> >         /*
> >          * Given that we do not allow direct reclaim to call us, we should
> >          * never be called in a recursive filesystem reclaim context.
> >          */
> >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> >                 goto redirty;
> > 
> > which effectivelly says that memalloc_nofs_save/restore cannot be used
> > for that code path. Your stack trace doesn't point to a reclaim path
> > which shows that this path is shared and also underlines that this is
> > not really an intended use of the api. Please refer to
> > Documentation/core-api/gfp_mask-from-fs-io.rst for more details but
> > shortly the API should be used at the layer which defines a context
> > which shouldn't allow to recurse. E.g. a lock which would be problematic
> > in the reclaim recursion path.
> 
> I'm concerned that this might not be the correct approach.  Generally we
> have the rule that freeing memory should not require allocating memory
> to succeed.  Since XFS obviously does need to allocate memory to start
> a transaction, this allocation should normally be protected by a mempool
> or similar.

<sigh>

We've been down this path before about exactly how much memory XFS
uses to do extent allocation and why mempools nor any existing mm
infrastructure can actually guarantee filesystem transactions
forwards progress. e.g:

https://lwn.net/Articles/636797/

It's even more problematic now that extent allocation is not jsut
one transaction now, but is a series of overlapping linked
transactions that update things like refcount and reverse mapping
btrees, and not just the freespace and inode block map btrees...

> XFS is subtle and has its own rules around memory allocation and
> writeback, so I don't want to say this is definitely wrong.  I'm just
> saying that I have concerns it might not be right.

Almost all XFS memory allocations within the writeback path are
within transaction contexts - the transaction allocation itself is
an exception, but then xfs_trans_alloc() sets PF_MEMALLOC_NOFS
itself (used to be PF_FSTRANS, as per my previous comments in this
thread). Hence putting the entire ->writepages path under NOFS
context won't change anything materially for XFS.

-Dave.
Michal Hocko June 16, 2020, 7:56 a.m. UTC | #8
On Tue 16-06-20 09:06:05, Dave Chinner wrote:
> On Mon, Jun 15, 2020 at 04:53:31PM +0200, Michal Hocko wrote:
> > On Mon 15-06-20 16:25:52, Holger Hoffstätte wrote:
> > > On 2020-06-15 13:56, Yafang Shao wrote:
> > [...]
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index b356118..1ccfbf2 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > > >   	struct writeback_control *wbc)
> > > >   {
> > > >   	struct xfs_writepage_ctx wpc = { };
> > > > +	unsigned int nofs_flag;
> > > > +	int ret;
> > > >   	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > > > -	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > > +
> > > > +	/*
> > > > +	 * We can allocate memory here while doing writeback on behalf of
> > > > +	 * memory reclaim.  To avoid memory allocation deadlocks set the
> > > > +	 * task-wide nofs context for the following operations.
> > > > +	 */
> > > > +	nofs_flag = memalloc_nofs_save();
> > > > +	ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > > +	memalloc_nofs_restore(nofs_flag);
> > > > +
> > > > +	return ret;
> > > >   }
> > > >   STATIC int
> > > > 
> > > 
> > > Not sure if I did something wrong, but while the previous version of this patch
> > > worked fine, this one gave me (with v2 removed obviously):
> > > 
> > > [  +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x6b4/0x780
> > 
> > This corresponds to
> >         /*
> >          * Given that we do not allow direct reclaim to call us, we should
> >          * never be called in a recursive filesystem reclaim context.
> >          */
> >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> >                 goto redirty;
> > 
> > which effectivelly says that memalloc_nofs_save/restore cannot be used
> > for that code path.
> 
> No it doesn't. Everyone is ignoring the -context- of this code, most
> especially the previous 3 lines of code and it's comment:
> 
>         /*
>          * Refuse to write the page out if we are called from reclaim context.
>          *
>          * This avoids stack overflows when called from deeply used stacks in
>          * random callers for direct reclaim or memcg reclaim.  We explicitly
>          * allow reclaim from kswapd as the stack usage there is relatively low.
>          *
>          * This should never happen except in the case of a VM regression so
>          * warn about it.
>          */
>         if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
>                         PF_MEMALLOC))
>                 goto redirty;
> 
> You will see that we specifically avoid this path from reclaim
> context except for kswapd. And kswapd always runs with GFP_KERNEL
> context so we allow writeback from it, so it will pass both this
> check and the NOFS check above. 
> 
> IOws, we can't trigger to the WARN_ON_ONCE(current->flags &
> PF_MEMALLOC_NOFS)) check from a memory reclaim context: this
> PF_MEMALLOC_NOFS check here is not doing what people think it is.

You are right.

> History lesson time, eh?
> 
> The recursion protection here -used- to use PF_FSTRANS, not
> PF_MEMALLOC_NOFS. See commit 9070733b4efac ("xfs: abstract
> PF_FSTRANS to PF_MEMALLOC_NOFS"). This hunk is most instructive
> when you look at the comment:
> 
>          * Given that we do not allow direct reclaim to call us, we should
>          * never be called while in a filesystem transaction.
>          */
> -       if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> +       if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>                 goto redirty;
> 
> It wasn't for memory allocation recursion protection in XFS - it was
> for transaction reservation recursion protection by something trying
> to flush data pages while holding a transaction reservation. Doing
> this could deadlock the journal because the existing reservation
> could prevent the nested reservation for being able to reserve space
> in the journal and that is a self-deadlock vector.
> 
> IOWs, this check is not protecting against memory reclaim recursion
> bugs at all (that's the previous check I quoted). This check is
> protecting against the filesystem calling writepages directly from a
> context where it can self-deadlock.

Thanks for the clarification.

> So what we are seeing here is that the PF_FSTRANS ->
> PF_MEMALLOC_NOFS abstraction lost all the actual useful information
> about what type of error this check was protecting against.

I have to admit that I am not familiar with the xfs code and the
PF_TRANS abstraction to PF_MEMALLOC_NOFS was mostly automatic and I
relied on xfs maintainers to tell me I was doing something stupid.
Now after your explanation it makes more sense that the warning is
indeed protecting from a different kind of issue.
Michal Hocko June 16, 2020, 8:16 a.m. UTC | #9
On Mon 15-06-20 07:56:21, Yafang Shao wrote:
> Recently there is a XFS deadlock on our server with an old kernel.
> This deadlock is caused by allocating memory in xfs_map_blocks() while
> doing writeback on behalf of memroy reclaim. Although this deadlock happens
> on an old kernel, I think it could happen on the upstream as well. This
> issue only happens once and can't be reproduced, so I haven't tried to
> reproduce it on upsteam kernel.
> 
> Bellow is the call trace of this deadlock.
> [480594.790087] INFO: task redis-server:16212 blocked for more than 120 seconds.
> [480594.790087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [480594.790088] redis-server    D ffffffff8168bd60     0 16212  14347 0x00000004
> [480594.790090]  ffff880da128f070 0000000000000082 ffff880f94a2eeb0 ffff880da128ffd8
> [480594.790092]  ffff880da128ffd8 ffff880da128ffd8 ffff880f94a2eeb0 ffff88103f9d6c40
> [480594.790094]  0000000000000000 7fffffffffffffff ffff88207ffc0ee8 ffffffff8168bd60
> [480594.790096] Call Trace:
> [480594.790101]  [<ffffffff8168dce9>] schedule+0x29/0x70
> [480594.790103]  [<ffffffff8168b749>] schedule_timeout+0x239/0x2c0
> [480594.790111]  [<ffffffff8168d28e>] io_schedule_timeout+0xae/0x130
> [480594.790114]  [<ffffffff8168d328>] io_schedule+0x18/0x20
> [480594.790116]  [<ffffffff8168bd71>] bit_wait_io+0x11/0x50
> [480594.790118]  [<ffffffff8168b895>] __wait_on_bit+0x65/0x90
> [480594.790121]  [<ffffffff811814e1>] wait_on_page_bit+0x81/0xa0
> [480594.790125]  [<ffffffff81196ad2>] shrink_page_list+0x6d2/0xaf0
> [480594.790130]  [<ffffffff811975a3>] shrink_inactive_list+0x223/0x710
> [480594.790135]  [<ffffffff81198225>] shrink_lruvec+0x3b5/0x810
> [480594.790139]  [<ffffffff8119873a>] shrink_zone+0xba/0x1e0
> [480594.790141]  [<ffffffff81198c20>] do_try_to_free_pages+0x100/0x510
> [480594.790143]  [<ffffffff8119928d>] try_to_free_mem_cgroup_pages+0xdd/0x170
> [480594.790145]  [<ffffffff811f32de>] mem_cgroup_reclaim+0x4e/0x120
> [480594.790147]  [<ffffffff811f37cc>] __mem_cgroup_try_charge+0x41c/0x670
> [480594.790153]  [<ffffffff811f5cb6>] __memcg_kmem_newpage_charge+0xf6/0x180
> [480594.790157]  [<ffffffff8118c72d>] __alloc_pages_nodemask+0x22d/0x420
> [480594.790162]  [<ffffffff811d0c7a>] alloc_pages_current+0xaa/0x170
> [480594.790165]  [<ffffffff811db8fc>] new_slab+0x30c/0x320
> [480594.790168]  [<ffffffff811dd17c>] ___slab_alloc+0x3ac/0x4f0
> [480594.790204]  [<ffffffff81685656>] __slab_alloc+0x40/0x5c
> [480594.790206]  [<ffffffff811dfc43>] kmem_cache_alloc+0x193/0x1e0
> [480594.790233]  [<ffffffffa04fab67>] kmem_zone_alloc+0x97/0x130 [xfs]
> [480594.790247]  [<ffffffffa04f90ba>] _xfs_trans_alloc+0x3a/0xa0 [xfs]
> [480594.790261]  [<ffffffffa04f915c>] xfs_trans_alloc+0x3c/0x50 [xfs]

Now with a more explanation from Dave I have got back to the original
backtrace. Not sure which kernel version you are using but this path
should have passed xfs_trans_reserve which sets PF_MEMALLOC_NOFS and
this in turn should have made __alloc_pages_nodemask to use __GFP_NOFS
and the memcg reclaim shouldn't ever wait_on_page_writeback (pressumably
this is what the io_schedule is coming from). Does your kernel have
ecf5fc6e9654c?
Yafang Shao June 16, 2020, 9:05 a.m. UTC | #10
On Tue, Jun 16, 2020 at 4:16 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 15-06-20 07:56:21, Yafang Shao wrote:
> > Recently there is a XFS deadlock on our server with an old kernel.
> > This deadlock is caused by allocating memory in xfs_map_blocks() while
> > doing writeback on behalf of memroy reclaim. Although this deadlock happens
> > on an old kernel, I think it could happen on the upstream as well. This
> > issue only happens once and can't be reproduced, so I haven't tried to
> > reproduce it on upsteam kernel.
> >
> > Bellow is the call trace of this deadlock.
> > [480594.790087] INFO: task redis-server:16212 blocked for more than 120 seconds.
> > [480594.790087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [480594.790088] redis-server    D ffffffff8168bd60     0 16212  14347 0x00000004
> > [480594.790090]  ffff880da128f070 0000000000000082 ffff880f94a2eeb0 ffff880da128ffd8
> > [480594.790092]  ffff880da128ffd8 ffff880da128ffd8 ffff880f94a2eeb0 ffff88103f9d6c40
> > [480594.790094]  0000000000000000 7fffffffffffffff ffff88207ffc0ee8 ffffffff8168bd60
> > [480594.790096] Call Trace:
> > [480594.790101]  [<ffffffff8168dce9>] schedule+0x29/0x70
> > [480594.790103]  [<ffffffff8168b749>] schedule_timeout+0x239/0x2c0
> > [480594.790111]  [<ffffffff8168d28e>] io_schedule_timeout+0xae/0x130
> > [480594.790114]  [<ffffffff8168d328>] io_schedule+0x18/0x20
> > [480594.790116]  [<ffffffff8168bd71>] bit_wait_io+0x11/0x50
> > [480594.790118]  [<ffffffff8168b895>] __wait_on_bit+0x65/0x90
> > [480594.790121]  [<ffffffff811814e1>] wait_on_page_bit+0x81/0xa0
> > [480594.790125]  [<ffffffff81196ad2>] shrink_page_list+0x6d2/0xaf0
> > [480594.790130]  [<ffffffff811975a3>] shrink_inactive_list+0x223/0x710
> > [480594.790135]  [<ffffffff81198225>] shrink_lruvec+0x3b5/0x810
> > [480594.790139]  [<ffffffff8119873a>] shrink_zone+0xba/0x1e0
> > [480594.790141]  [<ffffffff81198c20>] do_try_to_free_pages+0x100/0x510
> > [480594.790143]  [<ffffffff8119928d>] try_to_free_mem_cgroup_pages+0xdd/0x170
> > [480594.790145]  [<ffffffff811f32de>] mem_cgroup_reclaim+0x4e/0x120
> > [480594.790147]  [<ffffffff811f37cc>] __mem_cgroup_try_charge+0x41c/0x670
> > [480594.790153]  [<ffffffff811f5cb6>] __memcg_kmem_newpage_charge+0xf6/0x180
> > [480594.790157]  [<ffffffff8118c72d>] __alloc_pages_nodemask+0x22d/0x420
> > [480594.790162]  [<ffffffff811d0c7a>] alloc_pages_current+0xaa/0x170
> > [480594.790165]  [<ffffffff811db8fc>] new_slab+0x30c/0x320
> > [480594.790168]  [<ffffffff811dd17c>] ___slab_alloc+0x3ac/0x4f0
> > [480594.790204]  [<ffffffff81685656>] __slab_alloc+0x40/0x5c
> > [480594.790206]  [<ffffffff811dfc43>] kmem_cache_alloc+0x193/0x1e0
> > [480594.790233]  [<ffffffffa04fab67>] kmem_zone_alloc+0x97/0x130 [xfs]
> > [480594.790247]  [<ffffffffa04f90ba>] _xfs_trans_alloc+0x3a/0xa0 [xfs]
> > [480594.790261]  [<ffffffffa04f915c>] xfs_trans_alloc+0x3c/0x50 [xfs]
>
> Now with a more explanation from Dave I have got back to the original
> backtrace. Not sure which kernel version you are using but this path
> should have passed xfs_trans_reserve which sets PF_MEMALLOC_NOFS and
> this in turn should have made __alloc_pages_nodemask to use __GFP_NOFS
> and the memcg reclaim shouldn't ever wait_on_page_writeback (pressumably
> this is what the io_schedule is coming from).

Hi Michal,

The page is allocated before calling xfs_trans_reserve, so the
PF_MEMALLOC_NOFS hasn't been set yet.
See bellow,

xfs_trans_alloc
    kmem_zone_zalloc() <<< GPF_NOFS hasn't been set yet, but it may
trigger memory reclaim
    xfs_trans_reserve() <<<< GPF_NOFS is set here (for the kernel
prior to commit 9070733b4efac, it is PF_FSTRANS)


>  Does your kernel have
> ecf5fc6e9654c?

Yes, we have backported this commit to our in-house kernel.
Michal Hocko June 16, 2020, 9:27 a.m. UTC | #11
On Tue 16-06-20 17:05:25, Yafang Shao wrote:
> On Tue, Jun 16, 2020 at 4:16 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 15-06-20 07:56:21, Yafang Shao wrote:
> > > Recently there is a XFS deadlock on our server with an old kernel.
> > > This deadlock is caused by allocating memory in xfs_map_blocks() while
> > > doing writeback on behalf of memroy reclaim. Although this deadlock happens
> > > on an old kernel, I think it could happen on the upstream as well. This
> > > issue only happens once and can't be reproduced, so I haven't tried to
> > > reproduce it on upsteam kernel.
> > >
> > > Bellow is the call trace of this deadlock.
> > > [480594.790087] INFO: task redis-server:16212 blocked for more than 120 seconds.
> > > [480594.790087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [480594.790088] redis-server    D ffffffff8168bd60     0 16212  14347 0x00000004
> > > [480594.790090]  ffff880da128f070 0000000000000082 ffff880f94a2eeb0 ffff880da128ffd8
> > > [480594.790092]  ffff880da128ffd8 ffff880da128ffd8 ffff880f94a2eeb0 ffff88103f9d6c40
> > > [480594.790094]  0000000000000000 7fffffffffffffff ffff88207ffc0ee8 ffffffff8168bd60
> > > [480594.790096] Call Trace:
> > > [480594.790101]  [<ffffffff8168dce9>] schedule+0x29/0x70
> > > [480594.790103]  [<ffffffff8168b749>] schedule_timeout+0x239/0x2c0
> > > [480594.790111]  [<ffffffff8168d28e>] io_schedule_timeout+0xae/0x130
> > > [480594.790114]  [<ffffffff8168d328>] io_schedule+0x18/0x20
> > > [480594.790116]  [<ffffffff8168bd71>] bit_wait_io+0x11/0x50
> > > [480594.790118]  [<ffffffff8168b895>] __wait_on_bit+0x65/0x90
> > > [480594.790121]  [<ffffffff811814e1>] wait_on_page_bit+0x81/0xa0
> > > [480594.790125]  [<ffffffff81196ad2>] shrink_page_list+0x6d2/0xaf0
> > > [480594.790130]  [<ffffffff811975a3>] shrink_inactive_list+0x223/0x710
> > > [480594.790135]  [<ffffffff81198225>] shrink_lruvec+0x3b5/0x810
> > > [480594.790139]  [<ffffffff8119873a>] shrink_zone+0xba/0x1e0
> > > [480594.790141]  [<ffffffff81198c20>] do_try_to_free_pages+0x100/0x510
> > > [480594.790143]  [<ffffffff8119928d>] try_to_free_mem_cgroup_pages+0xdd/0x170
> > > [480594.790145]  [<ffffffff811f32de>] mem_cgroup_reclaim+0x4e/0x120
> > > [480594.790147]  [<ffffffff811f37cc>] __mem_cgroup_try_charge+0x41c/0x670
> > > [480594.790153]  [<ffffffff811f5cb6>] __memcg_kmem_newpage_charge+0xf6/0x180
> > > [480594.790157]  [<ffffffff8118c72d>] __alloc_pages_nodemask+0x22d/0x420
> > > [480594.790162]  [<ffffffff811d0c7a>] alloc_pages_current+0xaa/0x170
> > > [480594.790165]  [<ffffffff811db8fc>] new_slab+0x30c/0x320
> > > [480594.790168]  [<ffffffff811dd17c>] ___slab_alloc+0x3ac/0x4f0
> > > [480594.790204]  [<ffffffff81685656>] __slab_alloc+0x40/0x5c
> > > [480594.790206]  [<ffffffff811dfc43>] kmem_cache_alloc+0x193/0x1e0
> > > [480594.790233]  [<ffffffffa04fab67>] kmem_zone_alloc+0x97/0x130 [xfs]
> > > [480594.790247]  [<ffffffffa04f90ba>] _xfs_trans_alloc+0x3a/0xa0 [xfs]
> > > [480594.790261]  [<ffffffffa04f915c>] xfs_trans_alloc+0x3c/0x50 [xfs]
> >
> > Now with a more explanation from Dave I have got back to the original
> > backtrace. Not sure which kernel version you are using but this path
> > should have passed xfs_trans_reserve which sets PF_MEMALLOC_NOFS and
> > this in turn should have made __alloc_pages_nodemask to use __GFP_NOFS
> > and the memcg reclaim shouldn't ever wait_on_page_writeback (pressumably
> > this is what the io_schedule is coming from).
> 
> Hi Michal,
> 
> The page is allocated before calling xfs_trans_reserve, so the
> PF_MEMALLOC_NOFS hasn't been set yet.
> See bellow,
> 
> xfs_trans_alloc
>     kmem_zone_zalloc() <<< GPF_NOFS hasn't been set yet, but it may
> trigger memory reclaim
>     xfs_trans_reserve() <<<< GPF_NOFS is set here (for the kernel
> prior to commit 9070733b4efac, it is PF_FSTRANS)

You are right, I have misread the actual allocation side. 8683edb7755b8
has added KM_NOFS and 73d30d48749f8 has removed it. I cannot really
judge the correctness here.
Yafang Shao June 16, 2020, 9:39 a.m. UTC | #12
On Tue, Jun 16, 2020 at 5:27 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 16-06-20 17:05:25, Yafang Shao wrote:
> > On Tue, Jun 16, 2020 at 4:16 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 15-06-20 07:56:21, Yafang Shao wrote:
> > > > Recently there is a XFS deadlock on our server with an old kernel.
> > > > This deadlock is caused by allocating memory in xfs_map_blocks() while
> > > > doing writeback on behalf of memroy reclaim. Although this deadlock happens
> > > > on an old kernel, I think it could happen on the upstream as well. This
> > > > issue only happens once and can't be reproduced, so I haven't tried to
> > > > reproduce it on upsteam kernel.
> > > >
> > > > Bellow is the call trace of this deadlock.
> > > > [480594.790087] INFO: task redis-server:16212 blocked for more than 120 seconds.
> > > > [480594.790087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [480594.790088] redis-server    D ffffffff8168bd60     0 16212  14347 0x00000004
> > > > [480594.790090]  ffff880da128f070 0000000000000082 ffff880f94a2eeb0 ffff880da128ffd8
> > > > [480594.790092]  ffff880da128ffd8 ffff880da128ffd8 ffff880f94a2eeb0 ffff88103f9d6c40
> > > > [480594.790094]  0000000000000000 7fffffffffffffff ffff88207ffc0ee8 ffffffff8168bd60
> > > > [480594.790096] Call Trace:
> > > > [480594.790101]  [<ffffffff8168dce9>] schedule+0x29/0x70
> > > > [480594.790103]  [<ffffffff8168b749>] schedule_timeout+0x239/0x2c0
> > > > [480594.790111]  [<ffffffff8168d28e>] io_schedule_timeout+0xae/0x130
> > > > [480594.790114]  [<ffffffff8168d328>] io_schedule+0x18/0x20
> > > > [480594.790116]  [<ffffffff8168bd71>] bit_wait_io+0x11/0x50
> > > > [480594.790118]  [<ffffffff8168b895>] __wait_on_bit+0x65/0x90
> > > > [480594.790121]  [<ffffffff811814e1>] wait_on_page_bit+0x81/0xa0
> > > > [480594.790125]  [<ffffffff81196ad2>] shrink_page_list+0x6d2/0xaf0
> > > > [480594.790130]  [<ffffffff811975a3>] shrink_inactive_list+0x223/0x710
> > > > [480594.790135]  [<ffffffff81198225>] shrink_lruvec+0x3b5/0x810
> > > > [480594.790139]  [<ffffffff8119873a>] shrink_zone+0xba/0x1e0
> > > > [480594.790141]  [<ffffffff81198c20>] do_try_to_free_pages+0x100/0x510
> > > > [480594.790143]  [<ffffffff8119928d>] try_to_free_mem_cgroup_pages+0xdd/0x170
> > > > [480594.790145]  [<ffffffff811f32de>] mem_cgroup_reclaim+0x4e/0x120
> > > > [480594.790147]  [<ffffffff811f37cc>] __mem_cgroup_try_charge+0x41c/0x670
> > > > [480594.790153]  [<ffffffff811f5cb6>] __memcg_kmem_newpage_charge+0xf6/0x180
> > > > [480594.790157]  [<ffffffff8118c72d>] __alloc_pages_nodemask+0x22d/0x420
> > > > [480594.790162]  [<ffffffff811d0c7a>] alloc_pages_current+0xaa/0x170
> > > > [480594.790165]  [<ffffffff811db8fc>] new_slab+0x30c/0x320
> > > > [480594.790168]  [<ffffffff811dd17c>] ___slab_alloc+0x3ac/0x4f0
> > > > [480594.790204]  [<ffffffff81685656>] __slab_alloc+0x40/0x5c
> > > > [480594.790206]  [<ffffffff811dfc43>] kmem_cache_alloc+0x193/0x1e0
> > > > [480594.790233]  [<ffffffffa04fab67>] kmem_zone_alloc+0x97/0x130 [xfs]
> > > > [480594.790247]  [<ffffffffa04f90ba>] _xfs_trans_alloc+0x3a/0xa0 [xfs]
> > > > [480594.790261]  [<ffffffffa04f915c>] xfs_trans_alloc+0x3c/0x50 [xfs]
> > >
> > > Now with a more explanation from Dave I have got back to the original
> > > backtrace. Not sure which kernel version you are using but this path
> > > should have passed xfs_trans_reserve which sets PF_MEMALLOC_NOFS and
> > > this in turn should have made __alloc_pages_nodemask to use __GFP_NOFS
> > > and the memcg reclaim shouldn't ever wait_on_page_writeback (pressumably
> > > this is what the io_schedule is coming from).
> >
> > Hi Michal,
> >
> > The page is allocated before calling xfs_trans_reserve, so the
> > PF_MEMALLOC_NOFS hasn't been set yet.
> > See bellow,
> >
> > xfs_trans_alloc
> >     kmem_zone_zalloc() <<< GPF_NOFS hasn't been set yet, but it may
> > trigger memory reclaim
> >     xfs_trans_reserve() <<<< GPF_NOFS is set here (for the kernel
> > prior to commit 9070733b4efac, it is PF_FSTRANS)
>
> You are right, I have misread the actual allocation side. 8683edb7755b8
> has added KM_NOFS and 73d30d48749f8 has removed it. I cannot really
> judge the correctness here.
>

The history is complicated, but it doesn't matter.
Let's  turn back to the upstream kernel now. As I explained in the commit log,
xfs_vm_writepages
  -> iomap_writepages.
     -> write_cache_pages
        -> lock_page <<<< This page is locked.
        -> writepages ( which is  iomap_do_writepage)
           -> xfs_map_blocks
              -> xfs_convert_blocks
                 -> xfs_bmapi_convert_delalloc
                    -> xfs_trans_alloc
                         -> kmem_zone_zalloc //It should alloc page
with GFP_NOFS

If GFP_NOFS isn't set in xfs_trans_alloc(), the kmem_zone_zalloc() may
trigger the memory reclaim then it may wait on the page locked in
write_cache_pages() ...
That means the ->writepages should be set with GFP_NOFS to avoid this
recursive filesystem reclaim.
Yafang Shao June 16, 2020, 10:17 a.m. UTC | #13
On Tue, Jun 16, 2020 at 7:06 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Jun 15, 2020 at 04:53:31PM +0200, Michal Hocko wrote:
> > On Mon 15-06-20 16:25:52, Holger Hoffstätte wrote:
> > > On 2020-06-15 13:56, Yafang Shao wrote:
> > [...]
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index b356118..1ccfbf2 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > > >           struct writeback_control *wbc)
> > > >   {
> > > >           struct xfs_writepage_ctx wpc = { };
> > > > + unsigned int nofs_flag;
> > > > + int ret;
> > > >           xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > > > - return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > > +
> > > > + /*
> > > > +  * We can allocate memory here while doing writeback on behalf of
> > > > +  * memory reclaim.  To avoid memory allocation deadlocks set the
> > > > +  * task-wide nofs context for the following operations.
> > > > +  */
> > > > + nofs_flag = memalloc_nofs_save();
> > > > + ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > > + memalloc_nofs_restore(nofs_flag);
> > > > +
> > > > + return ret;
> > > >   }
> > > >   STATIC int
> > > >
> > >
> > > Not sure if I did something wrong, but while the previous version of this patch
> > > worked fine, this one gave me (with v2 removed obviously):
> > >
> > > [  +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x6b4/0x780
> >
> > This corresponds to
> >         /*
> >          * Given that we do not allow direct reclaim to call us, we should
> >          * never be called in a recursive filesystem reclaim context.
> >          */
> >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> >                 goto redirty;
> >
> > which effectivelly says that memalloc_nofs_save/restore cannot be used
> > for that code path.
>
> No it doesn't. Everyone is ignoring the -context- of this code, most
> especially the previous 3 lines of code and it's comment:
>
>         /*
>          * Refuse to write the page out if we are called from reclaim context.
>          *
>          * This avoids stack overflows when called from deeply used stacks in
>          * random callers for direct reclaim or memcg reclaim.  We explicitly
>          * allow reclaim from kswapd as the stack usage there is relatively low.
>          *
>          * This should never happen except in the case of a VM regression so
>          * warn about it.
>          */
>         if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
>                         PF_MEMALLOC))
>                 goto redirty;
>
> You will see that we specifically avoid this path from reclaim
> context except for kswapd. And kswapd always runs with GFP_KERNEL
> context so we allow writeback from it, so it will pass both this
> check and the NOFS check above.
>
> IOws, we can't trigger to the WARN_ON_ONCE(current->flags &
> PF_MEMALLOC_NOFS)) check from a memory reclaim context: this
> PF_MEMALLOC_NOFS check here is not doing what people think it is.
>
> History lesson time, eh?
>
> The recursion protection here -used- to use PF_FSTRANS, not
> PF_MEMALLOC_NOFS. See commit 9070733b4efac ("xfs: abstract
> PF_FSTRANS to PF_MEMALLOC_NOFS"). This hunk is most instructive
> when you look at the comment:
>
>          * Given that we do not allow direct reclaim to call us, we should
>          * never be called while in a filesystem transaction.
>          */

This comment is more clear.

> -       if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> +       if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>                 goto redirty;
>
> It wasn't for memory allocation recursion protection in XFS - it was
> for transaction reservation recursion protection by something trying
> to flush data pages while holding a transaction reservation. Doing
> this could deadlock the journal because the existing reservation
> could prevent the nested reservation for being able to reserve space
> in the journal and that is a self-deadlock vector.
>
> IOWs, this check is not protecting against memory reclaim recursion
> bugs at all (that's the previous check I quoted). This check is
> protecting against the filesystem calling writepages directly from a
> context where it can self-deadlock.
>
> So what we are seeing here is that the PF_FSTRANS ->
> PF_MEMALLOC_NOFS abstraction lost all the actual useful information
> about what type of error this check was protecting against.
>

Agreed. The check of PF_MEMALLOC_NOFS here really confused me before.
I think it will confuse others as well.

> > Your stack trace doesn't point to a reclaim path
> > which shows that this path is shared and also underlines that this is
> > not really an intended use of the api.
>
> Actually, Michal, it was your PF_FSTRANS -> PF_MEMALLOC_NOFS
> abstraction of this code that turned this from "exactly what
> PF_FSTRANS was intended to catch" to what you call "unintended use
> of the API"....
>
> IOWs, putting the iomap_writepage path under NOFS context is the
> right thing to do from a "prevent memory reclaim" perspective, but
> now we are hitting against the problems of repurposing filesystem
> specific flags for subtlely different generic semantics...
>
> I suspect we need to re-introduce PF_FSTRANS, set/clear/transfer it
> again in all the places XFS used to transfer it, and change this
> iomap check to use PF_FSTRANS and not PF_MEMALLOC_NOFS, because it's
> clearly not and never has been a memory reclaim recursion warning
> check....
>

Agree with you that it's better to convert it back to PF_FSTRANS,
otherwise it will mislead more people.
I would like to send a patch to change it back if there is no
objection to this proposal.



--
Thanks
Yafang
Michal Hocko June 16, 2020, 10:48 a.m. UTC | #14
On Tue 16-06-20 17:39:33, Yafang Shao wrote:
> On Tue, Jun 16, 2020 at 5:27 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 16-06-20 17:05:25, Yafang Shao wrote:
> > > On Tue, Jun 16, 2020 at 4:16 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Mon 15-06-20 07:56:21, Yafang Shao wrote:
> > > > > Recently there is a XFS deadlock on our server with an old kernel.
> > > > > This deadlock is caused by allocating memory in xfs_map_blocks() while
> > > > > doing writeback on behalf of memroy reclaim. Although this deadlock happens
> > > > > on an old kernel, I think it could happen on the upstream as well. This
> > > > > issue only happens once and can't be reproduced, so I haven't tried to
> > > > > reproduce it on upsteam kernel.
> > > > >
> > > > > Bellow is the call trace of this deadlock.
> > > > > [480594.790087] INFO: task redis-server:16212 blocked for more than 120 seconds.
> > > > > [480594.790087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > [480594.790088] redis-server    D ffffffff8168bd60     0 16212  14347 0x00000004
> > > > > [480594.790090]  ffff880da128f070 0000000000000082 ffff880f94a2eeb0 ffff880da128ffd8
> > > > > [480594.790092]  ffff880da128ffd8 ffff880da128ffd8 ffff880f94a2eeb0 ffff88103f9d6c40
> > > > > [480594.790094]  0000000000000000 7fffffffffffffff ffff88207ffc0ee8 ffffffff8168bd60
> > > > > [480594.790096] Call Trace:
> > > > > [480594.790101]  [<ffffffff8168dce9>] schedule+0x29/0x70
> > > > > [480594.790103]  [<ffffffff8168b749>] schedule_timeout+0x239/0x2c0
> > > > > [480594.790111]  [<ffffffff8168d28e>] io_schedule_timeout+0xae/0x130
> > > > > [480594.790114]  [<ffffffff8168d328>] io_schedule+0x18/0x20
> > > > > [480594.790116]  [<ffffffff8168bd71>] bit_wait_io+0x11/0x50
> > > > > [480594.790118]  [<ffffffff8168b895>] __wait_on_bit+0x65/0x90
> > > > > [480594.790121]  [<ffffffff811814e1>] wait_on_page_bit+0x81/0xa0
> > > > > [480594.790125]  [<ffffffff81196ad2>] shrink_page_list+0x6d2/0xaf0
> > > > > [480594.790130]  [<ffffffff811975a3>] shrink_inactive_list+0x223/0x710
> > > > > [480594.790135]  [<ffffffff81198225>] shrink_lruvec+0x3b5/0x810
> > > > > [480594.790139]  [<ffffffff8119873a>] shrink_zone+0xba/0x1e0
> > > > > [480594.790141]  [<ffffffff81198c20>] do_try_to_free_pages+0x100/0x510
> > > > > [480594.790143]  [<ffffffff8119928d>] try_to_free_mem_cgroup_pages+0xdd/0x170
> > > > > [480594.790145]  [<ffffffff811f32de>] mem_cgroup_reclaim+0x4e/0x120
> > > > > [480594.790147]  [<ffffffff811f37cc>] __mem_cgroup_try_charge+0x41c/0x670
> > > > > [480594.790153]  [<ffffffff811f5cb6>] __memcg_kmem_newpage_charge+0xf6/0x180
> > > > > [480594.790157]  [<ffffffff8118c72d>] __alloc_pages_nodemask+0x22d/0x420
> > > > > [480594.790162]  [<ffffffff811d0c7a>] alloc_pages_current+0xaa/0x170
> > > > > [480594.790165]  [<ffffffff811db8fc>] new_slab+0x30c/0x320
> > > > > [480594.790168]  [<ffffffff811dd17c>] ___slab_alloc+0x3ac/0x4f0
> > > > > [480594.790204]  [<ffffffff81685656>] __slab_alloc+0x40/0x5c
> > > > > [480594.790206]  [<ffffffff811dfc43>] kmem_cache_alloc+0x193/0x1e0
> > > > > [480594.790233]  [<ffffffffa04fab67>] kmem_zone_alloc+0x97/0x130 [xfs]
> > > > > [480594.790247]  [<ffffffffa04f90ba>] _xfs_trans_alloc+0x3a/0xa0 [xfs]
> > > > > [480594.790261]  [<ffffffffa04f915c>] xfs_trans_alloc+0x3c/0x50 [xfs]
> > > >
> > > > Now with a more explanation from Dave I have got back to the original
> > > > backtrace. Not sure which kernel version you are using but this path
> > > > should have passed xfs_trans_reserve which sets PF_MEMALLOC_NOFS and
> > > > this in turn should have made __alloc_pages_nodemask to use __GFP_NOFS
> > > > and the memcg reclaim shouldn't ever wait_on_page_writeback (pressumably
> > > > this is what the io_schedule is coming from).
> > >
> > > Hi Michal,
> > >
> > > The page is allocated before calling xfs_trans_reserve, so the
> > > PF_MEMALLOC_NOFS hasn't been set yet.
> > > See bellow,
> > >
> > > xfs_trans_alloc
> > >     kmem_zone_zalloc() <<< GPF_NOFS hasn't been set yet, but it may
> > > trigger memory reclaim
> > >     xfs_trans_reserve() <<<< GPF_NOFS is set here (for the kernel
> > > prior to commit 9070733b4efac, it is PF_FSTRANS)
> >
> > You are right, I have misread the actual allocation side. 8683edb7755b8
> > has added KM_NOFS and 73d30d48749f8 has removed it. I cannot really
> > judge the correctness here.
> >
> 
> The history is complicated, but it doesn't matter.
> Let's  turn back to the upstream kernel now. As I explained in the commit log,
> xfs_vm_writepages
>   -> iomap_writepages.
>      -> write_cache_pages
>         -> lock_page <<<< This page is locked.
>         -> writepages ( which is  iomap_do_writepage)
>            -> xfs_map_blocks
>               -> xfs_convert_blocks
>                  -> xfs_bmapi_convert_delalloc
>                     -> xfs_trans_alloc
>                          -> kmem_zone_zalloc //It should alloc page
> with GFP_NOFS
> 
> If GFP_NOFS isn't set in xfs_trans_alloc(), the kmem_zone_zalloc() may
> trigger the memory reclaim then it may wait on the page locked in
> write_cache_pages() ...

This cannot happen because the memory reclaim backs off on locked pages.
Have a look at trylock_page at the very beginning of the shrink_page_list
loop. You are likely waiting on a different page which is not being
flushed because of the FS ordering requirement or something like that.

> That means the ->writepages should be set with GFP_NOFS to avoid this
> recursive filesystem reclaim.
Yafang Shao June 16, 2020, 11:42 a.m. UTC | #15
On Tue, Jun 16, 2020 at 6:48 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 16-06-20 17:39:33, Yafang Shao wrote:
> > On Tue, Jun 16, 2020 at 5:27 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 16-06-20 17:05:25, Yafang Shao wrote:
> > > > On Tue, Jun 16, 2020 at 4:16 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Mon 15-06-20 07:56:21, Yafang Shao wrote:
> > > > > > Recently there is a XFS deadlock on our server with an old kernel.
> > > > > > This deadlock is caused by allocating memory in xfs_map_blocks() while
> > > > > > doing writeback on behalf of memroy reclaim. Although this deadlock happens
> > > > > > on an old kernel, I think it could happen on the upstream as well. This
> > > > > > issue only happens once and can't be reproduced, so I haven't tried to
> > > > > > reproduce it on upsteam kernel.
> > > > > >
> > > > > > Bellow is the call trace of this deadlock.
> > > > > > [480594.790087] INFO: task redis-server:16212 blocked for more than 120 seconds.
> > > > > > [480594.790087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > [480594.790088] redis-server    D ffffffff8168bd60     0 16212  14347 0x00000004
> > > > > > [480594.790090]  ffff880da128f070 0000000000000082 ffff880f94a2eeb0 ffff880da128ffd8
> > > > > > [480594.790092]  ffff880da128ffd8 ffff880da128ffd8 ffff880f94a2eeb0 ffff88103f9d6c40
> > > > > > [480594.790094]  0000000000000000 7fffffffffffffff ffff88207ffc0ee8 ffffffff8168bd60
> > > > > > [480594.790096] Call Trace:
> > > > > > [480594.790101]  [<ffffffff8168dce9>] schedule+0x29/0x70
> > > > > > [480594.790103]  [<ffffffff8168b749>] schedule_timeout+0x239/0x2c0
> > > > > > [480594.790111]  [<ffffffff8168d28e>] io_schedule_timeout+0xae/0x130
> > > > > > [480594.790114]  [<ffffffff8168d328>] io_schedule+0x18/0x20
> > > > > > [480594.790116]  [<ffffffff8168bd71>] bit_wait_io+0x11/0x50
> > > > > > [480594.790118]  [<ffffffff8168b895>] __wait_on_bit+0x65/0x90
> > > > > > [480594.790121]  [<ffffffff811814e1>] wait_on_page_bit+0x81/0xa0
> > > > > > [480594.790125]  [<ffffffff81196ad2>] shrink_page_list+0x6d2/0xaf0
> > > > > > [480594.790130]  [<ffffffff811975a3>] shrink_inactive_list+0x223/0x710
> > > > > > [480594.790135]  [<ffffffff81198225>] shrink_lruvec+0x3b5/0x810
> > > > > > [480594.790139]  [<ffffffff8119873a>] shrink_zone+0xba/0x1e0
> > > > > > [480594.790141]  [<ffffffff81198c20>] do_try_to_free_pages+0x100/0x510
> > > > > > [480594.790143]  [<ffffffff8119928d>] try_to_free_mem_cgroup_pages+0xdd/0x170
> > > > > > [480594.790145]  [<ffffffff811f32de>] mem_cgroup_reclaim+0x4e/0x120
> > > > > > [480594.790147]  [<ffffffff811f37cc>] __mem_cgroup_try_charge+0x41c/0x670
> > > > > > [480594.790153]  [<ffffffff811f5cb6>] __memcg_kmem_newpage_charge+0xf6/0x180
> > > > > > [480594.790157]  [<ffffffff8118c72d>] __alloc_pages_nodemask+0x22d/0x420
> > > > > > [480594.790162]  [<ffffffff811d0c7a>] alloc_pages_current+0xaa/0x170
> > > > > > [480594.790165]  [<ffffffff811db8fc>] new_slab+0x30c/0x320
> > > > > > [480594.790168]  [<ffffffff811dd17c>] ___slab_alloc+0x3ac/0x4f0
> > > > > > [480594.790204]  [<ffffffff81685656>] __slab_alloc+0x40/0x5c
> > > > > > [480594.790206]  [<ffffffff811dfc43>] kmem_cache_alloc+0x193/0x1e0
> > > > > > [480594.790233]  [<ffffffffa04fab67>] kmem_zone_alloc+0x97/0x130 [xfs]
> > > > > > [480594.790247]  [<ffffffffa04f90ba>] _xfs_trans_alloc+0x3a/0xa0 [xfs]
> > > > > > [480594.790261]  [<ffffffffa04f915c>] xfs_trans_alloc+0x3c/0x50 [xfs]
> > > > >
> > > > > Now with a more explanation from Dave I have got back to the original
> > > > > backtrace. Not sure which kernel version you are using but this path
> > > > > should have passed xfs_trans_reserve which sets PF_MEMALLOC_NOFS and
> > > > > this in turn should have made __alloc_pages_nodemask to use __GFP_NOFS
> > > > > and the memcg reclaim shouldn't ever wait_on_page_writeback (pressumably
> > > > > this is what the io_schedule is coming from).
> > > >
> > > > Hi Michal,
> > > >
> > > > The page is allocated before calling xfs_trans_reserve, so the
> > > > PF_MEMALLOC_NOFS hasn't been set yet.
> > > > See bellow,
> > > >
> > > > xfs_trans_alloc
> > > >     kmem_zone_zalloc() <<< GPF_NOFS hasn't been set yet, but it may
> > > > trigger memory reclaim
> > > >     xfs_trans_reserve() <<<< GPF_NOFS is set here (for the kernel
> > > > prior to commit 9070733b4efac, it is PF_FSTRANS)
> > >
> > > You are right, I have misread the actual allocation side. 8683edb7755b8
> > > has added KM_NOFS and 73d30d48749f8 has removed it. I cannot really
> > > judge the correctness here.
> > >
> >
> > The history is complicated, but it doesn't matter.
> > Let's  turn back to the upstream kernel now. As I explained in the commit log,
> > xfs_vm_writepages
> >   -> iomap_writepages.
> >      -> write_cache_pages
> >         -> lock_page <<<< This page is locked.
> >         -> writepages ( which is  iomap_do_writepage)
> >            -> xfs_map_blocks
> >               -> xfs_convert_blocks
> >                  -> xfs_bmapi_convert_delalloc
> >                     -> xfs_trans_alloc
> >                          -> kmem_zone_zalloc //It should alloc page
> > with GFP_NOFS
> >
> > If GFP_NOFS isn't set in xfs_trans_alloc(), the kmem_zone_zalloc() may
> > trigger the memory reclaim then it may wait on the page locked in
> > write_cache_pages() ...
>
> This cannot happen because the memory reclaim backs off on locked pages.
> Have a look at trylock_page at the very beginning of the shrink_page_list
> loop. You are likely waiting on a different page which is not being
> flushed because of the FS ordering requirement or something like that.
>

Right, there should be multiple-pages, some of which are already under
PG_writeback. When a new page in these multiple-pages is being
processed the reclaim is triggered and then a page already under
PG_writeback in these multiple-pages is reclaimed and then waited.

> > That means the ->writepages should be set with GFP_NOFS to avoid this
> > recursive filesystem reclaim.
>
> --
> Michal Hocko
> SUSE Labs
Dave Chinner June 18, 2020, 12:34 a.m. UTC | #16
On Tue, Jun 16, 2020 at 12:48:06PM +0200, Michal Hocko wrote:
> On Tue 16-06-20 17:39:33, Yafang Shao wrote:
> > The history is complicated, but it doesn't matter.
> > Let's  turn back to the upstream kernel now. As I explained in the commit log,
> > xfs_vm_writepages
> >   -> iomap_writepages.
> >      -> write_cache_pages
> >         -> lock_page <<<< This page is locked.
> >         -> writepages ( which is  iomap_do_writepage)
> >            -> xfs_map_blocks
> >               -> xfs_convert_blocks
> >                  -> xfs_bmapi_convert_delalloc
> >                     -> xfs_trans_alloc
> >                          -> kmem_zone_zalloc //It should alloc page
> > with GFP_NOFS
> > 
> > If GFP_NOFS isn't set in xfs_trans_alloc(), the kmem_zone_zalloc() may
> > trigger the memory reclaim then it may wait on the page locked in
> > write_cache_pages() ...
> 
> This cannot happen because the memory reclaim backs off on locked pages.

->writepages can hold a bio with multiple PageWriteback pages
already attached to it. Direct GFP_KERNEL page reclaim can wait on
them - if that happens the the bio will never be issued and so
reclaim will deadlock waiting for the writeback state to clear...

> > That means the ->writepages should be set with GFP_NOFS to avoid this
> > recursive filesystem reclaim.

Indeed. We already have parts of the IO submission path under
PF_MEMALLOC_NOFS so we can do transaction allocation, etc. See
xfs_prepare_ioend(), which is called from iomap via:

iomap_submit_ioend()
  ->prepare_ioend()
    xfs_prepare_ioend()

we can get there from:

iomap_writepage()
  iomap_do_writepage()
    iomap_writepage_map()
      iomap_submit_ioend()
  iomap_submit_ioend()

and:

iomap_writepages()
  write_cache_pages()
    iomap_do_writepage()
      iomap_writepage_map()
	iomap_submit_ioend()
  iomap_submit_ioend()

Which says that we really should be putting both iomap_writepage()
and iomap_writepages() under PF_MEMALLOC_NOFS context so that
filesystem callouts don't have to repeatedly enter and exit
PF_MEMALLOC_NOFS context to avoid memory reclaim recursion...

Cheers,

Dave.
Yafang Shao June 18, 2020, 11:04 a.m. UTC | #17
On Thu, Jun 18, 2020 at 8:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Jun 16, 2020 at 12:48:06PM +0200, Michal Hocko wrote:
> > On Tue 16-06-20 17:39:33, Yafang Shao wrote:
> > > The history is complicated, but it doesn't matter.
> > > Let's  turn back to the upstream kernel now. As I explained in the commit log,
> > > xfs_vm_writepages
> > >   -> iomap_writepages.
> > >      -> write_cache_pages
> > >         -> lock_page <<<< This page is locked.
> > >         -> writepages ( which is  iomap_do_writepage)
> > >            -> xfs_map_blocks
> > >               -> xfs_convert_blocks
> > >                  -> xfs_bmapi_convert_delalloc
> > >                     -> xfs_trans_alloc
> > >                          -> kmem_zone_zalloc //It should alloc page
> > > with GFP_NOFS
> > >
> > > If GFP_NOFS isn't set in xfs_trans_alloc(), the kmem_zone_zalloc() may
> > > trigger the memory reclaim then it may wait on the page locked in
> > > write_cache_pages() ...
> >
> > This cannot happen because the memory reclaim backs off on locked pages.
>
> ->writepages can hold a bio with multiple PageWriteback pages
> already attached to it. Direct GFP_KERNEL page reclaim can wait on
> them - if that happens the the bio will never be issued and so
> reclaim will deadlock waiting for the writeback state to clear...
>

Thanks for the explanation.

> > > That means the ->writepages should be set with GFP_NOFS to avoid this
> > > recursive filesystem reclaim.
>
> Indeed. We already have parts of the IO submission path under
> PF_MEMALLOC_NOFS so we can do transaction allocation, etc. See
> xfs_prepare_ioend(), which is called from iomap via:
>
> iomap_submit_ioend()
>   ->prepare_ioend()
>     xfs_prepare_ioend()
>
> we can get there from:
>
> iomap_writepage()
>   iomap_do_writepage()
>     iomap_writepage_map()
>       iomap_submit_ioend()
>   iomap_submit_ioend()
>
> and:
>
> iomap_writepages()
>   write_cache_pages()
>     iomap_do_writepage()
>       iomap_writepage_map()
>         iomap_submit_ioend()
>   iomap_submit_ioend()
>
> Which says that we really should be putting both iomap_writepage()
> and iomap_writepages() under PF_MEMALLOC_NOFS context so that
> filesystem callouts don't have to repeatedly enter and exit
> PF_MEMALLOC_NOFS context to avoid memory reclaim recursion...
>

Looks reasonable.
I will update this patch to put both iomap_writepage() and
iomap_writepages() under PF_MEMALLOC_NOFS context and try to remove
the usage of PF_MEMALLOC_NOFS from the filesystem callouts.
Yafang Shao June 22, 2020, 12:20 p.m. UTC | #18
On Mon, Jun 22, 2020 at 9:24 AM kernel test robot <rong.a.chen@intel.com> wrote:
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-9):
>
> commit: 59d77e81c5d114f74768d05d4d5faa87232a1efe ("[PATCH v3] xfs: avoid deadlock when trigger memory reclaim in ->writepages")
> url: https://github.com/0day-ci/linux/commits/Yafang-Shao/xfs-avoid-deadlock-when-trigger-memory-reclaim-in-writepages/20200615-195749
> base: https://git.kernel.org/cgit/fs/xfs/xfs-linux.git for-next
>
> in testcase: xfstests
> with following parameters:
>
>         disk: 4HDD
>         fs: xfs
>         test: generic-group20
>
> test-description: xfstests is a regression test suite for xfs and other files ystems.
> test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> +----------------------------------------------------------------------------+------------+------------+
> |                                                                            | 8cc0072469 | 59d77e81c5 |
> +----------------------------------------------------------------------------+------------+------------+
> | boot_successes                                                             | 9          | 6          |
> | boot_failures                                                              | 2          | 14         |
> | Kernel_panic-not_syncing:VFS:Unable_to_mount_root_fs_on_unknown-block(#,#) | 2          |            |
> | WARNING:at_fs/iomap/buffered-io.c:#iomap_do_writepage                      | 0          | 14         |
> | RIP:iomap_do_writepage                                                     | 0          | 14         |
> | WARNING:at_fs/iomap/direct-io.c:#iomap_dio_actor                           | 0          | 8          |
> | RIP:iomap_dio_actor                                                        | 0          | 8          |
> | Assertion_failed                                                           | 0          | 8          |
> | kernel_BUG_at_fs/xfs/xfs_message.c                                         | 0          | 8          |
> | invalid_opcode:#[##]                                                       | 0          | 8          |
> | RIP:assfail[xfs]                                                           | 0          | 8          |
> | Kernel_panic-not_syncing:Fatal_exception                                   | 0          | 8          |
> +----------------------------------------------------------------------------+------------+------------+
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <rong.a.chen@intel.com>
>
>
> [   52.588044] WARNING: CPU: 1 PID: 2062 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x1d3/0x1f0
> [   52.591594] Modules linked in: xfs libcrc32c dm_mod bochs_drm drm_vram_helper drm_ttm_helper ttm sr_mod cdrom drm_kms_helper sg snd_pcm ata_generic pata_acpi intel_rapl_msr snd_timer syscopyarea ppdev sysfillrect snd sysimgblt fb_sys_fops joydev ata_piix soundcore intel_rapl_common drm crc32c_intel libata serio_raw pcspkr i2c_piix4 parport_pc floppy parport ip_tables
> [   52.602353] CPU: 1 PID: 2062 Comm: fsstress Not tainted 5.7.0-rc4-00131-g59d77e81c5d114 #1
> [   52.605523] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [   52.608722] RIP: 0010:iomap_do_writepage+0x1d3/0x1f0
> [   52.611186] Code: 41 5e 41 5f c3 f6 c2 04 75 24 85 d2 0f 84 54 ff ff ff c6 00 00 f6 c2 02 0f 84 48 ff ff ff 31 c9 66 89 4c 10 fe e9 3c ff ff ff <0f> 0b eb b7 c7 00 00 00 00 00 c7 44 10 fc 00 00 00 00 e9 25 ff ff
> [   52.618316] RSP: 0018:ffffbe1d42927a98 EFLAGS: 00010206
> [   52.620962] RAX: 0000000000440140 RBX: ffffbe1d42927b18 RCX: 0000000000000010
> [   52.624026] RDX: 0000000000000000 RSI: ffffbe1d42927cc8 RDI: fffff7ef474323c0
> [   52.627075] RBP: ffffbe1d42927cc8 R08: ffff9966bffd5000 R09: 00000000000323d5
> [   52.630195] R10: 0000000000032380 R11: 000000000000020c R12: fffff7ef474323c0
> [   52.633406] R13: ffffbe1d42927be0 R14: fffff7ef474323c0 R15: ffff996651b56140
> [   52.636616] FS:  00007f0ed0b09b40(0000) GS:ffff9966bfd00000(0000) knlGS:0000000000000000
> [   52.639920] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   52.642665] CR2: 000055b70fba1000 CR3: 00000001cbf54000 CR4: 00000000000006e0
> [   52.645780] Call Trace:
> [   52.647878]  write_cache_pages+0x171/0x470
> [   52.650230]  ? iomap_write_begin+0x4f0/0x4f0
> [   52.652608]  iomap_writepages+0x1c/0x40
> [   52.656698]  xfs_vm_writepages+0x86/0xc0 [xfs]
> [   52.659277]  do_writepages+0x43/0xe0
> [   52.661535]  __filemap_fdatawrite_range+0xce/0x110
> [   52.664108]  filemap_write_and_wait_range+0x42/0xa0
> [   52.666830]  xfs_setattr_size+0x29d/0x490 [xfs]
> [   52.669419]  ? setattr_prepare+0x6a/0x1d0
> [   52.671908]  xfs_vn_setattr+0x70/0xb0 [xfs]
> [   52.674277]  notify_change+0x357/0x4d0
> [   52.676615]  do_truncate+0x76/0xd0
> [   52.678836]  vfs_truncate+0x161/0x1c0
> [   52.681042]  do_sys_truncate+0x8a/0xa0
> [   52.683473]  do_syscall_64+0x5b/0x1f0
> [   52.685670]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [   52.688176] RIP: 0033:0x7f0ed000fe97
> [   52.690338] Code: 00 00 00 48 8b 15 01 60 2b 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 d6 e9 20 ff ff ff b8 4c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d1 5f 2b 00 f7 d8 64 89 01 48
> [   52.697406] RSP: 002b:00007ffd47b3b828 EFLAGS: 00000206 ORIG_RAX: 000000000000004c
> [   52.700542] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0ed000fe97
> [   52.703615] RDX: 00000000000b7add RSI: 00000000000b7add RDI: 000055b70fb9b4b0
> [   52.706522] RBP: 00007ffd47b3b990 R08: 000000000000006a R09: 0000000000000003
> [   52.709555] R10: 00000000000002eb R11: 0000000000000206 R12: 00000000000b7add
> [   52.712596] R13: 0000000051eb851f R14: 00007ffd47b3be80 R15: 000055b70f5bef00
> [   52.715559] ---[ end trace bfc085456879090f ]---
>
>
> To reproduce:
>
>         # build kernel
>         cd linux
>         cp config-5.7.0-rc4-00131-g59d77e81c5d114 .config
>         make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
>         make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
>         cd <mod-install-dir>
>         find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
>
>
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
>
>
>

Thanks for the report.
That issue has already been addressed in another thread[1] .

[1]. https://lore.kernel.org/linux-xfs/1592222181-9832-1-git-send-email-laoar.shao@gmail.com/
diff mbox series

Patch

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b356118..1ccfbf2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -573,9 +573,21 @@  static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
 	struct writeback_control *wbc)
 {
 	struct xfs_writepage_ctx wpc = { };
+	unsigned int nofs_flag;
+	int ret;
 
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
-	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
+
+	/*
+	 * We can allocate memory here while doing writeback on behalf of
+	 * memory reclaim.  To avoid memory allocation deadlocks set the
+	 * task-wide nofs context for the following operations.
+	 */
+	nofs_flag = memalloc_nofs_save();
+	ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
+	memalloc_nofs_restore(nofs_flag);
+
+	return ret;
 }
 
 STATIC int